# Q-Learning Using Python And OpenAI Gym

In this article, we will build and play our very first reinforcement learning (RL) game using Python and OpenAI Gym environment. The OpenAI Gym library has tons of gaming environments – text based to real time complex environments. More details can be found on their website. To install the gym library is simple, just type this command:

pip install gym

We will be using the gym library to build and play a text based game called FrozenLake-v0. The following description is picked as is from the Gym site about this game:

"Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend."

The surface is described using a grid like the following. The game ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

SFFF
FHFH
FFFH
HFFG

where S: starting point, safe , F: frozen surface, safe, H: hole, fall to your doom, G: goal, where the frisbee is located.

In Q-learning reinforcement learning technique,
• The goal is to learn a policy that tells an agent what action to take under what circumstances.
• For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state.
• Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.
• "Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.
In this frozenlake environment there are 16 states - each grid point is a state. 4 actions are possible – Left, Right, Up and Down for each state.

To begin our program - import the following libraries in your notebook
1. import numpy as np
2. import gym
3. import random
4. import time
5. from IPython.display import clear_output
Now, we create the enviornment
1. env = gym.make("FrozenLake-v0")
After this we create our q_matrix that will be initialized to 0. The 16 rows of this matrix are 16 states that are possible and 4 columns are the 4 actions that are possible.
1. action_size = env.action_space.n
2. state_size = env.observation_space.n
3.
4. q_matrix = np.zeros((state_size, action_size))
5.
6. q_matrix

Now, we want to train our agent such that after training, we get this q_matrix updated with the maximum point for that state action pair and then the agent can use it to play the game. After a lot of iterations a good Q-table is ready.

Mathematically, the above is as shown in following equation, where alpha is the learning rate and gamma is the discount factor

We also need to decide the learning rate which is normally between 0.001 and 0.5.  The exploration rate starts from 1 and slowly decays. Exploration is the phenomenon where agent is not stuck up in the same trajectory and keeps exploring the environment to find different paths that may lead to maximized returns. Discount rate is the gamma factor for future rewards whose value is generally between 0.9 to 0.999.
1. num_episodes = 10000
2. max_steps = 100
3.
4. learning_rate = 0.1
5. discount_rate = 0.99
6.
7. exploration_rate = 1
8. max_exploration_rate = 1
9. min_exploration_rate = 0.05
10. exploration_decay_rate = 0.0001
11.
12. cumulative_rewards_all_episodes = []
Since the code is big, I have attached the code along with this article.  You will see the following updated q_matrix that is learned by the agent after the end of 10000 games.

Now, with this knowledge the agent can now play the game.
1. # Watch our agent play Frozen Lake by playing the best action
2. # from each state according to the Q-matrix
3.
4. for episode in range(3):
5.     # initialize new episode params
6.     state = env.reset()
7.     done = False
8.     print("*****EPISODE ", episode+1"*****\n\n\n\n")
9.     time.sleep(1)
10.
11.     for step in range(max_steps):
12.         # Show current state of environment on screen
13.         # Choose action with highest Q-value for current state
14.         # Take new action
15.         clear_output(wait=True)
16.         env.render()
17.         time.sleep(0.3)
18.
19.         action = np.argmax(q_matrix[state,:])
20.         new_state, reward, done, info = env.step(action)
21.
22.         if done:
23.             clear_output(wait=True)
24.             env.render()
25.             if reward == 1:
26.                 # Agent reached the goal and won episode
27.                 print("****You reached the goal!****")
28.                 time.sleep(3)
29.             else:
30.                 # Agent stepped in a hole and lost episode
31.                 print("****You fell through a hole!****")
32.                 time.sleep(3)
33.                 clear_output(wait=True)
34.             break
35.         # Set new state
36.         state = new_state
37. env.close()
The above code will render the environment as the agent plays the game.