Q-Learning Using Python And OpenAI Gym

In this article, we will build and play our very first reinforcement learning (RL) game using Python and OpenAI Gym environment. The OpenAI Gym library has tons of gaming environments – text based to real time complex environments. More details can be found on their website. To install the gym library is simple, just type this command:
pip install gym
We will be using the gym library to build and play a text based game called FrozenLake-v0. The following description is picked as is from the Gym site about this game:
"Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend."
The surface is described using a grid like the following. The game ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.
where S: starting point, safe , F: frozen surface, safe, H: hole, fall to your doom, G: goal, where the frisbee is located.
In Q-learning reinforcement learning technique,
  • The goal is to learn a policy that tells an agent what action to take under what circumstances.
  • For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state.
  • Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.
  • "Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.
In this frozenlake environment there are 16 states - each grid point is a state. 4 actions are possible – Left, Right, Up and Down for each state.
Q-Learning Using Python And OpenAI Gym
To begin our program - import the following libraries in your notebook 
  1. import numpy as np    
  2. import gym    
  3. import random    
  4. import time    
  5. from IPython.display import clear_output    
Now, we create the enviornment
  1. env = gym.make("FrozenLake-v0")  
After this we create our q_matrix that will be initialized to 0. The 16 rows of this matrix are 16 states that are possible and 4 columns are the 4 actions that are possible.
  1. action_size = env.action_space.n  
  2. state_size = env.observation_space.n  
  4. q_matrix = np.zeros((state_size, action_size))  
  6. q_matrix  
Q-Learning Using Python And OpenAI Gym
Now, we want to train our agent such that after training, we get this q_matrix updated with the maximum point for that state action pair and then the agent can use it to play the game. After a lot of iterations a good Q-table is ready.
Q-Learning Using Python And OpenAI Gym
Mathematically, the above is as shown in following equation, where alpha is the learning rate and gamma is the discount factor
Q-Learning Using Python And OpenAI Gym
We also need to decide the learning rate which is normally between 0.001 and 0.5.  The exploration rate starts from 1 and slowly decays. Exploration is the phenomenon where agent is not stuck up in the same trajectory and keeps exploring the environment to find different paths that may lead to maximized returns. Discount rate is the gamma factor for future rewards whose value is generally between 0.9 to 0.999.
  1. num_episodes = 10000  
  2. max_steps = 100  
  4. learning_rate = 0.1  
  5. discount_rate = 0.99  
  7. exploration_rate = 1  
  8. max_exploration_rate = 1  
  9. min_exploration_rate = 0.05  
  10. exploration_decay_rate = 0.0001  
  12. cumulative_rewards_all_episodes = []  
Since the code is big, I have attached the code along with this article.  You will see the following updated q_matrix that is learned by the agent after the end of 10000 games.
Q-Learning Using Python And OpenAI Gym
Now, with this knowledge the agent can now play the game. 
  1. # Watch our agent play Frozen Lake by playing the best action   
  2. # from each state according to the Q-matrix  
  4. for episode in range(3):  
  5.     # initialize new episode params  
  6.     state = env.reset()  
  7.     done = False  
  8.     print("*****EPISODE ", episode+1"*****\n\n\n\n")  
  9.     time.sleep(1)  
  11.     for step in range(max_steps):          
  12.         # Show current state of environment on screen  
  13.         # Choose action with highest Q-value for current state         
  14.         # Take new action  
  15.         clear_output(wait=True)  
  16.         env.render()  
  17.         time.sleep(0.3)  
  19.         action = np.argmax(q_matrix[state,:])          
  20.         new_state, reward, done, info = env.step(action)  
  22.         if done:  
  23.             clear_output(wait=True)  
  24.             env.render()  
  25.             if reward == 1:  
  26.                 # Agent reached the goal and won episode  
  27.                 print("****You reached the goal!****")  
  28.                 time.sleep(3)  
  29.             else:  
  30.                 # Agent stepped in a hole and lost episode              
  31.                 print("****You fell through a hole!****")  
  32.                 time.sleep(3)  
  33.                 clear_output(wait=True)  
  34.             break  
  35.         # Set new state  
  36.         state = new_state  
  37. env.close()  
The above code will render the environment as the agent plays the game.
Q-Learning Using Python And OpenAI Gym