Q-Learning Using Python And OpenAI Gym

Veena Sarda
5y
14.5k
0
10

Article

In this article, we will build and play our very first reinforcement learning (RL) game using Python and OpenAI Gym environment. The OpenAI Gym library has tons of gaming environments – text based to real time complex environments. More details can be found on their website. To install the gym library is simple, just type this command:

pip install gym

We will be using the gym library to build and play a text based game called FrozenLake-v0. The following description is picked as is from the Gym site about this game:

"Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc. However, the ice is slippery, so you won't always move in the direction you intend."

The surface is described using a grid like the following. The game ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

SFFF

FHFH

FFFH

HFFG

where S: starting point, safe , F: frozen surface, safe, H: hole, fall to your doom, G: goal, where the frisbee is located.

In Q-learning reinforcement learning technique,

The goal is to learn a policy that tells an agent what action to take under what circumstances.
For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it maximizes the expected value of the total reward over any and all successive steps, starting from the current state.
Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time and a partly-random policy.
"Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.

In this frozenlake environment there are 16 states - each grid point is a state. 4 actions are possible – Left, Right, Up and Down for each state.

To begin our program - import the following libraries in your notebook

import numpy as np
import gym
import random
import time
from IPython.display import clear_output

Now, we create the enviornment

env = gym.make("FrozenLake-v0")

After this we create our q_matrix that will be initialized to 0. The 16 rows of this matrix are 16 states that are possible and 4 columns are the 4 actions that are possible.

action_size = env.action_space.n
state_size = env.observation_space.n
q_matrix = np.zeros((state_size, action_size))
q_matrix

Now, we want to train our agent such that after training, we get this q_matrix updated with the maximum point for that state action pair and then the agent can use it to play the game. After a lot of iterations a good Q-table is ready.

Mathematically, the above is as shown in following equation, where alpha is the learning rate and gamma is the discount factor

We also need to decide the learning rate which is normally between 0.001 and 0.5. The exploration rate starts from 1 and slowly decays. Exploration is the phenomenon where agent is not stuck up in the same trajectory and keeps exploring the environment to find different paths that may lead to maximized returns. Discount rate is the gamma factor for future rewards whose value is generally between 0.9 to 0.999.

num_episodes = 10000
max_steps = 100
learning_rate = 0.1
discount_rate = 0.99
exploration_rate = 1
max_exploration_rate = 1
min_exploration_rate = 0.05
exploration_decay_rate = 0.0001
cumulative_rewards_all_episodes = []

Since the code is big, I have attached the code along with this article. You will see the following updated q_matrix that is learned by the agent after the end of 10000 games.

Now, with this knowledge the agent can now play the game.

# Watch our agent play Frozen Lake by playing the best action
# from each state according to the Q-matrix
for episode in range(3):
# initialize new episode params
state = env.reset()
done = False
print("*****EPISODE ", episode+1, "*****\n\n\n\n")
time.sleep(1)
for step in range(max_steps):
# Show current state of environment on screen
# Choose action with highest Q-value for current state
# Take new action
clear_output(wait=True)
env.render()
time.sleep(0.3)
action = np.argmax(q_matrix[state,:])
new_state, reward, done, info = env.step(action)
if done:
clear_output(wait=True)
env.render()
if reward == 1:
# Agent reached the goal and won episode
print("****You reached the goal!****")
time.sleep(3)
else:
# Agent stepped in a hole and lost episode
print("****You fell through a hole!****")
time.sleep(3)
clear_output(wait=True)
break
# Set new state
state = new_state
env.close()

The above code will render the environment as the agent plays the game.