Python implementation of Q-Learning
We will implement Q-learning using the FrozenLake-v1 environment from OpenAI Gym, where the goal is to navigate a frozen lake and reach the goal without falling into holes.
Step 1: Install Required Libraries
!pip install gym numpy
Step 2: Import Necessary Libraries
import numpy as np
import gym
import random
- numpy → For handling numerical operations.
- gym → OpenAI Gym provides the FrozenLake environment.
- random → Used for exploration strategies.
Step 3: Initialize Environment and Q-Table
# Create the FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=False) # `is_slippery=False` makes it deterministic
# Get number of states and actions
state_space = env.observation_space.n # Number of states
action_space = env.action_space.n # Number of actions
# Initialize the Q-table with zeros
Q_table = np.zeros((state_space, action_space))
print("State space:", state_space)
print("Action space:", action_space)
print("Initial Q-table:\n", Q_table)
- env.observation_space.n → Number of discrete states.
- env.action_space.n → Number of possible actions (left, right, up, down).
- Q_table → A matrix of size (state_space × action_space) initialized to 0.
Step 4: Set Hyperparameters
# Hyperparameters
alpha = 0.1 # Learning rate
gamma = 0.9 # Discount factor
epsilon = 1.0 # Exploration-exploitation trade-off
epsilon_decay = 0.995 # Decay rate for epsilon
epsilon_min = 0.01 # Minimum epsilon value
episodes = 1000 # Total number of training episodes
max_steps = 100 # Max steps per episode
- alpha (Learning Rate): Controls how much new info overrides old Q-values.
- gamma (Discount Factor): Determines future rewards' importance.
- epsilon (Exploration Rate): Balances exploration (random actions) and exploitation (best-known action).
- epsilon_decay: Gradually reduces exploration over time.
- episodes: The number of times the agent will play the game.
- max_steps: Limits steps per episode to prevent infinite loops.
Step 5: Implement Q-Learning Algorithm
# Train Q-learning agent
for episode in range(episodes):
state = env.reset()[0] # Reset environment at the start of each episode
done = False
for step in range(max_steps):
# Epsilon-greedy action selection
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample() # Explore: choose random action
else:
action = np.argmax(Q_table[state, :]) # Exploit: choose best action
# Take action and observe result
next_state, reward, done, _, _ = env.step(action)
# Q-Learning formula update
Q_table[state, action] = Q_table[state, action] + alpha * (
reward + gamma * np.max(Q_table[next_state, :]) - Q_table[state, action]
)
# Move to the next state
state = next_state
if done:
break # Stop episode if game over
# Decay epsilon
epsilon = max(epsilon_min, epsilon * epsilon_decay)
print("Training completed!")
Step 6: Evaluate the Trained Agent
# Test the trained Q-learning agent
test_episodes = 10
test_rewards = []
for episode in range(test_episodes):
state = env.reset()[0]
done = False
total_reward = 0
for step in range(max_steps):
action = np.argmax(Q_table[state, :]) # Always exploit the best action
next_state, reward, done, _, _ = env.step(action)
total_reward += reward
state = next_state
if done:
break
test_rewards.append(total_reward)
print(f"Episode {episode+1}: Reward = {total_reward}")
print(f"Average reward over {test_episodes} episodes: {np.mean(test_rewards)}")
- The agent now always picks the best action (greedy policy).
- The average reward indicates how well the agent learned.
Step 7: Visualizing the Final Q-Table
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
sns.heatmap(Q_table, annot=True, cmap="coolwarm")
plt.xlabel("Actions")
plt.ylabel("States")
plt.title("Final Q-Table")
plt.show()
- The heatmap visualizes Q-values, showing which actions are best for each state.
Final Thoughts
Initialized the environment and Q-table
Implemented Q-learning with an epsilon-greedy strategy
Updated Q-values using the Bellman equation
Decayed epsilon to shift from exploration to exploitation
Evaluated the trained agent
Visualized the final Q-table
Key Takeaways
- Q-learning learns from trial and error.
- The epsilon-greedy strategy ensures exploration before settling on the best actions.
- The Q-table improves over time, leading to better decision-making.