Blogs

Reinforcement Learning February 02 ,2025

Python implementation of Q-Learning

We will implement Q-learning using the FrozenLake-v1 environment from OpenAI Gym, where the goal is to navigate a frozen lake and reach the goal without falling into holes.

Step 1: Install Required Libraries

!pip install gym numpy

Step 2: Import Necessary Libraries

import numpy as np
import gym
import random

numpy → For handling numerical operations.
gym → OpenAI Gym provides the FrozenLake environment.
random → Used for exploration strategies.

Step 3: Initialize Environment and Q-Table

# Create the FrozenLake environment
env = gym.make("FrozenLake-v1", is_slippery=False)  # `is_slippery=False` makes it deterministic

# Get number of states and actions
state_space = env.observation_space.n  # Number of states
action_space = env.action_space.n  # Number of actions

# Initialize the Q-table with zeros
Q_table = np.zeros((state_space, action_space))

print("State space:", state_space)
print("Action space:", action_space)
print("Initial Q-table:\n", Q_table)

env.observation_space.n → Number of discrete states.
env.action_space.n → Number of possible actions (left, right, up, down).
Q_table → A matrix of size (state_space × action_space) initialized to 0.

Step 4: Set Hyperparameters

# Hyperparameters
alpha = 0.1   # Learning rate
gamma = 0.9   # Discount factor
epsilon = 1.0  # Exploration-exploitation trade-off
epsilon_decay = 0.995  # Decay rate for epsilon
epsilon_min = 0.01  # Minimum epsilon value
episodes = 1000  # Total number of training episodes
max_steps = 100  # Max steps per episode

alpha (Learning Rate): Controls how much new info overrides old Q-values.
gamma (Discount Factor): Determines future rewards' importance.
epsilon (Exploration Rate): Balances exploration (random actions) and exploitation (best-known action).
epsilon_decay: Gradually reduces exploration over time.
episodes: The number of times the agent will play the game.
max_steps: Limits steps per episode to prevent infinite loops.

Step 5: Implement Q-Learning Algorithm

# Train Q-learning agent
for episode in range(episodes):
    state = env.reset()[0]  # Reset environment at the start of each episode
    done = False

    for step in range(max_steps):
        # Epsilon-greedy action selection
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()  # Explore: choose random action
        else:
            action = np.argmax(Q_table[state, :])  # Exploit: choose best action

        # Take action and observe result
        next_state, reward, done, _, _ = env.step(action)

        # Q-Learning formula update
        Q_table[state, action] = Q_table[state, action] + alpha * (
            reward + gamma * np.max(Q_table[next_state, :]) - Q_table[state, action]
        )

        # Move to the next state
        state = next_state

        if done:
            break  # Stop episode if game over

    # Decay epsilon
    epsilon = max(epsilon_min, epsilon * epsilon_decay)

print("Training completed!")

Step 6: Evaluate the Trained Agent

# Test the trained Q-learning agent
test_episodes = 10
test_rewards = []

for episode in range(test_episodes):
    state = env.reset()[0]
    done = False
    total_reward = 0

    for step in range(max_steps):
        action = np.argmax(Q_table[state, :])  # Always exploit the best action
        next_state, reward, done, _, _ = env.step(action)
        total_reward += reward
        state = next_state

        if done:
            break

    test_rewards.append(total_reward)
    print(f"Episode {episode+1}: Reward = {total_reward}")

print(f"Average reward over {test_episodes} episodes: {np.mean(test_rewards)}")

The agent now always picks the best action (greedy policy).
The average reward indicates how well the agent learned.

Step 7: Visualizing the Final Q-Table

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
sns.heatmap(Q_table, annot=True, cmap="coolwarm")
plt.xlabel("Actions")
plt.ylabel("States")
plt.title("Final Q-Table")
plt.show()

The heatmap visualizes Q-values, showing which actions are best for each state.

Final Thoughts

Initialized the environment and Q-table
Implemented Q-learning with an epsilon-greedy strategy
Updated Q-values using the Bellman equation
Decayed epsilon to shift from exploration to exploitation
Evaluated the trained agent
Visualized the final Q-table

Key Takeaways

Q-learning learns from trial and error.
The epsilon-greedy strategy ensures exploration before settling on the best actions.
The Q-table improves over time, leading to better decision-making.

Next Blog- Deep Q Networks

Purnima

You must logged in to post comments.