Blogs

Reinforcement Learning February 02 ,2025

Python implementation of Deep Q-Networks (DQN)

Python implementation of Deep Q-Networks (DQN) using TensorFlow/Keras and OpenAI Gym.

Step 1: Install Dependencies

First, install the required libraries if you haven't already:

pip install numpy gym tensorflow keras

Step 2: Import Required Libraries

We start by importing the necessary libraries for reinforcement learning, neural networks, and handling the environment.

import numpy as np
import random
import gym
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from collections import deque

Step 3: Create the DQN Agent

The DQN agent consists of:

A neural network to predict Q-values.
Experience replay to store and reuse past experiences.
Target network for stable training.

class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size   # Number of states in the environment
        self.action_size = action_size # Number of possible actions
        self.memory = deque(maxlen=2000)  # Experience replay buffer
        self.gamma = 0.95  # Discount factor
        self.epsilon = 1.0  # Exploration-exploitation parameter
        self.epsilon_min = 0.01  # Minimum exploration rate
        self.epsilon_decay = 0.995  # Decay rate for epsilon
        self.learning_rate = 0.001  # Learning rate for neural network
        self.model = self.build_model()  # Primary Q-network
        self.target_model = self.build_model()  # Target Q-network
        self.update_target_model()  # Initialize target model weights

    def build_model(self):
        """Builds the neural network model for DQN"""
        model = Sequential([
            Dense(24, input_dim=self.state_size, activation='relu'),
            Dense(24, activation='relu'),
            Dense(self.action_size, activation='linear')  # Q-values for each action
        ])
        model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))
        return model

    def update_target_model(self):
        """Copies weights from the primary model to the target model"""
        self.target_model.set_weights(self.model.get_weights())

    def remember(self, state, action, reward, next_state, done):
        """Stores the experience in memory for experience replay"""
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        """Selects an action using the epsilon-greedy strategy"""
        if np.random.rand() <= self.epsilon:
            return random.randrange(self.action_size)  # Exploration (random action)
        q_values = self.model.predict(state, verbose=0)  # Exploitation (use model)
        return np.argmax(q_values[0])  # Action with highest Q-value

    def replay(self, batch_size):
        """Trains the network using randomly sampled experiences"""
        if len(self.memory) < batch_size:
            return

        minibatch = random.sample(self.memory, batch_size)

        for state, action, reward, next_state, done in minibatch:
            target = self.model.predict(state, verbose=0)
            if done:
                target[0][action] = reward  # No future reward if episode ends
            else:
                t = self.target_model.predict(next_state, verbose=0)
                target[0][action] = reward + self.gamma * np.amax(t)

            self.model.fit(state, target, epochs=1, verbose=0)

        # Reduce epsilon (less exploration over time)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

Step 4: Train the DQN Agent

We use OpenAI Gym's CartPole-v1 environment as an example.

# Initialize the environment
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0]  # Number of state features
action_size = env.action_space.n  # Number of possible actions
agent = DQNAgent(state_size, action_size)  # Create the agent

# Training parameters
episodes = 1000  # Total number of episodes
batch_size = 32  # Number of samples for experience replay

for e in range(episodes):
    state, _ = env.reset()
    state = np.reshape(state, [1, state_size])  # Reshape state for input

    for time in range(500):  # Max steps per episode
        action = agent.act(state)  # Choose action using the policy
        next_state, reward, done, _, _ = env.step(action)  # Take action
        next_state = np.reshape(next_state, [1, state_size])  # Reshape next state

        agent.remember(state, action, reward, next_state, done)  # Store experience
        state = next_state  # Update current state

        if done:
            print(f"Episode {e+1}/{episodes}: Score {time}")
            agent.update_target_model()  # Update target network every episode
            break

    # Train the model using experience replay
    agent.replay(batch_size)

env.close()  # Close environment

Step 5: Save and Load the Model

To save the trained model:

agent.model.save("dqn_cartpole.h5")

To load a saved model:

from tensorflow.keras.models import load_model
agent.model = load_model("dqn_cartpole.h5")

Step 6: Test the Trained Model

After training, let's run the agent to see how well it performs.

for e in range(10):  # Run 10 test episodes
    state, _ = env.reset()
    state = np.reshape(state, [1, state_size])

    for time in range(500):
        env.render()  # Display the environment
        action = np.argmax(agent.model.predict(state, verbose=0))  # Greedy policy
        next_state, reward, done, _, _ = env.step(action)
        next_state = np.reshape(next_state, [1, state_size])
        state = next_state

        if done:
            print(f"Test Episode {e+1}: Score {time}")
            break

env.close()  # Close environment

key Takeaways

Neural Network as Function Approximator:
The DQN uses a deep neural network to predict Q-values for each action, enabling it to handle high-dimensional state spaces.
Experience Replay:
A replay buffer stores past experiences, allowing the model to learn from randomly sampled mini-batches. This approach helps break the correlation between consecutive experiences, leading to more stable learning.
Target Network:
A separate target network provides stable target Q-values. It is updated periodically with the weights of the primary network, which reduces rapid fluctuations and aids convergence.
Epsilon-Greedy Exploration:
The agent balances exploration (trying new actions) with exploitation (using known good actions) by decaying the exploration rate (epsilon) over time.
Loss Function and Optimization:
The loss function (mean squared error) measures the difference between the predicted Q-values and the target Q-values derived from the Bellman equation. Gradient descent (or variants like Adam) is used to minimize this loss and update the network parameters.
Hyperparameter Tuning:
Parameters such as learning rate, discount factor, epsilon decay, and batch size are crucial for the performance and stability of the DQN. Fine-tuning these can significantly improve learning outcomes.
Practical Implementation:
The implementation shows how to set up the DQN agent, integrate it with an OpenAI Gym environment (like CartPole), and manage training, saving, and testing processes.