Python implementation of Deep Q-Networks (DQN)
Python implementation of Deep Q-Networks (DQN) using TensorFlow/Keras and OpenAI Gym.
Step 1: Install Dependencies
First, install the required libraries if you haven't already:
pip install numpy gym tensorflow keras
Step 2: Import Required Libraries
We start by importing the necessary libraries for reinforcement learning, neural networks, and handling the environment.
import numpy as np
import random
import gym
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from collections import deque
Step 3: Create the DQN Agent
The DQN agent consists of:
- A neural network to predict Q-values.
- Experience replay to store and reuse past experiences.
- Target network for stable training.
class DQNAgent:
def __init__(self, state_size, action_size):
self.state_size = state_size # Number of states in the environment
self.action_size = action_size # Number of possible actions
self.memory = deque(maxlen=2000) # Experience replay buffer
self.gamma = 0.95 # Discount factor
self.epsilon = 1.0 # Exploration-exploitation parameter
self.epsilon_min = 0.01 # Minimum exploration rate
self.epsilon_decay = 0.995 # Decay rate for epsilon
self.learning_rate = 0.001 # Learning rate for neural network
self.model = self.build_model() # Primary Q-network
self.target_model = self.build_model() # Target Q-network
self.update_target_model() # Initialize target model weights
def build_model(self):
"""Builds the neural network model for DQN"""
model = Sequential([
Dense(24, input_dim=self.state_size, activation='relu'),
Dense(24, activation='relu'),
Dense(self.action_size, activation='linear') # Q-values for each action
])
model.compile(loss='mse', optimizer=Adam(learning_rate=self.learning_rate))
return model
def update_target_model(self):
"""Copies weights from the primary model to the target model"""
self.target_model.set_weights(self.model.get_weights())
def remember(self, state, action, reward, next_state, done):
"""Stores the experience in memory for experience replay"""
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
"""Selects an action using the epsilon-greedy strategy"""
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_size) # Exploration (random action)
q_values = self.model.predict(state, verbose=0) # Exploitation (use model)
return np.argmax(q_values[0]) # Action with highest Q-value
def replay(self, batch_size):
"""Trains the network using randomly sampled experiences"""
if len(self.memory) < batch_size:
return
minibatch = random.sample(self.memory, batch_size)
for state, action, reward, next_state, done in minibatch:
target = self.model.predict(state, verbose=0)
if done:
target[0][action] = reward # No future reward if episode ends
else:
t = self.target_model.predict(next_state, verbose=0)
target[0][action] = reward + self.gamma * np.amax(t)
self.model.fit(state, target, epochs=1, verbose=0)
# Reduce epsilon (less exploration over time)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Step 4: Train the DQN Agent
We use OpenAI Gym's CartPole-v1 environment as an example.
# Initialize the environment
env = gym.make("CartPole-v1")
state_size = env.observation_space.shape[0] # Number of state features
action_size = env.action_space.n # Number of possible actions
agent = DQNAgent(state_size, action_size) # Create the agent
# Training parameters
episodes = 1000 # Total number of episodes
batch_size = 32 # Number of samples for experience replay
for e in range(episodes):
state, _ = env.reset()
state = np.reshape(state, [1, state_size]) # Reshape state for input
for time in range(500): # Max steps per episode
action = agent.act(state) # Choose action using the policy
next_state, reward, done, _, _ = env.step(action) # Take action
next_state = np.reshape(next_state, [1, state_size]) # Reshape next state
agent.remember(state, action, reward, next_state, done) # Store experience
state = next_state # Update current state
if done:
print(f"Episode {e+1}/{episodes}: Score {time}")
agent.update_target_model() # Update target network every episode
break
# Train the model using experience replay
agent.replay(batch_size)
env.close() # Close environment
Step 5: Save and Load the Model
To save the trained model:
agent.model.save("dqn_cartpole.h5")
To load a saved model:
from tensorflow.keras.models import load_model
agent.model = load_model("dqn_cartpole.h5")
Step 6: Test the Trained Model
After training, let's run the agent to see how well it performs.
for e in range(10): # Run 10 test episodes
state, _ = env.reset()
state = np.reshape(state, [1, state_size])
for time in range(500):
env.render() # Display the environment
action = np.argmax(agent.model.predict(state, verbose=0)) # Greedy policy
next_state, reward, done, _, _ = env.step(action)
next_state = np.reshape(next_state, [1, state_size])
state = next_state
if done:
print(f"Test Episode {e+1}: Score {time}")
break
env.close() # Close environment
key Takeaways
- Neural Network as Function Approximator:
The DQN uses a deep neural network to predict Q-values for each action, enabling it to handle high-dimensional state spaces. - Experience Replay:
A replay buffer stores past experiences, allowing the model to learn from randomly sampled mini-batches. This approach helps break the correlation between consecutive experiences, leading to more stable learning. - Target Network:
A separate target network provides stable target Q-values. It is updated periodically with the weights of the primary network, which reduces rapid fluctuations and aids convergence. - Epsilon-Greedy Exploration:
The agent balances exploration (trying new actions) with exploitation (using known good actions) by decaying the exploration rate (epsilon) over time. - Loss Function and Optimization:
The loss function (mean squared error) measures the difference between the predicted Q-values and the target Q-values derived from the Bellman equation. Gradient descent (or variants like Adam) is used to minimize this loss and update the network parameters. - Hyperparameter Tuning:
Parameters such as learning rate, discount factor, epsilon decay, and batch size are crucial for the performance and stability of the DQN. Fine-tuning these can significantly improve learning outcomes. - Practical Implementation:
The implementation shows how to set up the DQN agent, integrate it with an OpenAI Gym environment (like CartPole), and manage training, saving, and testing processes.