Markov Decision Process (MDP) – Step-by-Step Python Implementation
A Markov Decision Process (MDP) is a framework for modeling decision-making problems where outcomes are partly random and partly under the control of an agent. It consists of:
- States (S) – Possible conditions of the environment.
- Actions (A) – Choices available to the agent.
- Transition Probabilities (P) – Probability of moving from one state to another given an action.
- Rewards (R) – Immediate reward received after taking an action.
- Policy (π) – Strategy for selecting actions to maximize rewards.
We'll implement a simple Grid World MDP using Python.
Step 1: Install Dependencies
import numpy as np
Step 2: Define MDP Components
# Define States
states = ["A", "B", "C", "D"] # Simple 4-state environment
# Define Actions
actions = ["left", "right"]
# Define Transition Probabilities (P)
transition_prob = {
"A": {"left": ("A", 1.0), "right": ("B", 1.0)},
"B": {"left": ("A", 1.0), "right": ("C", 1.0)},
"C": {"left": ("B", 1.0), "right": ("D", 1.0)},
"D": {"left": ("C", 1.0), "right": ("D", 1.0)} # Terminal State
}
# Define Rewards (R)
rewards = {
"A": {"left": 0, "right": 1},
"B": {"left": -1, "right": 2},
"C": {"left": -2, "right": 3},
"D": {"left": 0, "right": 0} # No rewards in terminal state
}
Step 3: Define the Policy
# Random policy (chooses left or right with equal probability)
policy = {
"A": {"left": 0.5, "right": 0.5},
"B": {"left": 0.5, "right": 0.5},
"C": {"left": 0.5, "right": 0.5},
"D": {"left": 1.0, "right": 0.0} # Terminal state, no action needed
}
Step 4: Implement Value Iteration (to find the optimal policy)
def value_iteration(states, actions, transition_prob, rewards, gamma=0.9, theta=1e-6):
"""
Performs Value Iteration for finding the optimal policy.
"""
V = {s: 0 for s in states} # Initialize state values with 0
while True:
delta = 0 # Track maximum change in value function
new_V = V.copy()
for state in states:
if state == "D": # Terminal state, no updates
continue
action_values = []
for action in actions:
next_state, prob = transition_prob[state][action]
reward = rewards[state][action]
action_values.append(prob * (reward + gamma * V[next_state]))
new_V[state] = max(action_values)
delta = max(delta, abs(V[state] - new_V[state]))
V = new_V
if delta < theta: # Stop if values converge
break
return V
Step 5: Compute Optimal Value Function
optimal_values = value_iteration(states, actions, transition_prob, rewards)
print("Optimal State Values:")
print(optimal_values)
Step 6: Extract the Optimal Policy
def extract_policy(states, actions, transition_prob, rewards, V, gamma=0.9):
"""
Extracts the best policy from the computed value function.
"""
policy = {}
for state in states:
if state == "D": # Terminal state, no action needed
policy[state] = None
continue
best_action = None
best_value = float('-inf')
for action in actions:
next_state, prob = transition_prob[state][action]
reward = rewards[state][action]
value = prob * (reward + gamma * V[next_state])
if value > best_value:
best_value = value
best_action = action
policy[state] = best_action
return policy
optimal_policy = extract_policy(states, actions, transition_prob, rewards, optimal_values)
print("\nOptimal Policy:")
print(optimal_policy)
Final Output
The output will display the optimal state values and the best action to take in each state.
Example Output
Optimal State Values:
{'A': 4.385, 'B': 5.85, 'C': 6.5, 'D': 0}
Optimal Policy:
{'A': 'right', 'B': 'right', 'C': 'right', 'D': None}
Key Takeaways
- The agent learns to move right in every state to maximize rewards.
- Value Iteration updates state values iteratively, leading to an optimal policy.
- This method is useful in robotics, finance, and AI decision-making.