Markov Decision Process (MDP) – Step-by-Step Python Implementation

Reinforcement Learning February 15 ,2025

Markov Decision Process (MDP) – Step-by-Step Python Implementation

A Markov Decision Process (MDP) is a framework for modeling decision-making problems where outcomes are partly random and partly under the control of an agent. It consists of:

States (S) – Possible conditions of the environment.
Actions (A) – Choices available to the agent.
Transition Probabilities (P) – Probability of moving from one state to another given an action.
Rewards (R) – Immediate reward received after taking an action.
Policy (π) – Strategy for selecting actions to maximize rewards.

We'll implement a simple Grid World MDP using Python.

Step 1: Install Dependencies

import numpy as np

Step 2: Define MDP Components

# Define States
states = ["A", "B", "C", "D"]  # Simple 4-state environment

# Define Actions
actions = ["left", "right"]

# Define Transition Probabilities (P)
transition_prob = {
    "A": {"left": ("A", 1.0), "right": ("B", 1.0)},
    "B": {"left": ("A", 1.0), "right": ("C", 1.0)},
    "C": {"left": ("B", 1.0), "right": ("D", 1.0)},
    "D": {"left": ("C", 1.0), "right": ("D", 1.0)}  # Terminal State
}

# Define Rewards (R)
rewards = {
    "A": {"left": 0, "right": 1},
    "B": {"left": -1, "right": 2},
    "C": {"left": -2, "right": 3},
    "D": {"left": 0, "right": 0}  # No rewards in terminal state
}

Step 3: Define the Policy

# Random policy (chooses left or right with equal probability)
policy = {
    "A": {"left": 0.5, "right": 0.5},
    "B": {"left": 0.5, "right": 0.5},
    "C": {"left": 0.5, "right": 0.5},
    "D": {"left": 1.0, "right": 0.0}  # Terminal state, no action needed
}

Step 4: Implement Value Iteration (to find the optimal policy)

def value_iteration(states, actions, transition_prob, rewards, gamma=0.9, theta=1e-6):
    """
    Performs Value Iteration for finding the optimal policy.
    """
    V = {s: 0 for s in states}  # Initialize state values with 0
    while True:
        delta = 0  # Track maximum change in value function
        new_V = V.copy()

        for state in states:
            if state == "D":  # Terminal state, no updates
                continue
            
            action_values = []
            for action in actions:
                next_state, prob = transition_prob[state][action]
                reward = rewards[state][action]
                action_values.append(prob * (reward + gamma * V[next_state]))
            
            new_V[state] = max(action_values)
            delta = max(delta, abs(V[state] - new_V[state]))
        
        V = new_V
        if delta < theta:  # Stop if values converge
            break
    return V

Step 5: Compute Optimal Value Function

optimal_values = value_iteration(states, actions, transition_prob, rewards)
print("Optimal State Values:")
print(optimal_values)

Step 6: Extract the Optimal Policy

def extract_policy(states, actions, transition_prob, rewards, V, gamma=0.9):
    """
    Extracts the best policy from the computed value function.
    """
    policy = {}
    for state in states:
        if state == "D":  # Terminal state, no action needed
            policy[state] = None
            continue
        
        best_action = None
        best_value = float('-inf')
        
        for action in actions:
            next_state, prob = transition_prob[state][action]
            reward = rewards[state][action]
            value = prob * (reward + gamma * V[next_state])
            
            if value > best_value:
                best_value = value
                best_action = action
        
        policy[state] = best_action
    return policy

optimal_policy = extract_policy(states, actions, transition_prob, rewards, optimal_values)
print("\nOptimal Policy:")
print(optimal_policy)

Final Output

The output will display the optimal state values and the best action to take in each state.

Example Output

Optimal State Values:
{'A': 4.385, 'B': 5.85, 'C': 6.5, 'D': 0}

Optimal Policy:
{'A': 'right', 'B': 'right', 'C': 'right', 'D': None}

Key Takeaways

The agent learns to move right in every state to maximize rewards.
Value Iteration updates state values iteratively, leading to an optimal policy.
This method is useful in robotics, finance, and AI decision-making.

Next Blog- Q-Learning

Purnima

You must logged in to post comments.