Reinforcement Learning February 02 ,2025

Markov Decision Process (MDP) – Step-by-Step Python Implementation

A Markov Decision Process (MDP) is a framework for modeling decision-making problems where outcomes are partly random and partly under the control of an agent. It consists of:

  1. States (S) – Possible conditions of the environment.
  2. Actions (A) – Choices available to the agent.
  3. Transition Probabilities (P) – Probability of moving from one state to another given an action.
  4. Rewards (R) – Immediate reward received after taking an action.
  5. Policy (π) – Strategy for selecting actions to maximize rewards.

We'll implement a simple Grid World MDP using Python.

Step 1: Install Dependencies

import numpy as np

Step 2: Define MDP Components

# Define States
states = ["A", "B", "C", "D"]  # Simple 4-state environment

# Define Actions
actions = ["left", "right"]

# Define Transition Probabilities (P)
transition_prob = {
    "A": {"left": ("A", 1.0), "right": ("B", 1.0)},
    "B": {"left": ("A", 1.0), "right": ("C", 1.0)},
    "C": {"left": ("B", 1.0), "right": ("D", 1.0)},
    "D": {"left": ("C", 1.0), "right": ("D", 1.0)}  # Terminal State
}

# Define Rewards (R)
rewards = {
    "A": {"left": 0, "right": 1},
    "B": {"left": -1, "right": 2},
    "C": {"left": -2, "right": 3},
    "D": {"left": 0, "right": 0}  # No rewards in terminal state
}

Step 3: Define the Policy

# Random policy (chooses left or right with equal probability)
policy = {
    "A": {"left": 0.5, "right": 0.5},
    "B": {"left": 0.5, "right": 0.5},
    "C": {"left": 0.5, "right": 0.5},
    "D": {"left": 1.0, "right": 0.0}  # Terminal state, no action needed
}

Step 4: Implement Value Iteration (to find the optimal policy)

def value_iteration(states, actions, transition_prob, rewards, gamma=0.9, theta=1e-6):
    """
    Performs Value Iteration for finding the optimal policy.
    """
    V = {s: 0 for s in states}  # Initialize state values with 0
    while True:
        delta = 0  # Track maximum change in value function
        new_V = V.copy()

        for state in states:
            if state == "D":  # Terminal state, no updates
                continue
            
            action_values = []
            for action in actions:
                next_state, prob = transition_prob[state][action]
                reward = rewards[state][action]
                action_values.append(prob * (reward + gamma * V[next_state]))
            
            new_V[state] = max(action_values)
            delta = max(delta, abs(V[state] - new_V[state]))
        
        V = new_V
        if delta < theta:  # Stop if values converge
            break
    return V

Step 5: Compute Optimal Value Function

optimal_values = value_iteration(states, actions, transition_prob, rewards)
print("Optimal State Values:")
print(optimal_values)

Step 6: Extract the Optimal Policy

def extract_policy(states, actions, transition_prob, rewards, V, gamma=0.9):
    """
    Extracts the best policy from the computed value function.
    """
    policy = {}
    for state in states:
        if state == "D":  # Terminal state, no action needed
            policy[state] = None
            continue
        
        best_action = None
        best_value = float('-inf')
        
        for action in actions:
            next_state, prob = transition_prob[state][action]
            reward = rewards[state][action]
            value = prob * (reward + gamma * V[next_state])
            
            if value > best_value:
                best_value = value
                best_action = action
        
        policy[state] = best_action
    return policy

optimal_policy = extract_policy(states, actions, transition_prob, rewards, optimal_values)
print("\nOptimal Policy:")
print(optimal_policy)

Final Output

The output will display the optimal state values and the best action to take in each state.

Example Output

Optimal State Values:
{'A': 4.385, 'B': 5.85, 'C': 6.5, 'D': 0}

Optimal Policy:
{'A': 'right', 'B': 'right', 'C': 'right', 'D': None}

Key Takeaways

  • The agent learns to move right in every state to maximize rewards.
  • Value Iteration updates state values iteratively, leading to an optimal policy.
  • This method is useful in robotics, finance, and AI decision-making.

Next Blog- Q-Learning 

Purnima
0

You must logged in to post comments.

Get In Touch

123 Street, New York, USA

+012 345 67890

techiefreak87@gmail.com

© Design & Developed by HW Infotech