Reinforcement Learning - The Q-Learning Algorithm

Reinforcement learning is a type of learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize the cumulative reward over time.

Example: Training a robot to navigate through a maze by giving it positive rewards for reaching the goal and negative rewards for hitting obstacles.

[
    [-1, -1, -1, -1,  0],   # Row 0: Penalty cells with one neutral cell at the end
    [-1, -1, -1,  0, -1],   # Row 1: Penalty cells with one neutral cell
    [-1, -1,  0, -1, 100],  # Row 2: Penalty cells with the goal (100) at the end
    [-1,  0, -1, -1, -1],   # Row 3: Penalty cells with one neutral cell
    [ 0, -1, -1, -1, -1]    # Row 4: One neutral cell with penalty cells
]

Let's consider a simple example of reinforcement learning using Q-learning:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm

# Define the environment
environment = [
    [-1, -1, -1, -1, 0], # Row 0: Penalty cells with one neutral cell at the end
    [-1, -1, -1, 0, -1], # Row 1: Penalty cells with one neutral cell
    [-1, -1, 0, -1, 100], # Row 2: Penalty cells with the goal (100) at the end
    [-1, 0, -1, -1, -1], # Row 3: Penalty cells with one neutral cell
    [0, -1, -1, -1, -1] # Row 4: One neutral cell with penalty cells
]

# Define the Q-learning parameters
num_episodes = 1000
learning_rate = 0.5
discount_factor = 0.9
epsilon = 0.1

# Initialize the Q-table
num_states = len(environment)
num_actions = 4
Q = np.zeros((num_states, num_states, num_actions))

# Define the reward function
def get_reward(state):
    return environment[state[0]][state[1]]

# Define the action mapping
actions = {
    0: (-1, 0),  # Up
    1: (0, 1),   # Right
    2: (1, 0),   # Down
    3: (0, -1)   # Left
}

# Q-learning algorithm
for episode in range(num_episodes):
    state = (0, 0)  # Start from the top-left corner
    while True:
        # Choose action (epsilon-greedy strategy)
        if np.random.uniform(0, 1) < epsilon:
            action = np.random.choice(num_actions)
        else:
            action = np.argmax(Q[state[0], state[1]])

        next_state = tuple(np.array(state) + np.array(actions[action]))

        # Check if the next state is valid
        if (0 <= next_state[0] < num_states) and (0 <= next_state[1] < len(environment[0])):
            reward = get_reward(next_state)
            Q[state[0], state[1], action] += learning_rate * (reward + discount_factor * np.max(Q[next_state[0], next_state[1]]) - Q[state[0], state[1], action])
            state = next_state

            # Check if the goal state is reached
            if reward == 100:
                break
        else:
            # If the next state is invalid, choose another action
            Q[state[0], state[1], action] -= learning_rate  # Penalize for invalid move

# Print the optimal path
state = (0, 0)
path = [state]
while get_reward(state) != 100:
    action = np.argmax(Q[state[0], state[1]])
    next_state = tuple(np.array(state) + np.array(actions[action]))
    if (0 <= next_state[0] < num_states) and (0 <= next_state[1] < len(environment[0])):
        state = next_state
        path.append(state)
    else:
        break

print("Optimal path:", path)

# Visualize the path with arrows
fig, ax = plt.subplots()
cmap = plt.get_cmap('coolwarm')
bounds = [-1.5, -0.5, 0.5, 100.5]
norm = BoundaryNorm(bounds, cmap.N)
img = ax.imshow(environment, cmap=cmap, norm=norm)

# Draw arrows on the path
for i in range(len(path) - 1):
    start = path[i]
    end = path[i + 1]
    dx = end[1] - start[1]
    dy = end[0] - start[0]
    ax.arrow(start[1], start[0], dx, dy, head_width=0.2, head_length=0.2, fc='black', ec='black')

# Mark the start and goal
ax.text(0, 0, 'Start', ha='center', va='center', color='white', fontsize=12, fontweight='bold')
ax.text(4, 2, 'Goal', ha='center', va='center', color='white', fontsize=12, fontweight='bold')

# Set grid and labels
ax.set_xticks(np.arange(len(environment[0])))
ax.set_yticks(np.arange(len(environment)))
ax.set_xticklabels(np.arange(len(environment[0])))
ax.set_yticklabels(np.arange(len(environment)))
ax.grid(color='gray', linestyle='-', linewidth=0.5)
plt.colorbar(img, ticks=[-1, 0, 100], orientation='vertical', label='Reward')
plt.show()

Output:

Optimal path: [(0, 0), (0, 1), (1, 1), (1, 2), (1, 3), (1, 4), (2, 4)]

Optimal path (maximizing reward) learnt by the reinforcement learning algorithm,

Explanation:

We define the environment as a grid where each cell represents a state. The values in the grid represent rewards (-1 for obstacles, 0 for empty cells, and 100 for the goal state).
We initialize the Q-table with zeros for each state-action pair.
We define the reward function get_reward() to retrieve the reward for a given state.
We define the action mapping to specify the movement in each direction.
We iterate for a specified number of episodes (num_episodes).
In each episode, we start from the initial state and take actions based on the Q-values.
We update the Q-values using the Q-learning update rule, considering the reward and the maximum Q-value of the next state.
We continue taking actions until we reach the goal state or an invalid state.
After training, we find the optimal path from the start state to the goal state by following the actions with the highest Q-values.

Intuition: Reinforcement learning is like learning through trial and error. The agent interacts with the environment, receives rewards or penalties based on its actions, and learns to make better decisions over time. By exploring different actions and updating the Q-values, the agent learns to maximize the cumulative reward and find the optimal path to the goal.