Reinforcement learning is a type of learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties based on its actions and learns to maximize the cumulative reward over time.
Example: Training a robot to navigate through a maze by giving it positive rewards for reaching the goal and negative rewards for hitting obstacles.
[
[-1, -1, -1, -1, 0], # Row 0: Penalty cells with one neutral cell at the end
[-1, -1, -1, 0, -1], # Row 1: Penalty cells with one neutral cell
[-1, -1, 0, -1, 100], # Row 2: Penalty cells with the goal (100) at the end
[-1, 0, -1, -1, -1], # Row 3: Penalty cells with one neutral cell
[ 0, -1, -1, -1, -1] # Row 4: One neutral cell with penalty cells
]
Let's consider a simple example of reinforcement learning using Q-learning:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
# Define the environment
environment = [
[-1, -1, -1, -1, 0], # Row 0: Penalty cells with one neutral cell at the end
[-1, -1, -1, 0, -1], # Row 1: Penalty cells with one neutral cell
[-1, -1, 0, -1, 100], # Row 2: Penalty cells with the goal (100) at the end
[-1, 0, -1, -1, -1], # Row 3: Penalty cells with one neutral cell
[0, -1, -1, -1, -1] # Row 4: One neutral cell with penalty cells
]
# Define the Q-learning parameters
num_episodes = 1000
learning_rate = 0.5
discount_factor = 0.9
epsilon = 0.1
# Initialize the Q-table
num_states = len(environment)
num_actions = 4
Q = np.zeros((num_states, num_states, num_actions))
# Define the reward function
def get_reward(state):
return environment[state[0]][state[1]]
# Define the action mapping
actions = {
0: (-1, 0), # Up
1: (0, 1), # Right
2: (1, 0), # Down
3: (0, -1) # Left
}
# Q-learning algorithm
for episode in range(num_episodes):
state = (0, 0) # Start from the top-left corner
while True:
# Choose action (epsilon-greedy strategy)
if np.random.uniform(0, 1) < epsilon:
action = np.random.choice(num_actions)
else:
action = np.argmax(Q[state[0], state[1]])
next_state = tuple(np.array(state) + np.array(actions[action]))
# Check if the next state is valid
if (0 <= next_state[0] < num_states) and (0 <= next_state[1] < len(environment[0])):
reward = get_reward(next_state)
Q[state[0], state[1], action] += learning_rate * (reward + discount_factor * np.max(Q[next_state[0], next_state[1]]) - Q[state[0], state[1], action])
state = next_state
# Check if the goal state is reached
if reward == 100:
break
else:
# If the next state is invalid, choose another action
Q[state[0], state[1], action] -= learning_rate # Penalize for invalid move
# Print the optimal path
state = (0, 0)
path = [state]
while get_reward(state) != 100:
action = np.argmax(Q[state[0], state[1]])
next_state = tuple(np.array(state) + np.array(actions[action]))
if (0 <= next_state[0] < num_states) and (0 <= next_state[1] < len(environment[0])):
state = next_state
path.append(state)
else:
break
print("Optimal path:", path)
# Visualize the path with arrows
fig, ax = plt.subplots()
cmap = plt.get_cmap('coolwarm')
bounds = [-1.5, -0.5, 0.5, 100.5]
norm = BoundaryNorm(bounds, cmap.N)
img = ax.imshow(environment, cmap=cmap, norm=norm)
# Draw arrows on the path
for i in range(len(path) - 1):
start = path[i]
end = path[i + 1]
dx = end[1] - start[1]
dy = end[0] - start[0]
ax.arrow(start[1], start[0], dx, dy, head_width=0.2, head_length=0.2, fc='black', ec='black')
# Mark the start and goal
ax.text(0, 0, 'Start', ha='center', va='center', color='white', fontsize=12, fontweight='bold')
ax.text(4, 2, 'Goal', ha='center', va='center', color='white', fontsize=12, fontweight='bold')
# Set grid and labels
ax.set_xticks(np.arange(len(environment[0])))
ax.set_yticks(np.arange(len(environment)))
ax.set_xticklabels(np.arange(len(environment[0])))
ax.set_yticklabels(np.arange(len(environment)))
ax.grid(color='gray', linestyle='-', linewidth=0.5)
plt.colorbar(img, ticks=[-1, 0, 100], orientation='vertical', label='Reward')
plt.show()
Output:
Optimal path: [(0, 0), (0, 1), (1, 1), (1, 2), (1, 3), (1, 4), (2, 4)]

Optimal path (maximizing reward) learnt by the reinforcement learning algorithm,
Explanation:
get_reward() to retrieve the reward for a given state.num_episodes).Intuition: Reinforcement learning is like learning through trial and error. The agent interacts with the environment, receives rewards or penalties based on its actions, and learns to make better decisions over time. By exploring different actions and updating the Q-values, the agent learns to maximize the cumulative reward and find the optimal path to the goal.