From 5bb505ae26d90886812fec0d079153a1910f56be Mon Sep 17 00:00:00 2001
From: Ajay Dhangar <99037494+ajay-dhangar@users.noreply.github.com>
Date: Sun, 10 Nov 2024 21:33:36 +0530
Subject: [PATCH] Update qLearning.md
---
docs/machine-learning/qLearning.md | 45 +++++++++++++++---------------
1 file changed, 23 insertions(+), 22 deletions(-)
diff --git a/docs/machine-learning/qLearning.md b/docs/machine-learning/qLearning.md
index 095f156fb..1a3be3b1f 100644
--- a/docs/machine-learning/qLearning.md
+++ b/docs/machine-learning/qLearning.md
@@ -1,16 +1,16 @@
---
-
-id: q-learning
-title: Q-Learning Algorithm
-sidebar_label: Q-Learning
-description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making."
-tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free]
-
+id: q-learning
+title: Q Learning Algorithm
+sidebar_label: Q Learning
+description: "An overview of the Q-Learning Algorithm, a model-free reinforcement learning method that learns the optimal action-value function to guide decision-making."
+tags: [machine learning, reinforcement learning, q-learning, algorithms, model-free]
---
### Definition:
The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm used to find the optimal action-selection policy for any given finite Markov Decision Process (MDP). It works by learning the value of actions in specific states without needing a model of the environment and aims to optimize long-term rewards.
+
+
### Characteristics:
- **Model-Free**:
Q-Learning does not require prior knowledge of the environment's dynamics and learns directly from experience.
@@ -24,16 +24,16 @@ The **Q-Learning Algorithm** is a model-free reinforcement learning algorithm us
### How It Works:
Q-Learning learns an **action-value function (Q-function)** that maps state-action pairs to their expected cumulative rewards. The Q-value is updated using the following equation:
-\[
+$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
-\]
+$$
-- **\( s \)**: Current state
-- **\( a \)**: Action taken in the current state
-- **\( r \)**: Reward received after taking action \( a \)
-- **\( s' \)**: Next state after taking action \( a \)
-- **\( \alpha \)**: Learning rate (controls how much new information overrides old)
-- **\( \gamma \)**: Discount factor (determines the importance of future rewards)
+- $ s $: Current state
+- $ a $: Action taken in the current state
+- $ r $: Reward received after taking action $ a $
+- $ s' $: Next state after taking action $ a $
+- $\alpha $: Learning rate (controls how much new information overrides old)
+- $ \gamma $: Discount factor (determines the importance of future rewards)
### Steps Involved:
1. **Initialization**:
@@ -51,6 +51,8 @@ Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s,
5. **Repeat**:
Continue until the learning converges or a stopping condition is met.
+
+
### Problem Statement:
Given an environment defined by states and actions with unknown dynamics, the goal is to learn the optimal Q-function that allows an agent to make decisions maximizing cumulative rewards over time.
@@ -59,10 +61,10 @@ Given an environment defined by states and actions with unknown dynamics, the go
A matrix where each row represents a state, and each column represents an action. The values represent the learned Q-values for state-action pairs.
- **Epsilon-Greedy Strategy**:
- A common method to balance exploration and exploitation. The agent selects a random action with probability \( \epsilon \) and the best-known action with probability \( 1 - \epsilon \).
+ A common method to balance exploration and exploitation. The agent selects a random action with probability $ \epsilon $ and the best-known action with probability $ 1 - \epsilon $.
- **Convergence**:
- Q-Learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate.
+ Q-learning converges to the optimal Q-function given an infinite number of episodes and a decaying learning rate.
### Example:
Consider a grid-world environment where an agent navigates to collect rewards:
@@ -73,9 +75,10 @@ Consider a grid-world environment where an agent navigates to collect rewards:
**Update Step**:
After moving from (1,1) to (1,2) with action "Right" and receiving a reward of 0:
-\[
+
+$$
Q(1,1, \text{Right}) \leftarrow Q(1,1, \text{Right}) + \alpha \left[ 0 + \gamma \max_{a'} Q(1,2, a') - Q(1,1, \text{Right}) \right]
-\]
+$$
### Python Implementation:
Here is a basic implementation of Q-Learning in Python:
@@ -123,6 +126,4 @@ print("Training completed.")
```
### Conclusion:
-Q-Learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.
-
----
+Q-learning is a powerful and foundational reinforcement learning technique that enables agents to learn optimal policies through direct interaction with an environment. Its simplicity and effectiveness make it a popular choice for many RL applications.