A Comprehensive Overview of Q-Learning and Actor-Critic Methods
A Comprehensive Overview of Q-Learning and Actor-Critic Methods
Table of Contents
- 1. Q-Learning: A Foundational Approach
- 2. Basic Actor-Critic
- 3. Neural Network Parameterization
- 4. Deep Deterministic Policy Gradient (DDPG)
- 5. Twin Delayed Deep Deterministic Policy Gradient (TD3)
- 6. Proximal Policy Optimization (PPO)
- 7. Soft Actor-Critic (SAC)
- 8. Asynchronous Advantage Actor-Critic (A3C)
- 9. A3C full example:
- 9. Evolution Timeline of the Methods
- 10. Concluding Remarks
1. Q-Learning: A Foundational Approach
Q-Learning attempts to learn the state-action value function:
For a discrete action space, we can keep a table or a neural network to represent .
1.1 Bellman Update
The core Bellman optimality update is:
Python Snippet
1.2 Deep Q-Network (DQN)
In deep Q-learning, we approximate with a neural network . The loss to minimize is:
Here, and denote online and target network parameters.
Python Snippet
Thus, Q-learning (and its deep counterpart) is typically off-policy because it learns about the greedy policy while potentially following a different data-collecting policy (e.g., -greedy).
2. Basic Actor-Critic
Unlike Q-learning, actor-critic methods maintain:
- A policy (the “actor”).
- A value function or -function (the “critic”).
2.1 Policy Gradient Theory
We want to maximize the expected return:
The policy gradient theorem says:
where is a baseline (often the value function).
Python Snippet
2.2 Critic Objective
The critic (value-based) is learned via MSE:
Python Snippet
3. Neural Network Parameterization
Below is a typical two-layer MLP for both actor and critic, with explicit shapes:
-
Actor
- First hidden layer:
- Second hidden layer:
- Output layer depends on discrete vs. continuous actions.
-
Critic or
- Similarly a two-layer MLP.
- Output dimension = (scalar).
Python Snippet for a 2-Layer MLP
4. Deep Deterministic Policy Gradient (DDPG)
4.1 Deterministic Actor, Q-Critic
- Actor: .
- Critic: .
- Critic Loss:
Python Snippet
- Actor Update uses the deterministic policy gradient:
Python Snippet
DDPG is off-policy and uses a replay buffer plus target networks to improve stability.
5. Twin Delayed Deep Deterministic Policy Gradient (TD3)
5.1 Twin Critics
To reduce overestimation in DDPG, TD3 uses two critics:
The critic target is:
Python Snippet
5.2 Delayed Updates
TD3 updates the actor (and target networks) every few critic steps, reducing variance.
6. Proximal Policy Optimization (PPO)
6.1 Probability Ratio and Clipping
PPO is an on-policy method. We define:
The clipped objective is:
Python Snippet
7. Soft Actor-Critic (SAC)
7.1 Maximum Entropy RL
SAC encourages exploration via an entropy term . The objective is:
7.2 Two Critics
Like TD3, SAC uses twin critics . The target is:
with .
Python Snippet
8. Asynchronous Advantage Actor-Critic (A3C)
8.1 Parallelization Insight
A3C runs multiple worker processes, each with local copies of and . They asynchronously update the shared global parameters.
8.2 Advantage Actor-Critic Loss
A typical A3C loss (value-based critic) is:
Python Snippet
A3C’s asynchronous updates help decorrelate data and speed up training on CPUs.
9. A3C full example:
9. Evolution Timeline of the Methods
-
Q-Learning:
- Learns using the Bellman update.
- Great for discrete action spaces (DQN for deep version).
- Off-policy, can be inefficient for large continuous spaces.
-
Actor-Critic (baseline):
- Combines policy gradient with a critic to reduce variance.
- Works in both discrete and continuous settings.
-
**DDPG **:
- Deterministic policy + replay buffer + target networks for continuous control.
- Issue: Overestimation, sensitive hyperparameters.
-
**A3C **:
- Multiple asynchronous workers for faster training.
- No replay buffer, but can have higher variance.
-
**TD3 **:
- Twin critics + delayed updates to reduce overestimation in DDPG.
- Deterministic, needs exploration noise.
-
**PPO **:
- On-policy with clipped objective for stable learning.
- Popular and relatively easy to tune.
-
**SAC **:
- Maximum entropy RL for robust exploration.
- Twin critics to reduce overestimation.
- Often state-of-the-art in continuous control tasks.
Hence, each method emerges to address specific challenges:
- Overestimation (TD3, SAC).
- Exploration (SAC’s entropy).
- Stability (PPO clipping, twin critics).
- Efficiency (replay buffers, asynchronous runs).
10. Concluding Remarks
- Q-learning (and DQN) forms the foundation for many discrete-action RL approaches.
- Actor-Critic methods extend naturally to continuous actions and can reduce variance with a learned critic.
- DDPG introduced a deterministic actor with an off-policy, replay-buffer approach, later refined by TD3 to address overestimation.
- PPO simplified stable on-policy learning with a clipped objective.
- SAC combined twin critics with maximum entropy to encourage robust exploration.
- A3C leveraged asynchronous CPU processes to speed up training without replay buffers.