Bridging GRPO and Transformer Learning Mechanisms to Enhance Language Model Training
Bridging GRPO and Transformer Learning Mechanisms to Enhance Language Model Training
In this post, we will explore how GRPO (Group Relative Policy Optimization) naturally extends the standard Transformer training paradigm (which is typically pure next-token prediction using maximum likelihood) into a reinforcement learning (RL) framework. We will derive the connections step-by-step and discuss how these insights can guide future improvements.
Review
1. Why Naive Transformer Can learn
We’ll review how all these elements combine so that the Transformer learns
and thus factorizes
for a text sequence . Finally, we connect it to the negative log-likelihood (NLL) objective.
1.1 Causal Masking
In a causal language model, position must only attend to previous positions . To enforce this in self-attention, we define a mask such that
For each self-attention head, the attention weights are given by a softmax over the scaled dot products plus the mask:
where is the attention weight of position on position . Because for , those positions contribute zero after the softmax—enforcing causality.
1.2 Query, Key, Value Transformations
Each multi-head self-attention block creates queries, keys, and values via learned linear projections of the hidden states . For head (out of total heads), we define:
- each have dimension (or ), where typically .
For position , the attention weights w.r.t. position become:
The context vector for head at position is then
noting that for via the causal mask.
2.1. Multi-Head Concatenation and Linear
If there are heads, we concatenate their outputs:
which is dimension . We then apply another linear projection
where has shape . The residual connection is typically added next (plus layer normalization—discussed below).
1.3 Linear Transformations
A linear layer in a Transformer can be written:
where and . In the multi-head attention context, we have three such linear layers () to form queries, keys, and values, plus a final one () to map the concatenated heads back to .
Additionally, a typical feed-forward sub-layer within the Transformer is also a composition of two linear layers with a nonlinearity (often ReLU or GELU):
where , .
1.4 Layer Normalization
LayerNorm is applied either before or after the main sub-layer operations (depending on the Transformer variant). The standard formula for a hidden vector is:
where
- is the mean of the components ,
- is the variance of ,
- and are learnable parameters of dimension
- is a small constant (e.g., ) to avoid division by zero,
- denotes elementwise multiplication.
Thus, LayerNorm re-centers and re-scales each dimension of , per sample, improving training stability.
1.5 Auto-Regressive Next-Token Probability
After layers of (1) self-attention with causal masking, (2) feed-forward layers, and (3) layer normalization + residual connections, we obtain a final hidden state for each position . We then project to vocabulary logits:
where has size (assuming a vocabulary of size ) and . A softmax over these logits yields:
This defines the auto-regressive distribution for the next token.
1.6 Negative Log-Likelihood Objective
With the entire network parameterized by , the language modeling likelihood of an observed sequence factorizes as:
The training objective is to minimize the negative log-likelihood over a large corpus :
which is equivalent to cross-entropy loss. Concretely, for each token , the cross-entropy w.r.t. the model’s predicted distribution is
and backpropagation through the Q,K,V transformations, the linear layers, and the layer normalization updates all parameters so as to increase the likelihood of the observed tokens.
1.7 Final Takeaway
-
Causal Masking:
ensures that position never sees positions .
-
Self-Attention: Uses
to selectively integrate information from previous tokens.
-
Linear Layers:
transform the embeddings, attention outputs, and feed-forward transformations.
-
LayerNorm:
-
Vocabulary Projection: A final linear layer maps the final hidden state to vocabulary logits , from which we apply softmax.
-
Training: Minimizing
aligns the model parameters to make accurate next-token predictions, purely via supervised next-token prediction.
-
Key: Minimizing this Negative Log-Likelihood (NLL) via cross-entropy is equivalent to maximizing the likelihood of the training data. This procedure uses purely supervised learning with no environment-based “reward.”
Reinforcement Learning in DeepSeek R1
2. Enter GRPO: An RL Overlay on Transformers
TL;DR What’s the difference?
When we switch from pure next-token prediction to a reinforcement learning perspective, we view each generated token (or entire sequence) as an “action” that yields a reward. The policy is still a Transformer, but its training signal changes from simple log-likelihood to expected reward.
2.1. Group Relative Policy Optimization in a Nutshell
-
Group Baseline: Instead of learning a value function as in PPO, GRPO uses a group-level baseline for variance reduction. For a group of outputs from the same prompt , it computes the group-average reward
where is the reward of the -th output.
-
Group-Relative Advantage:
This advantage replaces the role of a learned function in PPO.
-
Policy Ratio: As with PPO, GRPO measures the ratio of new policy to old policy
-
KL Divergence or Clipping: GRPO can control policy updates via a KL penalty or by clipping this ratio in a trust region, similar to PPO.
2.2. Surrogate Objective with Group Averages
If we adopt a clipping-based approach (akin to PPO), the GRPO objective might look like:
Alternatively, if we incorporate a direct KL penalty w.r.t. a reference policy , we might write:
In both cases, the key difference from pure cross-entropy is that we are directly maximizing a reward-based objective rather than maximizing likelihood of tokens in a dataset.
3. Where They Intersect: The Transformer as a Policy Network
Underneath both pure next-token MLE and GRPO (or PPO), the neural architecture is typically the same Transformer. What changes is the loss function and associated training data:
-
Pure MLE: The training data is a static corpus of text, and the objective is to predict the next token—no explicit reward.
-
GRPO / PPO:
- We still use the Transformer architecture for .
- We gather data (policy outputs) and compute or receive a reward signal .
- We update to improve this reward-based objective.
Thus, the mechanism of self-attention, feed-forward layers, and so on does not change. Instead, the training objective shifts from “match the next token distribution in the dataset” to “output sequences with high reward.”
4. Potential Improvements: Unifying Likelihood and Reward Signals
A core tension arises in practical LLM training: we want linguistic fluency (which is well-captured by next-token prediction on large corpora) but also want task-specific behaviors (captured by a reward model or user feedback). Some ways forward:
-
Two-Stage Training:
- First, pre-train the Transformer with standard cross-entropy on a large text corpus (for broad language fluency).
- Then, fine-tune with GRPO on a smaller dataset or a reward signal. This approach is used in many RLHF (Reinforcement Learning from Human Feedback) pipelines.
-
Hybrid Objective:
Combine the likelihood term and the reward term in a single objective—e.g.:Tuning vs. balances linguistic correctness with reward maximization.
-
Better Baselines:
- Instead of using a simple group average or a learned value function, one could integrate more sophisticated baselines (e.g., learned critics that exploit contextual cues).
- Or incorporate group-level variance reduction plus longer-horizon estimates (GAE-like expansions) for tasks requiring multiple steps.
-
Dynamic Reference Policies:
- Periodically update to be the current .
- Use an adaptive schedule for the KL penalty coefficient so that the policy can explore initially but is later constrained when it’s sufficiently good.
5. Mathematical Summary: The Relation Between MLE and GRPO
We can compare the respective training objectives more formally:
-
Standard Transformer (MLE)
- Data is from a static corpus.
- The “policy” is trained to predict tokens accurately.
-
GRPO Objective
- Data is generated on the fly by sampling from .
- Rewards can come from a reward model or human feedback.
- The group-based baseline reduces variance.
Bridging the Two Objectives
In many real-world applications (e.g., RLHF), the pipeline is:
- Pre-train on MLE (huge corpus).
- Reward model is trained (from human preferences or other signals).
- Fine-tune with an RL method like GRPO or PPO to align the model with desired behavior.
This approach leverages both the linguistic understanding from MLE and the targeted reward optimization from RL.
6. Concluding Thoughts
- GRPO can be viewed as a natural RL extension of a Transformer that otherwise does purely next-token prediction.
- Transformer architecture remains the same—what changes is the objective and the data generation process (from static to on-policy).
- Hybrid / multi-stage training can preserve fluency while encouraging the model to generate high-value responses in certain tasks.
- Future improvements might focus on more efficient baselines, better reference policy management, and smoother transitions between MLE and RL phases.
Takeaway: By understanding both the rigorous foundation of maximum likelihood Transformer training and how GRPO modifies it into a reward-driven RL scheme, we can better tailor language models to produce desirable outputs beyond mere likelihood matching. The synergy of large-scale pre-training and reward-guided fine-tuning is likely to remain a core strategy for building advanced, aligned language models.
Further Reading & References
- “Attention Is All You Need” by Vaswani et al., 2017, for the original Transformer.
- “Proximal Policy Optimization” by Schulman et al., 2017, for PPO.
- “Learning to Summarize with Human Feedback” by Stiennon et al., 2020, for a practical RLHF application.
- “Group Relative Policy Optimization (GRPO)” for simplifying PPO’s baseline with group-level statistics.