Post-training LLMs

Post training is essentially all about changing the model behavior after the extensive initial model pretraining.

Screenshot 2025-07-18 at 22.36.57.png

SFT helps the model imitate example responses by learning from examples of prompts and desired outputs.

The loss function for supervised fine-tuning:

L_{S F T} = - \sum_{i = 1}^{N} l o g (p_{θ} (R e s p o n s e (i) | P r o m p t (i)))

SFT minimizes negative log likelihood for the responses (maximizes likelihood) with cross entropy loss.

Screenshot 2025-07-18 at 22.38.14.png
The result is an instruct model (finetuned model) that can answer users queries properly if done correctly.

SFT Use cases:

Jumpstarting new model behavior:
- Pre-trained models -> instruct models
- Non-reasoning models -> reasoning models
- Let the model use certain tools without providing tool descriptions in the prompt.
Improving model capabilities:
- Distilling capabilities for small models by training on high-quality synthetic data generated from larger models.

Common methods for high-quality SFT data curation:

Distillation: Generate responses from a stronger and larger instruct model.
Best of K / rejection sampling: Generate multiple responses from the original model, select the best among them.
Filtering: start from larger scale SFT dataset, filter according to the quality of responses and diversity of the prompts.

Quality > quantity for improving capabilities:

DPO utilizes Contrastive Learning from Positive and Negative Samples
DPO minimizes the contrastive loss which penalizes negative response and encourages positive response.
DPO loss is a cross entropy loss on the reward difference of a "re-parameterized" reward model.

The loss function for DPO:

L_{D P O} = - \log σ (β (\log \frac{π_{θ} (y_{p o s} | x)}{π_{r e f} (y_{p o s} | x)} - \log \frac{π_{θ} (y_{n e g} | x)}{π_{r e f} (y_{n e g} | x)}))

Screenshot 2025-07-18 at 22.40.05.png

Best Use Cases for DPO:

Making small modifications of model responses
- Identity
- Multilingual
- Instruction following
- Safety
Improving model capabilities
- Better than SFT in improving model capabilities due to contrastive nature
- Online DPO is better for improving capabilities than offline DPO

Common methods for high-quality DPO data curation:

Correction: Generate responses from original model as negative, make enhancements as positive response
- Example: I'm Llama (Negative) -> I'm Athene (Positive)
Online / On-policy: Your positive & negative example can both come from your model's distribution. One may generate multiple responses from the current model for the same prompt, and collect the best response as positive sample and the worst response as negative.
One can choose best / worst response based on reward functions / human judgement.

Avoid overfitting:

DPO is doing reward learning which can easily overfit to some shortcut when the preferred answers have shortcuts to learn compared with the non-preferred answers
- Example: when positive sample always contains a few special words while negative samples do not.

Screenshot 2025-07-18 at 22.37.43.png

Online Learning:

The model learns by generating new responses in real time -- it iteratively collects new responses and their reward, updates its weights, and explores new responses as it learns.

Offline Learning:

The model learns purely from a pre-collected prompt - response (-reward) tuple. No fresh responses generated during the learning process.

Process:
Screenshot 2025-07-19 at 19.58.27.png

1. Trained Reward Model

Usually initialized from an existing instruct model, then train on large-scale human / machine generated preference data
Works for any open-ended generations
Good for improving chat & safety
Less accurate for correctness-based domains like coding, math, function calling, etc

Screenshot 2025-07-19 at 20.00.20.png

2. Verifiable Reward

Requires preparation of ground truth for math, unit tests for coding, or sandbox execution environment for multi-turn agentic behavior.
More reliable than reward model in those domains
Used more often for training reasoning models

Screenshot 2025-07-19 at 23.52.57.png

Both GRPO and PPO are very effective online RL algorithms!
GRPO:
- Well-suited for binary (often correctness-based) reward
- Requires larger amount of samples
- Requires less GPU memory (no value model needed)
PPO:
- Works well with reward model or binary reward
- More sample efficient with a well-trained value model
- Requires more GPU memory (value model)

J_{P P O} (θ) = E [q \sim P (Q), o \sim P_{θ_{o l d}} (O | q)] \frac{1}{o} \sum_{t = 1}^{| o |} min [\frac{π_{θ} (o_{t} | q, o < t)}{π_{θ_{o l d}} (o_{t} | q, o < t)} A_{t}, clip (\frac{π_{θ} (o_{t} | q, o < t)}{π_{θ_{o l d}} (o_{t} | q, o < t)}, 1 - ϵ, 1 + ϵ) A_{t}]

Screenshot 2025-07-19 at 00.49.01.png