Post-training LLMs

Post training is essentially all about changing the model behavior after the extensive initial model pretraining.

Screenshot 2025-07-18 at 22.36.57.png

1. Supervised Fine-tuning (SFT)

SFT helps the model imitate example responses by learning from examples of prompts and desired outputs.

The loss function for supervised fine-tuning:

LSFT=i=1Nlog(pθ(Response(i)|Prompt(i)))

SFT minimizes negative log likelihood for the responses (maximizes likelihood) with cross entropy loss.

Screenshot 2025-07-18 at 22.38.14.png
The result is an instruct model (finetuned model) that can answer users queries properly if done correctly.

SFT Use cases:

Common methods for high-quality SFT data curation:

Quality > quantity for improving capabilities:

2. Direct Preference Optimization (DPO)

The loss function for DPO:

LDPO=logσ(β(logπθ(ypos|x)πref(ypos|x)logπθ(yneg|x)πref(yneg|x)))

Screenshot 2025-07-18 at 22.40.05.png

Best Use Cases for DPO:

Common methods for high-quality DPO data curation:

Avoid overfitting:

3. Online Reinforcement Learning

Screenshot 2025-07-18 at 22.37.43.png

Online Learning:

Offline Learning:

Process:
Screenshot 2025-07-19 at 19.58.27.png

Reward Function in Online RL:

1. Trained Reward Model

Screenshot 2025-07-19 at 20.00.20.png

2. Verifiable Reward

Policy Training in Online RL

Screenshot 2025-07-19 at 23.52.57.png

JPPO(θ)=E[qP(Q),oPθold(O|q)]1ot=1|o|min[πθ(ot|q,o<t)πθold(ot|q,o<t)At,clip(πθ(ot|q,o<t)πθold(ot|q,o<t),1ϵ,1+ϵ)At]

Full Fine-tuning vs Parameter Efficient Fine-tuning (PEFT)

Screenshot 2025-07-19 at 00.49.01.png

Reference