A demo presentation with math, code, and figures
Nathan Lambert
2026-02-22


The loss function for RLHF with a KL penalty:
Inline math works too: \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[R \cdot \nabla_\theta \log \pi_\theta(a|s)]

from colloquium import Deck
deck = Deck(title="My Talk", author="Researcher")
deck.add_title_slide(subtitle="A research presentation")
deck.add_slide(
title="Key Results",
content="Our method achieves **state-of-the-art** performance.",
)
deck.build("output/")


This wider column has the main explanation text. The 60/40 split gives more room to the primary content while keeping a sidebar for supplementary info.
“Supplementary details go in the narrower column.”
Supporting points:




Colloquium supports images in any layout. Here the wordmark sits in the wider column alongside explanatory text.
Images auto-scale to fit their container while maintaining aspect ratio.


| Model | Accuracy | F1 Score | Training Time |
|---|---|---|---|
| Baseline | 82.1% | 79.3% | 2h |
| Ours (small) | 91.4% | 89.7% | 4h |
| Ours (large) | 95.2% | 93.8% | 12h |
“The results demonstrate significant improvements across all metrics.”

This slide uses the align and size directives to center all text and increase the font size.
Great for emphasis slides.

All content on this slide is vertically centered, like a title slide but with ## heading style.



text-4xl text-3xl text-2xl
text-xl — Key takeaways
text-lg — Callouts and introductions
text-base — Default body text
text-sm — Dense lists, supporting details
text-xs — Footnotes, references, fine print

What is RLHF?
RLHF (Reinforcement Learning from Human Feedback) is a technique for aligning language models with human preferences using reward models trained on human comparisons.

You are a helpful AI research assistant.
What is RLHF?
RLHF is a technique for aligning language models with human preferences.

Can you explain the RLHF training pipeline?
The RLHF pipeline has three main steps:
What's the role of KL divergence?
The KL penalty prevents the policy from diverging too far from the SFT model. Without it, the model can exploit the reward model with degenerate outputs — this is called reward hacking.

High-contrast callout for key takeaways.
Softer card for supporting notes.
Neutral bordered panel for references or caveats.


The foundational work on RLHF (Christiano et al., 2017) introduced learning reward models from human comparisons.
InstructGPT (Ouyang et al., 2022) scaled this approach to large language models, demonstrating significant alignment improvements.
For a comprehensive overview, see (Lambert, 2024).

The reward model is trained with the Bradley-Terry preference model, where y_w is the preferred response and y_l is the rejected response.



Questions?


