AI Training and Alignment

The Core Pipeline

Every modern AI assistant goes through roughly the same lifecycle. Tap a stage to see what happens there.

1. Training Data

A model can only learn patterns that exist in what it reads. Frontier models are trained on trillions of tokens: web pages, books, code, reference works, licensed datasets, and increasingly images, audio, and video. The quality of that mixture matters as much as its size — a model trained on sloppy text learns sloppy habits.

Cleaning and filtering: raw web data is full of duplicates, spam, boilerplate, and junk. Labs deduplicate aggressively and filter for quality, because repeated or low-value text wastes training and skews behavior.
Mixture choices: how much code vs. prose vs. math vs. other languages is a deliberate recipe. More code in the mix is part of why modern models are decent programmers.
Hard problems: copyright disputes, privacy of scraped personal data, bias baked into source text, and uneven language coverage are all live, unresolved issues.
Synthetic data: labs now also train on data generated or curated by other models — useful for filling gaps, risky if errors compound.

Key distinction: training data is not a database the model looks things up in. After training, the data is gone — what remains is a web of statistical patterns squeezed into the model's weights. That is why a model can "know" something approximately but misremember the details.

2. Pretraining

Pretraining is one deceptively simple game played at staggering scale: guess the next token. The model reads a chunk of text with the ending hidden, predicts what comes next, and gets measured on how wrong it was — a number called the loss. An algorithm called backpropagation works out how to adjust each of the model's billions of weights to be slightly less wrong next time. Repeat trillions of times on thousands of specialized chips for months, and the model is forced to absorb grammar, facts, reasoning patterns, and style — because all of those help it guess better.

Train a tiny model yourself

Real demo — a real (toy) statistical model learning from real example sentences

This miniature model counts which word tends to follow each pair of words. It starts knowing nothing. Feed it examples and watch its prediction sharpen and its surprise (loss) fall — the same feedback loop as a real model, minus the neural network and about a trillion tokens.

Examples seen0 sentences

Surprise (loss)maximum — it knows nothing

Training round0 of 4

Untrained, every word is an equally good guess. Click “Feed more examples” to start training.

Why it works: predicting the next token well in arbitrary text quietly requires understanding the text. You cannot reliably finish "the capital of France is" without knowing the answer.
Scale: frontier training runs use trillions of tokens, weights numbering in the hundreds of billions or more, and months of compute. Bigger runs have historically produced predictably better predictions — the "scaling laws."
The catch: the result is a base model — a brilliant text continuer, not an assistant. Ask it a question and it may continue with three more questions, because that is what the pattern looked like online.

3. Fine-Tuning

Fine-tuning takes the base model and keeps training it — but now on a much smaller, hand-built dataset of examples showing the behavior you actually want. For assistants, that means example conversations: a user asks, the assistant answers helpfully. This is called supervised fine-tuning (SFT), and it is the step that teaches the model that a question deserves an answer rather than a continuation.

Base model vs. tuned assistant

Illustrative examples — written to show the typical behavior gap, not live model output

Base model continues the text

Tuned assistant answers

Instruction tuning: a broad SFT diet of many task types teaches the general skill of following instructions, not just answering trivia.
Domain adaptation: the same technique can specialize a model for code, medicine, law, or support — at some cost to general flexibility.
Same engine underneath: fine-tuning does not add knowledge so much as reshape behavior. The base model already "knew" how to answer; tuning made answering the default.

4. RLHF and Alignment

Fine-tuning shows the model what good answers look like. But "good" is fuzzy — helpful, honest, harmless, appropriately confident — and you cannot write enough perfect examples to pin it down. So labs use reinforcement learning from human feedback (RLHF): show people two candidate answers, ask which is better, train a reward model to predict those preferences, then use reinforcement learning to nudge the assistant toward answers the reward model scores highly. You are about to generate that exact training signal yourself.

You are the human feedback

Pick the better answer in each round — your choices become a reward signal

No preferences recorded yet. Your picks will be tallied here, the way a reward model tallies millions of them.

The chain: human rankings → reward model → reinforcement learning. No single person writes the assistant's personality; it emerges from millions of comparisons.
Variants: RLAIF replaces some human raters with AI raters; Constitutional AI has the model critique itself against written principles; DPO trains directly on preference pairs and skips the separate reward model.
The balance problem: if raters reward caution too much, the model becomes uselessly over-cautious; reward confidence too much and it becomes a smooth-talking guesser. Alignment is steering between those ditches, and it is never finished.

Honest framing: alignment shapes behavior; it does not install guarantees. A well-aligned model is more likely to be honest and refuse harm — it is not incapable of error.

5. Evals and Red Teaming

Before a new model ships, it gets examined: benchmark suites for reasoning, coding, and math; human evaluations of usefulness and style; safety testing for jailbreaks, dangerous capabilities, and privacy leaks; and red teams whose whole job is to make the model misbehave. The most important rule is the least glamorous: a new model must not quietly get worse at things the old one did well.

Run the release gate

A simplified eval dashboard — candidate model vs. the current one

Benchmarks saturate: once models near 100% on a test, it stops measuring anything. Eval suites are constantly rebuilt, and scores on old benchmarks mean less every year.
Contamination: if a benchmark's questions leaked into training data, the score is memorization, not skill. Labs work to detect and exclude this.
Red teaming: adversarial testers probe for jailbreaks, harmful instructions, persuasion, and misuse before — and after — release.

6. Deployment Layer

The trained model is not what you talk to. Products wrap it in a system prompt that sets role and rules, tool access, retrieval, memory, moderation filters, and rate limits. That wrapper is why the same underlying model can feel like a different product in different apps.

System prompts set persona, boundaries, and output style before your message arrives
Tool routing and permissions decide what the model may actually do
Moderation and monitoring catch failures the training missed
Agent wrappers add loops and autonomy — raising both capability and risk

7. RAG and Embeddings

Retraining a model to teach it your documents is slow and expensive. Retrieval-augmented generation (RAG) skips that: store your documents, search them when a question arrives, and place the best matches into the model's context so it can answer from evidence. The search usually uses embeddings — every text gets converted to a list of numbers (a vector) where similar meanings land near each other, so "watering schedule" can find "irrigation notes" without sharing a single word.

Retrieve, then answer

Real demo — a real (toy) keyword retriever scoring a real mini document library

Pick a question. The retriever will score every document, pull the best matches into context, and answer only from what it retrieved.

Why it beats retraining for facts: documents update instantly, answers can cite sources, and nothing is baked irreversibly into weights
Why it is not magic: retrieval can miss the right document or pull an irrelevant one — then the model confidently answers from the wrong evidence
When fine-tuning still wins: style, format, and skills are behaviors, not facts — you cannot retrieve your way into being better at code review

8. Limits and Open Questions

An honest explainer ends with what nobody fully knows yet.

Hallucination is reduced, not solved — fluent text without enough grounding remains the signature failure
Interpretability is young: researchers can map some internal features and circuits, but nobody can fully explain a frontier model's reasoning weight-by-weight
Bias from training data persists despite filtering and tuning
Emergence and saturation: which abilities appear at which scale is still debated, and benchmarks keep getting outgrown
What alignment can guarantee — versus merely make likely — is the central open question of the field