Future deep dive
AI Training and Alignment
This is a placeholder outline for the next major explainer. The main How AI Works page is about using AI systems at runtime. This page will explain how models are trained, tuned, evaluated, and shaped into assistants.
The Core Pipeline
The future page should make the whole lifecycle visible before going deep into each piece.
Collect and filter data
Pretrain base model
Fine-tune behavior
Align with feedback
Evaluate and red-team
Deploy with safeguards
1. Training Data
What the model learns from, and why quality matters as much as scale.
- Web text, books, code, images, audio, video, and licensed datasets
- Cleaning, deduplication, filtering, and dataset mixture choices
- Copyright, privacy, bias, language coverage, and synthetic data
- Why training data is not the same thing as live memory or a database
2. Pretraining
How a base model learns broad patterns before it becomes an assistant.
- Tokens, batches, loss, gradients, weights, and backpropagation
- Next-token prediction for language models
- Compute, GPUs, checkpoints, scaling laws, and training runs
- Why base models can be powerful but awkward or unsafe to chat with
3. Fine-Tuning
How broad capability is shaped toward a narrower task or product style.
- Supervised fine-tuning with example prompts and ideal answers
- Instruction tuning for following user requests
- Domain adaptation for coding, medicine, law, support, or local tasks
- The tradeoff between specialization and general flexibility
4. RLHF and Alignment
The missing piece Jez spotted: how models are tuned toward preferred behavior.
- Human preference comparisons and ranked answers
- Reward models that estimate which answer people prefer
- Reinforcement learning that nudges the model toward those rewards
- RLAIF, constitutional AI, safety tuning, refusal behavior, and over-refusal
5. Evals and Red Teaming
How labs check whether a model is useful, risky, or regressing.
- Benchmarks for reasoning, coding, math, instruction following, and multimodal tasks
- Human evals for usefulness, style, truthfulness, and preference
- Safety evals for jailbreaks, cyber, bio, persuasion, privacy, and misuse
- Regression testing when prompts, models, or tools change
6. Deployment Layer
What gets added around the trained model before users ever see it.
- System prompts, tool routing, retrieval, memory, moderation, and rate limits
- Product policies, permission boundaries, and monitoring
- Why the same base model can feel different in different products
- How agent wrappers change capability and risk
7. RAG and Embeddings
How models use external knowledge without retraining.
- Embeddings as searchable meaning vectors
- Vector databases and document retrieval
- Grounding answers in retrieved files or sources
- When RAG beats fine-tuning, and when it does not
8. Limits and Open Questions
The page should stay honest about what is known, what is product choice, and what is still debated.
- Hallucination, calibration, interpretability, and mechanistic understanding
- Bias, cultural assumptions, and power concentration
- Capability jumps, emergent behavior, and benchmark saturation
- What alignment can realistically guarantee
Interactive Ideas for Later
Good future modules would make training concrete instead of turning into a wall of theory.
- A tiny loss/gradient demo where a model gets less wrong over repeated examples
- A preference-ranking demo that turns human choices into a reward signal
- A base-model vs chat-model comparison showing why post-training matters
- A RAG simulator that retrieves source snippets before answering
- An eval dashboard showing pass, fail, regression, and safety checks