Build a DeepSeek Model (From Scratch) you own this product

Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat, Naman Dwivedi

MEAP began November 2025
Last updated December 2025
Publication in Summer 2026 (estimated)

ISBN 9781633434325
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

resources: Source code Book forum Source code on GitHub

table of content

1 Introduction to DeepSeek

1.1 Why DeepSeek? A turning point in open-source AI

1.2 The key innovations we will build

1.2.1 Architecture

1.2.2 Training

1.2.3 Post-training

1.3 Book structure and scope

1.4 What this book will teach you and what it won’t

1.5 What you will need to follow along

1.6 Summary

2 Solving the inference bottleneck with the key-value cache

2.1 The LLM inference loop: Generating text one token at a time

2.1.1 Distinguishing pre-training from inference

2.1.2 The autoregressive process: Appending tokens to build context

2.1.3 Visualizing autoregressive generation with GPT-2

2.2 The core task: Predicting the next token

2.2.1 From input embeddings to context vectors: A mathematical walkthrough

2.2.2 From context vectors to logits

2.2.3 The key insight: Why only the last row matters

2.3 The problem of redundant computations

2.3.1 Intuition: Are we calculating the same thing over and over?

2.3.2 A mathematical proof: Visualizing repeated calculations

2.3.3 The performance impact: From quadratic to linear complexity

2.4 The solution: Caching for efficiency

2.4.1 What to cache? A step-by-step derivation

2.4.2 The new inference loop with KV caching

2.4.3 Demonstrating the speedup of KV caching

2.5 The dark side of the KV cache: The memory cost

2.5.1 The KV cache formula: Deconstructing the size

2.5.2 The scaling problem in practice

2.6 The memory-first approach: Multi-Query Attention (MQA)

2.6.1 The core idea: Sharing a single key and value

2.6.2 The impact on the KV cache formula

2.6.3 The performance trade-off: Loss of expressivity

2.6.4 Implementing an MQA layer from scratch

2.7 The middle ground: Grouped-Query Attention (GQA)

2.7.1 The core idea: Sharing keys and values within groups

2.7.2 The tunable knob: Balancing memory and performance

2.7.3 Implementing a GQA layer from scratch

2.8 The performance vs. memory trade-off

2.9 Summary

3 The DeepSeek breakthrough: Multi-Head Latent Attention (MLA)

3.1 MLA: The best of both worlds

3.2 The MLA architecture: A visual walkthrough

3.2.1 The query path (unchanged)

3.2.2 The key/value path (the innovation)

3.3 The mathematical magic: How the latent matrix helps

3.3.1 A Step-by-Step Derivation of Q, K, and V in MLA

3.3.2 The absorption trick: How attention scores are calculated

3.3.3 The final step: Calculating the context vector

3.4 The new inference loop with MLA

3.4.1 What happens when a new token arrives?

3.4.2 Caching the latent vector: The only thing we store

3.4.3 Decompressing the cache and calculating attention

3.5 Quantifying the gains

3.5.1 The new KV cache formula: A 64x reduction

3.5.2 Preserving performance: Why head diversity is maintained

3.6 Building an MLA module from scratch

3.7 The problem of order

3.8 Attempt #1: The naive approach - integer positional encodings

3.8.1 The simple idea: Using position numbers directly

3.8.2 The major flaw: Polluting semantic embeddings with large magnitudes

3.9 Attempt #2: A step forward - Binary positional encodings

3.9.1 Solving the magnitude problem with binary representation

3.9.2 Uncovering a deeper pattern: Oscillation frequencies

3.9.3 The new problem: The issue with discontinuous jumps

3.10 Attempt #3: The "Attention Is All You Need" breakthrough - sinusoidal positional encodings

3.10.1 From discrete jumps to smooth waves: Introducing sine and cosine

3.10.2 The power of rotation: Encoding relative positions

3.10.3 The remaining flaw: Still polluting the token embeddings

3.11 The state-of-the-art: Rotary Positional Encoding (RoPE)

3.11.1 The core insights: Injecting position into attention and preserving magnitude

3.11.2 The mechanism: Rotating query and key vectors

3.12 The new challenge: Why standard RoPE and MLA don’t mix

3.13 The incompatibility problem: Why standard MLA and RoPE don’t work together

3.14 The DeepSeek solution: Decoupled rotary position embedding

3.14.1 The new architecture: A visual and mathematical deep dive

3.14.2 Combining the paths to calculate final attention

3.14.3 The cost of decoupling: A trade-off between memory and computation

3.14.4 The new inference loop in action: What happens when a new token arrives

3.15 Quantifying the gains: The final cache memory comparison

3.16 Building MLA + decoupled RoPE from scratch

3.17 The Payoff: An Empirical Head-to-Head Comparison

3.18 Summary

4 Mixture-of-Experts (MoE) in DeepSeek: Scaling intelligence efficiently

4.1 The intuition behind mixture of experts

4.1.1 The problem with dense FFNs in transformer: High parameter count and computational cost

4.1.2 The sparsity solution: Activating only a subset of experts per token

4.1.3 Expert specialization: The "why" behind sparsity

4.2 The mechanics of MoE: A hands-on mathematical walkthrough

4.2.1 The goal: Combining multiple expert outputs into one

4.2.2 Sparsity in action: Top-K selection for load balancing

4.2.3 The routing mechanism: From input to expert scores

4.2.4 From scores to weights: Top-K selection and softmax normalization

4.2.5 The final output: Creating the weighted sum of expert outputs

4.3 The challenge of balance: Ensuring all experts contribute

4.3.1 Attempt #1: The auxiliary loss

4.3.2 Attempt #2: The load balancing loss

4.3.3 A hard cap: The capacity factor

4.4 The DeepSeek innovations: Towards ultimate expert specialization

4.4.1 Core problems with traditional MoE

4.4.2 Innovation #1: Fine-grained expert segmentation

4.4.3 Innovation #2: Shared expert isolation

4.4.4 Innovation #3: Auxiliary-loss-free load balancing

4.5 Building a complete DeepSeek-MoE language model from scratch

4.6 The payoff: An empirical head-to-head comparison

4.7 Summary

5 Multi-token prediction and FP8 quantization

5.1 The core idea: From single-token to multi-token prediction

5.2 The four key advantages of MTP

5.2.1 Densification of training signals

5.2.2 Improved data efficiency

5.2.3 Better planning by prioritizing "choice points"

5.2.4 Higher inference speed via speculative decoding

5.3 The DeepSeek MTP architecture: A visual and mathematical walkthrough

5.3.1 The starting point: The shared transformer trunk

5.3.2 The MTP modules: A sequential chain of prediction

5.3.3 The final loss calculation

5.4 Implementing a causal multi-token prediction module from scratch

5.5 Quantization: Trading precision for speed and memory

5.5.1 What is quantization?

5.5.2 Why quantize? The memory cost of high-precision parameters

5.5.3 Understanding numerical formats: The building blocks of quantization

5.5.4 The basic mechanism: Scaling

5.5.5 The five pillars of DeepSeek’s FP8 training

5.5.6 Pillar 1: The mixed precision framework

5.5.7 Pillar 2: Fine-grained quantization

5.5.8 Pillar 3: Increasing accumulation precision

5.5.9 Pillar 4: Mantissa over exponents

5.5.10 Pillar 5: Online quantization

5.6 Summary

6 The DeepSeek training pipeline: Building a foundation model

7 Post-training: Supervised fine-tuning and reinforcement learning

8 Knowledge distillation: Making powerful models practical

Appendix

Appendix A: DeepSeek in context: A comparison with other LLMs

Overview

4 Mixture-of-Experts (MoE) in DeepSeek: Scaling intelligence efficiently

Mixture-of-Experts (MoE) replaces the Transformer’s dense feed-forward networks with a pool of smaller, specialized “experts,” and activates only a few per token via a learned router. This sparsity lets models amass far more total parameters without proportional compute, because most experts stay idle for any given token. During pretraining, experts naturally specialize (e.g., grammar, domains, patterns), so the router can dispatch tokens to the most relevant specialists and then merge their outputs, achieving high capacity at low per-token cost.

Mechanically, the router computes expert scores per token with a linear projection, applies top-k selection to enforce sparsity, normalizes the kept scores with softmax, and forms the final representation as a weighted sum of the chosen experts’ outputs—preserving the input/output shape expected by the Transformer. A central challenge is load imbalance: some experts can become hotspots while others “die.” Traditional remedies include an auxiliary loss that penalizes variance in expert importance, a load-balancing loss that aligns expert probabilities with actual routed token fractions, and a capacity factor that hard-limits tokens per expert to prevent overloads.

DeepSeek advances MoE along three fronts. First, fine-grained expert segmentation uses many smaller experts to curb knowledge hybridity and sharpen specialization without increasing total capacity. Second, shared expert isolation splits the layer into dense shared experts (holding common, reusable knowledge) and routed experts (freed to specialize deeply), with outputs summed alongside the residual path. Third, auxiliary-loss-free load balancing introduces a dynamic, per-expert bias on router logits that is updated each step to nudge routing toward underused experts and away from overloaded ones—decoupling balance from the main training objective. A from-scratch implementation demonstrates these ideas in practice, and empirical results show lower validation loss and higher throughput than a standard MoE, validating both efficiency and effectiveness.

Our four-stage journey to build the DeepSeek model. This chapter focuses on the highlighted component, DeepSeek-style Mixture-of-Experts (MoE), the second major innovation in the core architecture.

The standard Feed-Forward Network (FFN) in a Transformer block, featuring an expansion-contraction architecture.

The architectural change of MoE. The single, dense FFN is replaced by a collection of four smaller, specialized expert networks.

An example of expert routing in a Mixture-of-Experts model. For the input "What is 1+1?", the router must decide which specialized experts to activate. The routing mechanism might prioritize grammatical components (like the question mark and the verb "is"), highlighting how routing is a nuanced decision based on learned patterns.

The initial challenge of MoE. The input matrix is passed through each of the three expert networks in parallel, resulting in three separate expert output matrices.

The principle of sparsity or load balancing. For each token, we decide to route it to only a subset (k=2) of the available experts.

The routing mechanism. The input matrix is multiplied by a learned routing matrix to produce an expert selector matrix, which contains a raw score for each expert for each token.

The top-k selection process. For each row, only the two highest scores are kept, and the rest are masked out.

The masked scores are replaced with negative infinity in preparation for the softmax function.

The softmax function converts the scores into a final expert selector weight matrix.

Calculating the final output for a single token. The weights from the selector matrix are used to create a weighted sum of the corresponding expert outputs.

The complete MoE process for a batch of tokens. The expert selector weight matrix guides the weighted summation of the expert outputs to produce a single, final output matrix of the same shape as the input.

The Expert Selector Weight Matrix. Each row corresponds to a token, and each column corresponds to an expert.

Calculating Expert Importance by summing the probabilities down each column.

The Auxiliary Loss is calculated from the Coefficient of Variation of the Expert Importance scores.

An example demonstrating that equal expert importance does not guarantee a balanced token load.

Calculating the Router Probability (pi) for each expert.

Calculating the Fraction of Tokens Dispatched (fi) for each expert.

An illustration of imbalanced routing without an expert capacity. In this scenario, the router has sent all tokens in the batch to Expert 1, leaving the other experts idle. This overloading of a single expert is the problem that expert capacity is designed to prevent.

A comparison between a conventional MoE with a few large experts (top) and DeepSeek's fine-grained approach with many smaller experts (bottom).

The DeepSeekMoE architecture with Shared and Routed Experts. All tokens are processed by the dense Shared Experts, while the router selectively sends each token to a sparse subset of the Routed Experts.

The final output of the DeepSeekMoE layer is the sum of the outputs from the dense shared experts and the sparse routed experts.

Calculating the number of tokens routed to each expert based on the top-k selection.

Calculating the load violation for each expert.

The direction of the bias update based on the expert's load status.

The bias term is added to the raw router logits before the top-k selection process.

The complete forward pass of the DeepSeekMoE module, showing the three parallel data paths: the dense shared expert path, the sparse routed expert path with dynamic load balancing, and the residual connection.

A comparison of the validation loss curves for the Standard MoE and DeepSeek-MoE models. Despite having a similar number of parameters, the DeepSeek-MoE architecture consistently achieves a lower loss, indicating superior learning. Both models were trained for 5,000 iterations.

Expert selection frequency for the Standard-MoE model in a sample batch. The uneven distribution highlights the problem of imbalanced routing.

Expert selection frequency for the DeepSeek-MoE model in a sample batch. The distribution is remarkably uniform, demonstrating the effectiveness of the dynamic bias mechanism.

Summary

Dense Feed-Forward Networks (FFNs) in standard Transformers are computationally expensive, as all of their parameters are activated for every single token, creating a bottleneck for both training and inference.
Mixture of Experts (MoE) replaces the single, dense FFN with a committee of smaller, specialized "expert" networks.
The efficiency of MoE comes from sparsity: for any given token, a routing mechanism activates only a small subset of the total experts (e.g., the top 2), leaving the rest dormant and their computations unperformed.
During pre-training, experts learn to specialize in handling specific types of information (e.g., punctuation, verbs, or Python code), which is why activating only a few is effective.
The routing mechanism is a small, learnable linear layer that generates scores for each expert. A top-k selection identifies the most relevant experts, and a softmax function converts their scores into weights for combining their outputs.
Imbalanced routing, where some experts are over-utilized and others are ignored, leads to inefficient learning and performance degradation.
Traditional MoE models use an Auxiliary Loss term to penalize imbalance, but this can interfere with the primary training objective of learning the language.
DeepSeek's first innovation, Fine-Grained Expert Segmentation, uses a massive number of smaller experts to solve the problem of Knowledge Hybridity, allowing for deeper specialization.
DeepSeek's second innovation, Shared Expert Isolation, uses a small set of dense "generalist" experts to learn common knowledge, solving the problem of Knowledge Redundancy and freeing up the routed "specialist" experts.
DeepSeek's third innovation, Auxiliary-Loss-Free Load Balancing, dynamically adjusts router scores with a bias term, enforcing balance without interfering with the main training loss and resolving the core trade-off of traditional balancing methods.

FAQ

What is a Mixture-of-Experts (MoE) and how does it differ from a dense FFN in a Transformer?

MoE replaces the single, large, dense Feed-Forward Network (FFN) in each Transformer block with a pool of smaller FFNs called experts. Instead of activating all parameters for every token (dense FFN), MoE uses sparsity: for each token, only a small subset of experts is activated. This yields a model with many more total parameters (greater knowledge capacity) while keeping per-token compute low.

How does sparsity (top-k gating) make MoE efficient?

Sparsity means each token is routed to only k experts (k is a hyperparameter), while others remain inactive. Implementation steps: - Compute expert scores (logits) per token - Keep only the top-k scores per token; mask the rest with −∞ - Apply softmax so masked experts become 0 and selected experts get weights that sum to 1 This avoids computing all experts, cutting both training and inference cost.

How does the router compute expert scores, and what are the key tensor shapes?

The router is a linear layer that maps token embeddings to expert scores: - Input Matrix: (T, d) for T tokens, embedding size d - Routing Matrix (learned): (d, E) for E experts - Expert Selector (logits) Matrix: (T, E) = Input × Routing Each row is a token; each column is an expert; entries are raw suitability scores.

After routing, how are multiple expert outputs combined into a single output?

For each token: - Select the top-k experts and get their softmax-normalized weights - Compute each selected expert’s output vector for that token - Form a weighted sum of those vectors using the routing weights Stack per-token results to get a final (T, d) matrix matching the input shape.

Why does sparsity work in practice? What is expert specialization?

During pre-training, different experts converge to specialize on particular patterns (e.g., punctuation, verbs, domain-specific tokens). The router learns to activate only the most relevant specialists per token, so most experts can be ignored for any given token without losing performance.

What is load imbalance in MoE, and why is it harmful?

Load imbalance occurs when the router over-selects a few experts while others rarely get tokens. Consequences: - Inefficient learning: underused experts don’t update and become dead weight - Performance bottlenecks: overused experts become hotspots, hurting throughput and specialization

How do the Auxiliary Loss and the Load Balancing Loss differ?

- Auxiliary Loss: Encourages balanced expert “importance” by penalizing dispersion (e.g., coefficient of variation) of per-expert probability mass. It may still miss token-count imbalance. - Load Balancing Loss: Considers both - pi: mean router probability for expert i (normalized importance) - fi: fraction of tokens dispatched to expert i (actual load) Minimizes Σ(fi × pi), scaled by number of experts and λ, pushing both distributions toward uniformity.

What is Expert Capacity and the Capacity Factor?

Expert Capacity is a hard cap on how many tokens any expert can process in a batch. It is typically: capacity ≈ ceil((tokens_per_batch / num_experts) × capacity_factor) - Capacity Factor (>1.0, e.g., 1.25) allows slight imbalance without drops - If an expert exceeds capacity, excess tokens are dropped for that forward pass This prevents per-batch overloads even when losses are balanced only softly.

What are DeepSeek’s key MoE innovations and what problems do they solve?

- Fine-Grained Expert Segmentation: Use many smaller experts instead of a few large ones to avoid knowledge hybridity. Scale expert hidden sizes down so total capacity stays constant (e.g., 64×1024 ≈ 16×4096). - Shared Expert Isolation: Add a small dense set of shared experts seen by every token to centralize common knowledge, freeing routed experts to specialize deeply. Final output per token: Residual(x) + Sum(Shared_Outputs) + Weighted_Sum(Routed_Outputs). - Auxiliary-Loss-Free Load Balancing: Replace balancing losses with a dynamic bias added to router logits to encourage uniform expert utilization without compromising the main language-learning objective.

How does DeepSeek’s auxiliary-loss-free load balancing work, and what are the empirical benefits?

Mechanism: - After each step, count tokens per expert (post top-k) - Compute load violation per expert vs average load - Update a persistent bias b_i (e.g., b_i += η × tanh(violation)) increasing bias for underloaded experts and decreasing it for overloaded ones - Add bias to router logits before next step’s top-k Benefits observed in the chapter’s experiment: - More uniform expert utilization per batch - Lower validation loss (e.g., 1.9451 vs 1.9854) - Higher throughput (~22% faster), demonstrating improved efficiency and learning

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more