Build a DeepSeek Model (From Scratch) you own this product

Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat, Naman Dwivedi

MEAP began November 2025
Last updated November 2025
Publication in Summer 2026 (estimated)

ISBN 9781633434325
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

table of content

1 Introduction to DeepSeek

1.1 Why DeepSeek? A turning point in open-source AI

1.2 The key innovations we will build

1.2.1 Architecture

1.2.2 Training

1.2.3 Post-training

1.3 Book structure and scope

1.4 What this book will teach you and what it won’t

1.5 What you will need to follow along

1.6 Summary

2 Solving the inference bottleneck with the key-value cache

2.1 The LLM inference loop: Generating text one token at a time

2.1.1 Distinguishing pre-training from inference

2.1.2 The autoregressive process: Appending tokens to build context

2.1.3 Visualizing autoregressive generation with GPT-2

2.2 The core task: Predicting the next token

2.2.1 From input embeddings to context vectors: A mathematical walkthrough

2.2.2 From context vectors to logits

2.2.3 The key insight: Why only the last row matters

2.3 The problem of redundant computations

2.3.1 Intuition: Are we calculating the same thing over and over?

2.3.2 A mathematical proof: Visualizing repeated calculations

2.3.3 The performance impact: From quadratic to linear complexity

2.4 The solution: Caching for efficiency

2.4.1 What to cache? A step-by-step derivation

2.4.2 The new inference loop with KV caching

2.4.3 Demonstrating the speedup of KV caching

2.5 The dark side of the KV cache: The memory cost

2.5.1 The KV cache formula: Deconstructing the size

2.5.2 The scaling problem in practice

2.6 The memory-first approach: Multi-Query Attention (MQA)

2.6.1 The core idea: Sharing a single key and value

2.6.2 The impact on the KV cache formula

2.6.3 The performance trade-off: Loss of expressivity

2.6.4 Implementing an MQA layer from scratch

2.7 The middle ground: Grouped-Query Attention (GQA)

2.7.1 The core idea: Sharing keys and values within groups

2.7.2 The tunable knob: Balancing memory and performance

2.7.3 Implementing a GQA layer from scratch

2.8 The performance vs. memory trade-off

2.9 Summary

3 The DeepSeek breakthrough: Multi-Head Latent Attention (MLA)

3.1 MLA: The best of both worlds

3.2 The MLA architecture: A visual walkthrough

3.2.1 The query path (unchanged)

3.2.2 The key/value path (the innovation)

3.3 The mathematical magic: How the latent matrix helps

3.3.1 A Step-by-Step Derivation of Q, K, and V in MLA

3.3.2 The absorption trick: How attention scores are calculated

3.3.3 The final step: Calculating the context vector

3.4 The new inference loop with MLA

3.4.1 What happens when a new token arrives?

3.4.2 Caching the latent vector: The only thing we store

3.4.3 Decompressing the cache and calculating attention

3.5 Quantifying the gains

3.5.1 The new KV cache formula: A 64x reduction

3.5.2 Preserving performance: Why head diversity is maintained

3.6 Building an MLA module from scratch

3.7 The problem of order

3.8 Attempt #1: The naive approach - integer positional encodings

3.8.1 The simple idea: Using position numbers directly

3.8.2 The major flaw: Polluting semantic embeddings with large magnitudes

3.9 Attempt #2: A step forward - Binary positional encodings

3.9.1 Solving the magnitude problem with binary representation

3.9.2 Uncovering a deeper pattern: Oscillation frequencies

3.9.3 The new problem: The issue with discontinuous jumps

3.10 Attempt #3: The "Attention Is All You Need" breakthrough - sinusoidal positional encodings

3.10.1 From discrete jumps to smooth waves: Introducing sine and cosine

3.10.2 The power of rotation: Encoding relative positions

3.10.3 The remaining flaw: Still polluting the token embeddings

3.11 The state-of-the-art: Rotary Positional Encoding (RoPE)

3.11.1 The core insights: Injecting position into attention and preserving magnitude

3.11.2 The mechanism: Rotating query and key vectors

3.12 The new challenge: Why standard RoPE and MLA don’t mix

3.13 The incompatibility problem: Why standard MLA and RoPE don’t work together

3.14 The DeepSeek solution: Decoupled rotary position embedding

3.14.1 The new architecture: A visual and mathematical deep dive

3.14.2 Combining the paths to calculate final attention

3.14.3 The cost of decoupling: A trade-off between memory and computation

3.14.4 The new inference loop in action: What happens when a new token arrives

3.15 Quantifying the gains: The final cache memory comparison

3.16 Building MLA + decoupled RoPE from scratch

3.17 The Payoff: An Empirical Head-to-Head Comparison

3.18 Summary

4 Mixture-of-Experts (MoE) in DeepSeek: Scaling intelligence efficiently

4.1 The intuition behind mixture of experts

4.1.1 The problem with dense FFNs in transformer: High parameter count and computational cost

4.1.2 The sparsity solution: Activating only a subset of experts per token

4.1.3 Expert specialization: The "why" behind sparsity

4.2 The mechanics of MoE: A hands-on mathematical walkthrough

4.2.1 The goal: Combining multiple expert outputs into one

4.2.2 Sparsity in action: Top-K selection for load balancing

4.2.3 The routing mechanism: From input to expert scores

4.2.4 From scores to weights: Top-K selection and softmax normalization

4.2.5 The final output: Creating the weighted sum of expert outputs

4.3 The challenge of balance: Ensuring all experts contribute

4.3.1 Attempt #1: The auxiliary loss

4.3.2 Attempt #2: The load balancing loss

4.3.3 A hard cap: The capacity factor

4.4 The DeepSeek innovations: Towards ultimate expert specialization

4.4.1 Core problems with traditional MoE

4.4.2 Innovation #1: Fine-grained expert segmentation

4.4.3 Innovation #2: Shared expert isolation

4.4.4 Innovation #3: Auxiliary-loss-free load balancing

4.5 Building a complete DeepSeek-MoE language model from scratch

4.6 The payoff: An empirical head-to-head comparison

4.7 Summary

5 Multi-Token prediction and FP8 quantization

6 The DeepSeek training pipeline: Building a foundation model

7 Post-training: Supervised fine-tuning and reinforcement learning

8 Knowledge distillation: Making powerful models practical

Appendix

Appendix A: DeepSeek in context: A comparison with other LLMs

Overview

1 Introduction to DeepSeek

Large language models now write, reason, and code with striking fluency—and this book invites you to build one yourself. Using DeepSeek as a guiding case, it blends theory and code so you can reconstruct a state-of-the-art open-source LLM from the ground up. Framed as a hands-on, accessible journey inspired by a popular instructional series, the chapter motivates DeepSeek as a landmark in the democratization of AI and sets the expectation that by building each piece, you will gain durable, first-principles understanding rather than surface familiarity.

The chapter explains why DeepSeek is pivotal: it showed that an open model can match the best proprietary systems while being trained at a fraction of the cost. Technically, DeepSeek modernizes the Transformer by replacing standard attention with Multi-Head Latent Attention (MLA) to cut memory and latency for long contexts, and swapping the feed-forward block with a Mixture-of-Experts (MoE) to expand capacity efficiently. Training advances include Multi-Token Prediction (MTP) for faster learning and inference and FP8 quantization for compute efficiency, complemented by an overlapping pipeline (e.g., DualPipe) to keep hardware saturated. Post-training elevates the base DeepSeek-v3 to DeepSeek-R1 through a sequence of reinforcement learning, rejection sampling–based self-labeling, blended supervision, and a final RL phase, culminating in benchmark-competitive reasoning. Knowledge distillation then transfers capabilities into compact 1.5B–70B checkpoints, making high performance broadly accessible.

Finally, the chapter outlines the book’s four-stage roadmap and scope. Stage 1 establishes the Key-Value cache foundations that make MLA intuitive; Stage 2 implements MLA and MoE; Stage 3 adds MTP, FP8, and efficient scheduling; Stage 4 covers supervised fine-tuning, RL, and distillation. You will learn to design, implement, and reason about these components, but the book does not reproduce proprietary data, exact weights, hyperscale distributed training, or production serving concerns. Prerequisites include intermediate Python, basic deep learning and PyTorch, and familiarity with Transformers; the projects are sized for a laptop or a single 8–12GB GPU (with 24–48GB enabling larger MoE experiments), and reproducible environments are provided. The goal is a cohesive mini-DeepSeek that you understand deeply enough to adapt and extend.

A simple interaction with the DeepSeek chat interface.

The title and abstract of the DeepSeek-R1 research paper.

A detailed view of a standard Transformer block, the foundational architecture used in models like LLaMA and the GPT series. It is composed of a multi-head attention block and a feed-forward network (NN).

A simplified view of the DeepSeek model architecture. It modifies the standard Transformer by replacing the core components with Multi-Head Latent Attention (MLA) and a Mixture-of-Experts (MoE) layer. This design also utilizes RMS Norm (Root Mean Square Normalization) and a specialized Decoupled RoPE (Rotary Position Embedding).

An illustration of the DualPipe training pipeline on a single device. By overlapping the forward pass (the initial blocks), backward pass (the hatched blocks), and combined computations, this scheduling strategy minimizes GPU idle time and maximizes hardware utilization during large-scale training.

The multi-step post-training pipeline used to create DeepSeek-R1 from the DeepSeek-V3 base model. This process involves a combination of reinforcement learning (Pure RL), data generation (Rejection sampling), and fine-tuning to instill advanced reasoning capabilities.

Benchmark performance of DeepSeek-R1 against other leading models (as of January 2025).

The concept of knowledge distillation. A large, powerful "teacher" model (like DeepSeek-R1) is used to generate training data to teach a much smaller, more efficient "student" model, transferring its capabilities without the high computational cost.

The four-stage roadmap for building a mini-DeepSeek model in this book. We will progress from foundational concepts (Stage 1) and core architecture (Stage 2) to advanced training (Stage 3) and post-training techniques (Stage 4), implementing each key innovation along the way.

Summary

Large Language Models (LLMs) have become a dominant force in technology, but the knowledge to build them has often been confined to a few large labs.
DeepSeek marked a pivotal moment by releasing open-source models with performance that rivaled the best proprietary systems, demonstrating that cutting-edge AI could be developed and shared openly.
This book will guide you through a hands-on process of building a mini-DeepSeek model, focusing on its key technical innovations to provide a deep, practical understanding of modern LLM architecture and training.
The core innovations we will implement are divided into four stages: (1) KV Cache Foundation, (2) Core Architecture (MLA & MoE), (3) Advanced Training Techniques (MTP & FP8), and (4) Post-training (RL & Distillation).
By building these components yourself, you will gain not just theoretical knowledge but also the practical skills to implement and adapt state-of-the-art AI techniques.

FAQ

Why is DeepSeek considered a turning point in open-source AI?

DeepSeek showed that an openly available model can rival top proprietary systems. Its R1 model matched or beat leading models (e.g., OpenAI’s o1-1217) on tough reasoning benchmarks like AIME 2024 and Codeforces, while being trained at a fraction of the cost. The team also openly shared methods and released multiple distilled checkpoints, shrinking the gap between open and closed models.

What will I build and learn in this book?

You will build a mini-DeepSeek by implementing its key innovations step by step: Multi-Head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), FP8 quantization, an efficient DualPipe training pipeline, and post-training techniques like RL and distillation. You’ll start from fundamentals (e.g., KV cache) and progressively assemble the full system.

How does DeepSeek’s architecture differ from a standard Transformer?

DeepSeek replaces the standard multi-head attention with Multi-Head Latent Attention (MLA) and the feed-forward network with a Mixture-of-Experts (MoE) layer. It also uses RMSNorm and a specialized decoupled Rotary Position Embedding (RoPE). These changes target speed, memory, and capacity bottlenecks in large-scale LLMs.

What problems do MLA, MoE, MTP, and FP8 each solve?

- MLA: Alleviates speed and memory bottlenecks in attention for long sequences.
- MoE: Increases effective capacity via expert routing while keeping compute efficient.
- MTP: Improves learning and inference speed by predicting multiple tokens at once.
- FP8: Boosts computational efficiency and resource utilization with 8-bit floating point.

What is Multi-Head Latent Attention (MLA) and why is it important?

MLA is DeepSeek’s attention replacement designed to reduce memory usage and improve speed, especially on long contexts. It maintains quality while easing the heavy memory and bandwidth demands of standard multi-head attention—key for scaling LLMs efficiently.

How does the DualPipe training pipeline improve efficiency?

DualPipe overlaps tasks to minimize GPU idle time. While one batch runs the backward pass, the next batch’s forward pass is executed concurrently, with data loading and preprocessing coordinated in parallel. The result is high hardware utilization during large-scale training.

What post-training steps produced DeepSeek-R1 from DeepSeek-v3?

- Step 1: Lightly fine-tune the DeepSeek-v3 base with a small “cold-start” dataset.
- Step 2: Pure reinforcement learning to develop reasoning via exploration.
- Step 3: Rejection sampling for self-labeling high-quality synthetic data.
- Step 4: Blend synthetic and supervised data for balance.
- Step 5: Final RL across diverse prompts to enhance robustness and generalization.

What is knowledge distillation and what did DeepSeek release?

Distillation transfers a large teacher model’s capabilities to smaller student models by training on teacher-generated data. Alongside the R1 work, DeepSeek released distilled checkpoints as small as 1.5B up to 70B parameters (based on Qwen2.5 and Llama3 series), offering strong performance with much lower compute.

What are the prerequisites and hardware requirements to follow along?

- Skills: Python, basic deep learning (backprop), familiarity with PyTorch and the Transformer concept.
- Hardware: A CPU-only laptop can run examples (slower). A single 8–12 GB VRAM GPU is recommended; 24–48 GB helps for MoE experiments. Colab configs are provided; no supercomputer required.

What is in scope for the book—and what isn’t?

In scope: Clear, hands-on implementations of MLA, MoE, MTP, FP8, training efficiency (DualPipe), and post-training (RL, distillation), built up from first principles. Not in scope: Proprietary training data, exact model weights, massive distributed training for hundreds of billions of parameters, or production deployment topics like serving at scale and safety filters.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more