Build a Large Language Model (From Scratch) you own this product

Sebastian Raschka

September 2024
ISBN 9781633437166
368 pages

Included with a Manning Online subscription

printed in black & white

available in Complex Chinese, German, Italian, Japanese, Korean, Polish, Russian, Simplified Chinese, Spanish

catalog / Data Science / Deep Learning / Generative AI

table of content

1 Understanding large language models

1.1 What is an LLM?

1.2 Applications of LLMs

1.3 Stages of building and using LLMs

1.4 Introducing the transformer architecture

1.5 Utilizing large datasets

1.6 A closer look at the GPT architecture

1.7 Building a large language model

2 Working with text data

2.1 Understanding word embeddings

2.2 Tokenizing text

2.3 Converting tokens into token IDs

2.4 Adding special context tokens

2.5 Byte pair encoding

2.6 Data sampling with a sliding window

2.7 Creating token embeddings

2.8 Encoding word positions

3 Coding attention mechanisms

3.1 The problem with modeling long sequences

3.2 Capturing data dependencies with attention mechanisms

3.3 Attending to different parts of the input with self-attention

3.3.1 A simple self-attention mechanism without trainable weights

3.3.2 Computing attention weights for all input tokens

3.4 Implementing self-attention with trainable weights

3.4.1 Computing the attention weights step by step

3.4.2 Implementing a compact self-attention Python class

3.5 Hiding future words with causal attention

3.5.1 Applying a causal attention mask

3.5.2 Masking additional attention weights with dropout

3.5.3 Implementing a compact causal attention class

3.6 Extending single-head attention to multi-head attention

3.6.1 Stacking multiple single-head attention layers

3.6.2 Implementing multi-head attention with weight splits

4 Implementing a GPT model from scratch to generate text

4.1 Coding an LLM architecture

4.2 Normalizing activations with layer normalization

4.3 Implementing a feed forward network with GELU activations

4.4 Adding shortcut connections

4.5 Connecting attention and linear layers in a transformer block

4.6 Coding the GPT model

4.7 Generating text

5 Pretraining on unlabeled data

5.1 Evaluating generative text models

5.1.1 Using GPT to generate text

5.1.2 Calculating the text generation loss

5.1.3 Calculating the training and validation set losses

5.2 Training an LLM

5.3 Decoding strategies to control randomness

5.3.1 Temperature scaling

5.3.2 Top-k sampling

5.3.3 Modifying the text generation function

5.4 Loading and saving model weights in PyTorch

5.5 Loading pretrained weights from OpenAI

6 Fine-tuning for classification

6.1 Different categories of fine-tuning

6.2 Preparing the dataset

6.3 Creating data loaders

6.4 Initializing a model with pretrained weights

6.5 Adding a classification head

6.6 Calculating the classification loss and accuracy

6.7 Fine-tuning the model on supervised data

6.8 Using the LLM as a spam classifier

7 Fine-tuning to follow instructions

7.1 Introduction to instruction fine-tuning

7.2 Preparing a dataset for supervised instruction fine-tuning

7.3 Organizing data into training batches

7.4 Creating data loaders for an instruction dataset

7.5 Loading a pretrained LLM

7.6 Fine-tuning the LLM on instruction data

7.7 Extracting and saving responses

7.8 Evaluating the fine-tuned LLM

7.9 Conclusions

7.9.1 What’s next?

7.9.2 Staying up to date in a fast-moving field

7.9.3 Final words

Appendixes

Appendix A: Introduction to PyTorch

A.1 What is PyTorch?

A.1.1 The three core components of PyTorch

A.1.2 Defining deep learning

A.1.3 Installing PyTorch

A.2 Understanding tensors

A.2.1 Scalars, vectors, matrices, and tensors

A.2.2 Tensor data types

A.2.3 Common PyTorch tensor operations

A.3 Seeing models as computation graphs

A.4 Automatic differentiation made easy

A.5 Implementing multilayer neural networks

A.6 Setting up efficient data loaders

A.7 A typical training loop

A.8 Saving and loading models

A.9 Optimizing training performance with GPUs

A.9.1 PyTorch computations on GPU devices

A.9.2 Single-GPU training

A.9.3 Training with multiple GPUs

Appendix B: References and further reading

Appendix C: Exercise solutions

Appendix D: Adding bells and whistles to the training loop

D.1 Learning rate warmup

D.2 Cosine decay

D.3 Gradient clipping

D.4 The modified training function

Appendix E: Parameter-efficient fine-tuning with LoRA

E.1 Introduction to LoRA

E.2 Preparing the dataset

E.3 Initializing the model

E.4 Parameter-efficient fine-tuning with LoRA

Overview

1 Understanding Large Language Models

Large language models are deep neural networks that have transformed natural language processing by moving beyond narrowly scoped, task-specific systems to broadly capable generators and analyzers of text. Trained on vast corpora, they capture rich context and subtle patterns, enabling applications from translation and summarization to question answering, coding assistance, and conversational agents. While they appear to “understand” language through coherent, context-aware output, this refers to statistical competence rather than human-like comprehension; their power stems from scale—both in model parameters and data—and from advances in deep learning.

The core enabler is the transformer architecture and its self-attention mechanism, which selectively focuses on relevant parts of input sequences to model long-range dependencies. Two major families illustrate how transformers are used: encoder-based models like BERT, optimized for understanding tasks via masked word prediction, and decoder-only models like GPT, which are autoregressive generators trained with next-word prediction. Despite this simple pretraining objective, GPT-style models exhibit zero-shot and few-shot generalization and emergent abilities such as translation, benefiting from exposure to diverse, large-scale datasets. Pretraining produces a general foundation model that can be adapted efficiently through finetuning, making LLMs practical despite the high compute costs of training from scratch.

The chapter also motivates building LLMs to deepen understanding and to unlock benefits such as domain specialization, privacy, on-device deployment, and full control over updates. It outlines a hands-on path: prepare text data and implement attention; pretrain a compact GPT-like model for education; evaluate capabilities; and then finetune for instruction following or classification. While industrial-scale pretraining is expensive, open pretrained weights and careful engineering allow meaningful experimentation on consumer hardware, providing the skills to adapt and deploy models tailored to real-world needs.

As this hierarchical depiction of the relationship between the different fields suggests, LLMs represent a specific application of deep learning techniques, leveraging their ability to process and generate human-like text. Deep learning is a specialized branch of machine learning that focuses on using multi-layer neural networks. And machine learning and deep learning are fields aimed at implementing algorithms that enable computers to learn from data and perform tasks that typically require human intelligence.

LLM interfaces enable natural language communication between users and AI systems. This screenshot shows ChatGPT writing a poem according to a user's specifications.

Pretraining an LLM involves next-word prediction on large text datasets. A pretrained LLM can then be finetuned using a smaller labeled dataset.

A simplified depiction of the original transformer architecture, which is a deep learning model for language translation. The transformer consists of two parts, an encoder that processes the input text and produces an embedding representation (a numerical representation that captures many different factors in different dimensions) of the text that the decoder can use to generate the translated text one word at a time. Note that this figure shows the final stage of the translation process where the decoder has to generate only the final word ("Beispiel"), given the original input text ("This is an example") and a partially translated sentence ("Das ist ein"), to complete the translation.

A visual representation of the transformer's encoder and decoder submodules. On the left, the encoder segment exemplifies BERT-like LLMs, which focus on masked word prediction and are primarily used for tasks like text classification. On the right, the decoder segment showcases GPT-like LLMs, designed for generative tasks and producing coherent text sequences.

In addition to text completion, GPT-like LLMs can solve various tasks based on their inputs without needing retraining, finetuning, or task-specific model architecture changes. Sometimes, it is helpful to provide examples of the target within the input, which is known as a few-shot setting. However, GPT-like LLMs are also capable of carrying out tasks without a specific example, which is called zero-shot setting.

In the next-word pretraining task for GPT models, the system learns to predict the upcoming word in a sentence by looking at the words that have come before it. This approach helps the model understand how words and phrases typically fit together in language, forming a foundation that can be applied to various other tasks.

The GPT architecture employs only the decoder portion of the original transformer. It is designed for unidirectional, left-to-right processing, making it well-suited for text generation and next-word prediction tasks to generate text in iterative fashion one word at a time.

The stages of building LLMs covered in this book include implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and finetuning the foundation model to become a personal assistant or text classifier.

Summary

LLMs have transformed the field of natural language processing, which previously mostly relied on explicit rule-based systems and simpler statistical methods. The advent of LLMs introduced new deep learning-driven approaches that led to advancements in understanding, generating, and translating human language.
Modern LLMs are trained in two main steps.

First, they are pretrained on a large corpus of unlabeled text by using the prediction of the next word in a sentence as a "label."
Then, they are finetuned on a smaller, labeled target dataset to follow instructions or perform classification tasks.

LLMs are based on the transformer architecture. The key idea of the transformer architecture is an attention mechanism that gives the LLM selective access to the whole input sequence when generating the output one word at a time.
The original transformer architecture consists of an encoder for parsing text and a decoder for generating text.
LLMs for generating text and following instructions, such as GPT-3 and ChatGPT, only implement decoder modules, simplifying the architecture.
Large datasets consisting of billions of words are essential for pretraining LLMs. In this book, we will implement and train LLMs on small datasets for educational purposes but also see how we can load openly available model weights.
While the general pretraining task for GPT-like models is to predict the next word in a sentence, these LLMs exhibit "emergent" properties such as capabilities to classify, translate, or summarize texts.
Once an LLM is pretrained, the resulting foundation model can be finetuned more efficiently for various downstream tasks.
LLMs finetuned on custom datasets can outperform general LLMs on specific tasks.

[1] Readers with a background in machine learning may note that labeling information is typically required for traditional machine learning models and deep neural networks trained via the conventional supervised learning paradigm. However, this is not the case for the pretraining stage of LLMs. In this phase, LLMs leverage self-supervised learning, where the model generates its own labels from the input data. This concept is covered later in this chapter

[2] GPT-3, The $4,600,000 Language Model, https://www.reddit.com/r/MachineLearning/comments/h0jwoz/d_gpt3_the_4600000_language_model/

FAQ

What is a Large Language Model (LLM)?

An LLM is a deep neural network trained on massive text datasets to understand context and generate human-like text. Most modern LLMs use the transformer architecture and are optimized to predict the next word in a sequence, which enables diverse language abilities.

Do LLMs really “understand” language?

LLMs do not possess human-like consciousness. “Understand” means they process and generate text that is coherent and contextually relevant based on patterns learned from data.

How are modern LLMs different from earlier NLP approaches?

Earlier NLP systems were typically task-specific and relied on manual feature engineering. LLMs, trained on vast corpora, generalize across many tasks (translation, summarization, sentiment analysis, code generation) and can produce fluent, context-aware text.

What is the transformer architecture and why is it important?

The transformer uses self-attention to weigh relationships among tokens, capturing long-range dependencies. The original design includes an encoder and a decoder. This architecture underpins most LLMs and enables strong performance on complex language tasks.

How do GPT and BERT differ?

BERT is encoder-based and trained with masked word prediction, making it strong for classification tasks. GPT is decoder-only and autoregressive, trained via next-word prediction to generate text. Both are transformer variants tailored to different objectives.

What are pretraining and finetuning?

Pretraining is self-supervised learning on large “raw” text to build a general-purpose base (foundation) model via next-word prediction. Finetuning then adapts this model to specific tasks with labeled data—commonly instruction-finetuning (instruction/answer pairs) or classification (texts with labels).

Why is next-word prediction used to train LLMs?

It turns language’s sequential nature into a simple, scalable learning signal: the next word becomes the label. This self-supervised setup leverages massive unlabeled text and, despite its simplicity, yields surprisingly capable models.

What are zero-shot and few-shot capabilities in GPT-like models?

Zero-shot means performing a new task without examples; few-shot means learning from a handful of examples provided in the prompt. GPT-like models can do both in addition to standard text completion.

What data scale and costs are involved in training LLMs?

Training uses hundreds of billions of tokens from diverse sources (e.g., CommonCrawl, books, Wikipedia). GPT-3, for instance, was trained on about 300 billion tokens, with an estimated pretraining compute cost of roughly $4.6 million. Tokenization converts text into the units models read.

Why build or customize your own LLM?

Custom LLMs can outperform general models in specific domains, preserve data privacy, enable on-device deployment (lower latency and costs), and give full control over updates. Building one also deepens understanding of how these systems work.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more