Learn Generative AI with PyTorch you own this product

Build GANs, transformers, and diffusion models

Mark Liu
Foreword by Sarah Sanders

October 2024
ISBN 9781633436466
432 pages

Included with a Manning Online subscription

printed in black & white

resources: Source code Pretrained models Errata Book forum Source code on GitHub Register your pBook for a free eBook

table of content

Part 1. Introduction to generative AI

1 What is generative AI and why PyTorch?

1.1 Introducing generative AI and PyTorch

1.1.1 What is generative AI?

1.1.2 The Python programming language

1.1.3 Using PyTorch as our AI framework

1.2 GANs

1.2.1 A high-level overview of GANs

1.2.2 An illustrating example: Generating anime faces

1.2.3 Why should you care about GANs?

1.3 Transformers

1.3.1 The attention mechanism

1.3.2 The Transformer architecture

1.3.3 Multimodal Transformers and pretrained LLMs

1.4 Why build generative models from scratch?

Summary

2 Deep learning with PyTorch

2.1 Data types in PyTorch

2.1.1 Creating PyTorch tensors

2.1.2 Index and slice PyTorch tensors

2.1.3 PyTorch tensor shapes

2.1.4 Mathematical operations on PyTorch tensors

2.2 An end-to-end deep learning project with PyTorch

2.2.1 Deep learning in PyTorch: A high-level overview

2.2.2 Preprocessing data

2.3 Binary classification

2.3.1 Creating batches

2.3.2 Building and training a binary classification model

2.3.3 Testing the binary classification model

2.4 Multicategory classification

2.4.1 Validation set and early stopping

2.4.2 Building and training a multicategory classification model

Summary

3 Generative adversarial networks: Shape and number generation

3.1 Steps involved in training GANs

3.2 Preparing training data

3.2.1 A training dataset that forms an exponential growth curve

3.2.2 Preparing the training dataset

3.3 Creating GANs

3.3.1 The discriminator network

3.3.2 The generator network

3.3.3 Loss functions, optimizers, and early stopping

3.4 Training and using GANs for shape generation

3.4.1 The training of GANs

3.4.2 Saving and using the trained generator

3.5 Generating numbers with patterns

3.5.1 What are one-hot variables?

3.5.2 GANs to generate numbers with patterns

3.5.3 Training the GANs to generate numbers with patterns

3.5.4 Saving and using the trained model

Summary

Part 2. Image generation

4 Image generation with generative adversarial networks

4.1 GANs to generate grayscale images of clothing items

4.1.1 Training samples and the discriminator

4.1.2 A generator to create grayscale images

4.1.3 Training GANs to generate images of clothing items

4.2 Convolutional layers

4.2.1 How do convolutional operations work?

4.2.2 How do stride and padding affect convolutional operations?

4.3 Transposed convolution and batch normalization

4.3.1 How do transposed convolutional layers work?

4.3.2 Batch normalization

4.4 Color images of anime faces

4.4.1 Downloading anime faces

4.4.2 Channels-first color images in PyTorch

4.5 Deep convolutional GAN

4.5.1 Building a DCGAN

4.5.2 Training and using DCGAN

Summary

5 Selecting characteristics in generated images

5.1 The eyeglasses dataset

5.1.1 Downloading the eyeglasses dataset

5.1.2 Visualizing images in the eyeglasses dataset

5.2 cGAN and Wasserstein distance

5.2.1 WGAN with gradient penalty

5.2.2 cGANs

5.3 Create a cGAN

5.3.1 A critic in cGAN

5.3.2 A generator in cGAN

5.3.3 Weight initialization and the gradient penalty function

5.4 Training the cGAN

5.4.1 Adding labels to inputs

5.4.2 Training the cGAN

5.5 Selecting characteristics in generated images

5.5.1 Selecting images with or without eyeglasses

5.5.2 Vector arithmetic in latent space

5.5.3 Selecting two characteristics simultaneously

Summary

6 CycleGAN: Converting blond hair to black hair

6.1 CycleGAN and cycle consistency loss

6.1.1 What is CycleGAN?

6.1.2 Cycle consistency loss

6.2 The celebrity faces dataset

6.2.1 Downloading the celebrity faces dataset

6.2.2 Process the black and blond hair image data

6.3 Building a CycleGAN model

6.3.1 Creating two discriminators

6.3.2 Creating two generators

6.4 Using CycleGAN to translate between black and blond hair

6.4.1 Training a CycleGAN to translate between black and blond hair

6.4.2 Round-trip conversions of black hair images and blond hair images

Summary

7 Image generation with variational autoencoders

7.1 An overview of AEs

7.1.1 What is an AE?

7.1.2 Steps in building and training an AE

7.2 Building and training an AE to generate digits

7.2.1 Gathering handwritten digits

7.2.2 Building and training an AE

7.2.3 Saving and using the trained AE

7.3 What are VAEs?

7.3.1 Differences between AEs and VAEs

7.3.2 The blueprint to train a VAE to generate human face images

7.4 A VAE to generate human face images

7.4.1 Building a VAE

7.4.2 Training the VAE

7.4.3 Generating images with the trained VAE

7.4.4 Encoding arithmetic with the trained VAE

Summary

Part 3. Natural language processing and Transformers

8 Text generation with recurrent neural networks

8.1 Introduction to RNNs

8.1.1 Challenges in generating text

8.1.2 How do RNNs work?

8.1.3 Steps in training a LSTM model

8.2 Fundamentals of NLP

8.2.1 Different tokenization methods

8.2.2 Word embedding

8.3 Preparing data to train the LSTM model

8.3.1 Downloading and cleaning up the text

8.3.2 Creating batches of training data

8.4 Building and training the LSTM model

8.4.1 Building an LSTM model

8.4.2 Training the LSTM model

8.5 Generating text with the trained LSTM model

8.5.1 Generating text by predicting the next token

8.5.2 Temperature and top-K sampling in text generation

Summary

9 A line-by-line implementation of attention and Transformer

9.1 Introduction to attention and Transformer

9.1.1 The attention mechanism

9.1.2 The Transformer architecture

9.1.3 Different types of Transformers

9.2 Building an encoder

9.2.1 The attention mechanism

9.2.2 Creating an encoder

9.3 Building an encoder-decoder Transformer

9.3.1 Creating a decoder layer

9.3.2 Creating an encoder-decoder Transformer

9.4 Putting all the pieces together

9.4.1 Defining a generator

9.4.2 Creating a model to translate between two languages

Summary

10 Training a Transformer to translate English to French

10.1 Subword tokenization

10.1.1 Tokenizing English and French phrases

10.1.2 Sequence padding and batch creation

10.2 Word embedding and positional encoding

10.2.1 Word embedding

10.2.2 Positional encoding

10.3 Training the Transformer for English-to-French translation

10.3.1 Loss function and the optimizer

10.3.2 The training loop

10.4 Translating English to French with the trained model

Summary

11 Building a generative pretrained Transformer from scratch

11.1 GPT-2 architecture and causal self-attention

11.1.1 The architecture of GPT-2

11.1.2 Word embedding and positional encoding in GPT-2

11.1.3 Causal self-attention in GPT-2

11.2 Building GPT-2XL from scratch

11.2.1 BPE tokenization

11.2.2 The Gaussian error linear unit activation function

11.2.3 Causal self-attention

11.2.4 Constructing the GPT-2XL model

11.3 Loading up pretrained weights and generating text

11.3.1 Loading up pretrained parameters in GPT-2XL

11.3.2 Defining a generate() function to produce text

11.3.3 Text generation with GPT-2XL

Summary

12 Training a Transformer to generate text

12.1 Building and training a GPT from scratch

12.1.1 The architecture of a GPT to generate text

12.1.2 The training process of the GPT model to generate text

12.2 Tokenizing text of Hemingway novels

12.2.1 Tokenizing the text

12.2.2 Creating batches for training

12.3 Building a GPT to generate text

12.3.1 Model hyperparameters

12.3.2 Modeling the causal self-attention mechanism

12.3.3 Building the GPT model

12.4 Training the GPT model to generate text

12.4.1 Training the GPT model

12.4.2 A function to generate text

12.4.3 Text generation with different versions of the trained model

Summary

Part 4. Applications and new developments

13 Music generation with MuseGAN

13.1 Digital music representation

13.1.1 Musical notes, octave, and pitch

13.1.2 An introduction to multitrack music

13.1.3 Digitally represent music: Piano rolls

13.2 A blueprint for music generation

13.2.1 Constructing music with chords, style, melody, and groove

13.2.2 A blueprint to train a MuseGAN

13.3 Preparing the training data for MuseGAN

13.3.1 Downloading the training data

13.3.2 Converting multidimensional objects to music pieces

13.4 Building a MuseGAN

13.4.1 A critic in MuseGAN

13.4.2 A generator in MuseGAN

13.4.3 Optimizers and the loss function

13.5 Training the MuseGAN to generate music

13.5.1 Training the MuseGAN

13.5.2 Generating music with the trained MuseGAN

Summary

14 Building and training a music Transformer

14.1 Introduction to the music Transformer

14.1.1 Performance-based music representation

14.1.2 The music Transformer architecture

14.1.3 Training the music Transformer

14.2 Tokenizing music pieces

14.2.1 Downloading training data

14.2.2 Tokenizing MIDI files

14.2.3 Preparing the training data

14.3 Building a GPT to generate music

14.3.1 Hyperparameters in the music Transformer

14.3.2 Building a music Transformer

14.4 Training and using the music Transformer

14.4.1 Training the music Transformer

14.4.2 Music generation with the trained Transformer

Summary

15 Diffusion models and text-to-image Transformers

15.1 Introduction to denoising diffusion models

15.1.1 The forward diffusion process

15.1.2 Using the U-Net model to denoise images

15.1.3 A blueprint to train the denoising U-Net model

15.2 Preparing the training data

15.2.1 Flower images as the training data

15.2.2 Visualizing the forward diffusion process

15.3 Building a denoising U-Net model

15.3.1 The attention mechanism in the denoising U-Net model

15.3.2 The denoising U-Net model

15.4 Training and using the denoising U-Net model

15.4.1 Training the denoising U-Net model

15.4.2 Using the trained model to generate flower images

15.5 Text-to-image Transformers

15.5.1 CLIP: A multimodal Transformer

15.5.2 Text-to-image generation with DALL-E 2

Summary

16 Pretrained large language models and the LangChain library

16.1 Content generation with the OpenAI API

16.1.1 Text generation tasks with OpenAI API

16.1.2 Code generation with OpenAI API

16.1.3 Image generation with OpenAI DALL-E 2

16.1.4 Speech generation with OpenAI API

16.2 Introduction to LangChain

16.2.1 The need for the LangChain library

16.2.2 Using the OpenAI API in LangChain

16.2.3 Zero-shot, one-shot, and few-shot prompting

16.3 A zero-shot know-it-all agent in LangChain

16.3.1 Applying for a Wolfram Alpha API Key

16.3.2 Creating an agent in LangChain

16.3.3 Adding tools by using OpenAI GPTs

16.3.4 Adding tools to generate code and images

16.4 Limitations and ethical concerns of LLMs

16.4.1 Limitations of LLMs

16.4.2 Ethical concerns for LLMs

Summary

Appendixes

Appendix A: Installing Python, Jupyter Notebook, and PyTorch

A.1 Installing Python and setting up a virtual environment

A.1.1 Installing Anaconda

A.1.2 Setting up a Python virtual environment

A.1.3 Installing Jupyter Notebook

A.2 Installing PyTorch

A.2.1 Installing PyTorch without CUDA

A.2.2 Installing PyTorch with CUDA

Appendix B: Minimally qualified readers and deep learning basics

B.1 Deep learning and deep neural networks

B.1.1 Anatomy of a neural network

B.1.2 Different types of layers in neural networks

B.1.3 Activation Functions

B.2 Training a deep neural network

B.2.1 The training process

B.2.2 Loss functions

B.2.3 Optimizers

Overview

1 What is generative AI and why PyTorch?

Generative AI has surged into mainstream attention by creating new content—text, images, audio, code—rather than merely classifying existing data. This chapter sets the stage by contrasting generative and discriminative approaches, framing generative models as systems that learn data distributions to synthesize novel samples. It also motivates why learning to build these systems from the ground up matters: a transparent understanding leads to better results, practical control over outputs, and more responsible use. Python and PyTorch are introduced as the practical toolkit for this journey thanks to readable syntax, broad community support, dynamic computation graphs, and fast GPU training.

Two model families anchor the chapter. GANs pit a generator against a discriminator in an iterative contest, yielding increasingly realistic outputs and enabling powerful tasks like domain translation. Transformers address sequence problems with self-attention, capturing long-range dependencies while enabling parallel training—key to modern large language models and multimodal systems. The narrative connects these ideas to statistical foundations (conditional vs joint distributions), surveys Transformer variants (encoder-only, decoder-only, encoder–decoder), and highlights diffusion models and their role in text-to-image generation through progressive denoising and iterative refinement.

Beyond concepts, the chapter outlines a hands-on path: setting up Python and PyTorch, learning tensors, and completing an end-to-end project before building generative models from scratch. Readers implement GANs (including DCGAN and CycleGAN) and core Transformer components, explore smaller-scale language modeling, and leverage pretrained weights and transfer learning where training from scratch is impractical. Throughout, the chapter emphasizes practical benefits—controlling attributes of generated outputs, adapting models to downstream tasks—and encourages an informed perspective on the technology’s disruptive potential and risks, laying a solid foundation for the rest of the book.

A comparison of generative models versus discriminative models. A discriminative model (top half of the figure) takes data as inputs and produces probabilities of different labels, which we denote by Prob(dog) and Prob(cat). In contrast, a generative model (bottom half) acquires an in-depth understanding of the defining characteristics of these images to synthesize new images representing dogs and cats.

GANs architecture and its components. GANs employ a dual-network architecture comprising a generative model (left) tasked with capturing the underlying data distribution and a discriminative model (center) that serves to estimate the likelihood that a given sample originates from the authentic training dataset (considered as “real”) rather than being a product of the generative model (considered as “fake”).

Examples from the anime faces training dataset

Generated anime face images by the trained generator in DCGAN

Changing hair color with CycleGAN. If we feed images with blond hair (first row) to a trained CycleGAN model, the model converts blond hair to black hair in these images (second row). The same trained model can also convert black hair (third row) to blond hair (bottom row).

The Transformer architecture. The encoder in the Transformer (left side of the diagram) learns the meaning of the input sequence (e.g., the English phrase “How are you?”) and converts it into an abstract representation that captures its meaning before passing it to the decoder (right side of the diagram). The decoder constructs the output (e.g., the French translation of the English phrase) by predicting one word at a time, based on previous words in the sequence and the abstract representation from the encoder.

The diffusion model adds more and more noise to the images and learns to reconstruct them. The left column contains four original flower images. As we move to the right, some noise is added to the images in each time step, until at the right column, the four images are pure random noise. We then use these images to train a diffusion-based model to progressively remove noise from noisy images to generate new data samples.

Image generated by DALL-E 2 with text prompt “an astronaut in a space suit riding a unicorn”

Summary

Generative AI is a type of technology with the capacity to produce diverse forms of new content, including texts, images, code, music, audio, and video.
Discriminative models specialize in assigning labels while generative models generate new instances of data.
PyTorch, with its dynamic computational graphs and the ability for GPU training, is well suited for deep learning and generative modeling.
GANs are a type of generative modeling method consisting of two neural networks: a generator and a discriminator. The goal of the generator is to create realistic data samples to maximize the chance that the discriminator thinks they are real. The goal of the discriminator is to correctly identify fake samples from real ones.
Transformers are deep neural networks that use the attention mechanism to identify long-term dependencies among elements in a sequence. The original Transformer has an encoder and a decoder. When it’s used for English-to-French translation, for example, the encoder converts the English sentence into an abstract representation before passing it to the decoder. The decoder generates the French translation one word at a time, based on the encoder’s output and the previously generated words.

FAQ

What is generative AI, and how is it different from discriminative AI?

Generative AI learns the underlying data distribution so it can create new samples (text, images, audio, etc.). Discriminative models learn to assign labels to existing inputs. Statistically, discriminative models estimate prob(Y|X), while generative models learn the joint distribution p(X, Y) and can sample new X.

Why is PyTorch a strong choice for building generative AI models?

PyTorch offers a Pythonic, flexible API with a dynamic computational graph, making experimentation and debugging straightforward. It supports fast GPU acceleration, integrates well with libraries like NumPy and Matplotlib, has rich community tooling, and excels at transfer learning—crucial when working with pretrained LLMs and vision models.

What are GANs and how do the generator and discriminator interact?

GANs pair a generator that produces fake samples with a discriminator that tries to distinguish fake from real. They train in opposition: the generator learns to fool the discriminator, while the discriminator learns to spot fakes. Training proceeds until an equilibrium where generated samples become hard to tell apart from real data.

What is the latent vector Z in GANs, and why does it matter?

The latent vector Z is a compressed “task description” sampled from a latent space that the generator maps to outputs. Changing Z yields different samples; with conditional setups, Z (and labels) can steer attributes in the output. Understanding Z helps you manipulate features in generated content.

Do I need a GPU to follow the book’s projects?

No. All examples can run on CPU, though GPUs significantly speed up training. The author also provides trained models in the book’s repository so you can inspect results without long training runs.

Why are Transformers so impactful compared to RNNs/LSTMs?

Transformers use self-attention to model long-range dependencies and allow parallel training over sequence positions, dramatically reducing training time. This scalability enabled training on massive datasets, leading to capable LLMs like GPT-style models.

How does the attention mechanism (Q, K, V) work at a high level?

Inputs are projected to queries (Q), keys (K), and values (V). Attention scores measure how well each query matches keys; those scores weight the corresponding values to produce context-aware representations. It’s akin to a retrieval system that ranks relevant items and aggregates their information.

What are the main Transformer variants and their use cases?

Encoder-only models (e.g., BERT) produce representations for tasks like classification and NER. Decoder-only models (e.g., GPT-2/ChatGPT) excel at next-token prediction and text generation. Encoder–decoder models tackle sequence-to-sequence and multimodal tasks such as translation, speech recognition, and text-to-image.

What are diffusion models, and how do they relate to text-to-image systems?

Diffusion models learn to denoise data by reversing a process that gradually adds noise to training samples. Text-to-image systems often combine diffusion with text conditioning: they iteratively refine an image to match a prompt, adding detail over steps in a way conceptually similar to diffusion’s denoising.

Why build generative models from scratch instead of only using pretrained ones?

Implementing models yourself deepens understanding of how they work, enabling better debugging, customization, and responsible use. It lets you control attributes (e.g., via latent variables), design task-specific architectures, and fine-tune pretrained models more effectively for downstream applications.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more