table of content

Part 1 Understanding attention and transformers

1 A tale of two models: Transformers and diffusions

1.1 What is a text-to-image generation model?

1.1.1 Unimodal vs. multimodal models

1.1.2 Practical use cases of text-to-image models

1.2 Transformer-based text-to-image generation

1.2.1 Converting an image into a sequence of integers and then back

1.2.2 Training and using a transformer-based text-to-image model

1.3 Text-to-image generation with diffusion models

1.3.1 Forward and reverse diffusions

1.3.2 Latent diffusion models and Stable Diffusion

1.4 How to build text-to-image models from scratch

1.5 Challenges for text-to-image generation models

1.5.1 Are generative AI models stealing from artists?

1.5.2 The geometric inconsistency problem

1.6 Social, environmental, and ethical concerns

2 Build a transformer

2.1 An overview of attention and transformers

2.1.1 How the attention mechanism works

2.1.2 How to create a transformer

2.2 Word embedding and positional encoding

2.2.1 Word tokenization with the Spacy library

2.2.2 A sequence padding function

2.2.3 Input embedding from word embedding and positional encoding

2.3 Creating an encoder–decoder transformer

2.3.1 Coding the attention mechanism

2.3.2 Defining the Transformer() class

2.3.3 Creating a language translator

2.4 Training and using the German-to-English translator

2.4.1 Training the encoder–decoder transformer

2.4.2 Translating German to English with the trained model

3 Classify images with a vision transformer

3.1 The blueprint to train a ViT

3.1.1 Converting images to sequences

3.1.2 Training a ViT for classification

3.2 The CIFAR-10 dataset

3.2.1 Downloading and visualizing CIFAR-10 images

3.2.2 Preparing datasets for training and testing

3.3 Building a ViT from scratch

3.3.1 Dividing images into patches

3.3.2 Modeling the positions of different patches in an image

3.3.3 Using the multi-head self-attention mechanism

3.3.4 Building an encoder-only transformer

3.3.5 Using the ViT to create a classifier

3.4 Training and using the ViT to classify images

3.4.1 Choosing the optimizer and the loss function

3.4.2 Training the ViT for image classification

3.4.3 Classifying images using the trained ViT

4 Add captions to images

4.1 Training and using a transformer to add captions

4.1.1 Preparing data and the causal attention mask

4.1.2 Creating and training a transformer

4.2 Preparing the training dataset

4.2.1 Downloading and visualizing Flickr 8k images

4.2.2 Building a vocabulary of tokens

4.2.3 Preparing the training dataset

4.3 Creating a multimodal transformer to add captions

4.3.1 Defining a ViT as the image encoder

4.3.2 Creating the decoder to generate text

4.4 Training and using the image-to-text transformer

4.4.1 Training the encoder–decoder transformer

4.4.2 Adding captions to images with the trained model

Part 2 Introduction to diffusion models

5 Generate images with diffusion models

5.1 The forward diffusion process

5.1.1 How diffusion models work

5.1.2 Visualizing the forward diffusion process

5.1.3 Different diffusion schedules

5.2 The reverse diffusion process

5.3 A blueprint to train the U-Net model

5.3.1 Steps in training a denoising U-Net model

5.3.2 Preprocessing the training data

5.4 Training and using the diffusion model

5.4.1 The Denoising Diffusion Probabilistic Model noise scheduler

5.4.2 Inference using the U-Net denoising model

5.4.3 Training and using the denoising U-Net model

6 Control what images to generate in diffusion models

6.1 Classifier-free guidance in diffusion models

6.1.1 An overview of classifier-free guidance

6.1.2 A blueprint to implement CFG

6.2 Different components of a denoising U-Net model

6.2.1 Time step embedding and label embedding

6.2.2 The U-Net denoising model architecture

6.2.3 Down blocks and up blocks in the U-Net

6.3 Building and training the denoising U-Net model

6.3.1 Building the denoising U-Net

6.3.2 The Denoising Diffusion Probabilistic Model

6.3.3 Training the diffusion model

6.4 Generating images with the trained diffusion model

6.4.1 Visualizing generated images

6.4.2 How the guidance parameter affects generated images

7 Generate high-resolution images with diffusion models

7.1 Attention in U-Net, DDIM, and image interpolation

7.1.1 Incorporating the attention mechanism in the U-Net model

7.1.2 Denoising Diffusion Implicit Models

7.1.3 Image interpolation in diffusion models

7.2 High-resolution flower images as training data

7.2.1 Visualizing images in the training dataset

7.2.2 Applying forward diffusion on flower images

7.3 Building and training a U-Net for high-resolution images

7.3.1 Building the denoising U-Net model

7.3.2 Training the denoising U-Net model

7.4 Image generation and interpolation

7.4.1 Using the trained denoising U-Net to generate images

7.4.2 Transition from one image to another

Part 3 Text-to-image generation with diffusion models

8 CLIP: A model to measure the similarity between image and text

8.1 The CLIP model

8.1.1 How the CLIP model works

8.1.2 Selecting an image from Flickr 8k based on a text description

8.2 Preparing the training dataset

8.2.1 Image-caption pairs in Flickr 8k

8.2.2 The DistilBERT tokenizer

8.2.3 Preprocess captions and images for training

8.3 Creating a CLIP model

8.3.1 Creating a text encoder

8.3.2 Creating an image encoder

8.3.3 Building a CLIP model

8.4 Training and using the CLIP model

8.4.1 Training the CLIP model

8.4.2 Using the trained CLIP model to select images

8.4.3 Using the OpenAI pretrained CLIP model to select images

9 Text-to-image generation with latent diffusion

9.1 What is a latent diffusion model?

9.1.1 How variational autoencoders work

9.1.2 Combining a latent diffusion model with a variational autoencoder

9.2 Compressing and reconstructing images with VAEs

9.2.1 Downloading the pretrained VAE

9.2.2 Encoding and decoding images with the pretrained VAE

9.3 Text-to-image generation with latent diffusion

9.3.1 Guidance by the CLIP model

9.3.2 Diffusion in the latent space

9.3.3 Converting latent images to high-resolution ones

9.4 Modifying existing images with text prompts

10 A deep dive into Stable Diffusion

10.1 Generating images with Stable Diffusion

10.2 The Stable Diffusion architecture

10.2.1 Generating images from text with Stable Diffusion

10.2.2 Text embedding interpolation

10.3 Creating text embeddings

10.4 Image generation in the latent space

10.5 Converting latent images to high-resolution ones

Part 4 Text-to-image generation with transformers

11 VQGAN: Convert images into sequences of integers

11.1 Converting images into sequences of integers and back

11.2 Variational autoencoders

11.2.1 What is an autoencoder?

11.2.2 The need for VAEs and their training methodology

11.3 Vector quantized variational autoencoders

11.3.1 The need for VQ-VAEs

11.3.2 The VQ-VAE model architecture and training process

11.4 Vector quantized generative adversarial networks

11.4.1 Generative adversarial networks

11.4.2 VQGAN: A GAN with a VQ-VAE generator

11.5 A pretrained VQGAN model

11.5.1 Reconstructing images with the pretrained VQGAN

11.5.2 Converting images into sequences of integers

12 A minimal implementation of DALL-E

12.1 How min-DALL-E works

12.1.1 Training min-DALL-E

12.1.2 From prompt to pixels: Image generation at inference time

12.2 Tokenizing and encoding the text prompt

12.2.1 Tokenizing the text prompt

12.2.2 Encoding the text prompt

12.3 Iterative prediction of image tokens

12.3.1 Loading the pretrained BART decoder

12.3.2 Predicting image tokens using the BART decoder

12.4 Converting image tokens to high-resolution images

12.4.1 Loading the pretrained VQGAN detokenizer

12.4.2 Visualizing the intermediate and final high-resolution outputs

Part 5 New developments and challenges

13 New developments and challenges in text-to-image generation

13.1 State-of-the-art text-to-image generators

13.1.1 DALL-E series

13.1.2 Google’s Imagen

13.1.3 Latent diffusion models: Stable Diffusion and Midjourney

13.2 Challenges and concerns

13.3 A blueprint to fine-tune ResNet50

13.3.1 The history and architecture of ResNet50

13.3.2 A plan to fine-tune ResNet50 for classification

13.3.3 Using ResNet50 to classify images

13.4 Fine-tuning ResNet50 to detect fake images

13.4.1 Downloading and preprocessing real and fake face images

13.4.2 Fine-tuning ResNet50

13.4.3 Detecting deepfakes using the fine-tuned ResNet50

Appendix

Installing PyTorch and enabling GPU training locally and in Colab

A.1 Installing Python and setting up a virtual environment

A.1.1 Installing Anaconda

A.1.2 Setting up a Python virtual environment

A.1.3 Installing Jupyter Notebook

A.2 Installing PyTorch

A.3 Using Google Colab for GPU training and inference

references

Overview

1 A tale of two models: transformers and diffusions

Generative AI creates new content by learning patterns from data, and text-to-image systems are a standout example because they translate natural language into compelling visuals. This chapter frames these systems as multimodal models—connecting text inputs to image outputs—contrasted with unimodal models that handle a single data type. It surveys the rapid spread of text-to-image tools across design, marketing, healthcare, education, and data augmentation, and motivates hands-on understanding by showing how capabilities like captioning and text–image similarity underpin real-world applications.

The chapter then introduces two complementary approaches powering modern text-to-image generation. Transformer-based methods adapt attention from NLP to vision, often converting images to discrete codebook tokens (via VQGAN) and text to tokens (via models like BART), then training the transformer to predict image tokens conditioned on text—an idea exemplified by DALL·E-style systems and vision transformers that treat images as sequences of patches. Diffusion models take a different route: they learn to reverse a noise-adding process, denoising step by step to form images that match a prompt. To make this efficient, latent diffusion performs denoising in a low-dimensional latent space and uses a VAE decoder to produce high-resolution outputs, while guidance from models like CLIP helps align results with prompts. Stable Diffusion operationalizes this recipe at scale and, being open-source, makes these techniques widely accessible.

Despite striking progress, current systems still face notable limitations. Models can misinterpret negation or constraints (the “pink elephant” issue), raise legal and ethical questions around training data and originality, and struggle with geometric consistency due to limited 3D understanding. Broader concerns include heavy compute and energy costs, potential misuse for misinformation, and biases that can echo or overcorrect societal stereotypes. To equip readers to reason about and improve these systems, the book charts a practical, from-scratch path using Python and PyTorch: building transformers and ViTs, implementing diffusion and guidance, training CLIP, working with latent diffusion and Stable Diffusion, and mastering VQGAN and a DALL·E-style generator—so readers understand not just how these models work, but how to build and extend them responsibly.

Comparison of unimodal and multimodal models. Unimodal models handle only one type of data as both the input and output. For example, GPT-3 is a unimodal model since it processes text as input and generates text as output. On the other hand, multimodal models operate with more than one type of data. A prominent example is text-to-image generation models (say, Stable Diffusion), where the input is text and an image (when editing existing images), and the output is an image.

Examples of generating captions for images. Above each image, we first display the original caption from the training dataset, created by humans. We then feed the image into a trained image-to-text model to generate a caption, which is displayed above the image as well. While the generated captions differ from those created by humans, they accurately describe what’s going on in these images.

How the min-DALL-E model generates an image based on the prompt "panda with top hat reading a book." The model divides an image into 256 patches, organized in a 16x16 grid. When generating an image based on a text prompt, the model first predicts the top left patch. In the next iteration, the model predicts the patch next to it, based on the first patch and the prompt. The process is repeated until we have all the needed patches in the image. In this figure, the top left subplot shows the output when 32 image patches are generated. The second subplot in the top row shows the output when 64 patches are generated. The rest of the images show the outputs when 96, 128, ..., and 256 patches are generated.

A diagram of VQGAN. The encoder in VQGAN compresses an image into a lower-dimensional latent space. The latent vector for each image is divided into different patches. The continuous latent vector for each patch is then compared to the discrete vectors in the codebook. The quantized latent vector uses discrete vectors in the codebook to approximate the continuous latent vector for each image patch. The quantized latent vectors are then passed through the decoder in VQGAN to reconstruct the image.

The left side of this figure depicts how a transformer-based text-to-image model is trained, while the right side illustrates the process of generating an image from a text prompt using the trained model. To train the model, images are encoded into image tokens using a VQGAN encoder. Corresponding captions are processed through a BART encoder and then a BART decoder to generate text tokens. The objective is to train the BART decoder to predict text tokens that match the image tokens produced by the VQGAN encoder. That is, the text tokens are used as the predicted image tokens. To generate an image using the trained model, the text prompt is fed into the BART encoder and then the BART decoder to produce the predicted text tokens, which are passed through the VQGAN decoder to generate the final output, as shown at the top center of the figure.

A diagram of the forward diffusion process. We start with a clean image from the training set, 𝑥0, and add noise 𝜖0 to it to form a noisy image 𝑥1, which is a weighted sum of 𝑥0 and 𝜖0. We repeat this process for 1000 time steps until the image 𝑥1000 becomes random noise.

How a trained latent diffusion model (LDM) generates an image based on a text prompt. The text prompt ("a banana riding a motorcycle, wearing sunglasses and a straw hat," for example, as shown at the top left corner) is first encoded into a text embedding. To generate an image in the lower-dimensional space (latent space), we start with an image of pure noise (bottom left). We use the trained U-Net to iteratively denoise the image, conditional on the text embedding so the generated image matches the text embedding, with the guidance of a trained contrast language-image pre-training (CLIP) model. The generated latent image (bottom right) is presented to a trained VAE to convert it to a high-resolution image, which is the final output (top right).

Intermediate decoded outputs from a trained latent diffusion model at time steps 800, 600, ..., 200, and 0. The text prompt is " a banana riding a motorcycle, wearing sunglasses and a straw hat."

The StableDiffusionPipeline package in the diffusers library allows you to use Stable Diffusion as an off-the-shelf tool to generate great images in just a few lines of code. This figure shows the generated image based on the prompt "an astronaut in a spacesuit riding a unicorn."

The eight steps to building a text-to-image generator from scratch. Steps 1-4 establish the foundation: you’ll learn to build a transformer, understand how a vision transformer (ViT) processes images, implement a basic diffusion model, and use classifier-free guidance to control image generation. In steps 5-8, you’ll train your own CLIP model to connect images and text, create a diffusion-based generator (such as Stable Diffusion), master the VQGAN architecture for discrete image encoding, and build a transformer-based generator inspired by DALL-E. Each step brings you closer to understanding and creating a text-to-image generator.

An image generated by ChatGPT 4o using the prompt "draw me an image of a man without a beard."

Summary

Text-to-image models are a type of multimodal generative model designed to transform a text description into a corresponding image.
Unimodal models operate within a single type of data modality, such as text-only or image-only models. In contrast, multimodal models connect different data modalities, enabling interactions across text, images, audio, and video.
Transformer-based text-to-image generation models treat images as sequences by dividing them into patches, each patch acting as an element in the sequence. Image generation is then a sequence prediction problem, where the model predicts patches from top-left to bottom-right based on a text prompt.
In diffusion-based text-to-image generation models, we start with an image of pure noise. The model iteratively denoises it based on the text prompt, reducing noise with each step until a clear image matching the prompt is produced.
Instead of conducting forward and reverse diffusion processes on high-resolution images, latent diffusion models (LDMs) conduct diffusion processes in a lower-dimensional latent space, making the process faster and more efficient. After training, a variational autoencoder (VAE) converts the low-resolution latent space images into high-resolution final outputs.
Despite significant advancements, text-to-image generative models face challenges like the Pink Elephant problem, copyright disputes, geometry inconsistencies, and social, ethical, and environmental concerns.

FAQ

What’s the difference between unimodal and multimodal models?

Unimodal models handle a single data type for both input and output (for example, a language model that takes text and produces text). Multimodal models connect different data types—text-to-image systems take text as input and produce images as output, sometimes also accepting an image when editing.

How do vision transformers apply attention to images?

Vision transformers split an image into small patches (for example, a 16x16 grid), embed each patch, and treat the sequence like words in a sentence. Self-attention then models relationships across patches to understand global structure, borrowing the same mechanisms that revolutionized NLP.

How does a transformer-based text-to-image model (like DALL-E mini) generate images?

It uses two key parts: VQGAN encodes an image into discrete “image tokens,” and a text transformer (BART) encodes a caption into “text tokens.” During training, the BART decoder is optimized (via cross-entropy) to predict the image token sequence. At inference, the prompt is turned into tokens by BART, which are then decoded by VQGAN into an image.

What does VQGAN do, and why are “image tokens” important?

VQGAN’s encoder compresses an image and quantizes each patch to the nearest vector in a learned codebook, turning the image into a sequence of integers (codebook indices). Its decoder can reconstruct an image from those indices. Treating images as token sequences enables transformer-style generation.

What are forward and reverse diffusion processes?

Forward diffusion gradually adds Gaussian noise to clean images over many time steps until they become pure noise. Reverse diffusion trains a denoising network (often a U-Net) to iteratively remove noise, transforming random noise back into a clean image that matches the data distribution.

How do latent diffusion models (LDMs) and Stable Diffusion speed up generation?

LDMs run diffusion in a lower-dimensional latent space (e.g., 4x64x64) instead of pixel space, cutting computation dramatically. A VAE decoder then lifts the denoised latent back to a high-resolution image, while text conditioning and CLIP guidance align the output with the prompt; Stable Diffusion is a popular open-source implementation of this approach.

What practical applications do text-to-image models enable?

They accelerate content creation (art, illustration, marketing), rapid product and game concept prototyping, and education by visualizing complex ideas. Related skills include image captioning (turning images into text) and image selection using CLIP to find the image that best matches a prompt.

What is the “pink elephant” problem in text-to-image generation?

It’s the tendency to include an element that the prompt explicitly negates (e.g., “a man without a beard” still yields a beard). This reflects difficulty with negation and constraint-following, a known limitation that can surface across different models.

Do text-to-image models “steal” from artists, or do they create new works?

There are two perspectives: critics argue models can reproduce protected works or styles seen in training data, especially if overrepresented. Proponents counter that models learn statistical patterns and generate novel images via stochastic denoising, not exact copies; the legal and ethical debates are ongoing.

What social, environmental, and ethical concerns surround these models?

Major issues include high energy use for training and inference, potential misuse (deepfakes, misinformation), and societal biases reflected or amplified in outputs. Mitigations span efficiency techniques (e.g., pruning, quantization), content moderation and policy, and rigorous bias auditing—while avoiding overcorrection that harms factual accuracy.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $26.39

you save $21.60 (45%)

include audio $24.99 $13.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $26.39

you save $21.60 (45%)

include audio $24.99 $13.74

eBook

pdf, ePub, online

$47.99 $26.39

you save $21.60 (45%)

include audio $24.99 $13.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more