Overview

1 A tale of two models: transformers and diffusions

Generative AI creates new content by learning patterns from data, and text-to-image systems are a standout example because they translate natural language into compelling visuals. This chapter frames these systems as multimodal models—connecting text inputs to image outputs—contrasted with unimodal models that handle a single data type. It surveys the rapid spread of text-to-image tools across design, marketing, healthcare, education, and data augmentation, and motivates hands-on understanding by showing how capabilities like captioning and text–image similarity underpin real-world applications.

The chapter then introduces two complementary approaches powering modern text-to-image generation. Transformer-based methods adapt attention from NLP to vision, often converting images to discrete codebook tokens (via VQGAN) and text to tokens (via models like BART), then training the transformer to predict image tokens conditioned on text—an idea exemplified by DALL·E-style systems and vision transformers that treat images as sequences of patches. Diffusion models take a different route: they learn to reverse a noise-adding process, denoising step by step to form images that match a prompt. To make this efficient, latent diffusion performs denoising in a low-dimensional latent space and uses a VAE decoder to produce high-resolution outputs, while guidance from models like CLIP helps align results with prompts. Stable Diffusion operationalizes this recipe at scale and, being open-source, makes these techniques widely accessible.

Despite striking progress, current systems still face notable limitations. Models can misinterpret negation or constraints (the “pink elephant” issue), raise legal and ethical questions around training data and originality, and struggle with geometric consistency due to limited 3D understanding. Broader concerns include heavy compute and energy costs, potential misuse for misinformation, and biases that can echo or overcorrect societal stereotypes. To equip readers to reason about and improve these systems, the book charts a practical, from-scratch path using Python and PyTorch: building transformers and ViTs, implementing diffusion and guidance, training CLIP, working with latent diffusion and Stable Diffusion, and mastering VQGAN and a DALL·E-style generator—so readers understand not just how these models work, but how to build and extend them responsibly.

Comparison of unimodal and multimodal models. Unimodal models handle only one type of data as both the input and output. For example, GPT-3 is a unimodal model since it processes text as input and generates text as output. On the other hand, multimodal models operate with more than one type of data. A prominent example is text-to-image generation models (say, Stable Diffusion), where the input is text and an image (when editing existing images), and the output is an image.
Examples of generating captions for images. Above each image, we first display the original caption from the training dataset, created by humans. We then feed the image into a trained image-to-text model to generate a caption, which is displayed above the image as well. While the generated captions differ from those created by humans, they accurately describe what’s going on in these images.
How the min-DALL-E model generates an image based on the prompt "panda with top hat reading a book." The model divides an image into 256 patches, organized in a 16x16 grid. When generating an image based on a text prompt, the model first predicts the top left patch. In the next iteration, the model predicts the patch next to it, based on the first patch and the prompt. The process is repeated until we have all the needed patches in the image. In this figure, the top left subplot shows the output when 32 image patches are generated. The second subplot in the top row shows the output when 64 patches are generated. The rest of the images show the outputs when 96, 128, ..., and 256 patches are generated.
A diagram of VQGAN. The encoder in VQGAN compresses an image into a lower-dimensional latent space. The latent vector for each image is divided into different patches. The continuous latent vector for each patch is then compared to the discrete vectors in the codebook. The quantized latent vector uses discrete vectors in the codebook to approximate the continuous latent vector for each image patch. The quantized latent vectors are then passed through the decoder in VQGAN to reconstruct the image.
The left side of this figure depicts how a transformer-based text-to-image model is trained, while the right side illustrates the process of generating an image from a text prompt using the trained model. To train the model, images are encoded into image tokens using a VQGAN encoder. Corresponding captions are processed through a BART encoder and then a BART decoder to generate text tokens. The objective is to train the BART decoder to predict text tokens that match the image tokens produced by the VQGAN encoder. That is, the text tokens are used as the predicted image tokens. To generate an image using the trained model, the text prompt is fed into the BART encoder and then the BART decoder to produce the predicted text tokens, which are passed through the VQGAN decoder to generate the final output, as shown at the top center of the figure.
A diagram of the forward diffusion process. We start with a clean image from the training set, 𝑥0, and add noise 𝜖0 to it to form a noisy image 𝑥1, which is a weighted sum of 𝑥0 and 𝜖0. We repeat this process for 1000 time steps until the image 𝑥1000 becomes random noise.
How a trained latent diffusion model (LDM) generates an image based on a text prompt. The text prompt ("a banana riding a motorcycle, wearing sunglasses and a straw hat," for example, as shown at the top left corner) is first encoded into a text embedding. To generate an image in the lower-dimensional space (latent space), we start with an image of pure noise (bottom left). We use the trained U-Net to iteratively denoise the image, conditional on the text embedding so the generated image matches the text embedding, with the guidance of a trained contrast language-image pre-training (CLIP) model. The generated latent image (bottom right) is presented to a trained VAE to convert it to a high-resolution image, which is the final output (top right).
Intermediate decoded outputs from a trained latent diffusion model at time steps 800, 600, ..., 200, and 0. The text prompt is " a banana riding a motorcycle, wearing sunglasses and a straw hat."
The StableDiffusionPipeline package in the diffusers library allows you to use Stable Diffusion as an off-the-shelf tool to generate great images in just a few lines of code. This figure shows the generated image based on the prompt "an astronaut in a spacesuit riding a unicorn."
The eight steps to building a text-to-image generator from scratch. Steps 1-4 establish the foundation: you’ll learn to build a transformer, understand how a vision transformer (ViT) processes images, implement a basic diffusion model, and use classifier-free guidance to control image generation. In steps 5-8, you’ll train your own CLIP model to connect images and text, create a diffusion-based generator (such as Stable Diffusion), master the VQGAN architecture for discrete image encoding, and build a transformer-based generator inspired by DALL-E. Each step brings you closer to understanding and creating a text-to-image generator.
An image generated by ChatGPT 4o using the prompt "draw me an image of a man without a beard."

Summary

  • Text-to-image models are a type of multimodal generative model designed to transform a text description into a corresponding image.
  • Unimodal models operate within a single type of data modality, such as text-only or image-only models. In contrast, multimodal models connect different data modalities, enabling interactions across text, images, audio, and video.
  • Transformer-based text-to-image generation models treat images as sequences by dividing them into patches, each patch acting as an element in the sequence. Image generation is then a sequence prediction problem, where the model predicts patches from top-left to bottom-right based on a text prompt.
  • In diffusion-based text-to-image generation models, we start with an image of pure noise. The model iteratively denoises it based on the text prompt, reducing noise with each step until a clear image matching the prompt is produced.
  • Instead of conducting forward and reverse diffusion processes on high-resolution images, latent diffusion models (LDMs) conduct diffusion processes in a lower-dimensional latent space, making the process faster and more efficient. After training, a variational autoencoder (VAE) converts the low-resolution latent space images into high-resolution final outputs.
  • Despite significant advancements, text-to-image generative models face challenges like the Pink Elephant problem, copyright disputes, geometry inconsistencies, and social, ethical, and environmental concerns.

FAQ

What’s the difference between unimodal and multimodal models?Unimodal models handle a single data type for both input and output (for example, a language model that takes text and produces text). Multimodal models connect different data types—text-to-image systems take text as input and produce images as output, sometimes also accepting an image when editing.
How do vision transformers apply attention to images?Vision transformers split an image into small patches (for example, a 16x16 grid), embed each patch, and treat the sequence like words in a sentence. Self-attention then models relationships across patches to understand global structure, borrowing the same mechanisms that revolutionized NLP.
How does a transformer-based text-to-image model (like DALL-E mini) generate images?It uses two key parts: VQGAN encodes an image into discrete “image tokens,” and a text transformer (BART) encodes a caption into “text tokens.” During training, the BART decoder is optimized (via cross-entropy) to predict the image token sequence. At inference, the prompt is turned into tokens by BART, which are then decoded by VQGAN into an image.
What does VQGAN do, and why are “image tokens” important?VQGAN’s encoder compresses an image and quantizes each patch to the nearest vector in a learned codebook, turning the image into a sequence of integers (codebook indices). Its decoder can reconstruct an image from those indices. Treating images as token sequences enables transformer-style generation.
What are forward and reverse diffusion processes?Forward diffusion gradually adds Gaussian noise to clean images over many time steps until they become pure noise. Reverse diffusion trains a denoising network (often a U-Net) to iteratively remove noise, transforming random noise back into a clean image that matches the data distribution.
How do latent diffusion models (LDMs) and Stable Diffusion speed up generation?LDMs run diffusion in a lower-dimensional latent space (e.g., 4x64x64) instead of pixel space, cutting computation dramatically. A VAE decoder then lifts the denoised latent back to a high-resolution image, while text conditioning and CLIP guidance align the output with the prompt; Stable Diffusion is a popular open-source implementation of this approach.
What practical applications do text-to-image models enable?They accelerate content creation (art, illustration, marketing), rapid product and game concept prototyping, and education by visualizing complex ideas. Related skills include image captioning (turning images into text) and image selection using CLIP to find the image that best matches a prompt.
What is the “pink elephant” problem in text-to-image generation?It’s the tendency to include an element that the prompt explicitly negates (e.g., “a man without a beard” still yields a beard). This reflects difficulty with negation and constraint-following, a known limitation that can surface across different models.
Do text-to-image models “steal” from artists, or do they create new works?There are two perspectives: critics argue models can reproduce protected works or styles seen in training data, especially if overrepresented. Proponents counter that models learn statistical patterns and generate novel images via stochastic denoising, not exact copies; the legal and ethical debates are ongoing.
What social, environmental, and ethical concerns surround these models?Major issues include high energy use for training and inference, potential misuse (deepfakes, misinformation), and societal biases reflected or amplified in outputs. Mitigations span efficiency techniques (e.g., pruning, quantization), content moderation and policy, and rigorous bias auditing—while avoiding overcorrection that harms factual accuracy.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Text-to-Image Generator (from Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Text-to-Image Generator (from Scratch) ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Build a Text-to-Image Generator (from Scratch) ebook for free