Overview

1 How AI works

This chapter offers a clear, opinionated tour of how modern AI—especially large language models (LLMs)—works in practice. It explains popular terms that have entered mainstream conversation (tokens, embeddings, temperature, context window) and shows how everyday experiences with chatbots are shaped as much by “wrappers” around models as by the models themselves. Along the way, it highlights real-world constraints and trade-offs: token-based billing, the growing but finite context windows, quirks of tokenization that make some languages costlier to handle than English, and known weaknesses such as letter-level reasoning. It also demystifies why chat systems feel helpful and up to date: the surrounding software inserts system prompts, orchestrates tools and function calls, and retrieves external documents (RAG) to augment a model that is otherwise a static next-token predictor.

Under the hood, LLMs are autoregressive predictors that generate text one token at a time from a fixed vocabulary. Tokens are mapped to high-dimensional embeddings that capture meaning, enabling the transformer architecture to contextualize each token with attention, multihead mechanisms, and many stacked layers before scoring the probabilities of the next token. Sampling settings (temperature, Top-p, Top-k) control creativity versus conservatism, while context-window limits govern how much text can be considered at once. The chapter details how wrappers expand raw capabilities—turning single-token predictions into full responses, delimiting turns for chat, calling external tools for real-time data, and injecting retrieved passages so outputs can cite or reflect recent or private information. It also explains practical frictions, from token billing and chat-history growth to the challenges of non-English tokenization and analyzing letter-level structure.

Zooming out, the chapter situates LLMs within machine learning at large: models are built by designing an architecture and then learning billions of parameters from data. LLMs rely heavily on self-supervised pretraining, followed by fine-tuning and reinforcement learning with human feedback (RLHF) to better align behavior with human preferences; their quality is optimized via loss functions and large-scale stochastic gradient descent. Beyond language, convolutional neural networks (CNNs) extract hierarchical visual features, U-Nets transform images to images, and diffusion models denoise inputs to generate images and even video, while multimodal systems combine CNNs and LLMs to move between text and visuals. The chapter closes with a sober principle: there is no universally best model. Progress comes from crafting architectures and scaffolding—transformers, CNNs, tool use, and retrieval—tailored to each task, with humans guiding models far more than it may seem.

To generate full sentences, the LLM wrapper used the LLM to generate one word, then attached that word to the initial prompt, then used the LLM again to generate one more word, and so on.

OpenAI’s API lets users define a system prompt, which is a piece of text inserted into the beginning of the user’s prompt.

When the current date is supplied as part of the system prompt, the LLM can answer questions about the current date.

ChatGPT called a function to search the web behind the scenes and inserted the results into the user’s prompt. This creates the illusion that the LLM browses the web.

Each token is mapped to a vector of numbers. We can imagine that each number in the vector represents a topic. Here’s an imaginary list of topics and their respective numbers for the “dog” token.

An imaginary embedding vector for the “elephant” token

We can think the numbers in an embedding vector as coordinates that place the token in a multidimensional “meaning space.”

LLMs often struggle to analyze individual letters in words.

LLM overview. In step 1, the tokens are mapped to embeddings one by one. In step 2, each embedding is improved by contextualizing it using the previous tokens in the prompt. In step 3, the much-improved embeddings are used to make predictions about the next token.

The attention mechanism calculates the relative relevance of all tokens in the context window to contextualize or disambiguate the last token.

Training examples are generated by subdividing existing sentences and turning the last token in each into the desired autocomplete label.

A diffusion model is trained to improve a corrupted image paired with its caption.

A diffusion model is used repeatedly to have a desired image emerge from Gaussian noise.

Summary

LLMs are designed to guess the best next word that completes an input prompt.
LLMs subdivide inputs into valid tokens (common words or pieces of words) from an internal vocabulary.
LLMs calculate the probability that each possible token is the one that comes next after the input.
A wrapper around the LLM enhances its capabilities. For examples, it makes the LLM eat its own output repeatedly to generate full outputs, one token at a time.
Current LLMs represent information using embedding vector, which are lists of numbers.
Current LLMs follow the transformer architecture, which is a method to progressively contextualize input tokens.
LLMs are created using machine learning, meaning that data is used to define missing parameters inside the model.
There are different types of machine learning, including supervised, self-supervised, and unsupervised learning.
In supervised learning, the computer learns by example—it is fed with examples of how to perform the task. In the case of self-supervised learning, these examples are generated automatically by scanning data.
Popular LLMs were first trained in a self-supervised way using publicly available data, and then, they were refined using manually generated data to align them to the users’ objectives.
CNNs are a popular architecture to process other types of data, such as images.
CNNs are combined with transformers to create multimodal AI.

FAQ

What do large language models (LLMs) actually do?

They are autoregressive models trained to predict the next token given prior tokens. Apps then loop that single-step prediction: they feed each chosen token back into the model to produce full sentences, stopping when a special end-of-text token appears.

What’s the difference between the base LLM and apps like ChatGPT?

The base LLM only predicts the next token. An LLM wrapper (e.g., ChatGPT) adds behavior: turns single-token guesses into full replies, inserts hidden “system prompts” (e.g., current date, role), formats chat history with boundary tokens, and can orchestrate tools and real-time data.

How can an LLM answer about live data or “browse” the web?

Through tool/function calling. The system prompt advertises callable functions (e.g., get_current_weather). When the LLM suggests one, the wrapper executes it, then injects the results back into the prompt so the model can incorporate fresh information in its answer.

What is retrieval-augmented generation (RAG)?

RAG first retrieves relevant documents (from a database or the web) based on your prompt, then augments the prompt with those snippets so the LLM can ground its answer. Key challenges: finding good matches, fitting within the model’s context window, and managing cost for longer prompts.

What are tokens, and why do they affect cost and non‑English usage?

Models read and write tokens (common words, subwords, characters, and special codes), not raw text. APIs bill by tokens, so longer prompts/outputs cost more. Because vocabularies are optimized for English, words in other languages often split into more tokens—making those inputs/outputs longer and pricier.

What are embeddings and why do they matter?

Embeddings map tokens (and sometimes images) to vectors that encode meaning. Similar meanings land near each other in this “space.” Simple math on embeddings (dot products, projections) lets models compare concepts and extract task-relevant information, enabling effective next-token prediction.

How does the transformer architecture work in simple terms?

It has three stages: (1) map tokens to initial embeddings, (2) contextualize each token via attention over prior tokens (repeated across many layers/heads), and (3) project the contextualized embeddings to a probability over the vocabulary. A fixed context window caps how much the model can read at once.

How is the randomness/creativity of outputs controlled?

By sampling settings applied to the model’s probability distribution: temperature (lower = safer, higher = more creative), top‑p (sample only from tokens covering a chosen cumulative probability), and top‑k (sample from the top k tokens). These control variety versus determinism.

How are LLMs trained and improved?

They’re first pretrained with self‑supervised learning on internet-scale text (predict the next token). They’re then “tamed” with human-written examples (supervised fine‑tuning) and often refined with reinforcement learning from human feedback (RLHF). Training minimizes loss (e.g., cross‑entropy) via stochastic gradient descent across billions of parameters.

How do AI systems handle images and text‑to‑image generation?

Convolutional neural networks (CNNs) process images by stacking learnable filters (convolutions) and downsampling to form semantic embeddings. For text‑to‑image, diffusion models start from noise and iteratively denoise guided by a text embedding (often via a U‑Net), making the requested image “emerge” over multiple steps. Multimodal systems combine LLMs and CNNs by aligning their embeddings.

To generate full sentences, the LLM wrapper used the LLM to generate one word, then attached that word to the initial prompt, then used the LLM again to generate one more word, and so on.

OpenAI’s API lets users define a system prompt, which is a piece of text inserted into the beginning of the user’s prompt.

When the current date is supplied as part of the system prompt, the LLM can answer questions about the current date.

ChatGPT called a function to search the web behind the scenes and inserted the results into the user’s prompt. This creates the illusion that the LLM browses the web.

Each token is mapped to a vector of numbers. We can imagine that each number in the vector represents a topic. Here’s an imaginary list of topics and their respective numbers for the “dog” token.

An imaginary embedding vector for the “elephant” token

We can think the numbers in an embedding vector as coordinates that place the token in a multidimensional “meaning space.”

LLMs often struggle to analyze individual letters in words.

LLM overview. In step 1, the tokens are mapped to embeddings one by one. In step 2, each embedding is improved by contextualizing it using the previous tokens in the prompt. In step 3, the much-improved embeddings are used to make predictions about the next token.

The attention mechanism calculates the relative relevance of all tokens in the context window to contextualize or disambiguate the last token.

Training examples are generated by subdividing existing sentences and turning the last token in each into the desired autocomplete label.

A diffusion model is trained to improve a corrupted image paired with its caption.

A diffusion model is used repeatedly to have a desired image emerge from Gaussian noise.

pro $24.99 per month

lite $19.99 per month

team

pro

team

pro

team

pro

team