Build an LLM Application (from Scratch) you own this product

Hamza Farooq

MEAP began October 2024
Last updated January 2026
Publication in Summer 2026 (estimated)

ISBN 9781633436527
325 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Korean

catalog / Data Science / Deep Learning / Generative AI

resources: Book forum

table of content

PART 1: THE FUNDAMENTALS

1 The World of Large Language Models

1.1 What are Large Language Models, anyway?

1.1.1 Application of LLMs

1.1.2 Understanding the Scale of LLMs

1.1.3 Training LLMs

1.2 The Anatomy of an LLM Application

1.3 Challenges and Limitations of LLMs

1.4 The Startup World of LLMs

1.5 Summary

2 An in-depth look into the soul of the Transformer Architecture

2.1 The Transformers Architecture improvements over Recurrent Neural Networks

2.2 Digging into the Transformer’s underlying architecture

2.3 Encoder and Decoder Models

2.3.1 The Encoder Models

2.3.2 The Decoder Models and its Meteoric

2.3.3 Combining the power of Encoders and Decoders

2.4 Case Study: A Hotel Search Engine Utilizing Encoder & Decoder Models

2.5 Summary

3 Encoder models in action: Semantic-Based Retrieval Systems

3.1 Information Retrieval Systems: A Historic Overview

3.2 Keyword Search using Inverted Index and TF-IDF

3.2.1 Implementing keyword search using inverted index and TF-IDF

3.3 Semantic Search from Scratch

3.4 Implementing Semantic Search with Python and Sentence Transformers

3.5 Summary

PART 2: RETRIEVAL SYSTEMS

4 Semantic Search from Scratch

4.1 Loading the data for semantic search

4.2 Generate Embeddings from Hotel Reviews

4.2.1 Selecting the right Encoder Model

4.3 Similarity scores using Cross Encoders and Bi-Encoders

4.4 Introduction to FAISS and Vector Databases

4.5 Putting it all together: Travelle in action

4.6 Summary

5 Decoders in Action

5.1 Understanding the Core Principles of Decoder Models

5.1.1 Autoregressive Nature of LLMs

5.2 Decoding Algorithms

5.2.1 Greedy Decoding

5.2.2 Beam Search

5.2.3 Sampling Methods

5.3 Getting Started with Large Language Models and Prompting

5.3.1 What is Prompting?

5.3.2 Prompt Engineering

5.4 Selecting the Right LLM for our Application

5.5 Challenges with LLMs

5.6 References:

5.7 Summary

6 Retrieval Augmented Generation (RAG)

6.1 Retrieval Augmented Generation (R.A.G)

6.2 Using Vector Databases: For scalability and enhancing search capabilities

6.3 Document Chunking: Overcoming Context Length Limitations, the need for chunking

6.3.1 The Technical Reality of Context Windows in Embedding Models

6.3.2 The Computational Impossibility of Extending Context

6.3.3 Why More Compute Power Doesn’t Help

6.3.4 A Practical Example: The Long Document Problem

6.3.5 The Chunking Solution

6.3.6 Different Kinds of Chunking techniques

6.4 Building Research Paper RAG Tool with Qdrant

6.5 Evaluating RAG System Performance

6.5.1 Component-Level Evaluation: Dissecting the RAG Pipeline

6.5.2 The RAGAS Framework: A Comprehensive Evaluation Suite

6.6 Summary

7 Semantic Search: A Practical Case Study

PART 3: BUILDING APPLICATION AND AUGMENTING ANSWERS

8 Retrieval Augmentation

9 Fine-Tuning and Crafting Effective Prompts

10 Deploying Language Models as APIs

11 OpenSource Large Language Models

Overview

1 The World of Large Language Models

Language sits at the core of human connection, and modern Natural Language Processing has evolved from early rule-based efforts to deep learning systems that can recognize patterns at scale. That evolution enabled the rise of Large Language Models—probabilistic next-word predictors trained on vast corpora—which can write, summarize, translate, converse, and reason across broad topics. They surpass early voice assistants by sustaining rich, context-aware dialogue and are increasingly complemented by multimodal models that combine text with images or audio. Framing LLMs as foundational building blocks, the chapter positions the book as a practical guide to applying them in real-world settings rather than dwelling on mathematical formalism.

The chapter surveys a wide application surface: conversational agents, text and code generation, information retrieval, language understanding, recommendation, content editing, and agentic task execution. Building such applications entails careful orchestration—from defining the use case and acquiring suitable compute, to pretraining on massive datasets, then fine-tuning for domain-specific needs. Training leverages large-scale web data and specialized hardware (e.g., GPUs/TPUs), with iterative optimization of weights and biases to internalize linguistic patterns. Retrieval-Augmented Generation features prominently: it first fetches relevant context from a curated knowledge base and then conditions the model’s response on that evidence, improving specificity and freshness for focused domains.

Scale is both the superpower and the constraint: large models capture nuanced grammar and context yet demand significant compute and cost. The chapter underscores key challenges—data bias, ethical risks, limited interpretability, and hallucinations—arguing for guardrails, validation, and responsible deployment. It also maps the startup landscape catalyzed by LLMs, from lightweight wrappers to infrastructure platforms (vector databases and frameworks) and capital-intensive model labs competing at the frontier. Funding patterns mirror complexity and defensibility, with infrastructure and “GPU-rich” players attracting outsized investment. The chapter closes by setting expectations for the book’s focus: building effective LLM applications and exploring the components that make them reliable, practical, and impactful.

An output for a given prompt using ChatGPT

Rose Goldberg’s famous self-operation napkin constructing an LLM application demands a thoughtful orchestration of resources, from computational power to application definition, echoing the complexity of Rube Goldberg's contraptions.

A Python code snippet demonstrating how to use the Ares API to retrieve information about taco spots in San Francisco using the internet. Instead of just showing URLs, the API returns actual answers with web URLs as source

Retrieval Augmentation Generation is used to enhance the capabilities of LLMs, especially in generating relevant and contextually appropriate responses. The approach involves incorporating an initial retrieval step before generating a response to leverage information from a knowledge base.

Summary

Large language models (LLMs) are the latest breakthrough in natural language processing after statistical models and deep learning. LLMs stand on the shoulders of this prior research but take language understanding to new heights through scale.
Pretrained on massive text corpora, LLMs like GPT-3 capture broad knowledge about language in their model parameters. This allows them to achieve state-of-the-art performance on language tasks.
Applications powered by LLMs include text generation, classification, translation, and semantic search to name a few.
LLMs utilize multi-billion parameter Transformer architectures. Training such gigantic models requires massive computational resources only recently made possible through advances in AI hardware.
Bias and safety are key challenges with large models. Extensive testing is required to prevent unintended model behavior across diverse demographics.
Numerous startups are offering LLM model APIs, democratizing access and allowing innovation in the realm of Generative AI.

FAQ

What is a Large Language Model (LLM) in simple terms?

An LLM is an AI system trained on massive amounts of text to predict the next word in a sequence. By learning patterns, context, and nuances in language, it can generate coherent, human-like text, answer questions, summarize, translate, and converse.

How are LLMs different from early virtual assistants like Siri or Alexa?

Early assistants primarily followed predefined rules and handled narrow, scripted tasks. LLMs go beyond that by generating rich, context-aware language, sustaining multi-turn conversations, and adapting to varied topics with human-like fluency.

What are the main applications of LLMs?

Common applications include conversational assistants, text and code generation, information retrieval and search, language understanding (sentiment, intent, NER), recommendation systems, content creation/editing, and agent-based task fulfillment.

What is Retrieval-Augmented Generation (RAG) and when should I use it?

RAG enhances an LLM by retrieving relevant snippets from a selected corpus before generation, giving the model fresh, context-specific grounding. It’s especially useful for specialized domains or up-to-date information, though it relies on the quality of the underlying documents and doesn’t guarantee correctness.

How does RAG work at a high level?

It typically follows four steps: (1) Retrieval: search a curated knowledge base for relevant content; (2) Candidate selection: identify the most relevant passages; (3) Context integration: add those passages to the prompt; (4) Response generation: the LLM produces an answer using both its prior training and the retrieved context.

Why do LLMs need so much data, and what datasets are used?

Vast, diverse data helps models learn general language patterns, semantics, and contextual cues, improves robustness, and reduces overfitting. One widely used source is Common Crawl, a public web-scale corpus spanning hundreds of billions of pages collected over many years.

What’s the difference between training and fine-tuning an LLM?

Training (pretraining) teaches the model general language patterns on massive datasets by adjusting weights and biases to predict the next word. Fine-tuning adapts that pretrained model to a specific task or domain (e.g., legal or medical) using a smaller, targeted dataset for better task-specific performance.

What compute is required to train or tune LLMs?

Training LLMs is compute-intensive, typically requiring distributed clusters of GPUs (e.g., NVIDIA data center GPUs) or TPUs. Depending on model size and resources, training can take weeks or months; fine-tuning is faster but still resource-demanding.

What limitations should I watch for (bias, ethics, interpretability, hallucinations)?

LLMs may reflect biases in their training data, produce misleading or harmful content, and are often hard to interpret. They can also “hallucinate”—confidently generating incorrect or nonsensical information—so validation and guardrails are essential.

What are the core components of an LLM application?

Typical components include: defining the use case; choosing or accessing the base model; data pipelines and retrieval (e.g., RAG with a vector database); prompt and context orchestration; training/fine-tuning where needed; serving infrastructure (GPUs/TPUs); and monitoring, evaluation, and safety controls.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $35.99

you save $12.00 (25%)

eBook

pdf, ePub, online

$47.99 $35.99

you save $12.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more