This chapter cuts through the hype around large language models to provide a clear, practical view of where generalist LLMs and domain-specific small language models fit. It traces the shift sparked by the Transformer architecture and the rise of self-supervised training, which enabled models to scale beyond supervised labeling bottlenecks and acquire broad linguistic, reasoning, and generative abilities. The goal is to help readers judge capabilities and limits realistically, understand when smaller models are enough or preferable, and frame the trade-offs that matter for real applications, compliance, and risk.
Small Language Models are built on the same Transformer foundations as larger systems but prioritize efficiency: they typically contain up to a few billion parameters, run well on CPUs or consumer GPUs, can execute on edge or on-prem environments, and preserve data locality. They inherit the encoder/decoder family split (e.g., BERT-like vs. GPT-like), benefit from techniques such as RLHF, and support a wide range of text-centric tasks. Crucially, SLMs can be specialized with relatively modest effort via transfer learning and parameter-efficient fine-tuning, enabling focused accuracy in regulated or sensitive domains. Emerging agentic AI patterns further strengthen their case, as many agent actions require compact, fast models and heterogeneous systems that mix SLMs with LLMs when needed.
From a business perspective, the open-source surge provides strong baselines, letting teams fine-tune instead of training from scratch—dramatically lowering development costs while keeping deployment and inference as the main engineering challenges. The chapter details risks with closed, generalist LLMs—data exposure, lack of transparency and reproducibility, hard-to-audit training data, hallucinations, and code-safety concerns—contrasting them with the privacy, compliance, and accuracy benefits of domain-specific models that run within organizational boundaries. It also highlights sustainability advantages of smaller models and reframes alignment as shaping model intent within a narrow domain. The book’s focus is hands-on: optimizing and quantizing SLMs for inference, serving through diverse APIs, deploying across hardware (including laptops), and integrating with RAG and agentic systems, assuming basic familiarity with Python, PyTorch, Transformers, and related tooling.
Some examples of diverse content an LLM can generate.
The timeline of LLMs since 2019 (image taken from paper [3])
Order of magnitude of costs for each phase of LLM implementation from scratch.
Order of magnitude of costs for each phase of LLM implementation when starting from a pretrained model.
Ratios of data source types used to train some popular existing LLMs.
Generic model specialization to a given domain.
A LLM trained for tasks on molecule structures (generation and captioning).
Summary
The definition of SLMs.
Transformers use self-attention mechanisms to process entire text sequences at once instead of word by word.
Self-supervised learning creates training labels automatically from text data without human annotation.
BERT models use only the encoder part of Transformers for classification and prediction tasks.
GPT models use only the decoder part of Transformers for text generation tasks.
Word embeddings convert words into numerical vectors that capture semantic relationships.
RLHF uses reinforcement learning to improve LLM responses based on human feedback.
LLMs can generate any symbolic content including code, math expressions, and structured data.
Open source LLMs reduce development costs by providing pre-trained models as starting points.
Transfer learning adapts pre-trained models to specific domains using domain-specific data.
Generalist LLMs risk data leakage when deployed outside organizational networks.
Closed source models lack transparency about training data and model architecture.
Domain-specific LLMs provide better accuracy for specialized tasks than generalist models.
Smaller specialized models require less computational power than large generalist models.
Fine-tuning costs significantly less than training models from scratch.
Regulatory compliance often requires domain-specific models with known training data.
FAQ
What is a Small Language Model (SLM)?SLMs are Transformer-based language models designed to perform NLP tasks with far fewer parameters than typical LLMs—ranging from a few hundred million to a few billion (often under 10B). They have a smaller memory footprint, lower compute needs, are optimized for speed and energy efficiency, and can run locally on CPUs, consumer GPUs, mobile/edge devices, or small on-prem clusters—improving privacy and enabling offline/near–real-time use cases.How do SLMs differ from LLMs?They use the same core Transformer architecture and training principles; the main difference is scale. SLMs trade some raw capability for lower latency, lower energy use, simpler deployment, and better data locality, while still handling the same kinds of tasks when scoped appropriately.Why did Transformers replace RNNs for most NLP work?Transformers use self-attention to process entire sequences in parallel and remove recurrence, enabling much greater parallelism and faster training. Combined with word embeddings that capture semantic/syntactic relationships, they scale better and learn richer language representations than RNN/LSTM/GRU models.How are modern language models trained?They primarily use self-supervised learning on large unlabeled text corpora (e.g., predicting the next token or filling masked tokens). This avoids expensive manual labeling and enables scaling, while other paradigms—supervised, semi-supervised, and reinforcement learning—are layered as needed for specialization or alignment.What’s the difference between encoder-only (BERT) and decoder-only (GPT) Transformers?Encoder-only models (e.g., BERT) are typically better for understanding tasks such as classification, retrieval, and prediction. Decoder-only models (e.g., GPT) specialize in text generation tasks like drafting, summarization, code completion, and dialogue.What can LLMs/SLMs do beyond plain text generation?Beyond translation and dialogue, they can handle language understanding, classification, test and code generation, question answering, summarization, semantic parsing, pattern recognition, basic math, and logical chains. They also work with other symbolic text formats (e.g., programming languages, configuration files, scientific notations like SMILES for molecules).What is RLHF and why is it used?Reinforcement Learning from Human Feedback (RLHF) fine-tunes a model to optimize a reward aligned with human preferences, improving helpfulness, truthfulness, and safety. It has been used to refine commercial LLMs (e.g., the ChatGPT lineage) after pretraining.What are the risks of using closed-source generalist LLMs?Key risks include data leaving your network, potential data leakage, lack of transparency (training data, model internals, versioning), unaddressed biases, hallucinations (especially hard to assess without training data visibility), and code-generation misuse. These issues can be prohibitive in regulated or privacy-sensitive contexts.When do domain-specific models provide greater value, and how are they built?They shine in regulated, high-accuracy, privacy-critical settings where generalist training data fall short (e.g., healthcare, pharma/biotech, manufacturing, finance). Specialization is achieved via transfer learning on domain data (often with efficient fine-tuning techniques), enabling higher accuracy and safer outputs—illustrated by models trained on SMILES to generate or caption molecular structures; SLMs are also argued to be a strong fit for Agentic AI (per the NVIDIA paper cited in the chapter).Why consider open-source models and SLMs from a cost and deployment perspective?Starting from a pretrained open-source model significantly reduces development costs versus training from scratch; fine-tuning remains far cheaper than full pretraining. While deployment/inference still require careful engineering, the book focuses on SLM-oriented optimization and quantization to run efficiently on modest hardware, reducing costs and environmental impact compared to very large models.
pro $24.99 per month
access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!