This chapter cuts through the hype around large language models and provides a grounded overview of how today’s language technologies work and where small language models fit. It traces the roots of modern capabilities to the Transformer architecture and the shift to self-supervised training, which enabled broad skills such as understanding, generation, summarization, question answering, and basic reasoning. It explains key evolutions—encoder-only and decoder-only variants (BERT and GPT families) and the use of human feedback–driven reinforcement learning to shape conversational behavior—while emphasizing that SLMs and LLMs share the same foundations and differ primarily in scale and operational focus.
Small Language Models are defined as models with far fewer parameters (typically up to around ten billion), optimized for speed, memory, and energy efficiency so they can run on CPUs, consumer GPUs, edge devices, and on-prem servers with data kept local. Their deployment flexibility enables offline, embedded, and near–real time applications where privacy and latency matter. Crucially, SLMs can be specialized cost-effectively for domain tasks using transfer learning and parameter-efficient fine-tuning on public and private corpora—an approach well suited to regulated industries like healthcare, finance, manufacturing, chemistry, and biotech. Recent work also argues that agentic systems often benefit from SLMs for economy and responsiveness, sometimes alongside larger models in heterogeneous pipelines. A vibrant open-source ecosystem further lowers barriers and costs versus building from scratch, while improving sustainability due to smaller training and inference footprints.
The chapter also details the risks of closed, generalist LLMs—external deployment and potential data exposure, security and leakage concerns, opacity around models and training data, hallucinations, and code-related misuse—making them ill-suited for sensitive or highly regulated settings. Domain-specific models, privately deployed, can deliver higher accuracy, better intent alignment, and stronger governance. The chapter closes by outlining decision factors (task specificity, data sensitivity, compute budget) and previews the book’s practical focus: optimizing and quantizing SLMs for efficient inference, serving and deployment across diverse hardware, and integration with retrieval and agentic workflows—supported by hands-on code and minimal prerequisites.
Some examples of diverse content an LLM can generate.
The timeline of LLMs since 2019 (image taken from paper [3])
Order of magnitude of costs for each phase of LLM implementation from scratch.
Order of magnitude of costs for each phase of LLM implementation when starting from a pretrained model.
Ratios of data source types used to train some popular existing LLMs.
Generic model specialization to a given domain.
A LLM trained for tasks on molecule structures (generation and captioning).
Summary
The definition of SLMs.
Transformers use self-attention mechanisms to process entire text sequences at once instead of word by word.
Self-supervised learning creates training labels automatically from text data without human annotation.
BERT models use only the encoder part of Transformers for classification and prediction tasks.
GPT models use only the decoder part of Transformers for text generation tasks.
Word embeddings convert words into numerical vectors that capture semantic relationships.
RLHF uses reinforcement learning to improve LLM responses based on human feedback.
LLMs can generate any symbolic content including code, math expressions, and structured data.
Open source LLMs reduce development costs by providing pre-trained models as starting points.
Transfer learning adapts pre-trained models to specific domains using domain-specific data.
Generalist LLMs risk data leakage when deployed outside organizational networks.
Closed source models lack transparency about training data and model architecture.
Domain-specific LLMs provide better accuracy for specialized tasks than generalist models.
Smaller specialized models require less computational power than large generalist models.
Fine-tuning costs significantly less than training models from scratch.
Regulatory compliance often requires domain-specific models with known training data.
FAQ
What is a Small Language Model (SLM)?SLMs are Transformer-based language models with relatively few parameters (hundreds of millions to a few billion, typically under 10B). They have a smaller memory footprint and lower compute needs, are optimized for speed and energy efficiency, and can run on CPUs, consumer GPUs, mobile/edge devices, or on-prem servers—keeping data local for privacy.How do SLMs differ from Large Language Models (LLMs)?They share the same core Transformer architecture; the key difference is scale. LLMs trade cost and compute for breadth and raw capability, while SLMs emphasize efficiency, latency, deployment flexibility, and privacy. Both can tackle similar task types; SLMs are often better fits for constrained or sensitive environments.Why did Transformers revolutionize NLP?Transformers use self-attention to process entire sequences in parallel (no recurrence), enabling efficient training at scale and capturing long-range dependencies. With word embeddings and self-supervised pretraining on vast unlabeled text, they unlocked strong generalization across many tasks.What is the difference between encoder-only (BERT) and decoder-only (GPT) models?Encoder-only models (e.g., BERT) excel at understanding tasks like classification or prediction. Decoder-only models (e.g., GPT) excel at generation tasks such as writing text or code. Choose the architecture based on whether your task is primarily understanding or generation.How are modern language models trained, and what is RLHF?They are largely trained with self-supervised learning (e.g., next-token prediction), where labels are generated programmatically from text. RLHF (Reinforcement Learning from Human Feedback) further fine-tunes models to optimize for helpfulness, truthfulness, and harmlessness by learning from human preference signals.What kinds of tasks can SLMs/LLMs perform?Typical tasks include language understanding, classification, text generation, question answering, summarization, semantic parsing, pattern recognition, basic math, code generation, dialogue, general knowledge, and logical inference. They can also work with symbolic text beyond natural language (e.g., code or domain formats).What are the risks of using closed-source generalist LLMs?Key risks include data leaving your network, potential data leakage at the provider, limited transparency into training data and model changes, alignment and bias concerns, hallucinations (especially extrinsic when sources are unknown), and the possibility of unsafe code generation without robust guardrails.When do domain-specific models provide greater business value?In regulated or high-stakes domains, or when accuracy and privacy are critical. By applying transfer learning or parameter-efficient fine-tuning on domain and private data (and optionally RAG), domain-specific models improve relevance, reduce hallucinations, support compliance, and can run privately on constrained hardware.How does the open-source ecosystem change costs and strategy?Starting from a pretrained open-source model cuts development and full-from-scratch training costs. You still invest in data preparation and fine-tuning, and deployment/inference remain cost drivers. Optimization and quantization let SLMs (often ≤7B parameters) run efficiently on CPUs or consumer GPUs, improving sustainability (lower energy/CO₂).What role do SLMs play in Agentic AI systems?The chapter cites research arguing SLMs are powerful enough, more suitable, and more economical for many agent invocations. Heterogeneous agent systems can combine SLMs and LLMs, using SLMs for frequent, fast, or on-device steps and LLMs when broader reasoning is required.
pro $24.99 per month
access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!