Domain-Specific Small Language Models you own this product

Guglielmo Iozzia

MEAP began April 2025
Last updated November 2025
Publication in Spring 2026 (estimated)

ISBN 9781633436701
300 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Software Development / Software Engineering / Technology and Computing / Language Models

table of content

PART 1: FIRST STEPS

1 Large Language Models

1.1 10000 feet overview

1.2 The Transformer Architecture

1.3 Evolutions of Transformers

1.4 Areas of application

1.5 The Open Source revolution

1.6 Risks and challenges with generalist LLMs

1.7 When do domain specific LLMs provide a greater business value than generalist ones?

1.8 Prerequisites

1.9 Summary

PART 2: CORE DOMAIN-SPECIFIC LLMS

2 Tuning for a Specific Domain

2.1 Data Preparation

2.1.1 Data Preparation for BERT Fine Tuning

2.1.2 Data Preparation for GPT Fine Tuning

2.1.3 Data Preparation for RAG

2.2 Retrieval Augmented Generation

2.3 Fine tuning

2.4 LoRA

2.5 RAG or fine tuning?

2.6 Summary

3 End-to-end Transformer Fine Tuning

3.1 Data preparation

3.2 Fine tuning

3.3 Testing the fine-tuned model

3.3.1 Domain-specific evaluation

3.4 Summary

4 Running Inference

4.1 How to generate content

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Inference cost calculation

4.3 Areas for improvement (cost savings and performance)

4.3.1 Get the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU usage with DeepSpeed

4.4 Summary

5 Exploring ONNX

5.1 The ONNX format

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

5.7 Summary

6 Quantizing for Your Production Environment

6.1 Transformers precision formats

6.2 8-bit quantization

6.2.1 Hands-on 8-bit quantization

6.2.2 LLM.int8() and quantization

6.3 8-bit quantization with ONNX

6.4 4-bit quantization

6.4.1 4-bit quantization with GPTQ

6.4.2 4-bit quantization with ggml

6.5 Summary

PART 3: REAL-WORLD USE CASES

7 Generating Python Code

7.1 Transformers for programming language generation

7.2 Hands-on with Python language generation using a Transformer architecture

7.2.1 Python code generation with CodeGen

7.2.2 ONNX conversion and quantization of models not supported by Optimum

7.2.3 Model evaluation

7.2.4 Python code generation with better models

7.3 Inference (coding assistance) on commodity hardware

7.4 Summary

8 Generating Protein Structures

8.1 Application of Transformers in Chemistry

8.2 From natural language to protein structures

8.3 Antibody generation with a small language model

8.4 From CIF files to crystal structures

8.5 Summary

PART 4: ADVANCED CONCEPTS

9 Advanced Quantization Techniques

9.1 What if a domain-specific model isn’t small?

9.2 FlexGen

9.3 SmoothQuant

9.4 BitNet

9.4.1 BitNet and Python

9.5 Summary

10 Profiling Insights

10.1 Profiling ONNX ported LLMs

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

10.4 Summary

11 Deployment and Serving

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary

12 Running on Your Laptop

12.1 Why a personal local assistant

12.2 Running an LLM locally with Ollama

12.2.1 Importing a custom model into Ollama

12.2.2 User privacy in Ollama

12.3 Running an LLM locally with LM Studio

12.3.1 The LM Studio Python SDK

12.4 Running an LLM locally with Jan

12.4.1 The Cortex local LLM engine

12.5 Summary

13 Creating End-to-end LLM Applications

13.1 Why LLMs alone aren’t enough

13.2 Combining a domain-specific small language model with RAG

13.2.1 Using a vector database

13.3 Building an Agent

13.4 Summary

14 Advanced Components for LLM Applications

14.1 Graph RAG

14.1.1 Microsoft’s OS GraphRAG capabilities

14.1.2 Evaluation metrics

14.2 RAG + Agentic AI

14.3 Long- and short-term memory management

14.4 Summary

15 Test-time Compute and Small Language Models

15.1 Test-time compute

15.2 The OptiLLM inference proxy

15.3 SLMs with embedded test-time compute

15.4 Building a reasoning domain-specific SLM

15.5 Summary

Overview

1 Small Language Models

This chapter cuts through the hype around large language models to provide a clear, practical view of where generalist LLMs and domain-specific small language models fit. It traces the shift sparked by the Transformer architecture and the rise of self-supervised training, which enabled models to scale beyond supervised labeling bottlenecks and acquire broad linguistic, reasoning, and generative abilities. The goal is to help readers judge capabilities and limits realistically, understand when smaller models are enough or preferable, and frame the trade-offs that matter for real applications, compliance, and risk.

Small Language Models are built on the same Transformer foundations as larger systems but prioritize efficiency: they typically contain up to a few billion parameters, run well on CPUs or consumer GPUs, can execute on edge or on-prem environments, and preserve data locality. They inherit the encoder/decoder family split (e.g., BERT-like vs. GPT-like), benefit from techniques such as RLHF, and support a wide range of text-centric tasks. Crucially, SLMs can be specialized with relatively modest effort via transfer learning and parameter-efficient fine-tuning, enabling focused accuracy in regulated or sensitive domains. Emerging agentic AI patterns further strengthen their case, as many agent actions require compact, fast models and heterogeneous systems that mix SLMs with LLMs when needed.

From a business perspective, the open-source surge provides strong baselines, letting teams fine-tune instead of training from scratch—dramatically lowering development costs while keeping deployment and inference as the main engineering challenges. The chapter details risks with closed, generalist LLMs—data exposure, lack of transparency and reproducibility, hard-to-audit training data, hallucinations, and code-safety concerns—contrasting them with the privacy, compliance, and accuracy benefits of domain-specific models that run within organizational boundaries. It also highlights sustainability advantages of smaller models and reframes alignment as shaping model intent within a narrow domain. The book’s focus is hands-on: optimizing and quantizing SLMs for inference, serving through diverse APIs, deploying across hardware (including laptops), and integrating with RAG and agentic systems, assuming basic familiarity with Python, PyTorch, Transformers, and related tooling.

Some examples of diverse content an LLM can generate.

The timeline of LLMs since 2019 (image taken from paper [3])

Order of magnitude of costs for each phase of LLM implementation from scratch.

Order of magnitude of costs for each phase of LLM implementation when starting from a pretrained model.

Ratios of data source types used to train some popular existing LLMs.

Generic model specialization to a given domain.

A LLM trained for tasks on molecule structures (generation and captioning).

Summary

The definition of SLMs.
Transformers use self-attention mechanisms to process entire text sequences at once instead of word by word.
Self-supervised learning creates training labels automatically from text data without human annotation.
BERT models use only the encoder part of Transformers for classification and prediction tasks.
GPT models use only the decoder part of Transformers for text generation tasks.
Word embeddings convert words into numerical vectors that capture semantic relationships.
RLHF uses reinforcement learning to improve LLM responses based on human feedback.
LLMs can generate any symbolic content including code, math expressions, and structured data.
Open source LLMs reduce development costs by providing pre-trained models as starting points.
Transfer learning adapts pre-trained models to specific domains using domain-specific data.
Generalist LLMs risk data leakage when deployed outside organizational networks.
Closed source models lack transparency about training data and model architecture.
Domain-specific LLMs provide better accuracy for specialized tasks than generalist models.
Smaller specialized models require less computational power than large generalist models.
Fine-tuning costs significantly less than training models from scratch.
Regulatory compliance often requires domain-specific models with known training data.

FAQ

What is a Small Language Model (SLM)?

SLMs are Transformer-based language models designed to perform NLP tasks with far fewer parameters than typical LLMs—ranging from a few hundred million to a few billion (often under 10B). They have a smaller memory footprint, lower compute needs, are optimized for speed and energy efficiency, and can run locally on CPUs, consumer GPUs, mobile/edge devices, or small on-prem clusters—improving privacy and enabling offline/near–real-time use cases.

How do SLMs differ from LLMs?

They use the same core Transformer architecture and training principles; the main difference is scale. SLMs trade some raw capability for lower latency, lower energy use, simpler deployment, and better data locality, while still handling the same kinds of tasks when scoped appropriately.

Why did Transformers replace RNNs for most NLP work?

Transformers use self-attention to process entire sequences in parallel and remove recurrence, enabling much greater parallelism and faster training. Combined with word embeddings that capture semantic/syntactic relationships, they scale better and learn richer language representations than RNN/LSTM/GRU models.

How are modern language models trained?

They primarily use self-supervised learning on large unlabeled text corpora (e.g., predicting the next token or filling masked tokens). This avoids expensive manual labeling and enables scaling, while other paradigms—supervised, semi-supervised, and reinforcement learning—are layered as needed for specialization or alignment.

What’s the difference between encoder-only (BERT) and decoder-only (GPT) Transformers?

Encoder-only models (e.g., BERT) are typically better for understanding tasks such as classification, retrieval, and prediction. Decoder-only models (e.g., GPT) specialize in text generation tasks like drafting, summarization, code completion, and dialogue.

What can LLMs/SLMs do beyond plain text generation?

Beyond translation and dialogue, they can handle language understanding, classification, test and code generation, question answering, summarization, semantic parsing, pattern recognition, basic math, and logical chains. They also work with other symbolic text formats (e.g., programming languages, configuration files, scientific notations like SMILES for molecules).

What is RLHF and why is it used?

Reinforcement Learning from Human Feedback (RLHF) fine-tunes a model to optimize a reward aligned with human preferences, improving helpfulness, truthfulness, and safety. It has been used to refine commercial LLMs (e.g., the ChatGPT lineage) after pretraining.

What are the risks of using closed-source generalist LLMs?

Key risks include data leaving your network, potential data leakage, lack of transparency (training data, model internals, versioning), unaddressed biases, hallucinations (especially hard to assess without training data visibility), and code-generation misuse. These issues can be prohibitive in regulated or privacy-sensitive contexts.

When do domain-specific models provide greater value, and how are they built?

They shine in regulated, high-accuracy, privacy-critical settings where generalist training data fall short (e.g., healthcare, pharma/biotech, manufacturing, finance). Specialization is achieved via transfer learning on domain data (often with efficient fine-tuning techniques), enabling higher accuracy and safer outputs—illustrated by models trained on SMILES to generate or caption molecular structures; SLMs are also argued to be a strong fit for Agentic AI (per the NVIDIA paper cited in the chapter).

Why consider open-source models and SLMs from a cost and deployment perspective?

Starting from a pretrained open-source model significantly reduces development costs versus training from scratch; fine-tuning remains far cheaper than full pretraining. While deployment/inference still require careful engineering, the book focuses on SLM-oriented optimization and quantization to run efficiently on modest hardware, reducing costs and environmental impact compared to very large models.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more