Latency you own this product

Reduce delay in software systems

Pekka Enberg

October 2025
ISBN 9781633438088
264 pages

Included with a Manning Online subscription

printed in black & white

available in Korean, Simplified Chinese

catalog / Software Development

table of content

Part 1 Basics

1 Introduction

1.1 What is latency?

1.2 How is latency measured?

1.3 Why does latency matter?

1.3.1 User experience

1.3.2 Real-time systems

1.3.3 Efficiency

1.4 What latency is not

1.5 Latency vs. bandwidth

1.6 Latency vs. energy

2 Modeling and measuring latency

2.1 Laws of latency

2.1.1 Little’s law

2.1.2 Amdahl’s law

2.2 Latency distribution

2.3 Common sources of latency

2.3.1 Physics

2.3.2 CPU and hardware

2.3.3 Virtualization

2.3.4 Operating system, drivers, and firmware

2.3.5 Managed runtime

2.3.6 Application

2.4 Compounding latency

2.5 Measuring latency

2.6 Putting it together: Measuring network latency

2.6.1 Plotting with histograms

2.6.2 Plotting with eCDF

Part 2 Data

3 Colocation

3.1 Why colocate?

3.2 Internode latency

3.2.1 Geographical and last-mile latency

3.2.2 Edge computing and CDNs

3.3 Intranode latency

3.3.1 Network stack

3.3.2 TCP/IP protocol

3.3.3 Kernel-bypass networking

3.4 Multicore architecture

3.5 Putting it together: REST API with embedded database

4 Replication

4.1 Why replicate data?

4.2 Availability and scalability

4.3 Consistency model

4.3.1 Strong consistency

4.3.2 Eventual consistency

4.3.3 Other consistency models

4.4 Replication strategies

4.4.1 Single-leader replication

4.4.2 Multi-leader replication

4.4.3 Leaderless replication

4.4.4 Read-your-writes property

4.4.5 Local-first approach

4.5 Asynchronous vs. synchronous replication

4.6 State machine replication

4.7 Case study: Viewstamped Replication

4.8 Putting it together: Replicating a key–value store

5 Partitioning

5.1 Why partition data?

5.2 Physical partitioning strategies

5.2.1 Horizontal partitioning

5.2.2 Vertical partitioning

5.2.3 Hybrid partitioning

5.3 Logical partitioning strategies

5.3.1 Functional partitioning

5.3.2 Geographical partitioning

5.3.3 User-based partitioning

5.3.4 Time-based partitioning

5.3.5 Overpartitioning

5.4 Request routing

5.4.1 Direct routing

5.4.2 Proxy routing

5.4.3 Forward routing

5.5 Partition imbalance

5.5.1 Hot partitions

5.5.2 Skewed workloads

5.6 Putting it together: Horizontal partitioning with SQLite

6 Caching

6.1 Why cache data?

6.2 Caching overview

6.3 Caching strategies

6.3.1 Cache-aside caching

6.3.2 Read-through caching

6.3.3 Write-through caching

6.3.4 Write-behind caching

6.3.5 Client-side caching

6.3.6 Distributed caching

6.4 Cache coherency

6.5 Cache hit ratio

6.6 Cache replacement

6.6.1 Least recently used (LRU)

6.6.2 Least frequently used (LFU)

6.6.3 First-in, first-out (FIFO) and SIEVE

6.7 Time-to-live (TTL)

6.8 Materialized views

6.9 Memoization

6.10 Putting it together: In-application caching with Moka

Part 3 Compute

7 Eliminating work

7.1 Ways of eliminating work

7.2 Algorithmic complexity

7.3 Serializing and deserializing

7.4 Memory management

7.4.1 Dynamic memory allocation

7.4.2 Garbage collection

7.4.3 Virtual and physical memory

7.4.4 Demand paging

7.4.5 Memory topology

7.5 Operating system overhead

7.5.1 Scheduling delay and context switching

7.5.2 Background tasks and interrupts

7.5.3 Network stack

7.6 Precomputation

7.7 Putting it together: Benchmarking with Criterion

8 Wait-free synchronization

8.1 Mutual exclusion

8.1.1 Mutexes

8.1.2 Read–write locks

8.1.3 Spinlocks

8.2 Problems with mutual exclusion

8.2.1 Inefficiency

8.2.2 Priority inversion

8.2.3 Convoying

8.2.4 Deadlocks

8.3 Atomics

8.3.1 Atomic operations

8.3.2 Anatomy of a spinlock

8.4 Memory barriers

8.4.1 Types of memory barriers

8.4.2 Compiler barriers

8.4.3 Memory reordering example

8.5 Wait-free synchronization

8.5.1 Progress conditions

8.5.2 Consensus number

8.5.3 Wait-free queues

8.5.4 Wait-free stacks

8.5.5 Wait-free linked lists

8.6 Putting it together: Building a single-producer, single-consumer queue

9 Exploiting concurrency

9.1 Concurrency and parallelism

9.2 Concurrency models

9.2.1 Threads

9.2.2 Fibers

9.2.3 Coroutines

9.2.4 Event-driven concurrency

9.2.5 Futures and promises

9.2.6 Actor model

9.3 Parallel processing

9.3.1 Data parallelism

9.3.2 Task parallelism

9.4 Transactions

9.4.1 Serializability

9.4.2 Snapshot isolation

9.4.3 Data anomalies and weaker isolation

9.5 Concurrency control

9.5.1 Two-phase locking

9.5.2 Multiversion concurrency control

9.6 Putting it together: Sequential vs. concurrent execution

Part 4 Hiding latency

10 Asynchronous processing

10.1 Fundamentals

10.1.1 Asynchronous vs. synchronous processing

10.1.2 The event loop

10.1.3 Challenges

10.2 Asynchronous I/O

10.2.1 I/O multiplexing

10.2.2 Request batching

10.2.3 Request hedging

10.2.4 Buffered I/O

10.2.5 Memory mapping

10.3 Deferring work

10.3.1 Task scheduling

10.3.2 Priority queues

10.3.3 Work stealing

10.4 Resource management

10.4.1 Thread pools

10.4.2 Memory pools

10.4.3 Connection pools

10.5 Managing concurrency with backpressure

10.5.1 Controlling the producer

10.5.2 Buffering

10.5.3 Dropping and rate limiting

10.6 Error handling

10.6.1 Partial errors

10.6.2 Recovery

10.6.3 Timeouts and cancellation

10.7 Observability

10.7.1 Tracing

10.7.2 Metrics

11 Predictive techniques

11.1 Introduction to predictive techniques

11.2 Prefetching

11.2.1 Pattern-based prefetching

11.2.2 Semantic prefetching

11.3 Optimistic updates

11.3.1 Optimistic view

11.3.2 Synchronizing optimistic updates

11.3.3 Consistency guarantees

11.3.4 Error handling and rollbacks

11.4 Speculative execution

11.4.1 Incremental computation

11.4.2 Parallel speculation

11.4.3 Value prediction

11.5 Predictive resource allocation

11.5.1 Overprovisioning

11.5.2 Prewarming

Appendix

Appendix A: Further reading

Overview

1 Introduction

This chapter introduces the book’s practical mission: to help you build low‑latency applications with clear techniques, tools, and mental models. It defines latency as the time delay between a cause and its observed effect, sets expectations for how to think about and measure it across different contexts, and contrasts it with related concepts like throughput and bandwidth. At a high level, the chapter explains why latency matters for user experience, real‑time requirements, and efficiency, and previews the trade‑offs you’ll encounter—including those between latency and throughput, and between latency and energy.

Grounding the definition, the chapter walks through concrete examples that reveal how latency compounds and varies: flipping a (smart) light switch, serving an HTTP request from DNS lookup to page rendering, and moving a network packet from the NIC through the kernel to user space. These scenarios show that end‑to‑end delay is the sum of many small delays across layers, that variance matters as much as averages, and that context determines what you measure. The chapter also frames latency in units that span nanoseconds to milliseconds across the memory, storage, and networking hierarchy, noting that physical limits (including the speed of light) bound what’s achievable and that intuition at microsecond and nanosecond scales often needs deliberate calibration.

The motivations for low latency are threefold. For user experience, people perceive sub‑100 ms interactions as instant, tolerate about a second with minimal friction, and need feedback mechanisms for multi‑second operations—latency reductions correlate with higher engagement and conversion. For real‑time systems, hard deadlines must never be missed, while soft real‑time systems can occasionally miss them without catastrophic failure. For efficiency, the end of free speedups from frequency scaling shifts focus to smarter software (echoing observations about multicore trends and software bloat). Finally, the chapter clarifies latency versus bandwidth and throughput, illustrates the classic latency–throughput trade‑off via pipelining, and highlights latency–energy tensions (for example, busy polling versus sleep–wake), emphasizing that the right choices depend on workload patterns and goals.

60 ms Length of a nanosecond. Source: https://americanhistory.si.edu/collections/search/object/nmah_692464

Processing without pipelining. We first perform step W (washing) fully and only then perform step D (drying). As the time to complete W is 30 minutes and the time to complete D is 60 minutes, each step takes 90 minutes in total. Therefore, we say that the latency to wash and dry clothes is 90 minutes and the throughput is 1/90 loads of laundry washed per minute.

Processing with pipelining. We perform step W (washing) in full, but as soon as it completes, we start another step W. In parallel, we perform step D (drying) for the previous step W. If we ignore the initial step where there is no completed step W, the time to complete a load of laundry is 120 minutes because W and D run in parallel, but we’re bottlenecked by D, making latency worse than without pipelining. However, due to pipelining, we have now increased throughput to 1/60 loads of laundry per minute, which means that we can complete four loads of laundry in the same time as non-pipelined does three.

Summary

Latency is the time delay between a cause and its observed effect.
Latency is measured in units of time.
You need to understand the latency constants of your system when designing for low latency.
Latency matters because people expect a real-time experience.
When optimizing for latency, there are sometimes throughput and energy efficiency trade-offs.

FAQ

What is latency?

Latency is the time delay between a cause and its observed effect. In practice, that “cause” and “effect” depend on context—for example, pressing Enter and seeing a page, or a packet arriving on a NIC and the data reaching a userspace thread.

How is latency different from response time, service time, and wait time?

Some define response time as service time plus wait time, and call wait time “latency.” This book uses a broader definition: latency is any time delay between cause and effect. Practically, service time is request processing latency, wait time is network/queuing latency, and response time is the overall request latency.

How do we measure latency, and what are typical scales?

Latency is measured in time units. Computer systems span nanoseconds (CPU caches, DRAM), microseconds (NVMe/SSD access), and milliseconds or more (wide-area networking). Example constants: ~100 ns DRAM access, ~10 μs NVMe access, ~60 ms round trip from New York to London. Physics sets hard limits: information cannot travel faster than the speed of light, so distance imposes a minimum latency.

Why does latency matter?

Three main reasons: user experience (users abandon slow services; perception thresholds like ~100 ms feeling instantaneous matter), real-time requirements (meeting deadlines in hard/soft real-time systems), and efficiency (lower latency often means less wasted work and better resource use, especially now that “free” hardware speedups have plateaued).

What contributes to the end-to-end latency of loading a web page?

Many stages compound: client work (e.g., DNS lookup, which may miss cache), network transit, server processing (and calls to external stores/services), and client-side rendering and follow-on requests (e.g., executing JavaScript). Each adds to the user-perceived delay.

What does the light-switch example teach about latency?

With smart bulbs (or even some LEDs) there’s a noticeable delay between issuing the command and the light turning on due to wireless hops, hubs, and protocols. It highlights that latency varies across devices and that even small delays affect user experience.

How do OS and NIC internals affect latency?

Packets arrive at the NIC, land in receive queues, get polled by the kernel network stack, and are handed off to userspace (e.g., via recvmsg). CPU core handoffs, interrupts, RX queue contention, and scheduling all introduce delays. Affinity, queueing, and concurrency choices can meaningfully change packet processing latency.

How is latency different from bandwidth and throughput?

Latency is how long a request takes to complete. Bandwidth is the maximum data capacity of a link. Throughput is the actual achieved data or request rate. You can often add bandwidth (more links), but high latency can be hard to reduce without changing the system or network. When discussing data rates, throughput is typically the practical metric.

What is the trade-off between latency and throughput?

Pipelining can raise throughput while sometimes increasing per-item latency. In the laundry analogy, serial wash+dry takes ~90 minutes per load (~0.6 loads/hour). Pipelining lifts throughput to one load/hour but stretches per-load latency to ~120 minutes, illustrating the latency–throughput trade-off.

How does optimizing for latency interact with energy efficiency?

They can conflict. For example, busy polling reduces scheduling delays (good for latency) but consumes more power when idle. Sleep–wake strategies save energy but add wakeup latency. Depending on traffic patterns, busy polling can sometimes be both lower-latency and more energy-efficient (steady, frequent requests), whereas sporadic workloads may favor sleep–wake approaches.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

eBook

pdf, ePub, online

$47.99 $31.19

you save $16.80 (35%)

include audio $24.99 $16.24

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more