1 Introduction
This chapter introduces the book’s mission: to equip practitioners to build low-latency applications with clear definitions, practical techniques, and end-to-end mental models. It defines latency as the time between a cause and its observed effect, a lens that applies from user interactions to kernel-level packet handling. The text motivates why latency matters: slowdowns degrade user experience, compound across layers, and can determine business outcomes. It sets the stage by contrasting latency with related concepts and by positioning the work as a practical, comprehensive guide that balances theory with hands-on diagnosis and remediation.
Latency is measured in time units and manifests across the stack, from nanoseconds in CPU caches and DRAM to microseconds for SSD/NVMe access and milliseconds across networks (e.g., intercontinental round trips). Physical limits, notably the speed of light, constrain how low some latencies can go. Real examples illustrate compounding delays: a web page load spans DNS, network transit, server work, dependent services, and client rendering; even a light switch reveals variability and user-perceived lag. Human perception anchors targets: roughly 100 ms feels instantaneous, ~1 s can still feel responsive with feedback, and operations beyond ~10 s need progress cues or streaming to keep users engaged.
The chapter distinguishes latency from bandwidth and throughput: bandwidth is capacity, throughput is achieved delivery rate, while latency is the delay per request—and improving one can trade off another. It highlights the principle that bandwidth can often be added, but high latency is harder to mask, and it illustrates latency–throughput trade-offs via pipelining: higher overall throughput can increase per-item latency. Beyond user experience and real-time requirements (hard vs. soft), efficiency drives latency work as free hardware speedups have waned with the end of Dennard scaling and the shift to multicore; software efficiency matters more than ever. Finally, it addresses latency–energy trade-offs: techniques like busy polling can minimize delay but raise power draw, while sleep/wake strategies save energy at the cost of responsiveness—the optimal choice depends on workload patterns and goals.
60 ms Length of a nanosecond. Source: https://americanhistory.si.edu/collections/search/object/nmah_692464

Processing without pipelining. We first perform step W (washing) fully and only then perform step D (drying). As the time to complete W is 30 minutes and the time to complete D is 60 minutes, each step takes 90 minutes in total. Therefore, we say that the latency to wash and dry clothes is 90 minutes and the throughput is 1/90 loads of laundry washed per minute.

Processing with pipelining. We perform step W (washing) in full, but as soon as it completes, we start another step W. In parallel, we perform step D (drying) for the previous step W. If we ignore the initial step where there is no completed step W, the time to complete a load of laundry is 120 minutes because W and D run in parallel, but we’re bottlenecked by D, making latency worse than without pipelining. However, due to pipelining, we have now increased throughput to 1/60 loads of laundry per minute, which means that we can complete four loads of laundry in the same time as non-pipelined does three.

Summary
- Latency is the time delay between a cause and its observed effect.
- Latency is measured in units of time.
- You need to understand the latency constants of your system when designing for low latency.
- Latency matters because people expect a real-time experience.
- When optimizing for latency, there are sometimes throughput and energy efficiency trade-offs.