This chapter introduces the book’s practical mission: to help you build low‑latency applications with clear techniques, tools, and mental models. It defines latency as the time delay between a cause and its observed effect, sets expectations for how to think about and measure it across different contexts, and contrasts it with related concepts like throughput and bandwidth. At a high level, the chapter explains why latency matters for user experience, real‑time requirements, and efficiency, and previews the trade‑offs you’ll encounter—including those between latency and throughput, and between latency and energy.
Grounding the definition, the chapter walks through concrete examples that reveal how latency compounds and varies: flipping a (smart) light switch, serving an HTTP request from DNS lookup to page rendering, and moving a network packet from the NIC through the kernel to user space. These scenarios show that end‑to‑end delay is the sum of many small delays across layers, that variance matters as much as averages, and that context determines what you measure. The chapter also frames latency in units that span nanoseconds to milliseconds across the memory, storage, and networking hierarchy, noting that physical limits (including the speed of light) bound what’s achievable and that intuition at microsecond and nanosecond scales often needs deliberate calibration.
The motivations for low latency are threefold. For user experience, people perceive sub‑100 ms interactions as instant, tolerate about a second with minimal friction, and need feedback mechanisms for multi‑second operations—latency reductions correlate with higher engagement and conversion. For real‑time systems, hard deadlines must never be missed, while soft real‑time systems can occasionally miss them without catastrophic failure. For efficiency, the end of free speedups from frequency scaling shifts focus to smarter software (echoing observations about multicore trends and software bloat). Finally, the chapter clarifies latency versus bandwidth and throughput, illustrates the classic latency–throughput trade‑off via pipelining, and highlights latency–energy tensions (for example, busy polling versus sleep–wake), emphasizing that the right choices depend on workload patterns and goals.
60 ms Length of a nanosecond. Source: https://americanhistory.si.edu/collections/search/object/nmah_692464
Processing without pipelining. We first perform step W (washing) fully and only then perform step D (drying). As the time to complete W is 30 minutes and the time to complete D is 60 minutes, each step takes 90 minutes in total. Therefore, we say that the latency to wash and dry clothes is 90 minutes and the throughput is 1/90 loads of laundry washed per minute.
Processing with pipelining. We perform step W (washing) in full, but as soon as it completes, we start another step W. In parallel, we perform step D (drying) for the previous step W. If we ignore the initial step where there is no completed step W, the time to complete a load of laundry is 120 minutes because W and D run in parallel, but we’re bottlenecked by D, making latency worse than without pipelining. However, due to pipelining, we have now increased throughput to 1/60 loads of laundry per minute, which means that we can complete four loads of laundry in the same time as non-pipelined does three.
Summary
Latency is the time delay between a cause and its observed effect.
Latency is measured in units of time.
You need to understand the latency constants of your system when designing for low latency.
Latency matters because people expect a real-time experience.
When optimizing for latency, there are sometimes throughput and energy efficiency trade-offs.
FAQ
What is latency?Latency is the time delay between a cause and its observed effect. In practice, that “cause” and “effect” depend on context—for example, pressing Enter and seeing a page, or a packet arriving on a NIC and the data reaching a userspace thread.How is latency different from response time, service time, and wait time?Some define response time as service time plus wait time, and call wait time “latency.” This book uses a broader definition: latency is any time delay between cause and effect. Practically, service time is request processing latency, wait time is network/queuing latency, and response time is the overall request latency.How do we measure latency, and what are typical scales?Latency is measured in time units. Computer systems span nanoseconds (CPU caches, DRAM), microseconds (NVMe/SSD access), and milliseconds or more (wide-area networking). Example constants: ~100 ns DRAM access, ~10 μs NVMe access, ~60 ms round trip from New York to London. Physics sets hard limits: information cannot travel faster than the speed of light, so distance imposes a minimum latency.Why does latency matter?Three main reasons: user experience (users abandon slow services; perception thresholds like ~100 ms feeling instantaneous matter), real-time requirements (meeting deadlines in hard/soft real-time systems), and efficiency (lower latency often means less wasted work and better resource use, especially now that “free” hardware speedups have plateaued).What contributes to the end-to-end latency of loading a web page?Many stages compound: client work (e.g., DNS lookup, which may miss cache), network transit, server processing (and calls to external stores/services), and client-side rendering and follow-on requests (e.g., executing JavaScript). Each adds to the user-perceived delay.What does the light-switch example teach about latency?With smart bulbs (or even some LEDs) there’s a noticeable delay between issuing the command and the light turning on due to wireless hops, hubs, and protocols. It highlights that latency varies across devices and that even small delays affect user experience.How do OS and NIC internals affect latency?Packets arrive at the NIC, land in receive queues, get polled by the kernel network stack, and are handed off to userspace (e.g., via recvmsg). CPU core handoffs, interrupts, RX queue contention, and scheduling all introduce delays. Affinity, queueing, and concurrency choices can meaningfully change packet processing latency.How is latency different from bandwidth and throughput?Latency is how long a request takes to complete. Bandwidth is the maximum data capacity of a link. Throughput is the actual achieved data or request rate. You can often add bandwidth (more links), but high latency can be hard to reduce without changing the system or network. When discussing data rates, throughput is typically the practical metric.What is the trade-off between latency and throughput?Pipelining can raise throughput while sometimes increasing per-item latency. In the laundry analogy, serial wash+dry takes ~90 minutes per load (~0.6 loads/hour). Pipelining lifts throughput to one load/hour but stretches per-load latency to ~120 minutes, illustrating the latency–throughput trade-off.How does optimizing for latency interact with energy efficiency?They can conflict. For example, busy polling reduces scheduling delays (good for latency) but consumes more power when idle. Sleep–wake strategies save energy but add wakeup latency. Depending on traffic patterns, busy polling can sometimes be both lower-latency and more energy-efficient (steady, frequent requests), whereas sporadic workloads may favor sleep–wake approaches.
pro $24.99 per month
access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!