Overview

1 Thinking in distributed systems: Models, mindsets, and mechanics

Modern software is unavoidably distributed: multiple concurrent components communicate over a network to deliver functionality that a single machine cannot provide alone. This chapter motivates why we distribute—so systems can remain correct, scalable, and reliable as load grows and failures occur—and frames the core challenge: complexity emerges from many parts and their interactions. It argues that progress comes from strong mental models, not just terminology or tool familiarity, and defines good models as both correct (no falsehoods) and complete (no relevant omissions). With an emphasis on moving from “knowing” to “understanding,” it sets the goal of reasoning about distributed systems confidently through precise, shared abstractions.

The chapter builds a baseline model of a distributed system as a state machine that advances in discrete steps, each taken by a component or the network, through internal work and message exchanges. It clarifies global versus local viewpoints—observers can see the whole system; components see only their local state and their network channel—and shows how correctness can be specified via safety (nothing bad happens) and liveness (something good eventually happens). Scalability and reliability are treated as responsiveness under load and failures, respectively, formalized with service-level indicators, objectives, and error budgets. The text highlights that multiple, even contrasting, models can be valid depending on focus, warns about overreliance on analogies, and introduces “systems of systems” thinking (holons/holarchies) to flexibly zoom in and out of complex architectures.

To make abstractions tangible, the chapter presents an office-building metaphor for components, network, and external interfaces, which cleanly captures crash and message-delivery semantics such as loss, duplication, and reordering. It advocates “thinking above the code,” modeling concurrency through interleavings to generalize race conditions and connect to database serializability, and distills the central design problem: think globally, act locally—craft global guarantees via only local observations and actions. Throughout, short “AHA!” moments reinforce that guarantees are application-specific and emergent, encouraging readers to juggle multiple mental models, embrace changing resolutions of view, and build the disciplined reasoning needed to design functional, scalable, and reliable distributed systems.

Mental model and system
Different models describing the same aspects of a system (the set of facts of each model totally overlaps)
The network as the buffer of inflight messages
The components as the buffer for inflight messages
Different models describing different aspects of a system (the set of facts of each model partially overlaps)
A distributed system as a set of concurrent, communicating components (local state of network not shown)
Behavior of a system as a sequence of states
Safety and liveness
Behavior space of a distributed transaction with two participants
A distributed system as a set of concurrent, communicating subsystems
Holons and holarchies
Two different holarchies, representing the same system
Global point of view
C1’s point of view
Distributed Systems Incorporated
Black box versus white box, a global point of view
Local point of view
Splitbrain
Reasoning about race conditions
Reasoning about serializability

Summary

  • A mental model is the internal representation of the target system and is the basis of comprehension and communication.
  • Striving for a deep understanding of distributed systems is better than merely knowing about their concepts.
  • A distributed system is a set of concurrent components that communicate by sending and receiving messages over a network.
  • The core challenge in designing distributed systems is creating a coherent system that functions as a whole despite each component having only local knowledge.
  • Ultimately, we are interested in the guarantees a system provides. We reason about these guarantees in terms of correctness—that is, in terms of safety and liveness guarantees as well as scalability and reliability guarantees.
  • Distributed systems can be visualized as a corporation, where rooms represent concurrent components, pneumatic tubes represent the network, and a mailbox represents the external interface.

FAQ

What is a distributed system, and how should we reason about its behavior?A distributed system is a set of collaborating, concurrent components that communicate by exchanging messages over a network. A clear way to reason about its behavior is as a state machine: the system advances in discrete steps, each taken by exactly one component or the network, via external actions (send/receive) or internal actions (local compute/state access).
Why build distributed systems if they are notoriously hard to design?To meet fitness goals under real-world conditions. A single component can be functional, but cannot handle unbounded load or survive inevitable failures. Distributing enables correctness at scale: scalability (responsive under load) and reliability (responsive under failure) emerge from multiple components and their interactions.
What’s the difference between knowing and understanding in this context?Knowing captures facts (e.g., definitions and rules). Understanding comes from dependable mental models that let you predict and explain behavior. Like learning chess rules versus mastering strategy, the chapter emphasizes building accurate, concise models so you can reason with confidence.
What makes a good mental model of a system?A good model is both correct and complete: - Correctness: It contains no falsehoods (every fact in the model is true in the system). - Completeness: It omits no relevant facts (every fact needed for the purpose at hand is included). Relevance is application-specific.
Can different models of the same system all be valid?Yes. Multiple models can be equivalent (express the same facts differently), such as modeling in-flight messages as part of the network versus part of each component. Other models deliberately focus on different aspects (e.g., an abstract transaction model that ignores messages versus a protocol model that includes message loss/duplication/reordering). Choose the model that best fits the point you need to make.
How do safety and liveness define correctness?Correctness = every possible behavior satisfies safety and liveness: - Safety: nothing bad ever happens (e.g., no two participants make conflicting decisions). - Liveness: something good eventually happens (e.g., every participant eventually decides). Together, they prevent both inconsistency and getting stuck.
How are scalability and reliability defined in this chapter?Through responsiveness to Service Level Objectives (SLOs): - Scalability: ability to meet SLOs under increasing load. - Reliability: ability to meet SLOs in the presence of failures. The chapter formalizes responsiveness using SLIs (measurements), SLOs (predicates), error rate, and error budget (upper bound on tolerated errors).
Why is the global versus local viewpoint important?Analysts often reason with a global, omniscient view of system state, but each component only sees its local state and its channel to the network. Designing distributed algorithms is therefore “think globally, act locally”: local steps, limited knowledge, and unreliable communication must still yield global guarantees (e.g., avoid split-brain leadership).
What are holons and holarchies, and how do they help model systems of systems?A holon is an entity that is both a whole and a part; a holarchy is a hierarchy of such entities. This lens lets you zoom in/out: a “database” can be treated as an atomic service or as a higher-order composition of nodes; a “cluster” can be one unit or many subsystems. It matches how real distributed systems are organized and reasoned about at different resolutions.
What is the “Distributed Systems Inc.” analogy, and what concerns does it capture?The system is an office building: rooms = components with local state; pneumatic tubes/mailroom = network; mailbox = external interface. It vividly captures: - Message delivery semantics: loss, duplication, reordering. - Crash semantics: transient (short absence), intermittent (vacation), permanent (departure). The analogy helps explore consequences and countermeasures (e.g., retries, idempotency, coordination).

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Think Distributed Systems ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Think Distributed Systems ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Think Distributed Systems ebook for free