table of content

1 Introduction

1.1 An example

1.1.1 An example: How to solve it

1.1.2 How to solve it, take two: A book walkthrough

1.2 The structure of this book

1.3 What makes this book different and whom it is for

1.4 Why is massive data so challenging for today’s systems?

1.4.1 The CPU memory performance gap

1.4.2 Memory hierarchy

1.4.3 Latency vs. bandwidth

1.4.4 What about distributed systems?

1.5 Designing algorithms with hardware in mind

Summary

Part 1 Hash-based sketches

2 Review of hash tables and modern hashing

2.1 Ubiquitous hashing

2.2 A crash course on data structures

2.3 Usage scenarios in modern systems

2.3.1 Deduplication in backup/storage solutions

2.3.2 Plagiarism detection with MOSS and Rabin-Karp fingerprinting

2.4 O(1)—What’s the big deal?

2.5 Collision resolution: Theory vs. practice

2.6 Usage scenario: How Python’s dict does it

2.7 MurmurHash

2.8 Hash tables for distributed systems: Consistent hashing

2.8.1 A typical hashing problem

2.8.2 Hashring

2.8.3 Lookup

2.8.4 Adding a new node/resource

2.8.5 Removing a node

2.8.6 Consistent hashing scenario: Chord

2.8.7 Consistent hashing: Programming exercises

Summary

3 Approximate membership: Bloom and quotient filters

3.1 How it works

3.1.1 Insert

3.1.2 Lookup

3.2 Use cases

3.2.1 Bloom filters in networks: Squid

3.2.2 Bitcoin mobile app

3.3 A simple implementation

3.4 Configuring a Bloom filter

3.4.1 Playing with Bloom filters: Mini experiments

3.5 A bit of theory

3.5.1 Can we do better?

3.6 Bloom filter adaptations and alternatives

3.7 Quotient filter

3.7.1 Quotienting

3.7.2 Understanding metadata bits

3.7.3 Inserting into a quotient filter: An example

3.7.4 Python code for lookup

3.7.5 Resizing and merging

3.7.6 False positive rate and space considerations

3.8 Comparison between Bloom filters and quotient filters

Summary

4 Frequency estimation and count-min sketch

4.1 Majority element

4.1.1 General heavy hitters

4.2 Count-min sketch: How it works

4.2.1 Update

4.2.2 Estimate

4.3 Use cases

4.3.1 Top-k restless sleepers

4.3.2 Scaling the distributional similarity of words

4.4 Error vs. space in count-min sketch

4.5 A simple implementation of count-min sketch

4.5.1 Exercises

4.5.2 Intuition behind the formula: Math bit

4.6 Range queries with count-min sketch

4.6.1 Dyadic intervals

4.6.2 Update phase

4.6.3 Estimate phase

4.6.4 Computing dyadic intervals

Summary

5 Cardinality estimation and HyperLogLog

5.1 Counting distinct items in databases

5.2 HyperLogLog incremental design

5.2.1 The first cut: Probabilistic counting

5.2.2 Stochastic averaging, or “when life gives you lemons”

5.2.3 LogLog

5.2.4 HyperLogLog: Stochastic averaging with harmonic mean

5.3 Use case: Catching worms with HLL

5.4 But how does it work? A mini experiment

5.4.1 The effect of the number of buckets (m)

5.5 Use case: Aggregation using HyperLogLog

Summary

Part 2 Real-time analytics

6 Streaming data: Bringing everything together

6.1 Streaming data system: A meta example

6.1.1 Bloom-join

6.1.2 Deduplication

6.1.3 Load balancing and tracking the network traffic

6.2 Practical constraints and concepts in data streams

6.2.1 In real time

6.2.2 Small time and small space

6.2.3 Concept shifts and concept drifts

6.2.4 Sliding window model

6.3 Math bit: Sampling and estimation

6.3.1 Biased sampling strategy

6.3.2 Estimation from a representative sample

Summary

7 Sampling from data streams

7.1 Sampling from a landmark stream

7.1.1 Bernoulli sampling

7.1.2 Reservoir sampling

7.1.3 Biased reservoir sampling

7.2 Sampling from a sliding window

7.2.1 Chain sampling

7.2.2 Priority sampling

7.3 Sampling algorithms comparison

7.3.1 Simulation setup: Algorithms and data

Summary

8 Approximate quantiles on data streams

8.1 Exact quantiles

8.2 Approximate quantiles

8.2.1 Additive error

8.2.2 Relative error

8.2.3 Relative error in the data domain

8.3 T-digest: How it works

8.3.1 Digest

8.3.2 Scale functions

8.3.3 Merging t-digests

8.3.4 Space bounds for t-digest

8.4 Q-digest

8.4.1 Constructing a q-digest from scratch

8.4.2 Merging q-digests

8.4.3 Error and space considerations in q-digests

8.4.4 Quantile queries with q-digests

8.5 Simulation code and results

Summary

Part 3 Data structures for databases and external memory algorithms

9 Introducing the external memory model

9.1 External memory model: The preliminaries

9.2 Example 1: Finding a minimum

9.2.1 Use case: Minimum median income

9.3 Example 2: Binary search

9.3.1 Bioinformatics use case

9.3.2 Runtime analysis

9.4 Optimal searching

9.5 Example 3: Merging K sorted lists

9.5.1 Merging time/date logs

9.5.2 External memory model: Simple or simplistic?

9.6 What’s next

Summary

10 Data structures for databases: B-trees, Bε-trees, and LSM-trees

10.1 How indexing works

10.2 Data structures in this chapter

10.3 B-trees

10.3.1 B-tree balancing

10.3.2 Lookup

10.3.3 Insert

10.3.4 Delete

10.3.5 B+-trees

10.3.6 How operations on a B+-tree are different

10.3.7 Use case: B-trees in MySQL (and many other places)

10.4 Math bit: Why are B-tree lookups optimal in external memory?

10.4.1 Why B-tree inserts/deletes are not optimal in external memory

10.5 Bε-trees

10.5.1 Bε-tree: How it works

10.5.2 Buffering mechanics

10.5.3 Inserts and deletes

10.5.4 Lookups

10.5.5 Cost analysis

10.5.6 Bε-tree: The spectrum of data structures

10.5.7 Use case: Bε-trees in TokuDB

10.5.8 Make haste slowly, the I/O way

10.6 Log-structured merge-trees (LSM-trees)

10.6.1 The LSM-tree: How it works

10.6.2 LSM-tree cost analysis

10.6.3 Use case: LSM-trees in Cassandra

Summary

11 External memory sorting

11.1 Sorting use cases

11.1.1 Robot motion planning

11.1.2 Cancer genomics

11.2 Challenges of sorting in external memory: An example

11.2.1 Two-way merge-sort in external memory

11.3 External memory merge-sort (M/B-way merge-sort)

11.3.1 Searching and sorting in RAM vs. external memory

11.4 What about external quick-sort?

11.4.1 External memory two-way quick-sort

11.4.2 Toward external memory multiway quick-sort

11.4.3 Finding enough pivots

11.4.4 Finding good enough pivots

11.4.5 Putting it all back together

11.5 Math bit: Why is external memory merge-sort optimal?

11.6 Wrapping up

Summary

Overview

1 Introduction

Modern applications routinely outgrow the assumptions of classical algorithms that expect all data to fit in main memory. In data-intensive systems, the dominant cost is moving and accessing data rather than computing on it, so scalability hinges on minimizing transfers and space. This chapter motivates that perspective, explains why “massive” is relative to resources and requirements, and sets the book’s goal: practical, hardware-aware techniques—grounded in theory yet implementation-friendly—that help engineers design scalable solutions for large, evolving datasets across real-world pipelines.

A concrete example with billions of news comments shows how straightforward hash-table approaches quickly demand tens to hundreds of gigabytes, motivating more compact, approximate alternatives. The chapter previews succinct data structures for core questions: Bloom filters for membership with small false-positive rates, Count-Min Sketch for frequency estimation and heavy hitters using a fraction of the space, and HyperLogLog for precise-on-average cardinality estimates in kilobytes. It then broadens to streaming scenarios where one-pass constraints favor Bernoulli and reservoir sampling and quantile sketches (such as q-digest and t-digest) for real-time analytics, and to persistent storage where external-memory algorithms and database-backed structures (e.g., B-trees, Bε-trees, LSM-trees) are chosen to match read-, write-, or mixed-optimized workloads, including efficient external sorting.

The chapter grounds these choices in hardware realities: the widening CPU–memory performance gap, a memory hierarchy trading speed for capacity, and the “latency lags bandwidth” principle, all of which reward sequential access, caching, and compact representations—summarized by the mantra that reducing space saves time. It also notes added delays in distributed systems and cloud environments, reinforcing designs that minimize random I/O and network hops. The book is organized into three parts: hash-based sketches, streaming and sampling methods with quantile estimation, and external-memory models with storage-engine data structures. It is intended for practitioners with foundational algorithmic knowledge who seek system-agnostic, scalable techniques they can apply at work.

In this example, we build a (comment-id, frequency) hash table to help us store distinct comment-ids with their frequency count. An incoming comment-id 384793 is already contained in the table, and its frequency count is incremented. We also build topic-related hash tables, where for each article, we count the number of times associated keywords appeared in its comments (e.g., in the sports theme, keywords might be “soccer,” “player,” “goal,” etc.). For a large dataset of 3 billion comments, these data structures may require from dozens to a hundred gigabytes of RAM memory.

Most common data structures, including hash tables, become difficult to store and manage with large amounts of data.

CPU memory performance gap graph, adopted from Hennessy & Patterson’s computer architecture. The graph shows the widening gap between the speeds of memory accesses to CPU and RAM main memory (the average number of memory accesses per second over time.) The vertical axis is on the log scale. Processors show an improvement of about 1.5 times per year up to year 2005, while the improvement of access to main memory has been only about 1.1 times per year. Processor speed-up has somewhat flattened since 2005, but this is being alleviated by using multiple cores and parallelism.

Different types of memories in the computer. Starting from registers in the bottom left corner, which are blindingly fast but also very small, we move up (getting slower) and right (getting larger) with level 1 cache, level 2 cache, level 3 cache, and main memory, all the way to SSD and/or HDD. Mixing up different memories in the same computer allows for the illusion of having both the speed and the storage capacity by having each level serve as a cache for the next larger one.

Cloud access times can be high due to the network load and complex infrastructure. Accessing the cloud can take hundreds of milliseconds or even seconds. We can observe this as another level of memory that is even larger and slower than the hard disk. Improving the performance in cloud applications can also be hard because times to access or write data on a cloud are unpredictable.

An efficient data structure with bells and whistles

Summary

Applications today generate and process large amounts of data at a rapid rate. Traditional data structures, such as key-value dictionaries, can grow too big to fit in RAM memory, which can lead to an application choking due to the I/O bottleneck.
To process large datasets efficiently, we can design space-efficient hash-based sketches, do real-time analytics with the help of random sampling and approximate statistics, or deal with data on disk and other remote storage more efficiently.
This book serves as a natural continuation of the basic algorithms and data structures book/course because it teaches you how to transform the fundamental algorithms and data structures into algorithms and data structures that scale well to large datasets.
The key reasons why large data is a major issue for today’s computers and systems are that CPU (and multiprocessor) speeds improve at a much faster rate than memory speeds, and the tradeoff between the speed and size for different types of memory in the computer, as well as the latency versus bandwidth phenomenon, leads to applications processing data at a slower rate than performing computations. These trends are not likely to change soon, so the algorithms and data structure that address the I/O cost and issues of space are only going to increase in importance over time.
In data-intensive applications, optimizing for space means optimizing for time.

FAQ

What topics does Chapter 1 introduce and how is the book structured?

Chapter 1 outlines three themes that define the book:

Part 1 (Ch. 2–5): Hash-based succinct sketches (review of hashing, Bloom filters, Count-Min Sketch, HyperLogLog).
Part 2 (Ch. 6–8): Data streams (Bernoulli and reservoir sampling, sliding-window sampling, quantile sketches such as q-digest and t-digest).
Part 3 (Ch. 9–11): External-memory algorithms and storage (I/O-efficient searching/sorting, B-trees, Bε-trees, LSM-trees).

How are algorithms for massive datasets different from “classical” algorithms?

Classical analyses assume data fits in RAM and focus on CPU steps (Big-O). With massive data, the bottleneck is moving and accessing data, not computing on it. This shifts emphasis to space efficiency, approximate answers (sketches), cache-aware access patterns, and external-memory models that minimize costly transfers.

What qualifies as a “massive dataset” for this book’s techniques?

It’s relative. Size alone doesn’t define it; hardware budget, workload, and performance requirements matter. Teams with modest datasets but tight memory or stringent latency can benefit, and even resource-rich organizations adopt space-efficient structures to stretch RAM.

Why do common in-memory data structures become problematic at scale?

Per-item overhead accumulates. For example, a hash-based dictionary over billions of items can consume tens of gigabytes once you include keys, counts, and table overhead. Multiple such structures (e.g., per-topic indices) quickly exceed RAM, complicating performance and operations.

Which succinct data structures can replace large hash tables and what trade-offs do they make?

Bloom filter: answers membership with a small false-positive rate, using far less space (e.g., order-of-magnitude reduction) because it stores bit patterns, not keys.
Count-Min Sketch: estimates frequencies with one-sided (over)error, using dramatically less space than key-count hash maps; also useful for per-topic keyword counts.
HyperLogLog: estimates cardinality with small relative error using kilobytes of memory.

They trade exactness for bounded, controllable error and huge space savings.

In the comments example, why does a naive dictionary approach hit memory limits?

Counting distinct comment IDs and maintaining per-topic keyword tables for billions of comments requires many key-value entries plus hash-table overhead. Summed across tasks (deduplication, topic counts), this can approach or exceed available RAM, making updates and queries slow or infeasible.

How does the streaming setting change solution design?

With high-velocity events (e.g., new comments, likes), you often can’t store everything or rescan later. One-pass, low-memory techniques are needed: Bernoulli/reservoir sampling for aggregates; sketches for heavy hitters and quantiles (e.g., t-digest). Results are approximate but timely and resource-efficient.

What hardware realities make massive data hard?

CPU–memory speed gap: computation outpaces data access.
Memory hierarchy: fast memories are small; larger memories are slower (caches → RAM → SSD/HDD).
Latency vs. bandwidth: bandwidth improves faster than latency; many small random accesses are costly.
Distributed systems: network hops add unpredictable delays on top of local storage costs.

What are external-memory algorithms and when should I use them?

They target data on SSD/HDD/cloud, minimizing slow transfers by batching and sequential access. Use them for large persistent datasets (e.g., databases, indexes) and workloads that need accurate queries but tolerate disk latency. Structures include B-trees, Bε-trees, and LSM-trees, tuned for read-, write-, or mixed-optimized workloads.

How should I design algorithms with hardware in mind?

Reduce space to save time (fit summaries in fast memory, avoid random I/O).
Favor sequential over random access; exploit cache lines/pages/blocks and spatial locality.
Lay out data to minimize transfers and leverage bandwidth.
Balance scalability with other system concerns (security, availability, maintainability).

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

eBook

pdf, ePub, online

$64.99 $48.74

you save $16.25 (25%)

include audio $24.99 $18.74

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more