table of content

Part 1: The building blocks of scalable computing

1 Why scalable computing matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scaling out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

Summary

2 Introducing Dask

2.1 Hello Dask: A first look at the DataFrame API

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

Summary

Part 2: Working with structured data using Dask DataFrames

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

Summary

4 Loading data into DataFrames

4.1 Reading data from text files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading data from relational databases

4.3 Reading data from HDFS and S3

4.4 Reading data in Parquet format

Summary

5 Cleaning and transforming DataFrames

5.1 Working with indexes and axes

5.2 Dealing with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to parquet files

Summary

6 Summarizing and analyzing DataFrames

6.1 Descriptive statistics

6.2 Built-in aggregate functions

6.3 Custom aggregate functions

6.4 Rolling (window) functions

Summary

7 Visualizing DataFrames with Seaborn

7.1 The prepare-reduce-collect-plot pattern

7.2 Visualizing continuous relationships with scatterplot and regplot

7.2.1 Creating a scatterplot with Dask and Seaborn

7.2.2 Adding a linear regression line to the scatterplot

7.2.3 Adding a nonlinear regression line to a scatterplot

7.3 Visualizing Categorical Relationships with violinplot

7.3.1 Creating a violinplot with Dask and Seaborn

7.3.2 Randomly sampling data from a Dask DataFrame

7.4 Visualizing two categorical relationships with heatmap

Summary

8 Visualizing Location Data with Datashader

8.1 What is Datashader and how does it work?

8.1.1 The five stages of the Datashader rendering pipeline

8.1.2 Creating a Datashader Visualization

8.2 Plotting location data as an interactive heatmap

8.2.1 Preparing geographic data for map tiling

8.2.2 Creating the interactive heatmap

Summary

Part 3: Extending and deploying Dask

9 Working with Bags and Arrays

9.1 Reading and parsing unstructured data with Bags

9.1.1 Selecting and viewing data from a Bag

9.1.2 Common parsing issues and how to overcome them

9.1.3 Working with delimiters

9.2 Transforming, filtering, and folding elements

9.2.1 Transforming elements with the map method

9.2.2 Filtering Bags with the filter method

9.2.3 Calculating descriptive statistics on Bags

9.2.4 Creating aggregate functions using the foldby method

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for parallel text analysis with NLTK

9.4.1 The basics of bigram analysis

9.4.2 Extracting tokens and filtering stopwords

9.4.3 Analyzing the bigrams

Summary

10 Machine learning with Dask-ML

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary

11 Scaling and deploying Dask

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

Appendixes

Appendix A: Software Installation

A.1 Installing additional packages with Anaconda

A.2 Installing packages without Anaconda

A.3 Starting a Jupyter Notebook server

A.4 Configuring NLTK

Overview

1 Why Scalable Computing Matters

The chapter opens by acknowledging the growing pains of modern data work: as collecting and storing data becomes cheaper, practitioners increasingly collide with the limits of single-machine tools, experiencing slow runs, instability, and unwieldy workflows. It situates Python’s data stack (NumPy, SciPy, Pandas, Scikit-learn) as the on-ramp that democratized data science, while clarifying that these libraries excel primarily for small, in-memory datasets. To reason about scale, the text introduces a practical tiering—small, medium, and large data—highlighting when paging, parallelism, or distribution become necessary and when conventional tools start to fail. The chapter’s goal is to equip readers—especially beginner to intermediate data scientists and engineers—with the foundational concepts behind scalable computing and to motivate why Dask belongs in a modern toolkit.

Dask is presented as a Python-native bridge from familiar single-machine workflows to parallel and distributed computation. Its layered design centers on a task scheduler that manages computations expressed through low-level primitives (Delayed and Futures) and exposed via high-level collections that mirror NumPy arrays, Pandas DataFrames, and more. By chunking data into partitions and orchestrating many small tasks, Dask preserves familiar APIs while enabling work to scale from a laptop to a cluster with minimal code changes and low operational overhead. Beyond collections, its low-level APIs generalize to custom Python workloads, making it a flexible parallel framework. A brief comparison notes that while other systems are powerful, Dask’s tight alignment with Python, short learning curve, and versatility make it especially appealing to users steeped in the Python data stack.

The conceptual backbone of Dask’s execution model is the directed acyclic graph (DAG), which encodes computations as nodes with explicit dependencies, making order, parallelism, and optimization transparent. Using a cooking analogy, the chapter illustrates how DAGs prevent cycles, capture transitive dependencies, and help expose bottlenecks and opportunities for reordering. With this model, the scheduler can prioritize tasks near outputs, stream results early, and manage memory efficiently, while also handling real-world concerns: deciding when to scale up versus scale out, enforcing concurrency limits and resource locks, and recovering from worker failures or data loss by replaying lineage. The chapter closes by introducing a real, messy, medium-scale companion dataset of NYC parking tickets that will anchor hands-on examples throughout the book.

The components and layers than make up Dask

My favorite recipe for bucatini all'Amatriciana

A graph displaying nodes with dependencies

An example of a cyclic graph demonstrating an infinite feedback loop

The graph represented in figure 1.3 redrawn without transitive reduction

The full directed acyclic graph representation of the bucatini all’Amatriciana recipe.

Scaling up replaces existing equipment with larger/faster/more efficient equipment, while scaling out divides the work between many workers in parallel.

A graph with nodes distributed to many workers depicting dynamic redistribution of work as tasks complete at different times.

An example of resource starvation

Summary

In this chapter you learned

Dask can be used to scale popular data analysis libraries such as Pandas and NumPy, allowing you to analyze medium and large datasets with ease.
Dask uses directed acyclic graphs (DAGs) to coordinate execution of parallelized code across CPU cores and machines.
Directed acyclic graphs are comprised of nodes, have a clearly defined start and end, a single traversal path, and no looping.
Upstream nodes must be completed before work can begin on any dependent downstream nodes.
Scaling out can generally improve performance of complex workloads, but it creates additional overhead that might substantially reduce those performance gains.
In the event of a failure, the steps to reach a node can be repeated from the beginning without disturbing the rest of the process.

FAQ

What problem does this chapter address, and why does scalable computing matter?

It tackles the common pain of working with datasets that outgrow a single machine: long runtimes, out-of-memory errors, fragile code, and clunky workflows. As data volumes grow, traditional single-machine tools struggle, so scalable computing frameworks like Dask help you analyze and model larger datasets efficiently.

How does the chapter define small, medium, and large datasets?

- Small: roughly under 2–4 GB, fits in RAM and on local disk; tools like Pandas/NumPy/Scikit-learn work well without paging.
- Medium: roughly 10 GB to 2 TB, fits on local disk but not in RAM; paging (spilling to disk) and lack of parallelism become bottlenecks.
- Large: greater than ~2 TB, does not fit in RAM or a single machine’s disk; requires distributed computing.

Why choose Dask for data science?

Dask brings native scalability to the Python data stack and offers four key advantages: it’s fully in Python and scales NumPy/Pandas/Scikit-learn; it works on both a single machine (medium data) and clusters (large data); it can parallelize general Python workflows via low-level APIs; and it has low setup and maintenance overhead.

How is Dask structured under the hood?

At its core is a task scheduler that executes computations across cores and machines. Low-level APIs (Delayed for lazy evaluation, Futures for eager evaluation) build task graphs. High-level collections (Dask DataFrame, Array, Bag) sit atop these, translating familiar Pandas/NumPy-style operations into many parallel tasks managed by the scheduler.

What are DAGs and why do they matter in Dask?

A directed acyclic graph (DAG) represents work as nodes (tasks) connected by directed edges (dependencies) with no cycles. Dask composes your computation as a DAG so the scheduler can run independent tasks in parallel, respect prerequisites, optimize execution order, reduce memory pressure, and monitor progress.

When should I scale up versus scale out?

- Scale up: choose a bigger/faster machine when problems sit near the upper end of “medium” and parallelism is limited; it’s simpler operationally, especially in the cloud.
- Scale out: add workers when tasks parallelize well or data is large; it offers better long-term headroom but introduces coordination costs that Dask’s scheduler helps manage.

How does Dask handle concurrency and shared resources?

Dask’s scheduler accounts for resource constraints (like limited memory, I/O, or GPUs). It prevents resource starvation via locking and smart task placement, keeps workers busy with available tasks, and prioritizes work to minimize idle time and memory footprints.

What happens when things fail? How does Dask recover?

Dask tolerates worker failures by rescheduling remaining tasks elsewhere and reusing completed results if data isn’t lost. If intermediate data is lost, it can recompute it by replaying the task lineage from dependencies. If the scheduler fails, the graph must be rebuilt and restarted, since only the scheduler knows the full plan.

How does Dask compare to Apache Spark for Python users?

Spark is powerful for large-scale collection operations but is JVM-centric; PySpark routes Python through the JVM (Py4J), which can complicate debugging and tuning. Dask is Python-native, closely mirrors Pandas/NumPy APIs, is more flexible for custom Python logic, and is lighter to set up—often making it a better fit for Python-first teams.

What companion dataset does the book use, and why?

The book uses NYC parking citations (2013–mid-2017), about 8 GB uncompressed. It’s a realistic, messy, medium-sized dataset—large enough to demonstrate Dask’s advantages without requiring multi-terabyte downloads. You can find it on Kaggle under “NYC Parking Tickets.”

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

eBook

pdf, ePub, online

$54.99 $41.24

you save $13.75 (25%)

include audio $19.99 $14.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more