table of content

Part 1: The building blocks of scalable computing

1 Why scalable computing matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scaling out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

Summary

2 Introducing Dask

2.1 Hello Dask: A first look at the DataFrame API

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

Summary

Part 2: Working with structured data using Dask DataFrames

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

Summary

4 Loading data into DataFrames

4.1 Reading data from text files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading data from relational databases

4.3 Reading data from HDFS and S3

4.4 Reading data in Parquet format

Summary

5 Cleaning and transforming DataFrames

5.1 Working with indexes and axes

5.2 Dealing with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to parquet files

Summary

6 Summarizing and analyzing DataFrames

6.1 Descriptive statistics

6.2 Built-in aggregate functions

6.3 Custom aggregate functions

6.4 Rolling (window) functions

Summary

7 Visualizing DataFrames with Seaborn

7.1 The prepare-reduce-collect-plot pattern

7.2 Visualizing continuous relationships with scatterplot and regplot

7.2.1 Creating a scatterplot with Dask and Seaborn

7.2.2 Adding a linear regression line to the scatterplot

7.2.3 Adding a nonlinear regression line to a scatterplot

7.3 Visualizing Categorical Relationships with violinplot

7.3.1 Creating a violinplot with Dask and Seaborn

7.3.2 Randomly sampling data from a Dask DataFrame

7.4 Visualizing two categorical relationships with heatmap

Summary

8 Visualizing Location Data with Datashader

8.1 What is Datashader and how does it work?

8.1.1 The five stages of the Datashader rendering pipeline

8.1.2 Creating a Datashader Visualization

8.2 Plotting location data as an interactive heatmap

8.2.1 Preparing geographic data for map tiling

8.2.2 Creating the interactive heatmap

Summary

Part 3: Extending and deploying Dask

9 Working with Bags and Arrays

9.1 Reading and parsing unstructured data with Bags

9.1.1 Selecting and viewing data from a Bag

9.1.2 Common parsing issues and how to overcome them

9.1.3 Working with delimiters

9.2 Transforming, filtering, and folding elements

9.2.1 Transforming elements with the map method

9.2.2 Filtering Bags with the filter method

9.2.3 Calculating descriptive statistics on Bags

9.2.4 Creating aggregate functions using the foldby method

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for parallel text analysis with NLTK

9.4.1 The basics of bigram analysis

9.4.2 Extracting tokens and filtering stopwords

9.4.3 Analyzing the bigrams

Summary

10 Machine learning with Dask-ML

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary

11 Scaling and deploying Dask

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

Appendixes

Appendix A: Software Installation

A.1 Installing additional packages with Anaconda

A.2 Installing packages without Anaconda

A.3 Starting a Jupyter Notebook server

A.4 Configuring NLTK

Overview

2 Introducing Dask

This chapter introduces Dask through a pragmatic, data-science-focused tour of its core ideas: computations are represented as directed acyclic graphs (DAGs), executed lazily, and scaled from a laptop to clusters with the same code. Using an NYC Parking Tickets dataset as a running example, it shows how high-level collections (like Dask DataFrames) map familiar Pandas-like syntax into parallel task graphs, and how to inspect or diagnose those graphs. Along the way, it previews practical tools you’ll use repeatedly—graph visualization, progress diagnostics, and persistence—to make large workloads both understandable and efficient.

The walkthrough begins with Dask DataFrames: reading CSV files produces metadata rather than eager data samples, highlighting how Dask infers dtypes via sampling and partitions large files into manageable chunks processed in parallel. Simple operations such as counting missing values build progressively larger DAGs without executing until compute() is called; results can be monitored with a ProgressBar. The example converts missing counts to percentages, uses the output to drop sparse columns, and demonstrates how Pandas and Dask objects interoperate because each partition is a Pandas DataFrame. To avoid recomputing long transformation chains, persist() caches intermediate results across partitions, dramatically speeding up iterative, exploratory work.

Stepping down to the Delayed API, the chapter shows how small Python functions become DAG nodes, how list comprehensions create parallel branches, and how chaining transformations (add, multiply, sum) grows graphs that can be visualized with graphviz. Persisting intermediate results simplifies downstream graphs and reduces recomputation. Finally, it explains how the central scheduler executes these DAGs: tasks are assigned dynamically, data locality is favored to minimize network traffic, and distributed filesystems (such as S3 or HDFS) help keep data close to workers. Understanding these mechanics—lazy evaluation, partitioned execution, persistence, and locality—equips you to reason about performance and diagnose bottlenecks as your analyses scale.

Our workflow, at a glance, in Data Science with Python and Dask

Inspecting the Dask DataFrame

Dask splits large data files into multiple partitions and works on one partition at a time

A visual representation of the DAG produced in Listing 2.6.

Description: /var/folders/2h/4lhblrqs37x1hfky9hfbr20m0000gp/T/com.microsoft.Word/Content.MSO/BC6D2748.tmp

The directed acyclic graph representing the computation in Listing 2.7

The DAG from Figure 2.5 with the values superimposed over the computation.

The DAG including the multiply four step

The DAG generated by Listing 2.9

The DAG generated by Listing 2.10

The DAG generated by Listing 2.11

Reading data from local disk is much faster than reading data stored remotely

Summary

In this chapter you learned
Computations on Dask DataFrames are structured by the task scheduler using DAGs.
Computations are constructed lazily, and the compute method is called to execute the computation and retrieve the results.
You can call the visualize method on any Dask object to see a visual representation of the underlying DAG.
Computations can be streamlined by using the persist method to store and reuse intermediate results of complex computations.
Data locality brings the computation to the data in order to minimize network and IO latency.

FAQ

How does Dask’s DataFrame API differ from Pandas in practice?

Dask mirrors much of Pandas’ syntax but executes lazily and on partitioned data. Instead of loading a whole dataset into memory, Dask splits it into many small Pandas DataFrames (partitions) and builds a task graph to process them in parallel or sequentially.

Why do I see only metadata when I print a Dask DataFrame?

Dask defers computation. Inspecting a Dask DataFrame shows schema-like metadata (column names, inferred dtypes, number of partitions, and task count) rather than actual rows, because the data isn’t loaded or computed until you explicitly request it.

How does Dask infer column dtypes, and what are best practices to avoid surprises?

Dask samples the data to infer dtypes, which can miss rare anomalies. To avoid downstream errors, explicitly set dtypes when reading data or use a format like Parquet that stores schema, improving reliability and performance.

What are partitions and tasks in Dask, and how are they created when reading files?

Partitions are chunked subsets of your data (each is a small Pandas DataFrame). Reading a large CSV creates a task graph where each partition typically has tasks to read bytes, split into blocks, and construct the DataFrame. For example, ~33 partitions might produce ~99 tasks (about 3 per partition).

What does “lazy computation” mean, and when should I call compute()?

Lazy computation means operations build a DAG describing the work but don’t execute immediately. Call .compute() to materialize results (e.g., returning a Pandas object) after composing your transformations.

When should I use persist() instead of compute()?

Use .persist() to execute and cache an intermediate Dask collection in memory across partitions, keeping it as a Dask object for reuse. Use .compute() to get a final in-memory result (like a Pandas Series/DataFrame). Persisting reduces repeated work and can shrink subsequent DAGs.

How can I visualize the task graph (DAG) of a Dask computation?

Call .visualize() on Dask collections (DataFrame, Series, Array, Bag) or Delayed objects. With graphviz installed, Dask renders a diagram showing tasks (circles), intermediate results (squares), and dependencies, which helps debug and understand execution.

What are Dask Delayed objects and when are they useful?

Delayed wraps arbitrary Python functions and their arguments to build custom task graphs. They’re ideal for parallelizing pure-Python workflows, composing fine-grained pipelines, and learning/visualizing how DAGs form before scaling up to collections.

Can I mix Pandas objects with Dask DataFrames?

Yes. Because each Dask partition is a Pandas DataFrame, you can pass Pandas objects (e.g., a Pandas Series of column names to drop) into Dask operations. In distributed runs, Dask broadcasts such small objects to workers.

How does Dask’s scheduler assign work, and why does data locality matter?

A centralized scheduler dynamically assigns tasks to workers based on load, dependencies, and where data lives, aiming to minimize data movement. Storing data in a distributed filesystem (e.g., S3, HDFS) lets workers read locally, avoiding network bottlenecks and improving performance.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more