table of content

Part 1: The building blocks of scalable computing

1 Why scalable computing matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scaling out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

Summary

2 Introducing Dask

2.1 Hello Dask: A first look at the DataFrame API

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

Summary

Part 2: Working with structured data using Dask DataFrames

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

Summary

4 Loading data into DataFrames

4.1 Reading data from text files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading data from relational databases

4.3 Reading data from HDFS and S3

4.4 Reading data in Parquet format

Summary

5 Cleaning and transforming DataFrames

5.1 Working with indexes and axes

5.2 Dealing with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to parquet files

Summary

6 Summarizing and analyzing DataFrames

6.1 Descriptive statistics

6.2 Built-in aggregate functions

6.3 Custom aggregate functions

6.4 Rolling (window) functions

Summary

7 Visualizing DataFrames with Seaborn

7.1 The prepare-reduce-collect-plot pattern

7.2 Visualizing continuous relationships with scatterplot and regplot

7.2.1 Creating a scatterplot with Dask and Seaborn

7.2.2 Adding a linear regression line to the scatterplot

7.2.3 Adding a nonlinear regression line to a scatterplot

7.3 Visualizing Categorical Relationships with violinplot

7.3.1 Creating a violinplot with Dask and Seaborn

7.3.2 Randomly sampling data from a Dask DataFrame

7.4 Visualizing two categorical relationships with heatmap

Summary

8 Visualizing Location Data with Datashader

8.1 What is Datashader and how does it work?

8.1.1 The five stages of the Datashader rendering pipeline

8.1.2 Creating a Datashader Visualization

8.2 Plotting location data as an interactive heatmap

8.2.1 Preparing geographic data for map tiling

8.2.2 Creating the interactive heatmap

Summary

Part 3: Extending and deploying Dask

9 Working with Bags and Arrays

9.1 Reading and parsing unstructured data with Bags

9.1.1 Selecting and viewing data from a Bag

9.1.2 Common parsing issues and how to overcome them

9.1.3 Working with delimiters

9.2 Transforming, filtering, and folding elements

9.2.1 Transforming elements with the map method

9.2.2 Filtering Bags with the filter method

9.2.3 Calculating descriptive statistics on Bags

9.2.4 Creating aggregate functions using the foldby method

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for parallel text analysis with NLTK

9.4.1 The basics of bigram analysis

9.4.2 Extracting tokens and filtering stopwords

9.4.3 Analyzing the bigrams

Summary

10 Machine learning with Dask-ML

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary

11 Scaling and deploying Dask

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

Appendixes

Appendix A: Software Installation

A.1 Installing additional packages with Anaconda

A.2 Installing packages without Anaconda

A.3 Starting a Jupyter Notebook server

A.4 Configuring NLTK

Overview

4 Loading Data into DataFrames

This chapter moves from core ideas to hands-on ingestion, showing how to bring real-world datasets into Dask DataFrames for analysis of NYC Parking Tickets. It acknowledges the messiness of “data at rest” and reviews how Dask structures data as many partitioned pandas DataFrames executed via a task graph and scheduler. The chapter surveys practical entry points you’ll encounter on the job: delimited text files, relational databases, distributed filesystems such as HDFS and S3, and the Parquet columnar format.

The text begins with delimited files (especially CSVs), explaining row and column delimiters, headers, and text qualifiers, then demonstrates reading multiple years of ticket data. Because schemas changed across years, it extracts the common columns before combining. A central lesson is that automatic dtype inference can misclassify columns—especially with many missing values—so you should define explicit schemas using NumPy dtypes. The chapter outlines strategies to infer or confirm types (guess-and-check, manual sampling), cautions about NaN forcing floats, and urges choosing the smallest safe types for efficiency. With a finalized schema, it reads many files at once using wildcards and limits columns with a selector, noting memory considerations when peeking at rows.

Next, it covers loading from SQL systems via SQLAlchemy/ODBC with read_sql_table, highlighting assumptions around dtype inference, partitioning by a numeric or datetime index (or specifying partitions/boundaries), selecting subsets of columns, and choosing database schemas—plus the ability to pass pandas options through keyword arguments. It then explains reading from distributed filesystems, why data locality matters, and how simple path prefixes enable HDFS and S3 access once the appropriate connectors are installed. Finally, it introduces Parquet: a typed, columnar format that improves IO and compression and is typically read from a directory of files, with optional column and index selection. Together, these patterns equip you to reliably load large, heterogeneous datasets into Dask as a foundation for downstream cleaning and analysis.

The Data Science with Python and Dask workflow

Description: A picture containing screenshot Description automatically generated

The structure of a delimited text file

The metadata of the fy17 DataFrame

The first five rows of the fy17 DataFrame using the common column set

NumPy datatypes used by Dask

A Dask error showing mismatched datatypes

Running a distributed computation without a distributed filesystem

Running a distributed computation on a distributed filesystem

Shipping computations to the data

The structure of Parquet compared with delimited text files

Summary

In this chapter you learned

Inspecting the columns of a DataFrame can be done with the columns attribute.
Dask’s data type inference shouldn’t be relied on for large datasets. Instead, you should define your own schemas based on common NumPy datatypes.
Parquet format offers good performance because it’s a column-oriented format and highly compressible. Whenever possible, try to get your dataset in Parquet format.

FAQ

How do I load one or many CSV files into a single Dask DataFrame?

You can use dd.read_csv with a file path or a wildcard. Dask assumes a header row by default and splits files into partitions with a default 64 MB blocksize. Example: data = dd.read_csv('nyc-parking-tickets/*.csv'). Use dtype to set column types and usecols to select columns. The wildcard pattern is especially useful when a dataset is split across many files or stored on distributed filesystems.

How can I avoid dtype inference errors when reading CSVs?

Provide an explicit schema via the dtype argument (a dict mapping column names to NumPy dtypes). Inference can fail when columns have many missing values or mixed content; blanks are parsed as NaN (a float), which can mislead sampling. A practical workflow is to first load all columns as strings, profile a few columns (for example with .unique().head()), then refine types using the smallest appropriate NumPy dtypes. Remember that standard NumPy integer dtypes cannot store NaN, so columns with missing values must use a float dtype (or stay as strings) unless you adopt pandas’ nullable dtypes.

What’s the best way to find columns common to multiple CSV files before combining them?

Convert each DataFrame’s columns to sets and compute their intersection. For example, build a list of set(df.columns) and reduce with set.intersection to get common_columns. Then select only those columns from each DataFrame (df[common_columns]) before concatenating or load once with usecols=common_columns in dd.read_csv. This minimizes missing values created by schema drift across years.

Why did Dask classify my text column as float?

Blank fields are parsed as NaN, and NaN is a floating-point value. If Dask’s sampling mostly sees NaN in a column, it may infer a float dtype even if the column really contains text. Fix this by explicitly setting dtype={'Your Column': 'object'} (or np.str) when calling dd.read_csv.

How do I read tables from a relational database into Dask?

Use dd.read_sql_table with a SQLAlchemy connection string and an index column: data = dd.read_sql_table('table', conn_str, index_col='id'). Dtypes are inferred from a small sample, so you may still need to adjust types later. If index_col is numeric or datetime, Dask infers partition boundaries (default target ~256 MB per partition). For non-numeric indexes, specify npartitions or explicit divisions. You can also restrict columns with the columns argument and choose a non-default database schema via schema='your_schema'.

How should I partition a SQL read when the index is a string?

Provide either npartitions for evenly sized splits or divisions for custom boundaries. With divisions, Dask uses half-closed intervals sorted by the index type (for strings, lexicographically), so ranges are (prev, boundary]. Example: divisions=['Blue','Red','Yellow'] yields partitions covering strings <= 'Blue', (Blue, Red], and (Red, Yellow]. Choose boundaries that produce balanced partitions.

What are the pros and cons of local files vs HDFS/S3 for distributed computing?

Local files can bottleneck a cluster because one node must ship partitions over the network to all workers. Distributed filesystems (HDFS, S3) spread data across machines, improving throughput and resilience; computation can often be scheduled where the data resides (data locality) with HDFS. S3 does not execute code on storage nodes, so data always moves to workers. Use hdfs:// or s3:// paths and install the relevant clients (hdfs3 for HDFS; s3fs and boto config for S3) on every worker.

Why and how should I use Parquet with Dask?

Parquet is a columnar, typed, and compressed format that speeds up IO, aggregations, and shuffles by reading only needed columns and reducing data movement. Install pyarrow (recommended) or fastparquet, plus optional compression libraries (e.g., snappy). Read with dd.read_parquet('path/to/dataset_dir'); Parquet datasets are directories of files. You can filter columns and select an index at read time (columns=[...], index='col').

What’s the risk of calling head() or fetching many unique values?

head() materializes rows into the client’s RAM. Large requests can cause out-of-memory errors on the driver. When profiling types or categories, sample a small number of unique values with .unique().head(n) rather than computing all uniques. Use .compute() cautiously on large or high-cardinality columns.

What cluster-wide dependencies and credentials do I need for databases, HDFS, and S3?

Every Dask worker must have the same connectivity and libraries: for SQL, install SQLAlchemy, pyodbc, and the appropriate ODBC driver on each node; ensure network access to the DB. For HDFS, install hdfs3. For S3, install s3fs and configure AWS credentials (via environment variables or config files recognized by boto) on each worker. Avoid hard-coding secrets in your code; rely on environment/config-based discovery.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more