table of content

Part 1: The building blocks of scalable computing

1 Why scalable computing matters

1.1 Why Dask?

1.2 Cooking with DAGs

1.3 Scaling out, concurrency, and recovery

1.3.1 Scale Up vs. Scale Out

1.3.2 Concurrency and resource management

1.3.3 Recovering from failures

1.4 Introducing the companion dataset

Summary

2 Introducing Dask

2.1 Hello Dask: A first look at the DataFrame API

2.2 Visualizing DAGs

2.3 Lazy Computations

2.3.1 Data Locality

Summary

Part 2: Working with structured data using Dask DataFrames

3 Introducing Dask DataFrames

3.1 Why Use DataFrames?

3.2 Dask and Pandas

3.2.1 Managing DataFrame Partitioning

3.2.2 What is the Shuffle?

3.3 Limitations of Dask DataFrames

Summary

4 Loading data into DataFrames

4.1 Reading data from text files

4.1.1 Using Dask Datatypes

4.1.2 Creating Schemas for Dask DataFrames

4.2 Reading data from relational databases

4.3 Reading data from HDFS and S3

4.4 Reading data in Parquet format

Summary

5 Cleaning and transforming DataFrames

5.1 Working with indexes and axes

5.2 Dealing with missing values

5.3 Recoding data

5.4 Elementwise operations

5.5 Filtering and reindexing DataFrames

5.6 Joining and concatenating DataFrames

5.7 Writing data to text files and Parquet files

5.7.1 Writing to delimited text files

5.7.2 Writing to parquet files

Summary

6 Summarizing and analyzing DataFrames

6.1 Descriptive statistics

6.2 Built-in aggregate functions

6.3 Custom aggregate functions

6.4 Rolling (window) functions

Summary

7 Visualizing DataFrames with Seaborn

7.1 The prepare-reduce-collect-plot pattern

7.2 Visualizing continuous relationships with scatterplot and regplot

7.2.1 Creating a scatterplot with Dask and Seaborn

7.2.2 Adding a linear regression line to the scatterplot

7.2.3 Adding a nonlinear regression line to a scatterplot

7.3 Visualizing Categorical Relationships with violinplot

7.3.1 Creating a violinplot with Dask and Seaborn

7.3.2 Randomly sampling data from a Dask DataFrame

7.4 Visualizing two categorical relationships with heatmap

Summary

8 Visualizing Location Data with Datashader

8.1 What is Datashader and how does it work?

8.1.1 The five stages of the Datashader rendering pipeline

8.1.2 Creating a Datashader Visualization

8.2 Plotting location data as an interactive heatmap

8.2.1 Preparing geographic data for map tiling

8.2.2 Creating the interactive heatmap

Summary

Part 3: Extending and deploying Dask

9 Working with Bags and Arrays

9.1 Reading and parsing unstructured data with Bags

9.1.1 Selecting and viewing data from a Bag

9.1.2 Common parsing issues and how to overcome them

9.1.3 Working with delimiters

9.2 Transforming, filtering, and folding elements

9.2.1 Transforming elements with the map method

9.2.2 Filtering Bags with the filter method

9.2.3 Calculating descriptive statistics on Bags

9.2.4 Creating aggregate functions using the foldby method

9.3 Building Arrays and DataFrames from Bags

9.4 Using Bags for parallel text analysis with NLTK

9.4.1 The basics of bigram analysis

9.4.2 Extracting tokens and filtering stopwords

9.4.3 Analyzing the bigrams

Summary

10 Machine learning with Dask-ML

10.1 Building linear models with Dask-ML

10.1.1 Preparing the data with binary vectorization

10.1.2 Building a logistic regression model with Dask-ML

10.2 Evaluating and tuning Dask-ML models

10.2.1 Evaluating Dask-ML models with the score method

10.2.2 Building a naïve Bayes classifier with Dask-ML

10.2.3 Automatically tuning hyperparameters

10.3 Persisting Dask-ML models

Summary

11 Scaling and deploying Dask

11.1 Building a Dask cluster on Amazon AWS with Docker

11.1.1 Getting started

11.1.2 Creating a security key

11.1.3 Creating the ECS cluster

11.1.4 Configuring the cluster’s networking

11.1.5 Creating a shared data drive in Elastic File System

11.1.6 Allocating space for Docker images in Elastic Container Repository

11.1.7 Building and deploying images for scheduler, worker, and notebook

11.1.8 Connecting to the cluster

11.2 Running and monitoring Dask jobs on a cluster

11.3 Cleaning up the Dask cluster on AWS

Summary

Appendixes

Appendix A: Software Installation

A.1 Installing additional packages with Anaconda

A.2 Installing packages without Anaconda

A.3 Starting a Jupyter Notebook server

A.4 Configuring NLTK

Overview

10 Machine Learning with Dask-ML

This chapter introduces Dask-ML, the parallel machine learning layer of Dask that mirrors the Scikit-learn API, letting practitioners scale familiar workflows from a laptop to a cluster. Building on earlier data preparation with Dask Bags and DataFrames, it frames a practical goal: train a sentiment classifier on the Amazon Fine Foods Reviews to predict positive vs. negative sentiment without relying on the numeric review score. The narrative echoes the 80/20 reality of data science—most effort goes into preparation—then shows how Dask-ML fits alongside Dask Arrays and DataFrames to make model training and evaluation efficient and scalable.

The end-to-end pipeline starts by labeling reviews via star ratings, then tokenizing and removing stopwords to build features. To keep the problem tractable, the text is vectorized using a binary presence/absence scheme over the top 100 most frequent tokens. These vectors live in a Dask Bag and are transformed into Dask Arrays by concatenating partitioned NumPy arrays, then rechunked and persisted to Zarr for efficient I/O. With the data ready, a train/test split is created and a Dask-ML logistic regression model is fit and scored, achieving roughly 80% accuracy on held-out data. As a comparison, a Bernoulli Naive Bayes model from Scikit-learn is trained in parallel using Dask-ML’s Incremental wrapper (which leverages partial_fit); it performs slightly worse than logistic regression on this task.

Model assessment and improvement follow a champion–challenger approach, using built-in scoring for objective comparisons and hyperparameter tuning via GridSearchCV. The grid search explores penalty types (L1/L2) and regularization strength (C), parallelizing combinations across workers; in this case, the best settings roughly match the defaults, but the method generalizes to broader searches and other algorithms. Finally, the chapter shows how to persist artifacts: store large arrays with Zarr and serialize trained models with dill/pickle for deployment on lightweight systems. The overall takeaway is a reproducible, scalable pattern for feature construction, model training, validation, tuning, and persistence using Dask and Dask-ML with minimal changes to familiar Scikit-learn code.

Having thoroughly covered data preparation, it’s time to move on to model building

Description: A picture containing screenshot Description automatically generated

A review of the API components of Dask

An example of binary vectorization

Vectorizing the raw data into a bag of arrays, then concatenating to a single array

The shape of the feature array

The GridSearchCV results

Summary

In this chapter you learned

Binary vectorization is used to relate the existence of a word in a chunk of text to some predictor (e.g. sentiment)
Machine learning uses statistical and mathematical methods to find patterns that relate features (inputs) to predictors (outputs).
Data should be split into training and testing sets to avoid overfitting.
When trying to decide which model to use, select some error metrics and use the champion-challenger approach to objectively find the best model based on your selected metrics.
GridSearchCV can be used to automate the selection and tuning processes of your machine learning models.
Trained machine learning models can be saved using the dill library in order to reuse later to generate predictions.

FAQ

What is Dask-ML and how does it relate to Scikit-learn?

Dask-ML brings Scikit-learn’s APIs to distributed and parallel computing. If you know Scikit-learn, Dask-ML will feel familiar: you get estimators like LogisticRegression, model selection tools (train_test_split, GridSearchCV), and wrappers that let you scale many Scikit-learn models across Dask workers. In short, it parallelizes common ML workflows while keeping the Scikit-learn interface.

How do I prepare text data for modeling with Dask?

Typical steps shown in the chapter: - Tokenize review text (e.g., with NLTK’s RegexpTokenizer) and lowercase it. - Remove stopwords (NLTK list plus domain-specific terms such as “amazon”, “http”). - Build a vocabulary (corpus) and represent each review with binary vectorization: for each vocab word, set 1 if present in the review, else 0. - Create targets (e.g., sentiment: positive=1, negative=0). This yields feature vectors and labels suitable for Dask-ML.

Why limit the vocabulary to the top-N words for binary vectorization?

Using the entire corpus can create extremely wide arrays (hundreds of thousands of columns), which: - Increases memory and storage requirements. - Slows training and I/O. Choosing the top-N most frequent tokens (e.g., 100 or 1,000) keeps arrays compact, speeds up experimentation, and is easy to adjust later. You can always scale N up when resources allow or if accuracy gains justify it.

How do I convert a Dask Bag of vectors into a single Dask Array efficiently?

Instead of going Bag → DataFrame → Array, the chapter reduces directly to an Array: - Map each feature vector (NumPy array) to a 1-by-N Dask Array. - Use a custom reduction that concatenates arrays within each partition, then concatenates the partition-level arrays into a final (rows, features) Dask Array. This avoids DataFrame overhead with many columns and leverages Dask’s lazy evaluation for efficiency.

Why write arrays to Zarr, and how should I pick chunk sizes?

Zarr is a chunked, on-disk array format that Dask reads/writes efficiently. Rechunk before writing to avoid producing one file per tiny chunk. Guidelines: - Aim for chunk sizes that yield 10 MB–1 GB per chunk to reduce file overhead. - In the chapter, the feature array was rechunked to 5,000 rows per chunk, producing about 100 files—far fewer than one file per row.

How do I train a logistic regression model with Dask-ML and ensure reproducibility?

Steps: - Split data with train_test_split and set random_state for a repeatable split. - Fit LogisticRegression on the training set. - Score on the test set. Because many Dask-ML methods are lazy, wrap operations in ProgressBar and use compute() where needed. Setting random_state ensures consistent comparisons across model runs.

How is model accuracy measured here, and do I need compute() when scoring?

For classifiers, score returns accuracy (fraction of correct predictions between 0 and 1). In Dask-ML estimators (e.g., LogisticRegression), many operations are lazy, so scoring typically requires compute(). For wrapped Scikit-learn estimators via Incremental, score is eager and returns a Python float directly.

How can I use Scikit-learn estimators (like Naive Bayes) with Dask?

Use dask_ml.wrappers.Incremental with estimators that implement partial_fit: - Create the Scikit-learn estimator (e.g., BernoulliNB). - Wrap it with Incremental and call fit, providing the list of classes (e.g., [0, 1]) if required. This streams data in batches across workers, enabling parallel training without re-implementing the algorithm.

How do I tune hyperparameters at scale with GridSearchCV in Dask-ML?

Wrap an estimator with GridSearchCV and pass a parameter grid dict (e.g., penalty ∈ {l1, l2}, C ∈ {0.5, 1, 2}). Dask distributes model fits for each parameter combination across workers. Tips: - Keep the grid manageable—GridSearchCV tries every combination. - Inspect cv_results_ (e.g., in a Pandas DataFrame) to compare scores and timings. - Use the champion–challenger approach: establish a baseline, then try challengers with tuned hyperparameters and/or different algorithms.

How do I persist and load trained models, and what should I watch out for?

Serialize with dill (or pickle) and save to disk; later, load and use predict without retraining. Considerations: - The loading environment must have the same libraries and versions used to create the model (e.g., Dask, Scikit-learn). - Binary files should be read/written in binary mode. - This workflow supports training on powerful clusters, saving the model (e.g., to S3), and serving predictions on lighter-weight machines.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

eBook

pdf, ePub, online

$54.99 $32.99

you save $22.00 (40%)

include audio $19.99 $13.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more