Learn AI Data Engineering in a Month of Lunches you own this product

David Melillo

MEAP began September 2025
Last updated February 2026
Publication in Summer 2026 (estimated)

ISBN 9781633435728
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / AI

resources: Source code Book forum Source code on GitHub

table of content

Part 1: Core Concepts of Data Engineering with AI

1 Before You Begin

1.1 Why AI Matters to Data Engineering

1.2 Is This Book for You?

1.2.1 The Many Uses for AI

1.2.2 The Many Flavors of AI

1.3 How to Use This Book

1.3.1 The Main Chapters

1.3.2 Hands-on Labs

1.3.3 Chapter Setup Files

1.4 Setting Up Your Environment

1.4.1 Installing PostgreSQL and pgAdmin

1.4.2 Installing Jupyter Lab for Python Work

1.4.3 Creating an OpenAI Account

1.5 Being Immediately Effective with AI and Data Engineering

2 Advantages & Disadvantages of Using a Coding Companion

2.1 Mental Model: The Data Engineer and the Coding Companion

2.2 Advantages of Using an AI/LLM Coding Companion

2.2.1 Rapid Code Generation for Data Engineering Tasks

2.3 Disadvantages of Using a Coding Companion

2.3.1 Introduction to the Pagila Dataset

2.3.2 Example: Asking a Simple Question

2.4 Lab

2.5 Lab Answers

3 Using a Coding Companion with SQL

3.1 Zero-Shot Prompting

3.2 Few-Shot Prompting

3.3 Chain-of-Thought Prompting

3.4 Self-Consistency Prompting

3.5 Tree-of-Thought Prompting

3.6 Role-Playing, Domain Priming, Prompt Chaining and Beyond

3.7 Lab

3.8 Lab Answers

4 Using a Coding Companion with Python

4.1 Interacting with APIs Using AI Coding Companions & Python

4.1.1 Fetching Data from an API

4.1.2 Enhancing API Calls with AI Coding Companions and API Documentation

4.2 Unnesting Complex JSON Objects with AI Companions & Python

4.2.1 Simple Example: Flattening a Single Nested Field

4.2.2 Complex Example: Extracting Deeply Nested & Combined Fields

4.3 Using AI to Implement Regex Patterns

4.3.1 Extracting Phone Numbers from Text

4.3.2 Normalizing Phone Numbers with Regex and AI

4.3.3 Extracting Number Components into a DataFrame

4.4 Lab

4.5 Lab Answers

5 Using the OpenAI API in Data Workflows

5.1 Initial Setup and Data Extraction

5.2 Preprocessing Articles

5.3 Using ChatGPT for Sentiment Analysis

5.3.1 Understanding the ChatGPT API and Chat Completions Endpoint

5.3.2 Raw API Response Processing

5.4 Iteration - Normalizing Sentiment Output, Logging & Consolidation

5.4.1 Normalizing Sentiment Output

5.4.2 Logging & Consolidation

5.5 Lab

5.6 Lab Answers

Part 2: Data Cleaning & Transformation Pipelines with AI

6 AI & Data Quality

6.1 Identifying Data Quality Issues

6.2 Fixing Data Quality Issues

6.2.1 Understanding Data Classes

6.2.2 Using response_format

6.2.3 Working with Multiple Messages

6.3 Fixing Structural and Format Issues

6.4 Lab

6.5 Lab Answers

7 AI and Advanced Data Transformations

7.1 Complex Text Processing with Regular Expressions

7.2 Handling Hierarchical and Nested Data Structures

7.3 Entity Resolution

7.4 Time Series and Date-Time Transformations

7.5 Lab

7.6 Lab Answers

8 AI and The Data Lifecycle

8.1 From AI insights to data pipelines

8.1.1 Evolving AI Integration

8.1.2 Understanding ETL and ELT

8.2 Extracting News Data with AI

8.2.1 Extracting the Raw API Payload

8.2.2 Extracting Data with AI

8.3 Transforming News Data with AI

8.3.1 The Transformation Prompt

8.3.2 The AI Data Engineering Code Harness

8.3.3 The Transformation Pipeline

8.4 Loading News Data with AI

8.4.1 The Contract and Prompt

8.4.2 Response Handling

8.5 Lab

8.6 Lab Answers

9 Data Cleaning and Transformation Pipelines in Practice

9.1 Data Orchestration

9.1.1 Apache Airflow

9.1.2 Beyond Scheduling

9.1.3 Task Framework

9.2 Event Driven Architecture

9.2.1 What are events?

9.2.2 Pub/Sub and Beyond

9.3 Pipelines in Practice

9.3.1 Inspecting the Data & Inferring the Schema

9.3.2 Extracting the basics

9.3.3 Data Quality Transformations

9.3.4 Advanced Transformations

9.3.5 Analysis

9.4 Lab

9.5 Lab Answers

Part 3: Generating Data with AI

10 Introduction to Web Scraping

11 Identifying Opportunities for AI-Generated Data

12 Handling Unstructured Data with AI

13 Data Scraping & AI

Part 4: Data Cleaning & Transformation Pipelines with AI

14 Introduction to Agentic Workflows for Data Engineers

15 Generating Subject Matter Expertise with AI

16 SME and Agentic Workflows: Decision Paths and Data Activation

17 Practical Application: AI-Driven Outreach for Marketing and Sales

Appendices

Appendix A: Setting Up Your Environment

Appendix B: Prompt Engineering Reference

Appendix C: Using the OpenAI API

Appendix D: Dataset Index

Appendix E: Troubleshooting Common Errors

Overview

6 AI & Data Quality

Data quality in traditional data engineering has been a fragmented, rules-heavy endeavor—spanning SQL, pandas, regexes, and vendor tools—where every new edge case demands bespoke logic and maintenance. This chapter introduces an AI-centered alternative: expressing data quality expectations in natural language and letting a single model reason across missing values, invalid formats, anomalies, duplicates, and reference mapping within existing pipelines. While AI can be slower and incurs cost, it unlocks flexibility, enables domain experts to contribute without coding, and expands coverage beyond what hard-coded checks anticipate.

The chapter contrasts Python/pandas approaches for identifying problems (nulls, negatives, malformed emails) with an OpenAI Chat Completions workflow that detects inconsistencies via prompts. It then evolves toward reliable automation using structured outputs: response_format and Pydantic data classes define contracts the model must satisfy, turning free-form text into deterministic, parseable objects. By separating system instructions from user data in multi-message prompts, the chapter shows how to remove duplicates, drop null-heavy columns, and standardize datasets with predictable outputs that are easier to debug and integrate into production data pipelines.

Beyond detection, the chapter tackles structural and formatting fixes common in real pipelines—standardizing dates, validating SKUs, truncating descriptions, concatenating names, and mapping products to categories—first with explicit pandas logic, then with a single AI call that returns aligned, row-ordered lists for every transformation. A hands-on lab reinforces the pattern across a richer dataset, emphasizing clear prompts, schema-enforced responses, and a practical row-by-row calling strategy to avoid uneven results. The overarching takeaway is a reusable, scalable framework: keep Python for speed and control where rules are clear, but let AI provide broad, adaptable coverage and structured outputs where requirements are ambiguous or evolving.

Traditional data quality pipelines often require multiple tools—like SQL, pandas, and scikit-learn—to handle tasks such as null detection, format enforcement, and rule-based validation. As shown in figure 6.1, each tool typically addresses a narrow part of the workflow. With AI, however, many of these tasks can be consolidated into a single interface. A well-prompted model can reason across diverse data quality issues, from detecting duplicates to mapping references, enabling a unified, intelligent path to clean data.

The data class acts as a contract between the prompt and the model’s response. The prompt defines what kind of output we want, the data class formalizes that structure, and response_format ensures the model returns data in a predictable, usable format. This makes it easy to extract structured results directly from the model without extra parsing.

Dataframe before transformations (top) and the dataframe after transformations (bottom). This is the output of the implementation of listing 6.5 and shows the type of cleaning we need to do in order to make our data legible and useful to downstream consumers.

Cleaned dataset after applying AI-generated standardization rules. The AI model returned structured outputs for date normalization, SKU validation, description truncation, name concatenation, and category mapping. Each transformation was handled in a single API call using a custom response schema, producing a clean, ready-to-use dataset with minimal parsing logic in the code.

Lab Answers

Refer to the Chapter 6 Lab Jupyter Notebook for full answers.

You should follow the same pattern we used in Listing 6.5. To clean date formats, a custom function like this is used:

SKU format is enforced withregex:

To validate emails, use this regex and filter the DataFrame:

Create the full_name column using:

Description-based category mapping is done with a dictionary:

These techniques match the patterns introduced earlier in the chapter using pandas, map(), apply(), and filtering logic.

This step uses the same techniques introduced earlier in the chapter—structured prompting, a BaseModel schema, and the response_format parameter. But once we move from toy data to real-world messiness, we hit a problem: the model doesn’t always return one row for every input.

In earlier listings (like 6.4 and 6.6), we passed the full dataset in a single API call:

This works fine if the model returns a perfectly aligned list for each field. But if even one row is skipped, you’ll get dreaded ValueErrors.

Instead of processing the whole DataFrame at once, we clean each record individually and rebuild a new dataset:

This avoids the entire row-alignment issue by guaranteeing each response matches exactly one input. Plus, using tqdm adds helpful feedback during processing.

The logic and structure of the API call hasn’t changed—but applying it one row at a time makes the system far more reliable, especially when cleaning dirty data in production environments.

FAQ

Why is traditional data quality work so time‑consuming and fragmented?

It typically requires stitching together SQL, pandas, regex, and sometimes third‑party tools. Each new rule (nulls, negative numbers, formats, duplicates) adds bespoke code, edge cases, and maintenance overhead. Many teams either under‑invest or outsource because of this complexity.

How does AI change the way we validate data quality?

Instead of hand‑coding every rule, you describe expectations in a single prompt. One model can flag missing values, invalid formats, contextual anomalies, duplicates, and more—while fitting into existing pipelines. Subject matter experts can author rules without writing pandas or regex.

When should I prefer pandas-only checks vs. AI-assisted checks?

Use pandas when rules are well‑defined, deterministic, need to run locally, and cost must be zero. Use AI when rules are ambiguous or evolving, you want broader coverage (including issues you didn’t hard‑code), or you need faster iteration with domain experts.

How do I detect common issues with pandas (nulls, negatives, duplicates, null‑heavy columns)?

Use df.isnull().sum() for null counts, boolean masks like df[df["purchase_amount"] < 0] for negatives, df.drop_duplicates() to remove duplicate rows, and a threshold (for example, > 50% nulls) to drop null‑heavy columns.

How does the Chat Completions API detect inconsistencies across columns?

Loop over columns, send a prompt describing expectations (e.g., “purchase_amount must not be negative”) and pass the column’s values. The model returns a human‑readable summary of issues (invalid emails, missing ages, negative amounts). It’s flexible but slower and incurs cost.

What are Pydantic data classes and why use them here?

They define a schema (e.g., CleanedData with duplicates and drop_columns) that acts as a contract for the model’s output. The API response is parsed directly into a typed Python object—eliminating fragile string parsing and improving reliability in pipelines.

What is response_format and how does it improve reliability?

response_format tells the model how to structure outputs. Options include free‑form text (default), JSON mode (json_object), and schema‑enforced responses using a Pydantic BaseModel. Schema enforcement reduces non‑determinism and parsing errors in production.

Why send multiple messages (system + user) instead of one big prompt?

The system message sets role and instructions (“You are a data cleaning assistant…”). The user message supplies the data. This separation makes instructions clearer, improves accuracy, and scales better as data gets more complex.

How can AI fix structural/format issues in one pass?

Define a response schema (e.g., StandardizationInstructions) with lists like normalized_dates, cleaned_skus, truncated_descriptions, full_names, and mapped_categories. Send the records, get structured outputs in row order, and apply them back to the DataFrame—all in a single call.

What best practices and pitfalls should I watch for in the lab?

Don’t send the whole dataset at once—iterate row by row to avoid uneven list outputs, and monitor with a progress bar (tqdm). Always use a BaseModel to enforce structure, convert data via df.to_dict(orient="records"), validate critical fields, drop null‑heavy columns, and expect some latency/cost.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $33.59

you save $22.40 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $33.59

you save $22.40 (40%)

eBook

pdf, ePub, online

$55.99 $33.59

you save $22.40 (40%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more