Overview

6 AI & Data Quality

Data quality in traditional data engineering has been a fragmented, rules-heavy endeavor—spanning SQL, pandas, regexes, and vendor tools—where every new edge case demands bespoke logic and maintenance. This chapter introduces an AI-centered alternative: expressing data quality expectations in natural language and letting a single model reason across missing values, invalid formats, anomalies, duplicates, and reference mapping within existing pipelines. While AI can be slower and incurs cost, it unlocks flexibility, enables domain experts to contribute without coding, and expands coverage beyond what hard-coded checks anticipate.

The chapter contrasts Python/pandas approaches for identifying problems (nulls, negatives, malformed emails) with an OpenAI Chat Completions workflow that detects inconsistencies via prompts. It then evolves toward reliable automation using structured outputs: response_format and Pydantic data classes define contracts the model must satisfy, turning free-form text into deterministic, parseable objects. By separating system instructions from user data in multi-message prompts, the chapter shows how to remove duplicates, drop null-heavy columns, and standardize datasets with predictable outputs that are easier to debug and integrate into production data pipelines.

Beyond detection, the chapter tackles structural and formatting fixes common in real pipelines—standardizing dates, validating SKUs, truncating descriptions, concatenating names, and mapping products to categories—first with explicit pandas logic, then with a single AI call that returns aligned, row-ordered lists for every transformation. A hands-on lab reinforces the pattern across a richer dataset, emphasizing clear prompts, schema-enforced responses, and a practical row-by-row calling strategy to avoid uneven results. The overarching takeaway is a reusable, scalable framework: keep Python for speed and control where rules are clear, but let AI provide broad, adaptable coverage and structured outputs where requirements are ambiguous or evolving.

Traditional data quality pipelines often require multiple tools—like SQL, pandas, and scikit-learn—to handle tasks such as null detection, format enforcement, and rule-based validation. As shown in figure 6.1, each tool typically addresses a narrow part of the workflow. With AI, however, many of these tasks can be consolidated into a single interface. A well-prompted model can reason across diverse data quality issues, from detecting duplicates to mapping references, enabling a unified, intelligent path to clean data.
The data class acts as a contract between the prompt and the model’s response. The prompt defines what kind of output we want, the data class formalizes that structure, and response_format ensures the model returns data in a predictable, usable format. This makes it easy to extract structured results directly from the model without extra parsing.
Dataframe before transformations (top) and the dataframe after transformations (bottom). This is the output of the implementation of listing 6.5 and shows the type of cleaning we need to do in order to make our data legible and useful to downstream consumers.
Cleaned dataset after applying AI-generated standardization rules. The AI model returned structured outputs for date normalization, SKU validation, description truncation, name concatenation, and category mapping. Each transformation was handled in a single API call using a custom response schema, producing a clean, ready-to-use dataset with minimal parsing logic in the code.

Lab Answers

Refer to the Chapter 6 Lab Jupyter Notebook for full answers.

  1. You should follow the same pattern we used in Listing 6.5. To clean date formats, a custom function like this is used:

SKU format is enforced withregex:

To validate emails, use this regex and filter the DataFrame:

Create the full_name column using:

Description-based category mapping is done with a dictionary:

These techniques match the patterns introduced earlier in the chapter using pandas, map(), apply(), and filtering logic.

  1. This step uses the same techniques introduced earlier in the chapter—structured prompting, a BaseModel schema, and the response_format parameter. But once we move from toy data to real-world messiness, we hit a problem: the model doesn’t always return one row for every input.

In earlier listings (like 6.4 and 6.6), we passed the full dataset in a single API call:

This works fine if the model returns a perfectly aligned list for each field. But if even one row is skipped, you’ll get dreaded ValueErrors.

Instead of processing the whole DataFrame at once, we clean each record individually and rebuild a new dataset:

This avoids the entire row-alignment issue by guaranteeing each response matches exactly one input. Plus, using tqdm adds helpful feedback during processing.

The logic and structure of the API call hasn’t changed—but applying it one row at a time makes the system far more reliable, especially when cleaning dirty data in production environments.

FAQ

Why is traditional data quality work so time‑consuming and fragmented?It typically requires stitching together SQL, pandas, regex, and sometimes third‑party tools. Each new rule (nulls, negative numbers, formats, duplicates) adds bespoke code, edge cases, and maintenance overhead. Many teams either under‑invest or outsource because of this complexity.
How does AI change the way we validate data quality?Instead of hand‑coding every rule, you describe expectations in a single prompt. One model can flag missing values, invalid formats, contextual anomalies, duplicates, and more—while fitting into existing pipelines. Subject matter experts can author rules without writing pandas or regex.
When should I prefer pandas-only checks vs. AI-assisted checks?Use pandas when rules are well‑defined, deterministic, need to run locally, and cost must be zero. Use AI when rules are ambiguous or evolving, you want broader coverage (including issues you didn’t hard‑code), or you need faster iteration with domain experts.
How do I detect common issues with pandas (nulls, negatives, duplicates, null‑heavy columns)?Use df.isnull().sum() for null counts, boolean masks like df[df["purchase_amount"] < 0] for negatives, df.drop_duplicates() to remove duplicate rows, and a threshold (for example, > 50% nulls) to drop null‑heavy columns.
How does the Chat Completions API detect inconsistencies across columns?Loop over columns, send a prompt describing expectations (e.g., “purchase_amount must not be negative”) and pass the column’s values. The model returns a human‑readable summary of issues (invalid emails, missing ages, negative amounts). It’s flexible but slower and incurs cost.
What are Pydantic data classes and why use them here?They define a schema (e.g., CleanedData with duplicates and drop_columns) that acts as a contract for the model’s output. The API response is parsed directly into a typed Python object—eliminating fragile string parsing and improving reliability in pipelines.
What is response_format and how does it improve reliability?response_format tells the model how to structure outputs. Options include free‑form text (default), JSON mode (json_object), and schema‑enforced responses using a Pydantic BaseModel. Schema enforcement reduces non‑determinism and parsing errors in production.
Why send multiple messages (system + user) instead of one big prompt?The system message sets role and instructions (“You are a data cleaning assistant…”). The user message supplies the data. This separation makes instructions clearer, improves accuracy, and scales better as data gets more complex.
How can AI fix structural/format issues in one pass?Define a response schema (e.g., StandardizationInstructions) with lists like normalized_dates, cleaned_skus, truncated_descriptions, full_names, and mapped_categories. Send the records, get structured outputs in row order, and apply them back to the DataFrame—all in a single call.
What best practices and pitfalls should I watch for in the lab?Don’t send the whole dataset at once—iterate row by row to avoid uneven list outputs, and monitor with a progress bar (tqdm). Always use a BaseModel to enforce structure, convert data via df.to_dict(orient="records"), validate critical fields, drop null‑heavy columns, and expect some latency/cost.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Learn AI Data Engineering in a Month of Lunches ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Learn AI Data Engineering in a Month of Lunches ebook for free