Overview

1 Meet Apache Airflow

Modern organizations are increasingly data-driven, moving vast amounts of information through pipelines that must be reliable, observable, and scalable. This chapter introduces Apache Airflow as an orchestration tool—more like a conductor than a musician—that coordinates work across many systems without doing the heavy data processing itself. It frames pipelines as directed acyclic graphs (DAGs) of tasks, explains why this abstraction is powerful for managing dependencies and execution order, and sets expectations for a hands-on, example-led journey that helps readers decide whether Airflow fits their use cases and how to get started.

Representing pipelines as DAGs turns implicit task order into explicit dependencies, enabling safe, repeatable execution without circular deadlocks. Compared with monolithic scripts, DAGs allow parallelism where tasks are independent, faster recovery by rerunning only failed pieces, and clearer reasoning about complex workflows. The chapter situates Airflow within the broader ecosystem of workflow managers, noting common capabilities and key differences in how tools define workflows (code vs. static files) and in features such as scheduling and monitoring. Airflow’s code-first approach—primarily in Python—offers strong flexibility, dynamic DAG construction, and a large ecosystem of integrations for databases, big data platforms, and cloud services.

Airflow lets you schedule DAGs on time-based or event-like schedules, then uses core components (DAG processing, scheduling, workers, and asynchronous triggers) to queue and run tasks in the right order, capturing results and logs for inspection in a rich web interface. Built-in retries, easy task reruns, and powerful time semantics support incremental processing and backfilling, making it efficient to build and maintain production pipelines. The chapter closes with guidance on fit: Airflow excels at batch and regularly scheduled or event-triggered workflows that integrate many systems and benefit from software engineering best practices; it is less suited to real-time streaming, highly volatile DAG structures, or teams without Python expertise. It also previews the rest of the book, from foundational concepts to advanced patterns and deployment guidance.

For this weather dashboard, weather data is fetched from an external API and fed into a dynamic dashboard.
Graph representation of the data pipeline for the weather dashboard. Nodes represent tasks and directed edges represent dependencies between tasks (with an edge pointing from task 1 to task 2, indicating that task 1 needs to be run before task 2).
Cycles in graphs prevent task execution due to circular dependencies. In acyclic graphs (top), there is a clear path to execute the three different tasks. However, in cyclic graphs (bottom), there is no longer a clear execution path due to the interdependency between tasks 2 and 3.
Using the DAG structure to execute tasks in the data pipeline in the correct order: depicts each task’s state during each of the loops through the algorithm, demonstrating how this leads to the completed execution of the pipeline (end state)
Overview of the umbrella demand use case, in which historical weather and sales data are used to train a model that predicts future sales demands depending on weather forecasts
Independence between sales and weather tasks in the graph representation of the data pipeline for the umbrella demand forecast model. The two sets of fetch/cleaning tasks are independent as they involve two different data sets (the weather and sales data sets). This independence is indicated by the lack of edges between the two sets of tasks.
Airflow pipelines are defined as DAGs using Python code in DAG files. Each DAG file typically defines one DAG, which describes the different tasks and their dependencies. Besides this, the DAG also defines a schedule interval that determines when the DAG is executed by Airflow.
The main components involved in Airflow are the Airflow API server, scheduler, DAG processor, triggerer and workers.
Developing and executing pipelines as DAGs using Airflow. Once the user has written the DAG, the DAG Processor and scheduler ensure that the DAG is run at the right moment. The user can monitor progress and output while the DAG is running at all times.
The login page for the Airflow web interface. In the code examples accompanying this book, a default user “airflow” is provided with the password “airflow”.
The main page of Airflow’s web interface, showing a high-level overview of all DAGs and their recent results.
The DAGs page of Airflow’s web interface, showing a high-level overview of all DAGs and their recent results.
The graph view in Airflow’s web interface, showing an overview of the tasks in an individual DAG and the dependencies between these tasks
Airflow’s grid view, showing the results of multiple runs of the umbrella sales model DAG (most recent + historical runs). The columns show the status of one execution of the DAG and the rows show the status of all executions of a single task. Colors (which you can see in the e-book version) indicate the result of the corresponding task. Users can also click on the task “squares” for more details about a given task instance, or to manage the state of a task so that it can be rerun by Airflow, if desired.x

Summary

  • Directed Acyclic Graphs (DAGs) are a visual tool used to represent data workflows in data processing pipelines. A node in a DAG denote the task to be performed, and edges define the dependencies between them. This is not only visually more understandable but also aids in better representation, easier debugging + rerunning, and making use of parallelism compared to single monolithic scripts.
  • In Airflow, DAGs are defined using Python files. Airflow 3.0 introduced the option of using other languages. In this book we will focus on Python. These scripts outline the order of task execution and their interdependencies. Airflow parses these files to construct and understand the DAG's structure, enabling task orchestration and scheduling.
  • Although many workflow managers have been developed over the years for executing graphs of tasks, Airflow has several key features that makes it uniquely suited for implementing efficient, batch-oriented data pipelines.
  • Airflow excels as a workflow orchestration tool due to its intuitive design, scheduling capabilities, and extensible framework. It provides a rich user interface for monitoring and managing tasks in data processing workflows.
  • Airflow is comprised of five key components:
    1. DAG Processor: Reads and parses the DAGs and stores the resulting serialized version of these DAGs in the Metastore for use by (among others) the scheduler
    2. Scheduler: Reads the DAGs parsed by the DAG Processor, determines if their schedule intervals have elapsed, and queues their tasks for execution.
    3. Worker: Execute the tasks assigned to them by the scheduler.
    4. Triggerer: It handles the execution of deferred tasks, which are waiting for external events or conditions.
    5. API Server: Among other things, presents a user interface for visualizing and monitoring the DAGs and their execution status. The API Server also acts as the interface between all Airflow components
  • Airflow enables the setting of a schedule for each DAG, specifying when the pipeline should be executed. In addition, Airflow’s built-in mechanisms are able to manage task failures, automatically.
  • Airflow is well-suited for batch-oriented data pipelines, offering sophisticated scheduling options that enable regular, incremental data processing jobs. On the other hand, Airflow is not the right choice for streaming workloads or for implementing highly dynamic pipelines where DAG structure changes from one day to the other.

FAQ

What is Apache Airflow and what problem does it solve?Apache Airflow is an open source workflow orchestrator. It coordinates and schedules work across many systems by running pipelines as graphs of tasks. Airflow itself doesn’t process data; it orchestrates the tools and systems that do, making it well suited for batch and time-driven workflows.
How are data pipelines represented in Airflow?Airflow represents pipelines as Directed Acyclic Graphs (DAGs). Each node is a task and each directed edge is a dependency showing which tasks must finish before others can start. The “acyclic” property prevents circular dependencies that would deadlock execution.
Why use a graph-based pipeline instead of a single sequential script?A graph-based approach makes task dependencies explicit, enables parallel execution of independent branches to reduce runtime, and allows rerunning only failed tasks (and their downstream dependents) instead of re-executing an entire monolithic script.
How does Airflow execute a DAG under the hood?Airflow parses DAG files into a metastore, then the scheduler checks each DAG’s schedule. When due, it evaluates task dependencies, queues runnable tasks, and workers execute them in parallel. For tasks that wait on external conditions, Airflow can defer them to the triggerer to free workers. This loop repeats until all tasks finish.
What are Airflow’s main components and what do they do?- DAG Processor: parses DAG files and serializes them into the metastore via the API server. - Scheduler: decides when DAGs should run and queues tasks whose dependencies are met. - Workers: execute queued tasks and report results. - Triggerer: monitors asynchronous/deferred tasks and resumes them when ready. - API server: provides the web UI and is the gateway for reading/writing DAG and run state in the metastore.
How do you define pipelines in Airflow?You write DAG files in Python that declare tasks, dependencies, and metadata such as scheduling. Airflow 3.0 also allows other languages, but Python remains primary. Because DAGs are code, you can dynamically create tasks or even generate whole DAGs from configuration or metadata.
How does Airflow integrate with external systems?Tasks can run any operation you can implement in Python. Airflow’s rich provider ecosystem offers ready-made integrations with databases, big data engines, and cloud services, letting you orchestrate complex, cross-system data workflows.
How does scheduling work, and what are schedule intervals?You can schedule DAGs using simple intervals (hourly, daily, weekly) or Cron-like expressions. Airflow also tracks previous and next intervals, enabling incremental processing where each run handles just that interval’s data. You can backfill past intervals to build or reprocess historical datasets.
How do you monitor pipelines and handle failures in Airflow?Airflow’s web UI shows all DAGs, their runs, and component health. The graph view visualizes tasks and dependencies, while the grid view shows historical runs per task and per execution. Tasks can be retried automatically, logs are accessible for debugging, and you can clear task instances to rerun them (including downstream tasks).
When is Airflow a good fit, and when is it not?Good fit: - Batch or time-based workflows (regular or event-triggered). - Pipelines that benefit from clear time “buckets” (intervals), backfills, and incremental loads. - Orchestrations that span many systems via providers. - Teams applying software engineering practices to workflow code. Not ideal: - Real-time streaming event processing. - Highly dynamic DAGs that change structure every run (the UI shows only the latest structure). - Teams without Python experience who prefer purely graphical or static workflow definitions. - Use cases needing built-in lineage or data versioning (requires additional tools).

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Pipelines with Apache Airflow, Second Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Pipelines with Apache Airflow, Second Edition ebook for free