Airflow
Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows, commonly used to manage complex data pipelines.
A workflow in Airflow is a DAG (Directed Acyclic Graph), which defines a set of tasks and their execution order, dependencies, and scheduling.
A DAG (Directed Acyclic Graph) represents a workflow which has collection of tasks with dependencies.
In Apache Airflow, a task is the smallest unit of work within a workflow (DAG). Each task represents a single operation or action, such as running a Python function, executing a SQL query, or triggering a bash command. Uisng refered by task id. operator define a task.
Apache Airflow Scheduler is a core component responsible for triggering task instances to run in accordance with the defined Directed Acyclic Graphs (DAGs) and their schedules.
In Apache Airflow, an Executor is the component responsible for actually running the tasks defined in your workflows (DAGs). It takes task instances that the Scheduler determines are ready and orchestrates their execution either locally or on remote workers.
=======================================================================
πΉ What is Docker?
Docker is a platform used to:
-
Package an application and its dependencies into a container
-
Ensure the application runs the same across all environments
A Docker container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software: code, libraries, environment variables, and config files.
=======================================================================
Operators are Python classes that define a template for a specific unit of work (task) in a workflow. When you instantiate an operator in a DAG, it becomes a task that Airflow executes. Operators encapsulate the logic required to perform a defined action or job.
Each operator represent a single task in workflow — like running a script, moving data, or checking if a file exists.
Type of operator
In Apache Airflow, Operators are the building blocks of your workflows (DAGs). Each operator defines a single task to be executed. There are different types of operators based on the type of work they perform.
Operators fall into three broad categories:
Action Operators:
Perform an action like running code or sending an email.
PythonOperator
to run a Python functionBashOperator
to run shell commandsEmailOperator
to send emailsSimpleHTTPOperator
to interact with APIs
Examples:
Transfer Operators:
Move data between systems or different storage locations.
- S3ToRedshiftOperator
- MySqlToGoogleCloudStorageOperator
Examples:
Sensor Operators:
Wait for a certain event or external condition before proceeding.
- Examples:
FileSensor
waits for a file to appearExternalTaskSensor
waits for another task to complete
=======================================================================
Apache Airflow commands
π chain()
Function
The chain()
function is part of airflow.utils.task_group
(previously in airflow.utils.helpers
) and helps you connect multiple tasks or groups in a sequence without writing task_1 >> task_2 >> task_3
manually.
=======================================================================
Cron
cron
expressions are used in the schedule_interval
parameter of a DAG to define when the DAG should run.=======================================================================
πΉ What Are Hooks in Apache Airflow?
Hooks in Airflow are interfaces to external platforms, like databases, cloud storage, APIs, and more. They abstract the connection and authentication logic, allowing operators to use these services easily.
Hooks are mostly used behind the scenes by Operators, but you can also call them directly in Python functions.
πΈ Why Use Hooks?
-
Reusable connection logic
-
Securely use Airflow's connection system (
Airflow Connections UI
) -
Simplifies integrating with external systems (e.g., MySQL, S3, BigQuery, Snowflake)
=======================================================================
=======================================================================
π§ What Does an Executor Do?
It communicates with the Scheduler and runs the tasks defined in your DAGs—either locally, in parallel, or on distributed systems like Celery or Kubernetes.
-> airflow info , gives which executor we are running
=======================================================================
π§ Why Use SLAs?
To ensure:
-
Timely data availability
-
Reliable pipeline performance
-
Alerting for delays or failures
π§© How SLA Works in Airflow
-
SLA is defined per task, not per DAG.
-
If a task takes longer than the SLA, it's marked as an SLA miss.
-
Airflow triggers an SLA miss callback and logs the event.
-
Email alerts can be sent if configured.
π Monitoring SLA Misses
-
Go to Airflow UI > DAGs> Browse > SLA Misses
-
Or check the Task Instance Details
⚠️ Notes
-
SLAs are checked after the DAG run completes.
-
SLAs are about runtime, not start time.
-
SLA doesn’t retry or fail the task—it just logs the violation.
=======================================================================
π§ What is a Template?
A template is a string that contains placeholders which are evaluated at runtime using the Jinja2 engine.
In Apache Airflow, Templates allow you to dynamically generate values at runtime using Jinja templating (similar to templating in Flask or Django). They are useful when you want task parameters to depend on execution context, such as the date, DAG ID, or other dynamic values.
=======================================================================
Jinja is a templating engine for Python used heavily in Apache Airflow to dynamically render strings using runtime context. It lets you inject variables, logic, and macros into your task parameters.
π What is Jinja?
Jinja is a mini-language similar to Django or Liquid templates. In Airflow, it's used for:
-
Creating dynamic file paths
-
Modifying behavior based on execution date
-
Using control structures like loops and conditions
✅ What is catchup
?
When catchup=True
(default), Airflow will "catch up" by running all the DAG runs from the start_date
to the current date.
When catchup=False
, it only runs the latest scheduled DAG run from the time it is triggered.
=======================================================================
π What is Backfill?
Backfill is the process of running a DAG for past scheduled intervals that have not yet been run.
When a DAG is created or modified with a start_date
in the past, Airflow can "backfill" to ensure that all scheduled intervals between the start_date
and now are executed.
=======================================================================
π§ Components of Apache Airflow
Apache Airflow is made up of several core components that work together to orchestrate workflows:
Component | Description |
---|---|
Scheduler | The brain of Airflow that monitors DAGs and tasks, triggers DAG runs based on schedules or events, and submits tasks to the executor for execution. It continuously checks dependencies and task states to decide what to run next. it takes 5 min for AF to detect dag in dag folder Scheduler scan for new task every 4 sec |
Executor | Executes task instances assigned by the scheduler. It can run tasks locally, via distributed workers, or on containerized environments depending on the executor type (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.). |
Workers | Machines or processes (depending on executor) that actually run the task code. For distributed executors like Celery or Kubernetes, multiple workers run tasks in parallel, scaling out capacity. |
Metadata Database | A relational database (e.g., PostgreSQL, MySQL) that stores all Airflow metadata: DAG definitions, task states, execution history, logs, connection info, and more. The scheduler, workers, and webserver interact with it constantly. |
Webserver (UI) | Provides a user interface to monitor DAG runs, task status, logs, and overall workflow health. Built on a FastAPI server with APIs for workers, UI, and external clients. |
DAGs Folder | Directory or location where DAG definition Python files live. These files describe the workflows and are parsed by the scheduler or DAG processor. |
=======================================================================
DAG View
=======================================================================
π XCom(Cross-Communication)
XCom (short for “Cross-communication”) allows tasks to exchange small amounts of data between each other in a DAG.
π§ How XCom Works
-
Push → Send data to XCom from one task
-
Pull → Retrieve that data in another task
⚠️ Limitations
-
Not for large data (use external storage like S3, GCS)
-
Automatically pushed by return values of PythonOperator (Airflow ≥2.0)
-
Can get cluttered in DB if not managed
=======================================================================
π§© Airflow Variables
Airflow Variables are key-value pairs used to store and retrieve dynamic configurations in your DAGs and tasks.
=======================================================================
π°️ Apache Airflow Sensors
Sensors are special types of operators in Airflow that wait for a condition to be true before allowing downstream tasks to proceed.
=======================================================================
πΏ Branching in Apache Airflow
Branching allows you to dynamically choose one (or more) downstream paths from a set of tasks based on logic. This is done using the BranchPythonOperator
.
π§ Why Use Branching?
Branching is useful when:
-
You want to run different tasks based on a condition
-
You need to skip certain tasks
-
You want "if/else" logic in your DAG
✅ Notes:
-
Tasks not returned by
BranchPythonOperator
will be skipped. -
You can return a single task ID or a list of task IDs.
-
Ensure your downstream tasks can handle being skipped, or use appropriate
trigger_rule
.
=======================================================================
Subdag
π What is a SubDAG in Apache Airflow?
A SubDAG is a DAG within a DAG — essentially, a child DAG defined inside a parent DAG. It's used to logically group related tasks together and reuse workflow patterns, making complex DAGs easier to manage.
π Think of a SubDAG as a modular block that can be reused or organized separately.
π§© TaskGroup in Apache Airflow
A TaskGroup in Airflow is a way to visually and logically group tasks together in the UI without creating a separate DAG like SubDagOperator
. It's lightweight, easier to use, and the recommended approach in Airflow 2.x+.
=======================================================================
π Edge Labels in Apache Airflow
Edge labels in Airflow are annotations you can add to the edges (arrows) between tasks in the DAG graph view. They help clarify why one task depends on another, especially when using complex branching, conditionals, or TriggerRule
s.
Comments
Post a Comment