Airflow
Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows, commonly used to manage complex data pipelines.
A workflow in Airflow is a DAG (Directed Acyclic Graph), which defines a set of tasks and their execution order, dependencies, and scheduling.
A DAG (Directed Acyclic Graph) represents a workflow which has collection of tasks with dependencies.
In Apache Airflow, a task is the smallest unit of work within a workflow (DAG). Each task represents a single operation or action, such as running a Python function, executing a SQL query, or triggering a bash command. Uisng refered by task id. operator define a task.
Apache Airflow Scheduler is a core component responsible for triggering task instances to run in accordance with the defined Directed Acyclic Graphs (DAGs) and their schedules.
In Apache Airflow, an Executor is the component responsible for actually running the tasks defined in your workflows (DAGs). It takes task instances that the Scheduler determines are ready and orchestrates their execution either locally or on remote workers.
=======================================================================
πΉ What is Docker?
Docker is a platform used to:
-
Package an application and its dependencies into a container
-
Ensure the application runs the same across all environments
A Docker container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software: code, libraries, environment variables, and config files.
π³ What is a Dockerfile?
-
A Dockerfile is a text file that contains all the instructions to build a Docker image.
-
It defines the environment, dependencies, and commands your application needs to run consistently on any machine.
-
Think of it as a recipe for your container.
πΉ Step-by-Step Explanation
-
FROM python:3.11-slim
-
Base image with Python installed. Slim version = smaller image.
-
-
WORKDIR /app
-
Sets working directory inside container.
-
-
COPY requirements.txt .
-
Copies dependency file into container.
-
-
RUN pip install ...
-
Installs Python packages inside container.
-
-
COPY . .
-
Copies your ETL or Airflow scripts into container.
-
-
ENV PYTHONUNBUFFERED=1
-
Makes Python logs visible immediately (useful for debugging).
-
-
CMD ["python", "main.py"]
-
Default command when container starts. Can be your ETL job or Airflow task script.
πΉ Useful Commands
-
Build Docker Image
-
Run Container
-
Run with mounted volume (edit locally, reflect in container)
-
Push to Docker Hub / Registry
π³ What is Docker Compose?
-
Docker Compose is a tool to define and run multi-container Docker applications.
-
Instead of running each container individually, you define all services in a single
docker-compose.ymlfile. -
You can spin up the whole environment with one command:
docker-compose up
πΉ Run Commands
-
Build & start all services:
-
Run in detached mode (background):
-
Stop all containers:
-
View logs of a service:
----------------------------------------------------------------------------
π What is requirements.txt?
-
It’s a text file listing all the Python packages your project needs.
-
Used by pip to install dependencies: pip install -r requirements.txt
=======================================================================
⭐ 1. Airflow Connections
Connections = saved credentials for external systems.
Examples:
-
AWS
-
Snowflake
-
Postgres
-
MySQL
-
BigQuery
-
S3
-
Kafka
-
Redshift
πΉ How to Set Connections
A) Using Airflow UI
-
Go to Admin → Connections
-
Click + Add
-
Fill:
-
Conn ID →
aws_default -
Conn Type → Amazon Web Services
-
Extra → JSON (keys, region, endpoint)
-
-
Save
B) Using CLI
C) Using Environment Variables
Format:
Example:
This is very common in Docker/Kubernetes.
⭐ 2. Airflow Variables
Variables = key–value store for configuration.
Example:
-
file_path
-
S3 bucket name
-
threshold value
-
list of emails
πΉ How to Set Variables
A) Using UI
Admin → Variables → Add
B) Using CLI
C) Using JSON IMPORT
D) Using Environment Variables
Usage inside DAG:
⭐ 3. Airflow Secret Backends (Very Important for Data Engineers)
Airflow supports managing secrets securely using external systems.
πΉ Supported Secret Backends:
-
AWS Secrets Manager
-
GCP Secret Manager
-
Hashicorp Vault
-
Azure Key Vault
-
Custom secret backends
Why use secret backends?
-
Secrets are not stored in Airflow DB
-
Rotated automatically
-
Secure & centralized
-
Avoid plaintext passwords in Airflow UI
πΉ Example: Using AWS Secrets Manager
Add to airflow.cfg:
AWS Secret Format:
Used automatically in DAG:
⭐ 4. Best Practices for Storing Credentials (MOST IMPORTANT)
π 1. NEVER store passwords in code
❌
✔ Use:
π 2. Avoid storing secrets in Airflow Variables
Variables are NOT encrypted by default.
π 3. Use Secret Backends for all production credentials
-
AWS Secrets Manager
-
GCP Secret Manager
-
Hashicorp Vault
π 4. Use environment variables for local development
Safe and temporary.
π 5. Do not store credentials in GitHub / repo
Always use:
-
.env -
Kubernetes Secrets
-
Docker Secrets
π 6. Use different connection IDs for dev/stage/prod
Example:
-
aws_dev -
aws_stage -
aws_prod
π 7. Use JSON "extra" field for complex configs
Example Extra field in UI:
Operators
are Python classes that define a template for a specific unit of work (task) in a workflow. When you instantiate an operator in a DAG, it becomes a task that Airflow executes. Operators encapsulate the logic required to perform a defined action or job.
Each operator represent a single task in workflow — like running a script, moving data, or checking if a file exists.
Operators = do something
Sensors = wait for something
Hooks = connection to systems (S3Hook, PostgresHook, etc.)
Executors = how tasks run (Local, Celery, Kubernetes)
Scheduler = creates DAG Runs + task instances
Type of operator
In Apache Airflow, Operators are the building blocks of your workflows (DAGs). Each operator defines a single task to be executed. There are different types of operators based on the type of work they perform.
Operators fall into three broad categories:
Action Operators:
Perform an action like running code or sending an email.
PythonOperatorto run a Python functionBashOperatorto run shell commandsEmailOperatorto send emailsSimpleHTTPOperatorto interact with APIs
Examples:
Transfer Operators:
Move data between systems or different storage locations.
- S3ToRedshiftOperator
- MySqlToGoogleCloudStorageOperator
Examples:
Sensor Operators:
Wait for a certain event or external condition before proceeding.
- Examples:
FileSensorwaits for a file to appearExternalTaskSensorwaits for another task to complete
=======================================================================
Apache Airflow commands
π chain() Function
The chain() function is part of airflow.utils.task_group (previously in airflow.utils.helpers) and helps you connect multiple tasks or groups in a sequence without writing task_1 >> task_2 >> task_3 manually.
=======================================================================
⭐ 1. Airflow Scheduling Basics
Airflow schedules based on:
-
cron expressions
-
timetables
-
logical date
-
catchup
-
backfill
A DAG run does NOT start at the exact cron time—it starts after the logical interval finishes.
π¦ 2. Cron Expressions in Airflow
Cron = when to run the DAG.
Examples:
| Cron | Meaning |
|---|---|
0 0 * * * | Every midnight |
0 */2 * * * | Every 2 hours |
0 6 * * 1 | Every Monday at 6 AM |
*/5 * * * * | Every 5 minutes |
Airflow uses cron to define the start of the schedule interval, but the DAG runs after the interval finishes*.
π¦ 3. Timetables (Airflow 2.2+)
Timetables = new, flexible scheduling system.
Useful when cron is not enough.
Examples:
-
Run DAG every business day except holidays
-
Run every 3 hours between 9–5
-
Run based on dataset dependencies
-
Run after an upstream dataset is updated
Timetables replace schedule_interval for advanced cases.
π¦ 4. Catchup vs No Catchup
| Setting | What it Means |
|---|---|
catchup=True | Airflow creates DAG Runs for all past dates since the start date |
catchup=False | Airflow only runs the latest DAG run, skips historical dates |
Example:
DAG start date = Jan 1
Today = Jan 5
Schedule = daily
| catchup setting | Runs created |
|---|---|
| True | 1,2,3,4,5 Jan (5 runs) |
| False | Only Jan 5 (latest run) |
π¦ 5. Backfill (Manual Catchup)
Backfill = you manually run past dates even if catchup=False.
Command:
Purpose:
-
Re-run historical data
-
Fix missed data loads
-
Reprocess partitions
π¦ 6. Logical Date (MOST IMPORTANT)
Logical date = the data interval the DAG run is processing.
It is not the actual time the run starts.
Example:
Schedule: Daily
Cron: 0 0 * * * (midnight, i.e., start of interval)
DAG Run at: 2024-10-10 00:00 logical date
Run actually starts at: 2024-10-10 00:01 or later
Why this is important?
-
All tasks use
logical_datefor:-
file paths
-
S3 partitions
-
SQL date parameters
-
templated variables (
{{ ds }}etc.)
-
Think of it like:
✔ Logical date = data date
✔ Execution date = same as logical date (Airflow 2.2+)
✖ NOT the real-time the task runs
π£ Logical Date Example (Simple)
Schedule = daily
Interval = 2024-01-01 00:00 → 2024-01-02 00:00
DAG Run for 2024-01-02 actually runs at 2024-01-02 00:01, but:
Because that’s the interval start (logical date).
=======================================================
Cron
cron expressions are used in the schedule_interval parameter of a DAG to define when the DAG should run.=======================================================================
πΉ What Are Hooks in Apache Airflow?
Hooks in Airflow are interfaces to external platforms, like databases, cloud storage, APIs, and more. They abstract the connection and authentication logic, allowing operators to use these services easily.
Hooks are mostly used behind the scenes by Operators, but you can also call them directly in Python functions.
πΈ Why Use Hooks?
-
Reusable connection logic
-
Securely use Airflow's connection system (
Airflow Connections UI) -
Simplifies integrating with external systems (e.g., MySQL, S3, BigQuery, Snowflake)
=======================================================================
☁️ What is S3Hook?
-
S3Hook is a helper class in Airflow to interact with Amazon S3.
-
It abstracts the boto3 (AWS SDK for Python) operations so you can read/write files, list buckets, check if objects exist, etc., directly in your DAGs.
-
Comes from:
from airflow.providers.amazon.aws.hooks.s3 import S3Hook(Airflow 2+)
πΉ When to Use S3Hook
-
You want to upload a file to S3 from Airflow.
-
You want to download a file from S3 for processing.
-
You want to check if a key/object exists before running a task.
-
You want to list files in a bucket dynamically.
=======================================================================
=======================================================================
π§ What Does an Executor Do?
It communicates with the Scheduler and runs the tasks defined in your DAGs—either locally, in parallel, or on distributed systems like Celery or Kubernetes.
-> airflow info , gives which executor we are running
=======================================================================
π§ Why Use SLAs?
To ensure:
-
Timely data availability
-
Reliable pipeline performance
-
Alerting for delays or failures
π§© How SLA Works in Airflow
-
SLA is defined per task, not per DAG.
-
If a task takes longer than the SLA, it's marked as an SLA miss.
-
Airflow triggers an SLA miss callback and logs the event.
-
Email alerts can be sent if configured.
π Monitoring SLA Misses
-
Go to Airflow UI > DAGs> Browse > SLA Misses
-
Or check the Task Instance Details
⚠️ Notes
-
SLAs are checked after the DAG run completes.
-
SLAs are about runtime, not start time.
-
SLA doesn’t retry or fail the task—it just logs the violation.
=======================================================================
π§ What is a Template?
A template is a string that contains placeholders which are evaluated at runtime using the Jinja2 engine.
In Apache Airflow, Templates allow you to dynamically generate values at runtime using Jinja templating (similar to templating in Flask or Django). They are useful when you want task parameters to depend on execution context, such as the date, DAG ID, or other dynamic values.
=======================================================================
Jinja is a templating engine for Python used heavily in Apache Airflow to dynamically render strings using runtime context. It lets you inject variables, logic, and macros into your task parameters.
π What is Jinja?
Jinja is a mini-language similar to Django or Liquid templates. In Airflow, it's used for:
-
Creating dynamic file paths
-
Modifying behavior based on execution date
-
Using control structures like loops and conditions
✅ What is catchup?
When catchup=True (default), Airflow will "catch up" by running all the DAG runs from the start_date to the current date.
When catchup=False, it only runs the latest scheduled DAG run from the time it is triggered.
=======================================================================
π What is Backfill?
Backfill is the process of running a DAG for past scheduled intervals that have not yet been run.
When a DAG is created or modified with a start_date in the past, Airflow can "backfill" to ensure that all scheduled intervals between the start_date and now are executed.
=======================================================================
π§ Components of Apache Airflow
Apache Airflow is made up of several core components that work together to orchestrate workflows:
| Component | Description |
|---|---|
| Scheduler | The brain of Airflow that monitors DAGs and tasks, triggers DAG runs based on schedules or events, and submits tasks to the executor for execution. It continuously checks dependencies and task states to decide what to run next. it takes 5 min for AF to detect dag in dag folder Scheduler scan for new task every 4 sec |
| Executor | Executes task instances assigned by the scheduler. It can run tasks locally, via distributed workers, or on containerized environments depending on the executor type (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.). |
| Workers | Machines or processes (depending on executor) that actually run the task code. For distributed executors like Celery or Kubernetes, multiple workers run tasks in parallel, scaling out capacity. |
| Metadata Database | A relational database (e.g., PostgreSQL, MySQL) that stores all Airflow metadata: DAG definitions, task states, execution history, logs, connection info, and more. The scheduler, workers, and webserver interact with it constantly. |
| Webserver (UI) | Provides a user interface to monitor DAG runs, task status, logs, and overall workflow health. Built on a FastAPI server with APIs for workers, UI, and external clients. |
| DAGs Folder | Directory or location where DAG definition Python files live. These files describe the workflows and are parsed by the scheduler or DAG processor. |
π¦ What is Airflow Scheduler?
The Airflow Scheduler is the component responsible for triggering DAG runs and executing tasks at the right time based on the DAG’s schedule, dependencies, and state.
π It is the “brain” of Airflow.
π£ What does the Airflow Scheduler do?
The scheduler continuously:
| Function | Explanation |
|---|---|
| Monitors DAGs | Watches all DAG files for new/updated DAGs. |
| Creates DAG Runs | Starts DAG runs at the scheduled intervals. |
| Checks Dependencies | Ensures upstream tasks are finished before running next task. |
| Queues Tasks | Decides which tasks are ready to run. |
| Sends tasks to Executor | Hands tasks to workers (Local/Celery/K8s). |
| Handles retries | If a task fails, scheduler triggers retries. |
| Manages SLA | Detects SLA misses. |
π¦ How the Scheduler Works (Simple Flow)
The scheduler loops continuously, making decisions every few seconds.
π£ Important Concepts for Interviews
1. Scheduling interval
Scheduler respects:
-
schedule_interval -
start_date -
end_date -
catchup
2. Logical Date (Very important!)
Scheduler runs DAGs based on logical execution date, not current time.
3. Executor
Scheduler just queues tasks, but does NOT execute them.
Executor runs the task.
Example executors:
-
LocalExecutor
-
CeleryExecutor
-
KubernetesExecutor
4. Concurrency Controls
Scheduler respects:
-
DAG concurrency
-
Task concurrency
-
Pools
-
Parallelism
These prevent overload.
5. Heartbeats
Scheduler sends a “heartbeat” every few seconds.
If heartbeat stops → scheduler is down.
π¦ Example: Scheduler in Action
If a DAG has:
The scheduler will create DAG runs:
-
2024-01-01 (logical date)
-
2024-01-02
-
2024-01-03
…
Each run → scheduler checks tasks → queues ready ones.
======================================================
π¦ What is an Executor in Airflow?
An Executor is the Airflow component responsible for actually running the tasks.
While the Scheduler decides what to run,
the Executor decides how and where to run it.
π Executor = Task runner
π Scheduler = Task coordinator
π£ Why Executor is Important?
Executors decide:
-
How many tasks run in parallel
-
Where tasks get executed
-
Whether tasks run locally or on workers or on Kubernetes pods
The choice of executor determines Airflow’s scalability.
π¦ Types of Executors (Must Know)
✅ 1. SequentialExecutor
-
✔ What it is:
-
Runs ONE task at a time
-
Single-threaded
-
No parallelism
-
Default for quick testing
✔ Use Cases:
-
Local testing
-
Development / laptop
-
Very small DAGs
❌ Not for production.
-
✅ 2. LocalExecutor
-
✔ What it is:
-
Runs tasks in parallel on the same machine
-
Uses multiple processes/threads
-
Good performance for small pipelines
✔ Use Cases:
-
Small teams
-
Single-server Airflow deployments
-
Use case: 10–20 parallel tasks
❌ Not suitable for distributed workloads
❌ Cannot scale beyond one machine
-
✅ 3. CeleryExecutor
-
✔ What it is:
-
Distributed task execution
-
Multiple worker machines
-
Uses a message broker:
-
Redis
-
RabbitMQ
-
✔ Use Cases:
-
Medium to large teams
-
Many DAGs running at same time
-
Need dozens or hundreds of parallel tasks
-
On-prem or AWS EC2 deployments
π Pros
-
Highly scalable
-
Fault-tolerant
-
Good for data engineering teams
π Cons
-
Complex setup (workers + broker + DB)
-
Higher maintenance
-
✅ 4. KubernetesExecutor (Most modern)
-
✔ What it is:
-
Each task runs in its own Kubernetes pod
-
True elastic scaling
-
Perfect isolation of tasks
-
Clean environment per task
✔ Use Cases:
-
Cloud-native setups
-
Very large workloads
-
Need per-task compute scaling
-
Mixed workloads (Python, Spark, Java, Bash, etc.)
π Pros:
-
Auto-scaling
-
GPU/High-memory pods
-
Per-task docker image support
π Cons:
-
Requires Kubernetes knowledge
-
Complex to manage for small teams
-
(Bonus) — LocalKubernetesExecutor (Hybrid)
-
LocalExecutor for small tasks
-
KubernetesExecutor for heavy tasks
π¦ How Scheduler and Executor Work Together
π£ Comparison Table
| Executor | Parallel? | Distributed? | Use Case |
|---|---|---|---|
| SequentialExecutor | ❌ No | ❌ No | Testing only |
| LocalExecutor | ✔ Yes | ❌ No | Medium workloads |
| CeleryExecutor | ✔ Yes | ✔ Yes | Large-scale pipelines |
| KubernetesExecutor | ✔ Yes | ✔ Yes | Cloud-native, scalable workloads |
π¦ Best Executor for Data Engineering?
| Use Case | Best Executor |
|---|---|
| Small team, single VM | LocalExecutor |
| Distributed on-prem cluster | CeleryExecutor |
| Cloud environments (AWS/GCP/Azure) | KubernetesExecutor |
======================================================
π¦ What is the Airflow Webserver?
The Airflow Webserver is the component that provides the UI (User Interface) for Airflow.
It lets you view, monitor, trigger, pause, and manage DAGs through a browser.
π Webserver = Airflow UI
π It shows everything happening inside Airflow.
π£ What Webserver Does
| Function | Explanation |
|---|---|
| Displays DAGs | Shows all DAGs in the UI |
| Trigger DAGs | You can manually run a DAG |
| Pause/Unpause DAGs | Enable or disable scheduling |
| View Graph View | DAG structure (dependencies) |
| Task Logs | View task execution logs |
| Monitor status | Success / Failed / Queued / Running |
| View XCom | See data passed between tasks |
| Manage Connections | Add/edit database or API credentials |
| Variables | Store global values for DAGs |
| Admin Panel | DAG runs, task instances, users, roles |
π¦ How Webserver Works (Simple Explanation)
-
Webserver reads DAG files
-
Displays DAGs in the UI
-
Shows scheduler and executor status
-
Allows user actions (trigger, clear, rerun tasks)
It runs using Flask (Python web framework) behind the scenes.
π¦ Important Ports
Default port:
In production you may use Nginx/HTTPS.
π£ Webserver vs Scheduler
| Component | Purpose |
|---|---|
| Webserver | UI to view/manage pipelines |
| Scheduler | Decides when tasks should run |
| Executor | Actually runs the tasks |
π¦ How to Start the Webserver
In Docker:
=======================================================================
π¦ 1. max_active_runs (at DAG level)
✅ Definition
max_active_runs = maximum number of DAG Runs that can run at the same time for a specific DAG.
π Think:
"How many full pipeline runs can run in parallel?"
Example:
✔ Only one DAG run will run at a time
✖ A new scheduled run will wait until the previous run finishes
π Why important?
-
Prevents overlapping runs
-
Useful for pipelines that update the same tables
-
Avoids data corruption
π¦ 2. concurrency (at DAG level)
✅ Definition
concurrency = maximum number of task instances from the SAME DAG that can run in parallel.
π Think:
"How many tasks inside this DAG can run at the same time?"
Example:
✔ Maximum 5 tasks from this DAG can run at once
✖ The 6th task waits in the queue
π Why important?
-
Controls the load on your system
-
Prevents overwhelming the database, Spark cluster, APIs, etc.
⭐ 1. Dynamic DAGs (Airflow)
Dynamic DAGs = DAGs that are generated programmatically instead of hardcoding tasks.
Example:
✔ Why use Dynamic DAGs?
-
Automatically create tasks for multiple tables/files
-
Avoid writing duplicate code
-
Perfect for pipelines with 20–500 tables
⭐ 2. Dynamic Tasks (Task Mapping in Airflow 2.3+)
Task mapping = Airflow automatically creates multiple task instances at runtime.
Example (Best Interview Answer):
✔ Why Task Mapping is powerful:
-
Dynamically generates tasks at runtime
-
No DAG parsing overhead (unlike old dynamic DAGs)
-
Much cleaner & more scalable
✔ Example Use Cases:
-
Load 100 S3 files
-
Process N partitions
-
Trigger N API calls
-
Run ML jobs for each model
⭐ 3. Avoiding DAG Explosion
DAG Explosion = too many tasks or too many DAGs, causing:
-
Slow UI
-
Scheduler overload
-
Metadata DB pressure
-
DAG parsing delays
Causes:
-
Generating thousands of tasks in the DAG file
-
Creating DAGs dynamically for each table (e.g., 100 tables → 100 DAGs)
Solution:
-
Use Task Mapping
-
Use TaskGroup
-
Batch tasks
-
Push dynamic behavior to runtime, not DAG file parse time
⭐ 4. TaskGroup (Organizing Large DAGs)
TaskGroup = visual and logical grouping of tasks.
Example:
✔ Why TaskGroup is used:
-
Organize DAGs with 50+ tasks
-
Avoid clutter in Airflow UI
-
Easier debugging
-
Logical grouping like:
-
extract group
-
transform group
-
load group
-
Not for isolation — only for visual and logical grouping.
=======================================================================
DAG View
=======================================================================
π XCom(Cross-Communication)
XCom (short for “Cross-communication”) allows tasks to exchange small amounts of data between each other in a DAG.
π§ How XCom Works
-
Push → Send data to XCom from one task
-
Pull → Retrieve that data in another task
π₯ What XCom Should NOT be Used For
Very important for interviews:
❌ Do NOT pass large datasets
❌ Not meant for files
❌ Not used for DataFrames
❌ Not used for binary data
Use XCom only for small metadata, like:
-
file paths
-
S3 keys
-
table names
-
row counts
π¦ Types of XCom in Airflow
There are 3 main types of XCom you must know:
✅ 1. Default / Implicit XCom (PythonOperator return value)
-
When a PythonOperator function returns a value, Airflow automatically pushes it to XCom.
-
No need to write xcom_push() manually.
When a PythonOperator function returns a value, Airflow automatically pushes it to XCom.
No need to write xcom_push() manually.
Example:
✔ Automatically becomes an XCom value
✔ Most commonly used type
✅ 2. Manual XCom (Explicit push & pull)
Used when you want full control.
Push:
Pull:
✔ Used when you need custom key names
✔ Useful when returning multiple values
✅ 3. TaskFlow API XCom (@task decorator)
This works like implicit XCom but with ** cleaner syntax** using the TaskFlow API.
Example:
✔ Return values automatically become XCom
✔ Passing function outputs becomes easier
✔ Preferred in modern Airflow (2.x)
=======================================================================
π§© Airflow Variables
Airflow Variables are key-value pairs used to store and retrieve dynamic configurations in your DAGs and tasks.
=======================================================================
π°️ Apache Airflow Sensors
Sensors are special types of operators in Airflow that wait for a condition to be true before allowing downstream tasks to proceed.
=======================================================================
πΏ Branching in Apache Airflow
Branching allows you to dynamically choose one (or more) downstream paths from a set of tasks based on logic. This is done using the BranchPythonOperator.
π§ Why Use Branching?
Branching is useful when:
-
You want to run different tasks based on a condition
-
You need to skip certain tasks
-
You want "if/else" logic in your DAG
✅ Notes:
-
Tasks not returned by
BranchPythonOperatorwill be skipped. -
You can return a single task ID or a list of task IDs.
-
Ensure your downstream tasks can handle being skipped, or use appropriate
trigger_rule.
=======================================================================
Subdag
π What is a SubDAG in Apache Airflow?
A SubDAG is a DAG within a DAG — essentially, a child DAG defined inside a parent DAG. It's used to logically group related tasks together and reuse workflow patterns, making complex DAGs easier to manage.
π Think of a SubDAG as a modular block that can be reused or organized separately.
π§© TaskGroup in Apache Airflow
A TaskGroup in Airflow is a way to visually and logically group tasks together in the UI without creating a separate DAG like SubDagOperator. It's lightweight, easier to use, and the recommended approach in Airflow 2.x+.
=======================================================================
π Edge Labels in Apache Airflow
Edge labels in Airflow are annotations you can add to the edges (arrows) between tasks in the DAG graph view. They help clarify why one task depends on another, especially when using complex branching, conditionals, or TriggerRules.
==========================================================
⭐ 1. Catchup
Definition:
Airflow automatically creates DAG Runs for all missed schedule intervals since the DAG’s start_date.
| Feature | Details |
|---|---|
| Parameter | catchup=True/False |
| Default | True |
| Behavior | Creates runs for all past intervals until today |
| Use Case | When you want to process historical data automatically |
Example:
DAG start date = Jan 1, today = Jan 5, daily DAG, catchup=True → DAG runs for Jan 1,2,3,4,5
⭐ 2. Backfill
Definition:
Manually run DAG runs for specific past dates, regardless of catchup setting.
| Feature | Details |
|---|---|
| Command | airflow dags backfill -s 2024-01-01 -e 2024-01-05 my_dag |
| Behavior | Forces DAG to run historical intervals |
| Use Case | Missed runs, reprocessing, fixing failed jobs |
✅ Backfill is manual and selective, unlike catchup which is automatic.
⭐ 3. Manual Run
Definition:
Trigger a DAG run manually at any time, usually for testing or ad-hoc runs.
| Feature | Details |
|---|---|
| Method | Airflow UI → Trigger DAG CLI → airflow dags trigger my_dag |
| Behavior | Creates a single DAG run immediately |
| Use Case | Test DAG, ad-hoc execution, debugging |
⭐ Comparison Table
| Feature | Automatic/Manual | Purpose | Example |
|---|---|---|---|
| Catchup | Automatic | Run all missed DAG runs | catchup=True → process Jan 1–5 automatically |
| Backfill | Manual | Run specific historical DAG runs | airflow dags backfill -s Jan1 -e Jan5 |
| Manual Run | Manual | Trigger DAG on demand | Click “Trigger DAG” in UI or CLI command |
⭐ Logical Date vs These Runs (Important!)
-
Catchup → generates DAG runs using logical dates for past intervals
-
Backfill → same, but manually specified dates
-
Manual Run → logical date can be specified manually or default to current timestamp
Comments
Post a Comment