Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows, commonly used to manage complex data pipelines.

A workflow in Airflow is a DAG (Directed Acyclic Graph), which defines a set of tasks and their execution order, dependencies, and scheduling.

A DAG (Directed Acyclic Graph) represents a workflow which has collection of tasks with dependencies.

In Apache Airflow, a task is the smallest unit of work within a workflow (DAG). Each task represents a single operation or action, such as running a Python function, executing a SQL query, or triggering a bash command. Uisng refered by task id. operator define a task.

Apache Airflow Scheduler is a core component responsible for triggering task instances to run in accordance with the defined Directed Acyclic Graphs (DAGs) and their schedules.

In Apache Airflow, an Executor is the component responsible for actually running the tasks defined in your workflows (DAGs). It takes task instances that the Scheduler determines are ready and orchestrates their execution either locally or on remote workers.

=======================================================================

🔹 What is Docker?

Docker is a platform used to:

Package an application and its dependencies into a container
Ensure the application runs the same across all environments

A Docker container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software: code, libraries, environment variables, and config files.

🐳 What is a Dockerfile?

A Dockerfile is a text file that contains all the instructions to build a Docker image.
It defines the environment, dependencies, and commands your application needs to run consistently on any machine.
Think of it as a recipe for your container.

# Step 1: Base image

FROM python:3.11-slim

# Step 2: Set working directory inside container

WORKDIR /app

# Step 3: Copy your project files into container

COPY requirements.txt .

# Step 4: Install dependencies

RUN pip install --no-cache-dir -r requirements.txt

# Step 5: Copy all code

COPY . .

# Step 6: Set environment variables (optional)

ENV PYTHONUNBUFFERED=1

# Step 7: Define default command to run

CMD ["python", "main.py"]

🔹 Step-by-Step Explanation

FROM python:3.11-slim
- Base image with Python installed. Slim version = smaller image.
WORKDIR /app
- Sets working directory inside container.
COPY requirements.txt .
- Copies dependency file into container.
RUN pip install ...
- Installs Python packages inside container.
COPY . .
- Copies your ETL or Airflow scripts into container.
ENV PYTHONUNBUFFERED=1
- Makes Python logs visible immediately (useful for debugging).
CMD ["python", "main.py"]

Default command when container starts. Can be your ETL job or Airflow task script.

🔹 Useful Commands

Build Docker Image


docker build -t my-data-engineer-image .

Run Container


docker run -it --rm my-data-engineer-image

Run with mounted volume (edit locally, reflect in container)


docker run -v $(pwd):/app -it my-data-engineer-image

Push to Docker Hub / Registry


docker tag my-data-engineer-image username/my-image:latest
docker push username/my-image:latest

---------------------------------------------------

🐳 What is Docker Compose?

Docker Compose is a tool to define and run multi-container Docker applications.
Instead of running each container individually, you define all services in a single docker-compose.yml file.
You can spin up the whole environment with one command:
docker-compose up

version: '3.8'

services:

postgres:

image: postgres:15

environment:

POSTGRES_USER: airflow

POSTGRES_PASSWORD: airflow

POSTGRES_DB: airflow

ports:

- "5432:5432"

volumes:

- postgres_data:/var/lib/postgresql/data

airflow-webserver:

image: apache/airflow:2.7.1-python3.11

depends_on:

- postgres

environment:

AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow

AIRFLOW__CORE__EXECUTOR: LocalExecutor

volumes:

- ./dags:/opt/airflow/dags

- ./logs:/opt/airflow/logs

- ./plugins:/opt/airflow/plugins

ports:

- "8080:8080"

command: webserver

volumes:

postgres_data:

🔹 Run Commands

Build & start all services:


docker-compose up --build

Run in detached mode (background):


docker-compose up -d

Stop all containers:


docker-compose down

View logs of a service:


docker-compose logs airflow-webserver

----------------------------------------------------------------------------

📄 What is `requirements.txt`?

It’s a text file listing all the Python packages your project needs.
Used by pip to install dependencies: pip install -r requirements.txt

=======================================================================

Operators are Python classes that define a template for a specific unit of work (task) in a workflow. When you instantiate an operator in a DAG, it becomes a task that Airflow executes. Operators encapsulate the logic required to perform a defined action or job.

Each operator represent a single task in workflow — like running a script, moving data, or checking if a file exists.

Type of operator

In Apache Airflow, Operators are the building blocks of your workflows (DAGs). Each operator defines a single task to be executed. There are different types of operators based on the type of work they perform.

Operators fall into three broad categories:

Action Operators:
Perform an action like running code or sending an email.

Examples:

PythonOperator to run a Python function
BashOperator to run shell commands
EmailOperator to send emails
SimpleHTTPOperator to interact with APIs

Transfer Operators:
Move data between systems or different storage locations.

Examples:

S3ToRedshiftOperator
MySqlToGoogleCloudStorageOperator

Sensor Operators:
Wait for a certain event or external condition before proceeding.

Examples:
FileSensor waits for a file to appear
ExternalTaskSensor waits for another task to complete

=======================================================================

Apache Airflow commands

==================================================================

Dependencies

🔗 `chain()` Function

The chain() function is part of airflow.utils.task_group (previously in airflow.utils.helpers) and helps you connect multiple tasks or groups in a sequence without writing task_1 >> task_2 >> task_3 manually.

=======================================================================

Cron

cron expressions are used in the schedule_interval parameter of a DAG to define when the DAG should run.

A cron expression is a string representing a schedule — it tells Airflow how often to run a DAG

None - dont schedule evr , usually fr manually

@once - schedule only once

=======================================================================

🔹 What Are Hooks in Apache Airflow?

Hooks in Airflow are interfaces to external platforms, like databases, cloud storage, APIs, and more. They abstract the connection and authentication logic, allowing operators to use these services easily.

Hooks are mostly used behind the scenes by Operators, but you can also call them directly in Python functions.

🔸 Why Use Hooks?

Reusable connection logic
Securely use Airflow's connection system (Airflow Connections UI)
Simplifies integrating with external systems (e.g., MySQL, S3, BigQuery, Snowflake)

from airflow import DAG

from airflow.operators.python import PythonOperator

from airflow.providers.postgres.hooks.postgres import PostgresHook

from airflow.utils.dates import days_ago

def get_data():

hook = PostgresHook(postgres_conn_id='my_postgres')

connection = hook.get_conn()

cursor = connection.cursor()

cursor.execute("SELECT COUNT(*) FROM my_table;")

result = cursor.fetchone()

print(f"Row count: {result[0]}")

with DAG('postgres_hook_example',

start_date=days_ago(1),

schedule_interval=None,

catchup=False) as dag:

t1 = PythonOperator(

task_id='count_rows',

python_callable=get_data

)

=======================================================================

☁️ What is S3Hook?

S3Hook is a helper class in Airflow to interact with Amazon S3.
It abstracts the boto3 (AWS SDK for Python) operations so you can read/write files, list buckets, check if objects exist, etc., directly in your DAGs.
Comes from: from airflow.providers.amazon.aws.hooks.s3 import S3Hook (Airflow 2+)

🔹 When to Use S3Hook

You want to upload a file to S3 from Airflow.
You want to download a file from S3 for processing.
You want to check if a key/object exists before running a task.
You want to list files in a bucket dynamically.

=======================================================================

Defining dags

=======================================================================

🧠 What Does an Executor Do?

It communicates with the Scheduler and runs the tasks defined in your DAGs—either locally, in parallel, or on distributed systems like Celery or Kubernetes.

-> airflow info , gives which executor we are running

=======================================================================

In Apache Airflow, an SLA (Service Level Agreement) is a time-based constraint that you can apply to a task to ensure it finishes within a defined timeframe. If it misses the deadline, Airflow can alert or log the SLA miss.

🧠 Why Use SLAs?

To ensure:

Timely data availability
Reliable pipeline performance
Alerting for delays or failures

🧩 How SLA Works in Airflow

SLA is defined per task, not per DAG.
If a task takes longer than the SLA, it's marked as an SLA miss.
Airflow triggers an SLA miss callback and logs the event.
Email alerts can be sent if configured.

📊 Monitoring SLA Misses

Go to Airflow UI > DAGs> Browse > SLA Misses
Or check the Task Instance Details

⚠️ Notes

SLAs are checked after the DAG run completes.
SLAs are about runtime, not start time.
SLA doesn’t retry or fail the task—it just logs the violation.

=======================================================================

🔧 What is a Template?

A template is a string that contains placeholders which are evaluated at runtime using the Jinja2 engine.

In Apache Airflow, Templates allow you to dynamically generate values at runtime using Jinja templating (similar to templating in Flask or Django). They are useful when you want task parameters to depend on execution context, such as the date, DAG ID, or other dynamic values.

=======================================================================

Jinja is a templating engine for Python used heavily in Apache Airflow to dynamically render strings using runtime context. It lets you inject variables, logic, and macros into your task parameters.

🔍 What is Jinja?

Jinja is a mini-language similar to Django or Liquid templates. In Airflow, it's used for:

Creating dynamic file paths
Modifying behavior based on execution date
Using control structures like loops and conditions

🧾 DAG Example (Manual filename via params)

=======================================================================

✅ What is `catchup`?

When catchup=True (default), Airflow will "catch up" by running all the DAG runs from the start_date to the current date.

When catchup=False, it only runs the latest scheduled DAG run from the time it is triggered.

=======================================================================

🔁 What is Backfill?

Backfill is the process of running a DAG for past scheduled intervals that have not yet been run.

When a DAG is created or modified with a start_date in the past, Airflow can "backfill" to ensure that all scheduled intervals between the start_date and now are executed.

=======================================================================

🔧 Components of Apache Airflow

Apache Airflow is made up of several core components that work together to orchestrate workflows:

Component	Description
Scheduler	The brain of Airflow that monitors DAGs and tasks, triggers DAG runs based on schedules or events, and submits tasks to the executor for execution. It continuously checks dependencies and task states to decide what to run next. it takes 5 min for AF to detect dag in dag folder Scheduler scan for new task every 4 sec
Executor	Executes task instances assigned by the scheduler. It can run tasks locally, via distributed workers, or on containerized environments depending on the executor type (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.).
Workers	Machines or processes (depending on executor) that actually run the task code. For distributed executors like Celery or Kubernetes, multiple workers run tasks in parallel, scaling out capacity.
Metadata Database	A relational database (e.g., PostgreSQL, MySQL) that stores all Airflow metadata: DAG definitions, task states, execution history, logs, connection info, and more. The scheduler, workers, and webserver interact with it constantly.
Webserver (UI)	Provides a user interface to monitor DAG runs, task status, logs, and overall workflow health. Built on a FastAPI server with APIs for workers, UI, and external clients.
DAGs Folder	Directory or location where DAG definition Python files live. These files describe the workflows and are parsed by the scheduler or DAG processor.

=======================================================================

DAG View

=======================================================================

🔄 XCom(Cross-Communication)

XCom (short for “Cross-communication”) allows tasks to exchange small amounts of data between each other in a DAG.

🔧 How XCom Works

Push → Send data to XCom from one task
Pull → Retrieve that data in another task

⚠️ Limitations

Not for large data (use external storage like S3, GCS)
Automatically pushed by return values of PythonOperator (Airflow ≥2.0)
Can get cluttered in DB if not managed

=======================================================================

🧩 Airflow Variables

Airflow Variables are key-value pairs used to store and retrieve dynamic configurations in your DAGs and tasks.

=======================================================================

🛰️ Apache Airflow Sensors

Sensors are special types of operators in Airflow that wait for a condition to be true before allowing downstream tasks to proceed.

=======================================================================

🌿 Branching in Apache Airflow

Branching allows you to dynamically choose one (or more) downstream paths from a set of tasks based on logic. This is done using the BranchPythonOperator.

🧠 Why Use Branching?

Branching is useful when:

You want to run different tasks based on a condition
You need to skip certain tasks
You want "if/else" logic in your DAG

✅ Notes:

Tasks not returned by BranchPythonOperator will be skipped.
You can return a single task ID or a list of task IDs.
Ensure your downstream tasks can handle being skipped, or use appropriate trigger_rule.

=======================================================================

Subdag

🔄 What is a SubDAG in Apache Airflow?

A SubDAG is a DAG within a DAG — essentially, a child DAG defined inside a parent DAG. It's used to logically group related tasks together and reuse workflow patterns, making complex DAGs easier to manage.

📌 Think of a SubDAG as a modular block that can be reused or organized separately.

🧩 TaskGroup in Apache Airflow

A TaskGroup in Airflow is a way to visually and logically group tasks together in the UI without creating a separate DAG like SubDagOperator. It's lightweight, easier to use, and the recommended approach in Airflow 2.x+.

=======================================================================

🔗 Edge Labels in Apache Airflow

Edge labels in Airflow are annotations you can add to the edges (arrows) between tasks in the DAG graph view. They help clarify why one task depends on another, especially when using complex branching, conditionals, or TriggerRules.

Airflow

🔹 What is Docker?

🐳 What is a Dockerfile?

🔹 Step-by-Step Explanation

🔹 Useful Commands

🐳 What is Docker Compose?

🔹 Run Commands

📄 What is requirements.txt?

Type of operator

Apache Airflow commands

🔗 chain() Function

Cron

🔹 What Are Hooks in Apache Airflow?

🔸 Why Use Hooks?

☁️ What is S3Hook?

🔹 When to Use S3Hook

🧠 What Does an Executor Do?

🧠 Why Use SLAs?

🧩 How SLA Works in Airflow

📊 Monitoring SLA Misses

⚠️ Notes

🔧 What is a Template?

🔍 What is Jinja?

✅ What is catchup?

🔁 What is Backfill?

🔧 Components of Apache Airflow

DAG View

🔄 XCom(Cross-Communication)

🔧 How XCom Works

⚠️ Limitations

🧩 Airflow Variables

🛰️ Apache Airflow Sensors

🌿 Branching in Apache Airflow

🧠 Why Use Branching?

✅ Notes:

Subdag

🔄 What is a SubDAG in Apache Airflow?

🧩 TaskGroup in Apache Airflow

🔗 Edge Labels in Apache Airflow

Comments

Post a Comment

Popular posts from this blog

Teradata

Git

work

📄 What is `requirements.txt`?

🔗 `chain()` Function

✅ What is `catchup`?