Airflow


Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows, commonly used to manage complex data pipelines.

A workflow in Airflow is a DAG (Directed Acyclic Graph), which defines a set of tasks and their execution order, dependencies, and scheduling.

A DAG (Directed Acyclic Graph) represents a workflow which has collection of tasks with dependencies. 

In Apache Airflow, a task is the smallest unit of work within a workflow (DAG). Each task represents a single operation or action, such as running a Python function, executing a SQL query, or triggering a bash command. Uisng refered by task id. operator define a task.

Apache Airflow Scheduler is a core component responsible for triggering task instances to run in accordance with the defined Directed Acyclic Graphs (DAGs) and their schedules.

In Apache Airflow, an Executor is the component responsible for actually running the tasks defined in your workflows (DAGs). It takes task instances that the Scheduler determines are ready and orchestrates their execution either locally or on remote workers.

=======================================================================

πŸ”Ή What is Docker?

Docker is a platform used to:

  • Package an application and its dependencies into a container

  • Ensure the application runs the same across all environments

A Docker container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software: code, libraries, environment variables, and config files.



=======================================================================

Operators are Python classes that define a template for a specific unit of work (task) in a workflow. When you instantiate an operator in a DAG, it becomes a task that Airflow executes. Operators encapsulate the logic required to perform a defined action or job.

Each operator represent a single task in workflow — like running a script, moving data, or checking if a file exists.

Type of operator

In Apache Airflow, Operators are the building blocks of your workflows (DAGs). Each operator defines a single task to be executed. There are different types of operators based on the type of work they perform.

Operators fall into three broad categories:

  1. Action Operators:
    Perform an action like running code or sending an email.

    Examples:

  • PythonOperator to run a Python function
  • BashOperator to run shell commands
  • EmailOperator to send emails
  • SimpleHTTPOperator to interact with APIs
  1. Transfer Operators:
    Move data between systems or different storage locations.

    Examples:

  • S3ToRedshiftOperator
  • MySqlToGoogleCloudStorageOperator
  1. Sensor Operators:
    Wait for a certain event or external condition before proceeding.

  • Examples:
  • FileSensor waits for a file to appear
  • ExternalTaskSensor waits for another task to complete


=======================================================================

Apache Airflow commands


==================================================================
Dependencies

πŸ”— chain() Function

The chain() function is part of airflow.utils.task_group (previously in airflow.utils.helpers) and helps you connect multiple tasks or groups in a sequence without writing task_1 >> task_2 >> task_3 manually.


=======================================================================

Cron

cron expressions are used in the schedule_interval parameter of a DAG to define when the DAG should run.
A cron expression is a string representing a schedule — it tells Airflow how often to run a DAG

None - dont schedule evr , usually fr manually
@once - schedule only once





=======================================================================

πŸ”Ή What Are Hooks in Apache Airflow?

Hooks in Airflow are interfaces to external platforms, like databases, cloud storage, APIs, and more. They abstract the connection and authentication logic, allowing operators to use these services easily.

Hooks are mostly used behind the scenes by Operators, but you can also call them directly in Python functions.

πŸ”Έ Why Use Hooks?

  • Reusable connection logic

  • Securely use Airflow's connection system (Airflow Connections UI)

  • Simplifies integrating with external systems (e.g., MySQL, S3, BigQuery, Snowflake)

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.utils.dates import days_ago

def get_data():
    hook = PostgresHook(postgres_conn_id='my_postgres')
    connection = hook.get_conn()
    cursor = connection.cursor()
    cursor.execute("SELECT COUNT(*) FROM my_table;")
    result = cursor.fetchone()
    print(f"Row count: {result[0]}")

with DAG('postgres_hook_example',
         start_date=days_ago(1),
         schedule_interval=None,
         catchup=False) as dag:

    t1 = PythonOperator(
        task_id='count_rows',
        python_callable=get_data
    )

=======================================================================

Defining dags





=======================================================================

🧠 What Does an Executor Do?

It communicates with the Scheduler and runs the tasks defined in your DAGs—either locally, in parallel, or on distributed systems like Celery or Kubernetes.

-> airflow info , gives which executor we are running




=======================================================================

In Apache Airflow, an SLA (Service Level Agreement) is a time-based constraint that you can apply to a task to ensure it finishes within a defined timeframe. If it misses the deadline, Airflow can alert or log the SLA miss.

🧠 Why Use SLAs?

To ensure:

  • Timely data availability

  • Reliable pipeline performance

  • Alerting for delays or failures


🧩 How SLA Works in Airflow

  • SLA is defined per task, not per DAG.

  • If a task takes longer than the SLA, it's marked as an SLA miss.

  • Airflow triggers an SLA miss callback and logs the event.

  • Email alerts can be sent if configured.


πŸ“Š Monitoring SLA Misses

  • Go to Airflow UI > DAGs> Browse > SLA Misses

  • Or check the Task Instance Details


⚠️ Notes

  • SLAs are checked after the DAG run completes.

  • SLAs are about runtime, not start time.

  • SLA doesn’t retry or fail the task—it just logs the violation.

=======================================================================

πŸ”§ What is a Template?

A template is a string that contains placeholders which are evaluated at runtime using the Jinja2 engine.

In Apache AirflowTemplates allow you to dynamically generate values at runtime using Jinja templating (similar to templating in Flask or Django). They are useful when you want task parameters to depend on execution context, such as the date, DAG ID, or other dynamic values.



=======================================================================

Jinja is a templating engine for Python used heavily in Apache Airflow to dynamically render strings using runtime context. It lets you inject variables, logic, and macros into your task parameters.

πŸ” What is Jinja?

Jinja is a mini-language similar to Django or Liquid templates. In Airflow, it's used for:

  • Creating dynamic file paths

  • Modifying behavior based on execution date

  • Using control structures like loops and conditions

🧾 DAG Example (Manual filename via params)

=======================================================================

✅ What is catchup?

When catchup=True (default), Airflow will "catch up" by running all the DAG runs from the start_date to the current date.

When catchup=False, it only runs the latest scheduled DAG run from the time it is triggered.


=======================================================================

πŸ” What is Backfill?

Backfill is the process of running a DAG for past scheduled intervals that have not yet been run.

When a DAG is created or modified with a start_date in the past, Airflow can "backfill" to ensure that all scheduled intervals between the start_date and now are executed.




=======================================================================

πŸ”§ Components of Apache Airflow

Apache Airflow is made up of several core components that work together to orchestrate workflows:

ComponentDescription
SchedulerThe brain of Airflow that monitors DAGs and tasks, triggers DAG runs based on schedules or events, and submits tasks to the executor for execution. It continuously checks dependencies and task states to decide what to run next.
it takes 5 min for AF to detect dag in dag folder
Scheduler scan for new task every 4 sec
ExecutorExecutes task instances assigned by the scheduler. It can run tasks locally, via distributed workers, or on containerized environments depending on the executor type (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.).
WorkersMachines or processes (depending on executor) that actually run the task code. For distributed executors like Celery or Kubernetes, multiple workers run tasks in parallel, scaling out capacity.
Metadata DatabaseA relational database (e.g., PostgreSQL, MySQL) that stores all Airflow metadata: DAG definitions, task states, execution history, logs, connection info, and more. The scheduler, workers, and webserver interact with it constantly.
Webserver (UI)Provides a user interface to monitor DAG runs, task status, logs, and overall workflow health. Built on a FastAPI server with APIs for workers, UI, and external clients.
DAGs FolderDirectory or location where DAG definition Python files live. These files describe the workflows and are parsed by the scheduler or DAG processor.

=======================================================================

DAG View


=======================================================================

πŸ”„ XCom(Cross-Communication)

XCom (short for “Cross-communication”) allows tasks to exchange small amounts of data between each other in a DAG.

πŸ”§ How XCom Works

  1. Push → Send data to XCom from one task

  2. Pull → Retrieve that data in another task


⚠️ Limitations

  • Not for large data (use external storage like S3, GCS)

  • Automatically pushed by return values of PythonOperator (Airflow ≥2.0)

  • Can get cluttered in DB if not managed

=======================================================================

🧩 Airflow Variables

Airflow Variables are key-value pairs used to store and retrieve dynamic configurations in your DAGs and tasks.


=======================================================================

πŸ›°️ Apache Airflow Sensors

Sensors are special types of operators in Airflow that wait for a condition to be true before allowing downstream tasks to proceed.



=======================================================================

🌿 Branching in Apache Airflow

Branching allows you to dynamically choose one (or more) downstream paths from a set of tasks based on logic. This is done using the BranchPythonOperator.

🧠 Why Use Branching?

Branching is useful when:

  • You want to run different tasks based on a condition

  • You need to skip certain tasks

  • You want "if/else" logic in your DAG




✅ Notes:

  • Tasks not returned by BranchPythonOperator will be skipped.

  • You can return a single task ID or a list of task IDs.

  • Ensure your downstream tasks can handle being skipped, or use appropriate trigger_rule.

=======================================================================


Subdag

πŸ”„ What is a SubDAG in Apache Airflow?

A SubDAG is a DAG within a DAG — essentially, a child DAG defined inside a parent DAG. It's used to logically group related tasks together and reuse workflow patterns, making complex DAGs easier to manage.

πŸ“Œ Think of a SubDAG as a modular block that can be reused or organized separately.



🧩 TaskGroup in Apache Airflow

A TaskGroup in Airflow is a way to visually and logically group tasks together in the UI without creating a separate DAG like SubDagOperator. It's lightweight, easier to use, and the recommended approach in Airflow 2.x+.



=======================================================================

πŸ”— Edge Labels in Apache Airflow

Edge labels in Airflow are annotations you can add to the edges (arrows) between tasks in the DAG graph view. They help clarify why one task depends on another, especially when using complex branching, conditionals, or TriggerRules.












Comments

Popular posts from this blog

Teradata

Git

work