Docker/Airflow


Open source platform is a set of software tools and technologies that are freely available for anyone to use, modify, and distribute, thanks to their publicly accessible source code

🐳 What is Docker?

  • Docker is a platform to package applications into containers.

  • Docker is an open-source platform that enables developers to build, deploy, run, update, and manage applications using containers, which are ligh tweight, portable, and self-sufficient units that package an application and its dependencies together.

  • container is a lightweight, portable, isolated environment that includes your app + dependencies + OS libraries.

  • Think of it as “ship your code with everything it needs” so it runs the same anywhere — your laptop, cloud, or production server.

  • Written in the Go programming language.



🚀 Why Docker Containers Are Lightweight

✅ 1. They Don’t Need a Full Operating System-kernel

A virtual machine (VM) needs:

  • Its own full OS kernel

  • System libraries

  • Drivers

  • Boot process

A Docker container only needs:

  • Your application

  • Dependencies (libraries, Python packages, etc.)

  • very small OS userland (Ubuntu-base, Alpine, etc.)

  • Uses host’s OS kernel instead of its own

So instead of 2–5 GB (VM), a container may be 10–100 MB.


✅ 2. Containers Share Kernel With the Host

No container contains a kernel.

  • Kernel is the heaviest part of an OS

  • All containers use the same Linux kernel

This reduces:

  • Memory usage

  • Startup time

  • CPU overhead


✅ 3. Copy-on-Write File System

Docker uses layered images and UnionFS.

👉 Meaning:

  • If 10 containers use the same base image (like Python 3.10), the base layer is stored once

  • Only the top writable layer is unique per container

So storage & memory are reused efficiently.


✅ 4. Low Overhead for Startup

VM: Boots a full OS → can take minutes
Docker: Just starts a process → starts in <1 second

Because:

  • No BIOS/bootloader

  • No OS boot

  • Only starts your application process


✅ 5. Namespaces & Cgroups

Linux gives Docker:

  • Isolation (namespaces)

  • Resource control (cgroups)

These are kernel features, not heavy virtualization technology.

No hypervisor → less overhead.


✅ 6. Smaller Images (Especially Alpine)

Example:

  • Ubuntu Base Image → 70–100 MB

  • Alpine Linux → 4–6 MB (!!)

So applications are small and fast to ship.

=========================================================================

🟢 Docker Benefits (Simple Points)

  1. 1. Consistent environment everywhere

    • Same code runs the same way on any machine (dev, QA, prod).

    2. Lightweight (compared to VMs)

    • Starts in seconds

    • Uses less CPU and RAM

    3. Easy deployment

    • Build once → run anywhere

    • Faster releases

    4. Isolation

    • Each container has its own dependencies

    • No version conflicts

    5. Easy scaling

    • Run multiple containers from one image

    • Good for stream jobs, ETL parallelism

    6. Multi-service setup using Compose

    • Run Airflow + Postgres + Kafka + Redis + Spark together

    • One command: docker-compose up

    7. Cleaner development

    • No need to install databases, Spark, Kafka manually

    • Everything runs inside containers

    8. Better CI/CD

    • Code + dependencies packaged into one image

    • Consistent builds

    9. Secure

    • Apps isolated from host system

    10. Cloud-native

    • Works with Kubernetes, AWS ECS, EKS, GCP GKE, Azure AKS

    • Industry standard

=========================================================================

🧱 Components of Docker

Docker has 5 major components:

  1. Docker Client

  2. Docker Daemon (dockerd)

  3. Docker Images

  4. Docker Containers

  5. Docker Registry

Below is a deep but simple breakdown 👇


1️⃣ Docker Client (CLI)

The client is what you interact with.

The Docker Client (docker CLI) communicates with the daemon using a REST API. It provides the execution environment where Docker Images are instantiated into live containers.

When you run:

docker run nginx docker build -t myapp . docker pull ubuntu

You are using the Docker Client.

👉 It sends commands to the Docker Daemon.

The interface through which users interact with Docker. Users issue commands like docker run or docker build via the Docker CLI, which translates these into API calls to the Docker daemon.
docker build
docker run
docker pull


2️⃣ Docker Daemon (dockerd)

Daemon = background service that does the heavy work.

A background service that runs on the host machine and manages Docker objects such as images, containers, networks, and volumes. It listens for API requests from the Docker client and executes container lifecycle operations like starting, stopping, and monitoring containers.

The Docker Engine Daemon (dockerd) runs in the background, listening to API requests and managing objects like images, containers, networks, and volumes.

It is responsible for:

  • Building images

  • Running containers

  • Managing images

  • Managing networks

  • Managing storage

The Docker Client talks to the Daemon using a REST API.


3️⃣ Docker Images

Immutable, read-only templates composed of a series of layers that provide the filesystem and application environment. Images serve as blueprints for creating Docker containers.

python:3.10-slim is an image.

docker image is a:

  • Blueprint

  • Read-only template

  • Layered package

It contains:

  • Application code

  • Dependencies

  • Runtime

  • OS libraries

  • Configurations

Images are created using docker build.

Docker Image is a file made up of multiple layers that contains the instructions to build and run a Docker container. t acts as an executable package that includes everything needed to run an application — code, runtime, libraries, environment variables, and configurations.

How it Works:

  • The image defines how a container should be created.
  • Specifies which software components will run and how they are configured.
  • Once an image is run, it becomes a Docker Container.

4️⃣ Docker Containers

container is a running instance of an image.

Running instances of Docker images with a writable layer on top, enabling users to execute applications within isolated environments. Containers are lightweight and start quickly compared to traditional virtual machines.
When you run:
docker run python:3.10-slim

Container =

  • Lightweight

  • Portable

  • Isolated process

Created using:

docker run image_name

Multiple containers can run from the same image.


5️⃣ Docker Registry

registry stores images.(Docker Hub / ECR / GCR)

A repository or storage system where Docker images are stored and distributed. Docker Hub is the most popular public registry, while private registries can also be used. The Docker daemon pulls images from registries to create containers or pushes locally built images to registries.

Examples:

  • Docker Hub

  • AWS ECR

  • GitHub Container Registry

  • GCR

  • Azure ACR

Inside a registry we have repositories, and inside repositories we have tags.


🧩 Additional Components (Advanced)

🔹 6️⃣ Dockerfile

A file containing instructions to build an image.

The Dockerfile uses DSL (Domain Specific Language) and contains instructions for generating a Docker image. Dockerfile will define the processes to quickly produce an image. While creating your application, you should create a Dockerfile in order since the Docker daemon runs all of the instructions from top to bottom.



🔹 7️⃣ Docker Engine

The Docker Engine is the core component that enables Docker to run containers on a system. It follows a client-server architecture and is responsible for building, running, and managing Docker containers.

Core part of Docker containing:

  • Client

  • REST API

  • Daemon

🔹 8️⃣ Docker Compose

Tool to run multi-container apps.

A tool for defining and running multi-container Docker applications using a declarative YAML file.

Example:

  • app container

  • db container

  • redis container

All defined in docker-compose.yml.

🔹 9️⃣ Docker Network

Allows communication between containers and with external networks. Docker provides various networking drivers like bridge, overlay, and host to control connectivity for containers.

Provides:

  • Bridge network

  • Host network

  • Overlay network (for Swarm)

  • Container-to-container communication

🔹 🔟 Docker Volumes

A Docker volume is a persistent storage mechanism managed by Docker to store data outside of a container's writable layer. Volumes exist independently of containers, enabling data to persist even after a container is stopped, removed, or recreated. 
volumes:
- dbdata:/var/lib/postgresql/data

Used for persistent storage.

Examples:

  • Databases

  • Logs

  • App data

=========================================================================

Containerization vs Virtual Machines

Docker containers share the host operating system's kernel, running isolated applications with just the necessary libraries and binaries. This makes containers lightweight and efficient, using fewer system resources since they do not run a full OS for each instance.
Docker vertuarlise application but vm vertualise application and kernel.

Virtual machines run a full guest operating system on top of a hypervisor, which creates and manages virtualized hardware. Each VM includes a complete OS, consuming significantly more CPU, memory, and storage resources.


🖥️ 1. What is a VM (Virtual Machine)?

Virtual Machine (VM) is a computer inside a computer.

It behaves like a real machine:

  • It has its own Operating System (Windows / Linux / macOS)

  • Its own virtual CPU, RAM, disk, network

Example:
You install Ubuntu Linux on your Windows laptop using VirtualBox.
That Ubuntu runs as a VM.

✔ How it works

A VM includes:

  • BIOS

  • Bootloader

  • Kernel

  • User space

  • Applications

So VMs are heavy and use more resources.


👑 2. What is a Hypervisor?

Hypervisor is the manager that creates and runs Virtual Machines.

It lies between:

  • Hardware (CPU, RAM)

  • VMs

It gives resources to each VM.

Two Types of Hypervisors

Type-1 (Bare Metal)

Runs directly on hardware → faster
Examples:

  • VMware ESXi

  • Microsoft Hyper-V

  • Xen

  • KVM

Type-2 (Hosted)

Runs on top of an operating system → slower
Examples:

  • VirtualBox

  • VMware Workstation


🧠 3. What is a Kernel?

The kernel is the core part of an operating system.

It controls:

  • CPU

  • RAM

  • Disk

  • Network

  • Processes

  • Services

Every OS has a kernel:

  • Linux kernel

  • Windows NT kernel

  • macOS XNU kernel

✔ What kernel does

Kernel manages:

Kernel FunctionMeaning
Process managementRuns programs
Memory managementAllocates RAM
Device driversTalks to hardware
NetworkingManages internet connections
File systemsReads/writes files

The kernel is what makes an operating system an operating system.

=========================================================================

🟢 What is a Docker Image?

  • Docker Image is a read-only ,immutable file that contains everything your application needs to run:

    • Code (Python scripts, ETL jobs, DAGs)

    • Libraries / dependencies (pandas, PySpark, boto3, Airflow)

    • OS-level tools and environment variables

    • configuration files

  • It acts as a blueprint for creating Docker containers.

  • Think of it as a blueprint or snapshot of your environment.


🔹 Key Features of a Docker Image

  1. Immutable: Once built, the image doesn’t change.

  2. Versioned: Can tag different versions (my-etl:1.0my-etl:2.0).

  3. Portable: Can be run anywhere with Docker installed (local, cloud, CI/CD).

  4. Layered: Each command in the Dockerfile creates a new layer, allowing caching and faster builds.


🔹 Analogy

  • Image = Cake Recipe → contains instructions and ingredients.

  • Container = Baked Cake → running instance you can interact with.


🔹 How to Create a Docker Image

Step 1: Create a Dockerfile

# 1. Base image / Defines the starting environment. FROM python:3.11-slim # 2. Set working directory inside container/Sets the folder inside the container where commands will run. WORKDIR /app # 3. Copy dependency file/Moves files from your machine → inside the image. COPY requirements.txt . # 4. Install dependencies/Runs when building the image (not when container runs). RUN pip install --no-cache-dir -r requirements.txt # 5. Copy your application code COPY . . # 6. Default command to run when container starts CMD ["python", "etl_script.py"]

Step 2: Build the image: Building an image means generating a complete packaged environment for your application, based on the instructions in a Dockerfile.

docker build -t my-etl-image:1.0 .
  • -t my-etl-image:1.0 → gives a name and version tag to the image

  • The image now contains Python + dependencies + your ETL code

When we say “building an image” in Docker, we mean:
🧱 Creating a packaged blueprint of your application
This blueprint (called a Docker Image) contains everything required to run your app:
·Your code
·Dependencies (Python packages, JARs, Node modules, etc.)
·Runtime (Python, Java, Node.js, etc.)
·OS libraries (Ubuntu, Alpine, etc.)
·Environment setup
·Configurations
So building an image = constructing this package from a Dockerfile.

🛠️ What Happens When You Build an Image?
You run: BUILD IMAGE FROM DOCKERFILE IN PWD
▶ docker build -t myapp .

Docker then:
1️⃣ Reads the Dockerfile line by line
Example Dockerfile:
FROM python:3.10
COPY app.py /app/
RUN pip install flask
CMD ["python", "/app/app.py"]

2️⃣ Executes each command and creates layers
Dockerfile step
What Docker does
FROM
Pull base image
COPY
Add your code
RUN
Install dependencies
CMD
Set start command
Each step becomes a layer.

3️⃣ Saves the final layered result as an image
This image can now be:
·Run as a container
·Pushed to a registry
·Shared with others
·Deployed to Kubernetes

Step 3: Verify the image

docker images
  • Lists all images on your machine

🔥 Simple Analogy

Dockerfile = Recipe

docker build = Cooking the dish

Docker image = Finished food

Container = Serving & eating the food


🔹 Practical Use Case for Data Engineers

  • ETL pipelines: Package Python / Spark scripts and dependencies → run anywhere

  • Airflow DAGs: Build an image containing DAGs + plugins → use DockerOperator to run tasks

  • Testing pipelines: Share image with team → exact same environment

=========================================================================

🟦 Dockerfile (Simple Explanation)

Dockerfile is a text file containing step-by-step instructions to build a Docker image.

You tell Docker how to create the image:
what OS to use, what packages to install, what code to copy, what command to run.

 Think of it as a recipe for creating your application's environment. The Docker engine reads this file and executes the commands in order, layer by layer, to assemble a final, runnable image.


🟩 Most Important Instructions

InstructionMeaning
FROMBase image
WORKDIRSet working directory
COPYCopy files into image
RUNExecute commands during build
CMDDefault command when container runs
ENTRYPOINTFixed command; CMD becomes args
EXPOSEDocument port
ENVSet environment variables
ARGBuild-time variable
VOLUMECreate mount point

🟨 Basic Dockerfile Example

FROM python:3.10-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . CMD ["python", "main.py"]

What it does:

  • Uses Python 3.10 base

  • Sets /app as working folder

  • Installs requirements

  • Copies your code

  • Runs main.py by default


🟧 Build & Run Image

Build

docker build -t myapp .

Run

docker run myapp

🏗️ 1. BUILD = Create the Image

Build means you are constructing the Docker image from a Dockerfile.

Command:

docker build -t myimage .

What happens during build:

  • Docker reads the Dockerfile

  • Downloads base image

  • Installs dependencies (RUN commands)

  • Copies your code (COPY)

  • Creates layers

  • Produces a final image

📌 Output of build = Docker Image (a blueprint)


🚀 2. RUN = Start a Container

Run means you are starting a container from that image.

Command:

docker run myimage

What happens during run:

  • Docker takes the image

  • Creates a live running instance (container)

  • Executes the CMD/ENTRYPOINT

  • Runs your application

📌 Output of run = Container (a running process)

🔥 Simple Analogy

ConceptAnalogy
DockerfileRecipe
docker buildCooking the dish using the recipe
ImageFinished, packed food
docker runServing/eating the food

=========================================================================

🟢 What is a Docker Container?

  • Docker Container is a running instance of a Docker Image.

  • It is isolated, lightweight, and contains everything defined in the image: your code, libraries, and environment.

  • Unlike an image, a container can run, execute, generate logs, and store temporary data.

Analogy:

  • Image = Recipe

  • Container = Cake baked from that recipe


🔹 Key Features of Containers

  1. Ephemeral / Mutable

    • Containers can run, stop, restart, or be deleted.

    • Changes inside a container don’t affect the original image unless you commit it.

  2. Isolated Environment

    • Each container has its own filesystem, processes, and network stack.

    • Prevents conflicts between different projects or dependencies.

  3. Lightweight & Fast

    • Shares the host OS kernel → much faster than a VM.

    • Starts in seconds.

  4. Multiple Instances

    • You can run multiple containers from the same image → efficient resource usage.


🔹 Practical Commands

  1. Run a container

docker run -it --name my-etl-container my-etl-image:1.0
  • -it → interactive terminal

  • --name → container name

  • my-etl-image:1.0 → image to run

  1. List running containers

docker ps
  1. Stop a container

docker stop my-etl-container
  1. Remove a container

docker rm my-etl-container
  1. Run in detached mode (background)

docker run -d my-etl-image:1.0

🔹 Containers in Data Engineering

  • ETL Jobs: Each pipeline can run in a separate container → isolation and reproducibility.

  • Airflow Tasks: DockerOperator spins up a container per task → consistent environment for Python/Spark jobs.

  • Local Testing: Run full pipeline with dependencies (Spark + Postgres + Minio) without affecting host system.

  • Scalable Pipelines: Multiple containers can run simultaneously, useful for batch jobs or streaming tasks.

Image

  • Read-only template

  • Created from Dockerfile

  • Example: Python + libs + your ETL script

Container

  • Running instance of an image

  • Can be started/stopped

  • Temporary, isolated environment

Dockerfile

  • Instructions to build an image

=========================================================================

🚀 What is Docker Hub? - https://hub.docker.com/

Docker Hub = Online platform where Docker images are stored, shared, and downloaded.

Docker Hub is the most popular public Docker registry, provided by Docker Inc.

repository is a place where multiple versions (tags) of a Docker image are stored.

You use it to:

  • Pull images

  • Push images

  • Share images

  • Discover official images

  • Host private images

Website: hub.docker.com
(You don’t need to visit it—Docker CLI can interact directly.)


🧱 What You Can Do with Docker Hub

✔ 1. Pull images

Download ready-made images:

docker pull python:3.10 docker pull nginx docker pull postgres

✔ 2. Push your own images

Upload your images:

docker push username/myapp:1.0

✔ 3. Use official, verified images

Examples:

  • library/nginx

  • library/ubuntu

  • library/mysql

These are secure, maintained by Docker or companies.

✔ 4. Create public or private repositories

  • Public repo → anyone can access

  • Private repo → only you/team can access

✔ 5. Automate builds (CI/CD integration)


----------------------------------------------------------------------------------
Types:

🌍 1. Public Repository (Free-Docker Hub)

✔ Definition

public repo can be viewed and pulled by anyone.

Anyone can run:

docker pull yourname/myapp

No login required.

✔ Use Cases

  • Open-source images

  • Sharing tools with the community

  • Demo applications

  • Training material

✔ Pros

  • Free

  • Easy to share

  • Good for open-source

✔ Cons

  • Code/image contents are visible to the world

  • Cannot store sensitive applications

🔒 2. Private Repository (Restricted)

✔ Definition

private repo can be accessed only by you and people you give permission to.

A user must log in:

docker login docker pull yourname/private-app

If they don't have access → they cannot pull.

✔ Use Cases

  • Internal enterprise apps

  • Proprietary code

  • Databases / internal pipelines

  • Anything sensitive or confidential

✔ Pros

  • Secure

  • Access-controlled

  • Good for companies

✔ Cons

  • Limited free private repos on free plan

  • Need Docker Hub account login


----------------------------------------------------------------------------------

 Docker commands to pull an image from a repository and run it.

🚀 1. Pull the image from a repo

docker pull <repository>/<image>:<tag>

Example (public repo):

docker pull nginx:latest

Example (private repo):

docker login docker pull username/myapp:1.0

🏃 2. Run the container

docker run <image>:<tag>

Example:

docker run nginx:latest

With port mapping:

docker run -p 8080:80 nginx:latest

🔥 Pull + Run in one command (No need to pull manually)

Docker will automatically pull the image if it doesn't exist locally.

docker run -p 8080:80 nginx:latest

📦 Full Example: Private Repo

Step 1: Login

docker login

Step 2: Pull the image

docker pull rk/myapp:2.0

Step 3: Run the container

docker run -d -p 5000:5000 rk/myapp:2.0

🧩 Optional: Run in background

Add -d:

docker run -d -p 8080:80 nginx


----------------------------------------------------------------------------------

🧡 Your Command

docker run -d \ --name minio \ -p 9000:9001 \ -p 9002:9003 \ -e MINIO_ROOT_USER=admin \ -e MINIO_ROOT_PASSWORD=password123 \ minio/minio server /data --console-address ":9001"

docker run

Runs a new container.

-d

Run the container in detached mode (background).

--name minio

Gives the container a name: minio

So instead of container ID, you can use:

docker stop minio docker logs minio

-p 9000:9001

This is port mapping.

Format:

-p HOST_PORT:CONTAINER_PORT

But here you mapped:

host 9000 container 9001

👉 Means: requests coming to localhost:9000 go to MinIO’s internal port 9001.

-p 9002:9003

Another port mapping:

host 9002 container 9003

Useful if container exposes multiple ports.

-e MINIO_ROOT_USER=admin

Sets environment variable inside container.

This is MinIO root username:

username = admin

-e MINIO_ROOT_PASSWORD=password123

Sets root password:

password = password123

This is used to log in to the MinIO console.

minio/minio

This is the image name you want to run.

Docker will:

  • Pull the image (if not present)

  • Run the container from it

server /data

This tells MinIO to start in server mode and store data in the folder:

/data (inside container)

This is where your buckets & objects get stored.

--console-address ":9001"

MinIO UI console will run on port 9001 inside container.

If you didn’t set this, UI may run on random port.

docker ps -a

see all container including stopped also

docker run postgres:10.10


----------------------------------------------------------------------------------

You ran this command:

docker run postgres:10.10

And Docker responded with:

Unable to find image 'postgres:10.10' locally 10.10: Pulling from library/postgres

This means:

👉 Docker did NOT find the image locally,

so it started pulling (downloading) it from Docker Hub.

🧠 What Each Line Means

“Unable to find image 'postgres:10.10' locally”

Your machine doesn't have the image stored.

“Pulling from library/postgres”

Docker Hub official image is under:

library/postgres

“Already exists”

These are layers that your system already has (because you had other postgres images).

Docker uses layer caching, so it doesn’t download them again.

“Downloading” / “Download complete”

These are new layers specific to the 10.10 version of PostgreSQL.

📦 Why It Downloads Layers Even If Other Versions Exist?

Images are made of multiple layer like os base, library, config.

Different PostgreSQL versions share common layers:

  • OS layer

  • Utility layer

  • Common dependencies

But version-specific layers (for 10.10) must be downloaded.

That's why you see mixed messages:

  • Already exists → shared layers reused

  • Downloading → new layers for this version

🎯 After Downloading

Docker will automatically run the new Postgres 10.10 container.

You can check it using:

docker ps

=========================================================================

Registry
  • A registry is a server where Docker images are stored, uploaded, downloaded, shared , it can be private or public.

  • Types:

    • Public Registry: Open to anyone (e.g., Docker Hub).

    • Private Registry: Restricted access, can be self-hosted or cloud-hosted (e.g., AWS ECR, Azure Container Registry, GitHub Container Registry).

  • Key Points:

    • You can host your own registry to control access to images.

    • Used in CI/CD pipelines to store images built from your projects.

    • Access can be controlled with authentication and authorization.

  • Examples:

    • Docker Hub (public)

    • Amazon ECR (private)

    • Google Container Registry (GCR)

    • Azure Container Registry (ACR)

    • GitHub Container Registry

    • Harbor (self-hosted)

    • Nexus (self-hosted)

  • 🔥 What You Can Do With a Registry
  • ✔ Push images

    Upload your built image to a registry:

    docker push myrepo/myimage:latest

    ✔ Pull images

    Download an image from a registry:

    docker pull python:3.10

🏦 1. Registry = The Server

registry is the entire system/server that stores Docker images.

Examples of registries:

  • Docker Hub

  • Amazon ECR

  • GitHub Container Registry

  • Google Container Registry

  • Harbor

Think of registry = big storage platform.

📁 2. Repository = Folder Inside the Registry

repository is a collection of related images (usually different versions of the same app).

Example repository inside Docker Hub:

library/nginx

This repository contains multiple versions (tags):

  • nginx:1.21

  • nginx:1.23

  • nginx:latest

  • nginx:stable

Think of repository = folder inside registry.

=========================================================================

🟢 What is Docker Compose?

Docker Compose is a tool that lets you run multiple containers together using one YAML file.

Instead of running individual docker run commands, you define everything in:

docker-compose.yml

Then start all services with one commanddocker-compose up

docker-compose up
  • Docker Compose is a tool for defining and running multi-container Docker applications.

  • Instead of running each container individually, you define all services in a single docker-compose.yml file.

  • With one command, you can start all services, networks, and volumes together.


🟣 Why do we use Docker Compose? (Very Important)

  • Run multiple services together (e.g., Airflow + Postgres + Redis)

  • Handles networking automatically

  • Creates shared volumes

  • Starts containers in the right order

  • Perfect for data engineering pipelines

Docker Compose Architecture

A Compose file has 3 main parts:

  1. Version → YAML schema version

  2. Services → Containers to run

  3. Volumes → Persistent storage

  4. Networks → Optional custom networks

Example structure:

version: "3.9" services: service1: service2: volumes: vol1: networks: net1:

🟢 Basic Example (docker-compose.yml)

Example for Python app + Postgres DB:

version: '3.8' services: db: image: postgres:15 environment: POSTGRES_USER: user POSTGRES_PASSWORD: pass POSTGRES_DB: mydb ports: - "5432:5432" app: build: . depends_on: - db ports: - "8000:8000" environment: DATABASE_HOST: db

Highlights

  • Two services: db and app

  • app waits for db (depends_on)

  • Networking is automatic → app connects to db using hostname db

🟣 Most Important Docker Compose Commands



PurposeCommand
Start all servicesdocker-compose up
Start in backgrounddocker-compose up -d
Stop all servicesdocker-compose down
View running servicesdocker-compose ps
View service logsdocker-compose logs app
Rebuild + rundocker-compose up --build
Run a command inside containerdocker-compose exec app bash

🟢 Networking in Compose

  • All services automatically join the same network

  • Containers talk using service names

Example:

Host: db Port: 5432

No need for IP address.

🟣 Volumes in Compose

Used for saving persistent data:

volumes: pgdata:

Example:

services: db: volumes: - pgdata:/var/lib/postgres

🔹 Why Data Engineers Use Docker Compose

  1. Run Airflow scheduler + webserver + database locally.

  2. Test ETL pipelines with Spark, Postgres, Kafka, or Minio (S3) together.

  3. Manage dependencies, networking, and volumes easily.

  4. Create reproducible environments for interviews and portfolio projects.

🔹 Basic Docker Compose Example (Airflow + Postgres)

version: '3.8' services: postgres: image: postgres:15 environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow ports: - "5432:5432" volumes: - postgres_data:/var/lib/postgresql/data airflow-webserver: image: apache/airflow:2.7.1-python3.11 depends_on: - postgres environment: AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow AIRFLOW__CORE__EXECUTOR: LocalExecutor volumes: - ./dags:/opt/airflow/dags - ./logs:/opt/airflow/logs - ./plugins:/opt/airflow/plugins ports: - "8080:8080" command: webserver volumes: postgres_data:

Explanation:

  • Postgres → metadata database for Airflow

  • Airflow-webserver → runs DAGs, connected to Postgres

  • Volumes → persist database and logs

  • Ports → expose Airflow UI locally

🔹 Basic Docker Compose Commands

  1. Build & start services

docker-compose up --build
  1. Run in detached mode (background)

docker-compose up -d
  1. Stop all containers

docker-compose down
  1. View logs

docker-compose logs airflow-webserver
  1. Rebuild after changes

docker-compose up --build

🔹 Advanced Use Cases for Data Engineers

  1. Local ETL testing

    • Spark + Minio (S3) + Kafka + Postgres → run all together.

  2. Airflow development environment

    • Scheduler + Webserver + Worker + Postgres + Redis.

  3. Team collaboration

    • Share docker-compose.yml → everyone runs the same environment.

MinIO is an open-source, high-performance object storage system that works just like Amazon S3.

🚀 Simple Definition

MinIO = Your own S3 storage, but on your local machine or your company's servers.

You can store:

  • Files (images, videos, PDFs)

  • Backups

  • Logs

  • Data lake files (Parquet, CSV, JSON)

  • ML model files

It exposes an S3-compatible API, so tools that work with AWS S3 also work with MinIO.

MinIO is heavily used by:

  • Data engineers

  • Big data pipelines

  • Machine Learning teams

  • Kubernetes ecosystems

  • On-prem companies needing S3-like storage

Common use cases:

  • Storage for Airflow, Spark, Kafka, ML models

  • Data lake storage (like S3)

  • Backup system

  • File storage for microservices

🔹 Tips

  • Use .env file for sensitive credentials (AWS keys, DB passwords).

  • Use depends_on for proper startup order.

  • Combine Dockerfile + Docker Compose to build custom images and run multi-service pipelines.

  • Use networks to let containers communicate (service_name:port).

---------------------------------------------------------------------

In a Dockerfile, commands are executed in two different phases:

🚀 1. Build-time commands

Executed while building the image using:

docker build .

These commands modify the image, install software, copy files, etc.

⭐ Build-time instructions:

InstructionMeaning
FROMBase image
COPYCopies files into image
ADDSimilar to COPY with extra features
RUNExecutes commands during image build
ENVSets environment variables for image
WORKDIRSets working directory
EXPOSEMetadata only
USERSets default user
ENTRYPOINTSets startup program
CMDDefault arguments to ENTRYPOINT

✔ Example Build-Time (RUN)

RUN apt-get update && apt-get install -y python3 RUN mkdir /app COPY . /app

⏩ These run inside the image build, produce new layers.

🚀 2. Runtime commands

Executed when container starts, not during build.

This is when you run:

docker run <image>

⭐ Runtime instructions:

InstructionMeaning
CMDRuns when container starts
ENTRYPOINTMain container command
ENVAvailable at runtime
VOLUMEDeclares storage
EXPOSEHelps runtime port mapping
HEALTHCHECK

✔ Example Runtime (CMD)

CMD ["python3", "app.py"]

⏩ This runs when the container starts, not during build.

🔥 Major Difference (VERY IMPORTANT)

FeatureBuild TimeRuntime
Executed duringdocker builddocker run
Command usedRUNCMDENTRYPOINT
Creates layers?YesNo
Installs packages✔ Allowed❌ Not allowed
Runs application❌ No✔ Yes
Changes image?✔ Yes❌ No

🧠 Most Common Confusion

❗ Why not use RUN to start a server?

Example WRONG:

RUN python app.py

This will start app during build → build will hang forever.

You should use CMD or ENTRYPOINT:

CMD ["python", "app.py"]

🎯 Simple Example Dockerfile (Build-time vs Runtime)

FROM python:3.10 # --- Build time --- RUN mkdir /app COPY . /app RUN pip install -r /app/requirements.txt WORKDIR /app # --- Runtime --- CMD ["python", "main.py"]

=========================================================================

Basic Commands

PurposeCommandMeaning
Check Docker versiondocker --versionVerify installation
List imagesdocker imagesShows all images
List running containersdocker psOnly active containers
List all containersdocker ps -aActive + stopped containers
Build imagedocker build -t <name> .Build image from Dockerfile
Run containerdocker run <image>Start container
Run interactive shelldocker run -it <image> bashEnter container terminal
Run container in backgrounddocker run -d <image>Detached mode
Assign name to containerdocker run --name myapp <image>Run container with name
Stop containerdocker stop <id>Gracefully stop
Force stopdocker kill <id>Hard stop
Remove containerdocker rm <id>Delete container
Remove imagedocker rmi <image>Delete image
View container logsdocker logs <id>Show logs
Execute command inside containerdocker exec -it <id> bashOpen shell inside running container
Copy file from containerdocker cp <id>:/path/file .Copy from container to host
Show container statsdocker statsCPU/RAM usage
Pull image from Docker Hubdocker pull <image>Download image
Push image to registrydocker push <image>Upload image
Inspect container detailsdocker inspect <id>Low-level info
Show container logsdocker logs <id>View output

---------------------------------------------------------------------

▶docker build -t my-app:1.0 .
▶docker images
▶docker run my-app:1.0
▶docker rm containerid 
▶docker rmi imgid
▶docker ps
▶docker logs id
▶docker exec -it contid /bin/sh
▶exit

if we change dockerfile , we need to rebuild image so delete image and conatiner  redo
▶docker ps
▶docker stop id
▶docker rm id
▶docker rmi imgid
▶docker build -t my-app:1.0 .
▶docker run my-app:1.0

---------------------------------------------------------------------

docker exec is used to run a command inside a running container.

Think of it as opening a terminal inside a container.

✅ Syntax

docker exec <options> <container_name_or_id> <command>

🔥 Most Common Usage

⭐ 1️⃣ Open an interactive shell inside container

(Like SSH into the container)

docker exec -it container_name bash

or if bash is not available:

docker exec -it container_name sh

What this does:

  • -i → interactive

  • -t → allocate a terminal (TTY)

  • You get inside the container's environment

  • You can explore filesystem, logs, configs, etc.

⭐ 2️⃣ Run a single command inside container

Example: list files

docker exec container_name ls /

Example: check Redis keys

docker exec redis1 redis-cli keys '*'

⭐ 3️⃣ Check environment variables

docker exec container_name env

⭐ 4️⃣ Verify process running inside container

docker exec container_name ps aux

⭐ 5️⃣ Run SQL client inside PostgreSQL container

docker exec -it postgres1 psql -U postgres

🧠 When to Use docker exec?

✔ To debug inside a container
✔ To explore container file system
✔ To check logs that applications write to files
✔ To run app-specific commands (redis-cli, psql, etc.)
✔ To verify configs
✔ To run admin commands

❗ Important Notes

🔸 The container must be running

If container is stopped:

docker exec -it mycontainer bash

will give error:

Error: No such exec instance

Use:

docker ps docker start mycontainer

---------------------------------------------------------------------

▶docker logs <container_id_or_name>

🔥 Useful Flags

1️⃣ Follow logs (live streaming logs)

docker logs -f redis1

This is like tail -f, continuously showing new log lines.

2️⃣ Show last N lines

docker logs --tail 50 redis1

3️⃣ Include timestamps

docker logs -t redis1

4️⃣ Combine flags

docker logs -f -t --tail 100 redis1

Shows last 100 lines + timestamps + live updates.

---------------------------------------------------------------------

▶docker network ls
▶docker network create mynetwork
▶docker run -d \
  --name mongo \
  --network mynetwork \
  -e MONGO_INITDB_ROOT_USERNAME=admin \
  -e MONGO_INITDB_ROOT_PASSWORD=admin123 \
  mongo
▶docker run -d \
  --name mongo-express \
  --network mynetwork \
  -e ME_CONFIG_MONGODB_ADMINUSERNAME=admin \
  -e ME_CONFIG_MONGODB_ADMINPASSWORD=admin123 \
  -e ME_CONFIG_MONGODB_SERVER=mongo \
  -p 8081:8081 \
  mongo-express

✅ 1. docker network ls

This command lists all Docker networks on your system.

docker network ls

You will see something like:

NETWORK IDNAMEDRIVERSCOPE
934d...bridgebridgelocal
a23b...hosthostlocal
7dfe...nonenulllocal

✅ 2. Create a Docker network

To create your own custom network:

docker network create mynetwork

Why create a custom network?

Containers on the same network can communicate with each other by container name.

Example:
Mongo container can be accessed by name mongo inside Mongo Express.

✅ 3. Run MongoDB container on that network

docker run -d \ --name mongo \ --network mynetwork \ -e MONGO_INITDB_ROOT_USERNAME=admin \ -e MONGO_INITDB_ROOT_PASSWORD=admin123 \ mongo

What this does:

  • Starts MongoDB in background (-d)

  • Assigns container name mongo

  • Connects it to mynetwork

  • Sets username/password

✅ 4. Run Mongo Express (UI) on same network

docker run -d \ --name mongo-express \ --network mynetwork \ -e ME_CONFIG_MONGODB_ADMINUSERNAME=admin \ -e ME_CONFIG_MONGODB_ADMINPASSWORD=admin123 \ -e ME_CONFIG_MONGODB_SERVER=mongo \ -p 8081:8081 \ mongo-express

Important points:

  • Connected to same network → can reach Mongo

  • Mongo server is given as:

ME_CONFIG_MONGODB_SERVER=mongo

Because Docker resolves container names automatically on a shared network.

  • Port mapping -p 8081:8081
    → You can open Mongo Express in browser at:

http://localhost:8081

docker-compose.yml

version: "3.9" services: mongo: image: mongo container_name: mongo environment: MONGO_INITDB_ROOT_USERNAME: admin MONGO_INITDB_ROOT_PASSWORD: admin123 networks: - mynetwork volumes: - mongo_data:/data/db mongo-express: image: mongo-express container_name: mongo-express depends_on: - mongo environment: ME_CONFIG_MONGODB_ADMINUSERNAME: admin ME_CONFIG_MONGODB_ADMINPASSWORD: admin123 ME_CONFIG_MONGODB_SERVER: mongo networks: - mynetwork ports: - "8081:8081" networks: mynetwork: volumes: mongo_data:

Commands to run the Compose

Start containers

docker-compose -f mongo.yml up -d

Check running services

docker-compose ps

Stop everything

docker-compose down

running container

docker-compose -f mongo.yml ps

view log

docker-compose -f mongo.yml logs-f

---------------------------------------------------------------------

▶docker rm contid
▶docker rmi imgid
▶docker stop id
▶docker start id

▶docker pull redis
▶docker images
▶docker run redis
▶docker ps
▶docker run -d redis
▶docker stop containerid
▶docker start idname
▶docker ps -a
▶docker run redis:4.0

✅ 1. docker pull redis

This command downloads the Redis image from Docker Hub.

docker pull redis

What happens:

  • Docker checks if redis:latest exists locally

  • If not, it downloads all required image layers

  • Stores it in your local image cache


✅ 2. docker images

Shows all images available locally.

docker images

You will see output like:

REPOSITORYTAGIMAGE IDCREATEDSIZE
redislatestabc1232 days ago110MB

✅ 3. docker run redis

Runs the Redis image in the foreground.

docker run redis

Result:

  • It starts Redis in your terminal

  • You can see logs continuously

  • Your terminal gets “attached” to the container

  • Press CTRL + C to stop

Not recommended for production.


✅ 4. docker ps

Shows running containers.

docker ps

You will see columns:

| CONTAINER ID | IMAGE | STATUS | PORTS | NAMES |


✅ 5. docker run -d redis

Runs Redis in background (detached mode).

docker run -d redis

What happens:

  • Starts Redis container

  • Returns only the container ID

  • Your terminal is free to use

  • Container keeps running in the background


✅ 6. docker stop <container_id>

Stops the running Redis container.

docker stop <container_id>

What happens:

  • Sends graceful shutdown signal

  • Redis safely shuts down

  • Container becomes stopped, but not removed


✅ 7. docker start <container_id / name>

Starts a stopped container again.

docker start redis-container

or

docker start 7f9a3c0192d1

Important:

It starts the same stopped container—not a new one.


✅ 8. docker ps -a

Shows all containers — running + stopped.

docker ps -a

Useful to check old/stopped containers.


✅ 9. docker run redis:4.0

Runs a specific version of Redis.

docker run redis:4.0

What happens:

  • If version 4.0 image does NOT exist locally → Docker pulls it

  • A container is created using Redis v4.0

  • If you use -d, it runs in background

=========================================================================

🔵 What is Docker Caching?

Docker caching means Docker reuses previously built layers instead of rebuilding everything every time.

This makes builds:

  • Faster

  • Cheaper

  • More efficient


🔵 How Docker Caching Works

A Docker image is made of layers.
Each Dockerfile instruction creates one layer.

Example:

FROM python:3.10-slim → Layer 1 WORKDIR /app → Layer 2 COPY requirements.txt . → Layer 3 RUN pip install -r requirements.txt → Layer 4 COPY . . → Layer 5 CMD ["python", "main.py"] → Layer 6

If nothing changes in a layer, Docker reuses it from cache.


🔵 Why Caching Matters (Interview Points)

  • Speeds up builds (5 minutes → 10 seconds)

  • Reduces duplicate work

  • Prevents reinstalling dependencies

  • Saves cloud build costs (GitHub Actions, AWS, GCP)


🔵 What Breaks the Cache?

A cache is invalidated (rebuild happens) if:

  1. The instruction changes (example: change a RUN command)

  2. Any file copied in that layer changes

  3. Any previous layer changes

Example:

If requirements.txt changes, Docker will rebuild:

  • Layer for COPY requirements.txt

  • Layer for RUN pip install

  • All layers after them

But earlier layers (FROM, WORKDIR) are still cached.


🔵 Best Practice: ORDER YOUR DOCKERFILE

To get the maximum caching, put the steps that change least often first.

❌ Bad (slow builds every time):

COPY . . RUN pip install -r requirements.txt

✔ Good (better caching):

COPY requirements.txt . RUN pip install -r requirements.txt COPY . .

This way:

  • Pip install runs only if requirements.txt changes

  • App code changes won’t break pip install cache


🔵 Cache Example in Real Life

First build:

docker build . → takes 24 minutes

Second build with no code change:

docker build . → 5 seconds

Because all layers are reused.


🔵 Skipping Cache (Forced Rebuild)

Sometimes you want a full rebuild:

docker build --no-cache -t myapp .

🔵 Multi-Stage Build + Caching (Advanced)

Multi-stage builds let you cache dependency installation separately:

FROM python:3.10 as builder COPY requirements.txt . RUN pip install -r requirements.txt FROM python:3.10-slim COPY --from=builder /root/.local /root/.local COPY . .

This dramatically speeds up builds.


🔥 Short Summary (One Line Answers)

  • Docker caching = reusing previous build layers

  • Each Dockerfile instruction = one layer

  • Layers only rebuild if something changes

  • Correct ordering = fast builds

  • --no-cache disables caching

=========================================================================

🟦 Variables in Docker

Docker supports two types of variables:


✅ 1. ENV (Environment Variables)

🔹 Available inside the running container
🔹 Used by applications at runtime
🔹 Can be set in Dockerfile, Compose, or at run time

Dockerfile

ENV PORT=8000

docker run

docker run -e PORT=8000 myapp

docker-compose.yml

environment: PORT: 8000

📌 Use case:
Database URLs, passwords, app settings.


✅ 2. ARG (Build-time Variables)

🔹 Used only during image build
🔹 NOT available inside running container unless passed to ENV
🔹 Must be defined before use

Dockerfile

ARG VERSION=1.0 RUN echo "Building version $VERSION"

Build:

docker build --build-arg VERSION=2.0 .

📌 Use case:
Build metadata, versioning, optional settings.


🟨 ENV vs ARG (Interview Question)

FeatureARGENV
Available at runtime?❌ No✔ Yes
Available during build?✔ Yes✔ Yes
Passed using docker run?❌ No✔ Yes
Stored inside final image?❌ No✔ Yes

🟩 3. Variables in docker-compose with .env file

You can store environment variables in a file named .env.

.env:

DB_USER=admin DB_PASS=pass123

docker-compose.yml:

environment: - DB_USER=${DB_USER} - DB_PASS=${DB_PASS}


🟧 4. Using variables inside Dockerfile

Example:

ARG APP_DIR=/app WORKDIR $APP_DIR ENV LOG_LEVEL=debug

🟥 5. Why variables are important in Docker?

  • Avoid hardcoding secrets

  • Make Dockerfiles reusable

  • Dynamic config (ports, environment, versions)

  • Different environments: dev, test, prod



=========================================================================

🟦 Docker Registry — What It Is & Why It Matters

✅ What Is a Docker Registry?

Docker Registry is a storage + distribution system for Docker images.

A Docker registry is a centralized storage and distribution system for Docker images. It acts as a repository where Docker images—packages containing everything needed to run an application—are stored, managed, versioned, and shared across different environments. 

It is where Docker images are:

  • Stored

  • Versioned

  • Pulled from

  • Pushed to

Similar to GitHub, but for container images instead of code.


🟧 Key Concepts

🟠 1. Registry

The whole server that stores repositories → e.g., Docker Hub, AWS ECR.

🟠 2. Repository

A collection of versions (tags) of an image.
Example:

myapp:latest myapp:v1 myapp:v2

🟠 3. Image Tag

Label used to version an image.

Example:

python:3.10 node:20-alpine

🟩 Public vs Private Registries

TypeExamplesFeatures
PublicDocker Hub, GitHub Container RegistryAnyone can pull
PrivateAWS ECR, Azure ACR, GCP GCR, HarborSecure, enterprise use

🟦 Why Do We Need a Docker Registry?

Because:

  • You build an image locally

  • Push it to a registry

  • Your production server / CI/CD pulls the image and runs it

Without a registry → no easy way to share or deploy images.


🟣 Common Docker Registry Commands

✅ Login

docker login

✅ Tag an Image

docker tag myapp:latest username/myapp:latest

✅ Push to Registry

docker push username/myapp:latest

✅ Pull from Registry

docker pull username/myapp:latest

🟤 Examples of Docker Registries

📌 1. Docker Hub (Most Common)

  • Free public repositories

  • Paid private repos

📌 2. AWS ECR (Enterprise)

  • Most used in production

  • Private registry

  • Integrated with ECS, EKS, Lambda

📌 3. GitHub Container Registry

  • Images stored inside GitHub

  • Good for CI/CD workflows

📌 4. Google GCR / Artifact Registry

📌 5. Self-hosted Registry

Example: Harbor, JFrog Artifactory


🔥 Advanced Concepts (Interview-Level)

🔹 Digest-based pulling

Instead of tag:

docker pull myapp@sha256:abc123...

Guarantees exact version.


🔹 Immutable tags

Some registries enforce that v1 cannot be overwritten.


🔹 Retention Policies

Automatically delete old images in ECR/GCR.


🔹 Scan for vulnerabilities

Registries like:

  • AWS ECR

  • GHCR

  • Docker Hub (Pro)
    can scan images for security issues.

=========================================================================

 What is Docker Networking?

Docker networking allows containers to communicate with:

  • each other

  • the host machine

  • external internet

Each container gets its own virtual network interface + IP address.

🔶 Types of Docker Networks

Docker provides 5 main network types:


🟦 1. Bridge Network (Default)

  • Most commonly used

  • Containers on the same bridge network can talk to each other using container name

Example:

docker network create mynet docker run -d --name app1 --network=mynet nginx docker run -d --name app2 --network=mynet alpine ping app1

Use Case:

Local development
Microservices communication


🟩 2. Host Network

Container shares the same network as host.

❌ No isolation
⚡ Fastest network performance
🧠 No port mapping needed

Run:

docker run --network host nginx

Use Case:

  • High-performance applications

  • Network-heavy workloads


🟧 3. None Network

Container has no network.

docker run --network none alpine

Use Case:

Security
Sandbox jobs
Batch processing


🟪 4. Overlay Network (Swarm / Kubernetes)

Used in multi-node swarm clusters.
Allows containers on different machines to communicate.

Use Case:

Distributed apps
Microservices in Docker Swarm


🟫 5. Macvlan Network

Gives container its own IP address in LAN like a real device.

Use Case:

Legacy systems
Need direct connection to network
Running containers like physical machines


🔷 Key Networking Commands

CommandDescription
docker network lsList networks
docker network inspect <name>Inspect network
docker network create <name>Create network
docker network rm <name>Remove network
docker network connect <net> <container>Add container to network
docker network disconnect <net> <container>Remove container

🔷 How Containers Communicate

🟦 1. Same Bridge Network

✔ Can ping each other by container name
✔ DNS built-in

Example:

ping app1

🟥 2. Different Networks

❌ Cannot communicate
➡ Must connect to the same network


🟩 3. With Host Machine

Host can access container via:

localhost:<mapped-port>

Example:

docker run -p 8080:80 nginx

Access: → http://localhost:8080


🟧 4. Container to Internet

Enabled by default via NAT.


🔶 Port Mapping

If container port = 80
Host port = 8080

docker run -p 8080:80 nginx

👉 Host can access container
👉 “Port forwarding”


🟦 Docker DNS

On the same custom network:

  • Container names act like hostnames

  • Docker automatically manages DNS

curl http://app1:5000


🟢 How Containers Talk to Each Other

Within same network → Use service name

Example in docker-compose.yml:

services: db: image: postgres app: image: python-app

app can connect to db like this:

host = "db" port = 5432

✔ No need for IP address
✔ Docker handles DNS automatically


🟣 Important Commands

PurposeCommand
List networksdocker network ls
Inspect network detailsdocker network inspect <network>
Create a networkdocker network create mynet
Connect container to networkdocker network connect mynet container1
Disconnectdocker network disconnect mynet container1

🟢 Networking in Docker Compose (MOST USED)

services: app: image: myapp networks: - mynet db: image: postgres networks: - mynet networks: mynet:

Result:

  • app and db talk using: db:543

=========================================================================

🔵 Docker Volumes 

Docker Volumes are the official way to store data outside a container.

Docker volumes are a dedicated, persistent storage mechanism managed by Docker for storing data generated and used by containers. 

Unlike container writable layers, volumes exist independently of the container lifecycle, meaning data in volumes remains intact even if the container is stopped, removed, or recreated.

They reside outside the container filesystem on the host, typically under Docker's control directories, providing efficient I/O and storage management.

Because containers are ephemeral:
→ When container stops/deletes → data is lost
→ Volumes solve that.

🔶 Why Do We Need Docker Volumes?

✔ Containers are temporary
✔ Data must persist
✔ Multiple containers may need same data
✔ Upgrading/Deleting containers should NOT delete data

🟦 Types of Docker Storage

Docker offers 3 types:

1️⃣ Named Volume (Recommended)

Managed by Docker itself
Stored under:

/var/lib/docker/volumes/

Use Cases:

  • Databases (MySQL, PostgreSQL)

  • Persistent app data

Example:

docker volume create myvol docker run -v myvol:/data mysql

2️⃣ Bind Mount

Maps specific host directory into container

Uses host machine's folder.

docker run -v /host/path:/container/path nginx

Use Cases:

  • Local development

  • When you want full control of host path

3️⃣ tmpfs (Linux Only)

Data stored in RAM only.

docker run --tmpfs /data redis

Use Cases:

  • Sensitive data

  • Ultra-fast temporary storage

🟩 Volume Commands (Most Important)

CommandDescription
docker volume create myvolCreate volume
docker volume lsList volumes
docker volume inspect myvolInspect volume
docker volume rm myvolDelete volume
docker volume pruneRemove unused volumes

🟧 Using Volumes in Docker Run

Syntax:

docker run -v <volume_name>:<container_path> image

Example:

docker run -d \ -v dbdata:/var/lib/mysql \ mysql:8

🟣 Using Bind Mounts

Example:

docker run -d \ -v /home/user/app:/app \ node:20

🔵 Volumes in Docker Compose

Very important for real projects.

docker-compose.yml

version: "3.9" services: db: image: mysql volumes: - dbdata:/var/lib/mysql volumes: dbdata:

🔥 Example Use Case (DB Persistence)

If you run:

docker run mysql

Delete container → data gone.

But with volume:

docker run -v dbdata:/var/lib/mysql mysql

Stop container → data still exists (in volume).

🟥 Where Are Volumes Stored?

On Linux:

/var/lib/docker/volumes/<volume-name>/_data

On Windows/Mac → managed internally through Docker Desktop.

-------------------------------------------------------------------------------------------------------------------

1️⃣ Why each DB has a different location

where the database stores its actual files

2️⃣ Using Docker volumes for persistence

  • Volumes are Docker-managed storage that lives outside the container filesystem.

  • You can map container paths to host paths or let Docker manage them.

Syntax:

docker run -d \ -v <host-path>:<container-db-path> \ --name <container-name> \ <image-name>

Examples:

MySQL:

docker run -d \ -v /my/host/mysql-data:/var/lib/mysql \ -e MYSQL_ROOT_PASSWORD=root123 \ --name my-mysql \ mysql:8


3️⃣ Key points

  1. Each DB container has its own default data directory — you must map that path for persistence.

  2. You can use:

    • Host directory mapping (/host/path:/container/path) → data visible on host.

    • Named volumes (-v myvolume:/container/path) → Docker manages storage.

  3. Using different volumes/paths per DB avoids conflicts and keeps data safe.

  4. This also allows backup, restore, and migration easily by copying the volume.


🟨 Interview Questions (Short Answers)

1️⃣ What is a Docker Volume?

A persistent storage mechanism managed by Docker.

2️⃣ Difference: Volume vs Bind Mount?

VolumeBind Mount
Managed by DockerControlled by host user
More secureDirect host access
Best for productionBest for local development

3️⃣ Does deleting container delete volume?

❌ No.
Volumes must be deleted manually.

4️⃣ What happens if volume doesn't exist?

Docker automatically creates it.

5️⃣ Can two containers share one volume?

✔ Yes → used in DB replicas, logs, shared storage.

=========================================================================

🔵 What is ENTRYPOINT in Docker?

ENTRYPOINT defines the main command that will always run when a container starts.

Think of it as the default executable of the container.


🟦 Why ENTRYPOINT is used?

✔ Makes the container behave like a single-purpose program
✔ Forces a command to always run
✔ Can't be easily overridden (compared to CMD)
✔ Best for production containers


🔶 ENTRYPOINT Syntax

Two forms exist:


1️⃣ Exec Form (Recommended)

ENTRYPOINT ["executable", "param1", "param2"]

✔ Doesn’t use shell
✔ More secure
✔ Handles signals properly


2️⃣ Shell Form

ENTRYPOINT command param1 param2

⚠ Runs inside /bin/sh -c
⚠ Harder to handle signals


🟣 Example ENTRYPOINT Dockerfile

Dockerfile

FROM python:3.10 COPY app.py / ENTRYPOINT ["python3", "app.py"]

Run:

docker run myapp

This will always run:

python3 app.py

🟩 ENTRYPOINT + CMD (Very Important)

ENTRYPOINT = fixed command
CMD = default arguments

Example:

ENTRYPOINT ["python3", "app.py"] CMD ["--port", "5000"]

Container will run:

python3 app.py --port 5000

You can override CMD:

docker run myapp --port 8000

But ENTRYPOINT cannot be replaced unless you use --entrypoint.


🔥 Override ENTRYPOINT (Rare)

docker run --entrypoint bash myapp

🟥 ENTRYPOINT vs CMD (Very Important Table)

FeatureENTRYPOINTCMD
Main purposeMain commandDefault args
Overrides allowed?❌ Hard✔ Easy
Best usePermanent commandArguments
Runs asProgramCommand/Args

🔶 Common Interview Questions

1. Why use ENTRYPOINT instead of CMD?

To ensure the main command always runs and cannot be overridden.

2. What happens if both ENTRYPOINT and CMD exist?

CMD becomes arguments to ENTRYPOINT.

3. How do you override ENTRYPOINT?

Using --entrypoint.

=========================================================================

🔵 Docker Daemon & Docker Client

Docker works using a client–server architecture.


🟦 1. Docker Daemon (dockerd)

This is the brain of Docker.

✔ What it Does:

  • Runs in the background

  • Manages containers

  • Manages images

  • Manages networks

  • Manages volumes

  • Executes all Docker operations

✔ It Listens On:

  • Unix socket: /var/run/docker.sock

  • Sometimes TCP port (for remote Docker hosts)

✔ Daemon = Server Side


🟩 2. Docker Client (docker)

This is the command-line tool you use.

When you type:

docker ps docker run nginx

The client DOES NOT run containers.

Instead, it sends API requests to the Docker Daemon, which performs the real operations.

✔ Client = Frontend

✔ Daemon = Backend


🟧 How They Work Together (Simple Flow)

You run:

docker run nginx

Flow:

  1. Client sends request → Daemon

  2. Daemon pulls image

  3. Daemon creates container

  4. Daemon starts container

  5. You see output on terminal

=========================================================================

🔵 COPY vs ADD in Dockerfile

Both are used to copy files into the image, but COPY is preferred.


🟦 1. COPY (Recommended)

✔ What it does:

Copies local files/folders into the container.

✔ Safe

✔ Predictable

✔ No extra features (simple only)

Example:

COPY app.py /app/app.py

Use COPY when:

  • You want to copy source code

  • You want clean builds

  • You don’t need extraction or downloading


🟧 2. ADD (Avoid unless needed)

✔ What it does:

Does everything COPY does plus two extra features:

Extra Features:

1️⃣ Can download URLs

ADD https://example.com/file.tar.gz /app/

2️⃣ Automatically extracts tar files

ADD app.tar.gz /app/

⚠ Because of these extras → can create security issues

So Docker recommends: use COPY unless ADD is needed.


🟪 COPY vs ADD Table (Interview-Friendly)

FeatureCOPYADD
Copy local files✔ Yes✔ Yes
Copy remote URL❌ No✔ Yes
Auto extract .tar.gz❌ No✔ Yes
Simpler✔ Yes❌ No
More secure✔ Yes❌ No
Recommended?✔ Yes❌ Use only when required

🟩 When to Use ADD? (Rare)

Use ADD only for:

✔ Auto-unpacking tar files into image

ADD app.tar.gz /app/

✔ Downloading files from a URL

ADD https://example.com/setup.sh /scripts/

Otherwise → COPY is always better.

=========================================================================

🔵 What are Multi-Stage Builds?

Multi-stage builds allow you to use multiple FROM statements in a single Dockerfile.

✔ Build in one stage
✔ Copy only the required output into the final stage
✔ Final image becomes much smaller
✔ No build dependencies inside final image


🟦 Why Multi-Stage Builds Are Needed?

Problem (without multi-stage):

  • Build tools (Maven, Go compiler, Node modules, pip, etc.) stay inside the final image

  • Makes image heavy

  • Security issues

  • Slow deployment

Multi-stage solution:

  • Build tools exist only in the build stage

  • Final stage contains just the application

  • Clean, lightweight image


🟩 Simple Example – Python / Node / Java / Go (All follow same logic)

Here is a general multi-stage pattern:

# ----- Stage 1: Build ----- FROM node:20 AS builder WORKDIR /app COPY package*.json . RUN npm install COPY . . RUN npm run build # ----- Stage 2: Final Image ----- FROM nginx:alpine COPY --from=builder /app/dist /usr/share/nginx/html

What happens?

  • Node image builds the app

  • Only the final compiled output is copied to nginx

  • Result = super small production image


🔶 Another Example – Python App

# Stage 1: Build dependencies FROM python:3.10 AS builder WORKDIR /app COPY requirements.txt . RUN pip install --user -r requirements.txt # Stage 2: Clean final image FROM python:3.10-slim WORKDIR /app COPY --from=builder /root/.local /root/.local COPY . . CMD ["python", "app.py"]

🔷 Another Example – Java (Very Popular)

# Stage 1: Build JAR FROM maven:3.9 AS builder WORKDIR /app COPY pom.xml . COPY src ./src RUN mvn package -DskipTests # Stage 2: Run JAR FROM openjdk:17-jdk-slim COPY --from=builder /app/target/myapp.jar /myapp.jar CMD ["java", "-jar", "/myapp.jar"]

✔ No Maven in final image
✔ Final image is tiny


🟧 Key Features of Multi-Stage Builds

✔ Multiple FROM instructions

Each FROM = new stage

✔ You can name stages

FROM golang:1.20 AS builder

✔ Copy artifacts from stage to stage

COPY --from=builder /app/bin /bin

✔ Final image only contains last stage

All previous stages = removed
Image is clean + small


🟪 Benefits (Interview Ready)

BenefitExplanation
✔ Smaller imagesNo build tools in final image
✔ Faster buildsLayer caching for each stage
✔ Better securityNo compilers / secrets left behind
✔ Cleaner DockerfilesEach stage has a clear job
✔ Reproducible buildsSame environment every time

=========================================================================

🔵 What is .dockerignore?

.dockerignore is a file that tells Docker which files/folders to EXCLUDE when building an image.

It works similar to .gitignore.


🟦 Why do we use .dockerignore?

✔ Faster Docker builds

(Removes unnecessary files → smaller build context)

✔ Smaller images

(Don’t copy unwanted files)

✔ Better security

(Keep secrets, logs, configs out of image)

✔ Cleaner caching

(Prevents rebuilds when irrelevant files change)


🟩 Common Items in .dockerignore

node_modules/ __pycache__/ *.pyc *.log .env .env.* .git .gitignore Dockerfile docker-compose.yml .vscode/ .idea/ dist/ build/ *.zip *.tar.gz

🟧 How it works?

When you run:

docker build -t myapp .

Docker first copies the “build context” → (current directory)
Without dockerignore, everything is copied.

.dockerignore tells Docker:
🚫 Don’t send these files to the build context.


🟪 Example

.dockerignore

*.log *.env secret.txt cache/

Dockerfile

COPY . /app

Only allowed files will be copied.


🟥 Performance Impact (Very Important)

Without .dockerignore:

  • Docker copies huge directories (node_modules, logs)

  • Slow build

  • Cache invalidates unnecessarily

With .dockerignore:

  • Build context is very small

  • Build is faster

  • Cache stays valid → faster incremental builds

=========================================================================

🔵 Docker Container Lifecycle (Step-by-Step)

A Docker container goes through the following major stages:

Created → Running → Paused → Unpaused → Stopped → Restarted → Removed

🟦 1. Created

The container is created from an image but not started yet.

Command:

docker create image_name

🟩 2. Running

Container is active and executing processes.

Command:

docker start container # or docker run image_name

docker run = create + start


🟧 3. Paused

All processes inside the container are temporarily frozen.

Command:

docker pause container

🟪 4. Unpaused

Resumes the paused container.

Command:

docker unpause container

🟥 5. Stopped / Exited

Container stops running its main process (app has exited or manually stopped).

Command:

docker stop container

🟨 6. Restarted

Container is stopped and then started again.

Command:

docker restart container


🟫 7. Removed (Deleted)

The container is permanently removed from Docker.

Command:

docker rm container

You cannot remove a running container—must stop it first.


📌 Lifecycle Diagram (Simple)

pause → paused → unpause ↑ ↓ created → running → stopped → removed ↑

restart
📘 Useful Lifecycle Commands

ActionCommand Example
Createdocker create nginx
Run (create+start)docker run nginx
Startdocker start cont_id
Stopdocker stop cont_id
Pausedocker pause cont_id
Unpausedocker unpause cont_id
Restartdocker restart cont_id
Removedocker rm cont_id
Remove alldocker rm $(docker ps -aq)

=========================================================================

🔵 What is a Docker HEALTHCHECK?

HEALTHCHECK is a way to tell Docker how to test whether a container is healthy.
Docker runs this command periodically and updates the container's status:

  • healthy

  • unhealthy

  • starting

It helps in:

  • auto-restarts

  • load balancers

  • orchestrators (Kubernetes, ECS, Swarm)


🟦 Syntax (Dockerfile)

HEALTHCHECK [OPTIONS] CMD <command> # or disable HEALTHCHECK NONE

🟩 Options

OptionMeaning
--interval=30sCheck frequency
--timeout=3sHow long to wait before failing
--start-period=5sGrace period before checks start
--retries=3Fail after X failed attempts


🟧 Example 1: Simple HTTP Healthcheck

FROM nginx HEALTHCHECK --interval=30s --timeout=3s \ CMD curl -f http://localhost/ || exit 1
  • If curl -f works → healthy

  • If fails → unhealthy


🟪 Example 2: Healthcheck Script

FROM python:3.9 COPY health.sh /usr/local/bin/health.sh RUN chmod +x /usr/local/bin/health.sh HEALTHCHECK --interval=10s --timeout=2s \ CMD ["sh", "/usr/local/bin/health.sh"]

health.sh:

#!/bin/sh if curl -f http://localhost:5000/health > /dev/null; then exit 0 else exit 1

fi
🟥 How to Check Health Status

List containers with health status:

docker ps

Detailed inspection:

docker inspect container_id

You will see:

"Health": { "Status": "healthy", "FailingStreak": 0,

"Log": [...]
}

🟨 What Docker Does with Healthchecks

StatusMeaning
startingStartup period (start-period)
healthyApp is functioning
unhealthyCheck failed repeatedly

If you use restart policies:

docker run --health-retries=3 --restart=always ...

→ Docker auto-restarts an unhealthy container.


📌 Important Notes

  • HEALTHCHECK runs inside the container.

  • Should be lightweight (avoid heavy scripts).

  • Uses exit codes:

    • 0 = success (healthy)

    • 1 = unhealthy

    • 2 = reserved

=========================================================================

🔵 What is docker inspect?

docker inspect is used to view detailed information about Docker containers, images, networks, or volumes in JSON format.

It shows everything about a container:

  • Network info

  • Mounts / volumes

  • IP address

  • Ports

  • Environment variables

  • Health status

  • Entry point, CMD

  • Resource usage config

  • Labels

  • Container state (running, stopped, etc.)

This is the most powerful debugging command.


🟦 Basic Command

docker inspect <container_id_or_name>

🟩 Example Output (Simplified)

You will see JSON fields like:

{ "Id": "d2f1...", "State": { "Status": "running", "Health": { "Status": "healthy" } }, "Config": { "Env": ["APP_ENV=prod", "PORT=8080"], "Cmd": ["python", "app.py"] }, "NetworkSettings": { "IPAddress": "172.17.0.2" }, "Mounts": [ { "Source": "/data", "Destination": "/var/lib/data" } ] }

🔧 Most Useful Inspect Filters (Important!)

📍 1. Get container IP address

docker inspect -f '{{ .NetworkSettings.IPAddress }}' <container>

📍 2. Get just the environment variables

docker inspect -f '{{ .Config.Env }}' <container>

📍 3. Get container’s running status

docker inspect -f '{{ .State.Status }}' <container>

📍 4. Get container entrypoint

docker inspect -f '{{ .Config.Entrypoint }}' <container>

📍 5. Get exposed ports

docker inspect -f '{{ .NetworkSettings.Ports }}' <container>

🟧 Inspecting Images

docker inspect <image_name>

Useful to see:

  • layers

  • build parameters

  • environment variables

  • entrypoint


🟪 Inspecting Networks

docker inspect <network_name>

You can find:

  • connected containers

  • IP ranges (subnet)

  • gateway

  • driver type


🟫 Inspecting Volumes

docker inspect <volume_name>

Shows:

  • mount point

  • driver

  • usage


✨ Real Use Cases (Important for Interviews)

Use CaseCommand
Debug network issuesGet IP, ports
Debug ENV variablesextract .Config.Env
Verify mounted volumescheck .Mounts
Check health statuscheck .State.Health.Status
Know why a container exitedcheck .State.ExitCode


🟩 Check Container Logs (Related Command)

docker logs <container>

=========================================================================

🔵 What is Port Mapping in Docker?

Port mapping connects a container’s internal port to a port on your host machine so that applications inside the container can be accessed from outside.

Every container has own internal ports.

Multiple container run on same host but host has limited port.

If 2 container expose same internal port , u must map them to different host ports to avoid conflict.

Example:
A container running a web server on port 80 → accessible on host via port 8080

host:8080 → container:80

This is called port forwarding.


🟦 Syntax

docker run -p <host_port>:<container_port> image_name

Example:

docker run -p 8080:80 nginx
docker run -p 8087:80 nginx

Meaning:


🟩 Types of Port Mapping

1. Host → Container (most common)

-p 5000:5000

2. Bind to specific IP (e.g., localhost only)

-p 127.0.0.1:8080:80

Meaning:
Only local machine can access it.

3. Automatic host port assignment

-P

Docker assigns random free ports.
🟧 Check Mapped Ports

docker ps

You will see:

0.0.0.0:8080->80/tcp

🟪 Why Port Mapping Is Needed (Interview Points)

  • Containers run in isolated networks

  • Container ports aren’t accessible from host by default

  • Port mapping exposes them

  • Allows multiple instances to run on different host ports

  • Helps in local development and testing


🟫 Real Examples

1️⃣ Expose Postgres

docker run -p 5432:5432 postgres

2️⃣ Expose Airflow Webserver

docker run -p 8080:8080 apache/airflow

3️⃣ Expose FastAPI on 8000

docker run -p 8000:8000 myapp


🔥 Port Mapping in Docker Compose

services: web: image: nginx ports: - "8080:80"

Same meaning: host 8080 → container 80

🔥 Docker Pull vs Docker Run — Simple Difference

Docker pull

Only downloads the image from Docker Hub into your system.
It does NOT create or start a container.

Example:

docker pull redis

Result:

  • Redis image is downloaded

  • No container is created

  • No process runs


✅ docker run

Creates a container and runs it.
If the image does NOT exist locally, it will automatically pull it first.

Example:

docker run redis

Result:

  1. Docker checks if image exists

  2. If missing → pulls automatically

  3. Creates a new container

  4. Starts the container (runs Redis)

🤔 Common Uses You use port mapping when: ·Running web apps ·Running APIs ·Running databases ·Running UIs like Minio Console, Airflow UI, Grafana UI, etc.

 =========================================================================

=========================================================================

=========================================================================

AIRFLOW  

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows, commonly used to manage complex data pipelines.

A workflow in Airflow is a DAG (Directed Acyclic Graph), which defines a set of tasks and their execution order, dependencies, and scheduling.

A DAG (Directed Acyclic Graph) represents a workflow which has collection of tasks with dependencies. 

In Apache Airflow, a task is the smallest unit of work within a workflow (DAG). Each task represents a single operation or action, such as running a Python function, executing a SQL query, or triggering a bash command. Uisng refered by task id. operator define a task.

Apache Airflow Scheduler is a core component responsible for triggering task instances to run in accordance with the defined Directed Acyclic Graphs (DAGs) and their schedules.

In Apache Airflow, an Executor is the component responsible for actually running the tasks defined in your workflows (DAGs). It takes task instances that the Scheduler determines are ready and orchestrates their execution either locally or on remote workers.




=======================================================================

🔹 What is Docker?

Docker is a platform used to:

  • Package an application and its dependencies into a container

  • Ensure the application runs the same across all environments

A Docker container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software: code, libraries, environment variables, and config files.

🐳 What is a Dockerfile?

  • A Dockerfile is a text file that contains all the instructions to build a Docker image.

  • It defines the environment, dependencies, and commands your application needs to run consistently on any machine.

  • Think of it as a recipe for your container.

# Step 1: Base image
FROM python:3.11-slim

# Step 2: Set working directory inside container
WORKDIR /app

# Step 3: Copy your project files into container
COPY requirements.txt .

# Step 4: Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Step 5: Copy all code
COPY . .

# Step 6: Set environment variables (optional)
ENV PYTHONUNBUFFERED=1

# Step 7: Define default command to run
CMD ["python", "main.py"]

🔹 Step-by-Step Explanation

  1. FROM python:3.11-slim

    • Base image with Python installed. Slim version = smaller image.

  2. WORKDIR /app

    • Sets working directory inside container.

  3. COPY requirements.txt .

    • Copies dependency file into container.

  4. RUN pip install ...

    • Installs Python packages inside container.

  5. COPY . .

    • Copies your ETL or Airflow scripts into container.

  6. ENV PYTHONUNBUFFERED=1

    • Makes Python logs visible immediately (useful for debugging).

  7. CMD ["python", "main.py"]

    • Default command when container starts. Can be your ETL job or Airflow task script.


🔹 Useful Commands

  1. Build Docker Image

docker build -t my-data-engineer-image .
  1. Run Container

docker run -it --rm my-data-engineer-image
  1. Run with mounted volume (edit locally, reflect in container)

docker run -v $(pwd):/app -it my-data-engineer-image
  1. Push to Docker Hub / Registry

docker tag my-data-engineer-image username/my-image:latest docker push username/my-image:latest

---------------------------------------------------

🐳 What is Docker Compose?

  • Docker Compose is a tool to define and run multi-container Docker applications.

  • Instead of running each container individually, you define all services in a single docker-compose.yml file.

  • You can spin up the whole environment with one command:

    docker-compose up

version: '3.8'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  airflow-webserver:
    image: apache/airflow:2.7.1-python3.11
    depends_on:
      - postgres
    environment:
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
    ports:
      - "8080:8080"
    command: webserver

volumes:
  postgres_data:

🔹 Run Commands

  1. Build & start all services:

docker-compose up --build
  1. Run in detached mode (background):

docker-compose up -d
  1. Stop all containers:

docker-compose down
  1. View logs of a service:

docker-compose logs airflow-webserver
----------------------------------------------------------------------------

📄 What is requirements.txt?

  • It’s a text file listing all the Python packages your project needs.

  • Used by pip to install dependencies: pip install -r requirements.txt



=======================================================================

⭐ 1. Airflow Connections

Connections = saved credentials for external systems.

Examples:

  • AWS

  • Snowflake

  • Postgres

  • MySQL

  • BigQuery

  • S3

  • Kafka

  • Redshift

🔹 How to Set Connections

A) Using Airflow UI

  1. Go to Admin → Connections

  2. Click + Add

  3. Fill:

    • Conn ID → aws_default

    • Conn Type → Amazon Web Services

    • Extra → JSON (keys, region, endpoint)

  4. Save


B) Using CLI

airflow connections add my_postgres \ --conn-type postgres \ --conn-host localhost \ --conn-login user \ --conn-password pass \ --conn-schema mydb \ --conn-port 5432

C) Using Environment Variables

Format:

AIRFLOW_CONN_<CONN_ID>=<connection_uri>

Example:

export AIRFLOW_CONN_MY_PG="postgresql://user:pass@host:5432/db"

This is very common in Docker/Kubernetes.


⭐ 2. Airflow Variables

Variables = key–value store for configuration.

Example:

  • file_path

  • S3 bucket name

  • threshold value

  • list of emails


🔹 How to Set Variables

A) Using UI

Admin → Variables → Add

B) Using CLI

airflow variables set file_path /data/raw/ airflow variables get file_path

C) Using JSON IMPORT

airflow variables import variables.json

D) Using Environment Variables

AIRFLOW_VAR_BUCKET="my_bucket"

Usage inside DAG:

from airflow.models import Variable bucket = Variable.get("bucket")

⭐ 3. Airflow Secret Backends (Very Important for Data Engineers)

Airflow supports managing secrets securely using external systems.

🔹 Supported Secret Backends:

  1. AWS Secrets Manager

  2. GCP Secret Manager

  3. Hashicorp Vault

  4. Azure Key Vault

  5. Custom secret backends


Why use secret backends?

  • Secrets are not stored in Airflow DB

  • Rotated automatically

  • Secure & centralized

  • Avoid plaintext passwords in Airflow UI


🔹 Example: Using AWS Secrets Manager

Add to airflow.cfg:

[secrets] backend = airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend backend_kwargs = {"connections_prefix": "airflow/connections", "variables_prefix": "airflow/variables"}

AWS Secret Format:

airflow/connections/my_postgres

Used automatically in DAG:

conn = BaseHook.get_connection("my_postgres")

⭐ 4. Best Practices for Storing Credentials (MOST IMPORTANT)

🔐 1. NEVER store passwords in code

password="abcd1234"

✔ Use:

conn = BaseHook.get_connection("my_db")

🔐 2. Avoid storing secrets in Airflow Variables

Variables are NOT encrypted by default.


🔐 3. Use Secret Backends for all production credentials

  • AWS Secrets Manager

  • GCP Secret Manager

  • Hashicorp Vault


🔐 4. Use environment variables for local development

Safe and temporary.


🔐 5. Do not store credentials in GitHub / repo

Always use:

  • .env

  • Kubernetes Secrets

  • Docker Secrets


🔐 6. Use different connection IDs for dev/stage/prod

Example:

  • aws_dev

  • aws_stage

  • aws_prod


🔐 7. Use JSON "extra" field for complex configs

Example Extra field in UI:

{ "region_name": "ap-south-1", "role_arn": "arn:aws:iam::12345:role/my-role" }

=========================================================================================================

Operators 

are Python classes that define a template for a specific unit of work (task) in a workflow. When you instantiate an operator in a DAG, it becomes a task that Airflow executes. Operators encapsulate the logic required to perform a defined action or job.

Each operator represent a single task in workflow — like running a script, moving data, or checking if a file exists.


Operators = do something
Sensors = wait for something
Hooks = connection to systems (S3Hook, PostgresHook, etc.)
Executors = how tasks run (Local, Celery, Kubernetes)
Scheduler = creates DAG Runs + task instances

Type of operator

In Apache Airflow, Operators are the building blocks of your workflows (DAGs). Each operator defines a single task to be executed. There are different types of operators based on the type of work they perform.

Operators fall into three broad categories:

  1. Action Operators:
    Perform an action like running code or sending an email.

    Examples:

  • PythonOperator to run a Python function
  • BashOperator to run shell commands
  • EmailOperator to send emails
  • SimpleHTTPOperator to interact with APIs


  1. Transfer Operators:
    Move data between systems or different storage locations.

    Examples:

  • S3ToRedshiftOperator
  • MySqlToGoogleCloudStorageOperator
  1. Sensor Operators:
    Wait for a certain event or external condition before proceeding.

  • Examples:
  • FileSensor waits for a file to appear
  • ExternalTaskSensor waits for another task to complete






=======================================================================

Apache Airflow commands


==================================================================
Dependencies

🔗 chain() Function

The chain() function is part of airflow.utils.task_group (previously in airflow.utils.helpers) and helps you connect multiple tasks or groups in a sequence without writing task_1 >> task_2 >> task_3 manually.


=======================================================================

⭐ 1. Airflow Scheduling Basics

Airflow schedules based on:

  • cron expressions

  • timetables

  • logical date

  • catchup

  • backfill

A DAG run does NOT start at the exact cron time—it starts after the logical interval finishes.


🟦 2. Cron Expressions in Airflow

Cron = when to run the DAG.

Examples:

CronMeaning
0 0 * * *Every midnight
0 */2 * * *Every 2 hours
0 6 * * 1Every Monday at 6 AM
*/5 * * * *Every 5 minutes

Airflow uses cron to define the start of the schedule interval, but the DAG runs after the interval finishes*.


🟦 3. Timetables (Airflow 2.2+)

Timetables = new, flexible scheduling system.

Useful when cron is not enough.

Examples:

  • Run DAG every business day except holidays

  • Run every 3 hours between 9–5

  • Run based on dataset dependencies

  • Run after an upstream dataset is updated

Timetables replace schedule_interval for advanced cases.


🟦 4. Catchup vs No Catchup

SettingWhat it Means
catchup=TrueAirflow creates DAG Runs for all past dates since the start date
catchup=FalseAirflow only runs the latest DAG run, skips historical dates

Example:

DAG start date = Jan 1
Today = Jan 5
Schedule = daily

catchup settingRuns created
True1,2,3,4,5 Jan (5 runs)
FalseOnly Jan 5 (latest run)

🟦 5. Backfill (Manual Catchup)

Backfill = you manually run past dates even if catchup=False.

Command:

airflow dags backfill -s 2024-01-01 -e 2024-01-05 my_dag

Purpose:

  • Re-run historical data

  • Fix missed data loads

  • Reprocess partitions


🟦 6. Logical Date (MOST IMPORTANT)

Logical date = the data interval the DAG run is processing.

It is not the actual time the run starts.

Example:

Schedule: Daily
Cron: 0 0 * * * (midnight, i.e., start of interval)

DAG Run at: 2024-10-10 00:00 logical date
Run actually starts at: 2024-10-10 00:01 or later

Why this is important?

  • All tasks use logical_date for:

    • file paths

    • S3 partitions

    • SQL date parameters

    • templated variables ({{ ds }} etc.)

Think of it like:

Logical date = data date
Execution date = same as logical date (Airflow 2.2+)
NOT the real-time the task runs


🟣 Logical Date Example (Simple)

Schedule = daily
Interval = 2024-01-01 00:00 → 2024-01-02 00:00

DAG Run for 2024-01-02 actually runs at 2024-01-02 00:01, but:

{{ ds }} = 2024-01-01

Because that’s the interval start (logical date).

=======================================================

Cron

cron expressions are used in the schedule_interval parameter of a DAG to define when the DAG should run.
A cron expression is a string representing a schedule — it tells Airflow how often to run a DAG

None - dont schedule evr , usually fr manually
@once - schedule only once





=======================================================================

🔹 What Are Hooks in Apache Airflow?

Hooks in Airflow are interfaces to external platforms, like databases, cloud storage, APIs, and more. They abstract the connection and authentication logic, allowing operators to use these services easily.

Hooks are mostly used behind the scenes by Operators, but you can also call them directly in Python functions.

🔸 Why Use Hooks?

  • Reusable connection logic

  • Securely use Airflow's connection system (Airflow Connections UI)

  • Simplifies integrating with external systems (e.g., MySQL, S3, BigQuery, Snowflake)

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.utils.dates import days_ago

def get_data():
    hook = PostgresHook(postgres_conn_id='my_postgres')
    connection = hook.get_conn()
    cursor = connection.cursor()
    cursor.execute("SELECT COUNT(*) FROM my_table;")
    result = cursor.fetchone()
    print(f"Row count: {result[0]}")

with DAG('postgres_hook_example',
         start_date=days_ago(1),
         schedule_interval=None,
         catchup=False) as dag:

    t1 = PythonOperator(
        task_id='count_rows',
        python_callable=get_data
    )

=======================================================================

☁️ What is S3Hook?

  • S3Hook is a helper class in Airflow to interact with Amazon S3.

  • It abstracts the boto3 (AWS SDK for Python) operations so you can read/write files, list buckets, check if objects exist, etc., directly in your DAGs.

  • Comes from: from airflow.providers.amazon.aws.hooks.s3 import S3Hook (Airflow 2+)

🔹 When to Use S3Hook

  1. You want to upload a file to S3 from Airflow.

  2. You want to download a file from S3 for processing.

  3. You want to check if a key/object exists before running a task.

  4. You want to list files in a bucket dynamically.

=======================================================================


Defining dags





=======================================================================

🧠 What Does an Executor Do?

It communicates with the Scheduler and runs the tasks defined in your DAGs—either locally, in parallel, or on distributed systems like Celery or Kubernetes.

-> airflow info , gives which executor we are running




=======================================================================

In Apache Airflow, an SLA (Service Level Agreement) is a time-based constraint that you can apply to a task to ensure it finishes within a defined timeframe. If it misses the deadline, Airflow can alert or log the SLA miss.

🧠 Why Use SLAs?

To ensure:

  • Timely data availability

  • Reliable pipeline performance

  • Alerting for delays or failures


🧩 How SLA Works in Airflow

  • SLA is defined per task, not per DAG.

  • If a task takes longer than the SLA, it's marked as an SLA miss.

  • Airflow triggers an SLA miss callback and logs the event.

  • Email alerts can be sent if configured.


📊 Monitoring SLA Misses

  • Go to Airflow UI > DAGs> Browse > SLA Misses

  • Or check the Task Instance Details


⚠️ Notes

  • SLAs are checked after the DAG run completes.

  • SLAs are about runtime, not start time.

  • SLA doesn’t retry or fail the task—it just logs the violation.

=======================================================================

🔧 What is a Template?

A template is a string that contains placeholders which are evaluated at runtime using the Jinja2 engine.

In Apache AirflowTemplates allow you to dynamically generate values at runtime using Jinja templating (similar to templating in Flask or Django). They are useful when you want task parameters to depend on execution context, such as the date, DAG ID, or other dynamic values.



=======================================================================

Jinja is a templating engine for Python used heavily in Apache Airflow to dynamically render strings using runtime context. It lets you inject variables, logic, and macros into your task parameters.

🔍 What is Jinja?

Jinja is a mini-language similar to Django or Liquid templates. In Airflow, it's used for:

  • Creating dynamic file paths

  • Modifying behavior based on execution date

  • Using control structures like loops and conditions

🧾 DAG Example (Manual filename via params)

=======================================================================

✅ What is catchup?

When catchup=True (default), Airflow will "catch up" by running all the DAG runs from the start_date to the current date.

When catchup=False, it only runs the latest scheduled DAG run from the time it is triggered.


=======================================================================

🔁 What is Backfill?

Backfill is the process of running a DAG for past scheduled intervals that have not yet been run.

When a DAG is created or modified with a start_date in the past, Airflow can "backfill" to ensure that all scheduled intervals between the start_date and now are executed.




=======================================================================

🔧 Components of Apache Airflow

Apache Airflow is made up of several core components that work together to orchestrate workflows:

ComponentDescription
SchedulerThe brain of Airflow that monitors DAGs and tasks, triggers DAG runs based on schedules or events, and submits tasks to the executor for execution. It continuously checks dependencies and task states to decide what to run next.
it takes 5 min for AF to detect dag in dag folder
Scheduler scan for new task every 4 sec
ExecutorExecutes task instances assigned by the scheduler. It can run tasks locally, via distributed workers, or on containerized environments depending on the executor type (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.).
WorkersMachines or processes (depending on executor) that actually run the task code. For distributed executors like Celery or Kubernetes, multiple workers run tasks in parallel, scaling out capacity.
Metadata DatabaseA relational database (e.g., PostgreSQL, MySQL) that stores all Airflow metadata: DAG definitions, task states, execution history, logs, connection info, and more. The scheduler, workers, and webserver interact with it constantly.
Webserver (UI)Provides a user interface to monitor DAG runs, task status, logs, and overall workflow health. Built on a FastAPI server with APIs for workers, UI, and external clients.
DAGs FolderDirectory or location where DAG definition Python files live. These files describe the workflows and are parsed by the scheduler or DAG processor.

🟦 What is Airflow Scheduler?

The Airflow Scheduler is the component responsible for triggering DAG runs and executing tasks at the right time based on the DAG’s schedule, dependencies, and state.

📌 It is the “brain” of Airflow.


🟣 What does the Airflow Scheduler do?

The scheduler continuously:

FunctionExplanation
Monitors DAGsWatches all DAG files for new/updated DAGs.
Creates DAG RunsStarts DAG runs at the scheduled intervals.
Checks DependenciesEnsures upstream tasks are finished before running next task.
Queues TasksDecides which tasks are ready to run.
Sends tasks to ExecutorHands tasks to workers (Local/Celery/K8s).
Handles retriesIf a task fails, scheduler triggers retries.
Manages SLADetects SLA misses.

🟦 How the Scheduler Works (Simple Flow)

Parse DAG → Check schedule → Create DAG Run → Check dependencies → Queue tasks → Executor runs tasks

The scheduler loops continuously, making decisions every few seconds.


🟣 Important Concepts for Interviews

1. Scheduling interval

Scheduler respects:

  • schedule_interval

  • start_date

  • end_date

  • catchup


2. Logical Date (Very important!)

Scheduler runs DAGs based on logical execution date, not current time.


3. Executor

Scheduler just queues tasks, but does NOT execute them.
Executor runs the task.

Example executors:

  • LocalExecutor

  • CeleryExecutor

  • KubernetesExecutor


4. Concurrency Controls

Scheduler respects:

  • DAG concurrency

  • Task concurrency

  • Pools

  • Parallelism

These prevent overload.


5. Heartbeats

Scheduler sends a “heartbeat” every few seconds.
If heartbeat stops → scheduler is down.


🟦 Example: Scheduler in Action

If a DAG has:

schedule_interval='@daily' start_date=2024-01-01

The scheduler will create DAG runs:

  • 2024-01-01 (logical date)

  • 2024-01-02

  • 2024-01-03

Each run → scheduler checks tasks → queues ready ones.

======================================================

🟦 What is an Executor in Airflow?

An Executor is the Airflow component responsible for actually running the tasks.
While the Scheduler decides what to run,
the Executor decides how and where to run it.

📌 Executor = Task runner
📌 Scheduler = Task coordinator


🟣 Why Executor is Important?

Executors decide:

  • How many tasks run in parallel

  • Where tasks get executed

  • Whether tasks run locally or on workers or on Kubernetes pods

The choice of executor determines Airflow’s scalability.


🟦 Types of Executors (Must Know)


1. SequentialExecutor

  • ✔ What it is:

    • Runs ONE task at a time

    • Single-threaded

    • No parallelism

    • Default for quick testing

    ✔ Use Cases:

    • Local testing

    • Development / laptop

    • Very small DAGs

    ❌ Not for production.


2. LocalExecutor

  • ✔ What it is:

    • Runs tasks in parallel on the same machine

    • Uses multiple processes/threads

    • Good performance for small pipelines

    ✔ Use Cases:

    • Small teams

    • Single-server Airflow deployments

    • Use case: 10–20 parallel tasks

    ❌ Not suitable for distributed workloads

    ❌ Cannot scale beyond one machine


3. CeleryExecutor

  • ✔ What it is:

    • Distributed task execution

    • Multiple worker machines

    • Uses a message broker:

      • Redis

      • RabbitMQ

    ✔ Use Cases:

    • Medium to large teams

    • Many DAGs running at same time

    • Need dozens or hundreds of parallel tasks

    • On-prem or AWS EC2 deployments

    👍 Pros

    • Highly scalable

    • Fault-tolerant

    • Good for data engineering teams

    👎 Cons

    • Complex setup (workers + broker + DB)

    • Higher maintenance


4. KubernetesExecutor (Most modern)

  • ✔ What it is:

    • Each task runs in its own Kubernetes pod

    • True elastic scaling

    • Perfect isolation of tasks

    • Clean environment per task

    ✔ Use Cases:

    • Cloud-native setups

    • Very large workloads

    • Need per-task compute scaling

    • Mixed workloads (Python, Spark, Java, Bash, etc.)

    👍 Pros:

    • Auto-scaling

    • GPU/High-memory pods

    • Per-task docker image support

    👎 Cons:

    • Requires Kubernetes knowledge

    • Complex to manage for small teams


(Bonus) — LocalKubernetesExecutor (Hybrid)

  • LocalExecutor for small tasks

  • KubernetesExecutor for heavy tasks


🟦 How Scheduler and Executor Work Together

Scheduler → Sends task to Executor → Executor launches worker/pod → Task runs → Updates state

🟣 Comparison Table

ExecutorParallel?Distributed?Use Case
SequentialExecutor❌ No❌ NoTesting only
LocalExecutor✔ Yes❌ NoMedium workloads
CeleryExecutor✔ Yes✔ YesLarge-scale pipelines
KubernetesExecutor✔ Yes✔ YesCloud-native, scalable workloads

🟦 Best Executor for Data Engineering?

Use CaseBest Executor
Small team, single VMLocalExecutor
Distributed on-prem clusterCeleryExecutor
Cloud environments (AWS/GCP/Azure)KubernetesExecutor

======================================================

🟦 What is the Airflow Webserver?

The Airflow Webserver is the component that provides the UI (User Interface) for Airflow.
It lets you view, monitor, trigger, pause, and manage DAGs through a browser.

📌 Webserver = Airflow UI
📌 It shows everything happening inside Airflow.


🟣 What Webserver Does

FunctionExplanation
Displays DAGsShows all DAGs in the UI
Trigger DAGsYou can manually run a DAG
Pause/Unpause DAGsEnable or disable scheduling
View Graph ViewDAG structure (dependencies)
Task LogsView task execution logs
Monitor statusSuccess / Failed / Queued / Running
View XComSee data passed between tasks
Manage ConnectionsAdd/edit database or API credentials
VariablesStore global values for DAGs
Admin PanelDAG runs, task instances, users, roles

🟦 How Webserver Works (Simple Explanation)

  1. Webserver reads DAG files

  2. Displays DAGs in the UI

  3. Shows scheduler and executor status

  4. Allows user actions (trigger, clear, rerun tasks)

It runs using Flask (Python web framework) behind the scenes.


🟦 Important Ports

Default port:

http://localhost:8080

In production you may use Nginx/HTTPS.


🟣 Webserver vs Scheduler

ComponentPurpose
WebserverUI to view/manage pipelines
SchedulerDecides when tasks should run
ExecutorActually runs the tasks

🟦 How to Start the Webserver

airflow webserver

In Docker:

docker-compose up airflow-webserver

=================================================================

🟦 1. max_active_runs (at DAG level)

Definition

max_active_runs = maximum number of DAG Runs that can run at the same time for a specific DAG.

📌 Think:

"How many full pipeline runs can run in parallel?"

Example:

DAG( dag_id='etl_pipeline', max_active_runs=1 )

✔ Only one DAG run will run at a time
✖ A new scheduled run will wait until the previous run finishes

📘 Why important?

  • Prevents overlapping runs

  • Useful for pipelines that update the same tables

  • Avoids data corruption


🟦 2. concurrency (at DAG level)

Definition

concurrency = maximum number of task instances from the SAME DAG that can run in parallel.

📌 Think:

"How many tasks inside this DAG can run at the same time?"

Example:

DAG( dag_id='etl_pipeline', concurrency=5 )

✔ Maximum 5 tasks from this DAG can run at once
✖ The 6th task waits in the queue

📘 Why important?

  • Controls the load on your system

  • Prevents overwhelming the database, Spark cluster, APIs, etc.


=================================================

⭐ 1. Dynamic DAGs (Airflow)

Dynamic DAGs = DAGs that are generated programmatically instead of hardcoding tasks.

Example:

for table in ["customers", "orders", "sales"]: PythonOperator( task_id=f"load_{table}", python_callable=load_table, op_kwargs={"table": table}, dag=dag, )

✔ Why use Dynamic DAGs?

  • Automatically create tasks for multiple tables/files

  • Avoid writing duplicate code

  • Perfect for pipelines with 20–500 tables


⭐ 2. Dynamic Tasks (Task Mapping in Airflow 2.3+)

Task mapping = Airflow automatically creates multiple task instances at runtime.

Example (Best Interview Answer):

@task def load_table(table): ... tables = ["customers", "orders", "sales"] load_table.expand(table=tables)

✔ Why Task Mapping is powerful:

  • Dynamically generates tasks at runtime

  • No DAG parsing overhead (unlike old dynamic DAGs)

  • Much cleaner & more scalable

✔ Example Use Cases:

  • Load 100 S3 files

  • Process N partitions

  • Trigger N API calls

  • Run ML jobs for each model


⭐ 3. Avoiding DAG Explosion

DAG Explosion = too many tasks or too many DAGs, causing:

  • Slow UI

  • Scheduler overload

  • Metadata DB pressure

  • DAG parsing delays

Causes:

  • Generating thousands of tasks in the DAG file

  • Creating DAGs dynamically for each table (e.g., 100 tables → 100 DAGs)

Solution:

  • Use Task Mapping

  • Use TaskGroup

  • Batch tasks

  • Push dynamic behavior to runtime, not DAG file parse time


⭐ 4. TaskGroup (Organizing Large DAGs)

TaskGroup = visual and logical grouping of tasks.

Example:

with TaskGroup("load_all_tables") as tg: for table in tables: PythonOperator( task_id=f"load_{table}", python_callable=load, op_kwargs={"table": table}, )

✔ Why TaskGroup is used:

  • Organize DAGs with 50+ tasks

  • Avoid clutter in Airflow UI

  • Easier debugging

  • Logical grouping like:

    • extract group

    • transform group

    • load group

Not for isolation — only for visual and logical grouping.


=======================================================================

DAG View


=======================================================================

🔄 XCom(Cross-Communication)

XCom (short for “Cross-communication”) allows tasks to exchange small amounts of data between each other in a DAG.

🔧 How XCom Works

  1. Push → Send data to XCom from one task

  2. Pull → Retrieve that data in another task


🔥 What XCom Should NOT be Used For

Very important for interviews:

❌ Do NOT pass large datasets

❌ Not meant for files
❌ Not used for DataFrames
❌ Not used for binary data

Use XCom only for small metadata, like:

  • file paths

  • S3 keys

  • table names

  • row counts

🟦 Types of XCom in Airflow

There are 3 main types of XCom you must know:


1. Default / Implicit XCom (PythonOperator return value)

  • When a PythonOperator function returns a value, Airflow automatically pushes it to XCom.

  • No need to write xcom_push() manually.

Example:

def task1(): return "hello"

✔ Automatically becomes an XCom value
✔ Most commonly used type


2. Manual XCom (Explicit push & pull)

Used when you want full control.

Push:

def push_func(**context): context['ti'].xcom_push(key='mykey', value='mydata')

Pull:

def pull_func(**context): value = context['ti'].xcom_pull(key='mykey', task_ids='push_task')

✔ Used when you need custom key names
✔ Useful when returning multiple values


3. TaskFlow API XCom (@task decorator)

This works like implicit XCom but with ** cleaner syntax** using the TaskFlow API.

Example:

from airflow.decorators import task @task def t1(): return "hi" @task def t2(msg): print(msg) t2(t1())

✔ Return values automatically become XCom
✔ Passing function outputs becomes easier
✔ Preferred in modern Airflow (2.x)

=======================================================================

🧩 Airflow Variables

Airflow Variables are key-value pairs used to store and retrieve dynamic configurations in your DAGs and tasks.


=======================================================================

🛰️ Apache Airflow Sensors

Sensors are special types of operators in Airflow that wait for a condition to be true before allowing downstream tasks to proceed.



=======================================================================

🌿 Branching in Apache Airflow

Branching allows you to dynamically choose one (or more) downstream paths from a set of tasks based on logic. This is done using the BranchPythonOperator.

🧠 Why Use Branching?

Branching is useful when:

  • You want to run different tasks based on a condition

  • You need to skip certain tasks

  • You want "if/else" logic in your DAG


✅ Notes:

  • Tasks not returned by BranchPythonOperator will be skipped.

  • You can return a single task ID or a list of task IDs.

  • Ensure your downstream tasks can handle being skipped, or use appropriate trigger_rule.

=======================================================================

Subdag

🔄 What is a SubDAG in Apache Airflow?

A SubDAG is a DAG within a DAG — essentially, a child DAG defined inside a parent DAG. It's used to logically group related tasks together and reuse workflow patterns, making complex DAGs easier to manage.

📌 Think of a SubDAG as a modular block that can be reused or organized separately.



🧩 TaskGroup in Apache Airflow

A TaskGroup in Airflow is a way to visually and logically group tasks together in the UI without creating a separate DAG like SubDagOperator. It's lightweight, easier to use, and the recommended approach in Airflow 2.x+.



=======================================================================

🔗 Edge Labels in Apache Airflow

Edge labels in Airflow are annotations you can add to the edges (arrows) between tasks in the DAG graph view. They help clarify why one task depends on another, especially when using complex branching, conditionals, or TriggerRules.



==========================================================

1. Catchup

Definition:
Airflow automatically creates DAG Runs for all missed schedule intervals since the DAG’s start_date.

FeatureDetails
Parametercatchup=True/False
DefaultTrue
BehaviorCreates runs for all past intervals until today
Use CaseWhen you want to process historical data automatically

Example:

DAG start date = Jan 1, today = Jan 5, daily DAG, catchup=True → DAG runs for Jan 1,2,3,4,5


2. Backfill

Definition:
Manually run DAG runs for specific past dates, regardless of catchup setting.

FeatureDetails
Commandairflow dags backfill -s 2024-01-01 -e 2024-01-05 my_dag
BehaviorForces DAG to run historical intervals
Use CaseMissed runs, reprocessing, fixing failed jobs

✅ Backfill is manual and selective, unlike catchup which is automatic.


3. Manual Run

Definition:
Trigger a DAG run manually at any time, usually for testing or ad-hoc runs.

FeatureDetails
MethodAirflow UI → Trigger DAG
CLI → airflow dags trigger my_dag
BehaviorCreates a single DAG run immediately
Use CaseTest DAG, ad-hoc execution, debugging

Comparison Table

FeatureAutomatic/ManualPurposeExample
CatchupAutomaticRun all missed DAG runscatchup=True → process Jan 1–5 automatically
BackfillManualRun specific historical DAG runsairflow dags backfill -s Jan1 -e Jan5
Manual RunManualTrigger DAG on demandClick “Trigger DAG” in UI or CLI command

Logical Date vs These Runs (Important!)

  • Catchup → generates DAG runs using logical dates for past intervals

  • Backfill → same, but manually specified dates

  • Manual Run → logical date can be specified manually or default to current timestamp


=======================================================









Comments

Popular posts from this blog

work

Git

DSA