Docker/Airflow
🐳 What is Docker?
Docker is a platform to package applications into containers.
Docker is an open-source platform that enables developers to build, deploy, run, update, and manage applications using containers, which are ligh tweight, portable, and self-sufficient units that package an application and its dependencies together.
A container is a lightweight, portable, isolated environment that includes your app + dependencies + OS libraries.
Think of it as “ship your code with everything it needs” so it runs the same anywhere — your laptop, cloud, or production server.
Written in the Go programming language.
🚀 Why Docker Containers Are Lightweight
✅ 1. They Don’t Need a Full Operating System-kernel
A virtual machine (VM) needs:
Its own full OS kernel
System libraries
Drivers
Boot process
A Docker container only needs:
Your application
Dependencies (libraries, Python packages, etc.)
A very small OS userland (Ubuntu-base, Alpine, etc.)
Uses host’s OS kernel instead of its own
So instead of 2–5 GB (VM), a container may be 10–100 MB.
✅ 2. Containers Share Kernel With the Host
No container contains a kernel.
Kernel is the heaviest part of an OS
All containers use the same Linux kernel
This reduces:
Memory usage
Startup time
CPU overhead
✅ 3. Copy-on-Write File System
Docker uses layered images and UnionFS.
👉 Meaning:
If 10 containers use the same base image (like Python 3.10), the base layer is stored once
Only the top writable layer is unique per container
So storage & memory are reused efficiently.
✅ 4. Low Overhead for Startup
VM: Boots a full OS → can take minutes
Docker: Just starts a process → starts in <1 second
Because:
No BIOS/bootloader
No OS boot
Only starts your application process
✅ 5. Namespaces & Cgroups
Linux gives Docker:
Isolation (namespaces)
Resource control (cgroups)
These are kernel features, not heavy virtualization technology.
No hypervisor → less overhead.
✅ 6. Smaller Images (Especially Alpine)
Example:
Ubuntu Base Image → 70–100 MB
Alpine Linux → 4–6 MB (!!)
So applications are small and fast to ship.
=========================================================================
🟢 Docker Benefits (Simple Points)
1. Consistent environment everywhere
Same code runs the same way on any machine (dev, QA, prod).
2. Lightweight (compared to VMs)
Starts in seconds
Uses less CPU and RAM
3. Easy deployment
Build once → run anywhere
Faster releases
4. Isolation
Each container has its own dependencies
No version conflicts
5. Easy scaling
Run multiple containers from one image
Good for stream jobs, ETL parallelism
6. Multi-service setup using Compose
Run Airflow + Postgres + Kafka + Redis + Spark together
One command:
docker-compose up
7. Cleaner development
No need to install databases, Spark, Kafka manually
Everything runs inside containers
8. Better CI/CD
Code + dependencies packaged into one image
Consistent builds
9. Secure
Apps isolated from host system
10. Cloud-native
Works with Kubernetes, AWS ECS, EKS, GCP GKE, Azure AKS
Industry standard
=========================================================================
🧱 Components of Docker
Docker has 5 major components:
Docker Client
Docker Daemon (dockerd)
Docker Images
Docker Containers
Docker Registry
Below is a deep but simple breakdown 👇
1️⃣ Docker Client (CLI)
The client is what you interact with.
The Docker Client (docker CLI) communicates with the daemon using a REST API. It provides the execution environment where Docker Images are instantiated into live containers.
When you run:
You are using the Docker Client.
👉 It sends commands to the Docker Daemon.
The interface through which users interact with Docker. Users issue commands like docker run or docker build via the Docker CLI, which translates these into API calls to the Docker daemon.docker builddocker rundocker pull
2️⃣ Docker Daemon (dockerd)
Daemon = background service that does the heavy work.
A background service that runs on the host machine and manages Docker objects such as images, containers, networks, and volumes. It listens for API requests from the Docker client and executes container lifecycle operations like starting, stopping, and monitoring containers.
The Docker Engine Daemon (dockerd) runs in the background, listening to API requests and managing objects like images, containers, networks, and volumes.
It is responsible for:
Building images
Running containers
Managing images
Managing networks
Managing storage
The Docker Client talks to the Daemon using a REST API.
3️⃣ Docker Images
python:3.10-slim is an image.A docker image is a:
Blueprint
Read-only template
Layered package
It contains:
Application code
Dependencies
Runtime
OS libraries
Configurations
Images are created using docker build.
A Docker Image is a file made up of multiple layers that contains the instructions to build and run a Docker container. t acts as an executable package that includes everything needed to run an application — code, runtime, libraries, environment variables, and configurations.
How it Works:
- The image defines how a container should be created.
- Specifies which software components will run and how they are configured.
- Once an image is run, it becomes a Docker Container.
4️⃣ Docker Containers
A container is a running instance of an image.
Running instances of Docker images with a writable layer on top, enabling users to execute applications within isolated environments. Containers are lightweight and start quickly compared to traditional virtual machines.
When you run:docker run python:3.10-slim
Container =
Lightweight
Portable
Isolated process
Created using:
Multiple containers can run from the same image.
5️⃣ Docker Registry
A registry stores images.(Docker Hub / ECR / GCR)
Examples:
Docker Hub
AWS ECR
GitHub Container Registry
GCR
Azure ACR
Inside a registry we have repositories, and inside repositories we have tags.
🧩 Additional Components (Advanced)
🔹 6️⃣ Dockerfile
A file containing instructions to build an image.
The Dockerfile uses DSL (Domain Specific Language) and contains instructions for generating a Docker image. Dockerfile will define the processes to quickly produce an image. While creating your application, you should create a Dockerfile in order since the Docker daemon runs all of the instructions from top to bottom.
🔹 7️⃣ Docker Engine
The Docker Engine is the core component that enables Docker to run containers on a system. It follows a client-server architecture and is responsible for building, running, and managing Docker containers.
Core part of Docker containing:
Client
REST API
Daemon
🔹 8️⃣ Docker Compose
Tool to run multi-container apps.
Example:
app container
db container
redis container
All defined in docker-compose.yml.
🔹 9️⃣ Docker Network
Provides:
Bridge network
Host network
Overlay network (for Swarm)
Container-to-container communication
🔹 🔟 Docker Volumes
volumes:- dbdata:/var/lib/postgresql/dataUsed for persistent storage.
Examples:
Databases
Logs
App data
=========================================================================
Containerization vs Virtual Machines
🖥️ 1. What is a VM (Virtual Machine)?
A Virtual Machine (VM) is a computer inside a computer.
It behaves like a real machine:
It has its own Operating System (Windows / Linux / macOS)
Its own virtual CPU, RAM, disk, network
Example:
You install Ubuntu Linux on your Windows laptop using VirtualBox.
That Ubuntu runs as a VM.
✔ How it works
A VM includes:
BIOS
Bootloader
Kernel
User space
Applications
So VMs are heavy and use more resources.
👑 2. What is a Hypervisor?
A Hypervisor is the manager that creates and runs Virtual Machines.
It lies between:
Hardware (CPU, RAM)
VMs
It gives resources to each VM.
Two Types of Hypervisors
Type-1 (Bare Metal)
Runs directly on hardware → faster
Examples:
VMware ESXi
Microsoft Hyper-V
Xen
KVM
Type-2 (Hosted)
Runs on top of an operating system → slower
Examples:
VirtualBox
VMware Workstation
🧠 3. What is a Kernel?
The kernel is the core part of an operating system.
It controls:
CPU
RAM
Disk
Network
Processes
Services
Every OS has a kernel:
Linux kernel
Windows NT kernel
macOS XNU kernel
✔ What kernel does
Kernel manages:
| Kernel Function | Meaning |
|---|---|
| Process management | Runs programs |
| Memory management | Allocates RAM |
| Device drivers | Talks to hardware |
| Networking | Manages internet connections |
| File systems | Reads/writes files |
The kernel is what makes an operating system an operating system.
=========================================================================
🟢 What is a Docker Image?
A Docker Image is a read-only ,immutable file that contains everything your application needs to run:
Code (Python scripts, ETL jobs, DAGs)
Libraries / dependencies (pandas, PySpark, boto3, Airflow)
OS-level tools and environment variables
configuration files
It acts as a blueprint for creating Docker containers.
Think of it as a blueprint or snapshot of your environment.
🔹 Key Features of a Docker Image
Immutable: Once built, the image doesn’t change.
Versioned: Can tag different versions (
my-etl:1.0,my-etl:2.0).Portable: Can be run anywhere with Docker installed (local, cloud, CI/CD).
Layered: Each command in the Dockerfile creates a new layer, allowing caching and faster builds.
🔹 Analogy
Image = Cake Recipe → contains instructions and ingredients.
Container = Baked Cake → running instance you can interact with.
🔹 How to Create a Docker Image
Step 1: Create a Dockerfile
Step 2: Build the image: Building an image means generating a complete packaged environment for your application, based on the instructions in a Dockerfile.
-t my-etl-image:1.0→ gives a name and version tag to the imageThe image now contains Python + dependencies + your ETL code
Step 3: Verify the image
Lists all images on your machine
🔥 Simple Analogy
Dockerfile = Recipe
docker build = Cooking the dish
Docker image = Finished food
Container = Serving & eating the food
🔹 Practical Use Case for Data Engineers
ETL pipelines: Package Python / Spark scripts and dependencies → run anywhere
Airflow DAGs: Build an image containing DAGs + plugins → use DockerOperator to run tasks
Testing pipelines: Share image with team → exact same environment
=========================================================================
🟦 Dockerfile (Simple Explanation)
A Dockerfile is a text file containing step-by-step instructions to build a Docker image.
You tell Docker how to create the image:
what OS to use, what packages to install, what code to copy, what command to run.
Think of it as a recipe for creating your application's environment. The Docker engine reads this file and executes the commands in order, layer by layer, to assemble a final, runnable image.
🟩 Most Important Instructions
| Instruction | Meaning |
|---|---|
| FROM | Base image |
| WORKDIR | Set working directory |
| COPY | Copy files into image |
| RUN | Execute commands during build |
| CMD | Default command when container runs |
| ENTRYPOINT | Fixed command; CMD becomes args |
| EXPOSE | Document port |
| ENV | Set environment variables |
| ARG | Build-time variable |
| VOLUME | Create mount point |
🟨 Basic Dockerfile Example
What it does:
Uses Python 3.10 base
Sets
/appas working folderInstalls requirements
Copies your code
Runs main.py by default
🟧 Build & Run Image
Build
Run
🏗️ 1. BUILD = Create the Image
Build means you are constructing the Docker image from a Dockerfile.
Command:
What happens during build:
Docker reads the Dockerfile
Downloads base image
Installs dependencies (RUN commands)
Copies your code (COPY)
Creates layers
Produces a final image
📌 Output of build = Docker Image (a blueprint)
🚀 2. RUN = Start a Container
Run means you are starting a container from that image.
Command:
What happens during run:
Docker takes the image
Creates a live running instance (container)
Executes the CMD/ENTRYPOINT
Runs your application
📌 Output of run = Container (a running process)
🔥 Simple Analogy
| Concept | Analogy |
|---|---|
| Dockerfile | Recipe |
| docker build | Cooking the dish using the recipe |
| Image | Finished, packed food |
| docker run | Serving/eating the food |
=========================================================================
🟢 What is a Docker Container?
A Docker Container is a running instance of a Docker Image.
It is isolated, lightweight, and contains everything defined in the image: your code, libraries, and environment.
Unlike an image, a container can run, execute, generate logs, and store temporary data.
Analogy:
Image = Recipe
Container = Cake baked from that recipe
🔹 Key Features of Containers
Ephemeral / Mutable
Containers can run, stop, restart, or be deleted.
Changes inside a container don’t affect the original image unless you commit it.
Isolated Environment
Each container has its own filesystem, processes, and network stack.
Prevents conflicts between different projects or dependencies.
Lightweight & Fast
Shares the host OS kernel → much faster than a VM.
Starts in seconds.
Multiple Instances
You can run multiple containers from the same image → efficient resource usage.
🔹 Practical Commands
Run a container
-it→ interactive terminal--name→ container namemy-etl-image:1.0→ image to run
List running containers
Stop a container
Remove a container
Run in detached mode (background)
🔹 Containers in Data Engineering
ETL Jobs: Each pipeline can run in a separate container → isolation and reproducibility.
Airflow Tasks: DockerOperator spins up a container per task → consistent environment for Python/Spark jobs.
Local Testing: Run full pipeline with dependencies (Spark + Postgres + Minio) without affecting host system.
Scalable Pipelines: Multiple containers can run simultaneously, useful for batch jobs or streaming tasks.
Image
Read-only template
Created from Dockerfile
Example: Python + libs + your ETL script
Container
Running instance of an image
Can be started/stopped
Temporary, isolated environment
Dockerfile
Instructions to build an image
=========================================================================
Docker Hub = Online platform where Docker images are stored, shared, and downloaded.
Docker Hub is the most popular public Docker registry, provided by Docker Inc.
A repository is a place where multiple versions (tags) of a Docker image are stored.
You use it to:
Pull images
Push images
Share images
Discover official images
Host private images
Website: hub.docker.com
(You don’t need to visit it—Docker CLI can interact directly.)
🧱 What You Can Do with Docker Hub
✔ 1. Pull images
Download ready-made images:
✔ 2. Push your own images
Upload your images:
✔ 3. Use official, verified images
Examples:
library/nginxlibrary/ubuntulibrary/mysql
These are secure, maintained by Docker or companies.
✔ 4. Create public or private repositories
Public repo → anyone can access
Private repo → only you/team can access
✔ 5. Automate builds (CI/CD integration)
🌍 1. Public Repository (Free-Docker Hub)
✔ Definition
A public repo can be viewed and pulled by anyone.
Anyone can run:
No login required.
✔ Use Cases
Open-source images
Sharing tools with the community
Demo applications
Training material
✔ Pros
Free
Easy to share
Good for open-source
✔ Cons
Code/image contents are visible to the world
Cannot store sensitive applications
🔒 2. Private Repository (Restricted)
✔ Definition
A private repo can be accessed only by you and people you give permission to.
A user must log in:
If they don't have access → they cannot pull.
✔ Use Cases
Internal enterprise apps
Proprietary code
Databases / internal pipelines
Anything sensitive or confidential
✔ Pros
Secure
Access-controlled
Good for companies
✔ Cons
Limited free private repos on free plan
Need Docker Hub account login
Docker commands to pull an image from a repository and run it.
🚀 1. Pull the image from a repo
Example (public repo):
Example (private repo):
🏃 2. Run the container
Example:
With port mapping:
🔥 Pull + Run in one command (No need to pull manually)
Docker will automatically pull the image if it doesn't exist locally.
📦 Full Example: Private Repo
Step 1: Login
Step 2: Pull the image
Step 3: Run the container
🧩 Optional: Run in background
Add -d:
=========================================================================
A registry is a server where Docker images are stored, uploaded, downloaded, shared , it can be private or public.
Types:
Public Registry: Open to anyone (e.g., Docker Hub).
Private Registry: Restricted access, can be self-hosted or cloud-hosted (e.g., AWS ECR, Azure Container Registry, GitHub Container Registry).
Key Points:
You can host your own registry to control access to images.
Used in CI/CD pipelines to store images built from your projects.
Access can be controlled with authentication and authorization.
Examples:
Docker Hub (public)
Amazon ECR (private)
Google Container Registry (GCR)
Azure Container Registry (ACR)
GitHub Container Registry
Harbor (self-hosted)
Nexus (self-hosted)
- 🔥 What You Can Do With a Registry
✔ Push images
Upload your built image to a registry:
✔ Pull images
Download an image from a registry:
A registry is the entire system/server that stores Docker images.
Examples of registries:
Docker Hub
Amazon ECR
GitHub Container Registry
Google Container Registry
Harbor
Think of registry = big storage platform.
A repository is a collection of related images (usually different versions of the same app).
Example repository inside Docker Hub:
This repository contains multiple versions (tags):
nginx:1.21
nginx:1.23
nginx:latest
nginx:stable
Think of repository = folder inside registry.
=========================================================================
Docker Compose is a tool that lets you run multiple containers together using one YAML file.
Instead of running individual docker run commands, you define everything in:
Then start all services with one command: docker-compose up
🟣 Why do we use Docker Compose? (Very Important)
Run multiple services together (e.g., Airflow + Postgres + Redis)
Handles networking automatically
Creates shared volumes
Starts containers in the right order
Perfect for data engineering pipelines
Docker Compose Architecture
A Compose file has 3 main parts:
Version → YAML schema version
Services → Containers to run
Volumes → Persistent storage
Networks → Optional custom networks
Example structure:
🟢 Basic Example (docker-compose.yml)
Example for Python app + Postgres DB:
Highlights
Two services:
dbandappappwaits fordb(depends_on)Networking is automatic →
appconnects todbusing hostnamedb
🟣 Most Important Docker Compose Commands
| Purpose | Command |
|---|---|
| Start all services | docker-compose up |
| Start in background | docker-compose up -d |
| Stop all services | docker-compose down |
| View running services | docker-compose ps |
| View service logs | docker-compose logs app |
| Rebuild + run | docker-compose up --build |
| Run a command inside container | docker-compose exec app bash |
🟢 Networking in Compose
All services automatically join the same network
Containers talk using service names
Example:
No need for IP address.
🟣 Volumes in Compose
Used for saving persistent data:
Example:
🔹 Why Data Engineers Use Docker Compose
Run Airflow scheduler + webserver + database locally.
Test ETL pipelines with Spark, Postgres, Kafka, or Minio (S3) together.
Manage dependencies, networking, and volumes easily.
Create reproducible environments for interviews and portfolio projects.
🔹 Basic Docker Compose Example (Airflow + Postgres)
Explanation:
Postgres → metadata database for Airflow
Airflow-webserver → runs DAGs, connected to Postgres
Volumes → persist database and logs
Ports → expose Airflow UI locally
🔹 Basic Docker Compose Commands
Build & start services
Run in detached mode (background)
Stop all containers
View logs
Rebuild after changes
🔹 Advanced Use Cases for Data Engineers
Local ETL testing
Spark + Minio (S3) + Kafka + Postgres → run all together.
Airflow development environment
Scheduler + Webserver + Worker + Postgres + Redis.
Team collaboration
Share
docker-compose.yml→ everyone runs the same environment.
MinIO is an open-source, high-performance object storage system that works just like Amazon S3.
🚀 Simple Definition
MinIO = Your own S3 storage, but on your local machine or your company's servers.
You can store:
Files (images, videos, PDFs)
Backups
Logs
Data lake files (Parquet, CSV, JSON)
ML model files
It exposes an S3-compatible API, so tools that work with AWS S3 also work with MinIO.
MinIO is heavily used by:
Data engineers
Big data pipelines
Machine Learning teams
Kubernetes ecosystems
On-prem companies needing S3-like storage
Common use cases:
Storage for Airflow, Spark, Kafka, ML models
Data lake storage (like S3)
Backup system
File storage for microservices
🔹 Tips
Use
.envfile for sensitive credentials (AWS keys, DB passwords).Use depends_on for proper startup order.
Combine Dockerfile + Docker Compose to build custom images and run multi-service pipelines.
Use networks to let containers communicate (
service_name:port).
---------------------------------------------------------------------
In a Dockerfile, commands are executed in two different phases:
🚀 1. Build-time commands
Executed while building the image using:
These commands modify the image, install software, copy files, etc.
⭐ Build-time instructions:
| Instruction | Meaning |
|---|---|
| FROM | Base image |
| COPY | Copies files into image |
| ADD | Similar to COPY with extra features |
| RUN | Executes commands during image build |
| ENV | Sets environment variables for image |
| WORKDIR | Sets working directory |
| EXPOSE | Metadata only |
| USER | Sets default user |
| ENTRYPOINT | Sets startup program |
| CMD | Default arguments to ENTRYPOINT |
✔ Example Build-Time (RUN)
⏩ These run inside the image build, produce new layers.
🚀 2. Runtime commands
Executed when container starts, not during build.
This is when you run:
⭐ Runtime instructions:
| Instruction | Meaning |
|---|---|
| CMD | Runs when container starts |
| ENTRYPOINT | Main container command |
| ENV | Available at runtime |
| VOLUME | Declares storage |
| EXPOSE | Helps runtime port mapping |
✔ Example Runtime (CMD)
⏩ This runs when the container starts, not during build.
🔥 Major Difference (VERY IMPORTANT)
| Feature | Build Time | Runtime |
|---|---|---|
| Executed during | docker build | docker run |
| Command used | RUN | CMD, ENTRYPOINT |
| Creates layers? | Yes | No |
| Installs packages | ✔ Allowed | ❌ Not allowed |
| Runs application | ❌ No | ✔ Yes |
| Changes image? | ✔ Yes | ❌ No |
🧠 Most Common Confusion
❗ Why not use RUN to start a server?
Example WRONG:
This will start app during build → build will hang forever.
You should use CMD or ENTRYPOINT:
🎯 Simple Example Dockerfile (Build-time vs Runtime)
=========================================================================
Basic Commands
| Purpose | Command | Meaning |
|---|---|---|
| Check Docker version | docker --version | Verify installation |
| List images | docker images | Shows all images |
| List running containers | docker ps | Only active containers |
| List all containers | docker ps -a | Active + stopped containers |
| Build image | docker build -t <name> . | Build image from Dockerfile |
| Run container | docker run <image> | Start container |
| Run interactive shell | docker run -it <image> bash | Enter container terminal |
| Run container in background | docker run -d <image> | Detached mode |
| Assign name to container | docker run --name myapp <image> | Run container with name |
| Stop container | docker stop <id> | Gracefully stop |
| Force stop | docker kill <id> | Hard stop |
| Remove container | docker rm <id> | Delete container |
| Remove image | docker rmi <image> | Delete image |
| View container logs | docker logs <id> | Show logs |
| Execute command inside container | docker exec -it <id> bash | Open shell inside running container |
| Copy file from container | docker cp <id>:/path/file . | Copy from container to host |
| Show container stats | docker stats | CPU/RAM usage |
| Pull image from Docker Hub | docker pull <image> | Download image |
| Push image to registry | docker push <image> | Upload image |
| Inspect container details | docker inspect <id> | Low-level info |
| Show container logs | docker logs <id> | View output |
---------------------------------------------------------------------
---------------------------------------------------------------------
docker exec is used to run a command inside a running container.
Think of it as opening a terminal inside a container.
✅ Syntax
🔥 Most Common Usage
⭐ 1️⃣ Open an interactive shell inside container
(Like SSH into the container)
or if bash is not available:
What this does:
-i→ interactive-t→ allocate a terminal (TTY)You get inside the container's environment
You can explore filesystem, logs, configs, etc.
⭐ 2️⃣ Run a single command inside container
Example: list files
Example: check Redis keys
⭐ 3️⃣ Check environment variables
⭐ 4️⃣ Verify process running inside container
⭐ 5️⃣ Run SQL client inside PostgreSQL container
🧠 When to Use docker exec?
✔ To debug inside a container
✔ To explore container file system
✔ To check logs that applications write to files
✔ To run app-specific commands (redis-cli, psql, etc.)
✔ To verify configs
✔ To run admin commands
❗ Important Notes
🔸 The container must be running
If container is stopped:
will give error:
Use:
---------------------------------------------------------------------
🔥 Useful Flags
1️⃣ Follow logs (live streaming logs)
This is like tail -f, continuously showing new log lines.
2️⃣ Show last N lines
3️⃣ Include timestamps
4️⃣ Combine flags
Shows last 100 lines + timestamps + live updates.
---------------------------------------------------------------------
✅ 1. docker network ls
This command lists all Docker networks on your system.
You will see something like:
| NETWORK ID | NAME | DRIVER | SCOPE |
|---|---|---|---|
| 934d... | bridge | bridge | local |
| a23b... | host | host | local |
| 7dfe... | none | null | local |
✅ 2. Create a Docker network
To create your own custom network:
Why create a custom network?
Containers on the same network can communicate with each other by container name.
Example:
Mongo container can be accessed by name mongo inside Mongo Express.
✅ 3. Run MongoDB container on that network
What this does:
Starts MongoDB in background (
-d)Assigns container name
mongoConnects it to
mynetworkSets username/password
✅ 4. Run Mongo Express (UI) on same network
Important points:
Connected to same network → can reach Mongo
Mongo server is given as:
Because Docker resolves container names automatically on a shared network.
Port mapping
-p 8081:8081
→ You can open Mongo Express in browser at:
---------------------------------------------------------------------
✅ 1. docker pull redis
This command downloads the Redis image from Docker Hub.
What happens:
Docker checks if
redis:latestexists locallyIf not, it downloads all required image layers
Stores it in your local image cache
✅ 2. docker images
Shows all images available locally.
You will see output like:
| REPOSITORY | TAG | IMAGE ID | CREATED | SIZE |
|---|---|---|---|---|
| redis | latest | abc123 | 2 days ago | 110MB |
✅ 3. docker run redis
Runs the Redis image in the foreground.
Result:
It starts Redis in your terminal
You can see logs continuously
Your terminal gets “attached” to the container
Press
CTRL + Cto stop
Not recommended for production.
✅ 4. docker ps
Shows running containers.
You will see columns:
| CONTAINER ID | IMAGE | STATUS | PORTS | NAMES |
✅ 5. docker run -d redis
Runs Redis in background (detached mode).
What happens:
Starts Redis container
Returns only the container ID
Your terminal is free to use
Container keeps running in the background
✅ 6. docker stop <container_id>
Stops the running Redis container.
What happens:
Sends graceful shutdown signal
Redis safely shuts down
Container becomes stopped, but not removed
✅ 7. docker start <container_id / name>
Starts a stopped container again.
or
Important:
It starts the same stopped container—not a new one.
✅ 8. docker ps -a
Shows all containers — running + stopped.
Useful to check old/stopped containers.
✅ 9. docker run redis:4.0
Runs a specific version of Redis.
What happens:
If version 4.0 image does NOT exist locally → Docker pulls it
A container is created using Redis v4.0
If you use
-d, it runs in background
=========================================================================
🔵 What is Docker Caching?
Docker caching means Docker reuses previously built layers instead of rebuilding everything every time.
This makes builds:
Faster
Cheaper
More efficient
🔵 How Docker Caching Works
A Docker image is made of layers.
Each Dockerfile instruction creates one layer.
Example:
If nothing changes in a layer, Docker reuses it from cache.
🔵 Why Caching Matters (Interview Points)
Speeds up builds (5 minutes → 10 seconds)
Reduces duplicate work
Prevents reinstalling dependencies
Saves cloud build costs (GitHub Actions, AWS, GCP)
🔵 What Breaks the Cache?
A cache is invalidated (rebuild happens) if:
The instruction changes (example: change a RUN command)
Any file copied in that layer changes
Any previous layer changes
Example:
If requirements.txt changes, Docker will rebuild:
Layer for COPY requirements.txt
Layer for RUN pip install
All layers after them
But earlier layers (FROM, WORKDIR) are still cached.
🔵 Best Practice: ORDER YOUR DOCKERFILE
To get the maximum caching, put the steps that change least often first.
❌ Bad (slow builds every time):
✔ Good (better caching):
This way:
Pip install runs only if
requirements.txtchangesApp code changes won’t break pip install cache
🔵 Cache Example in Real Life
First build:
Second build with no code change:
Because all layers are reused.
🔵 Skipping Cache (Forced Rebuild)
Sometimes you want a full rebuild:
🔵 Multi-Stage Build + Caching (Advanced)
Multi-stage builds let you cache dependency installation separately:
This dramatically speeds up builds.
🔥 Short Summary (One Line Answers)
Docker caching = reusing previous build layers
Each Dockerfile instruction = one layer
Layers only rebuild if something changes
Correct ordering = fast builds
--no-cachedisables caching
=========================================================================
🟦 Variables in Docker
Docker supports two types of variables:
✅ 1. ENV (Environment Variables)
🔹 Available inside the running container
🔹 Used by applications at runtime
🔹 Can be set in Dockerfile, Compose, or at run time
Dockerfile
docker run
docker-compose.yml
📌 Use case:
Database URLs, passwords, app settings.
✅ 2. ARG (Build-time Variables)
🔹 Used only during image build
🔹 NOT available inside running container unless passed to ENV
🔹 Must be defined before use
Dockerfile
Build:
📌 Use case:
Build metadata, versioning, optional settings.
🟨 ENV vs ARG (Interview Question)
| Feature | ARG | ENV |
|---|---|---|
| Available at runtime? | ❌ No | ✔ Yes |
| Available during build? | ✔ Yes | ✔ Yes |
Passed using docker run? | ❌ No | ✔ Yes |
| Stored inside final image? | ❌ No | ✔ Yes |
🟩 3. Variables in docker-compose with .env file
You can store environment variables in a file named .env.
.env:
docker-compose.yml:
🟧 4. Using variables inside Dockerfile
Example:
🟥 5. Why variables are important in Docker?
Avoid hardcoding secrets
Make Dockerfiles reusable
Dynamic config (ports, environment, versions)
Different environments: dev, test, prod
🟦 Docker Registry — What It Is & Why It Matters
✅ What Is a Docker Registry?
A Docker Registry is a storage + distribution system for Docker images.
A Docker registry is a centralized storage and distribution system for Docker images. It acts as a repository where Docker images—packages containing everything needed to run an application—are stored, managed, versioned, and shared across different environments.
It is where Docker images are:
Stored
Versioned
Pulled from
Pushed to
Similar to GitHub, but for container images instead of code.
🟧 Key Concepts
🟠 1. Registry
The whole server that stores repositories → e.g., Docker Hub, AWS ECR.
🟠 2. Repository
A collection of versions (tags) of an image.
Example:
🟠 3. Image Tag
Label used to version an image.
Example:
🟩 Public vs Private Registries
| Type | Examples | Features |
|---|---|---|
| Public | Docker Hub, GitHub Container Registry | Anyone can pull |
| Private | AWS ECR, Azure ACR, GCP GCR, Harbor | Secure, enterprise use |
🟦 Why Do We Need a Docker Registry?
Because:
You build an image locally
Push it to a registry
Your production server / CI/CD pulls the image and runs it
Without a registry → no easy way to share or deploy images.
🟣 Common Docker Registry Commands
✅ Login
✅ Tag an Image
✅ Push to Registry
✅ Pull from Registry
🟤 Examples of Docker Registries
📌 1. Docker Hub (Most Common)
Free public repositories
Paid private repos
📌 2. AWS ECR (Enterprise)
Most used in production
Private registry
Integrated with ECS, EKS, Lambda
📌 3. GitHub Container Registry
Images stored inside GitHub
Good for CI/CD workflows
📌 4. Google GCR / Artifact Registry
📌 5. Self-hosted Registry
Example: Harbor, JFrog Artifactory
🔥 Advanced Concepts (Interview-Level)
🔹 Digest-based pulling
Instead of tag:
Guarantees exact version.
🔹 Immutable tags
Some registries enforce that v1 cannot be overwritten.
🔹 Retention Policies
Automatically delete old images in ECR/GCR.
🔹 Scan for vulnerabilities
Registries like:
AWS ECR
GHCR
Docker Hub (Pro)
can scan images for security issues.
=========================================================================
Docker networking allows containers to communicate with:
each other
the host machine
external internet
Each container gets its own virtual network interface + IP address.
Docker networking allows containers to communicate with:
each other
the host machine
external internet
Each container gets its own virtual network interface + IP address.
🔶 Types of Docker Networks
Docker provides 5 main network types:
🟦 1. Bridge Network (Default)
Most commonly used
Containers on the same bridge network can talk to each other using container name
Example:
Use Case:
Local development
Microservices communication
🟩 2. Host Network
Container shares the same network as host.
❌ No isolation
⚡ Fastest network performance
🧠 No port mapping needed
Run:
Use Case:
High-performance applications
Network-heavy workloads
🟧 3. None Network
Container has no network.
Use Case:
Security
Sandbox jobs
Batch processing
🟪 4. Overlay Network (Swarm / Kubernetes)
Used in multi-node swarm clusters.
Allows containers on different machines to communicate.
Use Case:
Distributed apps
Microservices in Docker Swarm
🟫 5. Macvlan Network
Gives container its own IP address in LAN like a real device.
Use Case:
Legacy systems
Need direct connection to network
Running containers like physical machines
🔷 Key Networking Commands
| Command | Description |
|---|---|
docker network ls | List networks |
docker network inspect <name> | Inspect network |
docker network create <name> | Create network |
docker network rm <name> | Remove network |
docker network connect <net> <container> | Add container to network |
docker network disconnect <net> <container> | Remove container |
🔷 How Containers Communicate
🟦 1. Same Bridge Network
✔ Can ping each other by container name
✔ DNS built-in
Example:
🟥 2. Different Networks
❌ Cannot communicate
➡ Must connect to the same network
🟩 3. With Host Machine
Host can access container via:
Example:
Access: → http://localhost:8080
🟧 4. Container to Internet
Enabled by default via NAT.
🔶 Port Mapping
If container port = 80
Host port = 8080
👉 Host can access container
👉 “Port forwarding”
🟦 Docker DNS
On the same custom network:
Container names act like hostnames
Docker automatically manages DNS
=========================================================================
Docker Volumes are the official way to store data outside a container.
Docker volumes are a dedicated, persistent storage mechanism managed by Docker for storing data generated and used by containers.
Unlike container writable layers, volumes exist independently of the container lifecycle, meaning data in volumes remains intact even if the container is stopped, removed, or recreated.
They reside outside the container filesystem on the host, typically under Docker's control directories, providing efficient I/O and storage management.
Because containers are ephemeral:
→ When container stops/deletes → data is lost
→ Volumes solve that.
🔶 Why Do We Need Docker Volumes?
✔ Containers are temporary
✔ Data must persist
✔ Multiple containers may need same data
✔ Upgrading/Deleting containers should NOT delete data
🟦 Types of Docker Storage
Docker offers 3 types:
1️⃣ Named Volume (Recommended)
Managed by Docker itself
Stored under:
Use Cases:
Databases (MySQL, PostgreSQL)
Persistent app data
Example:
2️⃣ Bind Mount
Maps specific host directory into container
Uses host machine's folder.
Use Cases:
Local development
When you want full control of host path
3️⃣ tmpfs (Linux Only)
Data stored in RAM only.
Use Cases:
Sensitive data
Ultra-fast temporary storage
🟩 Volume Commands (Most Important)
| Command | Description |
|---|---|
docker volume create myvol | Create volume |
docker volume ls | List volumes |
docker volume inspect myvol | Inspect volume |
docker volume rm myvol | Delete volume |
docker volume prune | Remove unused volumes |
🟧 Using Volumes in Docker Run
Syntax:
Example:
🟣 Using Bind Mounts
Example:
🔵 Volumes in Docker Compose
Very important for real projects.
docker-compose.yml
🔥 Example Use Case (DB Persistence)
If you run:
Delete container → data gone.
But with volume:
Stop container → data still exists (in volume).
🟥 Where Are Volumes Stored?
On Linux:
On Windows/Mac → managed internally through Docker Desktop.
-------------------------------------------------------------------------------------------------------------------
1️⃣ Why each DB has a different location
where the database stores its actual files
2️⃣ Using Docker volumes for persistence
Volumes are Docker-managed storage that lives outside the container filesystem.
You can map container paths to host paths or let Docker manage them.
Syntax:
Examples:
MySQL:
Volumes are Docker-managed storage that lives outside the container filesystem.
You can map container paths to host paths or let Docker manage them.
3️⃣ Key points
Each DB container has its own default data directory — you must map that path for persistence.
You can use:
Host directory mapping (
/host/path:/container/path) → data visible on host.Named volumes (
-v myvolume:/container/path) → Docker manages storage.
Using different volumes/paths per DB avoids conflicts and keeps data safe.
This also allows backup, restore, and migration easily by copying the volume.
🟨 Interview Questions (Short Answers)
1️⃣ What is a Docker Volume?
A persistent storage mechanism managed by Docker.
2️⃣ Difference: Volume vs Bind Mount?
| Volume | Bind Mount |
|---|---|
| Managed by Docker | Controlled by host user |
| More secure | Direct host access |
| Best for production | Best for local development |
3️⃣ Does deleting container delete volume?
❌ No.
Volumes must be deleted manually.
4️⃣ What happens if volume doesn't exist?
Docker automatically creates it.
5️⃣ Can two containers share one volume?
✔ Yes → used in DB replicas, logs, shared storage.
=========================================================================
ENTRYPOINT defines the main command that will always run when a container starts.
Think of it as the default executable of the container.
🟦 Why ENTRYPOINT is used?
✔ Makes the container behave like a single-purpose program
✔ Forces a command to always run
✔ Can't be easily overridden (compared to CMD)
✔ Best for production containers
🔶 ENTRYPOINT Syntax
Two forms exist:
1️⃣ Exec Form (Recommended)
✔ Doesn’t use shell
✔ More secure
✔ Handles signals properly
2️⃣ Shell Form
⚠ Runs inside /bin/sh -c
⚠ Harder to handle signals
🟣 Example ENTRYPOINT Dockerfile
Dockerfile
Run:
This will always run:
🟩 ENTRYPOINT + CMD (Very Important)
ENTRYPOINT = fixed commandCMD = default arguments
Example:
Container will run:
You can override CMD:
But ENTRYPOINT cannot be replaced unless you use --entrypoint.
🔥 Override ENTRYPOINT (Rare)
🟥 ENTRYPOINT vs CMD (Very Important Table)
| Feature | ENTRYPOINT | CMD |
|---|---|---|
| Main purpose | Main command | Default args |
| Overrides allowed? | ❌ Hard | ✔ Easy |
| Best use | Permanent command | Arguments |
| Runs as | Program | Command/Args |
🔶 Common Interview Questions
1. Why use ENTRYPOINT instead of CMD?
To ensure the main command always runs and cannot be overridden.
2. What happens if both ENTRYPOINT and CMD exist?
CMD becomes arguments to ENTRYPOINT.
3. How do you override ENTRYPOINT?
Using --entrypoint.
=========================================================================
🔵 Docker Daemon & Docker Client
Docker works using a client–server architecture.
🟦 1. Docker Daemon (dockerd)
This is the brain of Docker.
✔ What it Does:
Runs in the background
Manages containers
Manages images
Manages networks
Manages volumes
Executes all Docker operations
✔ It Listens On:
Unix socket:
/var/run/docker.sockSometimes TCP port (for remote Docker hosts)
✔ Daemon = Server Side
🟩 2. Docker Client (docker)
This is the command-line tool you use.
When you type:
The client DOES NOT run containers.
Instead, it sends API requests to the Docker Daemon, which performs the real operations.
✔ Client = Frontend
✔ Daemon = Backend
🟧 How They Work Together (Simple Flow)
You run:
Flow:
Client sends request → Daemon
Daemon pulls image
Daemon creates container
Daemon starts container
You see output on terminal
🔵 COPY vs ADD in Dockerfile
Both are used to copy files into the image, but COPY is preferred.
🟦 1. COPY (Recommended)
✔ What it does:
Copies local files/folders into the container.
✔ Safe
✔ Predictable
✔ No extra features (simple only)
Example:
Use COPY when:
You want to copy source code
You want clean builds
You don’t need extraction or downloading
🟧 2. ADD (Avoid unless needed)
✔ What it does:
Does everything COPY does plus two extra features:
Extra Features:
1️⃣ Can download URLs
2️⃣ Automatically extracts tar files
⚠ Because of these extras → can create security issues
So Docker recommends: use COPY unless ADD is needed.
🟪 COPY vs ADD Table (Interview-Friendly)
| Feature | COPY | ADD |
|---|---|---|
| Copy local files | ✔ Yes | ✔ Yes |
| Copy remote URL | ❌ No | ✔ Yes |
Auto extract .tar.gz | ❌ No | ✔ Yes |
| Simpler | ✔ Yes | ❌ No |
| More secure | ✔ Yes | ❌ No |
| Recommended? | ✔ Yes | ❌ Use only when required |
🟩 When to Use ADD? (Rare)
Use ADD only for:
✔ Auto-unpacking tar files into image
✔ Downloading files from a URL
Otherwise → COPY is always better.
=========================================================================
🔵 What are Multi-Stage Builds?
Multi-stage builds allow you to use multiple FROM statements in a single Dockerfile.
✔ Build in one stage
✔ Copy only the required output into the final stage
✔ Final image becomes much smaller
✔ No build dependencies inside final image
🟦 Why Multi-Stage Builds Are Needed?
Problem (without multi-stage):
Build tools (Maven, Go compiler, Node modules, pip, etc.) stay inside the final image
Makes image heavy
Security issues
Slow deployment
Multi-stage solution:
Build tools exist only in the build stage
Final stage contains just the application
Clean, lightweight image
🟩 Simple Example – Python / Node / Java / Go (All follow same logic)
Here is a general multi-stage pattern:
What happens?
Node image builds the app
Only the final compiled output is copied to nginx
Result = super small production image
🔶 Another Example – Python App
🔷 Another Example – Java (Very Popular)
✔ No Maven in final image
✔ Final image is tiny
🟧 Key Features of Multi-Stage Builds
✔ Multiple FROM instructions
Each FROM = new stage
✔ You can name stages
✔ Copy artifacts from stage to stage
✔ Final image only contains last stage
All previous stages = removed
Image is clean + small
🟪 Benefits (Interview Ready)
| Benefit | Explanation |
|---|---|
| ✔ Smaller images | No build tools in final image |
| ✔ Faster builds | Layer caching for each stage |
| ✔ Better security | No compilers / secrets left behind |
| ✔ Cleaner Dockerfiles | Each stage has a clear job |
| ✔ Reproducible builds | Same environment every time |
=========================================================================
🔵 What is .dockerignore?
.dockerignore is a file that tells Docker which files/folders to EXCLUDE when building an image.
It works similar to .gitignore.
🟦 Why do we use .dockerignore?
✔ Faster Docker builds
(Removes unnecessary files → smaller build context)
✔ Smaller images
(Don’t copy unwanted files)
✔ Better security
(Keep secrets, logs, configs out of image)
✔ Cleaner caching
(Prevents rebuilds when irrelevant files change)
🟩 Common Items in .dockerignore
🟧 How it works?
When you run:
Docker first copies the “build context” → (current directory)
Without dockerignore, everything is copied.
.dockerignore tells Docker:
🚫 Don’t send these files to the build context.
🟪 Example
.dockerignore
Dockerfile
Only allowed files will be copied.
🟥 Performance Impact (Very Important)
Without .dockerignore:
Docker copies huge directories (node_modules, logs)
Slow build
Cache invalidates unnecessarily
With .dockerignore:
Build context is very small
Build is faster
Cache stays valid → faster incremental builds
=========================================================================
🔵 Docker Container Lifecycle (Step-by-Step)
A Docker container goes through the following major stages:
🟦 1. Created
The container is created from an image but not started yet.
Command:
🟩 2. Running
Container is active and executing processes.
Command:
docker run= create + start
🟧 3. Paused
All processes inside the container are temporarily frozen.
Command:
🟪 4. Unpaused
Resumes the paused container.
Command:
🟥 5. Stopped / Exited
Container stops running its main process (app has exited or manually stopped).
Command:
🟨 6. Restarted
Container is stopped and then started again.
Command:
🟫 7. Removed (Deleted)
The container is permanently removed from Docker.
Command:
You cannot remove a running container—must stop it first.
📌 Lifecycle Diagram (Simple)
| Action | Command Example |
|---|---|
| Create | docker create nginx |
| Run (create+start) | docker run nginx |
| Start | docker start cont_id |
| Stop | docker stop cont_id |
| Pause | docker pause cont_id |
| Unpause | docker unpause cont_id |
| Restart | docker restart cont_id |
| Remove | docker rm cont_id |
| Remove all | docker rm $(docker ps -aq) |
=========================================================================
🔵 What is a Docker HEALTHCHECK?
A HEALTHCHECK is a way to tell Docker how to test whether a container is healthy.
Docker runs this command periodically and updates the container's status:
healthy
unhealthy
starting
It helps in:
auto-restarts
load balancers
orchestrators (Kubernetes, ECS, Swarm)
🟦 Syntax (Dockerfile)
🟩 Options
| Option | Meaning |
|---|---|
--interval=30s | Check frequency |
--timeout=3s | How long to wait before failing |
--start-period=5s | Grace period before checks start |
--retries=3 | Fail after X failed attempts |
🟧 Example 1: Simple HTTP Healthcheck
If
curl -fworks → healthyIf fails → unhealthy
🟪 Example 2: Healthcheck Script
health.sh:
List containers with health status:
Detailed inspection:
You will see:
| Status | Meaning |
|---|---|
| starting | Startup period (start-period) |
| healthy | App is functioning |
| unhealthy | Check failed repeatedly |
If you use restart policies:
→ Docker auto-restarts an unhealthy container.
📌 Important Notes
HEALTHCHECK runs inside the container.
Should be lightweight (avoid heavy scripts).
Uses exit codes:
0 = success (healthy)
1 = unhealthy
2 = reserved
=========================================================================
🔵 What is docker inspect?
docker inspect is used to view detailed information about Docker containers, images, networks, or volumes in JSON format.
It shows everything about a container:
Network info
Mounts / volumes
IP address
Ports
Environment variables
Health status
Entry point, CMD
Resource usage config
Labels
Container state (running, stopped, etc.)
This is the most powerful debugging command.
🟦 Basic Command
🟩 Example Output (Simplified)
You will see JSON fields like:
🔧 Most Useful Inspect Filters (Important!)
📍 1. Get container IP address
📍 2. Get just the environment variables
📍 3. Get container’s running status
📍 4. Get container entrypoint
📍 5. Get exposed ports
🟧 Inspecting Images
Useful to see:
layers
build parameters
environment variables
entrypoint
🟪 Inspecting Networks
You can find:
connected containers
IP ranges (subnet)
gateway
driver type
🟫 Inspecting Volumes
Shows:
mount point
driver
usage
✨ Real Use Cases (Important for Interviews)
| Use Case | Command |
|---|---|
| Debug network issues | Get IP, ports |
| Debug ENV variables | extract .Config.Env |
| Verify mounted volumes | check .Mounts |
| Check health status | check .State.Health.Status |
| Know why a container exited | check .State.ExitCode |
🟩 Check Container Logs (Related Command)
=========================================================================
🔵 What is Port Mapping in Docker?
Port mapping connects a container’s internal port to a port on your host machine so that applications inside the container can be accessed from outside.
Every container has own internal ports.
Multiple container run on same host but host has limited port.
If 2 container expose same internal port , u must map them to different host ports to avoid conflict.
Example:
A container running a web server on port 80 → accessible on host via port 8080
This is called port forwarding.
🟦 Syntax
Example:
Meaning:
Inside container, Nginx listens on 80
On your laptop/server, you hit http://localhost:8080
🟩 Types of Port Mapping
1. Host → Container (most common)
2. Bind to specific IP (e.g., localhost only)
Meaning:
Only local machine can access it.
3. Automatic host port assignment
Docker assigns random free ports.
🟧 Check Mapped Ports
You will see:
🟪 Why Port Mapping Is Needed (Interview Points)
Containers run in isolated networks
Container ports aren’t accessible from host by default
Port mapping exposes them
Allows multiple instances to run on different host ports
Helps in local development and testing
🟫 Real Examples
1️⃣ Expose Postgres
2️⃣ Expose Airflow Webserver
3️⃣ Expose FastAPI on 8000
🔥 Port Mapping in Docker Compose
Same meaning: host 8080 → container 80
🔥 Docker Pull vs Docker Run — Simple Difference
Only downloads the image from Docker Hub into your system.
It does NOT create or start a container.
Example:
Result:
Redis image is downloaded
No container is created
No process runs
✅ docker run
Creates a container and runs it.
If the image does NOT exist locally, it will automatically pull it first.
Example:
Result:
Docker checks if image exists
If missing → pulls automatically
Creates a new container
Starts the container (runs Redis)
=========================================================================
=========================================================================
=========================================================================
AIRFLOW
Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows, commonly used to manage complex data pipelines.
A workflow in Airflow is a DAG (Directed Acyclic Graph), which defines a set of tasks and their execution order, dependencies, and scheduling.
A DAG (Directed Acyclic Graph) represents a workflow which has collection of tasks with dependencies.
In Apache Airflow, a task is the smallest unit of work within a workflow (DAG). Each task represents a single operation or action, such as running a Python function, executing a SQL query, or triggering a bash command. Uisng refered by task id. operator define a task.
Apache Airflow Scheduler is a core component responsible for triggering task instances to run in accordance with the defined Directed Acyclic Graphs (DAGs) and their schedules.
In Apache Airflow, an Executor is the component responsible for actually running the tasks defined in your workflows (DAGs). It takes task instances that the Scheduler determines are ready and orchestrates their execution either locally or on remote workers.
=======================================================================
🔹 What is Docker?
Docker is a platform used to:
-
Package an application and its dependencies into a container
-
Ensure the application runs the same across all environments
A Docker container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software: code, libraries, environment variables, and config files.
🐳 What is a Dockerfile?
-
A Dockerfile is a text file that contains all the instructions to build a Docker image.
-
It defines the environment, dependencies, and commands your application needs to run consistently on any machine.
-
Think of it as a recipe for your container.
🔹 Step-by-Step Explanation
-
FROM python:3.11-slim
-
Base image with Python installed. Slim version = smaller image.
-
-
WORKDIR /app
-
Sets working directory inside container.
-
-
COPY requirements.txt .
-
Copies dependency file into container.
-
-
RUN pip install ...
-
Installs Python packages inside container.
-
-
COPY . .
-
Copies your ETL or Airflow scripts into container.
-
-
ENV PYTHONUNBUFFERED=1
-
Makes Python logs visible immediately (useful for debugging).
-
-
CMD ["python", "main.py"]
-
Default command when container starts. Can be your ETL job or Airflow task script.
🔹 Useful Commands
-
Build Docker Image
-
Run Container
-
Run with mounted volume (edit locally, reflect in container)
-
Push to Docker Hub / Registry
🐳 What is Docker Compose?
-
Docker Compose is a tool to define and run multi-container Docker applications.
-
Instead of running each container individually, you define all services in a single
docker-compose.ymlfile. -
You can spin up the whole environment with one command:
docker-compose up
🔹 Run Commands
-
Build & start all services:
-
Run in detached mode (background):
-
Stop all containers:
-
View logs of a service:
----------------------------------------------------------------------------
📄 What is requirements.txt?
-
It’s a text file listing all the Python packages your project needs.
-
Used by pip to install dependencies: pip install -r requirements.txt
=======================================================================
⭐ 1. Airflow Connections
Connections = saved credentials for external systems.
Examples:
-
AWS
-
Snowflake
-
Postgres
-
MySQL
-
BigQuery
-
S3
-
Kafka
-
Redshift
🔹 How to Set Connections
A) Using Airflow UI
-
Go to Admin → Connections
-
Click + Add
-
Fill:
-
Conn ID →
aws_default -
Conn Type → Amazon Web Services
-
Extra → JSON (keys, region, endpoint)
-
-
Save
B) Using CLI
C) Using Environment Variables
Format:
Example:
This is very common in Docker/Kubernetes.
⭐ 2. Airflow Variables
Variables = key–value store for configuration.
Example:
-
file_path
-
S3 bucket name
-
threshold value
-
list of emails
🔹 How to Set Variables
A) Using UI
Admin → Variables → Add
B) Using CLI
C) Using JSON IMPORT
D) Using Environment Variables
Usage inside DAG:
⭐ 3. Airflow Secret Backends (Very Important for Data Engineers)
Airflow supports managing secrets securely using external systems.
🔹 Supported Secret Backends:
-
AWS Secrets Manager
-
GCP Secret Manager
-
Hashicorp Vault
-
Azure Key Vault
-
Custom secret backends
Why use secret backends?
-
Secrets are not stored in Airflow DB
-
Rotated automatically
-
Secure & centralized
-
Avoid plaintext passwords in Airflow UI
🔹 Example: Using AWS Secrets Manager
Add to airflow.cfg:
AWS Secret Format:
Used automatically in DAG:
⭐ 4. Best Practices for Storing Credentials (MOST IMPORTANT)
🔐 1. NEVER store passwords in code
❌
✔ Use:
🔐 2. Avoid storing secrets in Airflow Variables
Variables are NOT encrypted by default.
🔐 3. Use Secret Backends for all production credentials
-
AWS Secrets Manager
-
GCP Secret Manager
-
Hashicorp Vault
🔐 4. Use environment variables for local development
Safe and temporary.
🔐 5. Do not store credentials in GitHub / repo
Always use:
-
.env -
Kubernetes Secrets
-
Docker Secrets
🔐 6. Use different connection IDs for dev/stage/prod
Example:
-
aws_dev -
aws_stage -
aws_prod
🔐 7. Use JSON "extra" field for complex configs
Example Extra field in UI:
Operators
are Python classes that define a template for a specific unit of work (task) in a workflow. When you instantiate an operator in a DAG, it becomes a task that Airflow executes. Operators encapsulate the logic required to perform a defined action or job.
Each operator represent a single task in workflow — like running a script, moving data, or checking if a file exists.
Operators = do something
Sensors = wait for something
Hooks = connection to systems (S3Hook, PostgresHook, etc.)
Executors = how tasks run (Local, Celery, Kubernetes)
Scheduler = creates DAG Runs + task instances
Type of operator
In Apache Airflow, Operators are the building blocks of your workflows (DAGs). Each operator defines a single task to be executed. There are different types of operators based on the type of work they perform.
Operators fall into three broad categories:
Action Operators:
Perform an action like running code or sending an email.
PythonOperatorto run a Python functionBashOperatorto run shell commandsEmailOperatorto send emailsSimpleHTTPOperatorto interact with APIs
Examples:
Transfer Operators:
Move data between systems or different storage locations.
- S3ToRedshiftOperator
- MySqlToGoogleCloudStorageOperator
Examples:
Sensor Operators:
Wait for a certain event or external condition before proceeding.
- Examples:
FileSensorwaits for a file to appearExternalTaskSensorwaits for another task to complete
=======================================================================
Apache Airflow commands
🔗 chain() Function
The chain() function is part of airflow.utils.task_group (previously in airflow.utils.helpers) and helps you connect multiple tasks or groups in a sequence without writing task_1 >> task_2 >> task_3 manually.
=======================================================================
⭐ 1. Airflow Scheduling Basics
Airflow schedules based on:
-
cron expressions
-
timetables
-
logical date
-
catchup
-
backfill
A DAG run does NOT start at the exact cron time—it starts after the logical interval finishes.
🟦 2. Cron Expressions in Airflow
Cron = when to run the DAG.
Examples:
| Cron | Meaning |
|---|---|
0 0 * * * | Every midnight |
0 */2 * * * | Every 2 hours |
0 6 * * 1 | Every Monday at 6 AM |
*/5 * * * * | Every 5 minutes |
Airflow uses cron to define the start of the schedule interval, but the DAG runs after the interval finishes*.
🟦 3. Timetables (Airflow 2.2+)
Timetables = new, flexible scheduling system.
Useful when cron is not enough.
Examples:
-
Run DAG every business day except holidays
-
Run every 3 hours between 9–5
-
Run based on dataset dependencies
-
Run after an upstream dataset is updated
Timetables replace schedule_interval for advanced cases.
🟦 4. Catchup vs No Catchup
| Setting | What it Means |
|---|---|
catchup=True | Airflow creates DAG Runs for all past dates since the start date |
catchup=False | Airflow only runs the latest DAG run, skips historical dates |
Example:
DAG start date = Jan 1
Today = Jan 5
Schedule = daily
| catchup setting | Runs created |
|---|---|
| True | 1,2,3,4,5 Jan (5 runs) |
| False | Only Jan 5 (latest run) |
🟦 5. Backfill (Manual Catchup)
Backfill = you manually run past dates even if catchup=False.
Command:
Purpose:
-
Re-run historical data
-
Fix missed data loads
-
Reprocess partitions
🟦 6. Logical Date (MOST IMPORTANT)
Logical date = the data interval the DAG run is processing.
It is not the actual time the run starts.
Example:
Schedule: Daily
Cron: 0 0 * * * (midnight, i.e., start of interval)
DAG Run at: 2024-10-10 00:00 logical date
Run actually starts at: 2024-10-10 00:01 or later
Why this is important?
-
All tasks use
logical_datefor:-
file paths
-
S3 partitions
-
SQL date parameters
-
templated variables (
{{ ds }}etc.)
-
Think of it like:
✔ Logical date = data date
✔ Execution date = same as logical date (Airflow 2.2+)
✖ NOT the real-time the task runs
🟣 Logical Date Example (Simple)
Schedule = daily
Interval = 2024-01-01 00:00 → 2024-01-02 00:00
DAG Run for 2024-01-02 actually runs at 2024-01-02 00:01, but:
Because that’s the interval start (logical date).
=======================================================
Cron
cron expressions are used in the schedule_interval parameter of a DAG to define when the DAG should run.=======================================================================
🔹 What Are Hooks in Apache Airflow?
Hooks in Airflow are interfaces to external platforms, like databases, cloud storage, APIs, and more. They abstract the connection and authentication logic, allowing operators to use these services easily.
Hooks are mostly used behind the scenes by Operators, but you can also call them directly in Python functions.
🔸 Why Use Hooks?
-
Reusable connection logic
-
Securely use Airflow's connection system (
Airflow Connections UI) -
Simplifies integrating with external systems (e.g., MySQL, S3, BigQuery, Snowflake)
=======================================================================
☁️ What is S3Hook?
-
S3Hook is a helper class in Airflow to interact with Amazon S3.
-
It abstracts the boto3 (AWS SDK for Python) operations so you can read/write files, list buckets, check if objects exist, etc., directly in your DAGs.
-
Comes from:
from airflow.providers.amazon.aws.hooks.s3 import S3Hook(Airflow 2+)
🔹 When to Use S3Hook
-
You want to upload a file to S3 from Airflow.
-
You want to download a file from S3 for processing.
-
You want to check if a key/object exists before running a task.
-
You want to list files in a bucket dynamically.
=======================================================================
=======================================================================
🧠 What Does an Executor Do?
It communicates with the Scheduler and runs the tasks defined in your DAGs—either locally, in parallel, or on distributed systems like Celery or Kubernetes.
-> airflow info , gives which executor we are running
=======================================================================
🧠 Why Use SLAs?
To ensure:
-
Timely data availability
-
Reliable pipeline performance
-
Alerting for delays or failures
🧩 How SLA Works in Airflow
-
SLA is defined per task, not per DAG.
-
If a task takes longer than the SLA, it's marked as an SLA miss.
-
Airflow triggers an SLA miss callback and logs the event.
-
Email alerts can be sent if configured.
📊 Monitoring SLA Misses
-
Go to Airflow UI > DAGs> Browse > SLA Misses
-
Or check the Task Instance Details
⚠️ Notes
-
SLAs are checked after the DAG run completes.
-
SLAs are about runtime, not start time.
-
SLA doesn’t retry or fail the task—it just logs the violation.
=======================================================================
🔧 What is a Template?
A template is a string that contains placeholders which are evaluated at runtime using the Jinja2 engine.
In Apache Airflow, Templates allow you to dynamically generate values at runtime using Jinja templating (similar to templating in Flask or Django). They are useful when you want task parameters to depend on execution context, such as the date, DAG ID, or other dynamic values.
=======================================================================
Jinja is a templating engine for Python used heavily in Apache Airflow to dynamically render strings using runtime context. It lets you inject variables, logic, and macros into your task parameters.
🔍 What is Jinja?
Jinja is a mini-language similar to Django or Liquid templates. In Airflow, it's used for:
-
Creating dynamic file paths
-
Modifying behavior based on execution date
-
Using control structures like loops and conditions
✅ What is catchup?
When catchup=True (default), Airflow will "catch up" by running all the DAG runs from the start_date to the current date.
When catchup=False, it only runs the latest scheduled DAG run from the time it is triggered.
=======================================================================
🔁 What is Backfill?
Backfill is the process of running a DAG for past scheduled intervals that have not yet been run.
When a DAG is created or modified with a start_date in the past, Airflow can "backfill" to ensure that all scheduled intervals between the start_date and now are executed.
=======================================================================
🔧 Components of Apache Airflow
Apache Airflow is made up of several core components that work together to orchestrate workflows:
| Component | Description |
|---|---|
| Scheduler | The brain of Airflow that monitors DAGs and tasks, triggers DAG runs based on schedules or events, and submits tasks to the executor for execution. It continuously checks dependencies and task states to decide what to run next. it takes 5 min for AF to detect dag in dag folder Scheduler scan for new task every 4 sec |
| Executor | Executes task instances assigned by the scheduler. It can run tasks locally, via distributed workers, or on containerized environments depending on the executor type (LocalExecutor, CeleryExecutor, KubernetesExecutor, etc.). |
| Workers | Machines or processes (depending on executor) that actually run the task code. For distributed executors like Celery or Kubernetes, multiple workers run tasks in parallel, scaling out capacity. |
| Metadata Database | A relational database (e.g., PostgreSQL, MySQL) that stores all Airflow metadata: DAG definitions, task states, execution history, logs, connection info, and more. The scheduler, workers, and webserver interact with it constantly. |
| Webserver (UI) | Provides a user interface to monitor DAG runs, task status, logs, and overall workflow health. Built on a FastAPI server with APIs for workers, UI, and external clients. |
| DAGs Folder | Directory or location where DAG definition Python files live. These files describe the workflows and are parsed by the scheduler or DAG processor. |
🟦 What is Airflow Scheduler?
The Airflow Scheduler is the component responsible for triggering DAG runs and executing tasks at the right time based on the DAG’s schedule, dependencies, and state.
📌 It is the “brain” of Airflow.
🟣 What does the Airflow Scheduler do?
The scheduler continuously:
| Function | Explanation |
|---|---|
| Monitors DAGs | Watches all DAG files for new/updated DAGs. |
| Creates DAG Runs | Starts DAG runs at the scheduled intervals. |
| Checks Dependencies | Ensures upstream tasks are finished before running next task. |
| Queues Tasks | Decides which tasks are ready to run. |
| Sends tasks to Executor | Hands tasks to workers (Local/Celery/K8s). |
| Handles retries | If a task fails, scheduler triggers retries. |
| Manages SLA | Detects SLA misses. |
🟦 How the Scheduler Works (Simple Flow)
The scheduler loops continuously, making decisions every few seconds.
🟣 Important Concepts for Interviews
1. Scheduling interval
Scheduler respects:
-
schedule_interval -
start_date -
end_date -
catchup
2. Logical Date (Very important!)
Scheduler runs DAGs based on logical execution date, not current time.
3. Executor
Scheduler just queues tasks, but does NOT execute them.
Executor runs the task.
Example executors:
-
LocalExecutor
-
CeleryExecutor
-
KubernetesExecutor
4. Concurrency Controls
Scheduler respects:
-
DAG concurrency
-
Task concurrency
-
Pools
-
Parallelism
These prevent overload.
5. Heartbeats
Scheduler sends a “heartbeat” every few seconds.
If heartbeat stops → scheduler is down.
🟦 Example: Scheduler in Action
If a DAG has:
The scheduler will create DAG runs:
-
2024-01-01 (logical date)
-
2024-01-02
-
2024-01-03
…
Each run → scheduler checks tasks → queues ready ones.
======================================================
🟦 What is an Executor in Airflow?
An Executor is the Airflow component responsible for actually running the tasks.
While the Scheduler decides what to run,
the Executor decides how and where to run it.
📌 Executor = Task runner
📌 Scheduler = Task coordinator
🟣 Why Executor is Important?
Executors decide:
-
How many tasks run in parallel
-
Where tasks get executed
-
Whether tasks run locally or on workers or on Kubernetes pods
The choice of executor determines Airflow’s scalability.
🟦 Types of Executors (Must Know)
✅ 1. SequentialExecutor
-
✔ What it is:
-
Runs ONE task at a time
-
Single-threaded
-
No parallelism
-
Default for quick testing
✔ Use Cases:
-
Local testing
-
Development / laptop
-
Very small DAGs
❌ Not for production.
-
✅ 2. LocalExecutor
-
✔ What it is:
-
Runs tasks in parallel on the same machine
-
Uses multiple processes/threads
-
Good performance for small pipelines
✔ Use Cases:
-
Small teams
-
Single-server Airflow deployments
-
Use case: 10–20 parallel tasks
❌ Not suitable for distributed workloads
❌ Cannot scale beyond one machine
-
✅ 3. CeleryExecutor
-
✔ What it is:
-
Distributed task execution
-
Multiple worker machines
-
Uses a message broker:
-
Redis
-
RabbitMQ
-
✔ Use Cases:
-
Medium to large teams
-
Many DAGs running at same time
-
Need dozens or hundreds of parallel tasks
-
On-prem or AWS EC2 deployments
👍 Pros
-
Highly scalable
-
Fault-tolerant
-
Good for data engineering teams
👎 Cons
-
Complex setup (workers + broker + DB)
-
Higher maintenance
-
✅ 4. KubernetesExecutor (Most modern)
-
✔ What it is:
-
Each task runs in its own Kubernetes pod
-
True elastic scaling
-
Perfect isolation of tasks
-
Clean environment per task
✔ Use Cases:
-
Cloud-native setups
-
Very large workloads
-
Need per-task compute scaling
-
Mixed workloads (Python, Spark, Java, Bash, etc.)
👍 Pros:
-
Auto-scaling
-
GPU/High-memory pods
-
Per-task docker image support
👎 Cons:
-
Requires Kubernetes knowledge
-
Complex to manage for small teams
-
(Bonus) — LocalKubernetesExecutor (Hybrid)
-
LocalExecutor for small tasks
-
KubernetesExecutor for heavy tasks
🟦 How Scheduler and Executor Work Together
🟣 Comparison Table
| Executor | Parallel? | Distributed? | Use Case |
|---|---|---|---|
| SequentialExecutor | ❌ No | ❌ No | Testing only |
| LocalExecutor | ✔ Yes | ❌ No | Medium workloads |
| CeleryExecutor | ✔ Yes | ✔ Yes | Large-scale pipelines |
| KubernetesExecutor | ✔ Yes | ✔ Yes | Cloud-native, scalable workloads |
🟦 Best Executor for Data Engineering?
| Use Case | Best Executor |
|---|---|
| Small team, single VM | LocalExecutor |
| Distributed on-prem cluster | CeleryExecutor |
| Cloud environments (AWS/GCP/Azure) | KubernetesExecutor |
======================================================
🟦 What is the Airflow Webserver?
The Airflow Webserver is the component that provides the UI (User Interface) for Airflow.
It lets you view, monitor, trigger, pause, and manage DAGs through a browser.
📌 Webserver = Airflow UI
📌 It shows everything happening inside Airflow.
🟣 What Webserver Does
| Function | Explanation |
|---|---|
| Displays DAGs | Shows all DAGs in the UI |
| Trigger DAGs | You can manually run a DAG |
| Pause/Unpause DAGs | Enable or disable scheduling |
| View Graph View | DAG structure (dependencies) |
| Task Logs | View task execution logs |
| Monitor status | Success / Failed / Queued / Running |
| View XCom | See data passed between tasks |
| Manage Connections | Add/edit database or API credentials |
| Variables | Store global values for DAGs |
| Admin Panel | DAG runs, task instances, users, roles |
🟦 How Webserver Works (Simple Explanation)
-
Webserver reads DAG files
-
Displays DAGs in the UI
-
Shows scheduler and executor status
-
Allows user actions (trigger, clear, rerun tasks)
It runs using Flask (Python web framework) behind the scenes.
🟦 Important Ports
Default port:
In production you may use Nginx/HTTPS.
🟣 Webserver vs Scheduler
| Component | Purpose |
|---|---|
| Webserver | UI to view/manage pipelines |
| Scheduler | Decides when tasks should run |
| Executor | Actually runs the tasks |
🟦 How to Start the Webserver
In Docker:
=================================================================
🟦 1. max_active_runs (at DAG level)
✅ Definition
max_active_runs = maximum number of DAG Runs that can run at the same time for a specific DAG.
📌 Think:
"How many full pipeline runs can run in parallel?"
Example:
✔ Only one DAG run will run at a time
✖ A new scheduled run will wait until the previous run finishes
📘 Why important?
-
Prevents overlapping runs
-
Useful for pipelines that update the same tables
-
Avoids data corruption
🟦 2. concurrency (at DAG level)
✅ Definition
concurrency = maximum number of task instances from the SAME DAG that can run in parallel.
📌 Think:
"How many tasks inside this DAG can run at the same time?"
Example:
✔ Maximum 5 tasks from this DAG can run at once
✖ The 6th task waits in the queue
📘 Why important?
-
Controls the load on your system
-
Prevents overwhelming the database, Spark cluster, APIs, etc.
⭐ 1. Dynamic DAGs (Airflow)
Dynamic DAGs = DAGs that are generated programmatically instead of hardcoding tasks.
Example:
✔ Why use Dynamic DAGs?
-
Automatically create tasks for multiple tables/files
-
Avoid writing duplicate code
-
Perfect for pipelines with 20–500 tables
⭐ 2. Dynamic Tasks (Task Mapping in Airflow 2.3+)
Task mapping = Airflow automatically creates multiple task instances at runtime.
Example (Best Interview Answer):
✔ Why Task Mapping is powerful:
-
Dynamically generates tasks at runtime
-
No DAG parsing overhead (unlike old dynamic DAGs)
-
Much cleaner & more scalable
✔ Example Use Cases:
-
Load 100 S3 files
-
Process N partitions
-
Trigger N API calls
-
Run ML jobs for each model
⭐ 3. Avoiding DAG Explosion
DAG Explosion = too many tasks or too many DAGs, causing:
-
Slow UI
-
Scheduler overload
-
Metadata DB pressure
-
DAG parsing delays
Causes:
-
Generating thousands of tasks in the DAG file
-
Creating DAGs dynamically for each table (e.g., 100 tables → 100 DAGs)
Solution:
-
Use Task Mapping
-
Use TaskGroup
-
Batch tasks
-
Push dynamic behavior to runtime, not DAG file parse time
⭐ 4. TaskGroup (Organizing Large DAGs)
TaskGroup = visual and logical grouping of tasks.
Example:
✔ Why TaskGroup is used:
-
Organize DAGs with 50+ tasks
-
Avoid clutter in Airflow UI
-
Easier debugging
-
Logical grouping like:
-
extract group
-
transform group
-
load group
-
Not for isolation — only for visual and logical grouping.
=======================================================================
DAG View
=======================================================================
🔄 XCom(Cross-Communication)
XCom (short for “Cross-communication”) allows tasks to exchange small amounts of data between each other in a DAG.
🔧 How XCom Works
-
Push → Send data to XCom from one task
-
Pull → Retrieve that data in another task
🔥 What XCom Should NOT be Used For
Very important for interviews:
❌ Do NOT pass large datasets
❌ Not meant for files
❌ Not used for DataFrames
❌ Not used for binary data
Use XCom only for small metadata, like:
-
file paths
-
S3 keys
-
table names
-
row counts
🟦 Types of XCom in Airflow
There are 3 main types of XCom you must know:
✅ 1. Default / Implicit XCom (PythonOperator return value)
-
When a PythonOperator function returns a value, Airflow automatically pushes it to XCom.
-
No need to write xcom_push() manually.
When a PythonOperator function returns a value, Airflow automatically pushes it to XCom.
No need to write xcom_push() manually.
Example:
✔ Automatically becomes an XCom value
✔ Most commonly used type
✅ 2. Manual XCom (Explicit push & pull)
Used when you want full control.
Push:
Pull:
✔ Used when you need custom key names
✔ Useful when returning multiple values
✅ 3. TaskFlow API XCom (@task decorator)
This works like implicit XCom but with ** cleaner syntax** using the TaskFlow API.
Example:
✔ Return values automatically become XCom
✔ Passing function outputs becomes easier
✔ Preferred in modern Airflow (2.x)
=======================================================================
🧩 Airflow Variables
Airflow Variables are key-value pairs used to store and retrieve dynamic configurations in your DAGs and tasks.
=======================================================================
🛰️ Apache Airflow Sensors
Sensors are special types of operators in Airflow that wait for a condition to be true before allowing downstream tasks to proceed.
=======================================================================
🌿 Branching in Apache Airflow
Branching allows you to dynamically choose one (or more) downstream paths from a set of tasks based on logic. This is done using the BranchPythonOperator.
🧠 Why Use Branching?
Branching is useful when:
-
You want to run different tasks based on a condition
-
You need to skip certain tasks
-
You want "if/else" logic in your DAG
✅ Notes:
-
Tasks not returned by
BranchPythonOperatorwill be skipped. -
You can return a single task ID or a list of task IDs.
-
Ensure your downstream tasks can handle being skipped, or use appropriate
trigger_rule.
=======================================================================
Subdag
🔄 What is a SubDAG in Apache Airflow?
A SubDAG is a DAG within a DAG — essentially, a child DAG defined inside a parent DAG. It's used to logically group related tasks together and reuse workflow patterns, making complex DAGs easier to manage.
📌 Think of a SubDAG as a modular block that can be reused or organized separately.
🧩 TaskGroup in Apache Airflow
A TaskGroup in Airflow is a way to visually and logically group tasks together in the UI without creating a separate DAG like SubDagOperator. It's lightweight, easier to use, and the recommended approach in Airflow 2.x+.
=======================================================================
🔗 Edge Labels in Apache Airflow
Edge labels in Airflow are annotations you can add to the edges (arrows) between tasks in the DAG graph view. They help clarify why one task depends on another, especially when using complex branching, conditionals, or TriggerRules.
==========================================================
⭐ 1. Catchup
Definition:
Airflow automatically creates DAG Runs for all missed schedule intervals since the DAG’s start_date.
| Feature | Details |
|---|---|
| Parameter | catchup=True/False |
| Default | True |
| Behavior | Creates runs for all past intervals until today |
| Use Case | When you want to process historical data automatically |
Example:
DAG start date = Jan 1, today = Jan 5, daily DAG, catchup=True → DAG runs for Jan 1,2,3,4,5
⭐ 2. Backfill
Definition:
Manually run DAG runs for specific past dates, regardless of catchup setting.
| Feature | Details |
|---|---|
| Command | airflow dags backfill -s 2024-01-01 -e 2024-01-05 my_dag |
| Behavior | Forces DAG to run historical intervals |
| Use Case | Missed runs, reprocessing, fixing failed jobs |
✅ Backfill is manual and selective, unlike catchup which is automatic.
⭐ 3. Manual Run
Definition:
Trigger a DAG run manually at any time, usually for testing or ad-hoc runs.
| Feature | Details |
|---|---|
| Method | Airflow UI → Trigger DAG CLI → airflow dags trigger my_dag |
| Behavior | Creates a single DAG run immediately |
| Use Case | Test DAG, ad-hoc execution, debugging |
⭐ Comparison Table
| Feature | Automatic/Manual | Purpose | Example |
|---|---|---|---|
| Catchup | Automatic | Run all missed DAG runs | catchup=True → process Jan 1–5 automatically |
| Backfill | Manual | Run specific historical DAG runs | airflow dags backfill -s Jan1 -e Jan5 |
| Manual Run | Manual | Trigger DAG on demand | Click “Trigger DAG” in UI or CLI command |
⭐ Logical Date vs These Runs (Important!)
-
Catchup → generates DAG runs using logical dates for past intervals
-
Backfill → same, but manually specified dates
-
Manual Run → logical date can be specified manually or default to current timestamp
Comments
Post a Comment