🐳 What is Docker?

Docker is a platform to package applications into containers.
Docker is an open-source platform that enables developers to build, deploy, run, update, and manage applications using containers, which are lightweight, portable, and self-sufficient units that package an application and its dependencies together.
A container is a lightweight, portable, isolated environment that includes your app + dependencies + OS libraries.
Think of it as “ship your code with everything it needs” so it runs the same anywhere — your laptop, cloud, or production server.

🟢 Docker Benefits (Simple Points)

1. Consistent environment everywhere
- Same code runs the same way on any machine (dev, QA, prod).
2. Lightweight (compared to VMs)
- Starts in seconds
- Uses less CPU and RAM
3. Easy deployment
- Build once → run anywhere
- Faster releases
4. Isolation
- Each container has its own dependencies
- No version conflicts
5. Easy scaling
- Run multiple containers from one image
- Good for stream jobs, ETL parallelism
6. Multi-service setup using Compose
- Run Airflow + Postgres + Kafka + Redis + Spark together
- One command: docker-compose up
7. Cleaner development
- No need to install databases, Spark, Kafka manually
- Everything runs inside containers
8. Better CI/CD
- Code + dependencies packaged into one image
- Consistent builds
9. Secure
- Apps isolated from host system
10. Cloud-native
- Works with Kubernetes, AWS ECS, EKS, GCP GKE, Azure AKS
- Industry standard

=============================================================

Containerization vs Virtual Machines

Docker containers share the host operating system's kernel, running isolated applications with just the necessary libraries and binaries. This makes containers lightweight and efficient, using fewer system resources since they do not run a full OS for each instance.

Virtual machines run a full guest operating system on top of a hypervisor, which creates and manages virtualized hardware. Each VM includes a complete OS, consuming significantly more CPU, memory, and storage resources.

==============================================================

🟢 What is a Docker Image?

A Docker Image is a read-only template that contains everything your application needs to run:
- Code (Python scripts, ETL jobs, DAGs)
- Libraries / dependencies (pandas, PySpark, boto3, Airflow)
- OS-level tools and environment variables
A Docker image is a read-only, immutable file that contains everything needed to run an application—a packaged bundle including the application code, binaries, libraries, dependencies, and configuration files. It acts as a blueprint for creating Docker containers.
Think of it as a blueprint or snapshot of your environment.

🔹 Key Features of a Docker Image

Immutable: Once built, the image doesn’t change.
Versioned: Can tag different versions (my-etl:1.0, my-etl:2.0).
Portable: Can be run anywhere with Docker installed (local, cloud, CI/CD).
Layered: Each command in the Dockerfile creates a new layer, allowing caching and faster builds.

🔹 Analogy

Image = Cake Recipe → contains instructions and ingredients.
Container = Baked Cake → running instance you can interact with.

🔹 How to Create a Docker Image

Step 1: Create a Dockerfile


# 1. Base image / Defines the starting environment.
FROM python:3.11-slim

# 2. Set working directory inside container/Sets the folder inside the container where commands will run.
WORKDIR /app

# 3. Copy dependency file/Moves files from your machine → inside the image.
COPY requirements.txt .

# 4. Install dependencies/Runs when building the image (not when container runs).
RUN pip install --no-cache-dir -r requirements.txt

# 5. Copy your application code
COPY . .

# 6. Default command to run when container starts
CMD ["python", "etl_script.py"]

Step 2: Build the image


docker build -t my-etl-image:1.0 .

-t my-etl-image:1.0 → gives a name and version tag to the image
The image now contains Python + dependencies + your ETL code

Step 3: Verify the image


docker images

Lists all images on your machine

🔹 Practical Use Case for Data Engineers

ETL pipelines: Package Python / Spark scripts and dependencies → run anywhere
Airflow DAGs: Build an image containing DAGs + plugins → use DockerOperator to run tasks
Testing pipelines: Share image with team → exact same environment

==============================================================

🟦 Dockerfile (Simple Explanation)

A Dockerfile is a text file containing step-by-step instructions to build a Docker image.

You tell Docker how to create the image:
what OS to use, what packages to install, what code to copy, what command to run.

🟩 Most Important Instructions

Instruction	Meaning
FROM	Base image
WORKDIR	Set working directory
COPY	Copy files into image
RUN	Execute commands during build
CMD	Default command when container runs
ENTRYPOINT	Fixed command; CMD becomes args
EXPOSE	Document port
ENV	Set environment variables
ARG	Build-time variable
VOLUME	Create mount point

🟨 Basic Dockerfile Example


FROM python:3.10-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

CMD ["python", "main.py"]

What it does:

Uses Python 3.10 base
Sets /app as working folder
Installs requirements
Copies your code
Runs main.py by default

🟧 Build & Run Image

Build


docker build -t myapp .

Run


docker run myapp

==============================================================

🟢 What is a Docker Container?

A Docker Container is a running instance of a Docker Image.
It is isolated, lightweight, and contains everything defined in the image: your code, libraries, and environment.
Unlike an image, a container can run, execute, generate logs, and store temporary data.

Analogy:

Image = Recipe

Container = Cake baked from that recipe

🔹 Key Features of Containers

Ephemeral / Mutable
- Containers can run, stop, restart, or be deleted.
- Changes inside a container don’t affect the original image unless you commit it.
Isolated Environment
- Each container has its own filesystem, processes, and network stack.
- Prevents conflicts between different projects or dependencies.
Lightweight & Fast
- Shares the host OS kernel → much faster than a VM.
- Starts in seconds.
Multiple Instances
- You can run multiple containers from the same image → efficient resource usage.

🔹 Practical Commands

Run a container


docker run -it --name my-etl-container my-etl-image:1.0

-it → interactive terminal
--name → container name
my-etl-image:1.0 → image to run

List running containers


docker ps

Stop a container


docker stop my-etl-container

Remove a container


docker rm my-etl-container

Run in detached mode (background)


docker run -d my-etl-image:1.0

🔹 Containers in Data Engineering

ETL Jobs: Each pipeline can run in a separate container → isolation and reproducibility.
Airflow Tasks: DockerOperator spins up a container per task → consistent environment for Python/Spark jobs.
Local Testing: Run full pipeline with dependencies (Spark + Postgres + Minio) without affecting host system.
Scalable Pipelines: Multiple containers can run simultaneously, useful for batch jobs or streaming tasks.

Image

Read-only template
Created from Dockerfile
Example: Python + libs + your ETL script

Container

Running instance of an image
Can be started/stopped
Temporary, isolated environment

Dockerfile

Instructions to build an image

Registry

Stores images (Docker Hub, AWS ECR, GCP Artifact Registry)

==============================================================

🟢 What is Docker Compose?

Docker Compose is a tool for defining and running multi-container Docker applications.
Instead of running each container individually, you define all services in a single docker-compose.yml file.
With one command, you can start all services, networks, and volumes together.


docker-compose up

🔹 Why Data Engineers Use Docker Compose

Run Airflow scheduler + webserver + database locally.
Test ETL pipelines with Spark, Postgres, Kafka, or Minio (S3) together.
Manage dependencies, networking, and volumes easily.
Create reproducible environments for interviews and portfolio projects.

🔹 Basic Docker Compose Example (Airflow + Postgres)


version: '3.8'

services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  airflow-webserver:
    image: apache/airflow:2.7.1-python3.11
    depends_on:
      - postgres
    environment:
      AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
      AIRFLOW__CORE__EXECUTOR: LocalExecutor
    volumes:
      - ./dags:/opt/airflow/dags
      - ./logs:/opt/airflow/logs
      - ./plugins:/opt/airflow/plugins
    ports:
      - "8080:8080"
    command: webserver

volumes:
  postgres_data:

Explanation:

Postgres → metadata database for Airflow
Airflow-webserver → runs DAGs, connected to Postgres
Volumes → persist database and logs
Ports → expose Airflow UI locally

🔹 Basic Docker Compose Commands

Build & start services


docker-compose up --build

Run in detached mode (background)


docker-compose up -d

Stop all containers


docker-compose down

View logs


docker-compose logs airflow-webserver

Rebuild after changes


docker-compose up --build

🔹 Advanced Use Cases for Data Engineers

Local ETL testing
- Spark + Minio (S3) + Kafka + Postgres → run all together.
Airflow development environment
- Scheduler + Webserver + Worker + Postgres + Redis.
Team collaboration
- Share docker-compose.yml → everyone runs the same environment.

🔹 Tips

Use .env file for sensitive credentials (AWS keys, DB passwords).
Use depends_on for proper startup order.
Combine Dockerfile + Docker Compose to build custom images and run multi-service pipelines.
Use networks to let containers communicate (service_name:port).

==============================================================

Basic Commands

Purpose	Command	Meaning
Check Docker version	`docker --version`	Verify installation
List images	`docker images`	Shows all images
List running containers	`docker ps`	Only active containers
List all containers	`docker ps -a`	Active + stopped containers
Build image	`docker build -t <name> .`	Build image from Dockerfile
Run container	`docker run <image>`	Start container
Run interactive shell	`docker run -it <image> bash`	Enter container terminal
Run container in background	`docker run -d <image>`	Detached mode
Assign name to container	`docker run --name myapp <image>`	Run container with name
Stop container	`docker stop <id>`	Gracefully stop
Force stop	`docker kill <id>`	Hard stop
Remove container	`docker rm <id>`	Delete container
Remove image	`docker rmi <image>`	Delete image
View container logs	`docker logs <id>`	Show logs
Execute command inside container	`docker exec -it <id> bash`	Open shell inside running container
Copy file from container	`docker cp <id>:/path/file .`	Copy from container to host
Show container stats	`docker stats`	CPU/RAM usage
Pull image from Docker Hub	`docker pull <image>`	Download image
Push image to registry	`docker push <image>`	Upload image
Inspect container details	`docker inspect <id>`	Low-level info
Show container logs	`docker logs <id>`	View output

==============================================================

🟢 What is Docker Networking?

Docker networking allows containers to communicate with:

each other
the host machine
external internet

Each container gets its own virtual network interface + IP address.

🟣 Types of Docker Networks

Below are the most commonly used:
1️⃣ Bridge Network (DEFAULT)

Most Important Type — used in 90% of projects

Default network when you run docker run
Containers can communicate with each other if they are on the same bridge network
Used for multi-container apps

Example:
Postgres container + Python ETL container → talk to each other using service names.

Command:


docker network create mynet
docker run --network=mynet myimage

2️⃣ Host Network (FASTEST)

Container directly uses the host machine’s network.

No isolation
Fastest performance
Suitable for monitoring agents, log shippers, etc.

Command:


docker run --network host myapp

3️⃣ None Network (ISOLATED)

No network at all.

Container cannot communicate with anything
Used for high-security workloads

Command:


docker run --network none myapp

🟢 How Containers Talk to Each Other

Within same network → Use service name

Example in docker-compose.yml:


services:
  db:
    image: postgres
  app:
    image: python-app

app can connect to db like this:


host = "db"
port = 5432

✔ No need for IP address
✔ Docker handles DNS automatically

🟣 Important Commands

Purpose	Command
List networks	`docker network ls`
Inspect network details	`docker network inspect <network>`
Create a network	`docker network create mynet`
Connect container to network	`docker network connect mynet container1`
Disconnect	`docker network disconnect mynet container1`

🟢 Networking in Docker Compose (MOST USED)


services:
  app:
    image: myapp
    networks:
      - mynet

  db:
    image: postgres
    networks:
      - mynet

networks:
  mynet:

Result:

app and db talk using: db:5432

==============================================================

🟢 What is Docker Compose?

Docker Compose is a tool that lets you run multiple containers together using one YAML file.

Instead of running individual docker run commands, you define everything in:


docker-compose.yml

Then start all services with one command:


docker-compose up

🟣 Why do we use Docker Compose? (Very Important)

Run multiple services together (e.g., Airflow + Postgres + Redis)
Handles networking automatically
Creates shared volumes
Starts containers in the right order
Perfect for data engineering pipelines

Docker Compose Architecture

A Compose file has 3 main parts:

Version → YAML schema version
Services → Containers to run
Volumes → Persistent storage
Networks → Optional custom networks

Example structure:


version: "3.9"

services:
  service1:
  service2:

volumes:
  vol1:

networks:
  net1:

🟢 Basic Example (`docker-compose.yml`)

Example for Python app + Postgres DB:


version: '3.8'

services:
  db:
    image: postgres:15
    environment:
      POSTGRES_USER: user
      POSTGRES_PASSWORD: pass
      POSTGRES_DB: mydb
    ports:
      - "5432:5432"

  app:
    build: .
    depends_on:
      - db
    ports:
      - "8000:8000"
    environment:
      DATABASE_HOST: db

Highlights

Two services: db and app
app waits for db (depends_on)
Networking is automatic → app connects to db using hostname db

🟣 Most Important Docker Compose Commands

Purpose	Command
Start all services	`docker-compose up`
Start in background	`docker-compose up -d`
Stop all services	`docker-compose down`
View running services	`docker-compose ps`
View service logs	`docker-compose logs app`
Rebuild + run	`docker-compose up --build`
Run a command inside container	`docker-compose exec app bash`

🟢 Networking in Compose

All services automatically join the same network
Containers talk using service names

Example:


Host: db
Port: 5432

No need for IP address.

🟣 Volumes in Compose

Used for saving persistent data:


volumes:
  pgdata:

Example:


services:
  db:
    volumes:
      - pgdata:/var/lib/postgresql/data

==============================================================

✅ Docker Components (Very Important for Interviews)

Docker has 6 main components:

1️⃣ Docker Client

The interface through which users interact with Docker. Users issue commands like docker run or docker build via the Docker CLI, which translates these into API calls to the Docker daemon.
docker build
docker run
docker pull

2️⃣ Docker Daemon (dockerd)

A background service that runs on the host machine and manages Docker objects such as images, containers, networks, and volumes. It listens for API requests from the Docker client and executes container lifecycle operations like starting, stopping, and monitoring containers.

3️⃣ Docker Images

Immutable, read-only templates composed of a series of layers that provide the filesystem and application environment. Images serve as blueprints for creating Docker containers.

python:3.10-slim is an image.

4️⃣ Docker Containers

Running instances of Docker images with a writable layer on top, enabling users to execute applications within isolated environments. Containers are lightweight and start quickly compared to traditional virtual machines.
When you run:
docker run python:3.10-slim

5️⃣ Docker Registry (Docker Hub / ECR / GCR)

A repository or storage system where Docker images are stored and distributed. Docker Hub is the most popular public registry, while private registries can also be used. The Docker daemon pulls images from registries to create containers or pushes locally built images to registries.

6️⃣ Docker Storage / Volumes

A Docker volume is a persistent storage mechanism managed by Docker to store data outside of a container's writable layer. Volumes exist independently of containers, enabling data to persist even after a container is stopped, removed, or recreated.
volumes:
- dbdata:/var/lib/postgresql/data

7️⃣ Docker Networking

Allows communication between containers and with external networks. Docker provides various networking drivers like bridge, overlay, and host to control connectivity for containers.

8.Docker Compose

A tool for defining and running multi-container Docker applications using a declarative YAML file.

==============================================================

🔵 What is Docker Caching?

Docker caching means Docker reuses previously built layers instead of rebuilding everything every time.

This makes builds:

Faster
Cheaper
More efficient

🔵 How Docker Caching Works

A Docker image is made of layers.
Each Dockerfile instruction creates one layer.

Example:


FROM python:3.10-slim     → Layer 1  
WORKDIR /app              → Layer 2  
COPY requirements.txt .   → Layer 3  
RUN pip install -r requirements.txt  → Layer 4  
COPY . .                  → Layer 5  
CMD ["python", "main.py"] → Layer 6

If nothing changes in a layer, Docker reuses it from cache.

🔵 Why Caching Matters (Interview Points)

Speeds up builds (5 minutes → 10 seconds)
Reduces duplicate work
Prevents reinstalling dependencies
Saves cloud build costs (GitHub Actions, AWS, GCP)

🔵 What Breaks the Cache?

A cache is invalidated (rebuild happens) if:

The instruction changes (example: change a RUN command)
Any file copied in that layer changes
Any previous layer changes

Example:

If requirements.txt changes, Docker will rebuild:

Layer for COPY requirements.txt
Layer for RUN pip install
All layers after them

But earlier layers (FROM, WORKDIR) are still cached.

🔵 Best Practice: ORDER YOUR DOCKERFILE

To get the maximum caching, put the steps that change least often first.

❌ Bad (slow builds every time):


COPY . .
RUN pip install -r requirements.txt

✔ Good (better caching):


COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .

This way:

Pip install runs only if requirements.txt changes
App code changes won’t break pip install cache

🔵 Cache Example in Real Life

First build:


docker build .    → takes 2–4 minutes

Second build with no code change:


docker build .    → 5 seconds

Because all layers are reused.

🔵 Skipping Cache (Forced Rebuild)

Sometimes you want a full rebuild:


docker build --no-cache -t myapp .

🔵 Multi-Stage Build + Caching (Advanced)

Multi-stage builds let you cache dependency installation separately:


FROM python:3.10 as builder
COPY requirements.txt .
RUN pip install -r requirements.txt

FROM python:3.10-slim
COPY --from=builder /root/.local /root/.local
COPY . .

This dramatically speeds up builds.

🔥 Short Summary (One Line Answers)

Docker caching = reusing previous build layers
Each Dockerfile instruction = one layer
Layers only rebuild if something changes
Correct ordering = fast builds
--no-cache disables caching

==============================================================

🟦 Variables in Docker

Docker supports two types of variables:

✅ 1. ENV (Environment Variables)

🔹 Available inside the running container
🔹 Used by applications at runtime
🔹 Can be set in Dockerfile, Compose, or at run time

Dockerfile


ENV PORT=8000

docker run


docker run -e PORT=8000 myapp

docker-compose.yml


environment:
  PORT: 8000

📌 Use case:
Database URLs, passwords, app settings.

✅ 2. ARG (Build-time Variables)

🔹 Used only during image build
🔹 NOT available inside running container unless passed to ENV
🔹 Must be defined before use

Dockerfile


ARG VERSION=1.0
RUN echo "Building version $VERSION"

Build:


docker build --build-arg VERSION=2.0 .

📌 Use case:
Build metadata, versioning, optional settings.

🟨 ENV vs ARG (Interview Question)

Feature	ARG	ENV
Available at runtime?	❌ No	✔ Yes
Available during build?	✔ Yes	✔ Yes
Passed using `docker run`?	❌ No	✔ Yes
Stored inside final image?	❌ No	✔ Yes

🟩 3. Variables in docker-compose with `.env` file

You can store environment variables in a file named .env.

.env:


DB_USER=admin
DB_PASS=pass123

docker-compose.yml:


environment:
  - DB_USER=${DB_USER}
  - DB_PASS=${DB_PASS}

🟧 4. Using variables inside Dockerfile

Example:


ARG APP_DIR=/app
WORKDIR $APP_DIR
ENV LOG_LEVEL=debug

🟥 5. Why variables are important in Docker?

Avoid hardcoding secrets
Make Dockerfiles reusable
Dynamic config (ports, environment, versions)
Different environments: dev, test, prod

=========================================================================

🟦 Docker Registry — What It Is & Why It Matters

✅ What Is a Docker Registry?

A Docker Registry is a storage + distribution system for Docker images.

A Docker registry is a centralized storage and distribution system for Docker images. It acts as a repository where Docker images—packages containing everything needed to run an application—are stored, managed, versioned, and shared across different environments.

It is where Docker images are:

Stored
Versioned
Pulled from
Pushed to

Similar to GitHub, but for container images instead of code.

🟧 Key Concepts

🟠 1. Registry

The whole server that stores repositories → e.g., Docker Hub, AWS ECR.

🟠 2. Repository

A collection of versions (tags) of an image.
Example:


myapp:latest
myapp:v1
myapp:v2

🟠 3. Image Tag

Label used to version an image.

Example:


python:3.10
node:20-alpine

🟩 Public vs Private Registries

Type	Examples	Features
Public	Docker Hub, GitHub Container Registry	Anyone can pull
Private	AWS ECR, Azure ACR, GCP GCR, Harbor	Secure, enterprise use

🟦 Why Do We Need a Docker Registry?

Because:

You build an image locally
Push it to a registry
Your production server / CI/CD pulls the image and runs it

Without a registry → no easy way to share or deploy images.

🟣 Common Docker Registry Commands

✅ Login


docker login

✅ Tag an Image


docker tag myapp:latest username/myapp:latest

✅ Push to Registry


docker push username/myapp:latest

✅ Pull from Registry


docker pull username/myapp:latest

🟤 Examples of Docker Registries

📌 1. Docker Hub (Most Common)

Free public repositories
Paid private repos

📌 2. AWS ECR (Enterprise)

Most used in production
Private registry
Integrated with ECS, EKS, Lambda

📌 3. GitHub Container Registry

Images stored inside GitHub
Good for CI/CD workflows

📌 4. Google GCR / Artifact Registry

📌 5. Self-hosted Registry

Example: Harbor, JFrog Artifactory

🔥 Advanced Concepts (Interview-Level)

🔹 Digest-based pulling

Instead of tag:


docker pull myapp@sha256:abc123...

Guarantees exact version.

🔹 Immutable tags

Some registries enforce that v1 cannot be overwritten.

🔹 Retention Policies

Automatically delete old images in ECR/GCR.

🔹 Scan for vulnerabilities

Registries like:

AWS ECR
GHCR
Docker Hub (Pro)
can scan images for security issues.

=========================================================================

🔷 Docker Networking

Docker networking allows containers to communicate:

with each other
with the host machine
with the outside world

🔶 Types of Docker Networks

Docker provides 5 main network types:

🟦 1. Bridge Network (Default)

Most commonly used
Containers on the same bridge network can talk to each other using container name

Example:


docker network create mynet
docker run -d --name app1 --network=mynet nginx
docker run -d --name app2 --network=mynet alpine ping app1

Use Case:

Local development
Microservices communication

🟩 2. Host Network

Container shares the same network as host.

❌ No isolation
⚡ Fastest network performance
🧠 No port mapping needed

Run:


docker run --network host nginx

Use Case:

High-performance applications
Network-heavy workloads

🟧 3. None Network

Container has no network.


docker run --network none alpine

Use Case:

Security
Sandbox jobs
Batch processing

🟪 4. Overlay Network (Swarm / Kubernetes)

Used in multi-node swarm clusters.
Allows containers on different machines to communicate.

Use Case:

Distributed apps
Microservices in Docker Swarm

🟫 5. Macvlan Network

Gives container its own IP address in LAN like a real device.

Use Case:

Legacy systems
Need direct connection to network
Running containers like physical machines

🔷 Key Networking Commands

Command	Description
`docker network ls`	List networks
`docker network inspect <name>`	Inspect network
`docker network create <name>`	Create network
`docker network rm <name>`	Remove network
`docker network connect <net> <container>`	Add container to network
`docker network disconnect <net> <container>`	Remove container

🔷 How Containers Communicate

🟦 1. Same Bridge Network

✔ Can ping each other by container name
✔ DNS built-in

Example:


ping app1

🟥 2. Different Networks

❌ Cannot communicate
➡ Must connect to the same network

🟩 3. With Host Machine

Host can access container via:


localhost:<mapped-port>

Example:


docker run -p 8080:80 nginx

Access: → http://localhost:8080

🟧 4. Container to Internet

Enabled by default via NAT.

🔶 Port Mapping

If container port = 80
Host port = 8080


docker run -p 8080:80 nginx

👉 Host can access container
👉 “Port forwarding”

🟦 Docker DNS

On the same custom network:

Container names act like hostnames
Docker automatically manages DNS


curl http://app1:5000

🔥 Real Interview Questions (with short answers)

1. What is Docker Bridge Network?

Default network; containers can communicate using container name.

2. Difference between Port Mapping and Exposing Port?

EXPOSE = documentation
-p = actual port forwarding

3. How do containers talk to each other?

By joining the same network.

4. What is Host Network?

Shares host’s IP; no port mapping; fastest.

5. What is Overlay Network?

Connects containers across multiple machines in Docker Swarm.

=========================================================================

🔵 Docker Volumes

Docker Volumes are the official way to store data outside a container.

Docker volumes are a dedicated, persistent storage mechanism managed by Docker for storing data generated and used by containers. Unlike container writable layers, volumes exist independently of the container lifecycle, meaning data in volumes remains intact even if the container is stopped, removed, or recreated. They reside outside the container filesystem on the host, typically under Docker's control directories, providing efficient I/O and storage management.

Because containers are ephemeral:
→ When container stops/deletes → data is lost
→ Volumes solve that.

🔶 Why Do We Need Docker Volumes?

✔ Containers are temporary
✔ Data must persist
✔ Multiple containers may need same data
✔ Upgrading/Deleting containers should NOT delete data

🟦 Types of Docker Storage

Docker offers 3 types:

1️⃣ Volume (Recommended)

Managed by Docker itself
Stored under:


/var/lib/docker/volumes/

Use Cases:

Databases (MySQL, PostgreSQL)
Persistent app data

Example:


docker volume create myvol
docker run -v myvol:/data mysql

2️⃣ Bind Mount

Uses host machine's folder.


docker run -v /host/path:/container/path nginx

Use Cases:

Local development
When you want full control of host path

3️⃣ tmpfs (Linux Only)

Data stored in RAM.


docker run --tmpfs /data redis

Use Cases:

Sensitive data
Ultra-fast temporary storage

🟩 Volume Commands (Most Important)

Command	Description
`docker volume create myvol`	Create volume
`docker volume ls`	List volumes
`docker volume inspect myvol`	Inspect volume
`docker volume rm myvol`	Delete volume
`docker volume prune`	Remove unused volumes

🟧 Using Volumes in Docker Run

Syntax:


docker run -v <volume_name>:<container_path> image

Example:


docker run -d \
  -v dbdata:/var/lib/mysql \
  mysql:8

🟣 Using Bind Mounts

Example:


docker run -d \
  -v /home/user/app:/app \
  node:20

🔵 Volumes in Docker Compose

Very important for real projects.

docker-compose.yml


version: "3.9"

services:
  db:
    image: mysql
    volumes:
      - dbdata:/var/lib/mysql

volumes:
  dbdata:

🔥 Example Use Case (DB Persistence)

If you run:


docker run mysql

Delete container → data gone.

But with volume:


docker run -v dbdata:/var/lib/mysql mysql

Stop container → data still exists (in volume).

🟥 Where Are Volumes Stored?

On Linux:


/var/lib/docker/volumes/<volume-name>/_data

On Windows/Mac → managed internally through Docker Desktop.

🟨 Interview Questions (Short Answers)

1️⃣ What is a Docker Volume?

A persistent storage mechanism managed by Docker.

2️⃣ Difference: Volume vs Bind Mount?

Volume	Bind Mount
Managed by Docker	Controlled by host user
More secure	Direct host access
Best for production	Best for local development

3️⃣ Does deleting container delete volume?

❌ No.
Volumes must be deleted manually.

4️⃣ What happens if volume doesn't exist?

Docker automatically creates it.

5️⃣ Can two containers share one volume?

✔ Yes → used in DB replicas, logs, shared storage.

=========================================================================

🔵 What is ENTRYPOINT in Docker?

ENTRYPOINT defines the main command that will always run when a container starts.

Think of it as the default executable of the container.

🟦 Why ENTRYPOINT is used?

✔ Makes the container behave like a single-purpose program
✔ Forces a command to always run
✔ Can't be easily overridden (compared to CMD)
✔ Best for production containers

🔶 ENTRYPOINT Syntax

Two forms exist:

1️⃣ Exec Form (Recommended)


ENTRYPOINT ["executable", "param1", "param2"]

✔ Doesn’t use shell
✔ More secure
✔ Handles signals properly

2️⃣ Shell Form


ENTRYPOINT command param1 param2

⚠ Runs inside /bin/sh -c
⚠ Harder to handle signals

🟣 Example ENTRYPOINT Dockerfile

Dockerfile


FROM python:3.10
COPY app.py /
ENTRYPOINT ["python3", "app.py"]

Run:


docker run myapp

This will always run:


python3 app.py

🟩 ENTRYPOINT + CMD (Very Important)

ENTRYPOINT = fixed command
CMD = default arguments

Example:


ENTRYPOINT ["python3", "app.py"]
CMD ["--port", "5000"]

Container will run:


python3 app.py --port 5000

You can override CMD:


docker run myapp --port 8000

But ENTRYPOINT cannot be replaced unless you use --entrypoint.

🔥 Override ENTRYPOINT (Rare)


docker run --entrypoint bash myapp

🟥 ENTRYPOINT vs CMD (Very Important Table)

Feature	ENTRYPOINT	CMD
Main purpose	Main command	Default args
Overrides allowed?	❌ Hard	✔ Easy
Best use	Permanent command	Arguments
Runs as	Program	Command/Args

🔶 Common Interview Questions

1. Why use ENTRYPOINT instead of CMD?

To ensure the main command always runs and cannot be overridden.

2. What happens if both ENTRYPOINT and CMD exist?

CMD becomes arguments to ENTRYPOINT.

3. How do you override ENTRYPOINT?

Using --entrypoint.

=========================================================================

🔵 Docker Daemon & Docker Client

Docker works using a client–server architecture.

🟦 1. Docker Daemon (`dockerd`)

This is the brain of Docker.

✔ What it Does:

Runs in the background
Manages containers
Manages images
Manages networks
Manages volumes
Executes all Docker operations

✔ It Listens On:

Unix socket: /var/run/docker.sock
Sometimes TCP port (for remote Docker hosts)

✔ Daemon = Server Side

🟩 2. Docker Client (`docker`)

This is the command-line tool you use.

When you type:


docker ps
docker run nginx

The client DOES NOT run containers.

Instead, it sends API requests to the Docker Daemon, which performs the real operations.

✔ Client = Frontend

✔ Daemon = Backend

🟧 How They Work Together (Simple Flow)

You run:


docker run nginx

Flow:

Client sends request → Daemon
Daemon pulls image
Daemon creates container
Daemon starts container
You see output on terminal

=========================================================================

🔵 COPY vs ADD in Dockerfile

Both are used to copy files into the image, but COPY is preferred.

🟦 1. COPY (Recommended)

✔ What it does:

Copies local files/folders into the container.

✔ Safe

✔ Predictable

✔ No extra features (simple only)

Example:


COPY app.py /app/app.py

Use COPY when:

You want to copy source code
You want clean builds
You don’t need extraction or downloading

🟧 2. ADD (Avoid unless needed)

✔ What it does:

Does everything COPY does plus two extra features:

Extra Features:

1️⃣ Can download URLs


ADD https://example.com/file.tar.gz /app/

2️⃣ Automatically extracts tar files


ADD app.tar.gz /app/

⚠ Because of these extras → can create security issues

So Docker recommends: use COPY unless ADD is needed.

🟪 COPY vs ADD Table (Interview-Friendly)

Feature	COPY	ADD
Copy local files	✔ Yes	✔ Yes
Copy remote URL	❌ No	✔ Yes
Auto extract `.tar.gz`	❌ No	✔ Yes
Simpler	✔ Yes	❌ No
More secure	✔ Yes	❌ No
Recommended?	✔ Yes	❌ Use only when required

🟩 When to Use ADD? (Rare)

Use ADD only for:

✔ Auto-unpacking tar files into image


ADD app.tar.gz /app/

✔ Downloading files from a URL


ADD https://example.com/setup.sh /scripts/

Otherwise → COPY is always better.

🟥 🔥 Interview Answer (Short)**

COPY is used to copy files/folders into the image and is preferred because it is simpler and more secure.
ADD has extra features like downloading files from URLs and extracting tar archives, so use it only when those features are needed.

=========================================================================

🔵 What are Multi-Stage Builds?

Multi-stage builds allow you to use multiple FROM statements in a single Dockerfile.

✔ Build in one stage
✔ Copy only the required output into the final stage
✔ Final image becomes much smaller
✔ No build dependencies inside final image

🟦 Why Multi-Stage Builds Are Needed?

Problem (without multi-stage):

Build tools (Maven, Go compiler, Node modules, pip, etc.) stay inside the final image
Makes image heavy
Security issues
Slow deployment

Multi-stage solution:

Build tools exist only in the build stage
Final stage contains just the application
Clean, lightweight image

🟩 Simple Example – Python / Node / Java / Go (All follow same logic)

Here is a general multi-stage pattern:


# ----- Stage 1: Build -----
FROM node:20 AS builder
WORKDIR /app
COPY package*.json .
RUN npm install
COPY . .
RUN npm run build

# ----- Stage 2: Final Image -----
FROM nginx:alpine
COPY --from=builder /app/dist /usr/share/nginx/html

What happens?

Node image builds the app
Only the final compiled output is copied to nginx
Result = super small production image

🔶 Another Example – Python App


# Stage 1: Build dependencies
FROM python:3.10 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

# Stage 2: Clean final image
FROM python:3.10-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY . .

CMD ["python", "app.py"]

🔷 Another Example – Java (Very Popular)


# Stage 1: Build JAR
FROM maven:3.9 AS builder
WORKDIR /app
COPY pom.xml .
COPY src ./src
RUN mvn package -DskipTests

# Stage 2: Run JAR
FROM openjdk:17-jdk-slim
COPY --from=builder /app/target/myapp.jar /myapp.jar
CMD ["java", "-jar", "/myapp.jar"]

✔ No Maven in final image
✔ Final image is tiny

🟧 Key Features of Multi-Stage Builds

✔ Multiple FROM instructions

Each FROM = new stage

✔ You can name stages


FROM golang:1.20 AS builder

✔ Copy artifacts from stage to stage


COPY --from=builder /app/bin /bin

✔ Final image only contains last stage

All previous stages = removed
Image is clean + small

🟪 Benefits (Interview Ready)

Benefit	Explanation
✔ Smaller images	No build tools in final image
✔ Faster builds	Layer caching for each stage
✔ Better security	No compilers / secrets left behind
✔ Cleaner Dockerfiles	Each stage has a clear job
✔ Reproducible builds	Same environment every time

=========================================================================

🔵 What is `.dockerignore`?

.dockerignore is a file that tells Docker which files/folders to EXCLUDE when building an image.

It works similar to .gitignore.

🟦 Why do we use `.dockerignore`?

✔ Faster Docker builds

(Removes unnecessary files → smaller build context)

✔ Smaller images

(Don’t copy unwanted files)

✔ Better security

(Keep secrets, logs, configs out of image)

✔ Cleaner caching

(Prevents rebuilds when irrelevant files change)

🟩 Common Items in `.dockerignore`


node_modules/
__pycache__/
*.pyc
*.log
.env
.env.*
.git
.gitignore
Dockerfile
docker-compose.yml
.vscode/
.idea/
dist/
build/
*.zip
*.tar.gz

🟧 How it works?

When you run:


docker build -t myapp .

Docker first copies the “build context” → (current directory)
Without dockerignore, everything is copied.

.dockerignore tells Docker:
🚫 Don’t send these files to the build context.

🟪 Example

`.dockerignore`


*.log
*.env
secret.txt
cache/

Dockerfile


COPY . /app

Only allowed files will be copied.

🟥 Performance Impact (Very Important)

Without .dockerignore:

Docker copies huge directories (node_modules, logs)
Slow build
Cache invalidates unnecessarily

With .dockerignore:

Build context is very small
Build is faster
Cache stays valid → faster incremental builds

🟨 Interview Questions (Short Answers)

1. What is the purpose of `.dockerignore`?

To exclude unnecessary files from the Docker build context.

2. What happens if `.dockerignore` is missing?

Docker sends all files to the build context → slow builds, large images.

3. Does `.dockerignore` reduce image size?

Yes, because it prevents unnecessary files from being copied.

4. Does `.dockerignore` improve caching?

Yes → fewer files → fewer cache invalidations.

5. Is `.dockerignore` mandatory?

No, but highly recommended.

=========================================================================

🔵 Docker Container Lifecycle (Step-by-Step)

A Docker container goes through the following major stages:


Created → Running → Paused → Unpaused → Stopped → Restarted → Removed

🟦 1. Created

The container is created from an image but not started yet.

Command:


docker create image_name

🟩 2. Running

Container is active and executing processes.

Command:


docker start container
# or
docker run image_name

docker run = create + start

🟧 3. Paused

All processes inside the container are temporarily frozen.

Command:


docker pause container

🟪 4. Unpaused

Resumes the paused container.

Command:


docker unpause container

🟥 5. Stopped / Exited

Container stops running its main process (app has exited or manually stopped).

Command:


docker stop container

🟨 6. Restarted

Container is stopped and then started again.

Command:


docker restart container

🟫 7. Removed (Deleted)

The container is permanently removed from Docker.

Command:


docker rm container

You cannot remove a running container—must stop it first.

📌 Lifecycle Diagram (Simple)


          pause → paused → unpause
            ↑                   ↓
created → running → stopped → removed
             ↑
         restart
📘 Useful Lifecycle Commands

Action	Command Example
Create	`docker create nginx`
Run (create+start)	`docker run nginx`
Start	`docker start cont_id`
Stop	`docker stop cont_id`
Pause	`docker pause cont_id`
Unpause	`docker unpause cont_id`
Restart	`docker restart cont_id`
Remove	`docker rm cont_id`
Remove all	`docker rm $(docker ps -aq)`

=========================================================================

🔵 What is a Docker HEALTHCHECK?

A HEALTHCHECK is a way to tell Docker how to test whether a container is healthy.
Docker runs this command periodically and updates the container's status:

healthy
unhealthy
starting

It helps in:

auto-restarts
load balancers
orchestrators (Kubernetes, ECS, Swarm)

🟦 Syntax (Dockerfile)


HEALTHCHECK [OPTIONS] CMD <command>

# or disable
HEALTHCHECK NONE

🟩 Options

Option	Meaning
`--interval=30s`	Check frequency
`--timeout=3s`	How long to wait before failing
`--start-period=5s`	Grace period before checks start
`--retries=3`	Fail after X failed attempts

🟧 Example 1: Simple HTTP Healthcheck


FROM nginx

HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost/ || exit 1

If curl -f works → healthy
If fails → unhealthy

🟪 Example 2: Healthcheck Script


FROM python:3.9

COPY health.sh /usr/local/bin/health.sh
RUN chmod +x /usr/local/bin/health.sh

HEALTHCHECK --interval=10s --timeout=2s \
  CMD ["sh", "/usr/local/bin/health.sh"]

health.sh:


#!/bin/sh
if curl -f http://localhost:5000/health > /dev/null; then
    exit 0
else
    exit 1
fi
🟥 How to Check Health Status

List containers with health status:


docker ps

Detailed inspection:


docker inspect container_id

You will see:


"Health": {
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [...]
}
🟨 What Docker Does with Healthchecks

Status	Meaning
starting	Startup period (start-period)
healthy	App is functioning
unhealthy	Check failed repeatedly

If you use restart policies:


docker run --health-retries=3 --restart=always ...

→ Docker auto-restarts an unhealthy container.

📌 Important Notes

HEALTHCHECK runs inside the container.
Should be lightweight (avoid heavy scripts).
Uses exit codes:
- 0 = success (healthy)
- 1 = unhealthy
- 2 = reserved

=========================================================================

🔵 What is `docker inspect`?

docker inspect is used to view detailed information about Docker containers, images, networks, or volumes in JSON format.

It shows everything about a container:

Network info
Mounts / volumes
IP address
Ports
Environment variables
Health status
Entry point, CMD
Resource usage config
Labels
Container state (running, stopped, etc.)

This is the most powerful debugging command.

🟦 Basic Command


docker inspect <container_id_or_name>

🟩 Example Output (Simplified)

You will see JSON fields like:


{
  "Id": "d2f1...",
  "State": {
    "Status": "running",
    "Health": {
      "Status": "healthy"
    }
  },
  "Config": {
    "Env": ["APP_ENV=prod", "PORT=8080"],
    "Cmd": ["python", "app.py"]
  },
  "NetworkSettings": {
    "IPAddress": "172.17.0.2"
  },
  "Mounts": [
    {
      "Source": "/data",
      "Destination": "/var/lib/data"
    }
  ]
}

🔧 Most Useful Inspect Filters (Important!)

📍 1. Get container IP address


docker inspect -f '{{ .NetworkSettings.IPAddress }}' <container>

📍 2. Get just the environment variables


docker inspect -f '{{ .Config.Env }}' <container>

📍 3. Get container’s running status


docker inspect -f '{{ .State.Status }}' <container>

📍 4. Get container entrypoint


docker inspect -f '{{ .Config.Entrypoint }}' <container>

📍 5. Get exposed ports


docker inspect -f '{{ .NetworkSettings.Ports }}' <container>

🟧 Inspecting Images


docker inspect <image_name>

Useful to see:

layers
build parameters
environment variables
entrypoint

🟪 Inspecting Networks


docker inspect <network_name>

You can find:

connected containers
IP ranges (subnet)
gateway
driver type

🟫 Inspecting Volumes


docker inspect <volume_name>

Shows:

mount point
driver
usage

✨ Real Use Cases (Important for Interviews)

Use Case	Command
Debug network issues	Get IP, ports
Debug ENV variables	extract `.Config.Env`
Verify mounted volumes	check `.Mounts`
Check health status	check `.State.Health.Status`
Know why a container exited	check `.State.ExitCode`

🟩 Check Container Logs (Related Command)


docker logs <container>

=========================================================================

🔵 What is Port Mapping in Docker?

Port mapping connects a container’s internal port to a port on your host machine so that applications inside the container can be accessed from outside.

Example:
A container running a web server on port 80 → accessible on host via port 8080


host:8080 → container:80

This is called port forwarding.

🟦 Syntax


docker run -p <host_port>:<container_port> image_name

Example:


docker run -p 8080:80 nginx

Meaning:

Inside container, Nginx listens on 80
On your laptop/server, you hit http://localhost:8080

🟩 Types of Port Mapping

1. Host → Container (most common)


-p 5000:5000

2. Bind to specific IP (e.g., localhost only)


-p 127.0.0.1:8080:80

Meaning:
Only local machine can access it.

3. Automatic host port assignment

-P

Docker assigns random free ports.
🟧 Check Mapped Ports


docker ps

You will see:


0.0.0.0:8080->80/tcp

🟪 Why Port Mapping Is Needed (Interview Points)

Containers run in isolated networks
Container ports aren’t accessible from host by default
Port mapping exposes them
Allows multiple instances to run on different host ports
Helps in local development and testing

🟫 Real Examples

1️⃣ Expose Postgres


docker run -p 5432:5432 postgres

2️⃣ Expose Airflow Webserver


docker run -p 8080:8080 apache/airflow

3️⃣ Expose FastAPI on 8000


docker run -p 8000:8000 myapp

🔥 Port Mapping in Docker Compose


services:
  web:
    image: nginx
    ports:
      - "8080:80"

Same meaning: host 8080 → container 80

🎯 One-Line Interview Summary

Port mapping (-p host:container) allows access to applications running inside a container by exposing container ports to the host machine.

=========================================================================

Docker

🐳 What is Docker?

🟢 Docker Benefits (Simple Points)

1. Consistent environment everywhere

2. Lightweight (compared to VMs)

3. Easy deployment

4. Isolation

5. Easy scaling

6. Multi-service setup using Compose

7. Cleaner development

8. Better CI/CD

9. Secure

10. Cloud-native

Containerization vs Virtual Machines

🟢 What is a Docker Image?

🔹 Key Features of a Docker Image

🔹 Analogy

🔹 How to Create a Docker Image

🔹 Practical Use Case for Data Engineers

🟦 Dockerfile (Simple Explanation)

🟩 Most Important Instructions

🟨 Basic Dockerfile Example

🟧 Build & Run Image

🟢 What is a Docker Container?

🔹 Key Features of Containers

🔹 Practical Commands

🔹 Containers in Data Engineering

Image

Container

Dockerfile

Registry

🟢 What is Docker Compose?

🔹 Why Data Engineers Use Docker Compose

🔹 Basic Docker Compose Example (Airflow + Postgres)

🔹 Basic Docker Compose Commands

🔹 Advanced Use Cases for Data Engineers

🔹 Tips

Basic Commands

🟢 What is Docker Networking?

🟣 Types of Docker Networks

Below are the most commonly used:1️⃣ Bridge Network (DEFAULT)

2️⃣ Host Network (FASTEST)

3️⃣ None Network (ISOLATED)

🟢 How Containers Talk to Each Other

Within same network → Use service name

🟣 Important Commands

🟢 Networking in Docker Compose (MOST USED)

🟢 What is Docker Compose?

🟣 Why do we use Docker Compose? (Very Important)

Docker Compose Architecture

🟢 Basic Example (docker-compose.yml)

Highlights

🟣 Most Important Docker Compose Commands

🟢 Networking in Compose

🟣 Volumes in Compose

✅ Docker Components (Very Important for Interviews)

🔵 What is Docker Caching?

🔵 How Docker Caching Works

🔵 Why Caching Matters (Interview Points)

🔵 What Breaks the Cache?

🔵 Best Practice: ORDER YOUR DOCKERFILE

🔵 Cache Example in Real Life

🔵 Skipping Cache (Forced Rebuild)

🔵 Multi-Stage Build + Caching (Advanced)

🔥 Short Summary (One Line Answers)

🟦 Variables in Docker

✅ 1. ENV (Environment Variables)

Dockerfile

docker run

docker-compose.yml

✅ 2. ARG (Build-time Variables)

Dockerfile

🟨 ENV vs ARG (Interview Question)

🟩 3. Variables in docker-compose with .env file

🟧 4. Using variables inside Dockerfile

Example:

🟥 5. Why variables are important in Docker?

🟦 Docker Registry — What It Is & Why It Matters

✅ What Is a Docker Registry?

🟧 Key Concepts

Below are the most commonly used:
1️⃣ Bridge Network (DEFAULT)

🟢 Basic Example (`docker-compose.yml`)

🟩 3. Variables in docker-compose with `.env` file