Docker
🐳 What is Docker?
-
Docker is a platform to package applications into containers.
Docker is an open-source platform that enables developers to build, deploy, run, update, and manage applications using containers, which are lightweight, portable, and self-sufficient units that package an application and its dependencies together.
-
A container is a lightweight, portable, isolated environment that includes your app + dependencies + OS libraries.
-
Think of it as “ship your code with everything it needs” so it runs the same anywhere — your laptop, cloud, or production server.
🟢 Docker Benefits (Simple Points)
1. Consistent environment everywhere
-
Same code runs the same way on any machine (dev, QA, prod).
2. Lightweight (compared to VMs)
-
Starts in seconds
-
Uses less CPU and RAM
3. Easy deployment
-
Build once → run anywhere
-
Faster releases
4. Isolation
-
Each container has its own dependencies
-
No version conflicts
5. Easy scaling
-
Run multiple containers from one image
-
Good for stream jobs, ETL parallelism
6. Multi-service setup using Compose
-
Run Airflow + Postgres + Kafka + Redis + Spark together
-
One command:
docker-compose up
7. Cleaner development
-
No need to install databases, Spark, Kafka manually
-
Everything runs inside containers
8. Better CI/CD
-
Code + dependencies packaged into one image
-
Consistent builds
9. Secure
-
Apps isolated from host system
10. Cloud-native
-
Works with Kubernetes, AWS ECS, EKS, GCP GKE, Azure AKS
-
Industry standard
-
Containerization vs Virtual Machines
🟢 What is a Docker Image?
-
A Docker Image is a read-only template that contains everything your application needs to run:
-
Code (Python scripts, ETL jobs, DAGs)
-
Libraries / dependencies (pandas, PySpark, boto3, Airflow)
-
OS-level tools and environment variables
-
A Docker image is a read-only, immutable file that contains everything needed to run an application—a packaged bundle including the application code, binaries, libraries, dependencies, and configuration files. It acts as a blueprint for creating Docker containers.
-
Think of it as a blueprint or snapshot of your environment.
🔹 Key Features of a Docker Image
-
Immutable: Once built, the image doesn’t change.
-
Versioned: Can tag different versions (
my-etl:1.0,my-etl:2.0). -
Portable: Can be run anywhere with Docker installed (local, cloud, CI/CD).
-
Layered: Each command in the Dockerfile creates a new layer, allowing caching and faster builds.
🔹 Analogy
-
Image = Cake Recipe → contains instructions and ingredients.
-
Container = Baked Cake → running instance you can interact with.
🔹 How to Create a Docker Image
Step 1: Create a Dockerfile
Step 2: Build the image
-
-t my-etl-image:1.0→ gives a name and version tag to the image -
The image now contains Python + dependencies + your ETL code
Step 3: Verify the image
-
Lists all images on your machine
🔹 Practical Use Case for Data Engineers
-
ETL pipelines: Package Python / Spark scripts and dependencies → run anywhere
-
Airflow DAGs: Build an image containing DAGs + plugins → use DockerOperator to run tasks
-
Testing pipelines: Share image with team → exact same environment
🟦 Dockerfile (Simple Explanation)
A Dockerfile is a text file containing step-by-step instructions to build a Docker image.
You tell Docker how to create the image:
what OS to use, what packages to install, what code to copy, what command to run.
🟩 Most Important Instructions
| Instruction | Meaning |
|---|---|
| FROM | Base image |
| WORKDIR | Set working directory |
| COPY | Copy files into image |
| RUN | Execute commands during build |
| CMD | Default command when container runs |
| ENTRYPOINT | Fixed command; CMD becomes args |
| EXPOSE | Document port |
| ENV | Set environment variables |
| ARG | Build-time variable |
| VOLUME | Create mount point |
🟨 Basic Dockerfile Example
What it does:
-
Uses Python 3.10 base
-
Sets
/appas working folder -
Installs requirements
-
Copies your code
-
Runs main.py by default
🟧 Build & Run Image
Build
Run
🟢 What is a Docker Container?
-
A Docker Container is a running instance of a Docker Image.
-
It is isolated, lightweight, and contains everything defined in the image: your code, libraries, and environment.
-
Unlike an image, a container can run, execute, generate logs, and store temporary data.
Analogy:
Image = Recipe
Container = Cake baked from that recipe
🔹 Key Features of Containers
-
Ephemeral / Mutable
-
Containers can run, stop, restart, or be deleted.
-
Changes inside a container don’t affect the original image unless you commit it.
-
-
Isolated Environment
-
Each container has its own filesystem, processes, and network stack.
-
Prevents conflicts between different projects or dependencies.
-
-
Lightweight & Fast
-
Shares the host OS kernel → much faster than a VM.
-
Starts in seconds.
-
-
Multiple Instances
-
You can run multiple containers from the same image → efficient resource usage.
-
🔹 Practical Commands
-
Run a container
-
-it→ interactive terminal -
--name→ container name -
my-etl-image:1.0→ image to run
-
List running containers
-
Stop a container
-
Remove a container
-
Run in detached mode (background)
🔹 Containers in Data Engineering
-
ETL Jobs: Each pipeline can run in a separate container → isolation and reproducibility.
-
Airflow Tasks: DockerOperator spins up a container per task → consistent environment for Python/Spark jobs.
-
Local Testing: Run full pipeline with dependencies (Spark + Postgres + Minio) without affecting host system.
-
Scalable Pipelines: Multiple containers can run simultaneously, useful for batch jobs or streaming tasks.
Image
-
Read-only template
-
Created from Dockerfile
-
Example: Python + libs + your ETL script
Container
-
Running instance of an image
-
Can be started/stopped
-
Temporary, isolated environment
Dockerfile
-
Instructions to build an image
Registry
-
Stores images (Docker Hub, AWS ECR, GCP Artifact Registry)
🟢 What is Docker Compose?
-
Docker Compose is a tool for defining and running multi-container Docker applications.
-
Instead of running each container individually, you define all services in a single
docker-compose.ymlfile. -
With one command, you can start all services, networks, and volumes together.
🔹 Why Data Engineers Use Docker Compose
-
Run Airflow scheduler + webserver + database locally.
-
Test ETL pipelines with Spark, Postgres, Kafka, or Minio (S3) together.
-
Manage dependencies, networking, and volumes easily.
-
Create reproducible environments for interviews and portfolio projects.
🔹 Basic Docker Compose Example (Airflow + Postgres)
Explanation:
-
Postgres → metadata database for Airflow
-
Airflow-webserver → runs DAGs, connected to Postgres
-
Volumes → persist database and logs
-
Ports → expose Airflow UI locally
🔹 Basic Docker Compose Commands
-
Build & start services
-
Run in detached mode (background)
-
Stop all containers
-
View logs
-
Rebuild after changes
🔹 Advanced Use Cases for Data Engineers
-
Local ETL testing
-
Spark + Minio (S3) + Kafka + Postgres → run all together.
-
-
Airflow development environment
-
Scheduler + Webserver + Worker + Postgres + Redis.
-
-
Team collaboration
-
Share
docker-compose.yml→ everyone runs the same environment.
-
🔹 Tips
-
Use
.envfile for sensitive credentials (AWS keys, DB passwords). -
Use depends_on for proper startup order.
-
Combine Dockerfile + Docker Compose to build custom images and run multi-service pipelines.
-
Use networks to let containers communicate (
service_name:port).
Basic Commands
| Purpose | Command | Meaning |
|---|---|---|
| Check Docker version | docker --version |
Verify installation |
| List images | docker images |
Shows all images |
| List running containers | docker ps |
Only active containers |
| List all containers | docker ps -a |
Active + stopped containers |
| Build image | docker build -t <name> . |
Build image from Dockerfile |
| Run container | docker run <image> |
Start container |
| Run interactive shell | docker run -it <image> bash |
Enter container terminal |
| Run container in background | docker run -d <image> |
Detached mode |
| Assign name to container | docker run --name myapp <image> |
Run container with name |
| Stop container | docker stop <id> |
Gracefully stop |
| Force stop | docker kill <id> |
Hard stop |
| Remove container | docker rm <id> |
Delete container |
| Remove image | docker rmi <image> |
Delete image |
| View container logs | docker logs <id> |
Show logs |
| Execute command inside container | docker exec -it <id> bash |
Open shell inside running container |
| Copy file from container | docker cp <id>:/path/file . |
Copy from container to host |
| Show container stats | docker stats |
CPU/RAM usage |
| Pull image from Docker Hub | docker pull <image> |
Download image |
| Push image to registry | docker push <image> |
Upload image |
| Inspect container details | docker inspect <id> |
Low-level info |
| Show container logs | docker logs <id> |
View output |
🟢 What is Docker Networking?
Docker networking allows containers to communicate with:
-
each other
-
the host machine
-
external internet
Each container gets its own virtual network interface + IP address.
🟣 Types of Docker Networks
Below are the most commonly used:
1️⃣ Bridge Network (DEFAULT)
Most Important Type — used in 90% of projects
-
Default network when you run
docker run -
Containers can communicate with each other if they are on the same bridge network
-
Used for multi-container apps
Example:
Postgres container + Python ETL container → talk to each other using service names.
Command:
2️⃣ Host Network (FASTEST)
Container directly uses the host machine’s network.
-
No isolation
-
Fastest performance
-
Suitable for monitoring agents, log shippers, etc.
Command:
3️⃣ None Network (ISOLATED)
No network at all.
-
Container cannot communicate with anything
-
Used for high-security workloads
Command:
🟢 How Containers Talk to Each Other
Within same network → Use service name
Example in docker-compose.yml:
app can connect to db like this:
✔ No need for IP address
✔ Docker handles DNS automatically
🟣 Important Commands
| Purpose | Command |
|---|---|
| List networks | docker network ls |
| Inspect network details | docker network inspect <network> |
| Create a network | docker network create mynet |
| Connect container to network | docker network connect mynet container1 |
| Disconnect | docker network disconnect mynet container1 |
🟢 Networking in Docker Compose (MOST USED)
Result:
-
appanddbtalk using:db:5432
🟢 What is Docker Compose?
Docker Compose is a tool that lets you run multiple containers together using one YAML file.
Instead of running individual docker run commands, you define everything in:
Then start all services with one command:
🟣 Why do we use Docker Compose? (Very Important)
-
Run multiple services together (e.g., Airflow + Postgres + Redis)
-
Handles networking automatically
-
Creates shared volumes
-
Starts containers in the right order
-
Perfect for data engineering pipelines
Docker Compose Architecture
A Compose file has 3 main parts:
-
Version → YAML schema version
-
Services → Containers to run
-
Volumes → Persistent storage
-
Networks → Optional custom networks
Example structure:
🟢 Basic Example (docker-compose.yml)
Example for Python app + Postgres DB:
Highlights
-
Two services:
dbandapp -
appwaits fordb(depends_on) -
Networking is automatic →
appconnects todbusing hostnamedb
🟣 Most Important Docker Compose Commands
| Purpose | Command |
|---|---|
| Start all services | docker-compose up |
| Start in background | docker-compose up -d |
| Stop all services | docker-compose down |
| View running services | docker-compose ps |
| View service logs | docker-compose logs app |
| Rebuild + run | docker-compose up --build |
| Run a command inside container | docker-compose exec app bash |
🟢 Networking in Compose
-
All services automatically join the same network
-
Containers talk using service names
Example:
No need for IP address.
🟣 Volumes in Compose
Used for saving persistent data:
Example:
✅ Docker Components (Very Important for Interviews)
Docker has 6 main components:
1️⃣ Docker Client
The interface through which users interact with Docker. Users issue commands like docker run or docker build via the Docker CLI, which translates these into API calls to the Docker daemon.docker builddocker rundocker pull
2️⃣ Docker Daemon (
dockerd)A background service that runs on the host machine and manages Docker objects such as images, containers, networks, and volumes. It listens for API requests from the Docker client and executes container lifecycle operations like starting, stopping, and monitoring containers.
3️⃣ Docker Images
python:3.10-slim is an image.4️⃣ Docker Containers
When you run:
docker run python:3.10-slim5️⃣ Docker Registry (Docker Hub / ECR / GCR)
6️⃣ Docker Storage / Volumes
volumes: - dbdata:/var/lib/postgresql/data7️⃣ Docker Networking
🔵 What is Docker Caching?
Docker caching means Docker reuses previously built layers instead of rebuilding everything every time.
This makes builds:
-
Faster
-
Cheaper
-
More efficient
🔵 How Docker Caching Works
A Docker image is made of layers.
Each Dockerfile instruction creates one layer.
Example:
If nothing changes in a layer, Docker reuses it from cache.
🔵 Why Caching Matters (Interview Points)
-
Speeds up builds (5 minutes → 10 seconds)
-
Reduces duplicate work
-
Prevents reinstalling dependencies
-
Saves cloud build costs (GitHub Actions, AWS, GCP)
🔵 What Breaks the Cache?
A cache is invalidated (rebuild happens) if:
-
The instruction changes (example: change a RUN command)
-
Any file copied in that layer changes
-
Any previous layer changes
Example:
If requirements.txt changes, Docker will rebuild:
-
Layer for COPY requirements.txt
-
Layer for RUN pip install
-
All layers after them
But earlier layers (FROM, WORKDIR) are still cached.
🔵 Best Practice: ORDER YOUR DOCKERFILE
To get the maximum caching, put the steps that change least often first.
❌ Bad (slow builds every time):
✔ Good (better caching):
This way:
-
Pip install runs only if
requirements.txtchanges -
App code changes won’t break pip install cache
🔵 Cache Example in Real Life
First build:
Second build with no code change:
Because all layers are reused.
🔵 Skipping Cache (Forced Rebuild)
Sometimes you want a full rebuild:
🔵 Multi-Stage Build + Caching (Advanced)
Multi-stage builds let you cache dependency installation separately:
This dramatically speeds up builds.
🔥 Short Summary (One Line Answers)
-
Docker caching = reusing previous build layers
-
Each Dockerfile instruction = one layer
-
Layers only rebuild if something changes
-
Correct ordering = fast builds
-
--no-cachedisables caching
🟦 Variables in Docker
Docker supports two types of variables:
✅ 1. ENV (Environment Variables)
🔹 Available inside the running container
🔹 Used by applications at runtime
🔹 Can be set in Dockerfile, Compose, or at run time
Dockerfile
docker run
docker-compose.yml
📌 Use case:
Database URLs, passwords, app settings.
✅ 2. ARG (Build-time Variables)
🔹 Used only during image build
🔹 NOT available inside running container unless passed to ENV
🔹 Must be defined before use
Dockerfile
Build:
📌 Use case:
Build metadata, versioning, optional settings.
🟨 ENV vs ARG (Interview Question)
| Feature | ARG | ENV |
|---|---|---|
| Available at runtime? | ❌ No | ✔ Yes |
| Available during build? | ✔ Yes | ✔ Yes |
Passed using docker run? | ❌ No | ✔ Yes |
| Stored inside final image? | ❌ No | ✔ Yes |
🟩 3. Variables in docker-compose with .env file
You can store environment variables in a file named .env.
.env:
docker-compose.yml:
🟧 4. Using variables inside Dockerfile
Example:
🟥 5. Why variables are important in Docker?
-
Avoid hardcoding secrets
-
Make Dockerfiles reusable
-
Dynamic config (ports, environment, versions)
-
Different environments: dev, test, prod
🟦 Docker Registry — What It Is & Why It Matters
✅ What Is a Docker Registry?
A Docker Registry is a storage + distribution system for Docker images.
A Docker registry is a centralized storage and distribution system for Docker images. It acts as a repository where Docker images—packages containing everything needed to run an application—are stored, managed, versioned, and shared across different environments.
It is where Docker images are:
-
Stored
-
Versioned
-
Pulled from
-
Pushed to
Similar to GitHub, but for container images instead of code.
🟧 Key Concepts
🟠 1. Registry
The whole server that stores repositories → e.g., Docker Hub, AWS ECR.
🟠 2. Repository
A collection of versions (tags) of an image.
Example:
🟠 3. Image Tag
Label used to version an image.
Example:
🟩 Public vs Private Registries
| Type | Examples | Features |
|---|---|---|
| Public | Docker Hub, GitHub Container Registry | Anyone can pull |
| Private | AWS ECR, Azure ACR, GCP GCR, Harbor | Secure, enterprise use |
🟦 Why Do We Need a Docker Registry?
Because:
-
You build an image locally
-
Push it to a registry
-
Your production server / CI/CD pulls the image and runs it
Without a registry → no easy way to share or deploy images.
🟣 Common Docker Registry Commands
✅ Login
✅ Tag an Image
✅ Push to Registry
✅ Pull from Registry
🟤 Examples of Docker Registries
📌 1. Docker Hub (Most Common)
-
Free public repositories
-
Paid private repos
📌 2. AWS ECR (Enterprise)
-
Most used in production
-
Private registry
-
Integrated with ECS, EKS, Lambda
📌 3. GitHub Container Registry
-
Images stored inside GitHub
-
Good for CI/CD workflows
📌 4. Google GCR / Artifact Registry
📌 5. Self-hosted Registry
Example: Harbor, JFrog Artifactory
🔥 Advanced Concepts (Interview-Level)
🔹 Digest-based pulling
Instead of tag:
Guarantees exact version.
🔹 Immutable tags
Some registries enforce that v1 cannot be overwritten.
🔹 Retention Policies
Automatically delete old images in ECR/GCR.
🔹 Scan for vulnerabilities
Registries like:
-
AWS ECR
-
GHCR
-
Docker Hub (Pro)
can scan images for security issues.
🔷 Docker Networking
Docker networking allows containers to communicate:
-
with each other
-
with the host machine
-
with the outside world
🔶 Types of Docker Networks
Docker provides 5 main network types:
🟦 1. Bridge Network (Default)
-
Most commonly used
-
Containers on the same bridge network can talk to each other using container name
Example:
Use Case:
Local development
Microservices communication
🟩 2. Host Network
Container shares the same network as host.
❌ No isolation
⚡ Fastest network performance
🧠 No port mapping needed
Run:
Use Case:
-
High-performance applications
-
Network-heavy workloads
🟧 3. None Network
Container has no network.
Use Case:
Security
Sandbox jobs
Batch processing
🟪 4. Overlay Network (Swarm / Kubernetes)
Used in multi-node swarm clusters.
Allows containers on different machines to communicate.
Use Case:
Distributed apps
Microservices in Docker Swarm
🟫 5. Macvlan Network
Gives container its own IP address in LAN like a real device.
Use Case:
Legacy systems
Need direct connection to network
Running containers like physical machines
🔷 Key Networking Commands
| Command | Description |
|---|---|
docker network ls | List networks |
docker network inspect <name> | Inspect network |
docker network create <name> | Create network |
docker network rm <name> | Remove network |
docker network connect <net> <container> | Add container to network |
docker network disconnect <net> <container> | Remove container |
🔷 How Containers Communicate
🟦 1. Same Bridge Network
✔ Can ping each other by container name
✔ DNS built-in
Example:
🟥 2. Different Networks
❌ Cannot communicate
➡ Must connect to the same network
🟩 3. With Host Machine
Host can access container via:
Example:
Access: → http://localhost:8080
🟧 4. Container to Internet
Enabled by default via NAT.
🔶 Port Mapping
If container port = 80
Host port = 8080
👉 Host can access container
👉 “Port forwarding”
🟦 Docker DNS
On the same custom network:
-
Container names act like hostnames
-
Docker automatically manages DNS
🔥 Real Interview Questions (with short answers)
1. What is Docker Bridge Network?
Default network; containers can communicate using container name.
2. Difference between Port Mapping and Exposing Port?
-
EXPOSE= documentation -
-p= actual port forwarding
3. How do containers talk to each other?
By joining the same network.
4. What is Host Network?
Shares host’s IP; no port mapping; fastest.
5. What is Overlay Network?
Connects containers across multiple machines in Docker Swarm.
🔵 Docker Volumes
Docker Volumes are the official way to store data outside a container.
Docker volumes are a dedicated, persistent storage mechanism managed by Docker for storing data generated and used by containers. Unlike container writable layers, volumes exist independently of the container lifecycle, meaning data in volumes remains intact even if the container is stopped, removed, or recreated. They reside outside the container filesystem on the host, typically under Docker's control directories, providing efficient I/O and storage management.
Because containers are ephemeral:
→ When container stops/deletes → data is lost
→ Volumes solve that.
🔶 Why Do We Need Docker Volumes?
✔ Containers are temporary
✔ Data must persist
✔ Multiple containers may need same data
✔ Upgrading/Deleting containers should NOT delete data
🟦 Types of Docker Storage
Docker offers 3 types:
1️⃣ Volume (Recommended)
Managed by Docker itself
Stored under:
Use Cases:
-
Databases (MySQL, PostgreSQL)
-
Persistent app data
Example:
2️⃣ Bind Mount
Uses host machine's folder.
Use Cases:
-
Local development
-
When you want full control of host path
3️⃣ tmpfs (Linux Only)
Data stored in RAM.
Use Cases:
-
Sensitive data
-
Ultra-fast temporary storage
🟩 Volume Commands (Most Important)
| Command | Description |
|---|---|
docker volume create myvol | Create volume |
docker volume ls | List volumes |
docker volume inspect myvol | Inspect volume |
docker volume rm myvol | Delete volume |
docker volume prune | Remove unused volumes |
🟧 Using Volumes in Docker Run
Syntax:
Example:
🟣 Using Bind Mounts
Example:
🔵 Volumes in Docker Compose
Very important for real projects.
docker-compose.yml
🔥 Example Use Case (DB Persistence)
If you run:
Delete container → data gone.
But with volume:
Stop container → data still exists (in volume).
🟥 Where Are Volumes Stored?
On Linux:
On Windows/Mac → managed internally through Docker Desktop.
🟨 Interview Questions (Short Answers)
1️⃣ What is a Docker Volume?
A persistent storage mechanism managed by Docker.
2️⃣ Difference: Volume vs Bind Mount?
| Volume | Bind Mount |
|---|---|
| Managed by Docker | Controlled by host user |
| More secure | Direct host access |
| Best for production | Best for local development |
3️⃣ Does deleting container delete volume?
❌ No.
Volumes must be deleted manually.
4️⃣ What happens if volume doesn't exist?
Docker automatically creates it.
5️⃣ Can two containers share one volume?
✔ Yes → used in DB replicas, logs, shared storage.
🔵 What is ENTRYPOINT in Docker?
ENTRYPOINT defines the main command that will always run when a container starts.
Think of it as the default executable of the container.
🟦 Why ENTRYPOINT is used?
✔ Makes the container behave like a single-purpose program
✔ Forces a command to always run
✔ Can't be easily overridden (compared to CMD)
✔ Best for production containers
🔶 ENTRYPOINT Syntax
Two forms exist:
1️⃣ Exec Form (Recommended)
✔ Doesn’t use shell
✔ More secure
✔ Handles signals properly
2️⃣ Shell Form
⚠ Runs inside /bin/sh -c
⚠ Harder to handle signals
🟣 Example ENTRYPOINT Dockerfile
Dockerfile
Run:
This will always run:
🟩 ENTRYPOINT + CMD (Very Important)
ENTRYPOINT = fixed command
CMD = default arguments
Example:
Container will run:
You can override CMD:
But ENTRYPOINT cannot be replaced unless you use --entrypoint.
🔥 Override ENTRYPOINT (Rare)
🟥 ENTRYPOINT vs CMD (Very Important Table)
| Feature | ENTRYPOINT | CMD |
|---|---|---|
| Main purpose | Main command | Default args |
| Overrides allowed? | ❌ Hard | ✔ Easy |
| Best use | Permanent command | Arguments |
| Runs as | Program | Command/Args |
🔶 Common Interview Questions
1. Why use ENTRYPOINT instead of CMD?
To ensure the main command always runs and cannot be overridden.
2. What happens if both ENTRYPOINT and CMD exist?
CMD becomes arguments to ENTRYPOINT.
3. How do you override ENTRYPOINT?
Using --entrypoint.
🔵 Docker Daemon & Docker Client
Docker works using a client–server architecture.
🟦 1. Docker Daemon (dockerd)
This is the brain of Docker.
✔ What it Does:
-
Runs in the background
-
Manages containers
-
Manages images
-
Manages networks
-
Manages volumes
-
Executes all Docker operations
✔ It Listens On:
-
Unix socket:
/var/run/docker.sock -
Sometimes TCP port (for remote Docker hosts)
✔ Daemon = Server Side
🟩 2. Docker Client (docker)
This is the command-line tool you use.
When you type:
The client DOES NOT run containers.
Instead, it sends API requests to the Docker Daemon, which performs the real operations.
✔ Client = Frontend
✔ Daemon = Backend
🟧 How They Work Together (Simple Flow)
You run:
Flow:
-
Client sends request → Daemon
-
Daemon pulls image
-
Daemon creates container
-
Daemon starts container
-
You see output on terminal
🔵 COPY vs ADD in Dockerfile
Both are used to copy files into the image, but COPY is preferred.
🟦 1. COPY (Recommended)
✔ What it does:
Copies local files/folders into the container.
✔ Safe
✔ Predictable
✔ No extra features (simple only)
Example:
Use COPY when:
-
You want to copy source code
-
You want clean builds
-
You don’t need extraction or downloading
🟧 2. ADD (Avoid unless needed)
✔ What it does:
Does everything COPY does plus two extra features:
Extra Features:
1️⃣ Can download URLs
2️⃣ Automatically extracts tar files
⚠ Because of these extras → can create security issues
So Docker recommends: use COPY unless ADD is needed.
🟪 COPY vs ADD Table (Interview-Friendly)
| Feature | COPY | ADD |
|---|---|---|
| Copy local files | ✔ Yes | ✔ Yes |
| Copy remote URL | ❌ No | ✔ Yes |
Auto extract .tar.gz | ❌ No | ✔ Yes |
| Simpler | ✔ Yes | ❌ No |
| More secure | ✔ Yes | ❌ No |
| Recommended? | ✔ Yes | ❌ Use only when required |
🟩 When to Use ADD? (Rare)
Use ADD only for:
✔ Auto-unpacking tar files into image
✔ Downloading files from a URL
Otherwise → COPY is always better.
🟥 🔥 Interview Answer (Short)**
COPY is used to copy files/folders into the image and is preferred because it is simpler and more secure.
ADD has extra features like downloading files from URLs and extracting tar archives, so use it only when those features are needed.
🔵 What are Multi-Stage Builds?
Multi-stage builds allow you to use multiple FROM statements in a single Dockerfile.
✔ Build in one stage
✔ Copy only the required output into the final stage
✔ Final image becomes much smaller
✔ No build dependencies inside final image
🟦 Why Multi-Stage Builds Are Needed?
Problem (without multi-stage):
-
Build tools (Maven, Go compiler, Node modules, pip, etc.) stay inside the final image
-
Makes image heavy
-
Security issues
-
Slow deployment
Multi-stage solution:
-
Build tools exist only in the build stage
-
Final stage contains just the application
-
Clean, lightweight image
🟩 Simple Example – Python / Node / Java / Go (All follow same logic)
Here is a general multi-stage pattern:
What happens?
-
Node image builds the app
-
Only the final compiled output is copied to nginx
-
Result = super small production image
🔶 Another Example – Python App
🔷 Another Example – Java (Very Popular)
✔ No Maven in final image
✔ Final image is tiny
🟧 Key Features of Multi-Stage Builds
✔ Multiple FROM instructions
Each FROM = new stage
✔ You can name stages
✔ Copy artifacts from stage to stage
✔ Final image only contains last stage
All previous stages = removed
Image is clean + small
🟪 Benefits (Interview Ready)
| Benefit | Explanation |
|---|---|
| ✔ Smaller images | No build tools in final image |
| ✔ Faster builds | Layer caching for each stage |
| ✔ Better security | No compilers / secrets left behind |
| ✔ Cleaner Dockerfiles | Each stage has a clear job |
| ✔ Reproducible builds | Same environment every time |
🔵 What is .dockerignore?
.dockerignore is a file that tells Docker which files/folders to EXCLUDE when building an image.
It works similar to .gitignore.
🟦 Why do we use .dockerignore?
✔ Faster Docker builds
(Removes unnecessary files → smaller build context)
✔ Smaller images
(Don’t copy unwanted files)
✔ Better security
(Keep secrets, logs, configs out of image)
✔ Cleaner caching
(Prevents rebuilds when irrelevant files change)
🟩 Common Items in .dockerignore
🟧 How it works?
When you run:
Docker first copies the “build context” → (current directory)
Without dockerignore, everything is copied.
.dockerignore tells Docker:
🚫 Don’t send these files to the build context.
🟪 Example
.dockerignore
Dockerfile
Only allowed files will be copied.
🟥 Performance Impact (Very Important)
Without .dockerignore:
-
Docker copies huge directories (node_modules, logs)
-
Slow build
-
Cache invalidates unnecessarily
With .dockerignore:
-
Build context is very small
-
Build is faster
-
Cache stays valid → faster incremental builds
🟨 Interview Questions (Short Answers)
1. What is the purpose of .dockerignore?
To exclude unnecessary files from the Docker build context.
2. What happens if .dockerignore is missing?
Docker sends all files to the build context → slow builds, large images.
3. Does .dockerignore reduce image size?
Yes, because it prevents unnecessary files from being copied.
4. Does .dockerignore improve caching?
Yes → fewer files → fewer cache invalidations.
5. Is .dockerignore mandatory?
No, but highly recommended.
🔵 Docker Container Lifecycle (Step-by-Step)
A Docker container goes through the following major stages:
🟦 1. Created
The container is created from an image but not started yet.
Command:
🟩 2. Running
Container is active and executing processes.
Command:
docker run= create + start
🟧 3. Paused
All processes inside the container are temporarily frozen.
Command:
🟪 4. Unpaused
Resumes the paused container.
Command:
🟥 5. Stopped / Exited
Container stops running its main process (app has exited or manually stopped).
Command:
🟨 6. Restarted
Container is stopped and then started again.
Command:
🟫 7. Removed (Deleted)
The container is permanently removed from Docker.
Command:
You cannot remove a running container—must stop it first.
📌 Lifecycle Diagram (Simple)
| Action | Command Example |
|---|---|
| Create | docker create nginx |
| Run (create+start) | docker run nginx |
| Start | docker start cont_id |
| Stop | docker stop cont_id |
| Pause | docker pause cont_id |
| Unpause | docker unpause cont_id |
| Restart | docker restart cont_id |
| Remove | docker rm cont_id |
| Remove all | docker rm $(docker ps -aq) |
🔵 What is a Docker HEALTHCHECK?
A HEALTHCHECK is a way to tell Docker how to test whether a container is healthy.
Docker runs this command periodically and updates the container's status:
-
healthy
-
unhealthy
-
starting
It helps in:
-
auto-restarts
-
load balancers
-
orchestrators (Kubernetes, ECS, Swarm)
🟦 Syntax (Dockerfile)
🟩 Options
| Option | Meaning |
|---|---|
--interval=30s | Check frequency |
--timeout=3s | How long to wait before failing |
--start-period=5s | Grace period before checks start |
--retries=3 | Fail after X failed attempts |
🟧 Example 1: Simple HTTP Healthcheck
-
If
curl -fworks → healthy -
If fails → unhealthy
🟪 Example 2: Healthcheck Script
health.sh:
List containers with health status:
Detailed inspection:
You will see:
| Status | Meaning |
|---|---|
| starting | Startup period (start-period) |
| healthy | App is functioning |
| unhealthy | Check failed repeatedly |
If you use restart policies:
→ Docker auto-restarts an unhealthy container.
📌 Important Notes
-
HEALTHCHECK runs inside the container.
-
Should be lightweight (avoid heavy scripts).
-
Uses exit codes:
-
0 = success (healthy)
-
1 = unhealthy
-
2 = reserved
-
🔵 What is docker inspect?
docker inspect is used to view detailed information about Docker containers, images, networks, or volumes in JSON format.
It shows everything about a container:
-
Network info
-
Mounts / volumes
-
IP address
-
Ports
-
Environment variables
-
Health status
-
Entry point, CMD
-
Resource usage config
-
Labels
-
Container state (running, stopped, etc.)
This is the most powerful debugging command.
🟦 Basic Command
🟩 Example Output (Simplified)
You will see JSON fields like:
🔧 Most Useful Inspect Filters (Important!)
📍 1. Get container IP address
📍 2. Get just the environment variables
📍 3. Get container’s running status
📍 4. Get container entrypoint
📍 5. Get exposed ports
🟧 Inspecting Images
Useful to see:
-
layers
-
build parameters
-
environment variables
-
entrypoint
🟪 Inspecting Networks
You can find:
-
connected containers
-
IP ranges (subnet)
-
gateway
-
driver type
🟫 Inspecting Volumes
Shows:
-
mount point
-
driver
-
usage
✨ Real Use Cases (Important for Interviews)
| Use Case | Command |
|---|---|
| Debug network issues | Get IP, ports |
| Debug ENV variables | extract .Config.Env |
| Verify mounted volumes | check .Mounts |
| Check health status | check .State.Health.Status |
| Know why a container exited | check .State.ExitCode |
🟩 Check Container Logs (Related Command)
🔵 What is Port Mapping in Docker?
Port mapping connects a container’s internal port to a port on your host machine so that applications inside the container can be accessed from outside.
Example:
A container running a web server on port 80 → accessible on host via port 8080
This is called port forwarding.
🟦 Syntax
Example:
Meaning:
-
Inside container, Nginx listens on 80
-
On your laptop/server, you hit http://localhost:8080
🟩 Types of Port Mapping
1. Host → Container (most common)
2. Bind to specific IP (e.g., localhost only)
Meaning:
Only local machine can access it.
3. Automatic host port assignment
Docker assigns random free ports.
🟧 Check Mapped Ports
You will see:
🟪 Why Port Mapping Is Needed (Interview Points)
-
Containers run in isolated networks
-
Container ports aren’t accessible from host by default
-
Port mapping exposes them
-
Allows multiple instances to run on different host ports
-
Helps in local development and testing
🟫 Real Examples
1️⃣ Expose Postgres
2️⃣ Expose Airflow Webserver
3️⃣ Expose FastAPI on 8000
🔥 Port Mapping in Docker Compose
Same meaning: host 8080 → container 80
🎯 One-Line Interview Summary
Port mapping (-p host:container) allows access to applications running inside a container by exposing container ports to the host machine.
Comments
Post a Comment