Pyspark
https://www.youtube.com/watch?v=FNJze2Ea780
What is Apache Spark?
Apache Spark is an open-source distributed data processing engine designed for big data analytics. It allows you to process large datasets across multiple machines (clusters) quickly and efficiently.
Think of it as a supercharged, scalable version of Python’s Pandas or SQL that works on massive data distributed across many servers.
What is PySpark?
PySpark is the Python API for Apache Spark. It allows you to write Spark applications in Python.
In other words:
-
Apache Spark = the big data processing engine (written in Scala, runs on JVM).
-
PySpark = a way to use Spark with Python.
Key Features of Spark
-
Distributed Computing:
-
Data is split into chunks and processed in parallel across a cluster.
-
Can handle petabytes of data.
-
-
Fast Processing:
-
Uses in-memory computation, which is faster than disk-based systems like Hadoop MapReduce.
-
Optimized DAG (Directed Acyclic Graph) execution for tasks.
-
-
Multi-Language Support:
-
Scala (native)
-
Java
-
Python (PySpark)
-
R
-
SQL
-
-
Unified Engine:
Spark can handle multiple workloads:-
Batch processing → Big files (CSV, Parquet, JSON)
-
Stream processing → Real-time data (Kafka, sockets)
-
Machine Learning → MLlib
-
Graph Processing → GraphX
-
-
Fault-Tolerant:
-
Uses RDDs (Resilient Distributed Datasets) that can recover lost data automatically.
=============================================
What is Distributed Computing?
Distributed computing is a computing model where large computational tasks are divided and executed across multiple machines (nodes) that work in parallel. Think of it as breaking a huge job into smaller parts and assigning each part to a different worker. It's key features include:
- Speed: Multiple nodes work simultaneously.
- Scalability: Add more nodes to handle more data.
- Fault tolerance: If one node fails, others continue.
=============================================
Spark is a distributed computing system, meaning it runs across a cluster of machines. Its architecture is designed for speed, fault tolerance, and scalability.
1️⃣ Main Components of Spark Architecture
| Component | Description |
|---|---|
| Driver | The main program that runs your Spark application. It creates SparkContext or SparkSession, converts code into tasks, and schedules them on executors. |
| Executor | Worker process running on cluster nodes. Responsible for executing tasks, storing data in memory or disk, and reporting results back to the driver. |
| Cluster Manager | Manages resources across the cluster. Allocates CPU, memory to Spark applications. Types: Standalone, YARN, Mesos, Kubernetes. |
2️⃣ How Spark Works (High-Level Flow)
-
Spark Application: You write your program (Python, Scala, Java, R).
-
Driver Program: SparkSession/SparkContext runs on the driver. It:
-
Divides work into tasks
-
Builds a DAG (Directed Acyclic Graph) of transformations
-
Sends tasks to executors
-
-
Executors: Run on cluster nodes and execute the tasks in parallel.
-
Cluster Manager: Assigns resources to driver and executors.
-
Result: Executors return results to driver, driver aggregates and produces output.
3️⃣ Spark Components in the Architecture
| Layer | Purpose |
|---|---|
| Spark Core | Handles task scheduling, memory management, fault recovery, RDD API (resilient distributed dataset). |
| Spark SQL | Query structured data using SQL, DataFrames, Datasets. |
| Spark Streaming / Structured Streaming | Real-time streaming processing. |
| MLlib | Distributed Machine Learning library. |
| GraphX | Graph processing API (e.g., social network analysis). |
4️⃣ Spark Execution Model
-
Transformations (
map,filter,flatMap) are lazy. Spark builds a DAG of operations. -
Actions (
collect,count,save) trigger execution. -
Parallel Execution: Tasks are distributed to multiple executors for parallel processing.
5️⃣ Cluster Manager Options
| Cluster Manager | Description |
|---|---|
| Standalone | Built-in, simple Spark cluster manager. |
| YARN | Hadoop’s cluster manager, widely used in enterprises. |
| Mesos | General-purpose cluster manager. |
| Kubernetes | Container-based cluster deployment. |
6️⃣ Diagram (Text Version)
Apache Spark Components
1️⃣ Spark Core
-
What it is: The heart of Spark. All other components (SQL, Streaming, MLlib, GraphX) run on top of Spark Core.
-
Responsibilities:
-
RDDs: Distributed collections of data
-
Task scheduling: Decides which worker (executor) will run which task
-
Memory management: Keeps data in memory for fast processing
-
Fault tolerance: Automatically recovers lost data
-
-
Example Use: You have a big CSV file, and you want to filter and map data in parallel.
2️⃣ Spark SQL
-
What it is: Module for structured data. Works with DataFrames and SQL queries.
-
Why it’s useful:
-
Easier than working with raw RDDs
-
Optimized using Catalyst Optimizer → faster queries
-
-
Example Use: You want to query a Parquet file like a SQL database:
3️⃣ Spark Streaming / Structured Streaming
-
What it is: Module for real-time data processing.
-
Why it’s useful: Handles continuous streams of data (Kafka, sockets, logs).
-
Example Use: Monitor real-time server logs:
4️⃣ MLlib (Machine Learning Library)
-
What it is: Spark’s distributed machine learning library.
-
Why it’s useful: Can train models on very large datasets across multiple nodes.
-
Supported algorithms: Classification, regression, clustering, recommendation, etc.
-
Example Use: Train a linear regression model on a big dataset:
5️⃣ GraphX
-
What it is: Module for graph and network analysis.
-
Why it’s useful: Analyze relationships in data, like social networks.
-
Example Use: Find shortest path between users or recommend friends.
-
In Python, GraphX is less used; often GraphFrames is preferred.
=============================================
1️⃣ What is Lazy Evaluation?
👉 Lazy evaluation means PySpark does NOT execute operations immediately.
Instead, it records the operations and create best logical plan , and executes them only when a result is required.(action is performed).
In simple words:
Transformations are lazy, Actions are eager
2️⃣ Transformations vs Actions (Core idea)
🔹 Transformations (Lazy)
These do not trigger execution:
-
select() -
filter() -
map() -
withColumn() -
join() -
groupBy()
They only build a logical execution plan (DAG).
3️⃣ What happens internally? (Spark flow)
When you write PySpark code:
Step 1: Build Logical Plan
Spark just creates a DAG (Directed Acyclic Graph):
❌ No execution yet
Step 2: Action triggers execution
Now Spark:
-
Optimizes the plan (Catalyst Optimizer)
-
Creates a physical execution plan
-
Submits jobs to executors
-
Reads data + performs transformations
4️⃣ Why does Spark use Lazy Evaluation?
✅ 1. Performance Optimization
Spark can:
-
Reorder operations
-
Push filters early (predicate pushdown)
-
Remove unused columns
-
Combine transformations
Example:
Spark internally optimizes to:
✅ 2. Avoids Unnecessary Computation
If no action is called, Spark does nothing.
✅ 3. Fault Tolerance
Spark tracks lineage, not data.
If a partition fails, Spark recomputes it using the DAG.
5️⃣ Example Showing Lazy Evaluation
Output:
💡 Actual reading + filtering happens only at count()
6️⃣ DAG Visualization
You can see lazy evaluation in action using:
This shows:
-
Parsed Logical Plan
-
Optimized Logical Plan
-
Physical Plan
7️⃣ Caching & Lazy Evaluation (Very Important)
⚠️ cache() itself is lazy
Data is cached only when an action runs
8️⃣ Real-world Data Engineering Example
In ETL pipelines:
-
Spark waits
-
Optimizes
-
Executes everything once
-
Writes final output
🚀 This is why Spark scales so well.
9️⃣ Interview One-Liner (Must Remember)
Lazy evaluation in PySpark means transformations are not executed immediately; Spark builds an execution plan and runs it only when an action is called, enabling optimization and efficient execution.
Comments
Post a Comment