Pyspark
What is Apache Spark?
Apache Spark is an open-source distributed data processing engine designed for big data analytics. It allows you to process large datasets across multiple machines (clusters) quickly and efficiently.
Think of it as a supercharged, scalable version of Python’s Pandas or SQL that works on massive data distributed across many servers.
What is PySpark?
PySpark is the Python API for Apache Spark. It allows you to write Spark applications in Python.
In other words:
-
Apache Spark = the big data processing engine (written in Scala, runs on JVM).
-
PySpark = a way to use Spark with Python.
Key Features of Spark
-
Distributed Computing:
-
Data is split into chunks and processed in parallel across a cluster.
-
Can handle petabytes of data.
-
-
Fast Processing:
-
Uses in-memory computation, which is faster than disk-based systems like Hadoop MapReduce.
-
Optimized DAG (Directed Acyclic Graph) execution for tasks.
-
-
Multi-Language Support:
-
Scala (native)
-
Java
-
Python (PySpark)
-
R
-
SQL
-
-
Unified Engine:
Spark can handle multiple workloads:-
Batch processing → Big files (CSV, Parquet, JSON)
-
Stream processing → Real-time data (Kafka, sockets)
-
Machine Learning → MLlib
-
Graph Processing → GraphX
-
-
Fault-Tolerant:
-
Uses RDDs (Resilient Distributed Datasets) that can recover lost data automatically.
=============================================
Apache Spark Architecture
Spark is a distributed computing system, meaning it runs across a cluster of machines. Its architecture is designed for speed, fault tolerance, and scalability.
1️⃣ Main Components of Spark Architecture
| Component | Description |
|---|---|
| Driver | The main program that runs your Spark application. It creates SparkContext or SparkSession, converts code into tasks, and schedules them on executors. |
| Executor | Worker process running on cluster nodes. Responsible for executing tasks, storing data in memory or disk, and reporting results back to the driver. |
| Cluster Manager | Manages resources across the cluster. Allocates CPU, memory to Spark applications. Types: Standalone, YARN, Mesos, Kubernetes. |
2️⃣ How Spark Works (High-Level Flow)
-
Spark Application: You write your program (Python, Scala, Java, R).
-
Driver Program: SparkSession/SparkContext runs on the driver. It:
-
Divides work into tasks
-
Builds a DAG (Directed Acyclic Graph) of transformations
-
Sends tasks to executors
-
-
Executors: Run on cluster nodes and execute the tasks in parallel.
-
Cluster Manager: Assigns resources to driver and executors.
-
Result: Executors return results to driver, driver aggregates and produces output.
3️⃣ Spark Components in the Architecture
| Layer | Purpose |
|---|---|
| Spark Core | Handles task scheduling, memory management, fault recovery, RDD API (resilient distributed dataset). |
| Spark SQL | Query structured data using SQL, DataFrames, Datasets. |
| Spark Streaming / Structured Streaming | Real-time streaming processing. |
| MLlib | Distributed Machine Learning library. |
| GraphX | Graph processing API (e.g., social network analysis). |
4️⃣ Spark Execution Model
-
Transformations (
map,filter,flatMap) are lazy. Spark builds a DAG of operations. -
Actions (
collect,count,save) trigger execution. -
Parallel Execution: Tasks are distributed to multiple executors for parallel processing.
5️⃣ Cluster Manager Options
| Cluster Manager | Description |
|---|---|
| Standalone | Built-in, simple Spark cluster manager. |
| YARN | Hadoop’s cluster manager, widely used in enterprises. |
| Mesos | General-purpose cluster manager. |
| Kubernetes | Container-based cluster deployment. |
6️⃣ Diagram (Text Version)
Apache Spark Components
1️⃣ Spark Core
-
What it is: The heart of Spark. All other components (SQL, Streaming, MLlib, GraphX) run on top of Spark Core.
-
Responsibilities:
-
RDDs: Distributed collections of data
-
Task scheduling: Decides which worker (executor) will run which task
-
Memory management: Keeps data in memory for fast processing
-
Fault tolerance: Automatically recovers lost data
-
-
Example Use: You have a big CSV file, and you want to filter and map data in parallel.
2️⃣ Spark SQL
-
What it is: Module for structured data. Works with DataFrames and SQL queries.
-
Why it’s useful:
-
Easier than working with raw RDDs
-
Optimized using Catalyst Optimizer → faster queries
-
-
Example Use: You want to query a Parquet file like a SQL database:
3️⃣ Spark Streaming / Structured Streaming
-
What it is: Module for real-time data processing.
-
Why it’s useful: Handles continuous streams of data (Kafka, sockets, logs).
-
Example Use: Monitor real-time server logs:
4️⃣ MLlib (Machine Learning Library)
-
What it is: Spark’s distributed machine learning library.
-
Why it’s useful: Can train models on very large datasets across multiple nodes.
-
Supported algorithms: Classification, regression, clustering, recommendation, etc.
-
Example Use: Train a linear regression model on a big dataset:
5️⃣ GraphX
-
What it is: Module for graph and network analysis.
-
Why it’s useful: Analyze relationships in data, like social networks.
-
Example Use: Find shortest path between users or recommend friends.
-
In Python, GraphX is less used; often GraphFrames is preferred.
Comments
Post a Comment