What is Apache Spark?

Apache Spark is an open-source distributed data processing engine designed for big data analytics. It allows you to process large datasets across multiple machines (clusters) quickly and efficiently.

Think of it as a supercharged, scalable version of Python’s Pandas or SQL that works on massive data distributed across many servers.

What is PySpark?

PySpark is the Python API for Apache Spark. It allows you to write Spark applications in Python.

In other words:

Apache Spark = the big data processing engine (written in Scala, runs on JVM).
PySpark = a way to use Spark with Python.

Key Features of Spark

Distributed Computing:
- Data is split into chunks and processed in parallel across a cluster.
- Can handle petabytes of data.
Fast Processing:
- Uses in-memory computation, which is faster than disk-based systems like Hadoop MapReduce.
- Optimized DAG (Directed Acyclic Graph) execution for tasks.
Multi-Language Support:
- Scala (native)
- Java
- Python (PySpark)
- R
- SQL
Unified Engine:
Spark can handle multiple workloads:
- Batch processing → Big files (CSV, Parquet, JSON)
- Stream processing → Real-time data (Kafka, sockets)
- Machine Learning → MLlib
- Graph Processing → GraphX
Fault-Tolerant:

Uses RDDs (Resilient Distributed Datasets) that can recover lost data automatically.

=============================================

Apache Spark Architecture

Spark is a distributed computing system, meaning it runs across a cluster of machines. Its architecture is designed for speed, fault tolerance, and scalability.

1️⃣ Main Components of Spark Architecture

Component	Description
Driver	The main program that runs your Spark application. It creates SparkContext or SparkSession, converts code into tasks, and schedules them on executors.
Executor	Worker process running on cluster nodes. Responsible for executing tasks, storing data in memory or disk, and reporting results back to the driver.
Cluster Manager	Manages resources across the cluster. Allocates CPU, memory to Spark applications. Types: Standalone, YARN, Mesos, Kubernetes.

2️⃣ How Spark Works (High-Level Flow)

Spark Application: You write your program (Python, Scala, Java, R).
Driver Program: SparkSession/SparkContext runs on the driver. It:
- Divides work into tasks
- Builds a DAG (Directed Acyclic Graph) of transformations
- Sends tasks to executors
Executors: Run on cluster nodes and execute the tasks in parallel.
Cluster Manager: Assigns resources to driver and executors.
Result: Executors return results to driver, driver aggregates and produces output.

3️⃣ Spark Components in the Architecture

Layer	Purpose
Spark Core	Handles task scheduling, memory management, fault recovery, RDD API (resilient distributed dataset).
Spark SQL	Query structured data using SQL, DataFrames, Datasets.
Spark Streaming / Structured Streaming	Real-time streaming processing.
MLlib	Distributed Machine Learning library.
GraphX	Graph processing API (e.g., social network analysis).

4️⃣ Spark Execution Model

Transformations (map, filter, flatMap) are lazy. Spark builds a DAG of operations.
Actions (collect, count, save) trigger execution.
Parallel Execution: Tasks are distributed to multiple executors for parallel processing.

5️⃣ Cluster Manager Options

Cluster Manager	Description
Standalone	Built-in, simple Spark cluster manager.
YARN	Hadoop’s cluster manager, widely used in enterprises.
Mesos	General-purpose cluster manager.
Kubernetes	Container-based cluster deployment.

6️⃣ Diagram (Text Version)

=============================================

Apache Spark Components

1️⃣ Spark Core

What it is: The heart of Spark. All other components (SQL, Streaming, MLlib, GraphX) run on top of Spark Core.
Responsibilities:
- RDDs: Distributed collections of data
- Task scheduling: Decides which worker (executor) will run which task
- Memory management: Keeps data in memory for fast processing
- Fault tolerance: Automatically recovers lost data
Example Use: You have a big CSV file, and you want to filter and map data in parallel.


rdd = sc.textFile("data.csv")
rdd_filtered = rdd.filter(lambda x: "USA" in x)

2️⃣ Spark SQL

What it is: Module for structured data. Works with DataFrames and SQL queries.
Why it’s useful:
- Easier than working with raw RDDs
- Optimized using Catalyst Optimizer → faster queries
Example Use: You want to query a Parquet file like a SQL database:


df = spark.read.parquet("data.parquet")
df.createOrReplaceTempView("people")
spark.sql("SELECT Name, Age FROM people WHERE Age > 25").show()

3️⃣ Spark Streaming / Structured Streaming

What it is: Module for real-time data processing.
Why it’s useful: Handles continuous streams of data (Kafka, sockets, logs).
Example Use: Monitor real-time server logs:


df_stream = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
df_stream.writeStream.format("console").start().awaitTermination()

4️⃣ MLlib (Machine Learning Library)

What it is: Spark’s distributed machine learning library.
Why it’s useful: Can train models on very large datasets across multiple nodes.
Supported algorithms: Classification, regression, clustering, recommendation, etc.
Example Use: Train a linear regression model on a big dataset:


from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="label")
model = lr.fit(training_data)
predictions = model.transform(test_data)

5️⃣ GraphX

What it is: Module for graph and network analysis.
Why it’s useful: Analyze relationships in data, like social networks.
Example Use: Find shortest path between users or recommend friends.
In Python, GraphX is less used; often GraphFrames is preferred.

Search This Blog

Anz learnings

Pyspark

What is Apache Spark?

What is PySpark?

Key Features of Spark

=============================================

Apache Spark Architecture

1️⃣ Main Components of Spark Architecture

2️⃣ How Spark Works (High-Level Flow)

3️⃣ Spark Components in the Architecture

4️⃣ Spark Execution Model

5️⃣ Cluster Manager Options

6️⃣ Diagram (Text Version)

=============================================

Apache Spark Components

1️⃣ Spark Core

2️⃣ Spark SQL

3️⃣ Spark Streaming / Structured Streaming

4️⃣ MLlib (Machine Learning Library)

5️⃣ GraphX

Comments

Post a Comment

Popular posts from this blog

Teradata

Git

work