Pyspark

 https://www.youtube.com/watch?v=FNJze2Ea780

What is Apache Spark?

Apache Spark is an open-source distributed data processing engine designed for big data analytics. It allows you to process large datasets across multiple machines (clusters) quickly and efficiently.

Think of it as a supercharged, scalable version of Python’s Pandas or SQL that works on massive data distributed across many servers.

What is PySpark?

PySpark is the Python API for Apache Spark. It allows you to write Spark applications in Python.

In other words:

  • Apache Spark = the big data processing engine (written in Scala, runs on JVM).

  • PySpark = a way to use Spark with Python.


Key Features of Spark

  1. Distributed Computing:

    • Data is split into chunks and processed in parallel across a cluster.

    • Can handle petabytes of data.

  2. Fast Processing:

    • Uses in-memory computation, which is faster than disk-based systems like Hadoop MapReduce.

    • Optimized DAG (Directed Acyclic Graph) execution for tasks.

  3. Multi-Language Support:

    • Scala (native)

    • Java

    • Python (PySpark)

    • R

    • SQL

  4. Unified Engine:
    Spark can handle multiple workloads:

    • Batch processing → Big files (CSV, Parquet, JSON)

    • Stream processing → Real-time data (Kafka, sockets)

    • Machine Learning → MLlib

    • Graph Processing → GraphX

  5. Fault-Tolerant:

    • Uses RDDs (Resilient Distributed Datasets) that can recover lost data automatically.

=============================================

What is Distributed Computing?

Distributed computing is a computing model where large computational tasks are divided and executed across multiple machines (nodes) that work in parallel. Think of it as breaking a huge job into smaller parts and assigning each part to a different worker. It's key features include:

  • Speed: Multiple nodes work simultaneously.
  • Scalability: Add more nodes to handle more data.
  • Fault tolerance: If one node fails, others continue.

=============================================

Apache Spark Architecture

Spark is a distributed computing system, meaning it runs across a cluster of machines. Its architecture is designed for speed, fault tolerance, and scalability.


1️⃣ Main Components of Spark Architecture

ComponentDescription
DriverThe main program that runs your Spark application. It creates SparkContext or SparkSession, converts code into tasks, and schedules them on executors.
ExecutorWorker process running on cluster nodes. Responsible for executing tasks, storing data in memory or disk, and reporting results back to the driver.
Cluster ManagerManages resources across the cluster. Allocates CPU, memory to Spark applications. Types: Standalone, YARN, Mesos, Kubernetes.

2️⃣ How Spark Works (High-Level Flow)

  1. Spark Application: You write your program (Python, Scala, Java, R).

  2. Driver Program: SparkSession/SparkContext runs on the driver. It:

    • Divides work into tasks

    • Builds a DAG (Directed Acyclic Graph) of transformations

    • Sends tasks to executors

  3. Executors: Run on cluster nodes and execute the tasks in parallel.

  4. Cluster Manager: Assigns resources to driver and executors.

  5. Result: Executors return results to driver, driver aggregates and produces output.


3️⃣ Spark Components in the Architecture

LayerPurpose
Spark CoreHandles task scheduling, memory management, fault recovery, RDD API (resilient distributed dataset).
Spark SQLQuery structured data using SQL, DataFrames, Datasets.
Spark Streaming / Structured StreamingReal-time streaming processing.
MLlibDistributed Machine Learning library.
GraphXGraph processing API (e.g., social network analysis).

4️⃣ Spark Execution Model

  • Transformations (map, filter, flatMap) are lazy. Spark builds a DAG of operations.

  • Actions (collect, count, save) trigger execution.

  • Parallel Execution: Tasks are distributed to multiple executors for parallel processing.


5️⃣ Cluster Manager Options

Cluster ManagerDescription
StandaloneBuilt-in, simple Spark cluster manager.
YARNHadoop’s cluster manager, widely used in enterprises.
MesosGeneral-purpose cluster manager.
KubernetesContainer-based cluster deployment.

6️⃣ Diagram (Text Version)




=============================================

Apache Spark Components

1️⃣ Spark Core

  • What it is: The heart of Spark. All other components (SQL, Streaming, MLlib, GraphX) run on top of Spark Core.

  • Responsibilities:

    • RDDs: Distributed collections of data

    • Task scheduling: Decides which worker (executor) will run which task

    • Memory management: Keeps data in memory for fast processing

    • Fault tolerance: Automatically recovers lost data

  • Example Use: You have a big CSV file, and you want to filter and map data in parallel.

rdd = sc.textFile("data.csv") rdd_filtered = rdd.filter(lambda x: "USA" in x)

2️⃣ Spark SQL

  • What it is: Module for structured data. Works with DataFrames and SQL queries.

  • Why it’s useful:

    • Easier than working with raw RDDs

    • Optimized using Catalyst Optimizer → faster queries

  • Example Use: You want to query a Parquet file like a SQL database:

df = spark.read.parquet("data.parquet") df.createOrReplaceTempView("people") spark.sql("SELECT Name, Age FROM people WHERE Age > 25").show()

3️⃣ Spark Streaming / Structured Streaming

  • What it is: Module for real-time data processing.

  • Why it’s useful: Handles continuous streams of data (Kafka, sockets, logs).

  • Example Use: Monitor real-time server logs:

df_stream = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load() df_stream.writeStream.format("console").start().awaitTermination()

4️⃣ MLlib (Machine Learning Library)

  • What it is: Spark’s distributed machine learning library.

  • Why it’s useful: Can train models on very large datasets across multiple nodes.

  • Supported algorithms: Classification, regression, clustering, recommendation, etc.

  • Example Use: Train a linear regression model on a big dataset:

from pyspark.ml.regression import LinearRegression lr = LinearRegression(featuresCol="features", labelCol="label") model = lr.fit(training_data) predictions = model.transform(test_data)

5️⃣ GraphX

  • What it is: Module for graph and network analysis.

  • Why it’s useful: Analyze relationships in data, like social networks.

  • Example Use: Find shortest path between users or recommend friends.

  • In Python, GraphX is less used; often GraphFrames is preferred.

=============================================

1️⃣ What is Lazy Evaluation?

👉 Lazy evaluation means PySpark does NOT execute operations immediately.
Instead, it records the operations and create best logical plan , and  executes them only when a result is required.(action is performed).

In simple words:

Transformations are lazy, Actions are eager


2️⃣ Transformations vs Actions (Core idea)

🔹 Transformations (Lazy)

These do not trigger execution:

  • select()

  • filter()

  • map()

  • withColumn()

  • join()

  • groupBy()

They only build a logical execution plan (DAG).

df2 = df.filter(df.age > 25)
df3 = df2.select("name", "age")
⚠️ No computation happens yet

🔹 Actions (Trigger execution)
These force Spark to execute the plan:
show()
count()
collect()
take()
write()
df3.show()   # 🔥 Execution happens here

3️⃣ What happens internally? (Spark flow)

When you write PySpark code:

Step 1: Build Logical Plan

df = spark.read.csv("employees.csv") df2 = df.filter(df.salary > 50000) df3 = df2.groupBy("dept").count()

Spark just creates a DAG (Directed Acyclic Graph):

Read CSVFilterGroupByCount

❌ No execution yet


Step 2: Action triggers execution

df3.show()

Now Spark:

  1. Optimizes the plan (Catalyst Optimizer)

  2. Creates a physical execution plan

  3. Submits jobs to executors

  4. Reads data + performs transformations


4️⃣ Why does Spark use Lazy Evaluation?

✅ 1. Performance Optimization

Spark can:

  • Reorder operations

  • Push filters early (predicate pushdown)

  • Remove unused columns

  • Combine transformations

Example:

df.select("name").filter(df.age > 30)

Spark internally optimizes to:

FilterSelect

✅ 2. Avoids Unnecessary Computation

If no action is called, Spark does nothing.

df.filter(df.salary > 100000) # No action → No job → No compute

✅ 3. Fault Tolerance

Spark tracks lineage, not data.
If a partition fails, Spark recomputes it using the DAG.


5️⃣ Example Showing Lazy Evaluation

df = spark.read.parquet("sales") df2 = df.filter(df.amount > 1000) df3 = df2.select("order_id", "amount") print("Before action") df3.count() print("After action")

Output:

Before action After action

💡 Actual reading + filtering happens only at count()


6️⃣ DAG Visualization

You can see lazy evaluation in action using:

df3.explain(True)

This shows:

  • Parsed Logical Plan

  • Optimized Logical Plan

  • Physical Plan


7️⃣ Caching & Lazy Evaluation (Very Important)

df2.cache() df2.count() # First action → computation + cache df2.show() # Second action → reused from cache

⚠️ cache() itself is lazy
Data is cached only when an action runs


8️⃣ Real-world Data Engineering Example

In ETL pipelines:

clean_df = raw_df.filter("status = 'ACTIVE'") \ .withColumn("load_date", current_date()) clean_df.write.mode("overwrite").parquet("/curated")
  • Spark waits

  • Optimizes

  • Executes everything once

  • Writes final output

🚀 This is why Spark scales so well.


9️⃣ Interview One-Liner (Must Remember)

Lazy evaluation in PySpark means transformations are not executed immediately; Spark builds an execution plan and runs it only when an action is called, enabling optimization and efficient execution.


================================================
























Comments

Popular posts from this blog

Teradata

Git

work