Spark - Apache Spark Architecture

Category : Apache Spark | Sub Category : Apache Spark Programs | By Prasad Bonam Last updated: 2023-10-01 00:53:31 Viewed : 290


Apache Spark is an open-source, distributed computing framework that provides a high-level interface for processing large-scale data across a cluster of computers. It is designed for speed, ease of use, and flexibility. The architecture of Apache Spark is a fundamental aspect of its design, enabling it to efficiently process data and provide fault tolerance. Here is an explanation of the key components of the Apache Spark architecture:


Cluster Manager:

    • At the core of an Apache Spark cluster is a cluster manager, which is responsible for resource allocation and management.
    • Apache Spark supports multiple cluster managers such as Apache Mesos, Hadoop YARN, and its built-in standalone cluster manager.
  1. Driver Program:

    • The driver program is the entry point for the Spark application. It runs the main function and creates a SparkContext.
    • The driver program defines the high-level control flow of the application, specifying transformations, actions, and data sources.
    • It communicates with the cluster manager to request and allocate resources for the application.
  1. SparkContext:

    • The SparkContext is the entry point for the Spark application code. It represents the connection to the cluster.
    • This is located in the Master Node’s driver program. Spark Context is a gateway to all the Spark functionalities. It is similar to your database connection. Any command you execute in your database goes through the database connection. Likewise, anything you do on Spark goes through Spark context.
    • SparkContext is responsible for coordinating tasks, scheduling, and distributing the code and data across the cluster.
    • It manages the execution of Spark jobs, which are composed of stages and tasks.
  2. Cluster Worker Nodes:

    • Worker nodes are the individual machines in the cluster that execute Spark tasks.
    • Each worker node runs an executor, which is a separate JVM process responsible for executing tasks assigned by the driver program.
    • Worker nodes can be scaled horizontally to accommodate larger workloads.
  3. Resilient Distributed Dataset (RDD):

    • RDD is the fundamental data structure in Spark, representing distributed collections of data.
    • RDDs are immutable and partitioned across the cluster, allowing for parallel processing and fault tolerance.
    • RDDs can be created from data in Hadoop Distributed File System (HDFS), local file systems, or other data sources.
  4. Directed Acyclic Graph (DAG) Scheduler:

    • Sparks DAG scheduler processes the high-level job into stages.
    • It constructs a directed acyclic graph of stages representing the logical execution plan of the job.
    • The scheduler then submits stages to the task scheduler for execution.
  5. Task Scheduler:

    • The task scheduler is responsible for scheduling tasks on worker nodes.
    • It takes into account data locality, task dependencies, and resource availability.
    • Task scheduling is performed using cluster manager-specific components.
  6. Executor:

    • Executors run on worker nodes and are responsible for executing tasks assigned by the driver program.
    • Each executor has its own JVM and can cache data in memory for reuse across tasks.
    • Executors communicate with the driver program and store data in their memory for efficient data processing.
  7. Block Manager:

    • The block manager manages data storage and caching for RDDs in memory or on disk.
    • It ensures that data is available on the same node where it is needed for computation, improving data locality and performance.
  8. Storage Level:

    • Spark allows users to specify different storage levels for RDDs, such as MEMORY_ONLY, MEMORY_AND_DISK, and DISK_ONLY.
    • This flexibility enables users to balance between memory usage and fault tolerance based on their applications requirements.
  9. Cluster Mode Execution:

    • In cluster mode, Spark applications are submitted to the cluster manager, which allocates resources and executes the application independently of the client.
    • This mode is suitable for long-running, production-grade applications.
  10. Client Mode Execution:

    • In client mode, the driver program runs on the client machine, and it communicates with the cluster manager to schedule tasks.
    • This mode is suitable for interactive or development use cases.

The Apache Spark architecture is designed to provide fault tolerance, data parallelism, and ease of use for distributed data processing. It allows users to build scalable and high-performance data processing applications with flexibility in data storage, execution modes, and cluster management integration.


A Cluster Manager in Apache Spark is a crucial component responsible for resource management, job scheduling, and allocation of resources to Spark applications running on a cluster. Apache Spark supports several cluster managers, including Apache Mesos, Hadoop YARN, and its built-in standalone cluster manager. Here, we will provide more information about the Cluster Manager and examples of how it works.

Cluster Manager Responsibilities:

  1. Resource Allocation: The Cluster Manager is responsible for allocating resources, such as CPU and memory, to Spark applications. It manages the available resources across worker nodes in the cluster.

  2. Job Scheduling: It schedules Spark jobs, stages, and tasks for execution. It decides when and where to run tasks based on resource availability and job dependencies.

  3. Fault Tolerance: The Cluster Manager monitors the health of worker nodes and restarts failed tasks on other nodes if needed, ensuring the fault tolerance of Spark applications.

  4. Dynamic Resource Allocation: Some cluster managers, like YARN and Mesos, support dynamic resource allocation, allowing Spark applications to request additional resources as needed.

  5. Cluster Coordination: It communicates with the driver program and worker nodes to coordinate the execution of Spark applications, making sure they run smoothly.

Examples of Cluster Managers:

  1. Standalone Cluster Manager:

    Apache Spark includes its own standalone cluster manager, which is the simplest option for deploying Spark applications. To start a standalone cluster, you can use the sbin/start-master.sh and sbin/start-worker.sh scripts included with Spark.

    Example:

    bash
    ./sbin/start-master.sh ./sbin/start-worker.sh <master-url>
  2. Apache Mesos:

    Mesos is a general-purpose cluster manager that can also be used with Spark. It provides resource isolation and efficient resource sharing among multiple Spark applications and other distributed systems.

    Example:

    arduino
    spark-submit --master mesos://<mesos-master-url> --class com.example.MyApp myapp.jar
  3. Hadoop YARN:

    YARN (Yet Another Resource Negotiator) is a resource management and job scheduling framework used in Hadoop clusters. Spark can be run on YARN as an application.

    Example:

    css
    spark-submit --master yarn --deploy-mode cluster --class com.example.MyApp myapp.jar
  4. Kubernetes:

    Apache Spark can also be run on Kubernetes clusters. Kubernetes provides container orchestration, and Spark can be packaged into containers for deployment.

    Example:

    arduino
    spark-submit --master k8s://<kubernetes-master-url> --deploy-mode cluster --class com.example.MyApp myapp.jar

Each of these examples demonstrates how Spark can be configured to work with different cluster managers. Depending on your cluster infrastructure and requirements, you can choose the most suitable cluster manager for your Spark applications.

In Apache Spark, the Driver Program is a crucial component that plays a central role in the execution of Spark applications. It is responsible for orchestrating the entire application, defining the high-level control flow, and managing the interaction with the cluster. Below, I will explain the Driver Programs responsibilities and provide examples to illustrate its role.

Driver Program Responsibilities:

  1. Application Entry Point: The Driver Program is the entry point for your Spark application. It is where the applications main function is executed.

  2. Job Coordination: It divides the Spark application into multiple stages and tasks, scheduling them for execution. It communicates with the Cluster Manager to allocate resources and manage task execution.

  3. Defining Transformations and Actions: The Driver Program defines the high-level logic of the application by specifying transformations and actions on RDDs (Resilient Distributed Datasets).

  4. Monitoring and Logging: It monitors the progress of Spark jobs and collects logs and statistics from worker nodes. It can be used for debugging and performance tuning.

  5. Fault Tolerance: The Driver Program is responsible for detecting task failures and re-scheduling failed tasks on other worker nodes to ensure fault tolerance.

  6. Data Serialization and Distribution: It manages the serialization and distribution of data across the cluster, ensuring that data is available to tasks when needed.

Example of a Driver Program:

Here is a simple example of a Spark Driver Program written in Scala:

scala
import org.apache.spark.{SparkConf, SparkContext} object SparkDriverProgram { def main(args: Array[String]): Unit = { // Create a SparkConf and SparkContext val conf = new SparkConf().setAppName("SparkDriverProgram").setMaster("local[*]") val sc = new SparkContext(conf) // Create an RDD from a collection val data = Seq(1, 2, 3, 4, 5) val rdd = sc.parallelize(data) // Define a transformation (map) and an action (count) val mappedRDD = rdd.map(_ * 2) val count = mappedRDD.count() // Print the result println(s"Transformed data: ${mappedRDD.collect().mkString(", ")}") println(s"Count: $count") // Stop the SparkContext sc.stop() } }

In this example:

  • We create a SparkConf and SparkContext to configure and initialize Spark.
  • We create an RDD rdd from a collection of integers.
  • We define a transformation by using map to double each element in the RDD.
  • We define an action by using count to count the number of elements in the RDD.
  • Finally, we print the results and stop the SparkContext.

The Driver Program (SparkDriverProgram) is responsible for orchestrating the execution of these Spark operations.

To run this example, you would typically use the spark-submit script, providing the path to your applications JAR file as an argument.

css
spark-submit --class SparkDriverProgram --master local[*] your-app.jar

This will execute the Driver Program and perform the Spark operations defined within it.

Search
Related Articles

Leave a Comment: