Overview of the Hadoop architecture
Category : Hadoop
| Sub Category : Hadoop Concepts | By Prasad Bonam Last updated: 2023-07-12 11:04:23
Viewed : 68
Overview of the Hadoop architecture:
Hadoop is an open-source framework that provides a distributed computing system for processing and storing large datasets across clusters of commodity hardware. It follows a master-slave architecture and consists of several core components. Here is an overview of the Hadoop architecture:
Hadoop Distributed File System (HDFS):
- HDFS is the primary storage system in Hadoop. It is designed to store and manage large datasets across multiple machines in a distributed manner. HDFS breaks data into blocks, replicates them across the cluster for fault tolerance, and provides high-throughput access to data.
Yet Another Resource Negotiator (YARN):
- YARN is the resource management framework in Hadoop. It manages resources and schedules tasks across the nodes in a Hadoop cluster. YARN consists of a ResourceManager, which manages cluster resources, and NodeManagers, which manage resources on individual nodes. YARN allows multiple workloads to run on the same cluster, including MapReduce, Spark, and other distributed processing frameworks.
- MapReduce is a programming model and processing engine in Hadoop for distributed data processing. It allows developers to write parallelizable computations that can be executed across the nodes in a Hadoop cluster. MapReduce processes data in two stages - Map and Reduce - to perform distributed data processing and aggregation.
- Hadoop Common provides the necessary libraries and utilities that are required by other Hadoop components. It includes common utilities, configuration files, and libraries used by Hadoop modules.
- Hadoop Clients are the applications or tools that interact with the Hadoop cluster. They can submit jobs, access data stored in HDFS, and manage cluster resources. Examples of Hadoop clients include Hadoop command-line interface (CLI), Hadoop Streaming, Hive, Pig, and Sqoop.
- Hadoop ecosystem consists of various additional components that extend the functionality of Hadoop, such as:
- Apache Hive: A data warehousing and SQL-like querying tool for Hadoop.
- Apache Pig: A high-level data flow scripting language for executing data transformations on Hadoop.
- Apache HBase: A distributed NoSQL database built on Hadoop.
- Apache Spark: A fast and general-purpose distributed computing engine.
- Apache Sqoop: A tool for transferring data between Hadoop and structured databases.
- Apache Kafka: A distributed streaming platform for handling real-time data feeds.
The architecture of Hadoop allows for distributed storage and processing of large datasets, fault tolerance, scalability, and parallel processing across a cluster of machines. It enables organizations to process and analyze massive amounts of data efficiently and effectively.
Each component in the Hadoop architecture has its own responsibilities and can be configured and scaled independently based on the requirements of the use case or application.