Apache Hadoop - Distributed Big Data Framework
Apache Hadoop is an open-source framework for distributed storage and processing of large data sets across clusters of commodity hardware.
Apache Hadoop is a foundational open-source framework that enables the distributed processing of massive data sets across clusters of computers using simple programming models. The framework is designed to scale from single servers to thousands of machines, each offering local computation and storage. Hadoop’s architecture assumes hardware failures are common and handles them automatically at the application layer, providing a highly available service on top of unreliable infrastructure.
The Hadoop ecosystem consists of four core modules: Hadoop Common (shared utilities and libraries), HDFS (Hadoop Distributed File System for scalable, fault-tolerant storage), YARN (Yet Another Resource Negotiator for cluster resource management and job scheduling), and MapReduce (a programming model for parallel data processing). Beyond these core modules, a rich ecosystem of related projects has emerged, including Apache Hive for SQL-like querying, Apache HBase for real-time random read/write access, Apache Pig for data flow scripting, and Apache Oozie for workflow scheduling.
Hadoop revolutionized the way organizations approach data at scale and became the backbone of the first generation of “big data” platforms. While newer technologies like Apache Spark have supplanted MapReduce for many workloads, HDFS and YARN remain critical infrastructure components in many enterprise data platforms. Major technology companies including Yahoo, Facebook (Meta), and LinkedIn built their early data infrastructure on Hadoop, and it continues to underpin large-scale data processing at thousands of organizations worldwide.
Key Capabilities
- HDFS - Distributed file system that stores data across clusters with configurable replication for fault tolerance
- YARN - Resource management layer that schedules and monitors distributed applications across the cluster
- MapReduce - Programming paradigm for parallel batch processing of large data sets with automatic data locality optimization
- Ecosystem Integration - Foundation for a broad ecosystem including Hive, HBase, Pig, Sqoop, Flume, and many other projects
Cloud Equivalents
Apache Hadoop is the open-source foundation behind AWS EMR, Azure HDInsight, and Google Cloud Dataproc. Managed services eliminate operational overhead of cluster management and patching, but self-hosted Hadoop offers greater control over configuration, versioning, and cost optimization at scale.
Origins and History
Apache Hadoop was created by Doug Cutting and Mike Cafarella in 2006, inspired by Google’s MapReduce and Google File System papers published in 2003-2004. The project was named after Cutting’s son’s toy elephant. It became an Apache top-level project in 2008. Hadoop is licensed under the Apache License 2.0. Key milestones include the introduction of YARN in Hadoop 2.0 (2013), which decoupled resource management from MapReduce, and Hadoop 3.0 (2017), which introduced erasure coding and support for more than two NameNodes.
Sources
- https://hadoop.apache.org/
- Dean, J. and Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters.” OSDI, 2004.
Need help implementing this?
Turn this knowledge into a working prototype. Our structured workshop methodology takes you from idea to deployed AI solution in three sessions.
Explore AI Workshops