Apache Hadoop (2006)
An open-source framework for storing and processing huge datasets across clusters of cheap machines, the system that brought distributed big-data computing to the mainstream.
Apache Hadoop is an open-source framework for storing and processing very large datasets across clusters of ordinary machines. Doug Cutting and Mike Cafarella split it out of the Nutch search project in early 2006, drawing on Google’s published designs. It made big-data processing affordable and put it within reach of any company, not only web giants.

What it was
Before Hadoop, processing terabytes of data meant buying one very large, very expensive machine. That approach hit a ceiling. A single server can hold only so many disks, and scaling it up cost more each step. Web companies were generating data faster than any one machine could handle.
Hadoop took a different path. Instead of one big computer, it used many small, cheap ones working together as a cluster. The framework handled the hard parts: splitting data across machines, running work in parallel, and recovering when a machine failed. Failure was treated as normal, not as an emergency.
Two parts did the core work. HDFS, the distributed file system, chopped a huge file into blocks and scattered them across the cluster. It kept several copies of each block, so a dead disk lost no data. MapReduce was the processing model. A map step ran in parallel across every machine that held the data, then a reduce step combined the partial results into a final answer.
Think of counting every word in a vast library. One person reading every book would take years. Hadoop hands each reader a few shelves, asks them to tally words on their own shelves, then merges the tallies. The work moves to where the books sit, not the books to one desk.
Why it mattered
Hadoop made big data cheap. A cluster of commodity servers cost a fraction of one high-end machine with the same capacity. To grow, you added more nodes rather than replacing the whole system. This linear, low-cost scaling changed what companies could afford to analyse.
It democratised a capability that had belonged to a few. Google described its methods in papers but kept the code private. Hadoop gave everyone a working, open implementation under the Apache licence. Yahoo invested heavily and ran some of the largest clusters of the era, then contributed back. By 2008 a Hadoop cluster sorted a terabyte of data in record time, proving the model at scale.
A whole ecosystem grew around it. Apache Hive added a SQL-like layer so analysts could query data without writing MapReduce code. Apache Pig offered a scripting language. HBase added a database for fast lookups. Apache ZooKeeper coordinated the moving parts. Companies like Cloudera and Hortonworks built businesses packaging and supporting these tools. The phrase “data lake”, a single store holding all of an organisation’s raw data, took hold in this period.
How it connects to AI today
Hadoop’s deepest legacy is the idea that you process data where it lives, across many machines, and design for failure from the start. Modern AI lives on this idea. Training a large model means moving huge datasets through clusters of machines in parallel, the same shape of problem Hadoop solved first.
The direct technical successor is Apache Spark. Spark kept Hadoop’s distributed model but ran computation in memory, which made it far faster for the repeated passes that machine learning needs. Many teams still run Spark on a Hadoop cluster, reading from HDFS and scheduling work through YARN, Hadoop’s resource manager. So Hadoop often sits underneath the AI data pipeline even when nobody writes MapReduce by hand.
YARN itself is a turning point. Introduced in Hadoop 2 around 2013, it separated resource management from MapReduce. That let one cluster run many kinds of workloads, including Spark and early machine-learning jobs. The pattern of a shared scheduler handing GPUs and CPUs to competing jobs echoes through today’s AI training platforms.
A builder meets Hadoop’s descendants constantly. Cloud data lakes on Amazon S3, Azure Data Lake Storage, and Google Cloud Storage borrowed the architecture, then swapped HDFS for cheaper object storage. Platforms like Databricks grew straight out of the Spark and Hadoop world. When you assemble a training set, clean event logs, or build features for a model, you are working in a pipeline that Hadoop’s design made normal.
Still in use today
Hadoop is in maintenance. The Apache project is alive and releases updates, but the focus is stability, security, and compatibility rather than fast new features. The excitement and the new workloads have moved elsewhere. The two companies that drove its commercial peak, Cloudera and Hortonworks, merged in 2019, a clear sign the gold-rush phase had ended.
Two forces pulled work away from it. Apache Spark replaced MapReduce for most processing because it is faster and friendlier to write. Cloud storage and warehouses, separating compute from storage, replaced HDFS for many new projects because they scale and cost less without a fixed cluster.
Yet Hadoop persists. Large on-premise data lakes built through the 2010s still run on HDFS and YARN, and migrating petabytes is slow and costly. Organisations with strict data-residency or regulatory rules keep running their own clusters. So Hadoop endures as durable infrastructure, while its core concepts now shape almost every distributed data and AI system in use.
Further reading
- IT History Timeline : see where Hadoop sits in the broader story of computing.
- AI Learning Galaxy : explore how big-data foundations connect to modern AI topics.
- SQL (1974) : the query language that Hive brought to Hadoop and that still rules data work.
- Java (1995) : the language Hadoop is written in and runs on.
- Apache Hadoop project site : official documentation, releases, and component guides.
- MapReduce paper, Dean and Ghemawat, 2004 : the Google paper that inspired Hadoop’s processing model.
- Apache Hadoop on Wikipedia : history, architecture, and ecosystem overview with sources.
Frequently asked questions