Apache Hive Data Warehouse on Hadoop

Apache Hive is a data warehouse infrastructure built on top of Apache Hadoop that provides SQL-like querying capabilities for large-scale data summarization, analysis, and ETL.

apache-hiveopen-sourcehadoopdata-warehousesqlbig-data

Connected Apache Hadoop - Distributed Big Data Framework Apache Spark - Unified Big Data Processing Engine

At a glance

OpennessOpen source

Self-hostYes

Learn this your way

Read Guided course

Apache Hive is an open-source data warehouse system built on top of Apache Hadoop that enables reading, writing, and managing large datasets stored in distributed storage using SQL-like syntax called HiveQL. Hive translates SQL queries into MapReduce, Tez, or Spark execution plans, allowing analysts and data engineers familiar with SQL to query petabyte-scale datasets without writing low-level MapReduce code. For AI workloads, Hive serves as a data preparation and feature extraction layer, enabling SQL-based transformations over large historical datasets that feed into machine learning training pipelines.

Official documentation: https://hive.apache.org/

Key Capabilities

SQL-Like Query Language (HiveQL) - Provides a familiar SQL interface for querying data stored in HDFS, S3, and other Hadoop-compatible file systems, lowering the barrier for analysts working with big data
Schema-on-Read - Data is stored in raw form and schema is applied at query time, allowing flexible exploration of semi-structured and evolving datasets without upfront schema migration
Partitioning and Bucketing - Table partitioning by date, region, or other columns enables query pruning that dramatically reduces scan times on large datasets
Multiple Execution Engines - Queries can execute on MapReduce, Apache Tez (optimized DAG execution), or Apache Spark, with Tez being the default for interactive query performance

AWS/Cloud Equivalent

Apache Hive runs as a component on Amazon EMR clusters and is available on other managed Hadoop services. Amazon Athena provides a serverless Hive-compatible query experience over S3 data using the Presto/Trino engine, eliminating the need to manage Hive clusters. Google BigQuery and Azure Synapse Analytics provide fully managed cloud-native alternatives for SQL analytics at scale. For most new projects, serverless SQL engines (Athena, BigQuery) have largely replaced Hive for ad-hoc querying, while Hive remains relevant for existing Hadoop-based data pipelines and ETL workflows.

Origins and History

Apache Hive was created at Facebook in 2007 by Jeff Hammerbacher and a team of engineers who needed to enable SQL-based analysis over the company’s rapidly growing Hadoop data warehouse. Facebook open-sourced Hive in August 2008, and it became a top-level Apache project in September 2010. Hive was instrumental in making Hadoop accessible to a broader audience beyond Java developers. The introduction of Tez as an execution engine in Hive 0.13 (2014) significantly improved query performance over MapReduce. Hive LLAP (Live Long and Process), introduced in Hive 2.0 (2016), added persistent query execution daemons and in-memory caching for interactive query latency. While Spark SQL and cloud-native query engines have reduced Hive’s dominance, it remains widely deployed in on-premises Hadoop environments and as a metastore standard (Hive Metastore) used by Spark, Presto, Trino, and other engines.

Sources

Apache Software Foundation. “Apache Hive.” https://hive.apache.org/
Thusoo, A. et al. “Hive: A Warehousing Solution Over a Map-Reduce Framework.” VLDB, 2009.

Open source projects

Freelancer Templates Contracts, proposals, SOWs

Freelancer Automation Workflow recipes, AI playbooks

Work with Linda

Workshop Series €2,000/mo x 3

1:1 Consulting 60 min session