Relational Algebra
The formal mathematical foundation for querying relational databases, defining operations like select, project, and join that underpin SQL and query optimization.
Relational algebra is a procedural query language that operates on relations (tables) and produces relations as output. It provides the theoretical foundation for SQL and serves as the internal representation that database query optimizers use to evaluate and transform queries before execution.
Fundamental Operations
Selection (sigma) filters rows from a relation based on a predicate. It takes a relation and a condition and returns only the rows that satisfy that condition. In SQL terms, selection corresponds to the WHERE clause.
Projection (pi) filters columns, returning a relation with only the specified attributes. Duplicate rows are eliminated in the formal definition. This corresponds to the SELECT column list in SQL.
Union combines two union-compatible relations (same number and types of columns) and returns all rows from both, removing duplicates.
Set Difference returns rows that appear in the first relation but not in the second. Both relations must be union-compatible.
Cartesian Product (cross product) combines every row of one relation with every row of another, producing a relation with all columns from both. This is rarely used alone but serves as the building block for joins.
Rename (rho) changes the name of a relation or its attributes, enabling self-joins and disambiguation.
Derived Operations
Join is the most heavily used derived operation. A natural join combines a Cartesian product with a selection on matching column names. Theta joins allow arbitrary comparison predicates. Equi-joins restrict the predicate to equality. Left, right, and full outer joins preserve unmatched rows from one or both relations.
Intersection returns rows common to two union-compatible relations and can be expressed as a combination of set difference operations.
Division answers queries of the form “find all X that are related to every Y” and is useful for universal quantification problems.
Role in Query Optimization
Database engines parse SQL into a relational algebra expression tree, then apply equivalence rules to transform it into more efficient forms. Pushing selections down the tree to reduce intermediate result sizes, reordering joins to minimize cost, and combining projections are all algebraic transformations that improve query performance without changing the result.
Origins and History
E.F. Codd defined relational algebra alongside the relational model in his 1970 paper [1]. He established that relational algebra and relational calculus (a nonprocedural alternative) are equivalent in expressive power, a property known as relational completeness. This work provided the mathematical rigor that distinguished relational databases from earlier hierarchical and network models.
Sources
- Codd, E.F. (1970). “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM, 13(6), 377-387.
- Codd, E.F. (1972). “Relational Completeness of Data Base Sublanguages.” IBM Research Report RJ987.
Need help implementing this?
Turn this knowledge into a working prototype. Our structured workshop methodology takes you from idea to deployed AI solution in three sessions.
Explore AI Workshops