Hash Join
Synonyms
Definition
The hash join is a common join algorithm in database systems using hashing. The join predicate needs to be an equality join predicate. The classic algorithm consists of two phases: the “build” phase and the “probe” phase. In the “build” phase, the algorithm builds a hash table on the smaller relation, say R, by applying a hash function to the join attribute of each tuple. In the “probe” phase, the algorithm probes the hash table using tuples of the larger relation, say S, to find matches.
Key Points
The classic algorithm is simple, but it requires that the smaller join relation fits into memory. If there is no enough memory to hold all the tuples in R, an additional “partition” phase is required. There are several variants of the classic hash join algorithm. They differ in terms of utilizing memory and handling overflow.
Grace Hash Join The idea behind grace hash join is to hash partition both relations on the join attribute, using the same hash function. As the result, each relation is hashed into k partitions, and these partitions are written to disk. The key observation is that R tuples in partition i can join only with S tuples in the same partition i. If any given partition of R can be hold in memory, the algorithm can read in and build a hash table on the partition of R, and then probe the hash table using tuples of the corresponding partition of S for matches.
If one or more of the partitions still does not fit into the available memory (for instance, due to data skewness), the algorithm is recursively applied. An additional orthogonal hash function is chosen to hash the large partition into sub-partitions, which are then processed as before.
| Algorithm 1: Grace Hash Join: R ⋈ r=s S | |
| // partition R into k partitions | |
| foreach R ∈ R do | |
| read R and add it to buffer page h 1 (R); | |
| flush the page to disk if full; | |
| end | |
| // partition S into k partitions | |
| foreach S ∈ S do | |
| read S and add it to buffer page h 1 (S); | |
| flush the page to disk if full; | |
| end | |
| // “build” and “probe” phases | |
| for i ← 1 to k do | |
| foreach R ∈ partition R i do | |
| read R and insert into the hash table using h 2 (R); | |
| end | |
| foreach S ∈ partition S i do | |
| read S and probe the hash table using h 2(S); | |
| for matching R tuples, add {R,S} to result; | |
| end | |
| clear the hash table and release the memory; | |
| end |
Suppose there are enough extra memory to hold an in-memory hash table for one partition, say the first partition, of R, the hybrid hash join does not write the partition to disk. Instead, it builds an in-memory hash table for the first partition of R during the “partition” phase. Similarly, when partitioning S, for the tuples in the first partition of S, the algorithm directly probes the in-memory hash table and writes out the results. At the end of the “partition” phase, the algorithm completes the join between the first partitions of R and S while partitioning the two relations. The algorithm then joins the remaining partitions as the grace hash join algorithm does.
Compared with the grace hash join algorithm, the hybrid hash join algorithm avoids writing the first partitions of R and S to disk during the “partition” phase and reading them in again during the “build” and the “probe” phases.
Cross-references
Recommended Reading
- 1.Mishra P. and Eich M.H. Join processing in relational databases. ACM Comput. Surv., 24(1):63–113, 1992.CrossRefGoogle Scholar