Encyclopedia of Database Systems

2009 Edition
| Editors: LING LIU, M. TAMER ÖZSU

Hash Join

  • Jingren Zhou
Reference work entry
DOI: https://doi.org/10.1007/978-0-387-39940-9_869

Synonyms

Definition

The hash join is a common join algorithm in database systems using hashing. The join predicate needs to be an equality join predicate. The classic algorithm consists of two phases: the “build” phase and the “probe” phase. In the “build” phase, the algorithm builds a hash table on the smaller relation, say R, by applying a hash function to the join attribute of each tuple. In the “probe” phase, the algorithm probes the hash table using tuples of the larger relation, say S, to find matches.

Key Points

The classic algorithm is simple, but it requires that the smaller join relation fits into memory. If there is no enough memory to hold all the tuples in R, an additional “partition” phase is required. There are several variants of the classic hash join algorithm. They differ in terms of utilizing memory and handling overflow.

Grace Hash Join The idea behind grace hash join is to hash partition both relations on the join attribute, using the same hash function. As the result, each relation is hashed into k partitions, and these partitions are written to disk. The key observation is that R tuples in partition i can join only with S tuples in the same partition i. If any given partition of R can be hold in memory, the algorithm can read in and build a hash table on the partition of R, and then probe the hash table using tuples of the corresponding partition of S for matches.

If one or more of the partitions still does not fit into the available memory (for instance, due to data skewness), the algorithm is recursively applied. An additional orthogonal hash function is chosen to hash the large partition into sub-partitions, which are then processed as before.

Hybrid Hash Join The hybrid hash join algorithm is a refinement of the grace hash join algorithm which takes advantage of more available memory. To partition R (S) into k partitions, the grace hash join uses one input buffer for reading in the relation and k output buffers, one for each partitions.

Algorithm 1: Grace Hash Join: R ⋈ r=s S

 

// partition R into k partitions

 

foreach RR do

 

  read R and add it to buffer page h 1 (R);

 

  flush the page to disk if full;

 

end

 

// partition S into k partitions

 

foreach SS do

 

  read S and add it to buffer page h 1 (S);

 

  flush the page to disk if full;

 

end

 

// “build” and “probe” phases

 

for i ← 1 to k do

 

  foreach Rpartition R i do

 

    read R and insert into the hash table using h 2 (R);

 

  end

 

  foreach Spartition S i do

 

    read S and probe the hash table using h 2(S);

 

    for matching R tuples, add {R,S} to result;

 

  end

 

  clear the hash table and release the memory;

 

end

 

Suppose there are enough extra memory to hold an in-memory hash table for one partition, say the first partition, of R, the hybrid hash join does not write the partition to disk. Instead, it builds an in-memory hash table for the first partition of R during the “partition” phase. Similarly, when partitioning S, for the tuples in the first partition of S, the algorithm directly probes the in-memory hash table and writes out the results. At the end of the “partition” phase, the algorithm completes the join between the first partitions of R and S while partitioning the two relations. The algorithm then joins the remaining partitions as the grace hash join algorithm does.

Compared with the grace hash join algorithm, the hybrid hash join algorithm avoids writing the first partitions of R and S to disk during the “partition” phase and reading them in again during the “build” and the “probe” phases.

Cross-references

Recommended Reading

  1. 1.
    Mishra P. and Eich M.H. Join processing in relational databases. ACM Comput. Surv., 24(1):63–113, 1992.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Jingren Zhou
    • 1
  1. 1.Microsoft ResearchRedmondUSA