New algorithms for join and grouping operations

Graefe, Goetz

doi:10.1007/s00450-011-0186-9

New algorithms for join and grouping operations

Special Issue Paper
Published: 30 June 2011

Volume 27, pages 3–27, (2012)
Cite this article

Computer Science - Research and Development

Goetz Graefe¹

478 Accesses
13 Citations
4 Altmetric
Explore all metrics

Abstract

Traditional database query processing relies on three types of algorithms for join and for grouping operations. For joins, index nested loops join exploits an index on its inner input, merge join exploits sorted inputs, and hash join exploits differences in the sizes of the join inputs. For grouping, an index-based algorithm has been used in the past whereas today sort- and hash-based algorithms prevail. Cost-based query optimization chooses the most appropriate algorithm for each query and for each operation. Unfortunately, mistaken algorithm choices during compile-time query optimization are common yet expensive to investigate and to resolve.

Our goal is to end mistaken choices among join algorithms and among grouping algorithms by replacing the three traditional types of algorithms with a single one. Like merge join, this new join algorithm exploits sorted inputs. Like hash join, it exploits different input sizes for unsorted inputs. In fact, for unsorted inputs, the cost functions for recursive hash join and for hybrid hash join have guided our search for the new join algorithm. In consequence, the new join algorithm can replace both merge join and hash join in a database management system.

The in-memory components of the new join algorithm employ indexes. If the database contains indexes for one (or both) of the inputs, the new join can exploit persistent indexes instead of temporary in-memory indexes. Using database indexes to find matching input records, the new join algorithm can also replace index nested loops join.

In addition to join operations, a very similar algorithm supports grouping (“group by” queries in SQL) and duplicate elimination. For unsorted inputs, candidate output records take on the role of one of the inputs in a join operation. Our goal is to define a single grouping algorithm that can replace grouping by repeated index searches, by sorting, and by hashing. In other words, our goal is to end mistaken algorithm choices not only for joins and other binary matching operations but also for grouping and other unary matching operations in database query processing.

Finally, these new algorithms can be instrumental for efficient and robust data processing in a map-reduce environment, because ‘map’ and ‘reduce’ operations are similar in essentials to join and grouping operations.

Results from an implementation of the core algorithm are reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Query Processing of Pre-partitioned Data Using Sandwich Operators

Efficient generation of query plans containing group-by, join, and groupjoin

Article 17 August 2017

Block Nested Join and Sort Merge Join Algorithms: An Empirical Evaluation

References

Antoshenkov G, Ziauddin M (1996) Query processing and optimization in Oracle Rdb. VLDB J 5(4):229–237
Article Google Scholar
Avnur R, Hellerstein JM (2000) Eddies: continuously adaptive query processing. In: SIGMOD, pp 261–272
Chapter Google Scholar
Ballinger C, Fryer R (1997) Born to be parallel: why parallel origins give Teradata an enduring performance edge. IEEE Data Eng Bull 20(2):3–12
Google Scholar
Bayer R, Unterauer K (1977) Prefix B-Trees. ACM Trans Database Syst 2(1):11–26
Article Google Scholar
Bitton D, DeWitt DJ (1983) Duplicate record elimination in large data files. ACM Trans Database Syst 8(2):255–265
Article MATH Google Scholar
Bizarro P, Babu S, DeWitt DJ, Widom J (2005) Content-based routing: different plans for different data. In: VLDB, pp 757–768
Google Scholar
Bratbergsengen K (1984) Hashing methods and relational algebra operations. In: VLDB, pp 323–333
Google Scholar
Carey MJ, Kossmann D (1997) On saying “enough already!” in SQL. In: SIGMOD, pp 219–230
Chapter Google Scholar
Carey MJ, Kossmann D (1997) Processing top n and bottom n queries. IEEE Data Eng Bull 20(3):12–19
Google Scholar
Chandrasekaran S, Franklin MJ (2002) Streaming queries over streaming data. In: VLDB, pp 203–214
Google Scholar
Chaudhuri S, Shim K (1994) Including group-by in query optimization. In: VLDB, pp 354–366
Google Scholar
Cole RL, Graefe G (1994) Optimization of dynamic query evaluation plans. In: SIGMOD, pp 150–160
Google Scholar
Dean J, Ghemawat S (2004) MapReduce—simplified data processing on large clusters. In: OSDI, pp 137–150
Google Scholar
DeWitt DJ, Gerber RH (1985) Multiprocessor hash-based join algorithms. In: VLDB, pp 151–164
Google Scholar
DeWitt DJ, Katz RH, Olken F, Shapiro LD, Stonebraker M, Wood DA (1984) Implementation techniques for main memory database systems. In: SIGMOD, pp 1–8
Google Scholar
DeWitt DJ, Gerber RH, Graefe G, Heytens ML, Kumar KB, Muralikrishna M (1986) GAMMA—a high performance dataflow database machine. In: VLDB, pp 228–237
Google Scholar
DeWitt DJ, Naughton JF, Burger J (1993) Nested loops revisited. In: PDIS, pp 230–242
Google Scholar
Dittrich J-P, Seeger B, Taylor DS, Widmayer P (2002) Progressive merge join: a generic and non-blocking sort-based join algorithm. In: VLDB, pp 299–310
Google Scholar
Freytag JC, Goodman N (1989) On the translation of relational queries into iterative programs. ACM Trans Database Syst 14(1):1–27
Article Google Scholar
Fushimi S, Kitsuregawa M, Tanaka H (1986) An overview of the system software of a parallel relational database machine GRACE. In: VLDB, pp 209–219
Google Scholar
Gassner P, Lohman GM, Schiefer KB, Wang Y (1993) Query optimization in the IBM DB2 family. IEEE Data Eng Bull 16(4):4–18
Google Scholar
Graefe G (1993) Query evaluation techniques for large databases. ACM Comput Surv 25(2):73–170
Article Google Scholar
Graefe G (2000) Dynamic query evaluation plans: some course corrections? IEEE Data Eng Bull 23(2):3–6
Google Scholar
Graefe G (2003) Sorting and indexing with partitioned B-trees. In: CIDR
Google Scholar
Graefe G (2003) Executing nested queries. In: BTW, pp 58–77
Google Scholar
Graefe G (2006) Implementing sorting in database systems. ACM Comput. Surv. 38(3)
Graefe G (2006) B-tree indexes, interpolation search, and skew. In: DaMoN, p 5
Chapter Google Scholar
Graefe G (2007) Master-detail clustering using merged indexes. Inform Forsch Entwickl 21(3–4):127–145
Article Google Scholar
Graefe G (2010) A survey of B-tree locking techniques. ACM Trans Database Syst
Graefe G (2010) Robust sorting (in preparation)
Graefe G, Cole RL (1995) Fast algorithms for universal quantification in large databases. ACM Trans Database Syst 20(2):187–236
Article Google Scholar
Graefe G, DeWitt DJ (1987) The Exodus optimizer generator. In: SIGMOD, pp 160–172
Google Scholar
Graefe G, Kuno HA (2010) Self-selecting, self-tuning, incrementally optimized indexes. In: EDBT, pp 371–381
Chapter Google Scholar
Graefe G, Ward K (1989) Dynamic query evaluation plans. In: SIGMOD, pp 358–366
Google Scholar
Graefe G, Linville A, Shapiro LD (1994) Sort versus hash revisited. IEEE Trans Knowl Data Eng 6(6):934–944
Article Google Scholar
Graefe G, Bunker R, Cooper S (1998) Hash joins and hash teams in Microsoft SQL Server. In: VLDB, pp 86–97
Google Scholar
Graefe G, Kuno HA, Wiener JL (2009) Visualizing the robustness of query execution. In: CIDR
Google Scholar
Haas LM, Carey MJ, Livny M, Shukla A (1997) Seeking the truth about ad-hoc join costs. VLDB J 6(3):241–256
Article Google Scholar
Hagmann RB (1986) An observation on database buffering performance metrics. In: VLDB, pp 289–293
Google Scholar
Hellerstein JM (1998) Optimization techniques for queries with expensive methods. ACM Trans Database Syst 23(2):113–157
Article MathSciNet Google Scholar
Helmer S, Westmann T, Moerkotte G (1998) Diag-join: an opportunistic join algorithm for 1:N relationships. In: VLDB, pp 98–109
Google Scholar
Hoare CAR (1962) Quicksort. Comput J 5(1):10–15
Article MathSciNet MATH Google Scholar
Idreos S, Kersten ML, Manegold S (2007) Database cracking. In: CIDR, pp 68–78
Google Scholar
Ioannidis YE, Kang YC (1990) Randomized algorithms for optimizing large join queries. In: SIGMOD, pp 312–321
Google Scholar
Keller T, Graefe G, Maier D (1991) Efficient assembly of complex objects. In: SIGMOD, pp 148–157
Google Scholar
Kemper A, Kossmann D, Wiesner C (1999) Generalised hash teams for join and group-by. In: VLDB, pp 30–41
Google Scholar
Kim W (1980) A new way to compute the product and join of relations. In: SIGMOD, pp 179–187
Google Scholar
Kitsuregawa M, Nakayama M, Takagi M (1989) The effect of bucket size tuning in the dynamic hybrid GRACE hash join method. In: VLDB, pp 257–266
Google Scholar
Knuth DE (1973) The art of computer programming, vol III: sorting and searching. Addison-Wesley, Reading
Google Scholar
Larson P-Å (2003) External sorting: run formation revisited. IEEE Trans Knowl Data Eng 15(4):961–972
Article Google Scholar
Larson P-Å, Graefe G (1998) Memory management during run generation in external sorting. In: SIGMOD, pp 472–483
Google Scholar
Li G (2010) On the design and evaluation of a new order-based join algorithm. MS-CS thesis, UC Irvine
Li Q, Shao M, Markl V, Beyer KS, Colby LS, Lohman GM (2007) Adaptively reordering joins during query execution. In: ICDE, pp 26–35
Google Scholar
Lohman GM (1988) Grammar-like functional rules for representing query optimization alternatives. In: SIGMOD, pp 18–27
Google Scholar
Markl V, Lohman GM, Raman V (2003) LEO: An autonomic query optimizer for DB2. IBM Syst J 42(1):98–106
Article Google Scholar
Mohan C, Narang I (1992) Algorithms for creating indexes for very large tables without quiescing updates. In: SIGMOD, pp 361–370
Google Scholar
Mohan C, Haderle DJ, Wang Y, Cheng JM (1990) Single table access using multiple indexes: optimization, execution, and concurrency control techniques. In: EDBT, pp 29–43
Google Scholar
Moss JEB (1992) Working with persistent objects: to swizzle or not to swizzle. IEEE Trans Softw Eng 18(8):657–673
Article Google Scholar
Muralikrishna M, DeWitt DJ (1988) Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD, pp 28–36
Google Scholar
Nakayama M, Kitsuregawa M, Takagi M (1988) Hash-partitioned join method using dynamic destaging strategy. In: VLDB, pp 468–478
Google Scholar
Pang H, Carey MJ, Livny M (1993) Memory-adaptive external sorting. In: VLDB, pp 618–629
Google Scholar
Salzberg B (1989) Merging sorted runs using large main memory. Acta Inform 27(3):195–215
Article MathSciNet MATH Google Scholar
Selinger PG, Astrahan MM, Chamberlin DD, Lorie RA, Price TG (1979) Access path selection in a relational database management system. In: SIGMOD, pp 23–34
Google Scholar
Shapiro LD (1986) Join processing in database systems with large main memories. ACM Trans Database Syst 11(3):239–264
Article Google Scholar
Shekita EJ, Carey MJ (1990) A performance evaluation of pointer-based joins. In: SIGMOD, pp 300–311
Google Scholar
Simmen DE, Shekita EJ, Malkemus T (1996) Fundamental techniques for order optimization. In: SIGMOD, pp 57–67
Google Scholar
Youssefi K, Wong E (1979) Query processing in a relational database management system. In: VLDB, pp 409–417
Google Scholar
Zeller H, Gray J (1990) An adaptive hash join algorithm for multiuser environments. In: VLDB, pp 186–197
Google Scholar
Zhang W, Larson P-Å (1998) Buffering and read-ahead strategies for external mergesort. In: VLDB, pp 523–533
Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Laboratories, Madison, WI, USA
Goetz Graefe

Authors

Goetz Graefe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Goetz Graefe.

Additional information

This paper largely overlaps with a paper presented at the BTW 2011 conference. Newly added sections cover prior work in much more detail and, more importantly, a similar generalized algorithm for grouping and aggregation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Graefe, G. New algorithms for join and grouping operations. Comput Sci Res Dev 27, 3–27 (2012). https://doi.org/10.1007/s00450-011-0186-9

Download citation

Published: 30 June 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s00450-011-0186-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New algorithms for join and grouping operations

Abstract

Access this article

Similar content being viewed by others

Query Processing of Pre-partitioned Data Using Sandwich Operators

Efficient generation of query plans containing group-by, join, and groupjoin

Block Nested Join and Sort Merge Join Algorithms: An Empirical Evaluation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

New algorithms for join and grouping operations

Abstract

Access this article

Similar content being viewed by others

Query Processing of Pre-partitioned Data Using Sandwich Operators

Efficient generation of query plans containing group-by, join, and groupjoin

Block Nested Join and Sort Merge Join Algorithms: An Empirical Evaluation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation