Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks
- 261 Downloads
- 7 Citations
Abstract
A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.
Keywords
Data parallel framework MapReduce Hadoop Star-join queryPreview
Unable to display preview. Download preview PDF.
References
- 1.Abouzeid, A., BajdaPawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB ’07: Proceedings of the 35th International Conference on Very Large Data Bases (2007) Google Scholar
- 2.Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT ’10: Proceedings of the 13th International Conference on Extending Database Technology (2010) Google Scholar
- 3.Aguilar-Saborit, J., Muntés-Mulero, V., Zuzarte, C., Larriba-Pey, J.L.: Star join revisited: Performance internals for cluster architectures. Data Knowl. Eng. 63(3), 997–1015 (2007) CrossRefGoogle Scholar
- 4.Apache: Hadoop. http://hadoop.apache.org/ (2007)
- 5.Apache: Pig. http://hadoop.apache.org/pig (2007)
- 6.Apache: Hive. http://hadoop.apache.org/hive (2008)
- 7.Aster Data: Aster Data nCluster. http://www.asterdata.com/product/index.php (2010)
- 8.Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM (1970) Google Scholar
- 9.Businesscom: CloudBase. http://cloudbase.sourceforge.net (2008)
- 10.Chaudhuri, S., Dayal, U.: An overview of data warehousing and olap technology. SIGMOD Rec. 26(1), 65–74 (1997) CrossRefGoogle Scholar
- 11.Chen, M.S., Lo, M., Yu, P.S., Young, H.C.: Applying segmented right-deep trees to pipelining multiple hash joins. IEEE Trans. Knowl. Data Eng. 7(4), 656–668 (1995) CrossRefGoogle Scholar
- 12.Datta, A., VanderMeer, D., Ramamritham, K.: Parallel star join + dataindexes: Efficient query processing in data warehouses and olap. IEEE Trans. Knowl. Data Eng. 14(6), 1299–1316 (2002) CrossRefGoogle Scholar
- 13.Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton, J.F.: Caching multidimensional queries using chunks. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998) Google Scholar
- 14.Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003) Google Scholar
- 15.Grossman, R., Gu, Y.: Data mining using high performance data clouds: Experimental studies using sector and sphere. In: SIGKDD 2008 (2008) Google Scholar
- 16.Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI’04: The 6th Symposium on Operating System Design and Implementation (2004) Google Scholar
- 17.Karayannidis, N., Tsois, A., Sellis, T., Pieringer, R., Markl, V., Ramsak, F., Fenk, R., Elhardt, K., Bayer, R.: Processing star queries on hierarchically-clustered fact tables. In: VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases (2002) Google Scholar
- 18.Lo, M.L., Chen, M.S.S., Ravishankar, C.V., Yu, P.S.: On optimal processor allocation to support pipelined hash joins. In: SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993) Google Scholar
- 19.Inmon, W.H.: Building the Data Warehouse. Wiley, New York (1996) Google Scholar
- 20.Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: European Conference on Computer Systems (EuroSys) (2007) Google Scholar
- 21.Monash, C.: Cloudera presents the MapReduce bull case. http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/ (2009)
- 22.NexR: Icube cloud testbed. http://www.icubecloud.com (2009)
- 23.O’Neil, P., Graefe, G.: Multi-table joins through bitmapped join indices. SIGMOD Rec. 24(3), 8–11 (1995) CrossRefGoogle Scholar
- 24.O’Neil, P., Quass, D.: Improved query performance with variant indexes. SIGMOD Rec. 26(2), 38–49 (1997) CrossRefGoogle Scholar
- 25.O’Neil, P., O’Neil, E., Chen, X.: The star schema benchmark (2007) Google Scholar
- 26.Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Sci. Program. J. (2005) Google Scholar
- 27.Roussopoulos, N.: Materialized views and data warehouses. SIGMOD Rec. 27, 21–26 (1997) CrossRefGoogle Scholar
- 28.Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010) CrossRefGoogle Scholar
- 29.Yang, Hc., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007) Google Scholar