Cluster Computing

, Volume 14, Issue 2, pp 183–197 | Cite as

Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks

  • Hyuck Han
  • Hyungsoo Jung
  • Hyeonsang Eom
  • Heon Y. Yeom
Article

Abstract

A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines. MapReduce, a popular data-parallel framework, is used in various fields such as web search, data mining and data warehouses; it is proven to be very practical for such a data-parallel application. A star-join query is a popular query in data warehouses that are a current target domain of data-parallel frameworks. This article proposes a new algorithm that efficiently processes star-join queries in data-parallel frameworks such as MapReduce and Dryad. Our star-join algorithm for general data-parallel frameworks is called Scatter-Gather-Merge, and it processes star-join queries in a constant number of computation steps, although the number of participating dimension tables increases. By adopting bloom filters, Scatter-Gather-Merge reduces a non-trivial amount of IO. We also show that Scatter-Gather-Merge can be easily applied to MapReduce. Our experimental results in both cluster and cloud environments show that Scatter-Gather-Merge outperforms existing approaches.

Keywords

Data parallel framework MapReduce Hadoop Star-join query 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abouzeid, A., BajdaPawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In: VLDB ’07: Proceedings of the 35th International Conference on Very Large Data Bases (2007) Google Scholar
  2. 2.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT ’10: Proceedings of the 13th International Conference on Extending Database Technology (2010) Google Scholar
  3. 3.
    Aguilar-Saborit, J., Muntés-Mulero, V., Zuzarte, C., Larriba-Pey, J.L.: Star join revisited: Performance internals for cluster architectures. Data Knowl. Eng. 63(3), 997–1015 (2007) CrossRefGoogle Scholar
  4. 4.
    Apache: Hadoop. http://hadoop.apache.org/ (2007)
  5. 5.
    Apache: Pig. http://hadoop.apache.org/pig (2007)
  6. 6.
    Apache: Hive. http://hadoop.apache.org/hive (2008)
  7. 7.
    Aster Data: Aster Data nCluster. http://www.asterdata.com/product/index.php (2010)
  8. 8.
    Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM (1970) Google Scholar
  9. 9.
    Businesscom: CloudBase. http://cloudbase.sourceforge.net (2008)
  10. 10.
    Chaudhuri, S., Dayal, U.: An overview of data warehousing and olap technology. SIGMOD Rec. 26(1), 65–74 (1997) CrossRefGoogle Scholar
  11. 11.
    Chen, M.S., Lo, M., Yu, P.S., Young, H.C.: Applying segmented right-deep trees to pipelining multiple hash joins. IEEE Trans. Knowl. Data Eng. 7(4), 656–668 (1995) CrossRefGoogle Scholar
  12. 12.
    Datta, A., VanderMeer, D., Ramamritham, K.: Parallel star join + dataindexes: Efficient query processing in data warehouses and olap. IEEE Trans. Knowl. Data Eng. 14(6), 1299–1316 (2002) CrossRefGoogle Scholar
  13. 13.
    Deshpande, P.M., Ramasamy, K., Shukla, A., Naughton, J.F.: Caching multidimensional queries using chunks. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (1998) Google Scholar
  14. 14.
    Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP ’03: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (2003) Google Scholar
  15. 15.
    Grossman, R., Gu, Y.: Data mining using high performance data clouds: Experimental studies using sector and sphere. In: SIGKDD 2008 (2008) Google Scholar
  16. 16.
    Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI’04: The 6th Symposium on Operating System Design and Implementation (2004) Google Scholar
  17. 17.
    Karayannidis, N., Tsois, A., Sellis, T., Pieringer, R., Markl, V., Ramsak, F., Fenk, R., Elhardt, K., Bayer, R.: Processing star queries on hierarchically-clustered fact tables. In: VLDB ’02: Proceedings of the 28th International Conference on Very Large Data Bases (2002) Google Scholar
  18. 18.
    Lo, M.L., Chen, M.S.S., Ravishankar, C.V., Yu, P.S.: On optimal processor allocation to support pipelined hash joins. In: SIGMOD ’93: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (1993) Google Scholar
  19. 19.
    Inmon, W.H.: Building the Data Warehouse. Wiley, New York (1996) Google Scholar
  20. 20.
    Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: Distributed data-parallel programs from sequential building blocks. In: European Conference on Computer Systems (EuroSys) (2007) Google Scholar
  21. 21.
    Monash, C.: Cloudera presents the MapReduce bull case. http://www.dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case/ (2009)
  22. 22.
    NexR: Icube cloud testbed. http://www.icubecloud.com (2009)
  23. 23.
    O’Neil, P., Graefe, G.: Multi-table joins through bitmapped join indices. SIGMOD Rec. 24(3), 8–11 (1995) CrossRefGoogle Scholar
  24. 24.
    O’Neil, P., Quass, D.: Improved query performance with variant indexes. SIGMOD Rec. 26(2), 38–49 (1997) CrossRefGoogle Scholar
  25. 25.
    O’Neil, P., O’Neil, E., Chen, X.: The star schema benchmark (2007) Google Scholar
  26. 26.
    Pike, R., Dorward, S., Griesemer, R., Quinlan, S.: Interpreting the data: Parallel analysis with Sawzall. Sci. Program. J. (2005) Google Scholar
  27. 27.
    Roussopoulos, N.: Materialized views and data warehouses. SIGMOD Rec. 27, 21–26 (1997) CrossRefGoogle Scholar
  28. 28.
    Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: Mapreduce and parallel dbmss: friends or foes? Commun. ACM 53(1), 64–71 (2010) CrossRefGoogle Scholar
  29. 29.
    Yang, Hc., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-reduce-merge: Simplified relational data processing on large clusters. In: SIGMOD ’07: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (2007) Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Hyuck Han
    • 1
  • Hyungsoo Jung
    • 2
  • Hyeonsang Eom
    • 1
  • Heon Y. Yeom
    • 1
  1. 1.School of Computer Science and EngineeringSeoul National UniversitySeoulKorea
  2. 2.School of Information TechnologiesUniversity of SydneySydneyAustralia

Personalised recommendations