Robust and Efficient Large-Large Table Outer Joins on Distributed Infrastructures

  • Long Cheng
  • Spyros Kotoulas
  • Tomas E Ward
  • Georgios Theodoropoulos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8632)

Abstract

Outer joins are ubiquitous in many workloads but are sensitive to load-balancing problems. Current approaches mitigate such problems caused by data skew by using (partial) replication. However, contemporary replication-based approaches (1) introduce overhead, since they usually result in redundant data movement, (2) are sensitive to parameter tuning and value of data skew and (3) typically require that one side is small. In this paper, we propose a novel parallel algorithm, Redistribution and Efficient Query with Counters (REQC), aimed at robustness in terms of size of join sides, variation in skew and parameter tuning. Experimental results demonstrate that our algorithm is faster, more robust and less demanding in terms of network bandwidth, compared to the state-of-the-art.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Galindo-Legaria, C., Rosenthal, A.: Outerjoin simplification and reordering for query optimization. ACM Transactions on Database Systems (TODS) 22(1), 43–74 (1997)CrossRefGoogle Scholar
  2. 2.
    Rao, J., Pirahesh, H., Zuzarte, C.: Canonical abstraction for outerjoin optimization. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 671–682. ACM (2004)Google Scholar
  3. 3.
    Bhargava, G., Goel, P., Iyer, B.: Hypergraph based reorderings of outer join queries with complex predicates. ACM SIGMOD Record 24(2), 304–315 (1995)CrossRefGoogle Scholar
  4. 4.
    Xu, Y., Kostamaa, P.: A new algorithm for small-large table outer joins in parallel DBMS. In: Proceedings of the 26th IEEE International Conference on Data Engineering, ICDE 2010, pp. 1018–1024 (2010)Google Scholar
  5. 5.
    De Witt, D., Gray, J.: Parallel database systems: The future of high performance database systems. Commun. ACM 35(6), 85–98 (1992)CrossRefGoogle Scholar
  6. 6.
    DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical skew handling in parallel joins. In: Proceedings of the 18th International Conference on Very Large Data Bases, VLDB 1992, pp. 27–40 (1992)Google Scholar
  7. 7.
    AI Hajj Hassan, M., Bamha, M.: An efficient parallel algorithm for evaluating join queries on heterogeneous distributed systems. In: Proceedings of The 16th annual IEEE International Conference on High Performance Computing, HiPC 2009, pp. 350–358 (2009)Google Scholar
  8. 8.
    Kotoulas, S., Oren, E., van Harmelen, F.: Mind the data skew: distributed inferencing by speeddating in elastic regions. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 531–540. ACM (2010)Google Scholar
  9. 9.
    Kim, C., Kaldewey, T., Lee, V.W., Sedlar, E., Nguyen, A.D., Satish, N., Chhugani, J., Di Blas, A., Dubey, P.: Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proc. VLDB Endow. 2(2), 1378–1389 (2009)CrossRefGoogle Scholar
  10. 10.
    Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core CPUs. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, pp. 37–48. ACM (2011)Google Scholar
  11. 11.
    Balkesen, C., Teubner, J., Öszu, G.A., Main-memory, M.T.: Hash joins on multi-core CPUs: Tuning to the underlying hardware. In: Proceedings of the 29th International Conference on Data Engineering, ICDE 2013, pp. 362–373 (2013)Google Scholar
  12. 12.
    He, B., Yang, K., Fang, R., Lu, M., Govindaraju, N., Luo, Q., Sander, P.: Relational joins on graphics processors. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 511–524. ACM (2008)Google Scholar
  13. 13.
    Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)CrossRefGoogle Scholar
  14. 14.
    Zhang, X., Kurc, T., Pan, T., Catalyurek, U., Narayanan, S., Wyckoff, P., Saltz, J.: Strategies for using additional resources in parallel hash-based join algorithms. In: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, HPDC 2004, pp. 4–13 (2004)Google Scholar
  15. 15.
    Xu, Y., Kostamaa, P., Zhou, X., Chen, L.: Handling data skew in parallel joins in shared-nothing systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1043–1052. ACM (2008)Google Scholar
  16. 16.
    Cheng, L., Kotoulas, S., Ward, T., Theodoropoulos, G.: Efficient handling skew in outer joins on distributed systems. In: Proceedings of the 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2014, pp. 295–304 (2014)Google Scholar
  17. 17.
    Hill, G., Ross, A.: Reducing outer joins. The VLDB Journal 18(3), 599–610 (2009)CrossRefGoogle Scholar
  18. 18.
    Larson, P.Å., Zhou, J.: View matching for outer-join views. The VLDB Journal 16(1), 29–53 (2007)CrossRefGoogle Scholar
  19. 19.
    Koloniari, G., Pitoura, E.: Peer-to-peer management of XML data: Issues and research challenges. ACM Sigmod Record 34(2), 6–17 (2005)CrossRefGoogle Scholar
  20. 20.
    Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)Google Scholar
  21. 21.
    Cheng, L., Kotoulas, S., Ward, T., Theodoropoulos, G.: QbDJ: A novel framework for handling skew in parallel join processing on distributed memory. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications, HPCC 2013, pp. 1519–1527 (2013)Google Scholar
  22. 22.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: An object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2005, pp. 519–538. ACM (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Long Cheng
    • 1
    • 2
    • 3
  • Spyros Kotoulas
    • 2
  • Tomas E Ward
    • 1
  • Georgios Theodoropoulos
    • 4
  1. 1.National University of Ireland MaynoothIreland
  2. 2.IBM ResearchIreland
  3. 3.Technische Universität DresdenGermany
  4. 4.Durham UniversityUK

Personalised recommendations