An Optimal Skew-insensitive Join and Multi-join Algorithm for Distributed Architectures

  • Mostafa Bamha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3588)

Abstract

The development of scalable parallel database systems requires the design of efficient algorithms for the join operation which is the most frequent and expensive operation in relational database systems. The join is also the most vulnerable operation to data skew and to the high cost of communication in distributed architectures.

In this paper, we present a new parallel algorithm for join and multi-join operations on distributed architectures based on an efficient semi-join computation technique. This algorithm is proved to have optimal complexity and deterministic perfect load balancing. Its tradeoff between balancing overhead and speedup is analyzed using the BSP cost model which predicts a negligible join product skew and a linear speed-up. This algorithm improves our fa_join and sfa_join algorithms by reducing their communication and synchronization cost to a minimum while offering the same load balancing properties even for highly skewed data.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bamha, M., Bentayeb, F., Hains, G.: An efficient scalable parallel view maintenance algorithm for shared nothing multi-processor machines. In: Bench-Capon, T.J.M., Soda, G., Tjoa, A.M. (eds.) DEXA 1999. LNCS, vol. 1677, pp. 616–625. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  2. 2.
    Bamha, M., Exbrayat, M.: Pipelining a skew-insensitive parallel join algorithm. Parallel Processing Letters 13(3), 317–328 (2003)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Bamha, M., Hains, G.: A self-balancing join algorithm for SN machines. In: Proceedings of International Conference on Parallel and Distributed Computing and Systems (PDCS), Las Vegas, Nevada, USA, October 1998, pp. 285–290 (1998)Google Scholar
  4. 4.
    Bamha, M., Hains, G.: A skew insensitive algorithm for join and multi-join operation on Shared Nothing machines. In: Ibrahim, M., Küng, J., Revell, N. (eds.) DEXA 2000. LNCS, vol. 1873, p. 644. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  5. 5.
    Bamha, M., Hains, G.: An efficient equi-semi-join algorithm for distributed architectures. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 755–763. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  6. 6.
    Bamha, M., Hains, G.: A frequency adaptive join algorithm for Shared Nothing machines. Journal of Parallel and Distributed Computing Practices (PDCP) 3(3), 333–345 (1999)Google Scholar
  7. 7.
    Datta, A., Moon, B., Thomas, H.: A case for parallelism in datawarehousing and OLAP. In: Ninth International Workshop on Database and Expert Systems Applications, DEXA 1998, Vienna, pp. 226–231. IEEE Computer Society, Los Alamitos (1998)CrossRefGoogle Scholar
  8. 8.
    DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada, pp. 27–40 (1992)Google Scholar
  9. 9.
    Hua, K.A., Lee, C.: Handling data skew in multiprocessor database computers using partition tuning. In. In: Proc. of the 17th International Conference on Very Large Data Bases, Barcelona, Catalonia, Spain, pp. 525–535. Morgan Kaufmann, San Francisco (1991)Google Scholar
  10. 10.
    Kitsuregawa, M., Ogawa, Y.: Bucket spreading parallel hash: A new, robust, parallel hash join method for skew in the super database computer (SDC). In: Very Large Data Bases: 16th International Conference on Very Large Data Bases, Brisbane, Australia, August 13–16, pp. 210–221 (1990)Google Scholar
  11. 11.
    Poosala, V., Ioannidis, Y.E.: Estimation of query-result distribution and its application in parallel-join load balancing. In: Proc. 22th Int. Conference on Very Large Database Systems, VLDB 1996, Bombay, India, September 1996, pp. 448–459 (1996)Google Scholar
  12. 12.
    Schneider, D., DeWitt, D.: A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In: Proceedings of the 1989 ACM SIGMOD International Conference on the Management of Data, Portland, Oregon, pp. 110–121. ACM Press, New York (1989)CrossRefGoogle Scholar
  13. 13.
    Seetha, M., Yu, P.S.: Effectiveness of parallel joins. IEEE, Transactions on Knowledge and Data Enginneerings 2(4), 410–424 (1990)CrossRefGoogle Scholar
  14. 14.
    Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientific Programming 6(3), 249–274 (1997)Google Scholar
  15. 15.
    Wolf, J.L., Dias, D.M., Yu, P.S., Turek, J.: New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Transactions on Knowledge and Data Engineering 6(6), 990–997 (1994)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Mostafa Bamha
    • 1
  1. 1.LIFO – CNRSUniversité d’OrléansOrléans Cedex 2France

Personalised recommendations