Abstract
Pipelined parallelism was largely studied and successfully implemented, on shared nothing machines, in several join algorithms in the presence of ideal conditions of load balancing between processors and in the absence of data skew. The aim of pipelining is to allow flexible resource allocation while avoiding unnecessary disk input/output for intermediate join results in the treatment of multi-join queries.
The main drawback of pipelining in existing algorithms is that communication and load balancing remain limited to the use of static approaches (generated during query optimization phase) based on hashing to redistribute data over the network and therefore cannot solve data skew problem and load imbalance between processors on heterogeneous multi-processor architectures where the load of each processor may vary in a dynamic and unpredictable way.
In this paper, we present a pipelined parallel algorithm for multi-join queries allowing to solve the problem of data skew while guaranteeing perfect balancing properties, on heterogeneous multi-processor Shared Nothing architectures. The performance of this algorithm is analyzed using the scalable portable BSP (Bulk Synchronous Parallel) cost model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bamha, M.: An optimal and skew-insensitive join and multi-join algorithm for ditributed architectures. In: Andersen, K.V., Debenham, J., Wagner, R. (eds.) DEXA 2005. LNCS, vol. 3588, pp. 616–625. Springer, Heidelberg (2005)
Bamha, M., Exbrayat, M.: Pipelining a skew-insensitive parallel join algorithm. Parallel Processing Letters 13(3), 317–328 (2003)
Bamha, M., Hains, G.: A skew insensitive algorithm for join and multi-join operation on Shared Nothing machines. In: Ibrahim, M., Küng, J., Revell, N. (eds.) DEXA 2000. LNCS, vol. 1873, pp. 644–653. Springer, Heidelberg (2000)
Bamha, M., Hains, G.: A frequency adaptive join algorithm for Shared Nothing machines. Journal of Parallel and Distributed Computing Practices (PDCP) 3(3), 333–345 (1999); Appears also in: Columbus, F. Progress in Computer Research, vol. II. Nova Science Publishers (2001)
Chen, M.-S., Lo, M.L., Yu, P.S., Young, H.C.: Using segmented right-deep trees for the execution of pipelined hash joins. In: Yuan, L.-Y. (ed.) Very Large Data Bases: VLDB 1992, Proceedings of the 18th International Conference on Very Large Data Bases, Vancouver, Canada, August 23–27, pp. 15–26. Morgan Kaufmann Publishers, Los Altos (1992)
Chen, M.-S., Yu, P.S., Wu, K.-L.: Scheduling and processor allocation for the execution of multi-join queries. In: International Conference on Data Engineering, pp. 58–67. IEEE Computer Society Press, Los Alamos (1992)
Datta, A., Moon, B., Thomas, H.: A case for parallelism in datawarehousing and OLAP. In: Ninth International Workshop on Database and Expert Systems Applications, DEXA 1998, pp. 226–231. IEEE Computer Society, Vienna (1998)
DeWitt, D.J., Naughton, J.F., Schneider, D.A., Seshadri, S.: Practical Skew Handling in Parallel Joins. In: Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, Canada, pp. 27–40 (1992)
Anastasios Gounaris: Resource aware query processing on the grid. Thesis report, University of Manchester, Faculty of Engineering and Physical Sciences (2005)
Hassan, M.A.H., Bamha, M.: Dynamic data redistribution for join queries on heterogeneous shared nothing architecture. Technical Report 2, LIFO, Université d’Orléans, France (March 2008)
Hua, K.A., Lee, C.: Handling data skew in multiprocessor database computers using partition tuning. In: Lohman, G.M., Sernadas, A., Camps, R. (eds.) Proc. of the 17th International Conference on Very Large Data Bases, Barcelona, Catalonia, Spain, pp. 525–535. Morgan Kaufmann, San Francisco (1991)
Liu, B., Rundensteiner, E.A.: Revisiting pipelined parallelism in multi-join query processing. In: VLDB 2005: Proceedings of the 31st international conference on Very large data bases, pp. 829–840. VLDB Endowment (2005)
Lu, H., Ooi, B.-C., Tan, K.-L.: Query Processing in Parallel Relational Database Systems. IEEE Computer Society Press, Los Alamos (1994)
Mourad, A.N., Morris, R.J.T., Swami, A., Young, H.C.: Limits of parallelism in hash join algorithms. Performance evaluation 20(1/3), 301–316 (1994)
Rahm, E.: Dynamic load balancing in parallel database systems. In: Fraigniaud, P., et al. (eds.) Euro-Par 1996. LNCS, vol. 1123. Springer, Heidelberg (1996)
Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientific Programming 6(3), 249–274 (1997)
Valiant, L.G.: A bridging model for parallel computation. Communications of the ACM 33(8), 103–111 (1990)
Wilschut, A.N., Apers, P.M.G.: Dataflow query execution in a parallel main-memory environment. In: Parallel and Distributed Information Systems (PDIS 1991), pp. 68–77. IEEE Computer Society Press, Los Alamits (1991)
Wilschut, A.N., Flokstra, J., Apers, P.M.G.: Parallel evaluation of multi-join queries. Proceedings of the ACM-SIGMOD 24(2), 115–126 (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hassan, M.A.H., Bamha, M. (2009). An Efficient Pipelined Parallel Join Algorithm on Heterogeneous Distributed Architectures. In: Cordeiro, J., Shishkov, B., Ranchordas, A., Helfert, M. (eds) Software and Data Technologies. ICSOFT 2008. Communications in Computer and Information Science, vol 47. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05201-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-05201-9_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05200-2
Online ISBN: 978-3-642-05201-9
eBook Packages: Computer ScienceComputer Science (R0)