Abstract
Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets.
In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.
This work was supported by NSF grant ACR-9982087, NSF CAREER award ACR-9733520, and NSF grant ACR-0130437.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Shafer, J.: Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering 8(6), 962–969 (1996)
Blume, W., Doallo, R., Eigenman, R., Grout, J., Hoelflinger, J., Lawrence, T., Lee, J., Padua, D., Paek, Y., Pottenger, B., Rauchwerger, L., Tu, P.: Parallel programming with Polaris. IEEE Computer 29(12), 78–82 (1996)
Gutierrez, E., Plata, O., Zapata, E.L.: A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors. In: ICS 2000, pp. 78–87. ACM Press, New York (2000)
Hall, M., Amarsinghe, S., Murphy, B., Liao, S., Lam, M.: Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer (12) (December 1996)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2000)
High Performance Fortran Forum. Hpf language specification, version 2.0 (January 1997), Available from http://www.crpc.rice.edu/HPFF/versions/hpf2/files/hpf-v20.ps.gz
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Jin, R., Agrawal, G.: A middleware for developing parallel data mining implementations. In: Proceedings of the first SIAM conference on Data Mining (April 2001)
Jin, R., Agrawal, G.: Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. In: Proceedings of the second SIAM conference on Data Mining (April 2002)
Lin, Y., Padua, D.: On the automatic parallelization of sparse and irregular Fortran programs. In: O’Hallaron, D.R. (ed.) LCR 1998. LNCS, vol. 1511, pp. 41–56. Springer, Heidelberg (1998)
Lu, H., Cox, A.L., Dwarkadas, S., Rajamony, R., Zwaenepoel, W.: Compiler and software distributed shared memory support for irregular applications. In: Proceedings of the Sixth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pp. 48–56. ACM Press, New York (1997), ACM SIGPLAN Notices 32(7)
Murthy, S.K.: Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery 2(4), 345–389 (1998)
Parthasarathy, S., Zaki, M., Li, W.: Memory placement techniques for parallel association mining. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD) (August 1998)
Parthasarathy, S., Zaki, M., Ogihara, M., Li, W.: Parallel data mining for association rules on shared-memory systems. In: Knowledge and Information Systems (2000) (to appear)
Rinard, M.C., Diniz, P.C.: Eliminating Synchronization Bottlenecks in Object- Oriented Programs Using Adaptive Replication. In: Proceedings of International Conference on Supercomputing (ICS). ACM Press, New York (1999)
Saltz, J.H., Mirchandaney, R., Crowley, K.: Run-time parallelization and scheduling of loops. IEEE Transactions on Computers 40(5), 603–612 (1991)
Zaki, M.J., Ho, C.-T., Agrawal, R.: Parallel classification for data mining on sharedmemory multiprocessors. In: IEEE International Conference on Data Engineering, May 1999, pp. 198–205 (1999)
Zaki, M.J., Ogihara, M., Parthasarathy, S., Li, W.: Parallel data mining for association rules on shared memory multiprocessors. In: Proceedings of Supercomputing 1996 (November 1996)
Zaki, M.J.: Parallel and distributed association mining: A survey. IEEE Concurrency 7(4), 14–25 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, X., Jin, R., Agrawal, G. (2005). Compiler and Runtime Support for Shared Memory Parallelization of Data Mining Algorithms. In: Pugh, B., Tseng, CW. (eds) Languages and Compilers for Parallel Computing. LCPC 2002. Lecture Notes in Computer Science, vol 2481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11596110_18
Download citation
DOI: https://doi.org/10.1007/11596110_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30781-5
Online ISBN: 978-3-540-31612-1
eBook Packages: Computer ScienceComputer Science (R0)