Parallel Formulations of Decision-Tree Classification Algorithms

Srivastava, Anurag; Han, Eui-Hong; Kumar, Vipin; Singh, Vineet

doi:10.1007/0-306-47011-X_2

Anurag Srivastava³,
Eui-Hong Han⁴,
Vipin Kumar⁴ &
…
Vineet Singh⁵

373 Accesses
13 Citations

Abstract

Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925.
Google Scholar
Alsabti, K., Ranka, S., and Singh, V. 1997. A one-pass algorithm for accurately estimating quantiles for disk-resident data. Proc. of the 23rd VLDB Conference.
Google Scholar
Alsabti, K., Ranka, S., and Singh, V. 1998. CLOUDS: Classification for large or out-of-core datasets. http://www.cise.uft.edu/~ranka/dm.html.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Monterrey, CA: Wadsworth.
Google Scholar
Catlett, J. 1991. Megainduction: machine learning on very large databases. PhD thesis, University of Sydney.
Google Scholar
Chan, Philip K. and Stolfo, Salvatore J. 1993a. Experiments on multistrategy learning by metaleaming. Proc. SecondIntl. Conference on Info. and Knowledge Mgmt, pp. 314–323.
Google Scholar
Chan, Philip K. and Stolfo, Salvatore J. 1993b. Metalearning for multistrategy learning andparallel learning. Proc. Second Intl. Conference on Multistrategy Learning, pp. 150–165.
Google Scholar
Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutiwaraphun, J., To, H.W., and Yang, D. Large scale data mining: Challenges andresponses. Proc. of the Third Int’l Conference on Knowledge Discovery and Data Mining.
Google Scholar
Goil, S., Alum, S., and Ranka, S. 1996. Concatenated parallelism: A technique for efficient parallel divide and conquer. Proc. of the Symposium of Parallel and Distributed Computing (SPDP’96).
Google Scholar
Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman.
Google Scholar
Hong, S.J. 1997. Use of contextual information for feature ranking and discretization. IEEE Transactions on Knowledge and Data Eng., 9(5):718–730.
Google Scholar
Joshi, M.V., Karypis, G., and Kumar, V., 1998. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. Proc. of the International Parallel Processing Symposium.
Google Scholar
George Karypis and Vipin Kumar. 1994. Unstructured tree search on simd parallel computers. Journal of Parallel and Distributed Computing, 22(3):379–391.
Google Scholar
Kufrin, R. 1997. Decision trees on parallel processors. In Parallel Processing for Artificial Intelligence 3. J. Geller, H. Kitano, and C.B. Suttner (Ed.). Elsevier Science.
Google Scholar
Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Algorithm Design and Analysis. Redwod City: Benjamin Cummings/Addison Wesley.
Google Scholar
Lippmann, R. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22).
Google Scholar
Mehta, M., Agrawal, R., and Rissaneh, J. 1996. SLIQ: A fast scalable classifier for data mining. Proc. of the Fifth Int’l Conference on Extending Database Technology. Avignon. France.
Google Scholar
Pearson, R.A. 1994. A coarse grained parallel induction heuristic. In Parallel Processing for Artificial Intelligence 2, H. Kitano, V. Kumar, and C.B. Suttner (Ed.). Elsevier Science, pp. 207–226.
Google Scholar
Ross Quinlan, J. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.
Google Scholar
Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT A scalable parallel classifier for data mining. Proc. of the 22nd VLDB Conference.
Google Scholar
Shankar, R., Alsabti, K., and Ranka, S. 1995. Many-to-many communication with bounded traffic. Frontiers’ 95, the Fifth Symposium on Advances in Massively Parallel Computation. McLean, VA.
Google Scholar
Spiegelhalter, D.J., Michie, D., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood.
Google Scholar
Anurag Srivastava, Vineet Singh, Eui-Hong Han, and Vipin Kumar. 1997. An efficient, scalable, parallel classifier for data mining. Technical Report TR-97-010, http://www.cs.umn.edu/~kumar, Department of Computer Science, University of Minnesota, Minneapolis.
Google Scholar
Wirth, J. and Catlett, J. 1988. Experiments on the costs and benefits of windowing in ID 3.5th Int’l Conference on Machine learning.
Google Scholar

Download references

Author information

Authors and Affiliations

Ditigal Impact, USA
Anurag Srivastava
Department of Computer Science & Engineering, Army HPC Research Center, University of Minnesota, USA
Eui-Hong Han & Vipin Kumar
Information Technology Lab, Hitachi America, Ltd., USA
Vineet Singh

Authors

Anurag Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Eui-Hong Han
View author publications
You can also search for this author in PubMed Google Scholar
Vipin Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Vineet Singh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Imperial College, UK
Yike Guo
University of Illinois at Chicago, USA
Robert Grossman

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Srivastava, A., Han, EH., Kumar, V., Singh, V. (1999). Parallel Formulations of Decision-Tree Classification Algorithms. In: Guo, Y., Grossman, R. (eds) High Performance Data Mining. Springer, Boston, MA. https://doi.org/10.1007/0-306-47011-X_2

Download citation

DOI: https://doi.org/10.1007/0-306-47011-X_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7745-0
Online ISBN: 978-0-306-47011-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics