Skip to main content

Parallel Formulations of Decision-Tree Classification Algorithms

  • Chapter
High Performance Data Mining

Abstract

Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Agrawal, R., Imielinski, T., and Swami, A. 1993. Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Eng., 5(6):914–925.

    Google Scholar 

  • Alsabti, K., Ranka, S., and Singh, V. 1997. A one-pass algorithm for accurately estimating quantiles for disk-resident data. Proc. of the 23rd VLDB Conference.

    Google Scholar 

  • Alsabti, K., Ranka, S., and Singh, V. 1998. CLOUDS: Classification for large or out-of-core datasets. http://www.cise.uft.edu/~ranka/dm.html.

  • Breiman, L., Friedman, J., Olshen, R., and Stone, C. 1984. Classification and Regression Trees. Monterrey, CA: Wadsworth.

    Google Scholar 

  • Catlett, J. 1991. Megainduction: machine learning on very large databases. PhD thesis, University of Sydney.

    Google Scholar 

  • Chan, Philip K. and Stolfo, Salvatore J. 1993a. Experiments on multistrategy learning by metaleaming. Proc. SecondIntl. Conference on Info. and Knowledge Mgmt, pp. 314–323.

    Google Scholar 

  • Chan, Philip K. and Stolfo, Salvatore J. 1993b. Metalearning for multistrategy learning andparallel learning. Proc. Second Intl. Conference on Multistrategy Learning, pp. 150–165.

    Google Scholar 

  • Chattratichat, J., Darlington, J., Ghanem, M., Guo, Y., Huning, H., Kohler, M., Sutiwaraphun, J., To, H.W., and Yang, D. Large scale data mining: Challenges andresponses. Proc. of the Third Int’l Conference on Knowledge Discovery and Data Mining.

    Google Scholar 

  • Goil, S., Alum, S., and Ranka, S. 1996. Concatenated parallelism: A technique for efficient parallel divide and conquer. Proc. of the Symposium of Parallel and Distributed Computing (SPDP’96).

    Google Scholar 

  • Goldberg, D.E. 1989. Genetic Algorithms in Search, Optimizations and Machine Learning. Morgan-Kaufman.

    Google Scholar 

  • Hong, S.J. 1997. Use of contextual information for feature ranking and discretization. IEEE Transactions on Knowledge and Data Eng., 9(5):718–730.

    Google Scholar 

  • Joshi, M.V., Karypis, G., and Kumar, V., 1998. ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. Proc. of the International Parallel Processing Symposium.

    Google Scholar 

  • George Karypis and Vipin Kumar. 1994. Unstructured tree search on simd parallel computers. Journal of Parallel and Distributed Computing, 22(3):379–391.

    Google Scholar 

  • Kufrin, R. 1997. Decision trees on parallel processors. In Parallel Processing for Artificial Intelligence 3. J. Geller, H. Kitano, and C.B. Suttner (Ed.). Elsevier Science.

    Google Scholar 

  • Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. 1994. Introduction to Parallel Computing: Algorithm Design and Analysis. Redwod City: Benjamin Cummings/Addison Wesley.

    Google Scholar 

  • Lippmann, R. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine, 4(22).

    Google Scholar 

  • Mehta, M., Agrawal, R., and Rissaneh, J. 1996. SLIQ: A fast scalable classifier for data mining. Proc. of the Fifth Int’l Conference on Extending Database Technology. Avignon. France.

    Google Scholar 

  • Pearson, R.A. 1994. A coarse grained parallel induction heuristic. In Parallel Processing for Artificial Intelligence 2, H. Kitano, V. Kumar, and C.B. Suttner (Ed.). Elsevier Science, pp. 207–226.

    Google Scholar 

  • Ross Quinlan, J. 1993. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Shafer, J., Agrawal, R., and Mehta, M. 1996. SPRINT A scalable parallel classifier for data mining. Proc. of the 22nd VLDB Conference.

    Google Scholar 

  • Shankar, R., Alsabti, K., and Ranka, S. 1995. Many-to-many communication with bounded traffic. Frontiers’ 95, the Fifth Symposium on Advances in Massively Parallel Computation. McLean, VA.

    Google Scholar 

  • Spiegelhalter, D.J., Michie, D., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood.

    Google Scholar 

  • Anurag Srivastava, Vineet Singh, Eui-Hong Han, and Vipin Kumar. 1997. An efficient, scalable, parallel classifier for data mining. Technical Report TR-97-010, http://www.cs.umn.edu/~kumar, Department of Computer Science, University of Minnesota, Minneapolis.

    Google Scholar 

  • Wirth, J. and Catlett, J. 1988. Experiments on the costs and benefits of windowing in ID 3.5th Int’l Conference on Machine learning.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Kluwer Academic Publishers

About this chapter

Cite this chapter

Srivastava, A., Han, EH., Kumar, V., Singh, V. (1999). Parallel Formulations of Decision-Tree Classification Algorithms. In: Guo, Y., Grossman, R. (eds) High Performance Data Mining. Springer, Boston, MA. https://doi.org/10.1007/0-306-47011-X_2

Download citation

  • DOI: https://doi.org/10.1007/0-306-47011-X_2

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-7745-0

  • Online ISBN: 978-0-306-47011-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics