Advertisement

An Incremental Fuzzy Decision Tree Classification Method for Mining Data Streams

  • Tao Wang
  • Zhoujun Li
  • Yuejin Yan
  • Huowang Chen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4571)

Abstract

One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system fVFDT on top of VFDT and VFDTc. We make the following four contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(nlogn), while VFDT‘s processing time is O(n 2 ). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but fVFDT just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves fromO(nlogn) to O (n) in processing time. 3) Comparing to VFDTc, fVFDT‘s candidate split-test number decrease fromO(n) to O(logn).4)Improve the soft discretization method to be used in data streams mining, it overcomes the problem of noise data and improve the classification accuracy.

Keywords

Data Streams Incremental Fuzzy Continuous Attribute Threaded Binary Search Tree 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Babcock, B., Babu, S., Datar, M., Motawani, R., Widom, J.: Models and Issues in Data Stream Systems. In: PODS (2002)Google Scholar
  2. 2.
    Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proceedings of the Association for Computing Machinery Sixth International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)Google Scholar
  3. 3.
    Mehta, M., Agrawal, A., Rissanen, J.: SLIQ: A Fast Scalable Classifier for Data Mining. In: Proceedings of The Fifth International Conference on Extending Database Technology, Avignon, France, pp. 18–32 (1996)Google Scholar
  4. 4.
    Fan, W.: StreamMiner: A Classifier Ensemble-based Engine to Mine Concept Drifting Data Streams. In: VLDB 2004 (2004)Google Scholar
  5. 5.
    Gama, J., Rocha, R., Medas, P.: Accurate Decision Trees for Mining High-Speed Data Streams. In: Domingos, P., Faloutsos, C. (eds.) Proceedings of the Ninth International Conference on Knowledge Discovery and Data Mining, ACM Press, New York (2003)Google Scholar
  6. 6.
    Hulten, G., Spencer, L., Domingos, P.: Mining Time-Changing Data Streams. ACM SIGKDD (2001)Google Scholar
  7. 7.
    Jin, R., Agrawal, G.: Efficient Decision Tree Construction on Streaming Data. In: Proceedings of ACM SIGKDD (2003)Google Scholar
  8. 8.
    Last, M.: Online Classification of Nonstationary Data Streams. Intelligent Data Analysis 6(2), 129–147 (2002)zbMATHMathSciNetGoogle Scholar
  9. 9.
    Muthukrishnan, S.: Data streams: Algorithms and Applications. In: Proceedings of the fourteenth annual ACM-SIAM symposium on discrete algorithms (2003)Google Scholar
  10. 10.
    Wang, H., Fan, W., Yu, P., Han, J.: Mining Concept-Drifting Data Streams using Ensemble Classifiers. In: the 9th ACM International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA. SIGKDD (2003)Google Scholar
  11. 11.
    Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: The Stanford Stream Data Manager Demonstration Description –Short Overview of System Status and Plans. In: SIGMOD 2003. Proc. of the ACM Intl Conf. on Management of Data (June 2003)Google Scholar
  12. 12.
    Aggarwal, C., Han, J., Wang, J., Yu, P.S.: On Demand Classification of Data Streams. In: KDD 2004. Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA (2004)Google Scholar
  13. 13.
    Guetova, M., Holldobter, Storr, H.-P.: Incremental Fuzzy Decision Trees. In: 25th German conference on Artificial Intelligence (2002)Google Scholar
  14. 14.
    Ben-David, S., Gehrke, J., Kifer, D.: Detecting Change in Data Streams. In: Proceedings of VLDB (2004)Google Scholar
  15. 15.
    Aggarwal, C.: A Framework for Diagnosing Changes in Evolving Data Streams. In: Proceedings of the ACM SIGMOD Conference (2003)Google Scholar
  16. 16.
    Gaber, M.M., Zaslavskey, A., Krishnaswamy, S.: Mining Data Streams: a Review, SIGMOD Record, vol. 34(2) (June 2005)Google Scholar
  17. 17.
    Cezary, Janikow, Z.: Fuzzy Decision Trees: Issues and Methods. IEEE Transactions on Systems, Man, and Cybernetics 28(1), 1–14 (1998)Google Scholar
  18. 18.
    Utgoff, P.E.: Incremental Induction of Decision Trees. Machine Learning 4(2), 161–186 (1989)CrossRefGoogle Scholar
  19. 19.
    Xie, Q.H.: An Efficient Approach for Mining Concept-Drifting Data Streams, Master ThesisGoogle Scholar
  20. 20.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1993)Google Scholar
  21. 21.
    Hoeffding, W.: Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association 58, 13–30 (1963)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees, Wadsworth, Belmont, CA (1984)Google Scholar
  23. 23.
    Maron, O., Moore, A.: Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing System (1994)Google Scholar
  24. 24.
    Kelly, M.G., Hand, D.J., Adams, N.M.: The Impact of Changing Populations on Classifier Performance. In: Proc. of KDD 1999, pp. 367–371 (1999)Google Scholar
  25. 25.
    Black, M., Hickey, R.J.: Maintaining the Performance of a Learned Classifier under Concept Drift. Intelligent Data Analysis 3, 453–474 (1999)CrossRefGoogle Scholar
  26. 26.
    Maimon, O., Last, M.: Knowledge Discovery and Data Mining, the Info-Fuzzy Network (IFN) Methodology. Kluwer Academic Publishers, Boston (2000)zbMATHGoogle Scholar
  27. 27.
    Fayyad, U.M., Irani, K.B.: On the Handling of Continuous-valued Attributes in Decision Tree Generation. Machine Learning 8, 87–102 (1992)zbMATHGoogle Scholar
  28. 28.
    Wang, T., Li, Z., Yan, Y., Chen, H.: An Efficient Classification System Based on Binary Search Trees for Data Streams Mining, ICONS (2007)Google Scholar
  29. 29.
    Wang, T., Li, Z., Hu, X., Yan, Y., Chen, H.: A New Decision Tree Classification Method for Mining High-Speed Data Streams Based on Threaded Binary Search Trees. In: PAKDD 2007 workshop (2007)Google Scholar
  30. 30.
    Peng, Y.H., Flach, P.A.: Soft Discretization to Enhance the Continuous Decision Tree Induction. In: Proceedings of ECML/PKDD-2001 Workshop IDDM-2001, Freiburg, Germany (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Tao Wang
    • 1
  • Zhoujun Li
    • 2
  • Yuejin Yan
    • 1
  • Huowang Chen
    • 1
  1. 1.Computer School, National University of Defense Technology, Changsha, 410073China
  2. 2.School of Computer Science & Engineering, Beihang University, Beijing, 100083China

Personalised recommendations