Advertisement

Online ChiMerge Algorithm

  • Petri Lehtinen
  • Matti Saarela
  • Tapio Elomaa
Part of the Intelligent Systems Reference Library book series (ISRL, volume 24)

Abstract

We show that a commonly-used sampling theoretical attribute discretization algorithm ChiMerge can be implemented efficiently in the online setting. Its benefits include that it is efficient, statistically justified, robust to noise, can be made to produce low-arity partitions, and has empirically been observed to work well in practice.

The worst-case time requirement of the batch version of ChiMerge bottom-up interval merging is \(O(n\lg n)\) per attribute. We show that ChiMerge can be implemented in the online setting so that only logarithmic time is required to update the relevant data structures in connection of an insertion. Hence, the same \(O(n\lg n)\) total time as in batch setting is spent on discretization of a data stream in which the examples fall into n bins. However, maintaining just one binary search tree is not enough, we also need other data structures. Moreover, in order to guarantee equal discretization results, an up-to-date discretization cannot always be kept available, but we need to delay the updates to happen at periodic intervals. We also provide a comparative evaluation of the proposed algorithm.

Keywords

Data Stream Priority Queue Concept Drift Initial Interval Binary Search Tree 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, vol. 1(2). Now Publishers, Hanover (2005)zbMATHGoogle Scholar
  2. 2.
    Aggarwal, C.C. (ed.): Data Streams: Models and Algorithms. Advances in Database Systems, vol. 31. Springer, Heidelberg (2007)zbMATHGoogle Scholar
  3. 3.
    Gama, J., Gaber, M.M. (eds.): Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Heidelberg (2007)zbMATHGoogle Scholar
  4. 4.
    Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: On demand classification of data streams. In: Proc. Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 503–508. ACM Press, New York (2004)CrossRefGoogle Scholar
  5. 5.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proc. Sixth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 71–80. ACM Press, New York (2000)CrossRefGoogle Scholar
  6. 6.
    Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: Proc. Seventh ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 97–106. ACM Press, New York (2001)CrossRefGoogle Scholar
  7. 7.
    Gao, J., Fan, W., Han, J.: On appropriate assumptions to mine data streams: Analysis and practice. In: Proc. 7th IEEE International Conference on Data Mining, pp. 143–152. IEEE Computer Society Press, Los Alamitos (2007)Google Scholar
  8. 8.
    Gao, J., Fan, W., Han, J., Yu, P.S.: A general framework for mining concept-drifting data streams with skewed distributions. In: Proc. Seventh SIAM International Conference on Data Mining. SIAM, Philadelphia (2007)Google Scholar
  9. 9.
    Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: Proc. Ninth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 523–528. ACM Press, New York (2003)CrossRefGoogle Scholar
  10. 10.
    Jin, R., Agrawal, G.: Efficient decision tree construction for streaming data. In: Proc. Ninth ACM SIGKDD Conference on Data Mining and Knowledge Discovery, pp. 571–576. ACM Press, New York (2003)CrossRefGoogle Scholar
  11. 11.
    Gama, J., Medas, P., Rodrigues, P.: Learning decision trees from dynamic data streams. In: Proc. 2005 ACM Symposium on Applied Computing, pp. 573–577. ACM Press, New York (2005)CrossRefGoogle Scholar
  12. 12.
    Gama, J., Pinto, C.: Dizcretization from data streams: Applications to histograms and data mining. In: Proc. 2006 ACM Symposium on Applied Computing, pp. 662–667. ACM Press, New York (2006)CrossRefGoogle Scholar
  13. 13.
    Pfahringer, B., Holmes, G., Kirkby, R.: Handling numeric attributes in hoeffding trees. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 296–307. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Elomaa, T., Lehtinen, P.: Maintaining optimal multi-way splits for numerical attributes in data streams. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 544–553. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proc. Twelfth International Conference on Machine Learning, pp. 194–202. Morgan Kaufmann, San Francisco (1995)Google Scholar
  16. 16.
    Liu, H., Hussain, F., Tan, C.L., Dash, M.: Discretization: An enabling technique. Data Mining and Knowledge Discovery 6(4), 393–423 (2002)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Yang, Y., Webb, G.I.: Discretization methods. In: The Data Mining and Knowledge Discovery Handbook. Springer, Heidelberg (2005)Google Scholar
  18. 18.
    Kerber, R.: ChiMerge: Discretization of numeric attributes. In: Proc. Tenth National Conference on Artificial Intelligence, pp. 123–128. AAAI Press, Menlo Park (1992)Google Scholar
  19. 19.
    Richeldi, M., Rossotto, M.: Class-driven statistical discretization of continuous attributes. In: ECML 1995. LNCS, vol. 912, pp. 335–338. Springer, Heidelberg (1995)Google Scholar
  20. 20.
    Liu, H., Setiono, R.: Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering 9, 642–645 (1997)CrossRefGoogle Scholar
  21. 21.
    Tay, F.E.H., Shen, L.: A modified Chi2 algorithm for discretization. IEEE Transactions on Knowledge and Data Engineering 14(3), 666–670 (2002)CrossRefGoogle Scholar
  22. 22.
    Catlett, J.: Megainduction: A test flight. In: Proc. Eighth International Workshop on Machine Learning, pp. 596–599. Morgan Kaufmann, San Mateo (1991)Google Scholar
  23. 23.
    Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: Proc. Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 23–32. ACM Press, New York (1999)CrossRefGoogle Scholar
  24. 24.
    Utgoff, P.: Incremental induction of decision trees. Machine Learning 4, 161–186 (1989)CrossRefGoogle Scholar
  25. 25.
    Utgoff, P., Berkman, N.C., Clouse, J.A.: Decision tree induction based on efficient tree restructuring. Machine Learning 29(1), 5–44 (1997)zbMATHCrossRefGoogle Scholar
  26. 26.
    Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: EDBT 1996. LNCS, vol. 1057, pp. 18–32. Springer, Heidelberg (1996)Google Scholar
  27. 27.
    Shafer, J.C., Agrawal, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proc. Twenty-Second International Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Francisco (1996)Google Scholar
  28. 28.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Pacific Grove (1984)zbMATHGoogle Scholar
  29. 29.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  30. 30.
    Hulten, G., Domingos, P.: VFML — a toolkit for mining high-speed time-changing data streams (2003)Google Scholar
  31. 31.
    Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: SIGMOD 2001 Electronic Proceedings, pp. 58–66 (2001)Google Scholar
  32. 32.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (2001)zbMATHGoogle Scholar
  33. 33.
    Kirkby, R.: Improving Hoeffding Trees. PhD thesis, University of Waikato, Department of Computer Science, New Zealand (2008), http://adt.waikato.ac.nz/public/adt-uow20080415.103751/index.html
  34. 34.
    Univ. of Waikato New Zealand: MOA: Massive On-line Analysis (2008), http://www.cs.waikato.ac.nz/~abifet/MOA/

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Petri Lehtinen
    • 1
  • Matti Saarela
    • 1
  • Tapio Elomaa
    • 1
  1. 1.Department of Software SystemsTampere University of TechnologyTampereFinland

Personalised recommendations