Handling Numeric Attributes in Hoeffding Trees

  • Bernhard Pfahringer
  • Geoffrey Holmes
  • Richard Kirkby
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5012)

Abstract

For conventional machine learning classification algorithms handling numeric attributes is relatively straightforward. Unsupervised and supervised solutions exist that either segment the data into pre-defined bins or sort the data and search for the best split points. Unfortunately, none of these solutions carry over particularly well to a data stream environment. Solutions for data streams have been proposed by several authors but as yet none have been compared empirically. In this paper we investigate a range of methods for multi-class tree-based classification where the handling of numeric attributes takes place as the tree is constructed. To this end, we extend an existing approximation approach, based on simple Gaussian approximation. We then compare this method with four approaches from the literature arriving at eight final algorithm configurations for testing. The solutions cover a range of options from perfectly accurate and memory intensive to highly approximate. All methods are tested using the Hoeffding tree classification algorithm. Surprisingly, the experimental comparison shows that the most approximate methods produce the most accurate trees by allowing for faster tree growth.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Imielinski, T., Swami, A.: Database mining: A performance perspective. IEEE Transactions on Knowledge and Data Engineering 5(6), 914–925 (1993)CrossRefGoogle Scholar
  2. 2.
    Agrawal, R., Swami, A.: A one-pass space-efficient algorithm for finding quantiles. In: International Conference on Management of Data (1995)Google Scholar
  3. 3.
    Alsabti, K., Ranka, S., Singh, V.: A one-pass algorithm for accurately estimating quantiles for disk-resident data. In: International Conference on Very Large Databases, pp. 346–355 (1997)Google Scholar
  4. 4.
    Chan, T.F., Lewis, J.G.: Computing standard deviations: Accuracy. Communications of the ACM 22(9), 526–531 (1979)MATHCrossRefGoogle Scholar
  5. 5.
    Domingos, P., Hulten, G.: Mining high-speed data streams. In: International Conference on Knowledge Discovery and Data Mining, pp. 71–80 (2000)Google Scholar
  6. 6.
    Gama, J., Medas, P., Rocha, R.: Forest trees for on-line data. In: ACM Symposium on Applied Computing, pp. 632–636 (2004)Google Scholar
  7. 7.
    Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining high-speed data streams. In: International Conference on Knowledge Discovery and Data Mining, pp. 523–528 (2003)Google Scholar
  8. 8.
    Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM Special Interest Group on Management Of Data Conference, pp. 58–66 (2001)Google Scholar
  9. 9.
    Holmes, G., Kirkby, R., Pfahringer, B.: Stress-testing hoeffding trees. In: European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 495–502 (2005)Google Scholar
  10. 10.
    Hulten, G., Domingos, P.: VFML – a toolkit for mining high-speed time-changing data streams (2003), http://www.cs.washington.edu/dm/vfml/
  11. 11.
    Jain, R., Chlamtac, I.: The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM 28(10), 1076–1085 (1985)CrossRefGoogle Scholar
  12. 12.
    Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: ACM Special Interest Group on Management Of Data Conference, pp. 426–435 (1998)Google Scholar
  13. 13.
    Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Ross Quinlan, J.: Improved use of continuous attributes in C4. Journal of Artificial Intelligence Research 4, 77–90 (1996)Google Scholar
  15. 15.
    Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11(1), 37–57 (1985)MATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics 4(3), 419–420 (1962)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Bernhard Pfahringer
    • 1
  • Geoffrey Holmes
    • 1
  • Richard Kirkby
    • 1
  1. 1.University of WaikatoHamiltonNew Zealand

Personalised recommendations