Skip to main content

Classifying Streaming Data

  • Chapter
  • First Online:

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

Abstract

This chapter is concerned with the classification of streaming data, i.e. data which arrives (generally in large quantities) from some automatic process over a period of days, months, years or potentially forever.

Generating a classification tree for streaming data requires a different approach from the TDIDT algorithm described earlier in this book. The algorithm given here, H-Tree, is a variant of the popular VFDT algorithm which generates a type of decision tree called a Hoeffding Tree. The algorithm is described and explained in detailed with accompanying pseudocode for the benefit of readers who may be interested in developing their own implementations. An example is given to illustrate a way of comparing the rules generated by H-Tree with those from TDIDT.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    We distinguish between nodes which have or have not previously been split on an attribute. The former are called internal nodes; the latter are called leaf nodes. We will consider the root node not as a third type of node but as an internal node after it has been split on an attribute and a leaf node before that.

  2. 2.

    A note on notation. In this chapter array elements are generally shown enclosed in square brackets, e.g. \(\textit{currentAtts}[2]\). However an array containing a number of constant values will generally be denoted by those values separated by commas and enclosed in braces. So \(\textit{currentAtts}[2]\) is \(\{\textit{att1}, \textit{att2}, \textit{att3}, \textit{att5}, \textit{att6}, \textit{att7}\}\).

  3. 3.

    The row and column headings are provided to assist the reader only. The table itself has 3 rows and 3 columns.

  4. 4.

    Pseudocode fragments are provided for the benefit of readers who may be interested in developing their own implementations of the H-Tree algorithm. Other readers can safely ignore them.

  5. 5.

    As initially there are no other nodes, all incoming records will be sorted there.

  6. 6.

    In Figures 21.6, 21.8 and 21.9 we depart from our usual notation for trees and show the values that are in the classtotals array for each node.

  7. 7.

    Confusion matrices were described in Chapter 7.

  8. 8.

    For some practical applications, to have a tree with a smaller number of leaf nodes which predicts the same or almost the same classifications as the complete TDIDT decision tree might be considered preferable, but we will not pursue that issue here.

References

  1. Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). New York: ACM.

    Chapter  Google Scholar 

  2. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 (301), 13–30.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag London Ltd.

About this chapter

Cite this chapter

Bramer, M. (2016). Classifying Streaming Data. In: Principles of Data Mining. Undergraduate Topics in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-7307-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-7307-6_21

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-7306-9

  • Online ISBN: 978-1-4471-7307-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics