Classifying Streaming Data

Bramer, Max

doi:10.1007/978-1-4471-7307-6_21

Classifying Streaming Data

Max Bramer¹¹

Chapter
First Online: 10 November 2016

359k Accesses
1 Citations

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

Abstract

This chapter is concerned with the classification of streaming data, i.e. data which arrives (generally in large quantities) from some automatic process over a period of days, months, years or potentially forever.

Generating a classification tree for streaming data requires a different approach from the TDIDT algorithm described earlier in this book. The algorithm given here, H-Tree, is a variant of the popular VFDT algorithm which generates a type of decision tree called a Hoeffding Tree. The algorithm is described and explained in detailed with accompanying pseudocode for the benefit of readers who may be interested in developing their own implementations. An example is given to illustrate a way of comparing the rules generated by H-Tree with those from TDIDT.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
We distinguish between nodes which have or have not previously been split on an attribute. The former are called internal nodes; the latter are called leaf nodes. We will consider the root node not as a third type of node but as an internal node after it has been split on an attribute and a leaf node before that.
2.
A note on notation. In this chapter array elements are generally shown enclosed in square brackets, e.g. \(\textit{currentAtts}[2]\). However an array containing a number of constant values will generally be denoted by those values separated by commas and enclosed in braces. So \(\textit{currentAtts}[2]\) is \(\{\textit{att1}, \textit{att2}, \textit{att3}, \textit{att5}, \textit{att6}, \textit{att7}\}\).
3.
The row and column headings are provided to assist the reader only. The table itself has 3 rows and 3 columns.
4.
Pseudocode fragments are provided for the benefit of readers who may be interested in developing their own implementations of the H-Tree algorithm. Other readers can safely ignore them.
5.
As initially there are no other nodes, all incoming records will be sorted there.
6.
In Figures 21.6, 21.8 and 21.9 we depart from our usual notation for trees and show the values that are in the classtotals array for each node.
7.
Confusion matrices were described in Chapter 7.
8.
For some practical applications, to have a tree with a smaller number of leaf nodes which predicts the same or almost the same classifications as the complete TDIDT decision tree might be considered preferable, but we will not pursue that issue here.

References

Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). New York: ACM.
Chapter Google Scholar
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 (301), 13–30.
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, University of Portsmouth, Portsmouth, Hampshire, UK
Prof. Max Bramer

Authors

Prof. Max Bramer
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bramer, M. (2016). Classifying Streaming Data. In: Principles of Data Mining. Undergraduate Topics in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-7307-6_21

Download citation

DOI: https://doi.org/10.1007/978-1-4471-7307-6_21
Published: 10 November 2016
Publisher Name: Springer, London
Print ISBN: 978-1-4471-7306-9
Online ISBN: 978-1-4471-7307-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics