Abstract
Mining data streams has become an important and challenging task for a wide range of applications. In these scenarios, data tend to arrive in multiple, rapid and time-varying streams, thus constraining data mining algorithms to look at data only once. Maintaining an accurate model, e.g. a classifier, while the stream goes by requires a smart way of keeping track of the data already passed away. Such a synthetic structure has to serve two purposes: distilling the most of information out of past data and allowing a fast reaction to concept drifting, i.e. to the change of the data trend that necessarily affects the model. The paper outlines novel data structures and algorithms to tackle the above problem, when the model mined out of the data is a classifier. The introduced model and the overall ensemble architecture are presented in details, even considering how the approach can be extended for treating numerical attributes. A large part of the paper discusses the experiments and the comparisons with several existing systems. The comparisons show that the performance of our system in general, and in particular with respect to the reaction to concept drifting, is at the top level.
Similar content being viewed by others
References
Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156
Aggarwal CC, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. In: Proceedings of the 2003 international conference on very large data bases (VLDB’03). Berlin, Germany, pp 81–92
Aggarwal CC, Han J, Wang J, Yu P (2004a) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 2004 international conference on very large data bases (VLDB’04). Toronto, Canada, pp 852–863
Aggarwal CC, Han J, Wang J, Yu P (2004b) On demand classification of data streams. In: Proceedings of the 10th international conference on knowledge discovery and data mining (KDD’04). Seattle, WA, pp 503–508
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’02). Madison, WI, pp 1–16
Bach SH, Maloof MA (2008) Paired learners for concept drift. In: Proceedings of the 2008 international conference on data mining (ICDM’08). Pisa, Italy, pp 23–32
Baena-Garcia M, del Campo-Avila J, Fidalgo R, Bifet A, Ravalda R, Morales-Bueno R (2006) Early drift detection method. In: International workshop on knowledge discovery from data streams
Bellandi A, Furletti B, Grossi V, Romei A (2007) Ontology-driven association rules extraction: a case of study. In: Proceedings of the international workshop on contexts and ontologies: representation and reasoning (C&O:RR) collocated with the 6th international and interdisciplinary conference on modelling and using context (CONTEXT 2007). Roskilde, Denmark
Bifet A, Holmes G, Pfahringer B, Kirby R, Gavaldá R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th international conference on knowledge discovery and data mining, pp 139–148
Bonchi F, Lucchese C (2005) Pushing tougher constraints in frequent pattern mining. In: Proceedings of the 2005 Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’05). Hanoi, Vietnam, pp 114–124
Bondu A, Boulle M, Lemaire V (2010) A non-parametric semi-supervised discretization method. Knowl Inf Syst 24(1): 35–57
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont
Budhaditya S, Pham D-S, Lazarescu M, Venkatesh S (2009) Effective anomaly detection in sensor networks data streams. In: Proceedings of the 2009 international conference on data mining (ICDM’09). Miami, FL, pp 722–727
Chu F (2005) Mining techniques for data streams and sequences. PhD thesis, Supervisor Prof. Carlo Zaniolo, University of California
Chu F, Zaniolo C (2004) Fast and light boosting for adaptive mining of data streams. In: Proceedings of the 8th Pacific-Asia conference advances in knowledge discovery and data mining (PAKDD’04). Sydney, Australia, pp 282–292
Cohen L, Avrahami G, Last M, Kandel A (2008) Info-fuzzy algorithms for mining dynamic data streams. Appl Soft Comput 8(4): 1283–1294
Cormode G, Garofalakis MN (2008) Approximate continuous querying over distributed streams. ACM Trans Database Syst 33(2): 1–39
Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6): 1794–1813
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
Domingos P, Hulten G (2000) Mining high-speed data streams, In: Proceedings of the 6th international conference on knowledge discovery and data mining (KDD’00). Boston, MA, pp 71–80
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th international conference on machine learning (ICML’01). Williamstown, MA, pp 106–113
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56: 52–64
Folino G, Pizzuti C, Spezzano G (2007) Mining distributed evolving data streams using fractal gp ensembles. In: Proceedings of the 10th European conference genetic programming (EuroGP’07). Valencia, Spain, pp 160–169
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32: 675–701
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. An Math Stat 11: 86–92
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Records 34(2): 18–26
Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10(1): 23–45
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: SBIA Brazilian symposium on artificial intelligence, pp 286–295
Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing (SAC’06). Dijon, France, pp 662–667
Gao J, Fan W, Han J, Yu PS (2007) On appropriate assumptions to mine data streams: analysis and practice. In: Proceedings of the 7th IEEE international conference on data mining (ICDM’07). Omaha, NE, pp 143–152
Giannella C, Han J, Pei J, Yan X, Yu PS (2003) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar K, Yesha Y Next Generation Data Mining. AAAI/MIT, Cambridge
Gilbert A, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M (2002) Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 2002 annual ACM symposium on theory of computing (STOC’02). Montreal, Quebec, Canada, pp 389–398
Grossi V (2009) A new framework for data streams classification. PhD thesis, Supervisor Prof. Franco Turini, University of Pisa http://etd.adm.unipi.it/theses/available/etd-11242009-124601/
Grossi V, Turini F (2010) A new selective ensemble approach for data streams classification. In: Proceedings of the 10th international artificial intelligence and applications (AIA2010). Innsbruck, Austria, pp 339–346
Guha S, Koudas N, Shim K (2001) Data-streams and histograms, In: Proceedings of the 2001 Annual ACM symposium on theory of computing (STOC’01). Heraklion, Crete, Greece, pp 471–475
Han J, Kamber M (2006) Data Mining Concepts and Techniques, 2nd edition. Morgan Kaufmann, San Francisco, CA
Hu X-G, Li P-P, Wu X-D, Wu G-Q (2007) A semi-random multiple decision-tree algorithm for mining data streams. J Comput Sci Technol 22(5): 711–724
Hulten G, Spencer L, Domingos P (2001) Mining time changing data streams. In: Proceedings of the 7th international conference on knowledge discovery and data mining (KDD’01). San Francisco, CA, pp 97–106
Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391
Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8: 281–300
Kolter JZ, Maloof MA (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of the 22nd international conference on machine learning (ICML’05). Bonn, Germany, pp 449–456
Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790
Li P, Hu X, Wu X (2008) Mining concept-drifting data streams with multiple semi-random decision trees, In: Proceedings of the 4th international conference on advanced data mining and applications (ADMA’08). Chengdu, China, pp 733–740
Lin X, Zhang Y (2008) Aggregate computation over data streams. In: Proceedings of the 10th Asia Pacific web conference (APWeb’08). Shenyang, China, pp 10–25
Masud M, Gao J, Khan L, Han J (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceedings of the 8th IEEE international conference of data mining (ICDM’08). Pisa, Italy, pp 929–934
McHugh J (2000) Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory. ACM Trans Inf Syst Secur 3(4): 262–294
Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11: 160–198
Oza NC, Russell S (2001) Online bagging and boosting, In: Proceedings of 8th international workshop on artificial intelligence and statistics (AISTATS’01). Key West, pp 105–112
Pfahringer B, Holmes G, Kirkby R (2008) Handling numeric attributes in hoeffding trees. In: Proceeding of the 2008 Pacific-Asia conference on knowledge discovery and data mining (PAKDD’08). Osaka, Japan, pp 296–307
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Schlimmer JC, Granger RH (1986) Beyond incremental processing: Tracking concept drift. In: Proceedings of the 5th national conference on artificial intelligence. Menlo Park, CA, pp 502–507
Scholz M, Klinkenberg R. (2005) An ensemble classifier for drifting concepts. In: Proceeding of 2nd international workshop on knowledge discovery from data streams, in conjunction with ECML-PKDD 2005. Porto, Portugal, pp 53–64
Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the 7th international conference on knowledge discovery and data mining (KDD’01). San Francisco, CA, pp 377–382
Tavallaee M, Bagheri E, Lu W, Ghorbani A (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of the second IEEE international conference on computational intelligence for security and defense applications (CISDA’09), pp 53–58
The UCI KDD: University of California (1999), KDD Cup 1999 Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
The University of Waikato (2009a) MOA: Massive Online Analysis, August 2009, http://www.cs.waikato.ac.nz/ml/moa
The University of Waikato (2009b) Weka 3: data mining software in Java, Version 3.6, http://www.cs.waikato.ac.nz/ml/weka
Tsymbal A (2004) The problem of concept drift: definitions and related work. In: Technical Report TCD-CS-2004-15. Computer Science Department, Trinity College, Dublin, Ireland
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the 9th international conference on knowledge discovery and data mining (KDD’03). Washington, DC, pp 226–235
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1): 69–101
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Grossi, V., Turini, F. Stream mining: a novel architecture for ensemble-based classification. Knowl Inf Syst 30, 247–281 (2012). https://doi.org/10.1007/s10115-011-0378-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0378-4