Skip to main content
Log in

Stream mining: a novel architecture for ensemble-based classification

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Mining data streams has become an important and challenging task for a wide range of applications. In these scenarios, data tend to arrive in multiple, rapid and time-varying streams, thus constraining data mining algorithms to look at data only once. Maintaining an accurate model, e.g. a classifier, while the stream goes by requires a smart way of keeping track of the data already passed away. Such a synthetic structure has to serve two purposes: distilling the most of information out of past data and allowing a fast reaction to concept drifting, i.e. to the change of the data trend that necessarily affects the model. The paper outlines novel data structures and algorithms to tackle the above problem, when the model mined out of the data is a classifier. The introduced model and the overall ensemble architecture are presented in details, even considering how the approach can be extended for treating numerical attributes. A large part of the paper discusses the experiments and the comparisons with several existing systems. The comparisons show that the performance of our system in general, and in particular with respect to the reaction to concept drifting, is at the top level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20(2): 137–156

    Article  MathSciNet  Google Scholar 

  2. Aggarwal CC, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. In: Proceedings of the 2003 international conference on very large data bases (VLDB’03). Berlin, Germany, pp 81–92

  3. Aggarwal CC, Han J, Wang J, Yu P (2004a) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 2004 international conference on very large data bases (VLDB’04). Toronto, Canada, pp 852–863

  4. Aggarwal CC, Han J, Wang J, Yu P (2004b) On demand classification of data streams. In: Proceedings of the 10th international conference on knowledge discovery and data mining (KDD’04). Seattle, WA, pp 503–508

  5. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems (PODS’02). Madison, WI, pp 1–16

  6. Bach SH, Maloof MA (2008) Paired learners for concept drift. In: Proceedings of the 2008 international conference on data mining (ICDM’08). Pisa, Italy, pp 23–32

  7. Baena-Garcia M, del Campo-Avila J, Fidalgo R, Bifet A, Ravalda R, Morales-Bueno R (2006) Early drift detection method. In: International workshop on knowledge discovery from data streams

  8. Bellandi A, Furletti B, Grossi V, Romei A (2007) Ontology-driven association rules extraction: a case of study. In: Proceedings of the international workshop on contexts and ontologies: representation and reasoning (C&O:RR) collocated with the 6th international and interdisciplinary conference on modelling and using context (CONTEXT 2007). Roskilde, Denmark

  9. Bifet A, Holmes G, Pfahringer B, Kirby R, Gavaldá R (2009) New ensemble methods for evolving data streams. In: Proceedings of the 15th international conference on knowledge discovery and data mining, pp 139–148

  10. Bonchi F, Lucchese C (2005) Pushing tougher constraints in frequent pattern mining. In: Proceedings of the 2005 Pacific-Asia conference on knowledge discovery and data mining (PAKDD ’05). Hanoi, Vietnam, pp 114–124

  11. Bondu A, Boulle M, Lemaire V (2010) A non-parametric semi-supervised discretization method. Knowl Inf Syst 24(1): 35–57

    Article  Google Scholar 

  12. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont

    MATH  Google Scholar 

  13. Budhaditya S, Pham D-S, Lazarescu M, Venkatesh S (2009) Effective anomaly detection in sensor networks data streams. In: Proceedings of the 2009 international conference on data mining (ICDM’09). Miami, FL, pp 722–727

  14. Chu F (2005) Mining techniques for data streams and sequences. PhD thesis, Supervisor Prof. Carlo Zaniolo, University of California

  15. Chu F, Zaniolo C (2004) Fast and light boosting for adaptive mining of data streams. In: Proceedings of the 8th Pacific-Asia conference advances in knowledge discovery and data mining (PAKDD’04). Sydney, Australia, pp 282–292

  16. Cohen L, Avrahami G, Last M, Kandel A (2008) Info-fuzzy algorithms for mining dynamic data streams. Appl Soft Comput 8(4): 1283–1294

    Article  Google Scholar 

  17. Cormode G, Garofalakis MN (2008) Approximate continuous querying over distributed streams. ACM Trans Database Syst 33(2): 1–39

    Article  Google Scholar 

  18. Datar M, Gionis A, Indyk P, Motwani R (2002) Maintaining stream statistics over sliding windows. SIAM J Comput 31(6): 1794–1813

    Article  MATH  MathSciNet  Google Scholar 

  19. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30

    MATH  MathSciNet  Google Scholar 

  20. Domingos P, Hulten G (2000) Mining high-speed data streams, In: Proceedings of the 6th international conference on knowledge discovery and data mining (KDD’00). Boston, MA, pp 71–80

  21. Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th international conference on machine learning (ICML’01). Williamstown, MA, pp 106–113

  22. Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56: 52–64

    Article  MATH  MathSciNet  Google Scholar 

  23. Folino G, Pizzuti C, Spezzano G (2007) Mining distributed evolving data streams using fractal gp ensembles. In: Proceedings of the 10th European conference genetic programming (EuroGP’07). Valencia, Spain, pp 160–169

  24. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32: 675–701

    Article  Google Scholar 

  25. Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. An Math Stat 11: 86–92

    Article  MATH  Google Scholar 

  26. Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM SIGMOD Records 34(2): 18–26

    Article  Google Scholar 

  27. Gama J, Fernandes R, Rocha R (2006) Decision trees for mining data streams. Intell Data Anal 10(1): 23–45

    Google Scholar 

  28. Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: SBIA Brazilian symposium on artificial intelligence, pp 286–295

  29. Gama J, Pinto C (2006) Discretization from data streams: applications to histograms and data mining. In: Proceedings of the 2006 ACM symposium on applied computing (SAC’06). Dijon, France, pp 662–667

  30. Gao J, Fan W, Han J, Yu PS (2007) On appropriate assumptions to mine data streams: analysis and practice. In: Proceedings of the 7th IEEE international conference on data mining (ICDM’07). Omaha, NE, pp 143–152

  31. Giannella C, Han J, Pei J, Yan X, Yu PS (2003) Mining frequent patterns in data streams at multiple time granularities. In: Kargupta H, Joshi A, Sivakumar K, Yesha Y Next Generation Data Mining. AAAI/MIT, Cambridge

  32. Gilbert A, Guha S, Indyk P, Kotidis Y, Muthukrishnan S, Strauss M (2002) Fast, small-space algorithms for approximate histogram maintenance. In: Proceedings of the 2002 annual ACM symposium on theory of computing (STOC’02). Montreal, Quebec, Canada, pp 389–398

  33. Grossi V (2009) A new framework for data streams classification. PhD thesis, Supervisor Prof. Franco Turini, University of Pisa http://etd.adm.unipi.it/theses/available/etd-11242009-124601/

  34. Grossi V, Turini F (2010) A new selective ensemble approach for data streams classification. In: Proceedings of the 10th international artificial intelligence and applications (AIA2010). Innsbruck, Austria, pp 339–346

  35. Guha S, Koudas N, Shim K (2001) Data-streams and histograms, In: Proceedings of the 2001 Annual ACM symposium on theory of computing (STOC’01). Heraklion, Crete, Greece, pp 471–475

  36. Han J, Kamber M (2006) Data Mining Concepts and Techniques, 2nd edition. Morgan Kaufmann, San Francisco, CA

    MATH  Google Scholar 

  37. Hu X-G, Li P-P, Wu X-D, Wu G-Q (2007) A semi-random multiple decision-tree algorithm for mining data streams. J Comput Sci Technol 22(5): 711–724

    Article  Google Scholar 

  38. Hulten G, Spencer L, Domingos P (2001) Mining time changing data streams. In: Proceedings of the 7th international conference on knowledge discovery and data mining (KDD’01). San Francisco, CA, pp 97–106

  39. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22(3): 371–391

    Article  Google Scholar 

  40. Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8: 281–300

    Google Scholar 

  41. Kolter JZ, Maloof MA (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of the 22nd international conference on machine learning (ICML’05). Bonn, Germany, pp 449–456

  42. Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790

    MATH  Google Scholar 

  43. Li P, Hu X, Wu X (2008) Mining concept-drifting data streams with multiple semi-random decision trees, In: Proceedings of the 4th international conference on advanced data mining and applications (ADMA’08). Chengdu, China, pp 733–740

  44. Lin X, Zhang Y (2008) Aggregate computation over data streams. In: Proceedings of the 10th Asia Pacific web conference (APWeb’08). Shenyang, China, pp 10–25

  45. Masud M, Gao J, Khan L, Han J (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceedings of the 8th IEEE international conference of data mining (ICDM’08). Pisa, Italy, pp 929–934

  46. McHugh J (2000) Testing intrusion detection systems: a critique of the 1998 and 1999 darpa intrusion detection system evaluations as performed by lincoln laboratory. ACM Trans Inf Syst Secur 3(4): 262–294

    Article  Google Scholar 

  47. Opitz D, Maclin R (1999) Popular ensemble methods: an empirical study. J Artif Intell Res 11: 160–198

    Google Scholar 

  48. Oza NC, Russell S (2001) Online bagging and boosting, In: Proceedings of 8th international workshop on artificial intelligence and statistics (AISTATS’01). Key West, pp 105–112

  49. Pfahringer B, Holmes G, Kirkby R (2008) Handling numeric attributes in hoeffding trees. In: Proceeding of the 2008 Pacific-Asia conference on knowledge discovery and data mining (PAKDD’08). Osaka, Japan, pp 296–307

  50. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  51. Schlimmer JC, Granger RH (1986) Beyond incremental processing: Tracking concept drift. In: Proceedings of the 5th national conference on artificial intelligence. Menlo Park, CA, pp 502–507

  52. Scholz M, Klinkenberg R. (2005) An ensemble classifier for drifting concepts. In: Proceeding of 2nd international workshop on knowledge discovery from data streams, in conjunction with ECML-PKDD 2005. Porto, Portugal, pp 53–64

  53. Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Proceedings of the 7th international conference on knowledge discovery and data mining (KDD’01). San Francisco, CA, pp 377–382

  54. Tavallaee M, Bagheri E, Lu W, Ghorbani A (2009) A detailed analysis of the kdd cup 99 data set. In: Proceedings of the second IEEE international conference on computational intelligence for security and defense applications (CISDA’09), pp 53–58

  55. The UCI KDD: University of California (1999), KDD Cup 1999 Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

  56. The University of Waikato (2009a) MOA: Massive Online Analysis, August 2009, http://www.cs.waikato.ac.nz/ml/moa

  57. The University of Waikato (2009b) Weka 3: data mining software in Java, Version 3.6, http://www.cs.waikato.ac.nz/ml/weka

  58. Tsymbal A (2004) The problem of concept drift: definitions and related work. In: Technical Report TCD-CS-2004-15. Computer Science Department, Trinity College, Dublin, Ireland

  59. Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the 9th international conference on knowledge discovery and data mining (KDD’03). Washington, DC, pp 226–235

  60. Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1): 69–101

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Valerio Grossi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grossi, V., Turini, F. Stream mining: a novel architecture for ensemble-based classification. Knowl Inf Syst 30, 247–281 (2012). https://doi.org/10.1007/s10115-011-0378-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0378-4

Keywords

Navigation