Skip to main content
Log in

Facing the reality of data stream classification: coping with scarcity of labeled data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20: 137–156

    Article  Google Scholar 

  2. Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5): 577–589

    Article  Google Scholar 

  3. Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24: 171–196

    Article  Google Scholar 

  4. Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Procedings of nineteenth international conference on machine learning (ICML), Sydney, Australia, pp 19–26

  5. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of SIAM international conference on data mining (SDM), Lake Buena Vista, FL, pp 333–344

  6. Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints’. In: Chapelle O, Schoelkopf B, Zien A (eds) Semi-supervised learning. pp 73–102

  7. Bengio Y, Delalleau O, Le Roux N (2006) Label propagation and quadratic criterion. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-Supervised Learning. MIT Press, Cambridge, pp 193–216

    Google Scholar 

  8. Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc Ser B (Methodological) 48(3): 259–302

    MathSciNet  MATH  Google Scholar 

  9. Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of 21st international conference on machine learning (ICML), Banff, Canada, pp 81–88

  10. Chen S, Wang H, Zhou S, Yu P (2008) Stop chasing trends: discovering high order models in evolving data. In: Proceedings of ICDE, pp 923–932

  11. Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Technical report TR2003-1892, Cornell University

  12. Demiriz A, Bennett KP, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: Artificial neural networks in engineering (ANNIE-99). ASME Press, pp 809–814

  13. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39: 1–38

    MathSciNet  MATH  Google Scholar 

  14. Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining KDD. ACM Press, Boston MA, USA, pp 71–80

  15. Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Seattle, WA, USA, pp 128–137

  16. Fan W, an Huang Y, Wang H, Yu PS (2004) Active mining of data streams. In: Proceedings of SDM ’04’. pp 457–461

  17. Gao J, Fan W, Han J (2007) On appropriate assumptions to mine data streams. In: Proceedings of seventh IEEE international conference on data mining (ICDM), Omaha, NE, USA, pp 143–152

  18. Grossi V, Turini F (2011) Stream mining: a novel architecture for ensemble-based classification in preprints. knowl Inf Syst

  19. Halkidi M, Gunopulos D, Kumar N, Vazirgiannis M, Domeniconi C (2005) A framework for semi-supervised learning based on subjective and objective clustering criteria. In: Proceedings of fifth IEEE international conference on data mining (ICDM), Houston, Texas, USA, pp 637–640

  20. Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2): 180–184

    Article  MathSciNet  MATH  Google Scholar 

  21. Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), San Francisco, CA, USA, pp 97–106

  22. Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22: 371–391

    Article  Google Scholar 

  23. KDD Cup 1999 Intrusion Detection Dataset (n.d.) http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html .

  24. Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedigs of 19th international conference on machine learning (ICML). Morgan Kaufmann Publishers Inc., Sydney, pp 307–314

  25. Kolter J, Maloof M (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of international conference on machine learning (ICML), Bonn, Germany, pp 449–456

  26. Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790

    MATH  Google Scholar 

  27. Kranen P, Assent I, Baldauf C, Seidl T (2010) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst (In preprints)

  28. Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labelling. In: ‘ICDM’. pp 869–874

  29. Li P, Wu X, Hu X (2010) Learning from concept drifting data streams with unlabeled data. In: ‘AAAI’. pp 1945–1946

  30. Li X, Yu PS, Liu B, Ng SK (2009) Positive unlabeled learning for data stream classification. In: ‘SDM’. pp 257–268

  31. Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceedings if international conference on data mining (ICDM), Pisa, Italy, pp 929–934

  32. Masud MM, Gao J, Khan L, Han J, Thuraisingham BM (2009) Integrating novel class detection with classification for concept-drifting data streams. In: ECML PKDD ’09, Vol. II. pp. 79–94

  33. NASA Aviation Safety Reporting System (n.d.) http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_Begin.aspx

  34. Scholz M, Klinkenberg R (2005) An ensemble classifier for drifting concepts. In: Proceedings of second international workshop on knowledge discovery in data streams (IWKDDS), Porto, Portugal, pp 53–64

  35. Tumer K, Ghosh J (1996) Error correlation and error reduction in ensemble classifiers. Connect Sci 8(304): 385–403

    Article  Google Scholar 

  36. van Huyssteen GB, Puttkammer MJ, Pilon S, Groenewald HJ (2007) Using machine learning to annotate data for nlp tasks semi-automatically. In: Proceedings of computer-aided language processing (CALP’07)

  37. Wagsta K, Cardie C, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of 18th international conference on machine learning (ICML), Morgan Kaufmann, Williamstown, MA, USA, pp 577–584

  38. Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Washington, DC, pp c226–c235

  39. Woolam C, Masud MM, Khan L (2009) Lacking labels in the stream: classifying evolving stream data with few labels. In: Proceedings of international symposium on methodologies for intelligent systems (ISMIS), Prague, Czech Republic, pp 552–562

  40. Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems vol 15. MIT Press, pp 505–512

  41. Yang Y, Wu X, Zhu X (2005) Combining proactive and reactive predictions for data streams. In: Proceedigs of KDD. pp 710–715

  42. Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15: 181–214

    Article  Google Scholar 

  43. Zhou D, Bousquet O, Lal TN, Weston J, Olkopf BS (2004) Learning with local and global consistency. In: Advances in neural information processing systems, vol 16. MIT Press, pp 321–328

  44. Zhu X, Ding W, Yu P, Zhang C (2010) One-class learning and concept summarization for data streams. Knowl Inf Syst 1–31

  45. Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9: 339–363

    Article  Google Scholar 

  46. Zhu X, Zhang P, Lin X, Shi Y (2007) Active learning from data streams. In: Proceedings of ICDM ’07’, pp 757–762

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad M. Masud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Masud, M.M., Woolam, C., Gao, J. et al. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowl Inf Syst 33, 213–244 (2012). https://doi.org/10.1007/s10115-011-0447-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0447-8

Keywords

Navigation