Knowledge and Information Systems

, Volume 33, Issue 1, pp 213–244 | Cite as

Facing the reality of data stream classification: coping with scarcity of labeled data

  • Mohammad M. Masud
  • Clay Woolam
  • Jing Gao
  • Latifur Khan
  • Jiawei Han
  • Kevin W. Hamlen
  • Nikunj C. Oza
Regular Paper


Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.


Data stream classification Semi-supervised clustering Ensemble classification Concept drift 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal CC (2009) On classification and segmentation of massive audio data streams. Knowl Inf Syst 20: 137–156CrossRefGoogle Scholar
  2. 2.
    Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5): 577–589CrossRefGoogle Scholar
  3. 3.
    Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24: 171–196CrossRefGoogle Scholar
  4. 4.
    Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. In: Procedings of nineteenth international conference on machine learning (ICML), Sydney, Australia, pp 19–26Google Scholar
  5. 5.
    Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of SIAM international conference on data mining (SDM), Lake Buena Vista, FL, pp 333–344Google Scholar
  6. 6.
    Basu S, Bilenko M, Banerjee A, Mooney RJ (2006) Probabilistic semi-supervised clustering with constraints’. In: Chapelle O, Schoelkopf B, Zien A (eds) Semi-supervised learning. pp 73–102Google Scholar
  7. 7.
    Bengio Y, Delalleau O, Le Roux N (2006) Label propagation and quadratic criterion. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-Supervised Learning. MIT Press, Cambridge, pp 193–216Google Scholar
  8. 8.
    Besag J (1986) On the statistical analysis of dirty pictures. J R Stat Soc Ser B (Methodological) 48(3): 259–302MathSciNetzbMATHGoogle Scholar
  9. 9.
    Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of 21st international conference on machine learning (ICML), Banff, Canada, pp 81–88Google Scholar
  10. 10.
    Chen S, Wang H, Zhou S, Yu P (2008) Stop chasing trends: discovering high order models in evolving data. In: Proceedings of ICDE, pp 923–932Google Scholar
  11. 11.
    Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Technical report TR2003-1892, Cornell UniversityGoogle Scholar
  12. 12.
    Demiriz A, Bennett KP, Embrechts MJ (1999) Semi-supervised clustering using genetic algorithms. In: Artificial neural networks in engineering (ANNIE-99). ASME Press, pp 809–814Google Scholar
  13. 13.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39: 1–38MathSciNetzbMATHGoogle Scholar
  14. 14.
    Domingos P, Hulten G (2000) Mining high-speed data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining KDD. ACM Press, Boston MA, USA, pp 71–80Google Scholar
  15. 15.
    Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Seattle, WA, USA, pp 128–137Google Scholar
  16. 16.
    Fan W, an Huang Y, Wang H, Yu PS (2004) Active mining of data streams. In: Proceedings of SDM ’04’. pp 457–461Google Scholar
  17. 17.
    Gao J, Fan W, Han J (2007) On appropriate assumptions to mine data streams. In: Proceedings of seventh IEEE international conference on data mining (ICDM), Omaha, NE, USA, pp 143–152Google Scholar
  18. 18.
    Grossi V, Turini F (2011) Stream mining: a novel architecture for ensemble-based classification in preprints. knowl Inf SystGoogle Scholar
  19. 19.
    Halkidi M, Gunopulos D, Kumar N, Vazirgiannis M, Domeniconi C (2005) A framework for semi-supervised learning based on subjective and objective clustering criteria. In: Proceedings of fifth IEEE international conference on data mining (ICDM), Houston, Texas, USA, pp 637–640Google Scholar
  20. 20.
    Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2): 180–184MathSciNetzbMATHCrossRefGoogle Scholar
  21. 21.
    Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), San Francisco, CA, USA, pp 97–106Google Scholar
  22. 22.
    Katakis I, Tsoumakas G, Vlahavas I (2010) Tracking recurring contexts using ensemble classifiers: an application to email filtering. Knowl Inf Syst 22: 371–391CrossRefGoogle Scholar
  23. 23.
    KDD Cup 1999 Intrusion Detection Dataset (n.d.) .
  24. 24.
    Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedigs of 19th international conference on machine learning (ICML). Morgan Kaufmann Publishers Inc., Sydney, pp 307–314Google Scholar
  25. 25.
    Kolter J, Maloof M (2005) Using additive expert ensembles to cope with concept drift. In: Proceedings of international conference on machine learning (ICML), Bonn, Germany, pp 449–456Google Scholar
  26. 26.
    Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790zbMATHGoogle Scholar
  27. 27.
    Kranen P, Assent I, Baldauf C, Seidl T (2010) The clustree: indexing micro-clusters for anytime stream mining. Knowl Inf Syst (In preprints)Google Scholar
  28. 28.
    Kuncheva LI, Sánchez JS (2008) Nearest neighbour classifiers for streaming data with delayed labelling. In: ‘ICDM’. pp 869–874Google Scholar
  29. 29.
    Li P, Wu X, Hu X (2010) Learning from concept drifting data streams with unlabeled data. In: ‘AAAI’. pp 1945–1946Google Scholar
  30. 30.
    Li X, Yu PS, Liu B, Ng SK (2009) Positive unlabeled learning for data stream classification. In: ‘SDM’. pp 257–268Google Scholar
  31. 31.
    Masud MM, Gao J, Khan L, Han J, Thuraisingham B (2008) A practical approach to classify evolving data streams: training with limited amount of labeled data. In: Proceedings if international conference on data mining (ICDM), Pisa, Italy, pp 929–934Google Scholar
  32. 32.
    Masud MM, Gao J, Khan L, Han J, Thuraisingham BM (2009) Integrating novel class detection with classification for concept-drifting data streams. In: ECML PKDD ’09, Vol. II. pp. 79–94Google Scholar
  33. 33.
    NASA Aviation Safety Reporting System (n.d.)
  34. 34.
    Scholz M, Klinkenberg R (2005) An ensemble classifier for drifting concepts. In: Proceedings of second international workshop on knowledge discovery in data streams (IWKDDS), Porto, Portugal, pp 53–64Google Scholar
  35. 35.
    Tumer K, Ghosh J (1996) Error correlation and error reduction in ensemble classifiers. Connect Sci 8(304): 385–403CrossRefGoogle Scholar
  36. 36.
    van Huyssteen GB, Puttkammer MJ, Pilon S, Groenewald HJ (2007) Using machine learning to annotate data for nlp tasks semi-automatically. In: Proceedings of computer-aided language processing (CALP’07)Google Scholar
  37. 37.
    Wagsta K, Cardie C, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proceedings of 18th international conference on machine learning (ICML), Morgan Kaufmann, Williamstown, MA, USA, pp 577–584Google Scholar
  38. 38.
    Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of ninth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Washington, DC, pp c226–c235Google Scholar
  39. 39.
    Woolam C, Masud MM, Khan L (2009) Lacking labels in the stream: classifying evolving stream data with few labels. In: Proceedings of international symposium on methodologies for intelligent systems (ISMIS), Prague, Czech Republic, pp 552–562Google Scholar
  40. 40.
    Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems vol 15. MIT Press, pp 505–512Google Scholar
  41. 41.
    Yang Y, Wu X, Zhu X (2005) Combining proactive and reactive predictions for data streams. In: Proceedigs of KDD. pp 710–715Google Scholar
  42. 42.
    Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15: 181–214CrossRefGoogle Scholar
  43. 43.
    Zhou D, Bousquet O, Lal TN, Weston J, Olkopf BS (2004) Learning with local and global consistency. In: Advances in neural information processing systems, vol 16. MIT Press, pp 321–328Google Scholar
  44. 44.
    Zhu X, Ding W, Yu P, Zhang C (2010) One-class learning and concept summarization for data streams. Knowl Inf Syst 1–31Google Scholar
  45. 45.
    Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inf Syst 9: 339–363CrossRefGoogle Scholar
  46. 46.
    Zhu X, Zhang P, Lin X, Shi Y (2007) Active learning from data streams. In: Proceedings of ICDM ’07’, pp 757–762Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Mohammad M. Masud
    • 1
  • Clay Woolam
    • 1
  • Jing Gao
    • 2
  • Latifur Khan
    • 1
  • Jiawei Han
    • 2
  • Kevin W. Hamlen
    • 1
  • Nikunj C. Oza
    • 3
  1. 1.Department of Computer ScienceUniversity of Texas at DallasRichardsonUSA
  2. 2.Department of Computer ScienceUniversity of Illinois at Urbana ChampaignUrbanaUSA
  3. 3.Intelligent Systems DivisionNASA Ames Research CenterMoffett FieldUSA

Personalised recommendations