Knowledge and Information Systems

, Volume 22, Issue 3, pp 371–391 | Cite as

Tracking recurring contexts using ensemble classifiers: an application to email filtering

  • Ioannis Katakis
  • Grigorios Tsoumakas
  • Ioannis Vlahavas
Regular Paper

Abstract

Concept drift constitutes a challenging problem for the machine learning and data mining community that frequently appears in real world stream classification problems. It is usually defined as the unforeseeable concept change of the target variable in a prediction task. In this paper, we focus on the problem of recurring contexts, a special sub-type of concept drift, that has not yet met the proper attention from the research community. In the case of recurring contexts, concepts may re-appear in future and thus older classification models might be beneficial for future classifications. We propose a general framework for classifying data streams by exploiting stream clustering in order to dynamically build and update an ensemble of incremental classifiers. To achieve this, a transformation function that maps batches of examples into a new conceptual representation model is proposed. The clustering algorithm is then applied in order to group batches of examples into concepts and identify recurring contexts. The ensemble is produced by creating and maintaining an incremental classifier for every concept discovered in the data stream. An experimental study is performed using (a) two new real-world concept drifting datasets from the email domain, (b) an instantiation of the proposed framework and (c) five methods for dealing with drifting concepts. Results indicate the effectiveness of the proposed representation and the suitability of the concept-specific classifiers for problems with recurring contexts.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal, C (eds) (2007) Data streams: models and algorithms. Springer, HeidelbergMATHGoogle Scholar
  2. 2.
    Tsymbal A (2004) The problem of concept drift: definitions and related work. Technical report, Department of Computer Science Trinity CollegeGoogle Scholar
  3. 3.
    Widmer G, Kubat M (1996) Learning in the presense of concept drift and hidden contexts. Mach Learn 23(1): 69–101Google Scholar
  4. 4.
    Harries MB, Sammut C, Horn K (1998) Extracting hidden context. Mach Learn 32(2): 101–126MATHCrossRefGoogle Scholar
  5. 5.
    Forman G (2006) Tackling concept drift by temporal inductive transfer. In: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, New York, pp 252–259Google Scholar
  6. 6.
    Gaber M, Zaslavsky A, Krishnaswamy S (2007) A survey of classification methods in data streams. In: Aggarwal C (eds) Data streams, models and algorithms. Springer, Heidelberg, pp 39–59Google Scholar
  7. 7.
    Barbará D (2002) Requirements for clustering data streams. SIGKDD Explor 3(2): 23–27CrossRefGoogle Scholar
  8. 8.
    Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inform Syst 16(1): 1–27CrossRefMathSciNetGoogle Scholar
  9. 9.
    Kolter J, Maloof M (2003) Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Proceedings of the Third IEEE international conference on data mining. IEEE Press, Los Alamitos, pp 123–130Google Scholar
  10. 10.
    Kolter JZ, Maloof MA (2005) Using additive expert ensembles to cope with concept drift. In: ICML ’05: Proceedings of the 22nd international conference on machine learning. ACM Press, New York, pp 449–456Google Scholar
  11. 11.
    Wenerstrom B, Giraud-Carrier C (2006) Temporal data mining in dynamic feature spaces. IEEE Computer Society, Los Alamitos, pp 1141–1145Google Scholar
  12. 12.
    Gama J, Medas P, Castillo G, Rodrigues PP (2004) Learning with drift detection. In: Bazzan ALC, Labidi S (eds) Advances in artificial intelligence. Proceedings of the 17th Brazilian symposium on artificial intelligence (SBIA 2004). Lecture notes in artificial intelligence, vol 3171. Springer, Brazil, pp 286–295Google Scholar
  13. 13.
    Kolter JZ, Maloof MA (2007) Dynamic weighted majority: an ensemble method for drifting concepts. J Mach Learn Res 8: 2755–2790Google Scholar
  14. 14.
    Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensembles classifiers. In: 9th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, Washington, DC, pp 226–235Google Scholar
  15. 15.
    Martin Scholz RK (2007) Boosting classifiers for drifting concepts. Intell Data Anal, Spec Issue Knowl Discovery from Data Streams 11(1): 3–28Google Scholar
  16. 16.
    Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inform Syst 15(2): 181–214CrossRefGoogle Scholar
  17. 17.
    O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) High-performance clustering of streams and large data sets. In: ICDE 2002Google Scholar
  18. 18.
    Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: VLDB ’04: Proceedings of the 30th international conference on very large data bases, VLDB Endowment, pp 852–863Google Scholar
  19. 19.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Rec 25(2): 103–114CrossRefGoogle Scholar
  20. 20.
    Klinkenberg R, Joachims T (2000) Detecting concept drift with support vector machines. In: ICML ’00: Proceedings of the 17th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 487–494Google Scholar
  21. 21.
    Klinkenberg R (2004) Learning drifting concepts: example selection vs. example weighting. Intell Data Anal 8(3): 200–281Google Scholar
  22. 22.
    Fan W (2004) Systematic data selection to mine concept-drifting data streams. In: KDD ’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 128–137Google Scholar
  23. 23.
    Delany SJ, Padraig Cunningham ATLC (2005) A case-based technique for tracking concept drift in spam filtering. Knowl Based Syst 18(4–5): 187–195CrossRefGoogle Scholar
  24. 24.
    Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: 7th ACM SIGKDD international conference on knowledge discovery in data mining. ACM Press, pp 277–382Google Scholar
  25. 25.
    Zhu X, Wu X, Yang Y (2006) Effective classification of noisy data streams with attribute-oriented dynamic classifier selection. Knowl Inform Syst 9(3): 339–363CrossRefMathSciNetGoogle Scholar
  26. 26.
    Spinosa EJ, Carvahlo Ad, Gama J (2007) OLINDDA: a cluster-based approach for detecting novelty and concept drift in data streams. In: 22nd annual acm symposium on applied computing. ACM Press, pp 448–452Google Scholar
  27. 27.
    Hulten G, Spence L, Domingos P (2001) Mining time-changing data streams. In: KDD ’01: 7th ACM SIGKDD International conference on knowledge discovery and data mining. ACM Press, pp 97–106Google Scholar
  28. 28.
    Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley, New YorkGoogle Scholar
  29. 29.
    Asuncion A, Newman D (2007) UCI machine learning repositoryGoogle Scholar
  30. 30.
    Katakis I, Tsoumakas G, Vlahavas I (2006) Dynamic feature space and incremental feature selection for the classification of textual data streams. In: ECML/PKDD-2006 international workshop on knowledge discovery from data stream, pp 107–116Google Scholar
  31. 31.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1): 1–38MATHMathSciNetGoogle Scholar
  32. 32.
    Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. 2nd edn, San FranciscoGoogle Scholar
  33. 33.
    John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: UAI ’95: Proceedings of the 11th annual conference on uncertainty in artificial intelligence. Morgan Kaufman, Montreal, pp 338–345Google Scholar
  34. 34.
    Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29(2–3): 103–130MATHCrossRefGoogle Scholar
  35. 35.
    Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47CrossRefGoogle Scholar
  36. 36.
    Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A bayesian approach to filtering junk e-mail. In: Learning for text categorization: papers from the 1998 Workshop, Madison, Wisconsin, AAAI Technical Report WS-98-05Google Scholar
  37. 37.
    Rennie J (2000) ifile: an application of machine learning to e-mail filtering. In: KDD-2000 workshop on text miningGoogle Scholar
  38. 38.
    Vapnik V (1995) The nature of statistical learning theory. Springer, HeidelbergMATHGoogle Scholar
  39. 39.
    Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning. Number 1398. Springer, Heidelberg, pp 137–142CrossRefGoogle Scholar
  40. 40.
    Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inform Syst 16(3): 281–301CrossRefGoogle Scholar
  41. 41.
    Klimt B, Yang Y (2004) The enron corpus: a new dataset for email classification research. In: ECML 2004, 15th European conference on machine learning. Springer, Pisa, pp 217–226Google Scholar
  42. 42.
    Rennie JD, Rifkn R (2001) Improving multiclass text classification with the support vector machine. Technical Report AIM-2001-026, Massachusetts Institute of TechnologyGoogle Scholar
  43. 43.
    Yang Y, Liu X (1999) A re-examination of text categorization methods. In: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, New York, pp 42–49Google Scholar
  44. 44.
    Tsoumakas G, Angelis L, Vlahavas I (2004) Clustering classifiers for knowledge discovery from physically distributed databases. Data Knowl Eng 49(3): 223–242CrossRefGoogle Scholar
  45. 45.
    Katakis I, Tsoumakas G, Banos E, Bassiliades N, Vlahavas I An adaptive personalized news dissemination system. J Intell Inform Syst 32:191–212Google Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Ioannis Katakis
    • 1
  • Grigorios Tsoumakas
    • 1
  • Ioannis Vlahavas
    • 1
  1. 1.Department of InformaticsAristotle University of ThessalonikiThessalonikiGreece

Personalised recommendations