Skip to main content

A Case-Based Technique for Tracking Concept Drift in Spam Filtering

  • Conference paper
Applications and Innovations in Intelligent Systems XII (SGAI 2004)

Abstract

Clearly, machine learning techniques can play an important role in filtering spam email because ample training data is available to build a robust classifier. However, spam filtering is a particularly challenging task as the data distribution and concept being learned changes over time. This is a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent the spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering called ECUE that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Spira J. Spam E-Mail and its Impact on IT Spending and Productivity, Basex Report 2003, http://www.basex.com/poty2003.nsfl

  2. Lenz M, Auriol E, Manago M. Diagnosis and Decision Support. In: M. Bartsch-Sporl, H. D. B., and Wess, S. (eds) Case-Based Reasoning Technology: From Foundations to Applications, Springer-Verlag, 1998 LNCS 104

    Google Scholar 

  3. Androutsopoulos I, Paliouras G, Michelakis E. Learning to Filter Unsolicited Commercial E-Mail. Tech rpt 2004/2, 2004, NCSR “Demokritos”, http://www.iit.demokritos.gr/skel/iconfig/publications/

  4. Androutsopoulos I, Koutsias J, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos, P. Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In: 4th PKDD Workshop on Machine Learning and Textual Information Access. 2000

    Google Scholar 

  5. Pantel P, Lin D. SpamCop: A spam classification and organization program. In: Learning for Text Categorization—Papers from the AAAI Workshop, Madison Wisconsin, 1998, 95–98. AAAI Technical Report WS-98-05

    Google Scholar 

  6. Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian Approach to Filtering Junk Email. In: AAAI-98 Workshop on Learning for Text Categorization. Madison, Wisconsin. 1998, 55–62, AAAI Technical Report WS-98-05.

    Google Scholar 

  7. Androutsopoulos I, Koutsias J, Konstantinos V, Chandrinos V, Paliouras G, Spyropoulos C. An evaluation of Naive Bayesian anti-spam filtering, In: Potamias G, Moustakis V, van Someren M (eds.) Proc. of the ECML 2000 Workshop on Machine Learning in the New Information Age, 2000, 9–17

    Google Scholar 

  8. Drucker HD, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions On Neural Networks, 1999 10(5) 1048–1054

    Article  Google Scholar 

  9. Kolcz A, Alspector J. SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proc. of TextDM’2001, IEEE ICDM-2001 Workshop on Text Mining, San Jose CA 2001.

    Google Scholar 

  10. Gee K.R. Using Latent Semantic Indexing to Filter Spam. In: Proc. of the 2003 ACM Symposium on Applied Computing (SAC), ACM, 2003, 460–464

    Google Scholar 

  11. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 2004 6(1) 49–73

    Article  Google Scholar 

  12. Carreras X, Marquez L. Boosting trees for anti-spam email filtering. In: Proc. 4th Int. Conf. on Recent Advances in Natural Language Processing 2001 Tzigov Chark, Bulgaria.

    Google Scholar 

  13. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. Stacking classifiers for anti-spam filtering of e-mail. In: (ed) Lee & Harman, Proc. of 6th Conf. on Empirical Methods in Natural Language Processing 2001, 44–50

    Google Scholar 

  14. Widmer G, Kubat M. Learning in the presence of concept drift and hidden contexts, Machine Learning 1996 23(1) 69–101

    Google Scholar 

  15. Stanley K.O. Learning concept drift with a committee of decision trees, Tech. Report UT-AI-TR-03-302, Dept of Computer Sciences, University of Texas at Austin, USA, 2003

    Google Scholar 

  16. Widmer G, Kubat M. Effective learning in dynamic environments by explicit context tracking, In: Proc. ECML 1993, Springer-Verlag, LNCS 667, 1993, 227–243

    Google Scholar 

  17. Kubat M, Widmer G. Adapting to drift in continuous domains, Tech. Report Ă–FAI-TR-94-27, Austrian Research Institute for Artificial Intelligence, Vienna, 1994

    Google Scholar 

  18. Salganicoff M. Tolerating concept and sampling shift in lazy learning using prediction error context switching, AI Review, Spec. Iss. on Lazy Learning, 1997 11(1–5) 133–155

    Google Scholar 

  19. Klinkenberg R. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 2004 8(3) (to appear)

    Google Scholar 

  20. Cunningham P, Nowlan N, Delany SJ, Haahr M. A Case-Based Approach to Spam Filtering that Can Track Concept Drift. The ICCBR’03 Workshop on Long-Lived CBR Systems, Trondheim, Norway, 2003

    Google Scholar 

  21. Schlimmer JC, Granger RH. Incremental learning from noisy data, Machine Learning, 1986 1(3):317–354

    Google Scholar 

  22. Harries M., Sammut C, Horn K., Extracting hidden context, Machine Learning, 32(2), 1998, 101–126.

    Article  MATH  Google Scholar 

  23. Street W, Kim Y. A streaming ensemble algorithm (SEA) for large-scale classification, Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining KDD-2001, ACM Press, 2001, 377–382

    Google Scholar 

  24. Wang H, Fan W, Yu PS, Han J. Mining concept-drifting data streams using ensemble classifiers. In: Proc. 9th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining KDD-2003, ACM Press, 2003, 226–235

    Google Scholar 

  25. Kolter JZ, Maloof MA. Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Procs. 3rd IEEE Int. Conf. on Data Mining, IEEE CS Press, 2003, 123–130

    Google Scholar 

  26. Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: Proc. 7th Int. Conf. on Knowledge Discovery and Data Mining, ACM Press, 2001, 97–106.

    Google Scholar 

  27. Aha DW, Kibler D, Albert MK. Instance-Based Learning Algorithms. Machine Learning, 1991 6:37–66

    Google Scholar 

  28. Quinlan J Ross. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA, 1993.

    Google Scholar 

  29. Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 1997, 412–420

    Google Scholar 

  30. Delany SJ, Cunningham P. An Analysis of Case-Based Editing in a Spam Filtering System, In: Proc. of 7th European Conf. in Case-Based Reasoning, ECCBR-04, Springer Verlag, 2004 (to appear)

    Google Scholar 

  31. http://www.brightmail.com/accuracy.html

  32. Delany SJ, Cunningham P, Coyle L. An Assessment of Case-base Reasoning for Spam Filtering. In: Working papers of 15th Artificial Intelligence and Cognitive Science Conference (AICS 2004), 2004

    Google Scholar 

  33. Lewis D, Ringuette M. Comparison of two learning algorithms for text categorization, In: SDAIR, (1994)81–93.

    Google Scholar 

  34. Niblett. Constructing decision trees in noisy domains. In: Proceedings of the Second European Working Session on Learning, Sigma, 1987, 67–78.

    Google Scholar 

  35. Kohavi R, Becker B, Sommerfield D. Improving Simple Bayes. In: ECML-97 Proceedings of the Ninth European Conference on Machine Learning. 1997

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag London Limited

About this paper

Cite this paper

Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L. (2005). A Case-Based Technique for Tracking Concept Drift in Spam Filtering. In: Macintosh, A., Ellis, R., Allen, T. (eds) Applications and Innovations in Intelligent Systems XII. SGAI 2004. Springer, London. https://doi.org/10.1007/1-84628-103-2_1

Download citation

  • DOI: https://doi.org/10.1007/1-84628-103-2_1

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-85233-908-1

  • Online ISBN: 978-1-84628-103-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics