A Case-Based Technique for Tracking Concept Drift in Spam Filtering

Delany, Sarah Jane; Cunningham, Pádraig; Tsymbal, Alexey; Coyle, Lorcan

doi:10.1007/1-84628-103-2_1

Sarah Jane Delany⁴,
Pádraig Cunningham⁵,
Alexey Tsymbal⁵ &
…
Lorcan Coyle⁵

Included in the following conference series:

International Conference on Innovative Techniques and Applications of Artificial Intelligence

563 Accesses
7 Citations

Abstract

Clearly, machine learning techniques can play an important role in filtering spam email because ample training data is available to build a robust classifier. However, spam filtering is a particularly challenging task as the data distribution and concept being learned changes over time. This is a particularly awkward form of concept drift as the change is driven by spammers wishing to circumvent the spam filters. In this paper we show that lazy learning techniques are appropriate for such dynamically changing contexts. We present a case-based system for spam filtering called ECUE that can learn dynamically. We evaluate its performance as the case-base is updated with new cases. We also explore the benefit of periodically redoing the feature selection process to bring new features into play. Our evaluation shows that these two levels of model update are effective in tracking concept drift.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Spira J. Spam E-Mail and its Impact on IT Spending and Productivity, Basex Report 2003, http://www.basex.com/poty2003.nsfl
Lenz M, Auriol E, Manago M. Diagnosis and Decision Support. In: M. Bartsch-Sporl, H. D. B., and Wess, S. (eds) Case-Based Reasoning Technology: From Foundations to Applications, Springer-Verlag, 1998 LNCS 104
Google Scholar
Androutsopoulos I, Paliouras G, Michelakis E. Learning to Filter Unsolicited Commercial E-Mail. Tech rpt 2004/2, 2004, NCSR “Demokritos”, http://www.iit.demokritos.gr/skel/iconfig/publications/
Androutsopoulos I, Koutsias J, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos C, Stamatopoulos, P. Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In: 4th PKDD Workshop on Machine Learning and Textual Information Access. 2000
Google Scholar
Pantel P, Lin D. SpamCop: A spam classification and organization program. In: Learning for Text Categorization—Papers from the AAAI Workshop, Madison Wisconsin, 1998, 95–98. AAAI Technical Report WS-98-05
Google Scholar
Sahami M, Dumais S, Heckerman D, Horvitz E. A Bayesian Approach to Filtering Junk Email. In: AAAI-98 Workshop on Learning for Text Categorization. Madison, Wisconsin. 1998, 55–62, AAAI Technical Report WS-98-05.
Google Scholar
Androutsopoulos I, Koutsias J, Konstantinos V, Chandrinos V, Paliouras G, Spyropoulos C. An evaluation of Naive Bayesian anti-spam filtering, In: Potamias G, Moustakis V, van Someren M (eds.) Proc. of the ECML 2000 Workshop on Machine Learning in the New Information Age, 2000, 9–17
Google Scholar
Drucker HD, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions On Neural Networks, 1999 10(5) 1048–1054
Article Google Scholar
Kolcz A, Alspector J. SVM-based filtering of e-mail spam with content-specific misclassification costs. In: Proc. of TextDM’2001, IEEE ICDM-2001 Workshop on Text Mining, San Jose CA 2001.
Google Scholar
Gee K.R. Using Latent Semantic Indexing to Filter Spam. In: Proc. of the 2003 ACM Symposium on Applied Computing (SAC), ACM, 2003, 460–464
Google Scholar
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 2004 6(1) 49–73
Article Google Scholar
Carreras X, Marquez L. Boosting trees for anti-spam email filtering. In: Proc. 4^th Int. Conf. on Recent Advances in Natural Language Processing 2001 Tzigov Chark, Bulgaria.
Google Scholar
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C, Stamatopoulos P. Stacking classifiers for anti-spam filtering of e-mail. In: (ed) Lee & Harman, Proc. of 6^th Conf. on Empirical Methods in Natural Language Processing 2001, 44–50
Google Scholar
Widmer G, Kubat M. Learning in the presence of concept drift and hidden contexts, Machine Learning 1996 23(1) 69–101
Google Scholar
Stanley K.O. Learning concept drift with a committee of decision trees, Tech. Report UT-AI-TR-03-302, Dept of Computer Sciences, University of Texas at Austin, USA, 2003
Google Scholar
Widmer G, Kubat M. Effective learning in dynamic environments by explicit context tracking, In: Proc. ECML 1993, Springer-Verlag, LNCS 667, 1993, 227–243
Google Scholar
Kubat M, Widmer G. Adapting to drift in continuous domains, Tech. Report ÖFAI-TR-94-27, Austrian Research Institute for Artificial Intelligence, Vienna, 1994
Google Scholar
Salganicoff M. Tolerating concept and sampling shift in lazy learning using prediction error context switching, AI Review, Spec. Iss. on Lazy Learning, 1997 11(1–5) 133–155
Google Scholar
Klinkenberg R. Learning drifting concepts: example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 2004 8(3) (to appear)
Google Scholar
Cunningham P, Nowlan N, Delany SJ, Haahr M. A Case-Based Approach to Spam Filtering that Can Track Concept Drift. The ICCBR’03 Workshop on Long-Lived CBR Systems, Trondheim, Norway, 2003
Google Scholar
Schlimmer JC, Granger RH. Incremental learning from noisy data, Machine Learning, 1986 1(3):317–354
Google Scholar
Harries M., Sammut C, Horn K., Extracting hidden context, Machine Learning, 32(2), 1998, 101–126.
Article MATH Google Scholar
Street W, Kim Y. A streaming ensemble algorithm (SEA) for large-scale classification, Proc. 7^th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining KDD-2001, ACM Press, 2001, 377–382
Google Scholar
Wang H, Fan W, Yu PS, Han J. Mining concept-drifting data streams using ensemble classifiers. In: Proc. 9^th ACM SIGKDD Int. Conf on Knowledge Discovery and Data Mining KDD-2003, ACM Press, 2003, 226–235
Google Scholar
Kolter JZ, Maloof MA. Dynamic weighted majority: a new ensemble method for tracking concept drift. In: Procs. 3^rd IEEE Int. Conf. on Data Mining, IEEE CS Press, 2003, 123–130
Google Scholar
Hulten G, Spencer L, Domingos P. Mining time-changing data streams. In: Proc. 7^th Int. Conf. on Knowledge Discovery and Data Mining, ACM Press, 2001, 97–106.
Google Scholar
Aha DW, Kibler D, Albert MK. Instance-Based Learning Algorithms. Machine Learning, 1991 6:37–66
Google Scholar
Quinlan J Ross. C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA, 1993.
Google Scholar
Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 1997, 412–420
Google Scholar
Delany SJ, Cunningham P. An Analysis of Case-Based Editing in a Spam Filtering System, In: Proc. of 7th European Conf. in Case-Based Reasoning, ECCBR-04, Springer Verlag, 2004 (to appear)
Google Scholar
http://www.brightmail.com/accuracy.html
Delany SJ, Cunningham P, Coyle L. An Assessment of Case-base Reasoning for Spam Filtering. In: Working papers of 15^th Artificial Intelligence and Cognitive Science Conference (AICS 2004), 2004
Google Scholar
Lewis D, Ringuette M. Comparison of two learning algorithms for text categorization, In: SDAIR, (1994)81–93.
Google Scholar
Niblett. Constructing decision trees in noisy domains. In: Proceedings of the Second European Working Session on Learning, Sigma, 1987, 67–78.
Google Scholar
Kohavi R, Becker B, Sommerfield D. Improving Simple Bayes. In: ECML-97 Proceedings of the Ninth European Conference on Machine Learning. 1997
Google Scholar

Download references

Author information

Authors and Affiliations

Dublin Institute of Technology, Kevin St., Dublin 8, Ireland
Sarah Jane Delany
College Green, Trinity College Dublin, Dublin 2, Ireland
Pádraig Cunningham, Alexey Tsymbal & Lorcan Coyle

Authors

Sarah Jane Delany
View author publications
You can also search for this author in PubMed Google Scholar
Pádraig Cunningham
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Tsymbal
View author publications
You can also search for this author in PubMed Google Scholar
Lorcan Coyle
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Napier University, Edinburgh, EH10 5DT, UK
Ann Macintosh BSc, CEng
Stratum Management Ltd, UK
Richard Ellis BSc, MSc
Nottingham Trent University, Nottingham
Tony Allen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L. (2005). A Case-Based Technique for Tracking Concept Drift in Spam Filtering. In: Macintosh, A., Ellis, R., Allen, T. (eds) Applications and Innovations in Intelligent Systems XII. SGAI 2004. Springer, London. https://doi.org/10.1007/1-84628-103-2_1

Download citation

DOI: https://doi.org/10.1007/1-84628-103-2_1
Publisher Name: Springer, London
Print ISBN: 978-1-85233-908-1
Online ISBN: 978-1-84628-103-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics