Skip to main content

Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

  • Conference paper
Case-Based Reasoning Research and Development (ICCBR 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4626))

Included in the following conference series:

Abstract

In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a feature-free distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a policy as simple as retaining misclassified examples has a hugely beneficial effect on handling concept drift in spam but, on its own, it results in the case base growing by over 30%. We then compare two different retention policies and two different forgetting policies (one a form of instance selection, the other a form of instance weighting) and find that they perform roughly as well as each other while keeping the case base size constant. Finally, we compare a feature-based textual case-based spam filter with our feature-free approach. In the face of concept drift, the feature-based approach requires the case base to be rebuilt periodically so that we can select a new feature set that better predicts the target concept. We find feature-free approaches to have lower error rates than their feature-based equivalents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. In: Procs. of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, Maryland, pp. 863–872 (2003)

    Google Scholar 

  2. Carreras, X., Marquez, L.: Boosting trees for anti-spam filtering. In: Procs. of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, pp. 58–64 (2001)

    Google Scholar 

  3. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10, 1048–1054 (1999)

    Article  Google Scholar 

  4. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk email. In: Procs. of the AAAI-1998 Workshop for Text Categorisation, Madison, Wisconsin, pp. 55–62 (1998)

    Google Scholar 

  5. Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In: Procs. of the PKDD-2000 Workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)

    Google Scholar 

  6. Delany, S.J., Cunningham, P.: An analysis of case-based editing in a spam filtering system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 128–141. Springer, Heidelberg (2004)

    Google Scholar 

  7. Delany, S.J., Cunningham, P., Coyle, L.: An assessment of case-based reasoning for spam filtering. Artificial Intelligence Review 24, 359–378 (2005)

    Article  Google Scholar 

  8. Méndez, J.R., Fdez-Roverola, F., Iglesias, E.L., Díaz, F., Corchado, J.M.: Tracking concept drift at feature selection stage in spamhunting: An anti-spam instance-based reasoning system. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 504–518. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Gray, A., Haahr, M.: Personalised, collaborative spam filtering. In: Procs. of 1st Conference on Email and Anti-Spam, Mountain View, CA (2004)

    Google Scholar 

  10. Delany, S.J., Bridge, D.: Feature-based and feature-free textual CBR: A comparison in spam filtering. In: Procs. of the 17th Irish Conference on Artificial Intelligence and Cognitive Science, Belfast, Northern Ireland, pp. 244–253 (2006)

    Google Scholar 

  11. Aha, D.W.: Generalizing from case studies: A case study. In: Procs. of the 9th International Conference on Machine Learning, Aberdeen, Scotland, pp. 1–10 (1992)

    Google Scholar 

  12. Delany, S.J., Cunningham, P., Smyth, B.: ECUE: A spam filter that uses machine learning to track concept drift. In: Procs. of the 17th European Conference on Artificial Intelligence (PAIS stream), Riva del Garda, Italy, pp. 627–631 (2006)

    Google Scholar 

  13. Delany, S.J., Bridge, D.: Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artificial Intelligence Review (Forthcoming)

    Google Scholar 

  14. Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems 18, 187–195 (2005)

    Article  Google Scholar 

  15. Lenz, M., Auriol, E., Manago, M.: Diagnosis and decision support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS (LNAI), vol. 1400, pp. 51–90. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  16. McKenna, E., Smyth, B.: Competence-guided case-base editing techniques. In: Blanzieri, E., Portinale, L. (eds.) EWCBR 2000. LNCS (LNAI), vol. 1898, pp. 186–197. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  17. Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38, 257–286 (2000)

    Article  MATH  Google Scholar 

  18. Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  19. Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  20. Loewenstern, D., Hirsh, H., Yianilos, P., Noordewier, M.: DNA sequence classification using compression-based induction. Technical Report 95-04, Rutgers University, Computer Science Department (1995)

    Google Scholar 

  21. Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Procs. of the 10th ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining, New York, USA, pp. 206–215 (2004)

    Google Scholar 

  22. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88, 048702/1–048702/4 (2002)

    Google Scholar 

  23. Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51, 1523–1545 (2005)

    Article  MathSciNet  Google Scholar 

  24. Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Procs. of the IEEE Data Compression Conference, Utah, USA, pp. 200–209 (2000)

    Google Scholar 

  25. Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: Procs. of the 6th International Conference on Recherche d’Information Assistee par Ordinateur, Paris, France, pp. 943–961 (2000)

    Google Scholar 

  26. Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Procs. of the Workshop on Language Modeling for Information Retrieval, Carnegie Mellon University, pp. 83–88 (2001)

    Google Scholar 

  27. Bratko, A., Filipič, B.: Spam filtering using character-level Markov models: Experiments for the TREC 2005 spam track. In: Procs. of the 14th Text REtrieval Conference, Gaithersburg, MD (2005)

    Google Scholar 

  28. Bratko, A., Cormack, G.V., Filipič, B., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)

    Google Scholar 

  29. Rennie, J.D.M., Jaakkola, T.: Automatic feature induction for text classification. In: MIT Artificial Intelligence Laboratory Abstract Book, Cambridge, MA (2002)

    Google Scholar 

  30. Wess, S., Althoff, K.D., Derwand, G.: Using k-d trees to improve the retrieval step in case-based reasoning. In: Haton, J.-P., Manago, M., Keane, M.A. (eds.) Advances in Case-Based Reasoning. LNCS, vol. 984, pp. 167–181. Springer, Heidelberg (1995)

    Google Scholar 

  31. Schaaf, J.W.: Fish and shrink. A next step towards efficient case retrieval in large-scale case bases. In: Smith, I., Faltings, B.V. (eds.) Advances in Case-Based Reasoning. LNCS, vol. 1168, pp. 362–376. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  32. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 69–101 (1996)

    Google Scholar 

  33. Kubat, M., Widmer, G.: Adapting to drift in continuous domains. In: Procs. of the 8th European Conference on Machine Learning, Heraclion, Crete, pp. 307–310 (1995)

    Google Scholar 

  34. Salganicoff, M.: Tolerating concept and sampling shift in lazy learning using prediction error context switching. Artificial Intelligence Review 11, 133–155 (1997)

    Article  Google Scholar 

  35. Klinkenberg, R., Joachims, T.: Detecting concept drift with support vector machines. In: Procs. of the 17th International Conference on Machine Learning, San Francisco, CA, pp. 487–494 (2000)

    Google Scholar 

  36. Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis 8, 281–300 (2004)

    Google Scholar 

  37. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)

    Google Scholar 

  38. Kuncheva, L.I.: Classifier ensembles for changing environments. In: Procs. of the 5th International Workshop on Multiple Classifier Systems, Italy, pp. 1–15 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Rosina O. Weber Michael M. Richter

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Delany, S.J., Bridge, D. (2007). Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering. In: Weber, R.O., Richter, M.M. (eds) Case-Based Reasoning Research and Development. ICCBR 2007. Lecture Notes in Computer Science(), vol 4626. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74141-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74141-1_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74138-1

  • Online ISBN: 978-3-540-74141-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics