Advertisement

Information Retrieval

, Volume 6, Issue 1, pp 49–73 | Cite as

A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

  • Georgios Sakkis
  • Ion Androutsopoulos
  • Georgios Paliouras
  • Vangelis Karkaletsis
  • Constantine D. Spyropoulos
  • Panagiotis Stamatopoulos
Article

Abstract

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.

text categorization machine learning unsolicited commercial e-mail spam 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aha WD(1992)Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms. International Journal of Man-Machine Studies, 36:267-287.Google Scholar
  2. Aha WD, Kibler D and Albert MK (1991) Instance-based learning algorithms. Machine Learning, 6:37-66.Google Scholar
  3. Androutsopoulos I, Koutsias J, Chandrinos KV, Paliouras G and Spyropoulos CD (2000a) An evaluation of Naive Bayesian anti-spam filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9-17.Google Scholar
  4. Androutsopoulos I, Koutsias J, Chandrinos KV and Spyropoulos CD(2000b)An experimental comparison of Naive Bayesian and keyword-based anti-spam filtering with encrypted personal e-mail messages. In: Proceedings of the 23rd Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, Greece, pp. 160-167.Google Scholar
  5. Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD and Stamatopoulos P (2000c) Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. In: Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France, pp. 1-3.Google Scholar
  6. Bailey T and Jain AK(1978)A note on distance-weighted k-nearest neighbor rules, IEEE Transactions on Systems, Man, and Cybernetics, 8(4):311-313.Google Scholar
  7. Cohen WW (1996) Learning rules that classify e-mail. In: Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, Palo Alto, US, pp. 18 -25.Google Scholar
  8. Cohen WW and Singer Y (1999) Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2):141-173.Google Scholar
  9. Cover T and Hart P (1967) Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13:21-27.Google Scholar
  10. Cranor LF and LaMacchia BA (1998) Spam!, Communications of ACM, 41(8):74-83.Google Scholar
  11. Cristianini N and Shawe-Taylor J (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press.Google Scholar
  12. Daelemans W, Van den Bosch A and Weijters A (1997) IGTREE: Using trees for compression and classification in lazy learning algorithms. Artificial Intelligence Review, 11:407-423.Google Scholar
  13. Daelemans W, Zavrel J, van der Sloot K and van den Bosch A (2000) TiMBL: Tilburg Memory Based Learner, version 3.0, Reference Guide. ILK, Computational Linguistics, Tilburg University. http:/ilk.kub.nl/∼ilk/papers.Google Scholar
  14. Drucker HD, Wu D and Vapnik V (1999) Support vector machines for spam categorization. IEEE Transactions On Neural Networks, 10(5):1048-1054.Google Scholar
  15. Duda RO and Hart PE (1973) Bayes decision theory. Chapter 2 in Pattern Classification and Scene Analysis, John Wiley, pp. 10-43.Google Scholar
  16. Dudani AS (1976) The distance-weighted k-nearest neighbor rule. IEEE Transactions on Systems, Man and Cybernetics, 6(4):325-327.Google Scholar
  17. Giraud-Carrier C and Martinez RT (1995) An efficient metric for heterogeneous inductive learning applications in the attribute-value language. Intelligent Systems, pp. 341-350.Google Scholar
  18. Gómez Hidalgo JM, Maña López M and Puertas Sanz E (2000) Combining text and heuristics for costsensitive spam filtering. In: Fourth Computational Natural Language LearningWorkshop, CoNLL-2000, Lisbon, Portugal, pp. 99-102.Google Scholar
  19. Hall RJ (1998) How to avoid unwanted email. Communications of ACM, 41(3):88-95.Google Scholar
  20. Hull D and Robertson S (2000) The TREC-8 filtering track final report. In: Proceedings of the Eighth Text Retrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 35-56.Google Scholar
  21. Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 143-151.Google Scholar
  22. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), Morgan Kaufmann, pp. 1137-1143.Google Scholar
  23. Lang K (1995) Newsweeder: Learning to filter netnews. In: Proceedings of the 12th International Conference on Machine Learning, Stanford, CA, pp. 331-339.Google Scholar
  24. Lewis D (1995) Evaluating and optimizing autonomous text classification systems. In: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '95), New York, pp. 246-254.Google Scholar
  25. MacLeod ESJ, Luk A and Titterington DM (1987) A re-examination of the distance-weighted k-nearest neighbor classification rule. IEEE Transactions on Systems, Man, and Cybernetics, 17(4):689-696.Google Scholar
  26. Mitchell TM (1997) Machine Learning. McGraw-Hill.Google Scholar
  27. Pantel P and Lin D (1998) SpamCop: A spam classification and organization program. Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin, pp. 95-98. AAAI Technical Report WS-98-05.Google Scholar
  28. Payne TR and Edwards P (1997) Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1):1-32.Google Scholar
  29. Quinlan JR (1986) Induction of Decision Trees. Machine learning, 1(1):81-106.Google Scholar
  30. Quinlan JR (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.Google Scholar
  31. Rocchio, J. (1971). Relevance feedback information retrieval. In: Salton G, ed., The Smart Retrieval System- Experiments in Automatic Document Processing, Prentice-Hall, Englewood Cliffs, NJ, pp. 313-323.Google Scholar
  32. Sahami M, Dumais S, Heckerman D and Horvitz E (1998) A Bayesian approach to filtering junk e-mail. Learning for Text Categorization-Papers from the AAAI Workshop, Madison Wisconsin, pp. 55-62. AAAI Technical Report WS-98-05.Google Scholar
  33. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD and Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of e-mail. In: Proceedings of the Sixth Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Carnegie Mellon University, Pittsburgh, PA, USA, pp. 44-50.Google Scholar
  34. Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(3):513-523.Google Scholar
  35. Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw-Hill.Google Scholar
  36. Schapire RE and Singer Y(2000) BoosTexter:Aboosting-based system for text categorization. Machine Learning, 39(2/3):135-168.Google Scholar
  37. Sebastiani F (2001) Machine Learning in Automated Text Categorization. Revised version of Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy.Google Scholar
  38. Wettschereck D (1994) A Study of Distance-Based Machine Learning Algorithms. PhD thesis, Oregon State University.Google Scholar
  39. Wettschereck D, Aha WD and Mohri T (1995) A Review and Comparative Evaluation of Feature Weighting Methods for Lazy Learning Algorithms, Technical Report AIC-95-012, Washington, DC: Naval Research Laboratory, Navy Center for Applied Research in Artificial Intelligence.Google Scholar
  40. Wilson DR (1997) Advances in Instance-Based Learning Algorithms. PhD thesis, Brigham Young University.Google Scholar
  41. Wilson DR and Martinez RT (1997) Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6(1):1-34.Google Scholar
  42. Wolpert D (1992) Stacked generalization. Neural Networks, 5(2):241-260.Google Scholar
  43. Yang Y and Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, pp. 412-420.Google Scholar
  44. Zavrel J (1997) An empirical re-examination of weighted voting for k-NN. In: Proceedings of the 7th Belgian-Dutch Conference on Machine Learning (BENELEARN-97), Tilburg, The Netherlands.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Georgios Sakkis
    • 1
  • Ion Androutsopoulos
    • 2
  • Georgios Paliouras
    • 1
  • Vangelis Karkaletsis
    • 1
  • Constantine D. Spyropoulos
    • 1
  • Panagiotis Stamatopoulos
    • 2
  1. 1.National Centre for Scientific Research (NCSR) “Demokritos”Institute of Informatics and TelecommunicationsAthensGreece
  2. 2.Department of InformaticsAthens University of Economics and BusinessAthensGreece

Personalised recommendations