Skip to main content

An Assessment of Case-Based Reasoning for Spam Filtering

Abstract

Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to machine learning where induction is delayed to run time. This means that the case base can be updated continuously and new training data is immediately available to the induction process. In this paper we present a detailed description of such a system called ECUE and evaluate design decisions concerning the case representation. We compare its performance with an alternative system that uses Naïve Bayes. We find that there is little to choose between the two alternatives in cross-validation tests on data sets. However, ECUE does appear to have some advantages in tracking concept drift over time.

This is a preview of subscription content, access via your institution.

References

  • Androutsopoulos I, Koutsias J, Chandrinos G, Paliouras, G., Spyropoulos, C. (2000a). ‘An Evaluation of Naive Bayesian Anti-Spam Filtering’. In Potamias, G. Moustakis V. and van Someren M. (eds.) Proc. of Workshop on Machine Learning in the New Information Age, ECML 2000, 9–17

  • Androutsopoulos I, Koutsias J, Paliouras G, Karkaletsis, V. Sakkis, G., Spyropoulos, C. (2000b). Learning to Filter Spam E-Mail: A comparison of a naive Bayesian and a memory based approach. In Zaragoza H, Gallinari, P. and Rajman M. (eds.) Procs of Workshop on Machine Learning and Textual Information Access, PKDD 2000, 1–13

  • Androutsopoulos I, Paliouras, G., Michelakis, E. (2000c). Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/02, NCSR “Demokritos”.

  • Bradley A. (1997). The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30: 1145–1150

    Article  Google Scholar 

  • Brighton H., Mellish C. (2002). Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery 62: 153–172

    Article  MathSciNet  Google Scholar 

  • Ceglowski M, Coburn, A., Cuadrado, J. (2003). Semantic Search of Unstructured Data using Contextual Network Graphs

  • Cunningham P, Nowlan N, Delany, S., Haahr, M. (2003). A Case-Based approach to Spam Filtering that can track Concept Drift. In ICCBR 2003 Workshop on Long-Lived CBR Systems.

  • Delany, S. J., Cunningham, P. (2004). An Analysis of Case-Based Editing in a Spam Filtering System In Funk P., González-Calero P.(eds.) 7th European Conference on Case-Based Reasoning (ECCBR 2004), Vol. 3155 of LNAI. 128–141, Springer

  • Dietterich D.T. (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computing 10: 1895–1923

    Article  Google Scholar 

  • Drucker H, Wu D., Vapnik V. (1999). Support Vector Machines for Spam Categorisation. IEEE Transactions on Neural Networks 10(5): 1048–1055

    Article  Google Scholar 

  • Gee K. R. (2003). Using Latent Semantic Indexing to Filter Spam. In SAC ’03: Proceedings of the 2003 ACM symposium on Applied computing, 460–464, ACM Press

  • Kohavi R, Becker, B., Sommerfield, D. (1997). Improving Simple Bayes, In Proceedings of the 9th European Conference on Machine Learning (ECML 97). Springer Verlag

  • Lenz M, Auriol, E., Manago M. (1998). Diagnosis and Decision Support. In Lenz M, B. Bartsch-Spörl, Burkhard, H., Wess, S. (eds.) Case-Based Reasoning Technology From Foundations to Applications pp. 51–90, Springer-Verlag

  • Lewis, D., Ringuette M. (1994). Comparison of Two Learning Algorithms for Text Categorisation. In Procs of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR 94), 81–93

  • McKenna, E., Smyth, B. (2000). Competence-Guided Editing Methods for Lazy Learning. In Horn W. (ed.) ECAI 2000, Proceedings of the 14th European Conference on Artificial Intelligence 60–64, IOS Press

  • Niblett, T. (1987). Constructing Decision Trees in Noisy Domains. In Bratko I., Lavrac N. (eds.) Progress in Machine Learning, Procs of 2nd European Working Session on Learning (EWSL 87). 67–78, Sigma Press

  • Pantel, P., Lin, D. (1988). ‘SpamCop: A spam classification and organisation program’. In: Procs of Workshop for Text Categorisation, AAAI-98, 95–98

  • Quinlan J.R. (1997). C4.5 Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo, CA

    Google Scholar 

  • Sahami M, Dumais S, Heckerman, D., Horvitz, E. (1998). A Bayesian Approach to Filtering Junk E-mail. In Procs of Workshop for Text Categorisation AAAI-98, 55–62

  • Sakkis G., Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C.D., Stamatopoulos P. (2003). A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1): 49–73

    Article  Google Scholar 

  • USPatent: 2000. United States Patent 6, 161, 130

    Google Scholar 

  • Wilson, D., Martinez, T. (1997). Instance Pruning Techniques. In ICML ’97: Proceedings of the Fourteenth International Conference on Machine Learning. 403–411, Morgan Kaufmann Publishers Inc

  • Yang, Y., Pedersen, J. (1997). A Comparative Study on Feature Selection in Text Categorization. In ICML ’97: Proceedings of the 14th International Conference on Machine Learning, 412–420. Morgan Kaufmann Publishers Inc

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarah Jane Delany.

Additional information

★ This research was supported by funding from Enterprise Ireland under grant no. CFTD/03/219 and funding from Science Foundation Ireland under grant no. SFI-02IN.1I111

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Delany, S.J., Cunningham, P. & Coyle, L. An Assessment of Case-Based Reasoning for Spam Filtering. Artif Intell Rev 24, 359–378 (2005). https://doi.org/10.1007/s10462-005-9006-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-005-9006-6

Keywords

  • case base reasoning
  • spam filtering