Abstract
In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a feature-free distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a policy as simple as retaining misclassified examples has a hugely beneficial effect on handling concept drift in spam but, on its own, it results in the case base growing by over 30%. We then compare two different retention policies and two different forgetting policies (one a form of instance selection, the other a form of instance weighting) and find that they perform roughly as well as each other while keeping the case base size constant. Finally, we compare a feature-based textual case-based spam filter with our feature-free approach. In the face of concept drift, the feature-based approach requires the case base to be rebuilt periodically so that we can select a new feature set that better predicts the target concept. We find feature-free approaches to have lower error rates than their feature-based equivalents.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric. In: Procs. of the 14th Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, Maryland, pp. 863–872 (2003)
Carreras, X., Marquez, L.: Boosting trees for anti-spam filtering. In: Procs. of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, pp. 58–64 (2001)
Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Trans. on Neural Networks 10, 1048–1054 (1999)
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk email. In: Procs. of the AAAI-1998 Workshop for Text Categorisation, Madison, Wisconsin, pp. 55–62 (1998)
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P.: Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. In: Procs. of the PKDD-2000 Workshop on Machine Learning and Textual Information Access, pp. 1–13 (2000)
Delany, S.J., Cunningham, P.: An analysis of case-based editing in a spam filtering system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS (LNAI), vol. 3155, pp. 128–141. Springer, Heidelberg (2004)
Delany, S.J., Cunningham, P., Coyle, L.: An assessment of case-based reasoning for spam filtering. Artificial Intelligence Review 24, 359–378 (2005)
Méndez, J.R., Fdez-Roverola, F., Iglesias, E.L., Díaz, F., Corchado, J.M.: Tracking concept drift at feature selection stage in spamhunting: An anti-spam instance-based reasoning system. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 504–518. Springer, Heidelberg (2006)
Gray, A., Haahr, M.: Personalised, collaborative spam filtering. In: Procs. of 1st Conference on Email and Anti-Spam, Mountain View, CA (2004)
Delany, S.J., Bridge, D.: Feature-based and feature-free textual CBR: A comparison in spam filtering. In: Procs. of the 17th Irish Conference on Artificial Intelligence and Cognitive Science, Belfast, Northern Ireland, pp. 244–253 (2006)
Aha, D.W.: Generalizing from case studies: A case study. In: Procs. of the 9th International Conference on Machine Learning, Aberdeen, Scotland, pp. 1–10 (1992)
Delany, S.J., Cunningham, P., Smyth, B.: ECUE: A spam filter that uses machine learning to track concept drift. In: Procs. of the 17th European Conference on Artificial Intelligence (PAIS stream), Riva del Garda, Italy, pp. 627–631 (2006)
Delany, S.J., Bridge, D.: Textual case-based reasoning for spam filtering: A comparison of feature-based and feature-free approaches. Artificial Intelligence Review (Forthcoming)
Delany, S.J., Cunningham, P., Tsymbal, A., Coyle, L.: A case-based technique for tracking concept drift in spam filtering. Knowledge-Based Systems 18, 187–195 (2005)
Lenz, M., Auriol, E., Manago, M.: Diagnosis and decision support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS (LNAI), vol. 1400, pp. 51–90. Springer, Heidelberg (1998)
McKenna, E., Smyth, B.: Competence-guided case-base editing techniques. In: Blanzieri, E., Portinale, L. (eds.) EWCBR 2000. LNCS (LNAI), vol. 1898, pp. 186–197. Springer, Heidelberg (2000)
Wilson, D.R., Martinez, T.R.: Reduction techniques for instance-based learning algorithms. Machine Learning 38, 257–286 (2000)
Brighton, H., Mellish, C.: Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery 6, 153–172 (2002)
Quinlan, J.R.: C4.5 Programs for Machine Learning. Morgan Kaufmann, San Francisco (1997)
Loewenstern, D., Hirsh, H., Yianilos, P., Noordewier, M.: DNA sequence classification using compression-based induction. Technical Report 95-04, Rutgers University, Computer Science Department (1995)
Keogh, E., Lonardi, S., Ratanamahatana, C.: Towards parameter-free data mining. In: Procs. of the 10th ACM SIGKDD, International Conference on Knowledge Discovery and Data Mining, New York, USA, pp. 206–215 (2004)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88, 048702/1–048702/4 (2002)
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51, 1523–1545 (2005)
Frank, E., Chui, C., Witten, I.H.: Text categorization using compression models. In: Procs. of the IEEE Data Compression Conference, Utah, USA, pp. 200–209 (2000)
Teahan, W.J.: Text classification and segmentation using minimum cross-entropy. In: Procs. of the 6th International Conference on Recherche d’Information Assistee par Ordinateur, Paris, France, pp. 943–961 (2000)
Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization. In: Procs. of the Workshop on Language Modeling for Information Retrieval, Carnegie Mellon University, pp. 83–88 (2001)
Bratko, A., Filipič, B.: Spam filtering using character-level Markov models: Experiments for the TREC 2005 spam track. In: Procs. of the 14th Text REtrieval Conference, Gaithersburg, MD (2005)
Bratko, A., Cormack, G.V., Filipič, B., Lynam, T.R., Zupan, B.: Spam filtering using statistical data compression models. Journal of Machine Learning Research 7, 2673–2698 (2006)
Rennie, J.D.M., Jaakkola, T.: Automatic feature induction for text classification. In: MIT Artificial Intelligence Laboratory Abstract Book, Cambridge, MA (2002)
Wess, S., Althoff, K.D., Derwand, G.: Using k-d trees to improve the retrieval step in case-based reasoning. In: Haton, J.-P., Manago, M., Keane, M.A. (eds.) Advances in Case-Based Reasoning. LNCS, vol. 984, pp. 167–181. Springer, Heidelberg (1995)
Schaaf, J.W.: Fish and shrink. A next step towards efficient case retrieval in large-scale case bases. In: Smith, I., Faltings, B.V. (eds.) Advances in Case-Based Reasoning. LNCS, vol. 1168, pp. 362–376. Springer, Heidelberg (1996)
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23, 69–101 (1996)
Kubat, M., Widmer, G.: Adapting to drift in continuous domains. In: Procs. of the 8th European Conference on Machine Learning, Heraclion, Crete, pp. 307–310 (1995)
Salganicoff, M.: Tolerating concept and sampling shift in lazy learning using prediction error context switching. Artificial Intelligence Review 11, 133–155 (1997)
Klinkenberg, R., Joachims, T.: Detecting concept drift with support vector machines. In: Procs. of the 17th International Conference on Machine Learning, San Francisco, CA, pp. 487–494 (2000)
Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis 8, 281–300 (2004)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Kuncheva, L.I.: Classifier ensembles for changing environments. In: Procs. of the 5th International Workshop on Multiple Classifier Systems, Italy, pp. 1–15 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Delany, S.J., Bridge, D. (2007). Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering. In: Weber, R.O., Richter, M.M. (eds) Case-Based Reasoning Research and Development. ICCBR 2007. Lecture Notes in Computer Science(), vol 4626. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74141-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-540-74141-1_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74138-1
Online ISBN: 978-3-540-74141-1
eBook Packages: Computer ScienceComputer Science (R0)