A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

Méndez, J. R.; Fdez-Riverola, F.; Díaz, F.; Iglesias, E. L.; Corchado, J. M.

doi:10.1007/11790853_9

J. R. Méndez¹⁹,
F. Fdez-Riverola¹⁹,
F. Díaz²⁰,
E. L. Iglesias¹⁹ &
…
J. M. Corchado²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4065))

Included in the following conference series:

Industrial Conference on Data Mining

1890 Accesses
39 Citations

Abstract

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ ²-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Spam statistics, http://www.theregister.co.uk/security/spam/
Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)
Article Google Scholar
Wittel, G.L., Wu, S.F.: On Attacking Statistical Spam Filters. In: Proc. of the First Conference on E-mail and Anti-Spam CEAS (2004)
Google Scholar
Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR Demokritos (2004)
Google Scholar
Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M.: Analyzing the Impact of Corpus Preprocessing on Anti-Spam Filtering Software. Research on Computing Science (to appear, 2005)
Google Scholar
Corchado, J.M., Corchado, E.S., Aiken, J., Fyfe, C., Fdez-Riverola, F., Glez-Bedia, M.: Maximum Likelihood Hebbian Learning Based Retrieval Method for CBR Systems. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, pp. 107–121. Springer, Heidelberg (2003)
Chapter Google Scholar
Corchado, J.M., Aiken, J., Corchado, E., Lefevre, N., Smyth, T.: Quantifying the ocean’s CO2 budget with a coHeL-IBR system. In: Funk, P., González Calero, P.A. (eds.) ECCBR 2004. LNCS, vol. 3155, pp. 533–546. Springer, Heidelberg (2004)
Chapter Google Scholar
Fdez-Riverola, F., Lorenzo, E.L., Díaz, F., Méndez, J.R., Corchado, J.M.: SpamHunting: An Instance-Based Reasoning System for Spam Labelling and Filtering. Decision Support Systems (to appear, 2006)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)
Google Scholar
Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)
Google Scholar
Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning: ICML 1997, pp. 412–420 (1997)
Google Scholar
Mitchell, T.: Machine Learning. Mc Graw Hill, New York (1996)
MATH Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proc. of the 7th International Conference on Information and Knowledge Management, pp. 229–237 (1998)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proc. of the ACL, vol. 27, pp. 76–83 (1989)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Drucker, H.D., Wu, D., Vapnik, V.: Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
Article Google Scholar
Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208 (1999)
Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Tsymbal, A.: The problem of concept drift: definitions and related work, available at: http://www.cs.tcd.ie
Graham, P.: Better Bayesian filtering. In: Proc. of the MIT Spam Conference (2003)
Google Scholar
Kolcz, A., Alspector, J.: SVM-based filtering of e-mail spam with content specific misclassification costs. In: Proc. of the ICDM Workshop on Text Mining (2001)
Google Scholar
Hovold, J.: Naïve Bayes Spam Filtering Using Word-Position-Based Attributes. In: Proc. of the Second Conference on Email and Anti-Spam CEAS 2005, http://www.ceas.cc/papers-2005/144.pdf
Gama, J., Castillo, G.: Adaptive Bayes. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527, pp. 765–774. Springer, Heidelberg (2002)
Chapter Google Scholar
Scholz, M., Klinkenberg, R.: An Ensemble Classifier for Drifting Concepts. In: Proc. of the Second International Workshop on Knowledge Discovery from Data Streams, pp. 53–64 (2005)
Google Scholar
Syed, N.A., Liu, H., Sung, K.K.: Handling Concept Drifts in Incremental Learning with Support Vector Machines. In: Proc. of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 317–321 (1999)
Google Scholar
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Zighed, A.D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 1–13. Springer, Heidelberg (2000)
Chapter Google Scholar
Daelemans, W., Jakub, Z., Sloot, K., Bosh, A.: TiMBL. Tilburg Memory Based Learning, version 5.1, Reference Guide. ILK, Computational Linguistics, Tilburg University, http://ilk.uvt.nl/software.html#timbl
Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS, vol. 1400, pp. 51–90. Springer, Heidelberg (1998)
Chapter Google Scholar
Frakes, B., Baeza-Yates, R.: Information Retrieval: Data Structures & Algorithms. Prentice-Hall, Englewood Cliffs (2000)
Google Scholar
NIST: National Institute of Science and Technology. Reuters corpora (2004), http://trec.nist.gov/data/reuters/reuters.html
Salton, G., McGill, M.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists. Information Retrieval 6(1), 49–73 (2003)
Article Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proc. of the 14th International Joint Conference on Artificial Intelligence: IJCAI 1995, pp. 1137–1143 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
J. R. Méndez, F. Fdez-Riverola & E. L. Iglesias
Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, Segovia, Spain
F. Díaz
Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, Salamanca, Spain
J. M. Corchado

Authors

J. R. Méndez
View author publications
You can also search for this author in PubMed Google Scholar
F. Fdez-Riverola
View author publications
You can also search for this author in PubMed Google Scholar
F. Díaz
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Corchado
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Vision and applied Computer Sciences, IBaI, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Méndez, J.R., Fdez-Riverola, F., Díaz, F., Iglesias, E.L., Corchado, J.M. (2006). A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain. In: Perner, P. (eds) Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining. ICDM 2006. Lecture Notes in Computer Science(), vol 4065. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11790853_9

Download citation

DOI: https://doi.org/10.1007/11790853_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36036-0
Online ISBN: 978-3-540-36037-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics