Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain

Méndez, J. R.; Iglesias, E. L.; Fdez-Riverola, F.; Díaz, F.; Corchado, J. M.

doi:10.1007/11881216_47

J. R. Méndez²²,
E. L. Iglesias²²,
F. Fdez-Riverola²²,
F. Díaz²³ &
…
J. M. Corchado²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4177))

Included in the following conference series:

Conference of the Spanish Association for Artificial Intelligence

1066 Accesses
12 Citations

Abstract

Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

van Rijsbergen. C.J.: Information Retrieval, 2nd edn. Butterworths (1979)
Google Scholar
Salton, G.: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)
Google Scholar
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316 (1997)
Google Scholar
Oard, D.W.: The state of the art in text filtering. User Modeling and User-Adapted Interaction 7, 141–178 (1997)
Article Google Scholar
Androutsopoulos, I., Paliouras, G., Michelakis, E.: Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, NCSR “Demokritos” (2004)
Google Scholar
Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Technical Report WS-98-05, pp. 55–62 (1998)
Google Scholar
Carreras, X., Màrquez, L.: Boosting trees for anti-spam e-mail filtering. In: Proc. of the 4^th International Conference on Recent Advances in Natural Language Processing, pp. 58–64 (2001)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory, 2nd edn. Statistics for Engineering and Information Science (1999)
Google Scholar
Delany, S.J., Cunningham, P., Coyle, L.: An Assessment of Case-base Reasoning for Spam Filtering. In: Proc. of Fifteenth Irish Conference on Artificial Intelligence and Cognitive Science: AICS 2004, pp. 9–18 (2004)
Google Scholar
Cunningham, P., Nowlan, N., Delany, S.J., Haahr, M.: A Case-Based Approach to Spam Filtering that Can Track Concept Drift. In: Ashley, K.D., Bridge, D.G. (eds.) ICCBR 2003. LNCS, vol. 2689, Springer, Heidelberg (2003)
Chapter Google Scholar
Fdez-Riverola, F., Méndez, J.R., Iglesias, E.L., Díaz, F.: Representación Flexible de emails para la construcción de filtros antispam: un caso práctico. In: Proc. of the I Congreso Español de Informática CEDI 2005, pp. 109–116 (2005)
Google Scholar
Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Article MATH Google Scholar
Platt, J.: Fast training of Support Vector Machines using Sequential Minimal Optimization. In: Sholkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Machine Learning 23(1), 69–101 (1996)
Google Scholar
Lenz, M., Auriol, E., Manago, M.: Diagnosis and Decision Support. In: Lenz, M., Bartsch-Spörl, B., Burkhard, H.-D., Wess, S. (eds.) Case-Based Reasoning Technology. LNCS (LNAI), vol. 1400, pp. 51–90. Springer, Heidelberg (1998)
Chapter Google Scholar
Le, Z., Tian-shun, Y.: Filtering Junk Mail with A Maximum Entropy Model. In: Proc. of the ICCPO 2003, pp. 446–453 (2003)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proc. of the Fourteenth International Conference on Machine Learning: ICML 1997, pp. 412–420 (1997)
Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence: IJCAI 1995, pp. 1137–1143 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
J. R. Méndez, E. L. Iglesias & F. Fdez-Riverola
Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Plaza Santa Eulalia, 9-11, 40005, Segovia, Spain
F. Díaz
Dept. Informática y Automática, University of Salamanca, Plaza de la Merced s/n, 37008, Salamanca, Spain
J. M. Corchado

Authors

J. R. Méndez
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
F. Fdez-Riverola
View author publications
You can also search for this author in PubMed Google Scholar
F. Díaz
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Corchado
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dpto. de Ingeniería de la Información da las Comunicaciones, Universidad de Murcia, Campus de Espinardo, s/n., 30071, Murcia, Spain
Roque Marín
Universidad Politecnica de Valencia, (Spain)
Eva Onaindía
Dept. Electronics and Computer Science, University of Santiago de Compostela,
Alberto Bugarín
Department of Life Sciences, Imperial College London, SW7 2AZ, London, UK
José Santos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Méndez, J.R., Iglesias, E.L., Fdez-Riverola, F., Díaz, F., Corchado, J.M. (2006). Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain. In: Marín, R., Onaindía, E., Bugarín, A., Santos, J. (eds) Current Topics in Artificial Intelligence. CAEPIA 2005. Lecture Notes in Computer Science(), vol 4177. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11881216_47

Download citation

DOI: https://doi.org/10.1007/11881216_47
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45914-9
Online ISBN: 978-3-540-45915-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics