Abstract
The paper aims to present a Spam Detection system by a Content Analysis based on Machine Leaning. The system is composed of six units: Tokenization and Cleaning words, Lemmatization, Stopping Word Removal and Synonym Replacement, Term Selection, Bag-of-Words Representer, and Classifier. Experiments performed on two different datasets, i.e., SpamAssassin and Trec2007 show satisfactory results, comparable with the state of the art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cormack, G., Lynam, T.: Spam corpus creation for TREC. In: CEAS, pp. 1–6. MIT Press (2005)
Camastra, F., Ciaramella, F., Staiano, A.: Machine learning and soft computing for ict security: an overview of current trends. J. Ambient Intell. Humaniz. Comput. 4(2), 235–247 (2013)
Guzella, T., Caminhas, W.: A review of machine learning approaches to spam filtering. Expert Syst. Appl. 36(7), 10206–10222 (2009)
Caruana, G., Li, M.: A survey of emerging approaches to spam filtering. ACM Comput. Surv. 44(2), 9.1–9.27 (2012)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Fellbaum, C.: Wordnet. In: The Encyclopedia of Applied Linguisticsl. American Cancer Society (2012)
Saini, J., Rakholia, R.M.: On continent and script-wise divisions-based statistical measures for stop-words lists of international languages. Procedia Comput. Sci. 89, 313–319 (2018)
Cover, T.M., Thomas, J.: Elements of Informtion Theory. Wiley (1991)
Salton, G., Wong, A., Yang, C.: A vector-space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 1–25 (1995)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (USA) (2002)
Hastie, T., Tibshirani, R., Friedman, R.: The Elements of Statistical Learning, 2nd edn. Springer (2009)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Bailey, P., De Vries, A., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC, pp. 1–7. MIT Press (2007)
Acknowledgments
Daniele Davino developed part of the work, as exam project for Multimedia Systems and Laboratory, during his M.Sc. in Computer Science at University of Naples Parthenope, under the supervision of Francesco Camastra and Angelo Ciaramella. Francesco Camastra’s, Angelo Ciaramella’s, and Antonino Staiano’s researches were funded by Sostegno alla ricerca individuale per il triennio 2015–17 project of University of Naples Parthenope.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Davino, D., Camastra, F., Ciaramella, A., Staiano, A. (2021). Spam Detection by Machine Learning-Based Content Analysis. In: Esposito, A., Faundez-Zanuy, M., Morabito, F., Pasero, E. (eds) Progresses in Artificial Intelligence and Neural Systems. Smart Innovation, Systems and Technologies, vol 184. Springer, Singapore. https://doi.org/10.1007/978-981-15-5093-5_37
Download citation
DOI: https://doi.org/10.1007/978-981-15-5093-5_37
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5092-8
Online ISBN: 978-981-15-5093-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)