Skip to main content

Spam Detection by Machine Learning-Based Content Analysis

  • Chapter
  • First Online:
Progresses in Artificial Intelligence and Neural Systems

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 184))

Abstract

The paper aims to present a Spam Detection system by a Content Analysis based on Machine Leaning. The system is composed of six units: Tokenization and Cleaning words, Lemmatization, Stopping Word Removal and Synonym Replacement, Term Selection, Bag-of-Words Representer, and Classifier. Experiments performed on two different datasets, i.e., SpamAssassin and Trec2007 show satisfactory results, comparable with the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cormack, G., Lynam, T.: Spam corpus creation for TREC. In: CEAS, pp. 1–6. MIT Press (2005)

    Google Scholar 

  2. Camastra, F., Ciaramella, F., Staiano, A.: Machine learning and soft computing for ict security: an overview of current trends. J. Ambient Intell. Humaniz. Comput. 4(2), 235–247 (2013)

    Article  Google Scholar 

  3. Guzella, T., Caminhas, W.: A review of machine learning approaches to spam filtering. Expert Syst. Appl. 36(7), 10206–10222 (2009)

    Article  Google Scholar 

  4. Caruana, G., Li, M.: A survey of emerging approaches to spam filtering. ACM Comput. Surv. 44(2), 9.1–9.27 (2012)

    Google Scholar 

  5. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  6. Fellbaum, C.: Wordnet. In: The Encyclopedia of Applied Linguisticsl. American Cancer Society (2012)

    Google Scholar 

  7. Saini, J., Rakholia, R.M.: On continent and script-wise divisions-based statistical measures for stop-words lists of international languages. Procedia Comput. Sci. 89, 313–319 (2018)

    Article  Google Scholar 

  8. Cover, T.M., Thomas, J.: Elements of Informtion Theory. Wiley (1991)

    Google Scholar 

  9. Salton, G., Wong, A., Yang, C.: A vector-space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  Google Scholar 

  10. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  11. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 1–25 (1995)

    MATH  Google Scholar 

  12. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    MATH  Google Scholar 

  13. Schölkopf, B., Smola, A.: Learning with Kernels. MIT Press, Cambridge (USA) (2002)

    MATH  Google Scholar 

  14. Hastie, T., Tibshirani, R., Friedman, R.: The Elements of Statistical Learning, 2nd edn. Springer (2009)

    Google Scholar 

  15. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    Article  Google Scholar 

  16. Bailey, P., De Vries, A., Craswell, N., Soboroff, I.: Overview of the TREC 2007 enterprise track. In: TREC, pp. 1–7. MIT Press (2007)

    Google Scholar 

Download references

Acknowledgments

Daniele Davino developed part of the work, as exam project for Multimedia Systems and Laboratory, during his M.Sc. in Computer Science at University of Naples Parthenope, under the supervision of Francesco Camastra and Angelo Ciaramella. Francesco Camastra’s, Angelo Ciaramella’s, and Antonino Staiano’s researches were funded by Sostegno alla ricerca individuale per il triennio 2015–17 project of University of Naples Parthenope.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonino Staiano .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Davino, D., Camastra, F., Ciaramella, A., Staiano, A. (2021). Spam Detection by Machine Learning-Based Content Analysis. In: Esposito, A., Faundez-Zanuy, M., Morabito, F., Pasero, E. (eds) Progresses in Artificial Intelligence and Neural Systems. Smart Innovation, Systems and Technologies, vol 184. Springer, Singapore. https://doi.org/10.1007/978-981-15-5093-5_37

Download citation

Publish with us

Policies and ethics