SMS Spam Filtering Using Probabilistic Topic Modelling and Stacked Denoising Autoencoder

  • Noura Al Moubayed
  • Toby Breckon
  • Peter Matthews
  • A. Stephen McGough
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9887)


In This paper we present a novel approach to spam filtering and demonstrate its applicability with respect to SMS messages. Our approach requires minimum features engineering and a small set of labelled data samples. Features are extracted using topic modelling based on latent Dirichlet allocation, and then a comprehensive data model is created using a Stacked Denoising Autoencoder (SDA). Topic modelling summarises the data providing ease of use and high interpretability by visualising the topics using word clouds. Given that the SMS messages can be regarded as either spam (unwanted) or ham (wanted), the SDA is able to model the messages and accurately discriminate between the two classes without the need for a pre-labelled training set. The results are compared against the state-of-the-art spam detection algorithms with our proposed approach achieving over 97 % accuracy which compares favourably to the best reported algorithms presented in the literature.


  1. 1.
    Almeida, T.A., Hidalgo, J.M.G., Yamakami, A.: Contributions to the study of sms spam filtering: new collection and results. In: Proceedings of the 11th ACM Symposium on Document Engineering, pp. 259–262. ACM (2011)Google Scholar
  2. 2.
    Almeida, T.A., Yamakami, A.: Facing the spammers: a very effective approach to avoid junk e-mails. Expert Syst. Appl. 39(7), 6557–6561 (2012)CrossRefGoogle Scholar
  3. 3.
    Bengio, Y., Courville, A.C., Vincent, P.: Unsupervised feature learning and deep learning: a review and new perspectives, vol. 1 (2012). CoRR, abs/1206.5538Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Delany, S.J., Buckley, M., Greene, D.: Sms spam filtering: methods and data. Expert Syst. Appl. 39(10), 9899–9908 (2012)CrossRefGoogle Scholar
  6. 6.
    Gómez Hidalgo, J.M., Bringas, G.C., Sánz, E.P., García, F.C.: Content based sms spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 107–114. ACM (2006)Google Scholar
  7. 7.
    Groupe Speciale Mobile Association (GSMA): SMS spams and mobile messaging attacks - introduction, trends and examples (2011)Google Scholar
  8. 8.
    Gupta, M., Gao, J., Aggarwal, C., Han, J.: Outlier detection for temporal data. Synth. Lect. Data Min. Knowl. Discov. 5(1), 1–129 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to spam filtering. Expert Syst. Appl. 36(7), 10206–10222 (2009)CrossRefGoogle Scholar
  10. 10.
    Hawkins, S., He, H., Williams, G.J., Baxter, R.A.: Outlier detection using replicator neural networks. In: Kambayashi, Y., Winiwarter, W., Arikawa, M. (eds.) DaWaK 2002. LNCS, vol. 2454, pp. 170–180. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  11. 11.
    Healy, M., Delany, S.J., Zamolotskikh, A.: An assessment of case base reasoning for short text message classification. In: Conference papers, p. 42 (2004)Google Scholar
  12. 12.
    Jie, H., Bei, H., Wenjing, P.: A bayesian approach for text filter on 3g network. In: 2010 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM), pp. 1–5. IEEE (2010)Google Scholar
  13. 13.
    Johnson, N.L., Kotz, S., Balakrishnan, N.: Continuous Multivariate Distributions, volume 1, Models and Applications, vol. 59. Wiley, New York (2002)zbMATHGoogle Scholar
  14. 14.
    PortioResearch: Mobile Messaging Futures 2013–2017 (2013)Google Scholar
  15. 15.
    Scholkopft, B., Mullert, K.R.: Fisher discriminant analysis with kernels. Neural Netw. sig. proc. IX 1, 1 (1999)Google Scholar
  16. 16.
    Smith, A.: The Smartphone Difference. Pew Research Center, Washington (2015)Google Scholar
  17. 17.
    Sohn, D.N., Lee, J.T., Rim, H.C.: The contribution of stylistic information to content-based mobile spam filtering. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp. 321–324. Association for Computational Linguistics (2009)Google Scholar
  18. 18.
    Steyvers, M., Griffiths, T.: Latent Semantic Analysis: A Road to Meaning, Chapter Probabilistic Topic Models. Laurence Erlbaum, Hillsdale (2007)Google Scholar
  19. 19.
    Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Xiang, Y., Chowdhury, M., Ali, S.: Filtering mobile spam by support vector machine. In: CSITeA 2004: Third International Conference on Computer Sciences, Software Engineering, Information Technology, E-Business and Applications, pp. 1–4. International Society for Computers and Their Applications (ISCA) (2004)Google Scholar
  21. 21.
    Zimek, A., Schubert, E., Kriegel, H.P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Noura Al Moubayed
    • 1
  • Toby Breckon
    • 1
  • Peter Matthews
    • 1
  • A. Stephen McGough
    • 1
  1. 1.School of Engineering and Computing SciencesDurham UniversityDurhamUK

Personalised recommendations