Evaluation of Content Based Spam Filtering Using Data Mining Approach Applied on Text and Image Corpus

  • Amit Kumar Sharma
  • Prabhjeet Kaur
  • Sanjay Kumar Anand
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 258)

Abstract

With the continuous growth of email users, unsolicited emails also known as Spam increases to a large extent. In current, server and client side anti spam filters are developed for detecting different features of spam emails. However, recently spammers introduced some new tricks consisting of embedding spam contents into digital image, pdf and doc files as attachments, which can make all current techniques based on the analysis of digital text in the body and subject field of emails ineffective. In this paper we proposed an anti spam filtering approach based on data mining techniques which classify the spam and ham emails. The effectiveness of proposed approach is experimentally evaluated on large corpus of simple text datasets as well as text embedded image datasets and comparisons between some classifiers such as Random Forest and Naive Bayes is done.

Keywords

Spam filtering Image spam OCR Stemming Features VSM tf-idf PCA Naive Bayes Random Forest 

References

  1. 1.
    Hayat, M.Z., Basiri, J., Seyedhossein, L., Shakery, A.: Content-based concept drift detection for email spam filtering. In: 5th International Symposium on Telecommunications (IST), 2010, IEEE, pp. 531–536 (2010)Google Scholar
  2. 2.
    Caruana, G., Li, M.: A survey of emerging approaches to spam filtering. ACM Comput. Surv. (CSUR) 44(2), 9 (2012)CrossRefGoogle Scholar
  3. 3.
    Qaroush, A., Khater, I.M., Washaha, M.: Identifying spam email based-on statistical header features and sender behavior. In: Proceedings of the CUBE International Information Technology Conference, pp. 771–778. ACM, New York (2012)Google Scholar
  4. 4.
    Wu, J., Deng, T.: Research in anti-spam method based on bayesian filtering. In: Pacific-Asia Workshop on Computational Intelligence and Industrial Application, PACIIA’08, vol. 2, pp. 887–891 (2008)Google Scholar
  5. 5.
    Alguliev, R.M., Aliguliyev, R.M., Nazirova, S.A.: Classification of textual e-mail spam using data mining techniques. Appl. Comput. Intell. Soft Comput. 2011, 10 (2011)Google Scholar
  6. 6.
    Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan kaufmann, San Francisco (2006)Google Scholar
  7. 7.
    Perez-Diaz, N., Ruano-Ordas, D., Fdez-Riverola, F., Mendez, J.R.: SDAI: an integral evaluation methodology for content-based spam filtering models. Expert Syst. Appl. 39(16), 12487–12500 (2012)Google Scholar
  8. 8.
    Rozza, A., Lombardi, G., Casiraghi, E.: Novel IPCA based classifiers and their application to spam filtering. In: Ninth International Conference on Intelligent Systems Design and Applications, ISDA’09, IEEE, pp. 797–802 (2009)Google Scholar
  9. 9.
    Youn, S., McLeod, D.: A comparative study for email classification. In: Elleithy, K. (ed.) Advances and Innovations in Systems, Computing Sciences and Software Engineering, pp. 387–391. Springer, New York (2007)Google Scholar
  10. 10.
    Fumera, G., Pillai, I., Roli, F.: Spam filtering based on the analysis of text information embedded into images. J. Mach. Learn. Res. 7, 2699–2720 (2006)Google Scholar
  11. 11.
    Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, pp. 497–506. ACM, New York (2008)Google Scholar
  12. 12.
    Fong, M.(ed): Spam or harm, Introduction to Artificial Intelligence Project, 12 Dec (2008)Google Scholar
  13. 13.
  14. 14.
    Wikipedia. Types of email spams. http://en.wikipedia.org/wiki/Email_spam
  15. 15.
    Attar, A., Rad, R.M., Atani, R.E.: A survey of image spamming and filtering techniques. Artif. Intell. Rev. 40(1) pp.71–105 (2013) Google Scholar
  16. 16.
    Gao, Y., Yang, M., Zhao, X., Pardo, B., Wu, Y., Pappas, T.N., Choudhary, A.: Image spam hunter. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, pp. 1765–1768 (2008)Google Scholar
  17. 17.
    Hidalgo, J.M.G., Bringas, G.C., Sanz, E.P., Garcia, F.C.: Content based SMS spam filtering. In: Proceedings of the 2006 ACM Symposium on Document Engineering, pp. 107–114. ACM, New York (2006)Google Scholar
  18. 18.
    GFI-Microsoft Gold certified partner. Attachment spam the latest trend. http://www.gfi.com/whitepapers/attachment-spam.pdf
  19. 19.
    Sadan, Z., Schwartz, D.G.: Social network analysis of web links to eliminate false positives in collaborative anti-spam systems. J. Netw. Comput. Appl. 34(5), 1717–1723 (2011)Google Scholar
  20. 20.
    Gao, Y., Choudhary, A., Hua, G.: A comprehensive approach to image spam detection: from server to client solution. IEEE Trans. Inf. Forensics Secur. 5(4), 826–836 (2010)CrossRefGoogle Scholar
  21. 21.
    ABBYY Fine Reader. Input and output supported files. http://www.abbyy.com/support/engine/10win/Product_info/Formats/
  22. 22.
    Garcia, F.D., Hoepman, J.H., van Nieuwenhuizen, J.: Spam filter analysis. In: Proceedings of 19th IFIP International Information Security Conference, WCC2004-SEC, pp. 395–410. Kluwer Academic Publishers, New York (2004)Google Scholar
  23. 23.
    Almeida, T.A., Yamakami, A.: Content-based spam filtering. In: Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2010)Google Scholar
  24. 24.
    Biggio, B., Fumera, G., Pillai, I., Roli, F.: A survey and experimental evaluation of image spam filtering techniques. Pattern Recogn. Lett. 32(10), 1436–1446 (2011)CrossRefGoogle Scholar
  25. 25.
    Jorgensen, Z., Zhou, Y., Inge, M.: A multiple instance learning strategy for combating good word attacks on spam filters. J. Mach. Learn. Res. 9, 1115–1146 (2008)Google Scholar
  26. 26.
    Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: Proceedings of the Second Conference on Email and Anti-spam (CEAS), pp. 125–132 (2005)Google Scholar
  27. 27.
    NIST (National Institute of Standard and Technology) US Govt. Trec07 Dataset. http://trec.nist.gov/data/spam.html
  28. 28.
    Gong, Y., Chen, Q.:. Research of spam filtering based on Bayesian algorithm. In: 2010 International Conference on Computer Application and System Modeling (ICCASM), vol. 4, pp. V4-678–V4-680 (2010)Google Scholar
  29. 29.
    Ribeiro, M.T., Guerra, P.H.C., Vilela, L., Veloso, A., Guedes, D., Meira, W., Jr, Chaves, M.H.P.C., Steding-Jessen, K., Hoepers, C.: Spam detection using web page content: a new battleground. In: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, pp. 8–91. ACM, New York (2011)Google Scholar
  30. 30.
    Onix text retrieval toolkit. Stop word lists. http://www.lextek.com/manuals/onix/stopwo-rds.html
  31. 31.
    Bhat, V.H., Malkani, V.R., Shenoy, P.D., Venugopal, K.R., Patnaik, L.M.: Classification of email using beaks: behavior and keyword stemming. In TENCON 2011—2011 IEEE Region 10 Conference, IEEE, pp. 1139–1143 (2011)Google Scholar
  32. 32.
  33. 33.
    Issac, B., Jap, W.J.: Implementing spam detection using bayesian and porter stemmer keyword stripping approaches. In: IEEE Region 10 Conference, TENCON 2009–2009, IEEE , pp. 1–5 (2009)Google Scholar
  34. 34.
    Porter, M.F.: Porter Stemming Algorithm. http://tartarus.org/martin/PorterStemmer/def.txt
  35. 35.
    Giannakopoulos, G., Mavridi, P., Paliouras, G., Papadakis, G., Tserpes, K.: Representation models for text classification: a comparative analysis over three web document types. In: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, p. 13. ACM, New York (2012)Google Scholar
  36. 36.
    Afolabi, I.T., Musa, G.A., Ayo, C.K., Sofoluwe, A.B.: Knowledge discovery in online repositories: a text mining approach. Eur. J. Sci. Res. 22(2), 241–250 (2008)Google Scholar
  37. 37.
    Parimala, R., Nallaswamy,R.: A study of spam e-mail classification using feature selection package. Glob. J. Comput. Sci. Technol. 11(7), 45– 54 (2011) Google Scholar
  38. 38.
    Kumar, R.K., Poonkuzhali, G., Sudhakar, P.: Comparative study on email spam classifier using data mining techniques. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2012)Google Scholar

Copyright information

© Springer India 2014

Authors and Affiliations

  • Amit Kumar Sharma
    • 1
  • Prabhjeet Kaur
    • 1
  • Sanjay Kumar Anand
    • 1
  1. 1.Central University of RajasthanKishangarh, AjmerIndia

Personalised recommendations