Application of Learning Algorithms to Image Spam Evolution

  • Shruti Wakade
  • Kathy J. Liszka
  • Chien-Chung Chan
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 13)


Spam filters have become very proficient at identifying text spam, so spammers have developed different techniques to bypass filters. One such method is image spam, which first appeared in 2005 and quickly grew in popularity. KnujOn is a web site that collects and sorts spam for investigations and data analysis of emailbased threats. We have been collecting image spam from KnujOn on a daily basis since April 2008, culminating in a significantly large corpus of real data. In this chapter, we have identified eight features for the detection of computer generated image spam versus ham (non-spam). We use J48 and J48 with reduced error pruning decision trees to classify the images. Finally, we perform a validation by feature analysis on thirteen months of our corpus and observe that our classification scheme is not affected by changes made to images for the purpose of avoiding OCR detection.


spam image spam decision trees classification feature analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Goodman, J., Cormack, G.V., Heckerman, D.: Spam and the ongoing battle for the inbox. Communication of the ACM 50(2), 25–33 (2007)CrossRefGoogle Scholar
  2. 2.
    Siponen, M., Stucke, C.: Effective anti-spam strategies in companies: an international study. In: Proceedings of HICSS 2006 (2006)Google Scholar
  3. 3.
    Hershkop, S.: Behavior-based email analysis with application to spam detection. Doctoral Dissertation, Columbia University New York, NY, USA (2006) ISBN: 0-542-46250-8Google Scholar
  4. 4.
    Yeh, C.-Y., Wu, C.-H., Doong, S.-H.: Effective spam classification based on meta-heuristics. In: Proceedings of IEEE International Conference on Systems, Man and Cybernetics, SMC 2005, vol. 4, pp. 3872–3877 (2005)Google Scholar
  5. 5.
    Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Trans. Asian Lang. Inform. Process (TALIP) 3(4), 243–269 (2004)CrossRefGoogle Scholar
  6. 6.
    Lai, C.-C., Tsai, M.-C.: An empirical performance comparison of machine learning methods for spam e-mail categorization. Hybrid. Intell. Syst., 44–48 (2004)Google Scholar
  7. 7.
    Fawcett, T.: “In vivo” spam filtering: a challenge problem for data mining. In: KDD Explor., vol. 5(2), pp. 140–148 (2003)Google Scholar
  8. 8.
    Blanzieri, E., Bryl, A.: A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29(1), 63–92 (2009)CrossRefGoogle Scholar
  9. 9.
    Guzella, T.S., Caminhas, W.M.: A review of machine learning approaches to Spam filtering. Expert Systems with Applications 36(7), 10206–10222 (2009)CrossRefGoogle Scholar
  10. 10.
    Hope, P., Bowling, J.R., Liszka, K.J.: Artificial Neural Networks as a Tool for Identifying Image Spam. In: The 2009 International Conference on Security and Management (SAM 2009), pp. 447–451 (July 2009)Google Scholar
  11. 11.
    Aradhye, H.B., Myers, G.K., Herson, J.A.: Image analysis for efficient categorization of image-based spam e-mail. In: Proceedings of the Eighth International Conference on Document Analysis and Recognition, August 29, September 1, vol. 2, pp. 914–918 (2005)Google Scholar
  12. 12.
    Wang, C., Zhang, F., Li, F., Liu, Q.: Image spam classification based on low-level image features. In: 2010 International Conference on Communications, Circuits and Systems (ICCCAS), July 28-30, pp. 290–293 (2010)Google Scholar
  13. 13.
    Krasser, S., Tang, Y., Gould, J., Alperovitch, D., Judge, P.: Identifying Image Spam based on Header and File Properties using C4.5 Decision Trees and Support Vector Machine Learning. In: Proc. of the IEEE SMC Information Assurance and Security Workshop, IAW 2007, June 20-22, pp. 255–261 (2007)Google Scholar
  14. 14.
    He, P., Wen, X., Zheng, W.: Simple Method for Filtering Image Spam. In: Eighth IEEE/ACIS International Conference on Computer and Information Science, ICIS 2009, June 1-3, pp. 910–913 (2009)Google Scholar
  15. 15.
    Egan, J.P.: Signal detection theory and ROC analysis. Series in Cognition and Perception. Academic Press, New York (1975)Google Scholar
  16. 16.
    Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning Fast Classifiers for Image Spam. In: Fourth Conference on Email and Anti-Spam, CEAS (2007)Google Scholar
  17. 17.
    Gao, Y., Yang, M., Zhao, X., Pardo, B., Wu, Y., Pappas, T.N., Choudhary, A.: Image spam hunter. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, March 31 -April 4, pp. 1765–1768 (2008)Google Scholar
  18. 18.
    Jardine, N., Sibson, R.: Mathematical Taxonomy. Wiley, New York (1971)MATHGoogle Scholar
  19. 19.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations 11(1) (2009)Google Scholar
  20. 20.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers (1993)Google Scholar
  21. 21.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Spyropoulos, C.D.: An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 160–167. ACM Press, New York (2000) ISBN 1-58113-226-3CrossRefGoogle Scholar
  22. 22.
    SPAMHAUS, The definition of spam (2005),
  23. 23.
  24. 24.
    Rivest, R.: The MD5 Message-Digest Algorithm. RFC Editor (1992)Google Scholar
  25. 25.
  26. 26.
  27. 27.
  28. 28.
    Creative Commons license,
  29. 29.
  30. 30.
  31. 31.
    Frankel, C., Swain, M.J., Athitsos, V.: WebSeer: An Image Search Engine for the World Wide Web. Computer Science Department. The University of Chicago, Chicago (1996)Google Scholar
  32. 32.
  33. 33.
    Blum, A., Langley, P.: Selection of relevant features and examples in machine learning. Artificial Intelligence 97(1-2), 245–271 (1997)MathSciNetMATHCrossRefGoogle Scholar
  34. 34.
    Pawlak, Z.: Rough Sets. International Journal of Computer and Information Sciences 11, 341–356 (1982)MathSciNetMATHCrossRefGoogle Scholar
  35. 35.
    Kohavi, R.: Useful Feature Subsets and Rough Set Reducts. In: Proc. Third International Workshop on Rough Set and Soft Computing, pp. 310–317 (1994)Google Scholar
  36. 36.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)MATHCrossRefGoogle Scholar
  37. 37.
    Pal, S.K., Chakraborty, B.: Fuzzy set theoretic measure for automatic feature evaluation. IEEE Trans. Syst., Man, Cybern. SMC-16, 754–760 (1986)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Shruti Wakade
    • 1
  • Kathy J. Liszka
    • 1
  • Chien-Chung Chan
    • 1
  1. 1.Department of Computer ScienceUniversity of AkronAkronUSA

Personalised recommendations