World Wide Web

, Volume 21, Issue 2, pp 557–572 | Cite as

Spam query detection using stream clustering



Nowadays, search engines play a gateway role for users to access their needed information in the Web. However, malicious users can also use them to facilitate their attacks by submitting excessive amounts of bot-generated queries, called spam queries. In this paper, we propose a novel semi-supervised method which can effectively detect spam queries in a practical manner. We first train a model to characterize normal and malicious users, using the linguistic properties of queries as well as the behavioral characteristics of users and IP addresses. Then, we use the trained model to predict the label of arriving requests with a fast and efficient algorithm which works based on the stream clustering approach. The results of our evaluation with the real log of a local search engine show that the proposed algorithm yields an accuracy of about %94, while incurring a low response-time and memory overhead.


Bot Spam query Search engine Clustering Stream data Semi-supervised learning 


  1. 1.
    Aggarwal, C.C., Watson, T.J., Ctr, R., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams. In: Proceedings of the 29th International Conference on very Large Data Bases, pp 81–92 (2003)Google Scholar
  2. 2.
    Buehrer, G., Stokes, J.W., Chellapilla, K.: A large-scale study of automated Web search traffic Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 1–8 (2008)Google Scholar
  3. 3.
    Convey, E.: Porn Sneaks Way Back on Web. The Boston Herald, pp. 0–28 (1996)Google Scholar
  4. 4.
    Daswani, N., Stoppelman, M.: The google click quality and security teams. the anatomy of clickbot. a The First Workshop in Understanding Botnets (2007)Google Scholar
  5. 5.
    Dave, V., Guha, S., Zhang, Y.: Viceroi: catching click-spam in search ad networks Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, pp 765–776. ACM (2013)Google Scholar
  6. 6.
    Dou, Z., Song, R., Yuan, X., Wen, J.R.: Are click-through data adequate for learning Web search rankings? In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp 73–82. ACM (2008)Google Scholar
  7. 7.
    Haddadi, H.: Fighting online click-fraud using bluff ads. ACM SIGCOMM Computer Communication Review 40(2), 21–25 (2010)CrossRefGoogle Scholar
  8. 8.
    Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web search engines ACM SIGIR Forum, vol. 36, pp. 11–22. ACM (2002)Google Scholar
  9. 9.
    Hong, C., Yu, F., Xie, Y.: Populated IP addresses Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 329–340 (2012)Google Scholar
  10. 10.
    Immorlica, N., Jain, K., Mahdian, M., Talwar, K.: Click fraud resistant methods for learning click-through rates International Workshop on Internet and Network Economics, pp. 34–45. Springer (2005)Google Scholar
  11. 11.
    Jung, J., Sit, E.: An empirical study of spam traffic and the use of Dns black lists Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, pp. 370–375. ACM (2004)Google Scholar
  12. 12.
    Kang, H., Wang, K., Soukal, D., Behr, F., Zheng, Z.: Large-scale bot detection for search engines Proceedings of the 19th International Conference on World Wide Web - WWW ’10, pp. 501–510 (2010)Google Scholar
  13. 13.
    Kitts, B., Zhang, J.Y., Roux, A., Mills, R.: Click fraud detection with bot signatures 2013 IEEE International Conference on Intelligence and Security Informatics (ISI), pp. 146–150. IEEE (2013)Google Scholar
  14. 14.
    Kitts, B., Zhang, J.Y., Wu, G., Brandi, W., Beasley, J., Morrill, K., Ettedgui, J., Siddhartha, S., Yuan, H., Gao, F., etal: Click fraud detection: adversarial pattern recognition over 5 years at microsoft Real World Data Mining Applications, pp. 181–201. Springer (2015)Google Scholar
  15. 15.
    Li, X., Zhang, M., Liu, Y., Ma, S., Jin, Y., Ru, L.: Search engine click spam detection based on bipartite graph propagation Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 93–102 (2014)Google Scholar
  16. 16.
    Metwally, A., Agrawal, D., El Abbad, A., Zheng, Q.: On hit inflation techniques and detection in streams of Web advertising networks 27th International Conference on Distributed Computing Systems (ICDCS’07), pp. 52–52. IEEE (2007)Google Scholar
  17. 17.
    Oentaryo, R.J., Lim, E.P., Finegold, M., Lo, D., Zhu, F., Phua, C., Cheu, E.Y., Yap, G.E., Sim, K., Nguyen, M.N., etal: Detecting click fraud in online advertising: a data mining approach. J. Mach. Learn. Res. 15(1), 99–140 (2014)MathSciNetGoogle Scholar
  18. 18.
    Peng, Y., Zhang, L., Chang, J.M., Guan, Y.: An effective method for combating malicious scripts clickbots European Symposium on Research in Computer Security, pp. 523–538. Springer (2009)Google Scholar
  19. 19.
    Platt, J.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Proceedings of Advances in Large Margin Classifiers, 61–74 (1999)Google Scholar
  20. 20.
    Provos, N., McClain, J., Wang, K.: Search worms. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, pp. 1–8. ACM (2006)Google Scholar
  21. 21.
    Radlinski, F.: Addressing malicious noise in clickthrough data. In: Learning to Rank for Information Retrieval Workshop at SIGIR, vol. 2007 (2007)Google Scholar
  22. 22.
    Redis: Redis. (2016)
  23. 23.
    Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: Proceedings of the 17th International Conference on World Wide Web, pp. 885–894 (2008)Google Scholar
  24. 24.
    Spirin, N., Han, J.: Survey on Web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter 13(2), 50–64 (2012)CrossRefGoogle Scholar
  25. 25.
    Stone-Gross, B., Stevens, R., Zarras, A., Kemmerer, R., Kruegel, C., Vigna, G.: Understanding fraudulent activities in online ad exchanges. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, pp. 279–294. ACM (2011)Google Scholar
  26. 26.
    Stringhini, G., Holz, T., Stone-Gross, B., Kruegel, C., Vigna, G.: Botmagnifier: locating spambots on the internet. In: USENIX Security symposium, pp. 1–32 (2011)Google Scholar
  27. 27.
    Wang, G., Konolige, T., Wilson, C., Wang, X., Zheng, H., Zhao, B.Y.: You are how you click: clickstream analysis for sybil detection. In: Presented as Part of the 22nd USENIX Security Symposium (USENIX Security 13), pp. 241–256 (2013)Google Scholar
  28. 28.
    Wang, G., Zhang, X., Tang, S., Zheng, H., Zhao, B.Y.: Unsupervised clickstream clustering for user behavior analysis. In: SIGCHI Conference on Human Factors in Computing Systems (2016)Google Scholar
  29. 29.
    Wikipedia: Pearson Correlation Coefficient. (2016)
  30. 30.
    Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G., Osipkov, I.: Spamming botnets: signatures and characteristics. ACM SIGCOMM Computer Communication Review 38(4), 171–182 (2008)CrossRefGoogle Scholar
  31. 31.
    Yu, F., John, J.P., Xie, Y., Abadi, M., Krishnamurthy, A.: Searching the searchers with searchaudit Proceedings of the 19th USENIX Conference on Security, pp. 9–9 (2010)Google Scholar
  32. 32.
    Yu, F., Xie, Y., Ke, Q.: SBOtminer: large scale search bot detection. In: Proceedings of the 3rd ACM International Conference on Web Search and Data Mining, pp. 421–430 (2010)Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Tahere Shakiba
    • 1
  • Sajjad Zarifzadeh
    • 1
  • Vali Derhami
    • 1
  1. 1.Yazd UniversityYazdIran

Personalised recommendations