Journal in Computer Virology

, Volume 7, Issue 1, pp 51–62 | Cite as

Removing web spam links from search engine results

  • Manuel EgeleEmail author
  • Clemens Kolbitsch
  • Christian Platzer
Original Paper


Web spam denotes the manipulation of web pages with the sole intent to raise their position in search engine rankings. Since a better position in the rankings directly and positively affects the number of visits to a site, attackers use different techniques to boost their pages to higher ranks. In the best case, web spam pages are a nuisance that provide undeserved advertisement revenues to the page owners. In the worst case, these pages pose a threat to Internet users by hosting malicious content and launching drive-by attacks against unsuspecting victims. When successful, these drive-by attacks then install malware on the victims’ machines. In this paper, we introduce an approach to detect web spam pages in the list of results that are returned by a search engine. In a first step, we determine the importance of different page features to the ranking in search engine results. Based on this information, we develop a classification technique that uses important features to successfully distinguish spam sites from legitimate entries. By removing spam sites from the results, more slots are available to links that point to pages with useful content. Additionally, and more importantly, the threat posed by malicious web sites can be mitigated, reducing the risk for users to get infected by malicious code that spreads via drive-by attacks.


Search Engine Query Term Ranking Algorithm Anchor Text Test Page 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Atkeson C.G., Moore A.W., Schaal S.: Locally weighted learning. Artif. Intell. Rev. 11(1–5), 11–73 (1997)CrossRefGoogle Scholar
  2. 2.
    Bifet, A., Castillo, C., Chirita, P.-A., Weber, I.: An analysis of factors used in search engine ranking. In: Adversarial Information Retrieval on the Web (2005)Google Scholar
  3. 3.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: 7th International World Wide Web Conference (WWW) (1998)Google Scholar
  4. 4.
    Cacheda, F., Viña, Á.: Experiencies retrieving information in the world wide web. In: Proceedings of the Sixth IEEE Symposium on Computers and Communications (ISCC 2001), pp. 72–79 (2001)Google Scholar
  5. 5.
    Chellapilla, K., Chickering, D.: Improving cloaking detection using search query popularity and monetizability. In: Adversarial Information Retrieval on the Web (2006)Google Scholar
  6. 6.
    Evans M.P.: Analysing Google rankings through search engine optimization data. Internet Res. 17(1), 21–37 (2007)CrossRefGoogle Scholar
  7. 7.
    Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: WebDB, pp. 1–6 (2004)Google Scholar
  8. 8.
    Google. Zeitgeist: Search patterns, trends, and surprises. Accessed 29 June 2009
  9. 9.
    Google Keeps Tweaking Its Search Engine. Accessed 29 June 2009
  10. 10.
    Gyöngyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: Adversarial Information Retrieval on the Web (2005)Google Scholar
  11. 11.
    Hearst M.A.: Support vector machines. IEEE Intell. Syst. 13(4), 18–28 (1998)CrossRefGoogle Scholar
  12. 12.
    Heckerman, D.: A tutorial on learning with bayesian networks. Technical report, Microsoft Research (1995)Google Scholar
  13. 13.
    John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: UAI ’95: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, August 18–20, 1995, Montreal, Quebec, Canada, pp. 338–345 (1995)Google Scholar
  14. 14.
    Kaburlasos V.G., Athanasiadis I.N., Mitkas P.A.: Fuzzy lattice reasoning (flr) classifier and its application for ambient ozone estimation. Int. J. Approx. Reason. 45(1), 152–188 (2007)zbMATHCrossRefGoogle Scholar
  15. 15.
    Karlberger, C., Bayler, G., Kruegel, C., Kirda, E.: Exploiting redundancy in natural language to penetrate bayesian spam filters. In: First USENIX Workshop on Offensive Technologies (WOOT07) (2007)Google Scholar
  16. 16.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  17. 17.
    Niu, Y., Wang, Y.-M., Chen, H., Ma, M., Hsu, F.: A quantitative study of forum spamming using context-based analysis. In: NDSS (2007)Google Scholar
  18. 18.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: 15th International World Wide Web Conference (WWW) (2006)Google Scholar
  19. 19.
    Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in kernel methods: support vector learning, pp. 185–208. MIT Press, Cambridge. ISBN:0-262-19416-3 (1999)Google Scholar
  20. 20.
    Provos, N., Mavrommatis, P., Rajab, M.A., Monrose, F.: All your iframes point to us. In: 17th USENIX Security Symposium (2008)Google Scholar
  21. 21.
    Provos, N., McNamee, D., Mavrommatis, P., Wang, K., Modadugu, N.: The ghost in the browser analysis of web-based malware. In: First Workshop on Hot Topics in Understanding Botnets (HotBots ’07) (2007)Google Scholar
  22. 22.
    Quinlan, R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  23. 23.
    Rahul Mohandas (McAfee Avert Labs). Analysis of Adversarial Code: The role of Malware Kits!, December 2007. Accessed 29 June 2009
  24. 24.
    Google Search Engine Ranking Factors. Accessed 29 June 2009
  25. 25.
    Shi, H.: Best-first decision tree learning. Master’s thesis, University of Waikato, Hamilton, NZ, COMP594 (2007)Google Scholar
  26. 26.
    Svore, K., Wu, Q., Burges, C., Raman, A.: Improving web spam classification using rank-time features. In: Adversarial Information Retrieval on the Web (2007)Google Scholar
  27. 27.
    Wang, Y.-M., Ma, M., Niu, Y., Chen, H.: Spam double-funnel: connecting web spammers with advertisers. In: 16th International Conference on World Wide Web (2007)Google Scholar
  28. 28.
    Witten, I., Frank, E.: Data mining: practical machine learning tools and techniques 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar
  29. 29.
    Wu, B., Davison, B.: Cloaking and redirection: a preliminary study. In: Adversarial Information Retrieval on the Web (2005)Google Scholar
  30. 30.
    Wu, B., Davison, B.D.: Identifying link farm spam pages. In: 14th International World Wide Web Conference (WWW) (2005)Google Scholar

Copyright information

© Springer-Verlag France 2009

Authors and Affiliations

  • Manuel Egele
    • 1
    Email author
  • Clemens Kolbitsch
    • 1
  • Christian Platzer
    • 1
  1. 1.Vienna University of TechnologyViennaAustria

Personalised recommendations