Advertisement

Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

  • Isabel Drost
  • Tobias Scheffer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)

Abstract

The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming – inflating the page rank of a target page by artificially creating many referring pages – has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam.

Keywords

Search Engine Recursive Feature Elimination Page Rank Open Research Problem Target Page 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Adali, S., Liu, T., Magdon-Ismail, M.: Optimal link bombs are uncoordinated. In: Proc. of the Workshop on Adversarial IR on the Web (2005)Google Scholar
  2. 2.
    Baeza-Yates, R., Castillo, C., López, V.: Pagerank increase under different collusion topologies. In: Proc. of the Workshop on Adversarial IR on the Web (2005)Google Scholar
  3. 3.
    Bharat, K., Chang, B., Henzinger, M., Ruhl, M.: Who links to whom: Mining linkage between web sites. In: Proc. of the IEEE International Conference on Data Mining (2001)Google Scholar
  4. 4.
    Bifet, A., Castillo, C., Chirita, P.-A., Weber, I.: An analysis of factors used in search engine ranking. In: Proc. of the Workshop on Adversarial IR on the Web (2005)Google Scholar
  5. 5.
    Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. In: Proc. of the International WWW Conference (2000)Google Scholar
  6. 6.
    Cafarella, M., Cutting, D.: Building Nutch: Open source search. ACM Queue 2(2) (2004)Google Scholar
  7. 7.
    Dalvi, N., Domingos, P., Mausam, Sanghai, S., Verma, D.: Adversarial classification. In: Proc. of the ACM International Conference on Knowledge Discovery and Data Mining (2004)Google Scholar
  8. 8.
    Davison, B.: Recognizing nepotistic links on the web, 2000. In: Proceedings of the AAAI 2000 Workshop on Artificial Intelligence for Web Search (2000)Google Scholar
  9. 9.
    Ebel, H., Mielsch, L.-I., Bornholdt, S.: Scale free topology of e-mail networks. Physical Review E (2002)Google Scholar
  10. 10.
    Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proc. of the Latin American Web Congress (2003)Google Scholar
  11. 11.
    Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In: Proc. of the International Workshop on the Web and Databases (2004)Google Scholar
  12. 12.
    Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proc. of the International WWW Conference (2003)Google Scholar
  13. 13.
    Gy”ongyi, Z., Garcia, H.: Web spam taxonomy. In: Proc. of the Workshop on Adversarial IR on the Web (2005)Google Scholar
  14. 14.
    Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with TrustRank. In: Proc. of the International Conf. on Very Large Data Bases (2004)Google Scholar
  15. 15.
    Henzinger, M., Motwani, R., Silverstein, C.: Challenges in web search engines. In: Proc. of the International Joint Conference on Artificial Intelligence (2003)Google Scholar
  16. 16.
    Joachims, T.: Making large-scale SVM learning practical. In: Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge (1998)Google Scholar
  17. 17.
    Lempel, R., Amitay, E., Carmel, D., Darlow, A., Soffer, A.: The connectivity sonar: Detecting site functionality by structural patterns. Journal of Digital Information 4(3) (2003)Google Scholar
  18. 18.
    Page, L., Brin, S.: The anatomy of a large-scale hypertextual web search engine. In: Proc. of the Seventh International World-Wide Web Conference (1998)Google Scholar
  19. 19.
    Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Proc. of the 14th International WWW Conference (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Isabel Drost
    • 1
  • Tobias Scheffer
    • 1
  1. 1.Department of Computer ScienceHumboldt-Universität zu BerlinBerlinGermany

Personalised recommendations