Advertisement

Machine Learning

, Volume 81, Issue 2, pp 207–225 | Cite as

Graph regularization methods for Web spam detection

  • Jacob Abernethy
  • Olivier Chapelle
  • Carlos Castillo
Open Access
Article

Abstract

We present an algorithm, witch, that learns to detect spam hosts or pages on the Web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph as well as page contents and features. The method is efficient, scalable, and provides state-of-the-art accuracy on a standard Web spam benchmark.

Keywords

Adversarial information retrieval Spam detection Web spam Graph regularization 

References

  1. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. (2006). Link-based characterization and detection of Web Spam. In Second international workshop on adversarial information retrieval on the Web (AIRWeb), Seattle, USA, August 2006. Google Scholar
  2. Belkin, M., Matveeva, I., & Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. Lecture Notes in Computer Science, 3120, 624–638. CrossRefMathSciNetGoogle Scholar
  3. Belkin, M., Niyogi, P., & Sindhwani, V. (2005). On manifold regularization. In Proceedings of the tenth international workshop on artificial intelligence and statistics (AISTATS). Google Scholar
  4. Benczúr, A., Csalogány, K., & Sarlós, T. (2006). Link-based similarity search to fight web spam. In Adversarial information retrieval on the Web (AIRWEB), Seattle, Washington, USA. Google Scholar
  5. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In COLT: proceedings of the workshop on computational learning theory (pp. 92–100). San Mateo: Morgan Kaufmann. Google Scholar
  6. Bottou, L. (2004). Stochastic learning. In Advanced lectures on machine learning. Lecture notes in artificial intelligence (pp. 146–168). Berlin: Springer. CrossRefGoogle Scholar
  7. Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. MATHGoogle Scholar
  8. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., & Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2), 11–24. CrossRefGoogle Scholar
  9. Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands, July 2007. New York: ACM. Google Scholar
  10. Chapelle, O. (2006). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178. CrossRefMathSciNetGoogle Scholar
  11. Cohen, W. W., & Kou, Z. (2006). Stacked graphical learning: approximating learning in Markov random fields using very short inhomogeneous Markov chains (Technical Report). Google Scholar
  12. Davison, B. D. (2000). Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 272–279), Athens, Greece. New York: ACM. CrossRefGoogle Scholar
  13. Fetterly, D. (2007). Adversarial information retrieval: the manipulation of web content. ACM Computing Reviews. Google Scholar
  14. Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB) (pp. 1–6), Paris, France, June 2004. Google Scholar
  15. Gan, Q., & Suel, T. (2007). Improving web spam classifiers using link structure. In AIRWeb’07: proceedings of the 3rd international workshop on adversarial information retrieval on the web (pp. 17–20), New York, NY, USA. New York: ACM. CrossRefGoogle Scholar
  16. Graph Labeling Workshop (2007). http://graphlab.lip6.fr/.
  17. Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the Web (pp. 39–47), Chiba, Japan. Google Scholar
  18. Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating Web spam with TrustRank. In Proceedings of the 30th international conference on very large data bases (VLDB) (pp. 576–587), Toronto, Canada, August 2004. San Mateo: Morgan Kaufmann. Google Scholar
  19. Haas, S. W., & Grams, E. S. (1998). Page and link classifications: connecting diverse resources. In DL’98: proceedings of the third ACM conference on digital libraries (pp. 99–107), New York, NY, USA. New York: ACM. CrossRefGoogle Scholar
  20. Henzinger, M. R., Motwani, R., & Silverstein, C. (2002). Challenges in web search engines. SIGIR Forum, 36(2), 11–22. CrossRefGoogle Scholar
  21. Joshi, A., Kumar, R., Reed, B., & Tomkins, A. (2007). Anchor-based proximity measures. In WWW (pp. 1131–1132). Google Scholar
  22. Kolari, P., Java, A., Finin, T., Oates, T., & Joshi, A. (2006). Detecting spam blogs: a machine learning approach. In Proceedings of the national conference on artificial intelligence (AAAI), Boston, MA, USA, July 2006. Google Scholar
  23. Krishnan, V., & Raj, R. (2006). Web spam detection with anti-trust rank. In ACM SIGIR workshop on adversarial information retrieval on the Web. Google Scholar
  24. Mishne, G., Carmel, D., & Lempel, R. (2005). Blocking blog spam with language model disagreement. In Proceedings of the first international workshop on adversarial information retrieval on the Web (AIRWeb), Chiba, Japan, May 2005. Google Scholar
  25. Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference (pp. 83–92), Edinburgh, Scotland, May 2006. Google Scholar
  26. Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain (Technical Report CMU-CS-94-125). School of Computer Science, Carnegie Mellon University. Google Scholar
  27. Urvoy, T., Lavergne, T., & Filoche, P. (2006). Tracking web spam with hidden style similarity. In Second international workshop on adversarial information retrieval on the Web, Seattle, Washington, USA, August 2006. Google Scholar
  28. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. MATHGoogle Scholar
  29. Web Spam Challenge (2007). http://webspam.lip6.fr/.
  30. Wu, B., Goel, V., & Davison, B. D. (2006). Propagating trust and distrust to demote web spam. In Workshop on models of trust for the Web, Edinburgh, Scotland, May 2006. Google Scholar
  31. Zhang, T., Popescul, A., & Dom, B. (2006). Linear prediction models with graph regularization for web-page categorization. In KDD’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 821–826), New York, NY, USA. New York: ACM. CrossRefGoogle Scholar
  32. Zhou, D., Burges, C. J. C., & Tao, T. (2007). Transductive link spam detection. In AIRWeb’07: proceedings of the 3rd international workshop on adversarial information retrieval on the web (pp. 21–28), New York, NY, USA. New York: ACM. CrossRefGoogle Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Jacob Abernethy
    • 1
  • Olivier Chapelle
    • 2
  • Carlos Castillo
    • 3
  1. 1.University of CaliforniaBerkeleyUSA
  2. 2.Yahoo! ResearchSanta ClaraUSA
  3. 3.Yahoo! ResearchBarcelonaSpain

Personalised recommendations