Abstract
We present an algorithm, witch, that learns to detect spam hosts or pages on the Web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph as well as page contents and features. The method is efficient, scalable, and provides state-of-the-art accuracy on a standard Web spam benchmark.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. (2006). Link-based characterization and detection of Web Spam. In Second international workshop on adversarial information retrieval on the Web (AIRWeb), Seattle, USA, August 2006.
Belkin, M., Matveeva, I., & Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. Lecture Notes in Computer Science, 3120, 624–638.
Belkin, M., Niyogi, P., & Sindhwani, V. (2005). On manifold regularization. In Proceedings of the tenth international workshop on artificial intelligence and statistics (AISTATS).
Benczúr, A., Csalogány, K., & Sarlós, T. (2006). Link-based similarity search to fight web spam. In Adversarial information retrieval on the Web (AIRWEB), Seattle, Washington, USA.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In COLT: proceedings of the workshop on computational learning theory (pp. 92–100). San Mateo: Morgan Kaufmann.
Bottou, L. (2004). Stochastic learning. In Advanced lectures on machine learning. Lecture notes in artificial intelligence (pp. 146–168). Berlin: Springer.
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., & Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2), 11–24.
Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands, July 2007. New York: ACM.
Chapelle, O. (2006). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.
Cohen, W. W., & Kou, Z. (2006). Stacked graphical learning: approximating learning in Markov random fields using very short inhomogeneous Markov chains (Technical Report).
Davison, B. D. (2000). Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 272–279), Athens, Greece. New York: ACM.
Fetterly, D. (2007). Adversarial information retrieval: the manipulation of web content. ACM Computing Reviews.
Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB) (pp. 1–6), Paris, France, June 2004.
Gan, Q., & Suel, T. (2007). Improving web spam classifiers using link structure. In AIRWeb’07: proceedings of the 3rd international workshop on adversarial information retrieval on the web (pp. 17–20), New York, NY, USA. New York: ACM.
Graph Labeling Workshop (2007). http://graphlab.lip6.fr/.
Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the Web (pp. 39–47), Chiba, Japan.
Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating Web spam with TrustRank. In Proceedings of the 30th international conference on very large data bases (VLDB) (pp. 576–587), Toronto, Canada, August 2004. San Mateo: Morgan Kaufmann.
Haas, S. W., & Grams, E. S. (1998). Page and link classifications: connecting diverse resources. In DL’98: proceedings of the third ACM conference on digital libraries (pp. 99–107), New York, NY, USA. New York: ACM.
Henzinger, M. R., Motwani, R., & Silverstein, C. (2002). Challenges in web search engines. SIGIR Forum, 36(2), 11–22.
Joshi, A., Kumar, R., Reed, B., & Tomkins, A. (2007). Anchor-based proximity measures. In WWW (pp. 1131–1132).
Kolari, P., Java, A., Finin, T., Oates, T., & Joshi, A. (2006). Detecting spam blogs: a machine learning approach. In Proceedings of the national conference on artificial intelligence (AAAI), Boston, MA, USA, July 2006.
Krishnan, V., & Raj, R. (2006). Web spam detection with anti-trust rank. In ACM SIGIR workshop on adversarial information retrieval on the Web.
Mishne, G., Carmel, D., & Lempel, R. (2005). Blocking blog spam with language model disagreement. In Proceedings of the first international workshop on adversarial information retrieval on the Web (AIRWeb), Chiba, Japan, May 2005.
Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference (pp. 83–92), Edinburgh, Scotland, May 2006.
Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain (Technical Report CMU-CS-94-125). School of Computer Science, Carnegie Mellon University.
Urvoy, T., Lavergne, T., & Filoche, P. (2006). Tracking web spam with hidden style similarity. In Second international workshop on adversarial information retrieval on the Web, Seattle, Washington, USA, August 2006.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
Web Spam Challenge (2007). http://webspam.lip6.fr/.
Wu, B., Goel, V., & Davison, B. D. (2006). Propagating trust and distrust to demote web spam. In Workshop on models of trust for the Web, Edinburgh, Scotland, May 2006.
Zhang, T., Popescul, A., & Dom, B. (2006). Linear prediction models with graph regularization for web-page categorization. In KDD’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 821–826), New York, NY, USA. New York: ACM.
Zhou, D., Burges, C. J. C., & Tao, T. (2007). Transductive link spam detection. In AIRWeb’07: proceedings of the 3rd international workshop on adversarial information retrieval on the web (pp. 21–28), New York, NY, USA. New York: ACM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Pavel Laskov.
Rights and permissions
Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
About this article
Cite this article
Abernethy, J., Chapelle, O. & Castillo, C. Graph regularization methods for Web spam detection. Mach Learn 81, 207–225 (2010). https://doi.org/10.1007/s10994-010-5171-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5171-1