Graph regularization methods for Web spam detection

Abernethy, Jacob; Chapelle, Olivier; Castillo, Carlos

doi:10.1007/s10994-010-5171-1

Graph regularization methods for Web spam detection

Open access
Published: 25 March 2010

Volume 81, pages 207–225, (2010)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Graph regularization methods for Web spam detection

Download PDF

Jacob Abernethy¹,
Olivier Chapelle² &
Carlos Castillo³

1335 Accesses
36 Citations
Explore all metrics

Abstract

We present an algorithm, witch, that learns to detect spam hosts or pages on the Web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph as well as page contents and features. The method is efficient, scalable, and provides state-of-the-art accuracy on a standard Web spam benchmark.

Article PDF

Web Spam Detection Using Transductive(Inductive Graph Neural Networks

E-Mail Spam Filtering: A Review of Techniques and Trends

Influence of Graph Construction on Semi-supervised Learning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Becchetti, L., Castillo, C., Donato, D., Leonardi, S., & Baeza-Yates, R. (2006). Link-based characterization and detection of Web Spam. In Second international workshop on adversarial information retrieval on the Web (AIRWeb), Seattle, USA, August 2006.
Belkin, M., Matveeva, I., & Niyogi, P. (2004). Regularization and semi-supervised learning on large graphs. Lecture Notes in Computer Science, 3120, 624–638.
Article MathSciNet Google Scholar
Belkin, M., Niyogi, P., & Sindhwani, V. (2005). On manifold regularization. In Proceedings of the tenth international workshop on artificial intelligence and statistics (AISTATS).
Benczúr, A., Csalogány, K., & Sarlós, T. (2006). Link-based similarity search to fight web spam. In Adversarial information retrieval on the Web (AIRWEB), Seattle, Washington, USA.
Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In COLT: proceedings of the workshop on computational learning theory (pp. 92–100). San Mateo: Morgan Kaufmann.
Google Scholar
Bottou, L. (2004). Stochastic learning. In Advanced lectures on machine learning. Lecture notes in artificial intelligence (pp. 146–168). Berlin: Springer.
Chapter Google Scholar
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
MATH Google Scholar
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., & Vigna, S. (2006). A reference collection for web spam. SIGIR Forum, 40(2), 11–24.
Article Google Scholar
Castillo, C., Donato, D., Gionis, A., Murdock, V., & Silvestri, F. (2007). Know your neighbors: Web spam detection using the web topology. In Proceedings of SIGIR, Amsterdam, Netherlands, July 2007. New York: ACM.
Google Scholar
Chapelle, O. (2006). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.
Article MathSciNet Google Scholar
Cohen, W. W., & Kou, Z. (2006). Stacked graphical learning: approximating learning in Markov random fields using very short inhomogeneous Markov chains (Technical Report).
Davison, B. D. (2000). Topical locality in the web. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 272–279), Athens, Greece. New York: ACM.
Chapter Google Scholar
Fetterly, D. (2007). Adversarial information retrieval: the manipulation of web content. ACM Computing Reviews.
Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB) (pp. 1–6), Paris, France, June 2004.
Gan, Q., & Suel, T. (2007). Improving web spam classifiers using link structure. In AIRWeb’07: proceedings of the 3rd international workshop on adversarial information retrieval on the web (pp. 17–20), New York, NY, USA. New York: ACM.
Chapter Google Scholar
Graph Labeling Workshop (2007). http://graphlab.lip6.fr/.
Gyöngyi, Z., & Garcia-Molina, H. (2005). Web spam taxonomy. In First international workshop on adversarial information retrieval on the Web (pp. 39–47), Chiba, Japan.
Gyöngyi, Z., Garcia-Molina, H., & Pedersen, J. (2004). Combating Web spam with TrustRank. In Proceedings of the 30th international conference on very large data bases (VLDB) (pp. 576–587), Toronto, Canada, August 2004. San Mateo: Morgan Kaufmann.
Google Scholar
Haas, S. W., & Grams, E. S. (1998). Page and link classifications: connecting diverse resources. In DL’98: proceedings of the third ACM conference on digital libraries (pp. 99–107), New York, NY, USA. New York: ACM.
Chapter Google Scholar
Henzinger, M. R., Motwani, R., & Silverstein, C. (2002). Challenges in web search engines. SIGIR Forum, 36(2), 11–22.
Article Google Scholar
Joshi, A., Kumar, R., Reed, B., & Tomkins, A. (2007). Anchor-based proximity measures. In WWW (pp. 1131–1132).
Kolari, P., Java, A., Finin, T., Oates, T., & Joshi, A. (2006). Detecting spam blogs: a machine learning approach. In Proceedings of the national conference on artificial intelligence (AAAI), Boston, MA, USA, July 2006.
Krishnan, V., & Raj, R. (2006). Web spam detection with anti-trust rank. In ACM SIGIR workshop on adversarial information retrieval on the Web.
Mishne, G., Carmel, D., & Lempel, R. (2005). Blocking blog spam with language model disagreement. In Proceedings of the first international workshop on adversarial information retrieval on the Web (AIRWeb), Chiba, Japan, May 2005.
Ntoulas, A., Najork, M., Manasse, M., & Fetterly, D. (2006). Detecting spam web pages through content analysis. In Proceedings of the World Wide Web conference (pp. 83–92), Edinburgh, Scotland, May 2006.
Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain (Technical Report CMU-CS-94-125). School of Computer Science, Carnegie Mellon University.
Urvoy, T., Lavergne, T., & Filoche, P. (2006). Tracking web spam with hidden style similarity. In Second international workshop on adversarial information retrieval on the Web, Seattle, Washington, USA, August 2006.
Vapnik, V. (1998). Statistical learning theory. New York: Wiley.
MATH Google Scholar
Web Spam Challenge (2007). http://webspam.lip6.fr/.
Wu, B., Goel, V., & Davison, B. D. (2006). Propagating trust and distrust to demote web spam. In Workshop on models of trust for the Web, Edinburgh, Scotland, May 2006.
Zhang, T., Popescul, A., & Dom, B. (2006). Linear prediction models with graph regularization for web-page categorization. In KDD’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 821–826), New York, NY, USA. New York: ACM.
Chapter Google Scholar
Zhou, D., Burges, C. J. C., & Tao, T. (2007). Transductive link spam detection. In AIRWeb’07: proceedings of the 3rd international workshop on adversarial information retrieval on the web (pp. 21–28), New York, NY, USA. New York: ACM.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of California, Berkeley, USA
Jacob Abernethy
Yahoo! Research, Santa Clara, USA
Olivier Chapelle
Yahoo! Research, Barcelona, Spain
Carlos Castillo

Authors

Jacob Abernethy
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Chapelle
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Castillo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jacob Abernethy.

Additional information

Editor: Pavel Laskov.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Abernethy, J., Chapelle, O. & Castillo, C. Graph regularization methods for Web spam detection. Mach Learn 81, 207–225 (2010). https://doi.org/10.1007/s10994-010-5171-1

Download citation

Received: 09 July 2008
Revised: 21 October 2009
Accepted: 17 February 2010
Published: 25 March 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s10994-010-5171-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Graph regularization methods for Web spam detection

Abstract

Article PDF

Similar content being viewed by others

Web Spam Detection Using Transductive(Inductive Graph Neural Networks

E-Mail Spam Filtering: A Review of Techniques and Trends

Influence of Graph Construction on Semi-supervised Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Graph regularization methods for Web spam detection

Abstract

Article PDF

Similar content being viewed by others

Web Spam Detection Using Transductive(Inductive Graph Neural Networks

E-Mail Spam Filtering: A Review of Techniques and Trends

Influence of Graph Construction on Semi-supervised Learning

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation