Combining Textual Content and Hyperlinks in Web Spam Detection

  • F. Javier Ortega
  • Craig Macdonald
  • José A. Troyano
  • Fermín L. Cruz
  • Fernando Enríquez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6716)

Abstract

In this work, we tackle the problem of spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a priori estimation of the spam likelihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) a web page is, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.

Keywords

Information retrieval Web spam detection Graph algorithms PageRank web search 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Becchetti, L., Castillo, C., Donato, D., Baeza-Yates, R., Leonardi, S.: Link analysis for web spam detection. ACM Transactions on the Web 2(1), 1–42 (2008)CrossRefGoogle Scholar
  2. 2.
    Benczur, A.A., Csalogany, K., Sarlos, T., Uher, M., Uher, M.: Spamrank - fully automatic link spam detection. In: Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 25–38 (2005)Google Scholar
  3. 3.
    Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S.: A reference collection for web spam. SIGIR Forum 40(2), 11–24 (2006)Google Scholar
  4. 4.
    Cormack, G.V., Smucker, M., Clarke, C.L.A.: Efficient and Effective Spam Filtering and Re-ranking for Large Web Datasets. Computing Research Repository, abs/1004.5 (2010)Google Scholar
  5. 5.
    Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6. ACM, New York (2004)Google Scholar
  6. 6.
    Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 576–587, VLDB Endowment, Toronto (2004)Google Scholar
  7. 7.
    Najork, M.: Web spam detection. In: Encyclopedia of Database Systems, pp. 3520–3523. Springer, US (2009)Google Scholar
  8. 8.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. World Wide Web Internet And Web Information Systems (66), 1–17 (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • F. Javier Ortega
    • 1
  • Craig Macdonald
    • 2
  • José A. Troyano
    • 1
  • Fermín L. Cruz
    • 1
  • Fernando Enríquez
    • 1
  1. 1.Departamento de Lenguajes y Sistemas InformáticosUniversidad de SevillaSevillaSpain
  2. 2.Department of Computing ScienceUniversity of GlasgowGlasgowUK

Personalised recommendations