Advertisement

EviRank: An Evidence Based Content Trust Model for Web Spam Detection

  • Wei Wang
  • Guosun Zeng
  • Mingjun Sun
  • Huanan Gu
  • Quan Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4537)

Abstract

Creating an effective spam detection method is a challenging task. Traditional works usually regard this kind of work as a problem of binary classification. In this paper, however, we argue that it is more property to use the notion of content trust for it, and regard it as a ranking or ordinal regression problem. Evidence is utilized to define the feature of spam web pages, and machine learning techniques are employed to combine the evidence to create a highly efficient and reasonably-accurate detection algorithm. Experiments on real web data are carried out, which improve the proposed method performs very well in practice.

Keywords

web spam evidence content trust ranking SVM learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Fetterly, D., Manasse, M., Najork, M.: Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In: 7th International Workshop on the Web and Databases (2004)Google Scholar
  2. 2.
    Ntoulas, A., Najork, M., Manasse, M., et al.: Detecting Spam Web Pages through Content Analysis. In: proceedings of WWW 2006, May 23–26, Edinburgh, Scotland (2006)Google Scholar
  3. 3.
    Wang, W., Zeng, G. S., Liu, T.: An Autonomous Trust Construction System Based on Bayesian Method, In: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT 2006), Hong Kong, China, pp. 357–362 (December18-22 2006)Google Scholar
  4. 4.
    Gyongyi, Z., Garcia-Molina, H.: Web Spam Taxonomy. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)Google Scholar
  5. 5.
    Davison, B.: Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search (July 2000)Google Scholar
  6. 6.
    Baeza-Yates, R., Castillo, C., Liopez, V.: PageRank Increase under Different Collusion Topologies. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)Google Scholar
  7. 7.
    Page, L., Brin, S., et al.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project (1998)Google Scholar
  8. 8.
    Adali, S., Liu, T., Magdon-Ismail, M.: Optimal Link Bombs are Uncoordinated. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)Google Scholar
  9. 9.
    Gyiongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In: 30th International Conference on Very Large Data Bases (August 2004)Google Scholar
  10. 10.
    Mishne, G., Carmel, D., Lempel, R.: Blocking Blog Spam with Language Model Disagreement. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)Google Scholar
  11. 11.
    Cao, Y. B., Xu, J., Liu, T. Y., et al.: Adapting Ranking SVM to Document Retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, pp. 186–193 (2006)Google Scholar
  12. 12.
    Herbrich, R., Graepel, T., Obermayer, K.: Large Margin Rank Boundaries for Ordinal Regression. Advances in Large Margin Classifiers, pp. 115–132 (2000)Google Scholar
  13. 13.
    Wang, W., Zeng, G.S., Yuan, L.L.: A Semantic Reputation Mechanism in P2P Semantic Web. In: Mizoguchi, R., Shi, Z., Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185, pp. 682–688. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  14. 14.
    Zhang, H., Su, J.: Naive Bayesian classifiers for ranking. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, Springer, Heidelberg (2004)Google Scholar
  15. 15.
    Provost, F.J., Domingos, P.: Tree Induction for Probability-Based Ranking. Ma.-chine Learning 52(3), 199–215 (2003)zbMATHGoogle Scholar
  16. 16.
    Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distribution. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 43–48. AAAI Press, California (1997)Google Scholar
  17. 17.
    Witten, I.H., Frank, E.: Data Mining–Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann, Washington (2000)Google Scholar
  18. 18.
    Gil, Y., Artz, D.: Towards content trust of web resources. In: Proceedings of the 15th International World Wide Web Conference (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Wei Wang
    • 1
  • Guosun Zeng
    • 1
  • Mingjun Sun
    • 1
  • Huanan Gu
    • 1
  • Quan Zhang
    • 1
  1. 1.Department of Computer Science and Technology, Tongji University, Shanghai 201804, China, Tongji Branch, National Engineering & Technology Center of, High Performance Computer, Shanghai 201804, China, The Key Laboratory of Embedded System and Service Computing, Ministry of Education, Email: willtongji@gmail.com 

Personalised recommendations