Advertisement

Adaptive Learning Ant Colony Optimization for Web Spam Detection

  • Bundit Manaskasemsak
  • Jirayus Jiarpakdee
  • Arnon Rungsawang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8584)

Abstract

Web spamming is nowadays a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a machine learning approach for spam detection by adopting the ant colony optimization algorithm. We first construct a directed graph corresponding to web hosts and their aggregated hyperlinks. Then, we train a classifier by employing ants to walk along paths in the graph. Each ant will start from an individual non-spam host and afterwards decides to follow a link to the next host with a probability based on both heuristic function and pheromone trail. Relying on the approximate isolation principle of a good set, we reward an ant that can discover a good path, i.e., a sequence of non-spam hosts, by charging energy for its longer walking. In contrast, if the ant instead discovers any spam, it will be penalized by decreasing its walking step. Finally, the classification rules are constructed by choosing common overlapping characteristic features of all non-spam hosts along the discovered paths. Experiments on WEBSPAM-UK2007 dataset show that our approach contributes to more accurately classify spam and non-spam hosts than several rule-based classification baselines.

Keywords

web spam detection adaptive learning paths reward distance penalty distance ant colony optimization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Araujo, L., Martinez-Romo, J.: Web spam detection: New classification features based on qualified link analysis and language models. IEEE Transactions on Information Forensics and Security 5(3), 581–590 (2010)CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. Addison Wesley, England (1999)Google Scholar
  3. 3.
    Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Link-based characterization and detection of web spam. In: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, pp. 1–8 (2006)Google Scholar
  4. 4.
    Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Web spam detection: Link-based and content-based techniques. In: The European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS): Proceedings of the Final Workshop, vol. 222, pp. 99–113 (2008)Google Scholar
  5. 5.
    Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S.: A reference collection for web spam. ACM SIGIR Forum 40(2), 11–24 (2006)CrossRefGoogle Scholar
  6. 6.
    Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423–430 (2007)Google Scholar
  7. 7.
    Dorigo, M., Di Caro, G., Gambardella, L.M.: Ant algorithms for discrete optimization. Artificial Life 5(2), 137–172 (1999)CrossRefGoogle Scholar
  8. 8.
    Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation 1(1), 53–66 (1997)CrossRefGoogle Scholar
  9. 9.
    Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics 26(1), 29–41 (1996)CrossRefGoogle Scholar
  10. 10.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)Google Scholar
  11. 11.
    Fetterly, D., Manasse, H., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6 (2004)Google Scholar
  12. 12.
    Geng, G.G., Jin, X.B., Wang, C.H.: Casia at web spam challenge 2008 track iii. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (2008)Google Scholar
  13. 13.
    Gyöngyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp. 39–47 (2005)Google Scholar
  14. 14.
    Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: Proceedings of the 13th International Conference on Very Large Data Bases, pp. 576–587 (2004)Google Scholar
  15. 15.
    Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in web search engines. ACM SIGIR Forum 36(2), 11–22 (2002)CrossRefGoogle Scholar
  16. 16.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)CrossRefMATHMathSciNetGoogle Scholar
  17. 17.
    Krishnan, V., Raj, R.: Web spam detection with anti-trust rank. In: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, pp. 37–40 (2006)Google Scholar
  18. 18.
    Liu, Y., Gao, B., Liu, T.Y., Zhang, Y., Ma, Z., He, S., Li, H.: Browserank: Letting web users vote for page importance. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 451–458 (2008)Google Scholar
  19. 19.
    Liu, Y., Zhang, M., Ma, S., Ru, L.: User behavior oriented web spam detection. In: Proceedings of the 17th International Conference on World Wide Web, pp. 1039–1040 (2008)Google Scholar
  20. 20.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92 (2006)Google Scholar
  21. 21.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford Digital Libraries (1999)Google Scholar
  22. 22.
    Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321–332 (2002)CrossRefGoogle Scholar
  23. 23.
    Stützle, T., Hoos, H.H.: \(\mathcal{MAX\mbox{-}MIN}\) ant system. Future Generation Computer Systems 16(9), 889–914 (2000)CrossRefGoogle Scholar
  24. 24.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Francisco (2005)Google Scholar
  25. 25.
    Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 820–829 (2005)Google Scholar
  26. 26.
    Wu, B., Goel, V., Davison, B.D.: Propagating trust and distrust to demote web spam. In: Proceedings of the Workshop on Models of Trust for the Web (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Bundit Manaskasemsak
    • 1
  • Jirayus Jiarpakdee
    • 1
  • Arnon Rungsawang
    • 1
  1. 1.Massive Information & Knowledge Engineering Laboratory, Department of Computer Engineering, Faculty of EngineeringKasetsart UniversityBangkokThailand

Personalised recommendations