Scalable Online Incremental Learning for Web Spam Detection

  • Liangxiu Han
  • Abby Levenberg
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 124)


In this paper, we propose an online incremental learning framework for identifying web spam. The proposed work can incrementally update the learning model based on any newly arrived samples without recourse to the original data. The prototype of the framework has been evaluated with a real large scale web spam dataset. The results demonstrate the proposed online detector has high learning speed and accurate prediction rates for the web spam.


Online Learner Online Method Edge Reciprocity Spam Page Link Spam 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S.: ACM SIGIR Forum 40(2), 11 (2006)CrossRefGoogle Scholar
  2. 2.
    Benczur, R.A., Csalogany, K., Sarlos, T., Uher, M.: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), held in conjunction with WWW 2005 (2005)Google Scholar
  3. 3.
    Zhou, B., Pei, J.: ACM Transactions on Knowledge Discovery from Data 3(3) (2009)Google Scholar
  4. 4.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: In: International World Wide Web Conference: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92 (2006)Google Scholar
  5. 5.
    Fetterly, D., Manasse, M., Najork, M.: 7th International Workshop on the Web and Databases (2004)Google Scholar
  6. 6.
    Dagan, I., Karor, Y., Roth, D.: Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 55–63 (1997)Google Scholar
  7. 7.
    Bekkerman, R., McCallum, A., Huang, G.: Categorization of email into folders: Bench- mark experiments on enron and sri corpora. Ciir technical report ir-418, CIIR, University of Massachusetts (2004)Google Scholar
  8. 8.
    Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: The Journal of Machine Learning Research, 551–585 (2006)Google Scholar
  9. 9.
    Carvalho, V.R., Cohen, W.W.: in. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 548–553. ACM (2006)Google Scholar
  10. 10.
    Yahoo! Research: ”Web Spam Collections”. Crawled by the Laboratory of Web Algorithmics, University of Milan URLs, (retrieved on July 12, 2010)
  11. 11.
    Levenberg, A., Osborne, M.: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP (2009)Google Scholar
  12. 12.
    Geng, G.G., Wang, C.H., Li, Q.D., Xu, L., Jin, X.B.: Fourth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007, vol. 4, pp. 583–587 (2007)Google Scholar
  13. 13.
    Mortensen, C.W., Pagh, R., Pǎtraçcu, M.: STOC 2005: Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, pp. 104–111. ACM (2005)Google Scholar
  14. 14.
    Bradley, A.P.: Pattern Recognition, 1145–1159 (1997)Google Scholar
  15. 15.
    Vanderlooy, S., Hüllermeier, E.: Machine Learning, 247–262 (2008)Google Scholar
  16. 16.
    Hanley, J.A., McNeil, B.J.: Radiology (1982)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Liangxiu Han
    • 1
  • Abby Levenberg
    • 2
  1. 1.School of Computing, Mathematics and Digital TechnologyManchester Metropolitan UniversityManchesterUK
  2. 2.School of InformaticsUniversity of EdinburghEdinburghUK

Personalised recommendations