Towards Intelligent Web Crawling – A Theme Weight and Bayesian Page Rank Based Approach

  • Yan TangEmail author
  • Lei Wei
  • Wangsong Wang
  • Pengcheng Xuan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10570)


With the rapid development of Internet, the web crawler has become one of the key technologies for users to automatically obtain information from designated sites. The traditional web crawler technology has exposed several problems, such as low content accuracy due to simple filtering conditions with respect to crawling themes, low efficiency due to content duplication and long webpage update time. Aiming at solving these problems, we propose the TBPR (Theme weight and Bayesian Page Rank based crawler) approach by adopting a multi-queue model to achieve high efficiency and reduce content redundancy. Further, TBPR introduces a theme weights model to accurately classify web pages into user’s crawl concept and a Bayesian Page Rank model containing two novel factors to increase content accuracy. Our experiment applies TBPR to real world web contents, demonstrating its accuracy and efficiency.


Web crawler Multithread Theme weight Bayesian Page Rank 


  1. 1.
    Sreeja, R., Chaudhari, Sangita: Review of web crawlers. Int. J. Knowl. Web Intell. 5(1), 49–61 (2014)CrossRefGoogle Scholar
  2. 2.
    Quoc, D.L., Fetzer, C., Felber, P., et al.: UniCrawl: a practical geographically distributed web crawler. In: IEEE International Conference on Cloud Computing, pp. 389–396. IEEE (2015)Google Scholar
  3. 3.
    Stevanovic, D., An, A., Vlajic, N.: Feature evaluation for web crawler detection with data mining techniques. Expert Syst. Appl. 39(10), 8707–8717 (2012)CrossRefGoogle Scholar
  4. 4.
    Tan, Q., Mitra, P.: Clustering-based incremental web crawling. ACM Trans. Inf. Syst. 28(4), 1–27 (2010)CrossRefGoogle Scholar
  5. 5.
    Zhao, F., Zhou, J., Nie, C., et al.: SmartCrawler: a two-stage crawler for efficiently harvesting deep-web interfaces. IEEE Trans. Serv. Comput. 9(4), 608–620 (2016)CrossRefGoogle Scholar
  6. 6.
    Gupta, S., Bhatia, K.K., Manchanda, P.: WebParF: a web partitioning framework for parallel crawlers. Int. J. Comput. Sci. Eng. 5(8) (2014)Google Scholar
  7. 7.
    Jiashu, X., Lixin, X., Zheng, T.: PageRank algorithm for text relevance of hyperlink. J. Harbin Inst. Technol. 1, 223–225 (2009)Google Scholar
  8. 8.
    Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In Proceedings of the 10th International Conference on World Wide Web, pp. 114–118 (2001)Google Scholar
  9. 9.
    Barford, P., et al.: Harvesting and analyzing online display ads. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 597–608 (2014)Google Scholar
  10. 10.
    Patel, P.: Research of page ranking algorithm on search engine using damping factor. Int. J. Adv. Eng. Res. Dev. 1(1), 1–6 (2014)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Yan Tang
    • 1
    Email author
  • Lei Wei
    • 1
  • Wangsong Wang
    • 1
  • Pengcheng Xuan
    • 1
  1. 1.College of Computer and InformationHohai UniversityNanjingChina

Personalised recommendations