Intelligent and Adaptive Crawling of Web Applications for Web Archiving

  • Muhammad Faheem
  • Pierre Senellart
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7977)

Abstract

Web sites are dynamic in nature with content and structure changing overtime. Many pages on the Web are produced by content management systems (CMSs) such as WordPress, vBulletin, or phpBB. Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on (leading to suboptimal crawling strategies) and whatever structured content is contained in Web pages (resulting in page-level archives whose content is hard to exploit). We present in this paper an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications (e.g., the pages served by a CMS). Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. To deal with possible changes in structure of Web applications, our AAH includes an adaptation module that makes crawling resilient to small changes in the structure of Web site. We show the value of our approach by comparing the output and efficiency of the AAH with respect to regular Web crawlers, also in the presence of structure change.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Jupp, E.: Obama’s victory tweet ‘four more years’ makes history. The Independent (November 2012), http://ind.pn/RF5Q6O
  2. 2.
    Coleman, S.: Blogs and the new politics of listening. The Political Quarterly 76(2) (2008)Google Scholar
  3. 3.
    Mulvenon, J.C., Chase, M.: You’ve Got Dissent! Chinese Dissident Use of the Internet and Beijing’s Counter Strategies. Rand Publishing (2002)Google Scholar
  4. 4.
    Giles, J.: Internet encyclopaedias go head to head. Nature 438 (2005)Google Scholar
  5. 5.
    Masanès, J.: Web archiving. Springer (2006)Google Scholar
  6. 6.
    Sigurðsson, K.: Incremental crawling with Heritrix. In: IWAW (2005)Google Scholar
  7. 7.
    Faheem, M.: Intelligent crawling of Web applications for Web archiving. In: WWW PhD Symposium (2012)Google Scholar
  8. 8.
    Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: A new approach to topic-specific Web resource discovery. Comp. Networks 31(11-16) (1999)Google Scholar
  9. 9.
    Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of Web page templates. In: WWW (2005)Google Scholar
  10. 10.
    Guo, Y., Li, K., Zhang, K., Zhang, G.: Board forum crawling: A Web crawling method for Web forums. In: Web Intelligence (2006)Google Scholar
  11. 11.
    Cai, R., Yang, J.M., Lai, W., Wang, Y., Zhang, L.: iRobot: An intelligent crawler for Web forums. In: WWW (2008)Google Scholar
  12. 12.
    Ying, H.M., Thing, V.: An enhanced intelligent forum crawler. In: CISDA (2012)Google Scholar
  13. 13.
    Edmonds, J.: Optimum branchings. J. Res. Nat. Bureau Standards 71B (1967)Google Scholar
  14. 14.
    Kolari, P., Finin, T., Joshi, A.: SVMs for the blogosphere: Blog identification and splog detection. In: AAAI (2006)Google Scholar
  15. 15.
    Kushmerick, N.: Regression testing for wrapper maintenance. In: AAAI (1999)Google Scholar
  16. 16.
    Chidlovskii, B.: Automatic repairing of Web wrappers. In: WIDM (2001)Google Scholar
  17. 17.
    Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for Web-data extraction. In: WIDM (2003)Google Scholar
  18. 18.
    Lerman, K., Minton, S.N., Knoblock, C.A.: Wrapper maintenance: A machine learning approach. J. A. I. Res. (2003)Google Scholar
  19. 19.
    Lim, S.J., Ng, Y.K.: An automated change-detection algorithm for HTML documents based on semantic hierarchies. In: ICDE (2001)Google Scholar
  20. 20.
    Artail, H., Fawaz, K.: A fast HTML Web page change detection approach based on hashing and reducing the number of similarity computations. Data Knowl. Eng. (2008)Google Scholar
  21. 21.
    Ferrara, E., Baumgartner, R.: Automatic wrapper adaptation by tree edit distance matching. In: Hatzilygeroudis, I., Prentzas, J. (eds.) Combinations of Intelligent Methods and Applications. SIST, vol. 8, pp. 41–54. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  22. 22.
    Gulhane, P., Madaan, A., Mehta, R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: ICDE (2011)Google Scholar
  23. 23.
    W3C: Web application description language (2009), http://www.w3.org/Submission/wadl/
  24. 24.
    Diao, Y., Altinel, M., Franklin, M.J., Zhang, H., Fischer, P.: Path sharing and predicate evaluation for high-performance XML filtering. ACM TODS (2003)Google Scholar
  25. 25.
    ISO: ISO 28500:2009, Information and documentation – WARC file formatGoogle Scholar
  26. 26.
    WordPress: WordPress sites in the world (2012), http://en.wordpress.com/stats/
  27. 27.
    The Future Buzz: Social media, Web 2.0 and internet stats (2009), http://goo.gl/H0FNF
  28. 28.
    Royal Pingdom: WordPress completely dominates top 100 blogs (2012), http://goo.gl/eifRJ

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Muhammad Faheem
    • 1
  • Pierre Senellart
    • 1
    • 2
  1. 1.Institut Mines-TélécomTélécom ParisTech, CNRS LTCIParisFrance
  2. 2.The University of Hong KongHong Kong

Personalised recommendations