Turn the Page: Automated Traversal of Paginated Websites

  • Tim Furche
  • Giovanni Grasso
  • Andrey Kravchenko
  • Christian Schallhart
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7387)

Abstract

Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages.

We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, \({\textsc{ber}_y{\textsc l}}\), that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, \({\textsc{ber}_y{\textsc l}}\) is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how \({\textsc{ber}_y{\textsc l}}\) minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).

Keywords

Block Type Extraction Rule Annotation Type Perfect Accuracy Feature Template 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Almpanidis, G., Kotropoulos, C., Pitas, I.: Combining text and link analysis for focused crawling - an application for vertical search engines. Inf. Syst. 32(6), 886–908 (2007)CrossRefGoogle Scholar
  2. 2.
    Bra, P.D., Post, R.D.J.: Information retrieval in the world-wide web: Making client-based searching feasible. Computer Networks and ISDN Systems 27(2), 183–192 (1994)CrossRefGoogle Scholar
  3. 3.
    Chakrabarti, S., Berg, M.V.D., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Computer Networks, pp. 1623–1640 (1999)Google Scholar
  4. 4.
    Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)MathSciNetMATHCrossRefGoogle Scholar
  5. 5.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M.A., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE, Version 6 (2011)Google Scholar
  6. 6.
    Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)Google Scholar
  7. 7.
    Fazzinga, B., Flesca, S., Tagarelli, A.: Schema-based web wrapping. Knowledge and Inf. Sys. 26, 127–173 (2011)CrossRefGoogle Scholar
  8. 8.
    Hersovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The shark-search algorithm. an application: tailored web site mapping. Computer Networks and ISDN Systems 30(1-7), 317–326 (1998)CrossRefGoogle Scholar
  9. 9.
    Kang, J., Choi, J.: Block classification of a web page by using a combination of multiple classifiers. In: NCM (2008)Google Scholar
  10. 10.
    Lee, C.H., Ken, M.Y., Lai, S.: Stylistic and lexical co-training for web block classification. In: WIDM (2004)Google Scholar
  11. 11.
    Liu, H., Janssen, J., Milios, E.: Using HMM to learn user browsing patterns for focused web crawling. DKE 59(2) (2006)Google Scholar
  12. 12.
    Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. TOIS 23(4), 430–462 (2005)CrossRefGoogle Scholar
  13. 13.
    Pant, G., Srinivasan, P.: Link contexts in classifier-guided topical crawlers. TKDE 18(1), 107–122 (2006)Google Scholar
  14. 14.
    Song, R., Liu, H., Wen, J.-R., Ma, W.-Y.: Learning block importance model for web pages. In: WWW (2004)Google Scholar
  15. 15.
    Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retrieval 8, 417–447 (2005)CrossRefGoogle Scholar
  16. 16.
    Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., Zhang, W.V.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: KDD (2009)Google Scholar
  17. 17.
    Yang, X., Shi, Y.: Learning web page block functions using roles of images. In: ICPCA (2008)Google Scholar
  18. 18.
    Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW (2005)Google Scholar
  19. 19.
    Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: CIKM (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Tim Furche
    • 1
  • Giovanni Grasso
    • 1
  • Andrey Kravchenko
    • 1
  • Christian Schallhart
    • 1
  1. 1.Department of Computer ScienceOxford UniversityOxfordUK

Personalised recommendations