World Wide Web

, Volume 22, Issue 4, pp 1577–1610 | Cite as

Deep Web crawling: a survey

  • Inma HernándezEmail author
  • Carlos R. Rivero
  • David Ruiz


Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.


Deep Web Web crawling Form filling Query selection Survey 



The authors would like to thank Dr. Rafael Corchuelo for his support and assistance throughout the entire research process that led to this article, and for his helpful and constructive comments that greatly contributed to improving the article. They would also like to thank the anonymous reviewers of this and past submissions, since their comments have contributed to give shape to this current version. Supported by the European Commission (FEDER), the Spanish and the Andalusian R &D & I programmes (grants TIN2016-75394-R, and TIN2013-40848-R).


  1. 1.
    Álvarez, M, Raposo, J, Pan, A, Cacheda, F, Bellas, F, Carneiro, V: Crawling the content hidden behind Web forms. In: ICCSA, pp. 322–333 (2007).
  2. 2.
    Anupam, V., Freire, J., Kumar, B., Lieuwen, D.F.: Automating Web navigation with the WebVCR. Comput. Netw. 33(1-6), 503–517 (2000). CrossRefGoogle Scholar
  3. 3.
    Asudeh, A., Thirumuruganathan, S., Zhang, N., Das, G.: Discovering the skyline of Web databases. PVLDB 9(7), 600–611 (2016). CrossRefGoogle Scholar
  4. 4.
    Barbosa, L, Freire, J: Siphoning hidden-Web data through keyword-based interfaces. In: SBBD, pp. 309–321. (2004).Google Scholar
  5. 5.
    Barbosa, L, Freire, J: Searching for hidden-Web databases. In: WebDB, pp. 1–6 (2005)Google Scholar
  6. 6.
    Barbosa, L, Freire, J: An adaptive crawler for locating hidden-Web entry points. In: WWW, pp. 441–450 (2007).
  7. 7.
    Baumgartner, R, Ceresna, M, Ledermuller, G: Deep Web navigation in Web data extraction. In: CIMCA/IAWTIC, pp. 698–703 (2005).
  8. 8.
    Bergholz, A, Chidlovskii, B: Crawling for domain-specific hidden Web resources. In: WISE, pp. 125–133 (2003).
  9. 9.
    Bergman, M.K.: The deep Web: Surfacing hidden value. J. Electron. Publ. 7, 1 (2001).Google Scholar
  10. 10.
    Blanco, L, Dalvi, N, Machanavajjhala, A: Highly efficient algorithms for structural clustering of large Webs ites. In: WWW, pp. 437–446 (2011).
  11. 11.
    Blythe, J., Kapoor, D., Knoblock, C.A., Lerman, K., Minton, S.: Information integration for the masses. J UCS 14(11), 1811–1837 (2008). CrossRefGoogle Scholar
  12. 12.
    Bollacker, K, Evans, C, Paritosh, P, Sturge, T, Taylor, J: Freebase: A collaboratively created graph database for structuring human knowledge. In: SIGMOD, pp. 1247–1250 (2008).
  13. 13.
    Calì, A, Martinenghi, D: Querying the deep Web. In: EDBT, pp. 724–727 (2010).
  14. 14.
    Caverlee, J, Liu, L, Buttler, D: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep Web. In: ICDE, pp. 103–114 (2004).
  15. 15.
    Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.M.: Automatic resource compilation by analyzing hyperlink structure and associated text. Comput. Netw. 30(1-7), 65–74 (1998). CrossRefGoogle Scholar
  16. 16.
    Chang, K.C.C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: Observations and implications. SIGMOD Record 33(3), 61–70 (2004). CrossRefGoogle Scholar
  17. 17.
    Chang, KCC, He, B, Zhang, Z: Toward large scale integration: Building a metaquerier over databases on the Web. In: CIDR, pp. 44–55. (2005).Google Scholar
  18. 18.
    Chen, H.: Dark Web: Exploring and data mining the dark side of the Web. Online Inf. Rev. 36(6), 932–933 (2012). CrossRefGoogle Scholar
  19. 19.
    Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst 28(4), 390–426 (2003). CrossRefGoogle Scholar
  20. 20.
  21. 21.
    Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the Web. In: ADC, CRPIT, vol. 17, pp. 181–189 (2003)Google Scholar
  22. 22.
    Davulcu, H, Freire, J, Kifer, M, Ramakrishnan, IV: A layered architecture for querying dynamic Web content. In: SIGMOD, pp. 491–502 (1999).
  23. 23.
    Devine, J., Egger-Sider, F.: Beyond google: The invisible Web in the academic library. J. Acad. Librarianship 30(4), 265–269 (2004). CrossRefGoogle Scholar
  24. 24.
    Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model Web query interfaces for Web source integration. PVLDB 2(1), 325–336 (2009). CrossRefGoogle Scholar
  25. 25.
    Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool (2012).
  26. 26.
    Fetto, J.: Mobile search: Topics and themes. report, Hitwise (2017)Google Scholar
  27. 27.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: Automatically understanding and integrating forms to access the deep Web. VLDBJ 22(5), 615–640 (2013). CrossRefGoogle Scholar
  28. 28.
    Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.J.: OXPath: A language for scalable data extraction, automation, and crawling on the Deep Web. VLDB J 22(1), 47–72 (2013). CrossRefGoogle Scholar
  29. 29.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: DIADEM: Thousands of Websites to a single database. PVLDB 7 (14), 1845–1856 (2014). CrossRefGoogle Scholar
  30. 30.
    Green, D.: The evolution of Web searching. Online Inf. Rev. 24(2), 124–137 (2000). CrossRefGoogle Scholar
  31. 31.
    He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: A survey. Commun ACM 50(5), 94–101 (2007). CrossRefGoogle Scholar
  32. 32.
    He, H, Meng, W, Lu, Y, Yu, CT, Wu, Z: Towards deeper understanding of the search interfaces of the Deep Web. In: WWW, pp. 133–155 (2007).
  33. 33.
    He, Y, Xin, D, Ganti, V, Rajaraman, S, Shah, N: Crawling deep Web entity pages. In: WSDM, pp. 355–364 (2013).
  34. 34.
    Hernández, I, Rivero, CR, Ruiz, D, Corchuelo, R: Towards discovering conceptual models behind Web sites. In: ER, pp. 166–175 (2012).
  35. 35.
    Hernández, I, Rivero, C.R., Ruiz, D., Corchuelo, R.: CALA: An unsupervised URL-based Web page classification system. Knowl.-Based Syst. 57(0), 168–180 (2014). CrossRefGoogle Scholar
  36. 36.
    Hicks, C, Scheffer, M, Ngu, AHH, Sheng, QZ: Discovery and cataloging of deep Web sources. In: IRI, pp. 224–230 (2012).
  37. 37.
    Holmes, A, Kellogg, M: Automating functional tests using selenium. In: AGILE, pp. 270–275 (2006).
  38. 38.
  39. 39.
    iMacros: (2016)
  40. 40.
    Jamil, HM, Jagadish, HV: A structured query model for the deep relational Web. In: CIKM, pp. 1679–1682 (2015).
  41. 41.
    Jiang, L, Wu, Z, Feng, Q, Liu, J, Zheng, Q: Efficient deep Web crawling using reinforcement learning. In: PAKDD, pp. 428–439 (2010).
  42. 42.
    Jiménez, P, Corchuelo, R.: Roller: A novel approach to Web information extraction. Knowl. Inf. Syst., 1–45 (2016).
  43. 43.
    Jin, X, Mone, A, Zhang, N, Das, G: Mobies: Mobile-interface enhancement service for hidden Web database. In: SIGMOD, pp. 1263–1266 (2011).
  44. 44.
    Jin, X, Zhang, N, Das, G: Attribute domain discovery for hidden Web databases. In: SIGMOD, pp. 553–564 (2011).
  45. 45.
    Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U.: Deep Web integration with visQI. PVLDB 3(2), 1613–1616 (2010). CrossRefGoogle Scholar
  46. 46.
    Kantorski, GZ, Moraes, TG, Moreira, VP, Heuser, CA: Advances in Databases and Information Systems, pp 125–136. Springer, Berlin (2013). Chap Choosing Values for Text Fields in Web FormsCrossRefGoogle Scholar
  47. 47.
    Kantorski, G.Z., Moreira, V.P., Heuser, C.A.: Automatic filling of hidden Web forms: A survey. SIGMOD Rec 44(1), 24–35 (2015). CrossRefGoogle Scholar
  48. 48.
    Kautz, H.A., Selman, B., Shah, M.A.: The hidden Web. AI Mag 18(2), 27–36 (1997). CrossRefGoogle Scholar
  49. 49.
    Khare, R, An, Y, Song, IY: Understanding deep Web search interfaces: A survey. SIGMOD Rec. 39(1), 33–40 (2010). CrossRefGoogle Scholar
  50. 50.
    Kumar, M, Bhatia, R: Design of a mobile Web crawler for hidden Web. In: RAIT, pp. 186–190 (2016)Google Scholar
  51. 51.
    Kushmerick, N: Learning to invoke Web forms. In: CoopIS, pp. 997–1013 (2003).
  52. 52.
    Kushmerick, N, Thomas, B: Adaptive information extraction: Core technologies for information agents. In: Intelligent Information Agents - The AgentLink Perspective, pp. 79–103 (2003).
  53. 53.
    Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden Web pages for data extraction. Data Knowl Eng 49(2), 177–196 (2004). CrossRefGoogle Scholar
  54. 54.
    Li, Y., Wang, Y., Du, J.: E-FFC: An enhanced form-focused crawler for domain-specific deep Web databases. J Intell Inf Syst 40(1), 159–184 (2013). CrossRefGoogle Scholar
  55. 55.
    Liakos, P, Ntoulas, A: Topic-sensitive hidden-Web crawling. In: WISE, pp. 538–551 (2012).
  56. 56.
    Liddle, SW, Embley, DW, Scott, DT, Yau, SH: Extracting data behind Web forms. In: Workshop on Conceptual Modeling Approaches for e-Business, pp. 402–413 (2002).
  57. 57.
    Losada, J., Raposo, J., Pan, A., Montoto, P.: Efficient execution of Web navigation sequences. WWWJ 17(5), 921–947 (2014). CrossRefGoogle Scholar
  58. 58.
    Madhavan, J, Jeffery, SR, Cohen, S, Dong, XL, Ko, D, Yu, C, Halevy, A: Web-scale data integration: You can only afford to pay as you go. In: CIDR, pp. 342–350 (2007)Google Scholar
  59. 59.
    Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.Y.: Google’s deep Web crawl. PVLDB 1(2), 1241–1252 (2008). CrossRefGoogle Scholar
  60. 60.
    Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.Y.: Harnessing the deep Web: present and future. Syst. Res. 2(2), 50–54 (2009).Google Scholar
  61. 61.
    Manvi, Dixit, A, Bhatia, KK: Design of an ontology based adaptive crawler for hidden Web. In: CSNT, pp. 659–663 (2013).
  62. 62.
    Mccoy, D, Bauer, K, Grunwald, D, Kohno, T, Sicker, D: Shining light in dark places: Understanding the tor network. In: PETS, pp. 63–76 (2008).
  63. 63.
    Meng, X, Hu, D, Li, C: Schema-guided wrapper maintenance for Web-data extraction. In: WIDM, pp. 1–8 (2003).
  64. 64.
    Modica, GA, Gal, A, Jamil, HM: The use of machine-generated ontologies in dynamic information seeking. In: CoopIS, pp. 433–448 (2001).
  65. 65.
    Montoto, P, Pan, A, Raposo, J, Bellas, F, Lopez, J: Web navigation sequences automation in modern Websites. In: DEXA, pp. 302–316 (2009).
  66. 66.
    Nazi, A, Asudeh, A, Das, G, Zhang, N, Jaoua, A: Mobiface: A mobile application for faceted search over hidden Web databases. In: ICCA, pp. 13–17 (2017).
  67. 67.
    Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. PVLDB 1(1), 684–694 (2008). CrossRefGoogle Scholar
  68. 68.
    nightwatch: (2018)
  69. 69.
    Ntoulas, A, Zerfos, P, Cho, J: Downloading textual hidden Web content through keyword queries. In: JCDL, pp. 100–109 (2005).
  70. 70.
    Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retriev. 4(3), 175–246 (2010). zbMATHCrossRefGoogle Scholar
  71. 71.
    Olston, C, Pandey, S: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008).
  72. 72.
    Pan, A, Raposo, J, Álvarez, M, Hidalgo, J, Viña, Á: Semi-automatic wrapper generation for commercial Web sources. In: EISIC, pp. 265–283 (2002).
  73. 73.
    Pandey, S, Olston, C: User-centric Web crawling. In: WWW, pp. 401–411. (2005)
  74. 74. (2018)
  75. 75.
    Raghavan, S, Garcia-Molina, H: Crawling the hidden Web. In: VLDB, pp. 129–138 (2001)Google Scholar
  76. 76.
    Ru, Y., Horowitz, E.: Indexing the invisible Web: a survey. Online Inf. Rev. 29(3), 249–265 (2005). CrossRefGoogle Scholar
  77. 77.
    Schulz, A, Lässig, J, Gaedke, M: Practical Web data extraction: are we there yet? - a short survey. In: WI, pp. 562–567 (2016).
  78. 78.
    Scrapy: (2016)
  79. 79.
    Settles, B.: Active learning. Synthesis Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012). MathSciNetzbMATHCrossRefGoogle Scholar
  80. 80.
    Sheng, C., Zhang, N., Tao, Y., Jin, X.: Optimal algorithms for crawling a hidden database in the Web. PVLDB 5(11), 1112–1123 (2012). CrossRefGoogle Scholar
  81. 81.
    Shu, L, Meng, W, He, H, Yu, CT: Querying capability modeling and construction of deep Web sources. In: WISE, pp. 13–25 (2007).
  82. 82.
    Sleiman, H.A., Corchuelo, R.: A survey on region extractors from Web documents. TKDE 25(9), 1960–1981 (2013). CrossRefGoogle Scholar
  83. 83.
    Sleiman, H.A., Corchuelo, R.: Trinity: On using trinary trees for unsupervised Web data extraction. IEEE Trans Knowl Data Eng 26(6), 1544–1556 (2014). CrossRefGoogle Scholar
  84. 84.
    Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retr. 8(3), 417–447 (2005). CrossRefGoogle Scholar
  85. 85.
    Statista: Mobile internet usage worldwide. Report (2018)Google Scholar
  86. 86.
    Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans Web 7(2), 8,1–8,22 (2013). CrossRefGoogle Scholar
  87. 87.
    Su, W, Li, Y, Lochovsky, FH: Query interfaces understanding by statistical parsing. In: WWW, pp. 1291–1294 (2014).
  88. 88.
    Toda, G.A., Cortez, E., da Silva, A.S., de Moura, E.: A probabilistic approach for automatically filling form-based Web interfaces. PVLDB 4(3), 151–160 (2010). CrossRefGoogle Scholar
  89. 89.
    Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the Hidden Web. J UCS 14(11), 1857–1876 (2008)Google Scholar
  90. 90.
    Vieira, K., Barbosa, L., Silva, A.S., Freire, J., Moura, E.: Finding seeds to bootstrap focused crawlers. World Wide Web, 1–26 (2015).
  91. 91.
    Wang, Y, Lu, J, Chen, J: Crawling deep Web using a new set covering algorithm. In: ADMA, pp. 326–337 (2009).
  92. 92. (2016)
  93. 93. (2016)
  94. 94. (2016)
  95. 95.
    Weninger, T., Palȧcios, R, Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: A metaanalysis of its past and thoughts on its future. SIGKDD Explorations 17(2), 17–23 (2015). CrossRefGoogle Scholar
  96. 96.
    Wu, Z, Raghavan, V, Qian, H, Rama, KV, Meng, W, He, H, Yu, C: Towards automatic incorporation of search engines into a large-scale metasearch engine. In: WI, pp. 658–661 (2003).
  97. 97.
    Wu, P, Wen, JR, Liu, H, Ma, WY: Query selection techniques for efficient crawling of structured Web sources. In: ICDE, pp. 47–56 (2006).
  98. 98.
    Wu, W, Doan, A, Yu, C, Meng, W: Modeling and extracting deep-Web query interfaces, pp. 65–90 (2009).
  99. 99.
    Wu, W, Zhong, T: Searching the deep Web using proactive phrase queries. In: WWW Companion, pp. 137–138 (2013).
  100. 100.
    Wu, W., Meng, W., Su, W., Zhou, G., Chiang, Y.Y.: Q2p: discovering query templates via autocompletion. ACM Trans Web 10(2), 10,1–10,29 (2016). CrossRefGoogle Scholar
  101. 101.
    Xu, S., Yoon, H.J., Tourassi, G.: A user-oriented Web crawler for selectively acquiring online content in e-health research. Bioinformatics 30(1), 104–114 (2014). CrossRefGoogle Scholar
  102. 102.
    Yan, H., Gong, Z., Zhang, N., Huang, T., Zhong, H., Wei, J.: Aggregate estimation in hidden databases with checkbox interfaces. TKDE 27(5), 1192–1204 (2015). CrossRefGoogle Scholar
  103. 103.
    Zhang, Z, He, B, Chang, KCC: Understanding Web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD, pp. 107–118 (2004).
  104. 104.
    Zhao, J, Wang, P: Nautilus: a generic framework for crawling Deep Web. In: ICDKE, pp. 141–151 (2012).
  105. 105.
    Zhao, F., Zhou, J., Nie, C., Huang, H., Jin, H.: Smartcrawler: a two-stage crawler for efficiently harvesting deep-Web interfaces. IEEE Trans Serv. Comput. 9 (4), 608–620 (2016). CrossRefGoogle Scholar
  106. 106.
    Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013). CrossRefGoogle Scholar
  107. 107.
    Zhou, X, Belkin, M: Chapter 22 - semi-supervised learning. In: Academic Press Library in Signal Processing: Volume 1, Academic Press Library in Signal Processing, vol 1, pp. 1239–1269. Elsevier (2014).
  108. 108. (2018)

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Languages and Computer SystemsUniversity of SevilleSevilleSpain
  2. 2.Department of Computer ScienceRochester Institute of TechnologyRochesterUSA

Personalised recommendations