World Wide Web

, Volume 20, Issue 1, pp 89–110 | Cite as

Crawling ranked deep Web data sources

  • Yan Wang
  • Jianguo Lu
  • Jessica Chen
  • Yaxin Li


In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.


Deep Web crawling Query selection Estimation Document frequency Return limit 


  1. 1.
    Alvarez, M., Raposo.Raposo, J., Pan, A., Cacheda, F., Bellas, O., Carneiro, V.: Crawling the Content Hidden behind Web Forms. In: ICCSA, pp 322–333 (2007)Google Scholar
  2. 2.
    Barbosa, L., Freire, J.: An Adaptive Crawler for Locating Hidden-Web Entry Points. In: Proceedings of WWW, pp 441–450 (2007)Google Scholar
  3. 3.
    Barbosa, M.L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD (2004)Google Scholar
  4. 4.
    Bar-Yossef, Z., Gurevich, M.: Random Sampling from a Search Engine’s Index. In: WWW, pp 367–376 (2006)Google Scholar
  5. 5.
    Bergman, M.K.: The deepWeb: Surfacing hidden value. J. Electron. Publ. 7(1) (2001)Google Scholar
  6. 6.
    Dong, X.L., Srivastava, D.: Big Data Integration. In: ICDE, pp 1245–1248 (2013)Google Scholar
  7. 7.
    Dong, Y., Li, Q.: A deep Web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)Google Scholar
  8. 8.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: automatically understanding and integrating forms to access the deep Web. VLDB J. 22(5), 615–640 (2013)CrossRefGoogle Scholar
  9. 9.
    Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. Journal of Quantitative Linguistics (1995)Google Scholar
  10. 10.
    Hatcher, E., Gospodnetic, O.: Lucene in action manning publications (2004)Google Scholar
  11. 11.
    He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: a survey. Commun. ACM 50(5), 94–101 (2007)CrossRefGoogle Scholar
  12. 12.
    He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling Deep Web Entity Pages. In: Proceedings of WSDM’13, pp 355–364 (2013)Google Scholar
  13. 13.
    Ipeirotis, P.G., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: VLDB (2002)Google Scholar
  14. 14.
    Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient Deep Web Crawling Using Reinforcement Learning. In: Proceedings of PAKDD, pp 428–439 (2010)Google Scholar
  15. 15.
    Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning Deep Web Crawling with Diverse Featueres. In: WI-IAT, pp 72–575 (2009)Google Scholar
  16. 16.
    Khare, R., An, Y., Song, I.: Understanding deep Web search interfaces: a survey. ACM SIGMOD Rec. 39(1), 33–40 (2010)CrossRefGoogle Scholar
  17. 17.
    Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of IJCAI (97)Google Scholar
  18. 18.
    Liakos, P., Ntoulas, A., A, L., Delis, A.: Focused crawling for the hidden Web. World Wide Web 2015, 1–27 (2015)Google Scholar
  19. 19.
    Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting Data behind Web Forms. In: Proceedings of Advanced Conceptual Modeling Techniques (2002)Google Scholar
  20. 20.
    Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content through Query Forms. In: Proceedings of WebIST, 634–642. Lisbon Portugal (2009)Google Scholar
  21. 21.
    Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep Web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)CrossRefGoogle Scholar
  22. 22.
    Lu, J.: Ranking bias in deep Web size estimation using capture recapture method. J. Data Knowl. Eng. 69(8), 866–879 (2010)CrossRefGoogle Scholar
  23. 23.
    Lu, J., Li, D.: Estimating deep Web data source size by capture-recapture method. Inf. Retr. 13(1), 70–95 (2010)CrossRefGoogle Scholar
  24. 24.
    Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of Web Intelligence, pp 718–724 (2008)Google Scholar
  25. 25.
    Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the Deep Web: Present and Future. In: Proceedings of CIDR (2009)Google Scholar
  26. 26.
    Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’S Deep-Web Crawl. In: Proceedings of VLDB, pp 1241–1252 (2008)Google Scholar
  27. 27.
    Mandelbrot, B.B.: Fractal Geometry of Nature. W.H. Freeman Press (1988)Google Scholar
  28. 28.
    Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: a survey. Knowledge and data engineering. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)CrossRefGoogle Scholar
  29. 29.
    Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Peng, M., Zhu, J., Li, X., Huang, J., Wang, H., Zhang, Y.: Central Topic Model for Event-Oriented Topics Mining in Microblog Stream. In: Proceedings of CIKM, pp 1611–1620 (2015)Google Scholar
  31. 31.
    Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27Th International Conference on Very Large Data Bases (VLDB), pp 129–138 (2001)Google Scholar
  32. 32.
    Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep Web. J. Data Knowl. Eng. 52(3), 273–311 (2005)CrossRefGoogle Scholar
  33. 33.
    Song, S., Chen, L.: Indexing dataspaces with partitions. World Wide Web 16 (2), 141–170 (2013)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Valkanas, G., Ntoulas, A., Gunopulos, D.: Rank-Aware Crawling of Hidden Web Sites. In: Proceedings of in WebDB (2011)Google Scholar
  35. 35.
    Wang, Y., Li, H., Wang, H., Zhou, B., Zhang, Y.: Multi-Window Based Ensemble Learning for Classification of Imbalanced Streaming Data. In: Proceedings of WISE, pp 78–92 (2015)Google Scholar
  36. 36.
    Wang, Y., Li, Y., Pi, N., Lu, J.: Crawling Ranked Deep Web Data Sources. In: Proceedings of WISE, pp 384–398 (2015)Google Scholar
  37. 37.
    Wang, Y., Liang, J., Lu, J.: Discover hidden Web properties by random walk on bipartite graph. Inf. Retr. 17(3), 203–228 (2014)CrossRefGoogle Scholar
  38. 38.
    Wang, Y., Lu, J., Chen, J.: Crawling Deep Web Using a New Set Covering Algorithm. In: Proceedings of ADMA, pp 326–337 (2009)Google Scholar
  39. 39.
    Wang, Y., Lu, J., Chen, J.: Ts-Ids Algorithm for Query Selection in the Deep Web Crawling. In: ApWeb, pp 189–200 (2014)Google Scholar
  40. 40.
    Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J.: Selecting queries from sample to crawl deep Web data sources. Web Intelligence Agent Syst. 10(1), 75–88 (2012)Google Scholar
  41. 41.
    Wen, L., van der Aalst, W.M., Wang, J., Sun, J.: Mining process models with non-free-choice constructs. Data Min. Knowl. Disc. 15(2), 145–180 (2007)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proceedings of ICDE, pp 47–56 (2006)Google Scholar
  43. 43.
    Yang, M., Wang, H.L.L., Wang, M.: Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search. In: Proceedings of SIGMOND, pp 819–830 (2010)Google Scholar
  44. 44.
    Zerfos, P., Cho, J., Ntoulas, A.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of the Joint Conference on Digital Libraries (JCDL), pp 100–109 (2005)Google Scholar
  45. 45.
    Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of InformationCentral University of Finance and EconomicsBeijingChina
  2. 2.School of Computer ScienceUniversity of WindsorWindsorCanada

Personalised recommendations