Crawling ranked deep Web data sources

Abstract

In the era of big data, the vast majority of the data are not from the surface Web, the Web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep Web, the Web that is hidden behind query interfaces. Since numerous applications, like data integration and vertical portals, require deep Web data, various crawling methods were developed for exhaustively harvesting a deep Web data source with the minimal (or near-minimal) cost. Most existing crawling methods assume that all the documents matched by queries are returned. In practice, data sources often return the top k matches. This makes exhaustive data harvesting difficult: highly ranked documents will be returned multiple times, while documents ranked low have small chance being returned. In this paper, we decompose this problem into two orthogonal sub-problems, i.e., query and ranking bias problems, and propose a document frequency based crawling method to overcome the ranking bias problem. The rational of our method is to use the queries whose document frequencies are within the specified range to avoid the effect of search ranking plus return limit and significantly reduce the difficulty of crawling ranked data source. The method is extensively tested on a variety of datasets and compared with two existing methods. The experimental result demonstrates that our method outperforms the two algorithms by 58 % and 90 % on average respectively.

This is a preview of subscription content, access via your institution.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8

Notes

  1. 1.

    In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.

References

  1. 1.

    Alvarez, M., Raposo.Raposo, J., Pan, A., Cacheda, F., Bellas, O., Carneiro, V.: Crawling the Content Hidden behind Web Forms. In: ICCSA, pp 322–333 (2007)

  2. 2.

    Barbosa, L., Freire, J.: An Adaptive Crawler for Locating Hidden-Web Entry Points. In: Proceedings of WWW, pp 441–450 (2007)

  3. 3.

    Barbosa, M.L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD (2004)

  4. 4.

    Bar-Yossef, Z., Gurevich, M.: Random Sampling from a Search Engine’s Index. In: WWW, pp 367–376 (2006)

  5. 5.

    Bergman, M.K.: The deepWeb: Surfacing hidden value. J. Electron. Publ. 7(1) (2001)

  6. 6.

    Dong, X.L., Srivastava, D.: Big Data Integration. In: ICDE, pp 1245–1248 (2013)

  7. 7.

    Dong, Y., Li, Q.: A deep Web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)

    Google Scholar 

  8. 8.

    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: automatically understanding and integrating forms to access the deep Web. VLDB J. 22(5), 615–640 (2013)

    Article  Google Scholar 

  9. 9.

    Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. Journal of Quantitative Linguistics (1995)

  10. 10.

    Hatcher, E., Gospodnetic, O.: Lucene in action manning publications (2004)

  11. 11.

    He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: a survey. Commun. ACM 50(5), 94–101 (2007)

    Article  Google Scholar 

  12. 12.

    He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling Deep Web Entity Pages. In: Proceedings of WSDM’13, pp 355–364 (2013)

  13. 13.

    Ipeirotis, P.G., Gravano, L.: Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In: VLDB (2002)

  14. 14.

    Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient Deep Web Crawling Using Reinforcement Learning. In: Proceedings of PAKDD, pp 428–439 (2010)

  15. 15.

    Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning Deep Web Crawling with Diverse Featueres. In: WI-IAT, pp 72–575 (2009)

  16. 16.

    Khare, R., An, Y., Song, I.: Understanding deep Web search interfaces: a survey. ACM SIGMOD Rec. 39(1), 33–40 (2010)

    Article  Google Scholar 

  17. 17.

    Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of IJCAI (97)

  18. 18.

    Liakos, P., Ntoulas, A., A, L., Delis, A.: Focused crawling for the hidden Web. World Wide Web 2015, 1–27 (2015)

    Google Scholar 

  19. 19.

    Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting Data behind Web Forms. In: Proceedings of Advanced Conceptual Modeling Techniques (2002)

  20. 20.

    Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content through Query Forms. In: Proceedings of WebIST, 634–642. Lisbon Portugal (2009)

  21. 21.

    Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep Web data extraction. IEEE Trans. Knowl. Data Eng. 22(3), 447–460 (2010)

    Article  Google Scholar 

  22. 22.

    Lu, J.: Ranking bias in deep Web size estimation using capture recapture method. J. Data Knowl. Eng. 69(8), 866–879 (2010)

    Article  Google Scholar 

  23. 23.

    Lu, J., Li, D.: Estimating deep Web data source size by capture-recapture method. Inf. Retr. 13(1), 70–95 (2010)

    Article  Google Scholar 

  24. 24.

    Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of Web Intelligence, pp 718–724 (2008)

  25. 25.

    Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the Deep Web: Present and Future. In: Proceedings of CIDR (2009)

  26. 26.

    Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’S Deep-Web Crawl. In: Proceedings of VLDB, pp 1241–1252 (2008)

  27. 27.

    Mandelbrot, B.B.: Fractal Geometry of Nature. W.H. Freeman Press (1988)

  28. 28.

    Moraes, M.C., Heuser, C.A., Moreira, V.P., Barbosa, D.: Prequery discovery of domain-specific query forms: a survey. Knowledge and data engineering. IEEE Trans. Knowl. Data Eng. 25(8), 1830–1848 (2013)

    Article  Google Scholar 

  29. 29.

    Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)

    MathSciNet  Article  MATH  Google Scholar 

  30. 30.

    Peng, M., Zhu, J., Li, X., Huang, J., Wang, H., Zhang, Y.: Central Topic Model for Event-Oriented Topics Mining in Microblog Stream. In: Proceedings of CIKM, pp 1611–1620 (2015)

  31. 31.

    Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of the 27Th International Conference on Very Large Data Bases (VLDB), pp 129–138 (2001)

  32. 32.

    Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep Web. J. Data Knowl. Eng. 52(3), 273–311 (2005)

    Article  Google Scholar 

  33. 33.

    Song, S., Chen, L.: Indexing dataspaces with partitions. World Wide Web 16 (2), 141–170 (2013)

    MathSciNet  Article  Google Scholar 

  34. 34.

    Valkanas, G., Ntoulas, A., Gunopulos, D.: Rank-Aware Crawling of Hidden Web Sites. In: Proceedings of in WebDB (2011)

  35. 35.

    Wang, Y., Li, H., Wang, H., Zhou, B., Zhang, Y.: Multi-Window Based Ensemble Learning for Classification of Imbalanced Streaming Data. In: Proceedings of WISE, pp 78–92 (2015)

  36. 36.

    Wang, Y., Li, Y., Pi, N., Lu, J.: Crawling Ranked Deep Web Data Sources. In: Proceedings of WISE, pp 384–398 (2015)

  37. 37.

    Wang, Y., Liang, J., Lu, J.: Discover hidden Web properties by random walk on bipartite graph. Inf. Retr. 17(3), 203–228 (2014)

    Article  Google Scholar 

  38. 38.

    Wang, Y., Lu, J., Chen, J.: Crawling Deep Web Using a New Set Covering Algorithm. In: Proceedings of ADMA, pp 326–337 (2009)

  39. 39.

    Wang, Y., Lu, J., Chen, J.: Ts-Ids Algorithm for Query Selection in the Deep Web Crawling. In: ApWeb, pp 189–200 (2014)

  40. 40.

    Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J.: Selecting queries from sample to crawl deep Web data sources. Web Intelligence Agent Syst. 10(1), 75–88 (2012)

    Google Scholar 

  41. 41.

    Wen, L., van der Aalst, W.M., Wang, J., Sun, J.: Mining process models with non-free-choice constructs. Data Min. Knowl. Disc. 15(2), 145–180 (2007)

    MathSciNet  Article  Google Scholar 

  42. 42.

    Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: Proceedings of ICDE, pp 47–56 (2006)

  43. 43.

    Yang, M., Wang, H.L.L., Wang, M.: Optimizing Content Freshness of Relations Extracted from the Web Using Keyword Search. In: Proceedings of SIGMOND, pp 819–830 (2010)

  44. 44.

    Zerfos, P., Cho, J., Ntoulas, A.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of the Joint Conference on Digital Libraries (JCDL), pp 100–109 (2005)

  45. 45.

    Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013)

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yan Wang.

Additional information

This work has been partially supported by National Key Research Program of China (2016YFB1001101), NSERC Discovery grant (RGPIN-2014-04463), NSFC (No.61440020, No.61272398 and N0.61309030), Programs for Innovation Research and 121 Project in CUFE.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Lu, J., Chen, J. et al. Crawling ranked deep Web data sources. World Wide Web 20, 89–110 (2017). https://doi.org/10.1007/s11280-016-0410-4

Download citation

Keywords

  • Deep Web crawling
  • Query selection
  • Estimation
  • Document frequency
  • Return limit