Skip to main content

Crawling Ranked Deep Web Data Sources

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2015 (WISE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9418))

Included in the following conference series:

Abstract

In the era of big data, the vast majority of the data are not from the surface web, the web that is interconnected by hyperlinks and indexed by most general purpose search engines. Instead, the trove of valuable data often reside in the deep web, the web that is hidden behind query interfaces. Since the data in the deep web are often of high value, there is a line of research on crawling deep web data sources in the recent decade. However, most existing crawling methods assume that all the matched documents are returned. In practice, many data sources rank the matched documents, and return only the top k matches. When conventional methods are applied on such ranked data sources, popular queries that matches more than k documents will cause large redundancy. This paper proposes the document frequency (df) based algorithm that exploits the queries whose document frequencies are within the specified range. The algorithm is extensively tested on a variety of datasets and compared with existing two algorithms. We demonstrate that our method outperforms the two algorithms 58 % and 90 % on average respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In this paper, we use the two words ‘term’ and ‘query’ interchangeably and the minor difference is that a query is an issued term.

References

  1. Bergman, M.K.: The deep web: Surfacing hidden value. J. Electron. Publishing 7(1), 1–17 (2001)

    Article  Google Scholar 

  2. Shestakov, D., Bhowmick, S.S., Lim, E.P.: Deque: querying the deep web. J. Data Knowl. Eng. 52(3), 273–311 (2005)

    Article  Google Scholar 

  3. He, B., Patel, M., Zhang, Z., Chang, K.C.: Accessing the deep web: a survey. Commun. ACM 50(5), 94–101 (2007)

    Article  Google Scholar 

  4. Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep-web crawl. In: Proceeding of VLDB, pp. 1241–1252 (2008)

    Google Scholar 

  5. Ipeirotis, P., Gravano, L., Sahami, M.: Probe, count, and classify: categorizing hidden web databases. In: proceeding of SIGMOD, pp. 67–68 (2001)

    Google Scholar 

  6. Raghavan, S., Molina, H.G.: Crawling the hidden web. In: Proceeding of the 27th international Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)

    Google Scholar 

  7. Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H.: Extracting data behind web forms. In: Olivé, À., Yoshikawa, M., Yu, E.S.K. (eds.) ER 2003. LNCS, vol. 2784, pp. 402–413. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  8. Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.: Harnessing the deep web: present and future. In: Proceeding of CIDR (2009)

    Google Scholar 

  9. He, Y., Xin, D., V, G., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: Proceeding of WSDM 2013, pp. 355–364 (2013)

    Google Scholar 

  10. Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query selection techniques for efficient crawling of structured web sources. In: Proceeding of ICDE, pp. 47–56 (2006)

    Google Scholar 

  11. Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB (2002)

    Google Scholar 

  12. Dong, X., Srivastava, D.: Big data integration. In: ICDE, pp. 1245–1248 (2013)

    Google Scholar 

  13. Yang, M., Wang, H., Lim, L., Wang, M.: Optimizing content freshness of relations extracted from the web using keyword search. In: Proceeding of SIGMOND, pp. 819–830 (2010)

    Google Scholar 

  14. http://www.dmoz.org

  15. Lu, J., Wang, Y., liang, J., Chen, J., Liu, J.: An approach to deep web crawling by sampling. In: Proceeding of Web Intelligence, pp. 718–724 (2008)

    Google Scholar 

  16. Wang, Y., Lu, J., Chen, J.: Crawling deep web using a new set covering algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  17. Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014)

    Google Scholar 

  18. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceeding of SBBD (2004)

    Google Scholar 

  19. Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through keyword queries. In: Proceeding of the Joint Conference on Digital Libraries (JCDL), pp. 100–109 (2005)

    Google Scholar 

  20. Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep web. Inf. Syst. 38(6), 801–819 (2013)

    Article  Google Scholar 

  21. Jiang, L., Wu, Z., Zheng, Q., Liu, J.: Learning deep web crawling with diverse featueres. In: WI-IAT, pp. 572–575 (2009)

    Google Scholar 

  22. Dong, Y., Li, Q.: A deep web crawling approach based on query harvest model. J. Comput. Inf. Syst. 8(3), 973–981 (2012)

    Google Scholar 

  23. Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q.: Efficient deep web crawling using reinforcement learning. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds.) PAKDD 2010, Part I. LNCS, vol. 6118, pp. 428–439. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  24. Lu, J.: Ranking bias in deep web size estimation using capture recapture method. Journal of Data and Knowledge Engineering 69(8), 866–879 (2010)

    Article  Google Scholar 

  25. Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. In: WWW, pp. 367–376 (2006)

    Google Scholar 

  26. Myung, I.J.: Tutorial on maximum likelihood estimation. J. Math. Psychol. 47, 90–100 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  27. Gale, W.A., Sampson, G.: Good-turing frequency estimation without tears*. J. Quant. Linguist. 2(3), 217–237 (1995)

    Article  Google Scholar 

  28. Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)

    Google Scholar 

Download references

Acknowledgements

This work is supported by NSFC (No.61440020 and No.6130 9029), NSERC, Programs for Innovation Research and 121 Project in Central University of Finance and Economics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, Y., Li, Y., Pi, N., Lu, J. (2015). Crawling Ranked Deep Web Data Sources. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9418. Springer, Cham. https://doi.org/10.1007/978-3-319-26190-4_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26190-4_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26189-8

  • Online ISBN: 978-3-319-26190-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics