Efficient Deep Web Crawling Using Reinforcement Learning

  • Lu Jiang
  • Zhaohui Wu
  • Qian Feng
  • Jun Liu
  • Qinghua Zheng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6118)

Abstract

Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment according to Q-value. The framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and breaks through the assumption of full-text search implied by existing methods.

Keywords

Hidden Web Deep Web Crawling Reinforcement Learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Lawrence, S., Giles, C.L.: Searching the World Wide Web. Science 280, 98–100 (1998)CrossRefGoogle Scholar
  2. 2.
    Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep-Web Crawl. In: Proceedings of VLDB 2008, Auckland, New Zealand, pp. 1241–1252 (2008)Google Scholar
  3. 3.
    Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of VLDB 2001, Rome, Italy, pp. 129–138 (2001)Google Scholar
  4. 4.
    Ntoulas, A., Zerfos, P., Cho, J.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of JCDL 2005, Denver, USA, pp. 100–109 (2005)Google Scholar
  5. 5.
    Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD 2004, Brasilia, Brazil, pp. 309–321 (2004)Google Scholar
  6. 6.
    Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content Through Query Forms. In: Proceedings of WEBIST 2009, Lisbon, Portugal, pp. 634–642 (2009)Google Scholar
  7. 7.
    Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of IEEE/WIC/ACM Web Intelligence, Sydney, Australia, pp. 718–724 (2008)Google Scholar
  8. 8.
    Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Source. In: Proceedings of ICDE 2006, Atlanta, GA, pp. 47–56 (2006)Google Scholar
  9. 9.
    Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB 2002, Hong Kong, China, pp.394–405 (2002)Google Scholar
  10. 10.
    Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
  11. 11.
    Jiang, L., Wu, Z.H., Zheng, Q.H., Liu, J.: Learning Deep Web Crawling with Diverse Features. In: Proceedings of IEEE/WIC/ACM Web Intelligence, Milan, Italy, pp. 572–575 (2009)Google Scholar
  12. 12.
    Yamamoto, M., Church, K.W.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
  13. 13.
    Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)MATHGoogle Scholar
  14. 14.
    Ratsaby, J.: Incremental Learning with Sample Queries. IEEE Trans. on PAMI 20(8), 883–888 (1998)Google Scholar
  15. 15.
    Amstrup, S.C., McDonald, T.L., Manly, B.F.J.: Handbook of capture–recapture analysis. Princeton University Press, Princeton (2005)Google Scholar
  16. 16.
    Mandelbrot, B.B.: Fractal Geometry of Nature. W. H. Freeman and Company, New York (1988)Google Scholar
  17. 17.
    Sutton, R.C., Barto, A.G.: Reinforcement learning: An Introduction. The MIT Press, Cambridge (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Lu Jiang
    • 1
  • Zhaohui Wu
    • 1
  • Qian Feng
    • 1
  • Jun Liu
    • 1
  • Qinghua Zheng
    • 1
  1. 1.MOE KLINNS Lab and SKLMS LabXi’an Jiaotong UniversityXi’anP.R.China

Personalised recommendations