Efficient Deep Web Crawling Using Reinforcement Learning
Abstract
Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and selects an action (query) to submit to the environment according to Q-value. The framework not only enables crawlers to learn a promising crawling strategy from its own experience, but also allows for utilizing diverse features of query keywords. Experimental results show that the method outperforms the state of art methods in terms of crawling capability and breaks through the assumption of full-text search implied by existing methods.
Keywords
Hidden Web Deep Web Crawling Reinforcement LearningPreview
Unable to display preview. Download preview PDF.
References
- 1.Lawrence, S., Giles, C.L.: Searching the World Wide Web. Science 280, 98–100 (1998)CrossRefGoogle Scholar
- 2.Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s Deep-Web Crawl. In: Proceedings of VLDB 2008, Auckland, New Zealand, pp. 1241–1252 (2008)Google Scholar
- 3.Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proceedings of VLDB 2001, Rome, Italy, pp. 129–138 (2001)Google Scholar
- 4.Ntoulas, A., Zerfos, P., Cho, J.: Downloading Textual Hidden Web Content through Keyword Queries. In: Proceedings of JCDL 2005, Denver, USA, pp. 100–109 (2005)Google Scholar
- 5.Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: Proceedings of SBBD 2004, Brasilia, Brazil, pp. 309–321 (2004)Google Scholar
- 6.Liu, J., Wu, Z.H., Jiang, L., Zheng, Q.H., Liu, X.: Crawling Deep Web Content Through Query Forms. In: Proceedings of WEBIST 2009, Lisbon, Portugal, pp. 634–642 (2009)Google Scholar
- 7.Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Proceedings of IEEE/WIC/ACM Web Intelligence, Sydney, Australia, pp. 718–724 (2008)Google Scholar
- 8.Wu, P., Wen, J.R., Liu, H., Ma, W.Y.: Query Selection Techniques for Efficient Crawling of Structured Web Source. In: Proceedings of ICDE 2006, Atlanta, GA, pp. 47–56 (2006)Google Scholar
- 9.Ipeirotis, P., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB 2002, Hong Kong, China, pp.394–405 (2002)Google Scholar
- 10.Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)Google Scholar
- 11.Jiang, L., Wu, Z.H., Zheng, Q.H., Liu, J.: Learning Deep Web Crawling with Diverse Features. In: Proceedings of IEEE/WIC/ACM Web Intelligence, Milan, Italy, pp. 572–575 (2009)Google Scholar
- 12.Yamamoto, M., Church, K.W.: Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics 27(1), 1–30 (2001)CrossRefGoogle Scholar
- 13.Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)zbMATHGoogle Scholar
- 14.Ratsaby, J.: Incremental Learning with Sample Queries. IEEE Trans. on PAMI 20(8), 883–888 (1998)Google Scholar
- 15.Amstrup, S.C., McDonald, T.L., Manly, B.F.J.: Handbook of capture–recapture analysis. Princeton University Press, Princeton (2005)Google Scholar
- 16.Mandelbrot, B.B.: Fractal Geometry of Nature. W. H. Freeman and Company, New York (1988)Google Scholar
- 17.Sutton, R.C., Barto, A.G.: Reinforcement learning: An Introduction. The MIT Press, Cambridge (1998)Google Scholar