Deep Web adaptive crawling based on minimum executable pattern
The key to Deep Web Crawling is to submit valid input values to a query form and retrieve Deep Web content efficiently. In the literature, related work focus only on generic text boxes or entire query forms, causing the problem of “data islands” or inferior validity of query submission. This paper proposes the concept of Minimum Executable Pattern (MEP), a minimal combination of elements in a query form that can conduct a successful query, and then presents a MEPGeneration method and a MEP-based Deep Web adaptive crawling method. The query form is parsed and partitioned into MEP set, and then local-optimal queries are generated by choosing a MEP in the MEP set and a keyword vector of the MEP. Furthermore, the crawler can make a decision on its termination to balance the trade-off between high coverage of the content and resource consumption. The adoption of MEP is expected to improve the validity of query submission, and adaptive selection of multiple MEPs shows good effect for overcoming the problem of “data islands”. We present a set of experiments to validate the effectiveness of the proposed method. Experimental results show that our method outperforms the state of art methods in terms of query capability and applicability, and on average, it achieves good coverage by issuing only a few hundred queries.
KeywordsDeep Web Minimum executable pattern Adaptive crawling Deep Web surfacing
The research was supported in part by the National High-Tech R&D Program of China under Grant No.2008AA01Z131, the National Science Foundation of China under Grant Nos.60825202, 60803079, 60921003, the National Key Technologies R&D Program of China under Grant Nos. 2006BAK11B02, 2006BAJ07B06, the Program for New Century Excellent Talents in University of China under Grant No.NECT-08-0433, the Doctoral Fund of Ministry of Education of China under Grant No. 20090201110060, Cheung Kong Scholar’s Program. The authors are grateful to the anonymous reviewers for their comments which greatly improved the quality of the paper.
- Alvarez, M., Raposo, J., Pan, A., Cacheda, F., Bellas, F., & Carneiro, V. (2007). DeepBot: A focused crawler for accessing hidden web content. In Proceedings of DEECS2007 (pp. 18–25). San Diego, CA.Google Scholar
- Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In Proceedings of SBBD2004 (pp. 309–321). Brasilia, Brazil.Google Scholar
- Bergman, M. K. (2001). The Deep Web: Surfacing hidden value. The Journal of Electronic Publishing from the University of Michigan, 7, 3–21.Google Scholar
- He, B., & Chang, K. C. C. (2006). Automatic complex schema matching across web query interfaces: A correlation mining approach. ACM Transactions on Database Systems, 13, 1–45.Google Scholar
- Ipeirotis, P., & Gravano, L. (2002). Distributed search over the hidden web: Hierarchical database sampling and selection. In Proceedings of VLDB2002 (pp. 1–12). Hong Kong, China.Google Scholar
- Jayant, M., David, K., et al. (2008). Google’s deep-web crawl. In Proceedings of VLDB2008 (pp. 1241–1252). Auckland, New Zealand.Google Scholar
- Mandelbrot, B. B. (1988). Fractal geometry of nature. New York: Freeman.Google Scholar
- Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. In Proceedings of JCDL2005 (pp. 100–109). Denver CO.Google Scholar
- Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden web. In Proceedings of VLDB2001 (pp. 129–138). Rome Italy.Google Scholar
- Wu, P., Wen, J. R., Liu, H., & Ma, W. Y. (2006). Query selection techniques for efficient crawling of structured web source. In Proceedings of ICDE2006 (pp. 47–56). Atlanta, GA.Google Scholar
- Zhang, Z., He, B., & Chang, K. C. C. (2004). Understanding web query interfaces: Best effort parsing with hidden syntax. In Proceedings of the ACM SIGMOD2004 (pp. 107–118). Paris, France.Google Scholar