Abstract
Web database crawling is one of the major kinds of design choices solution for Deep Web data integration. To the best of our knowledge, the existing works only focused on how to crawl all records in a web database at one time. Due to the high dynamic of web databases, it is not practical to always crawl the whole database in order to harvest a small proportion of new records. To this end, this paper studies the problem of incremental web database crawling, which targets at crawling the new records from a web database as many as possible while minimizing the communication costs. In our approach, a new graph model, an incremental crawling task is transformed into a graph traversal process. Based on this graph, appropriate queries are generated for crawling by analyzing the history versions of the web database. Extensive experimental evaluations over real Web databases validate the effectiveness of our techniques and provide insights for future efforts in this direction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
He, B., Patel, M., Zhang, Z.: Accessing the Deep Web: A survey. Communications of the ACM 50(5) (2007)
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.Y.: Harnessing the Deep Web: Present and Future. In: CIDR 2009 (2009)
Wu, P., Wen, J.-R., Liu, H., Ma, W.-Y.: Query Selection Techniques for Efficient Crawling of Structured Web Sources. In: ICDE 2006 (2006)
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: VLDB 2001, pp. 129–138 (2001)
Lu, J., Wang, Y., Liang, J., Chen, J., Liu, J.: An Approach to Deep Web Crawling by Sampling. In: Web Intelligence 2008 (2008)
Wang, Y., Lu, J., Chen, J.: Crawling Deep Web Using a New Set Covering Algorithm. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds.) ADMA 2009. LNCS, vol. 5678, pp. 326–337. Springer, Heidelberg (2009)
Barbosa, L., Freire, J.: Siphoning Hidden-Web Data through Keyword-Based Interfaces. In: SBBD 2004 (2004)
Liu, W., Meng, X., Ling, Y.: Graph-based approach for Web database sampling. Journal of Software(Chinese) 19(2), 179–193 (2008)
Zhao, H., Meng, W., Wu, Z., Raghavan, V.: Fully automatic wrapper generation for search engines. In: WWW 2005 (2005)
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW 2005 (2005)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. In: SIGMOD (2004)
He, H., Meng, W., Yu, C., Wu, Z.: WISE-Integrator: an automatic integrator of Web search interfaces for E-commerce. In: VLDB (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, W., Xiao, J. (2010). Incremental Structured Web Database Crawling via History Versions. In: Chen, L., Triantafillou, P., Suel, T. (eds) Web Information Systems Engineering – WISE 2010. WISE 2010. Lecture Notes in Computer Science, vol 6488. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17616-6_46
Download citation
DOI: https://doi.org/10.1007/978-3-642-17616-6_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17615-9
Online ISBN: 978-3-642-17616-6
eBook Packages: Computer ScienceComputer Science (R0)