Abstract
This paper presents Nautilus, which is a generic framework for crawling deep Web. We provide an abstraction of deep Web crawling process and mechanism of integrating heterogeneous business modules. A Federal Decentralized Architecture is proposed to ensemble advantages of existed P2P networking architectures. We also present effective policies to schedule crawling tasks. Experimental results show our scheduling policies have good performance on load-balance and overall throughput.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jayant, M., David, K., Łucja, K., et al.: Google’s Deep Web crawl. PVLDB 1(2), 1241–1252 (2008)
Wei, L., Xiaofeng, M., Weiyi, M.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Luciano, B., Juliana, F.: An adaptive crawler for locating hidden-Web entry points. In: Proceedings of the 16th International Conference on World Wide Web (2007)
Luciano, B., Juliana, F.: Searching for Hidden-Web Databases. In: Proceedings of WebDB (2005)
Karane, V., Luciano, B., Juliana, F., Altigran, S.: Siphon++: a hidden-web crawler for keyword-based interfaces. In. In: Proceedings of the 17th ACM Conference on Information and knowledge Management (2008)
Jon, K.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999, 2001)
Soumen, C., Kunal, P., Mallela, S.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web (2002)
Junghoo, C., Hector, G.M.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)
Ricardo, B.Y., Carlos, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile (2002)
Robert, M., Krishna, B.: SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Networks and ISDN Systems 30, 119–130 (1998)
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: The 4th International Web Archiving Workshop, Bath, UK (2004)
Allan, H., Marc, N.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed Web crawler. Softw. Pract. Exper. 34(8), 711–726 (2004)
Ntoulas, A., Pzerfos, P., Junghoo, C.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (2005)
Ion, S., Robert, M., David, K., Frans Kaashoek, M., Hari, B.: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Comput. Commun. Rev. 31(4), 149–160 (2001)
David, K., Eric, L., Tom, L., Rina, P., Matthew, L., Daniel, L.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, J., Wang, P. (2012). Nautilus: A Generic Framework for Crawling Deep Web. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds) Data and Knowledge Engineering. ICDKE 2012. Lecture Notes in Computer Science, vol 7696. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34679-8_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-34679-8_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34678-1
Online ISBN: 978-3-642-34679-8
eBook Packages: Computer ScienceComputer Science (R0)