Nautilus: A Generic Framework for Crawling Deep Web

Zhao, Jianyu; Wang, Peng

doi:10.1007/978-3-642-34679-8_14

Jianyu Zhao^20,21 &
Peng Wang^20,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7696))

Included in the following conference series:

International Conference on Data and Knowledge Engineering

1000 Accesses
1 Citations

Abstract

This paper presents Nautilus, which is a generic framework for crawling deep Web. We provide an abstraction of deep Web crawling process and mechanism of integrating heterogeneous business modules. A Federal Decentralized Architecture is proposed to ensemble advantages of existed P2P networking architectures. We also present effective policies to schedule crawling tasks. Experimental results show our scheduling policies have good performance on load-balance and overall throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 72.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jayant, M., David, K., Łucja, K., et al.: Google’s Deep Web crawl. PVLDB 1(2), 1241–1252 (2008)
Google Scholar
Wei, L., Xiaofeng, M., Weiyi, M.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Article Google Scholar
Luciano, B., Juliana, F.: An adaptive crawler for locating hidden-Web entry points. In: Proceedings of the 16th International Conference on World Wide Web (2007)
Google Scholar
Luciano, B., Juliana, F.: Searching for Hidden-Web Databases. In: Proceedings of WebDB (2005)
Google Scholar
Karane, V., Luciano, B., Juliana, F., Altigran, S.: Siphon++: a hidden-web crawler for keyword-based interfaces. In. In: Proceedings of the 17th ACM Conference on Information and knowledge Management (2008)
Google Scholar
Jon, K.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999, 2001)
Article MathSciNet MATH Google Scholar
Soumen, C., Kunal, P., Mallela, S.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web (2002)
Google Scholar
Junghoo, C., Hector, G.M.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)
Article Google Scholar
Ricardo, B.Y., Carlos, C.: Balancing volume, quality and freshness in web crawling. In: Soft Computing Systems - Design, Management and Applications, Santiago, Chile (2002)
Google Scholar
Robert, M., Krishna, B.: SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Networks and ISDN Systems 30, 119–130 (1998)
Article Google Scholar
Mohr, G., Kimpton, M., Stack, M., Ranitovic, I.: Introduction to Heritrix, an archival quality Web crawler. In: The 4th International Web Archiving Workshop, Bath, UK (2004)
Google Scholar
Allan, H., Marc, N.: Mercator: A scalable, extensible Web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a scalable fully distributed Web crawler. Softw. Pract. Exper. 34(8), 711–726 (2004)
Article Google Scholar
Ntoulas, A., Pzerfos, P., Junghoo, C.: Downloading textual hidden web content through keyword queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (2005)
Google Scholar
Ion, S., Robert, M., David, K., Frans Kaashoek, M., Hari, B.: Chord: A scalable peer-to-peer lookup service for internet applications. SIGCOMM Comput. Commun. Rev. 31(4), 149–160 (2001)
Article Google Scholar
David, K., Eric, L., Tom, L., Rina, P., Matthew, L., Daniel, L.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing, China
Jianyu Zhao & Peng Wang
College of Software Engineering, Nanjing, China
Jianyu Zhao & Peng Wang

Authors

Jianyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology, Deakin University, Melbourne Burwood Campus, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Yang Xiang
Media Distribution, Telstra Corporation Limited, 21/35 Collins St, 3000, Melbourne, VIC, Australia
Mukaddim Pathan
Department of Mathematics and Computing, The University of Southern Queensland, Toowoomba, QLD, Australia
Xiaohui Tao
Department of Mathematics and Computing, The University of Southern Queensland, Toowoomba, QLD, Australia
Hua Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J., Wang, P. (2012). Nautilus: A Generic Framework for Crawling Deep Web. In: Xiang, Y., Pathan, M., Tao, X., Wang, H. (eds) Data and Knowledge Engineering. ICDKE 2012. Lecture Notes in Computer Science, vol 7696. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34679-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-34679-8_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34678-1
Online ISBN: 978-3-642-34679-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics