E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Li, Yanni; Wang, Yuping; Du, Jintao

doi:10.1007/s10844-012-0221-8

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Published: 07 September 2012

Volume 40, pages 159–184, (2013)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Yanni Li¹,
Yuping Wang¹ &
Jintao Du²

704 Accesses
21 Citations
Explore all metrics

Abstract

A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs’ entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs’ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions’ limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Graphs: Opportunities and Challenges

Article Open access 03 April 2023

Ciyuan Peng, Feng Xia, … Francesco Osborne

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

Xuanhe Zhou, Zhaoyan Sun & Guoliang Li

Deep learning applications and challenges in big data analytics

Article Open access 24 February 2015

Maryam M Najafabadi, Flavio Villanustre, … Edin Muharemagic

References

Aggarwal, C.C., AL-Garawi, F., Yu, P.S. (2001). Intelligent crawling on the world wide web with arbitrary predicates. In Proc. of the 10th international conference on world wide web (WWW’01) (pp. 96–105).
Akilandeswari, J., & Gopalan, N.P. (2008). An architectural framework of a crawler for locating deep web repositories using learning multi-agent systems. In the Proc. of the third international conference on internet and web applications and services (ICIW’08) (pp. 558–562).
Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proc. of WebDB (pp. 1–6).
Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden-web entry points. In Proc. of WWW 2007 (WWW’07) (pp. 441–450). ACM.
Barbosa, L., Freire, J., Silva, A. (2007). Organizing hidden-web databases by clustering visible web documents. In Proc. of IEEE the 23rd international of conference on data engineering (ICDE’2007) (pp. 326–335).
Bazarganigilani, M., Syed, A., & Burki, S. (2011). Focused web crawling using decay content and genetic programming. International Journal of Data Mining & Knowledge Management Process (IJDKP), 1, 1–11.
Google Scholar
Bergman, M.K. (2001). The deep web: surfacing hidden value, Journal of Electronic Publishing, 7, 1–17.
Article Google Scholar
Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S. (1998). The connectivity server: fast access to linkage information on the web. Computer Networks and ISDN Systems, 30, 469–477.
Article Google Scholar
BrightPlanet.com (2001). The deep web: Surfacing hidden value, Accessible at http://brightplanet.com. Accessed 1 March 2012
Castillo, C. (2005). Effective web crawling. ACM SIGIR Forum, 39, 55–56.
Article MathSciNet Google Scholar
Chang, K.C.-C., He, B., Li, C., et al. (2004). Structured databases on the web: observations and implications. ACM SIGMOD Record, 33, 61–70.
Article Google Scholar
Chang, K.C.-C., He, B., Zhang, Z. (2005). Toward large-scale integration: building a metaQuerier over databases on the web. In Proc. of CIDR (CIDR’05) (pp. 44–55).
Chakrabarti, S., van den Berg, M., Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31, 1623–1640.
Article Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proc. of the 9th international conference on world wide web (WWW’02) (pp. 148–159).
Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44, 482–494.
Article Google Scholar
Cho, J., Garcia, M.H., Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30, 161–172.
Article Google Scholar
Cope, J., Craswell, N., Hawking, D. (2003). Automated discovery of search interfaces on the web. In Proc. of ADC (ADC’03) (pp. 181–189).
Dehua, D. (2010). Deep web services crawler. Master Thesis.
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M. (2000). Focused crawling using context graphs. In Proc. of VLDB(VLDB’2000) (pp. 527–534).
Ester, M., Kriegel, H.P., Schubert, M. (2004). Accurate and efficient crawling for relevant websites. In Proc. of the 13th international conference on very large data bases (VLDB’ 04) (Vol. 30, pp. 396–407).
Ghanem, T.M., & Aref, W.G. (2004). Databases deepen the web. IEEE Computer, 73, 116–117.
Article Google Scholar
He, B., & Chang, K.C.-C. (2003). Statistical schema matching across web query interfaces. In Proc. of the ACM SIGMOD conference (SIGMOD‘03) (pp. 217–228).
He, B., Li, C., Killian, D., Patel, M., Tseng, Y., Chang, K.C.-C. (2006). A structure-driven yield-aware web form crawler: building a database of online databases. UIUC Technical Report: UIUCDCS-R-2006-2752, UIUC-ENG-2006-1792 (pp. 1–12).
He, B., Patel, M., Zhang, Z., Chang, K.C.-C. (2007). Accessing the deep web: a survey. Communications of the ACM, 50, 95–101.
Article Google Scholar
He, B., Tao, T., Chang, K.C.-C. (2004). Organizing structured web sources by query schemas: A clustering approach. In Proc. of the thirteenth ACM international conference on information and knowledge management (CIKM’04) (pp. 3–7).
He, B., Tao, T., Chang, K.C.-C. (2004). Clustering structured web sources: A schema-based, model-differentiation approach. In Proc. of the 9th International Conference on Extending Database Technology (EDBT’04) (pp. 536–546).
Hornung, T., Simon, K., Lausen, G. (2009). Mashups over the deep web. Lecture Notes in Business Information Processing, 18, 228–241.
Article Google Scholar
Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H. (2006). A method for focused crawling using combination of link structure and content similarity. In Proc. of ACM international conference on web intelligence (WI’06) (pp. 753–756).
Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources discovery and identification. In Proc. of the 12th international asia-pacific web conference (pp. 464–467).
Madhavan, J., Ko, D., Kot, L. (2008). Google’s deep-web crawl, In Proc. of the VLDB Endowment (Vol. 1, pp. 1241–1251).
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A. (2009). Harnessing the deep web: present and future. CIDR Perspective available at http://arxiv.org/abs/0909.1785.
Novak, B. (2004). A survey of focused web crawling algorithms. In Proc. of SIKDD (SIKDD’04) (Vol. 5558, pp. 55–58).
Peng, Q, Meng, W., He, H., Yu, C.T. (2004). WISE-cluster: clustering e-commerce search engines automatically. In Proc. of the 6th ACM international workshop on web information and data management (VIDM’04) (pp. 104–111).
Raghavan, S., & Garacia-Molina, H. (2001). Crawling the hidden web. In Proc. of VLDB (VLDB’01) (pp. 129–138).
Rennie, J., & McCallum, A.K. (1999). Using reinforcement learning to spider the web efficiently. In Proc. of ICML(ICML’1999) (pp. 335–343).
Rocco, D., Liu, L., Critchlow, T. (2004). Focused crawling of the deep web using service class descriptions. In Proc. of the international conference on service oriented computing (ICSOC’04) (pp. 15–18).
Steve, L., & Giles, C.L. (1998). Searching the world wide web. Science, 280, 98–100.
Article Google Scholar
Steve, L., & Giles C.L. (1999). Accessibility of information on the web. Nature, 400, 107–109.
Article Google Scholar
The UIUC Web (2003). Integration Repository. available at http://metaquerier.cs.uiuc.edu/repository. Accessed 1 March 2012.
Wang, C., Guan, Z.Y., Chen, C., Bu, J.J., Wang, J.F., Lin, H.Z. (2009). On-line topical importance estimation: An effective focused crawling algorithm combining link and content analysis (Vol. 10, pp. 1114–1124). Zhejiang Univ Sci A.
Wang, Y., Lu, J., Chen, J. (2009). Crawling deep web using a new set covering algorithm. In Proc. of ADMA(ADMA’2009) (pp. 326–337).
Wang, Y., Zuo, W., Peng, T., He, F. (2008). Domain-specific deep web sources discovery. In Proc. of the international conference on natural computation (ICNC’08) (pp. 202–206).
Wu, W., Yu, C.T., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proc. of the 23th ACM international conference on management of data (SIGMOD’2004) (pp. 95–106).
Yadav, D., Sharma, A.K., Gupta, J.P. (2009). Topic web crawling using weighted anchor text and web page change detection techniques, WSEAS. Transactions on Information Science and Applications, 6, 263–275.
Google Scholar
Zhang, Z., He, B., Chang, K.C.-C. (2005). Light-weight domain-based form assistant: querying web databases on the fly. In Proc. of VLDB conference (VLDB’05) (pp. 97–108).

Download references

Acknowledgements

We are grateful to anonymous reviewers and Editor for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (No.61272119).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xidian University, Xi’an, 710071, People’s Republic of China
Yanni Li & Yuping Wang
School of Software, Xidian University, Xi’an, 710071, People’s Republic of China
Jintao Du

Authors

Yanni Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jintao Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanni Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Wang, Y. & Du, J. E-FFC: an enhanced form-focused crawler for domain-specific deep web databases. J Intell Inf Syst 40, 159–184 (2013). https://doi.org/10.1007/s10844-012-0221-8

Download citation

Received: 17 March 2012
Revised: 08 August 2012
Accepted: 14 August 2012
Published: 07 September 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s10844-012-0221-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

Abstract

Access this article

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

DB-GPT: Large Language Model Meets Database

Deep learning applications and challenges in big data analytics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Knowledge Graphs: Opportunities and Challenges

DB-GPT: Large Language Model Meets Database

Deep learning applications and challenges in big data analytics

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation