Skip to main content
Log in

E-FFC: an enhanced form-focused crawler for domain-specific deep web databases

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

A key problem of retrieving, integrating and mining rich and high quality information from massive Deep Web Databases (WDBs) online is how to automatically and effectively discover and recognize domain-specific WDBs’ entry points, i.e., forms, in the Web. It has been a challenging task because domain-specific WDBs’ forms with dynamic and heterogeneous properties are very sparsely distributed over several trillion Web pages. Although significant efforts have been made to address the problem and its special cases, more effective solutions remain to be further explored towards achieving both the satisfactory harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. In this paper, an Enhanced Form-Focused Crawler for domain-specific WDBs (E-FFC) has been proposed as a novel framework to address existing solutions’ limitations. The E-FFC, based on the divide and conquer strategy, employs a series of novel and effective strategies/algorithms, including a two-step page classifier, a link scoring strategy, classifiers for advanced searchable and domain-specific forms, crawling stopping criteria, etc. to its end achieving the optimized harvest rate and coverage rate of domain-specific WDBs’ forms simultaneously. Experiments of the E-FFC over a number of real Web pages in a set of representative domains have been conducted and the results show that the E-FFC outperforms the existing domain-specific Deep Web Form-Focused Crawlers in terms of the harvest rate, coverage rate and crawling robustness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Aggarwal, C.C., AL-Garawi, F., Yu, P.S. (2001). Intelligent crawling on the world wide web with arbitrary predicates. In Proc. of the 10th international conference on world wide web (WWW’01) (pp. 96–105).

  • Akilandeswari, J., & Gopalan, N.P. (2008). An architectural framework of a crawler for locating deep web repositories using learning multi-agent systems. In the Proc. of the third international conference on internet and web applications and services (ICIW’08) (pp. 558–562).

  • Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proc. of WebDB (pp. 1–6).

  • Barbosa, L., & Freire, J. (2007). An adaptive crawler for locating hidden-web entry points. In Proc. of WWW 2007 (WWW’07) (pp. 441–450). ACM.

  • Barbosa, L., Freire, J., Silva, A. (2007). Organizing hidden-web databases by clustering visible web documents. In Proc. of IEEE the 23rd international of conference on data engineering (ICDE’2007) (pp. 326–335).

  • Bazarganigilani, M., Syed, A., & Burki, S. (2011). Focused web crawling using decay content and genetic programming. International Journal of Data Mining & Knowledge Management Process (IJDKP), 1, 1–11.

    Google Scholar 

  • Bergman, M.K. (2001). The deep web: surfacing hidden value, Journal of Electronic Publishing, 7, 1–17.

    Article  Google Scholar 

  • Bharat, K., Broder, A., Henzinger, M., Kumar, P., Venkatasubramanian, S. (1998). The connectivity server: fast access to linkage information on the web. Computer Networks and ISDN Systems, 30, 469–477.

    Article  Google Scholar 

  • BrightPlanet.com (2001). The deep web: Surfacing hidden value, Accessible at http://brightplanet.com. Accessed 1 March 2012

  • Castillo, C. (2005). Effective web crawling. ACM SIGIR Forum, 39, 55–56.

    Article  MathSciNet  Google Scholar 

  • Chang, K.C.-C., He, B., Li, C., et al. (2004). Structured databases on the web: observations and implications. ACM SIGMOD Record, 33, 61–70.

    Article  Google Scholar 

  • Chang, K.C.-C., He, B., Zhang, Z. (2005). Toward large-scale integration: building a metaQuerier over databases on the web. In Proc. of CIDR (CIDR’05) (pp. 44–55).

  • Chakrabarti, S., van den Berg, M., Dom, B. (1999). Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks, 31, 1623–1640.

    Article  Google Scholar 

  • Chakrabarti, S., Punera, K., Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In Proc. of the 9th international conference on world wide web (WWW’02) (pp. 148–159).

  • Chau, M., & Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44, 482–494.

    Article  Google Scholar 

  • Cho, J., Garcia, M.H., Page, L. (1998). Efficient crawling through URL ordering. Computer Networks and ISDN Systems, 30, 161–172.

    Article  Google Scholar 

  • Cope, J., Craswell, N., Hawking, D. (2003). Automated discovery of search interfaces on the web. In Proc. of ADC (ADC’03) (pp. 181–189).

  • Dehua, D. (2010). Deep web services crawler. Master Thesis.

  • Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M. (2000). Focused crawling using context graphs. In Proc. of VLDB(VLDB’2000) (pp. 527–534).

  • Ester, M., Kriegel, H.P., Schubert, M. (2004). Accurate and efficient crawling for relevant websites. In Proc. of the 13th international conference on very large data bases (VLDB’ 04) (Vol. 30, pp. 396–407).

  • Ghanem, T.M., & Aref, W.G. (2004). Databases deepen the web. IEEE Computer, 73, 116–117.

    Article  Google Scholar 

  • He, B., & Chang, K.C.-C. (2003). Statistical schema matching across web query interfaces. In Proc. of the ACM SIGMOD conference (SIGMOD‘03) (pp. 217–228).

  • He, B., Li, C., Killian, D., Patel, M., Tseng, Y., Chang, K.C.-C. (2006). A structure-driven yield-aware web form crawler: building a database of online databases. UIUC Technical Report: UIUCDCS-R-2006-2752, UIUC-ENG-2006-1792 (pp. 1–12).

  • He, B., Patel, M., Zhang, Z., Chang, K.C.-C. (2007). Accessing the deep web: a survey. Communications of the ACM, 50, 95–101.

    Article  Google Scholar 

  • He, B., Tao, T., Chang, K.C.-C. (2004). Organizing structured web sources by query schemas: A clustering approach. In Proc. of the thirteenth ACM international conference on information and knowledge management (CIKM’04) (pp. 3–7).

  • He, B., Tao, T., Chang, K.C.-C. (2004). Clustering structured web sources: A schema-based, model-differentiation approach. In Proc. of the 9th International Conference on Extending Database Technology (EDBT’04) (pp. 536–546).

  • Hornung, T., Simon, K., Lausen, G. (2009). Mashups over the deep web. Lecture Notes in Business Information Processing, 18, 228–241.

    Article  Google Scholar 

  • Jamali, M., Sayyadi, H., Hariri, B.B., Abolhassani, H. (2006). A method for focused crawling using combination of link structure and content similarity. In Proc. of ACM international conference on web intelligence (WI’06) (pp. 753–756).

  • Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources discovery and identification. In Proc. of the 12th international asia-pacific web conference (pp. 464–467).

  • Madhavan, J., Ko, D., Kot, L. (2008). Google’s deep-web crawl, In Proc. of the VLDB Endowment (Vol. 1, pp. 1241–1251).

  • Madhavan, J., Afanasiev, L., Antova, L., Halevy, A. (2009). Harnessing the deep web: present and future. CIDR Perspective available at http://arxiv.org/abs/0909.1785.

  • Novak, B. (2004). A survey of focused web crawling algorithms. In Proc. of SIKDD (SIKDD’04) (Vol. 5558, pp. 55–58).

  • Peng, Q, Meng, W., He, H., Yu, C.T. (2004). WISE-cluster: clustering e-commerce search engines automatically. In Proc. of the 6th ACM international workshop on web information and data management (VIDM’04) (pp. 104–111).

  • Raghavan, S., & Garacia-Molina, H. (2001). Crawling the hidden web. In Proc. of VLDB (VLDB’01) (pp. 129–138).

  • Rennie, J., & McCallum, A.K. (1999). Using reinforcement learning to spider the web efficiently. In Proc. of ICML(ICML’1999) (pp. 335–343).

  • Rocco, D., Liu, L., Critchlow, T. (2004). Focused crawling of the deep web using service class descriptions. In Proc. of the international conference on service oriented computing (ICSOC’04) (pp. 15–18).

  • Steve, L., & Giles, C.L. (1998). Searching the world wide web. Science, 280, 98–100.

    Article  Google Scholar 

  • Steve, L., & Giles C.L. (1999). Accessibility of information on the web. Nature, 400, 107–109.

    Article  Google Scholar 

  • The UIUC Web (2003). Integration Repository. available at http://metaquerier.cs.uiuc.edu/repository. Accessed 1 March 2012.

  • Wang, C., Guan, Z.Y., Chen, C., Bu, J.J., Wang, J.F., Lin, H.Z. (2009). On-line topical importance estimation: An effective focused crawling algorithm combining link and content analysis (Vol. 10, pp. 1114–1124). Zhejiang Univ Sci A.

  • Wang, Y., Lu, J., Chen, J. (2009). Crawling deep web using a new set covering algorithm. In Proc. of ADMA(ADMA’2009) (pp. 326–337).

  • Wang, Y., Zuo, W., Peng, T., He, F. (2008). Domain-specific deep web sources discovery. In Proc. of the international conference on natural computation (ICNC’08) (pp. 202–206).

  • Wu, W., Yu, C.T., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proc. of the 23th ACM international conference on management of data (SIGMOD’2004) (pp. 95–106).

  • Yadav, D., Sharma, A.K., Gupta, J.P. (2009). Topic web crawling using weighted anchor text and web page change detection techniques, WSEAS. Transactions on Information Science and Applications, 6, 263–275.

    Google Scholar 

  • Zhang, Z., He, B., Chang, K.C.-C. (2005). Light-weight domain-based form assistant: querying web databases on the fly. In Proc. of VLDB conference (VLDB’05) (pp. 97–108).

Download references

Acknowledgements

We are grateful to anonymous reviewers and Editor for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China (No.61272119).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanni Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Wang, Y. & Du, J. E-FFC: an enhanced form-focused crawler for domain-specific deep web databases. J Intell Inf Syst 40, 159–184 (2013). https://doi.org/10.1007/s10844-012-0221-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0221-8

Keywords

Navigation