Skip to main content

Focused Deep Web Entrance Crawling by Form Feature Classification

  • Conference paper
  • First Online:
Big Data Computing and Communications (BigCom 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9196))

Included in the following conference series:

Abstract

Currently, Most back-end web databases cannot be indexed by traditional hyperlink-based search engines due to their requirement of users’ interactive queries via page form submission. In order to make hidden-Web information more easily accessible, this paper proposes a hierarchical classifier to locate domain-specific hidden Web entries at a large scale. The classifier is trained by appropriately selected page form features to get rid of non-relevant domains and non-searchable forms. Experiments conducted on eight different topics demonstrate that the technique can discover deep web interfaces accurately and efficiently.

L. Wang—Supported in part by the National Science Foundation under grant 61472382, 61272472 and 61232018

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brightplanets searchable databases directory. http://www.completeplanet.com

  2. Classification Trees and Regression Trees. http://cn.mathworks.com/help/stats/classification-trees-and-regression-trees.html

  3. Google Base. http://base.google.com/

  4. The R Project for Statistical Computing. http://www.r-project.org

  5. The uiuc Web integration repository. http://metaquerier.cs.uiuc.edu/repository/

  6. Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)

    Google Scholar 

  7. Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, pp. 431–440. ACM (2012)

    Google Scholar 

  8. Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450. ACM (2013)

    Google Scholar 

  9. Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321 (2014)

    Google Scholar 

  10. Bergholz, A., Childlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 2003, pp. 125–133. IEEE (2003)

    Google Scholar 

  11. Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks 31(11), 1623–1640 (1999)

    Article  Google Scholar 

  12. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)

    Google Scholar 

  13. Chang, K.C.C., He, B., Zhang, Z.: Toward large scale integration: building a metaquerier over databases on the web. In: CIDR, vol. 5, pp. 44–55 (2005)

    Google Scholar 

  14. Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference, vol. 17, pp. 181–189. Australian Computer Society, Inc. (2003)

    Google Scholar 

  15. Du, X., Zheng, Y., Yan, Z.: Automate discovery of deep web interfaces. In: 2010 2nd International Conference on Information Science and Engineering (ICISE), pp. 3572–3575. IEEE (2010)

    Google Scholar 

  16. Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678. ACM (2003)

    Google Scholar 

  17. Galperin, M.Y.: The molecular biology database collection: 2008 update. Nucleic Acids Research 36(suppl 1), D2–D4 (2008)

    MathSciNet  Google Scholar 

  18. Gravano, L., García-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)

    Article  Google Scholar 

  19. He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 357–368. VLDB Endowment (2013)

    Google Scholar 

  20. Raghavan, S., Garcia-Molina, H.: Crawling the hidden web (2014)

    Google Scholar 

  21. Torgo, L., Gama, J.: Regression by classification. In: Borges, D.L., Kaestner, C.A.A. (eds.) SBIA 1996. LNCS, vol. 1159, pp. 51–60. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  22. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106. ACM (2014)

    Google Scholar 

  23. Xu, J., Callan, J.: Effective retrieval with distributed collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 112–120. ACM (2008)

    Google Scholar 

  24. Yu, C., Liu, K.L., Meng, W., Wu, Z., Rishe, N.: A methodology to retrieve text documents from multiple databases. IEEE Transactions on Knowledge and Data Engineering 14(6), 1347–1361 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lin Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, L., Hawbani, A., Wang, X. (2015). Focused Deep Web Entrance Crawling by Form Feature Classification. In: Wang, Y., Xiong, H., Argamon, S., Li, X., Li, J. (eds) Big Data Computing and Communications. BigCom 2015. Lecture Notes in Computer Science(), vol 9196. Springer, Cham. https://doi.org/10.1007/978-3-319-22047-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22047-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22046-8

  • Online ISBN: 978-3-319-22047-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics