Focused Deep Web Entrance Crawling by Form Feature Classification

Wang, Lin; Hawbani, Ammar; Wang, Xingfu

doi:10.1007/978-3-319-22047-5_7

Lin Wang¹⁸,
Ammar Hawbani¹⁸ &
Xingfu Wang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9196))

Included in the following conference series:

International Conference on Big Data Computing and Communications

2168 Accesses
2 Citations

Abstract

Currently, Most back-end web databases cannot be indexed by traditional hyperlink-based search engines due to their requirement of users’ interactive queries via page form submission. In order to make hidden-Web information more easily accessible, this paper proposes a hierarchical classifier to locate domain-specific hidden Web entries at a large scale. The classifier is trained by appropriately selected page form features to get rid of non-relevant domains and non-searchable forms. Experiments conducted on eight different topics demonstrate that the technique can discover deep web interfaces accurately and efficiently.

L. Wang—Supported in part by the National Science Foundation under grant 61472382, 61272472 and 61232018

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brightplanets searchable databases directory. http://www.completeplanet.com
Classification Trees and Regression Trees. http://cn.mathworks.com/help/stats/classification-trees-and-regression-trees.html
Google Base. http://base.google.com/
The R Project for Statistical Computing. http://www.r-project.org
The uiuc Web integration repository. http://metaquerier.cs.uiuc.edu/repository/
Barbosa, L., Freire, J.: Searching for hidden-web databases. In: WebDB, pp. 1–6 (2005)
Google Scholar
Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the 16th International Conference on World Wide Web, pp. 431–440. ACM (2012)
Google Scholar
Barbosa, L., Freire, J.: An adaptive crawler for locating hidden-web entry points. In: Proceedings of the 16th International Conference on World Wide Web, pp. 441–450. ACM (2013)
Google Scholar
Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: SBBD, pp. 309–321 (2014)
Google Scholar
Bergholz, A., Childlovskii, B.: Crawling for domain-specific hidden web resources. In: Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 2003, pp. 125–133. IEEE (2003)
Google Scholar
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Computer Networks 31(11), 1623–1640 (1999)
Article Google Scholar
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)
Google Scholar
Chang, K.C.C., He, B., Zhang, Z.: Toward large scale integration: building a metaquerier over databases on the web. In: CIDR, vol. 5, pp. 44–55 (2005)
Google Scholar
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the web. In: Proceedings of the 14th Australasian Database Conference, vol. 17, pp. 181–189. Australian Computer Society, Inc. (2003)
Google Scholar
Du, X., Zheng, Y., Yan, Z.: Automate discovery of deep web interfaces. In: 2010 2nd International Conference on Information Science and Engineering (ICISE), pp. 3572–3575. IEEE (2010)
Google Scholar
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: Proceedings of the 12th International Conference on World Wide Web, pp. 669–678. ACM (2003)
Google Scholar
Galperin, M.Y.: The molecular biology database collection: 2008 update. Nucleic Acids Research 36(suppl 1), D2–D4 (2008)
MathSciNet Google Scholar
Gravano, L., García-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Transactions on Database Systems (TODS) 24(2), 229–264 (1999)
Article Google Scholar
He, H., Meng, W., Yu, C., Wu, Z.: Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In: Proceedings of the 29th International Conference on Very Large Data Bases, vol. 29, pp. 357–368. VLDB Endowment (2013)
Google Scholar
Raghavan, S., Garcia-Molina, H.: Crawling the hidden web (2014)
Google Scholar
Torgo, L., Gama, J.: Regression by classification. In: Borges, D.L., Kaestner, C.A.A. (eds.) SBIA 1996. LNCS, vol. 1159, pp. 51–60. Springer, Heidelberg (1996)
Chapter Google Scholar
Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the deep web. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 95–106. ACM (2014)
Google Scholar
Xu, J., Callan, J.: Effective retrieval with distributed collections. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 112–120. ACM (2008)
Google Scholar
Yu, C., Liu, K.L., Meng, W., Wu, Z., Rishe, N.: A methodology to retrieve text documents from multiple databases. IEEE Transactions on Knowledge and Data Engineering 14(6), 1347–1361 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Technology, University of Science and Technology of China, Hefei, 230022, Anhui, China
Lin Wang, Ammar Hawbani & Xingfu Wang

Authors

Lin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ammar Hawbani
View author publications
You can also search for this author in PubMed Google Scholar
Xingfu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Wang .

Editor information

Editors and Affiliations

University of North Carolina at Charlotte, Charlotte, North Carolina, USA
Yu Wang
Rutgers Business School, Newark, New Jersey, USA
Hui Xiong
Illinois Institute of Technology, Chicago, Illinois, USA
Shlomo Argamon
Illinois Institute of Technology, Chicago, Illinois, USA
XiangYang Li
Harbin Institute of Technology, Harbin, China
JianZhong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Hawbani, A., Wang, X. (2015). Focused Deep Web Entrance Crawling by Form Feature Classification. In: Wang, Y., Xiong, H., Argamon, S., Li, X., Li, J. (eds) Big Data Computing and Communications. BigCom 2015. Lecture Notes in Computer Science(), vol 9196. Springer, Cham. https://doi.org/10.1007/978-3-319-22047-5_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-22047-5_7
Published: 24 July 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22046-8
Online ISBN: 978-3-319-22047-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics