Automatic discovery of Web Query Interfaces using machine learning techniques

Marin-Castro, Heidy M.; Sosa-Sosa, Victor J.; Martinez-Trinidad, Jose F.; Lopez-Arevalo, Ivan

doi:10.1007/s10844-012-0217-4

Automatic discovery of Web Query Interfaces using machine learning techniques

Published: 23 August 2012

Volume 40, pages 85–108, (2013)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Heidy M. Marin-Castro¹,
Victor J. Sosa-Sosa¹,
Jose F. Martinez-Trinidad² &
…
Ivan Lopez-Arevalo¹

625 Accesses
8 Citations
Explore all metrics

Abstract

The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intelligent Rule-Based Deep Web Crawler

Efficiently harvesting deep web interfaces based on adaptive learning using two-phase data crawler framework

Article 06 May 2021

Madhusudhan Rao Murugudu & L. S. S. Reddy

Focused Deep Web Entrance Crawling by Form Feature Classification

References

Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proceedings of the 8th ACM SIGMOD international workshop on web and databases (pp. 1–6). Baltimore, Maryland.
Barbosa, L., & Freire, J. (2007a). Combining classifiers to identify online databases. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 431–440). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242631.
Chapter Google Scholar
Barbosa, L., & Freire, J. (2007b). An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 441–450). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242632.
Chapter Google Scholar
Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J. (2010). Creating and exploring web form repositories. In Proceedings of the 2010 international conference on management of data, SIGMOD ’10 (pp. 1175–1178). New York: ACM. ISBN 978-1-4503-0032-2. doi:10.1145/1807167.1807311.
Chapter Google Scholar
Bergman, M.K. (2001). The deep web: surfacing hidden value (white paper). Journal of Electronic Publishing, 7(1), 4.
Article Google Scholar
Cope, J., Craswell, N., Hawking, D. (2003). Automated discovery of search interfaces on the web. In Proceedings of the 14th Australasian database conference, ADC ’03 (Vol. 17, pp. 181–189). Darlinghurst: Australian Computer Society. Inc. ISBN 0-909-92595-X. URL: http://portal.acm.org/citation.cfm?id=820085.820120.
Google Scholar
D’Agostino, R.B., Belanger, A., D’Agostino, R.B. Jr. (1990). A suggestion for using powerful and informative tests of normality. The American Statistician, 44(4), 316–321. ISSN 00031305. URL http://www.jstor.org/stable/2684359.
Google Scholar
García-Serrano, J.R., & Martinez-Trinidad, J.F. (1999). Extension to c-means algorithm for the use of similarity functions. In Proceedings of the 3rd European conference on principles of data mining and knowledge discovery, PKDD ’99 (pp. 354–359). London: Springer-Verlag. ISBN 3-540-66490-4. URL: http://dl.acm.org/citation.cfm?id=645803.669654.
Chapter Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The WEKA data mining software: an update (Vol. 11, Issue 1). SIGKDD Explorations.
Jericho HTML Parser (2010). A Java Library for parsing HTML documents. Sourceforge Project, 2010. http://jericho.htmlparser.net/docs/index.html. Accessed 12 Dec 2011.
Jiang, L., Wu, Z., Zheng, Q., Liu, J. (2009). Learning deep web crawling with diverse features. In Web intelligence (pp. 572–575).
Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. In PAKDD (1) (pp 428–439).
Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. Proceedings, Very Large Data Bases, 2(1), 325–336.
Google Scholar
Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources’ discovery and identification. In Proceedings of the 2010 12th International Asia-Pacific Web Conference, APWEB ’10, pages 464–467, Washington, DC, USA. IEEE Computer Society. ISBN 978-0-7695-4012-2. doi:10.1109/APWeb.2010.54.
Lin, L., & Zhou, L. (2009). Web database schema identification through simple query interface. In RED (pp. 18–34).
Liu, V.Z., Luo, R.C., Cho, J., Chu, W.W. (2004). D-pro: A probabilistic approach for hidden web database selection using dynamic probing. In Proceedings of the ICDE. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2525.
Lu, J., & Li, D. (2010). Estimating deep web data source size by capture-recapture method. Information Retrieval, 13(1), 70–95.
Article Google Scholar
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Very Large Data Bases, 1, 1241–1252. ISSN 2150-8097. doi:10.1145/1454159.1454163.
Google Scholar
Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.
MATH Google Scholar
Olvera-Lopez, J.A., Martinez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2007). Mixed data object selection based on clustering and border objects. In CIARP (pp. 674–683).
Platt, J.C. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. Advances in Kernel MethodsSupport Vector Learning, 208(MSR-TR-98-14), 1–21. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.560&rep=rep1&type=pdf.
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. doi:10.1007/BF00116251.
Google Scholar
Ru, Y., & Horowitz, E. (2005). Indexing the invisible web: a survey. Online Information Review, 29(3), 249–265.
Article Google Scholar
Shestakov, D. (2008). Search interfaces on the web: Querying and characterizing. PhD thesis, University of Turku Department of Information Technology.
The UIUC web integration repository (2003). Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.
Wang, H., Liu, Y.W., Zuo, W.L. (2008). Using classifiers to find domain-specific online databases automatically. Journal of Software, 19(2), 246–256. URL: http://www.jos.org.cn/1000-9825/19/246.htm.
Article Google Scholar
Wang, Y., Li, H., Zuo, W., He, F., Wang, X., Chen, K. (2011). Research on discovering deep web entries. Computer Science and Information Systems, 8(3), 779–799.
Article Google Scholar
Witten, I.H., Frank, E., Hall, M.A. (2000). Data mining: Practical machine learning tools and techniques with java implementations. USA: Academic Press. ISBN 1558605525.
Google Scholar
Wu, W., Yu, C., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep Web. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04 (pp. 95–106). New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007582.
Chapter Google Scholar
Zhang, P., Qu, Y., Huang, C., Jaeger, P.T., Wells, J., Hayes, W.S., Hayes, J.E., Jin, X. (2010) Collaborative identification and annotation of government deep web resources: A hybrid approach. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10 (pp. 285–286). New York: ACM. ISBN 978-1-4503-0041-4. doi:10.1145/1810617.1810677.
Chapter Google Scholar
Zhang, Z., He, B., Chang, K.C.-C. (2004). Understanding Web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pages 107–118, New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007583.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Tamaulipas, Mexico
Heidy M. Marin-Castro, Victor J. Sosa-Sosa & Ivan Lopez-Arevalo
National Institute for Astrophysics, Optics and Electronics Tonantzintla, Puebla, San Andrés Cholula, Mexico
Jose F. Martinez-Trinidad

Authors

Heidy M. Marin-Castro
View author publications
You can also search for this author in PubMed Google Scholar
Victor J. Sosa-Sosa
View author publications
You can also search for this author in PubMed Google Scholar
Jose F. Martinez-Trinidad
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Lopez-Arevalo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heidy M. Marin-Castro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marin-Castro, H.M., Sosa-Sosa, V.J., Martinez-Trinidad, J.F. et al. Automatic discovery of Web Query Interfaces using machine learning techniques. J Intell Inf Syst 40, 85–108 (2013). https://doi.org/10.1007/s10844-012-0217-4

Download citation

Received: 21 December 2011
Revised: 08 May 2012
Accepted: 01 August 2012
Published: 23 August 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s10844-012-0217-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic discovery of Web Query Interfaces using machine learning techniques

Abstract

Access this article

Similar content being viewed by others

Intelligent Rule-Based Deep Web Crawler

Efficiently harvesting deep web interfaces based on adaptive learning using two-phase data crawler framework

Focused Deep Web Entrance Crawling by Form Feature Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic discovery of Web Query Interfaces using machine learning techniques

Abstract

Access this article

Similar content being viewed by others

Intelligent Rule-Based Deep Web Crawler

Efficiently harvesting deep web interfaces based on adaptive learning using two-phase data crawler framework

Focused Deep Web Entrance Crawling by Form Feature Classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation