Skip to main content

Advertisement

Log in

Automatic discovery of Web Query Interfaces using machine learning techniques

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proceedings of the 8th ACM SIGMOD international workshop on web and databases (pp. 1–6). Baltimore, Maryland.

  • Barbosa, L., & Freire, J. (2007a). Combining classifiers to identify online databases. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 431–440). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242631.

    Chapter  Google Scholar 

  • Barbosa, L., & Freire, J. (2007b). An adaptive crawler for locating hidden-web entry points. In Proceedings of the 16th international conference on World Wide Web, WWW ’07 (pp. 441–450). New York: ACM. ISBN 978-1-59593-654-7. doi:10.1145/1242572.1242632.

    Chapter  Google Scholar 

  • Barbosa, L., Nguyen, H., Nguyen, T., Pinnamaneni, R., Freire, J. (2010). Creating and exploring web form repositories. In Proceedings of the 2010 international conference on management of data, SIGMOD ’10 (pp. 1175–1178). New York: ACM. ISBN 978-1-4503-0032-2. doi:10.1145/1807167.1807311.

    Chapter  Google Scholar 

  • Bergman, M.K. (2001). The deep web: surfacing hidden value (white paper). Journal of Electronic Publishing, 7(1), 4.

    Article  Google Scholar 

  • Cope, J., Craswell, N., Hawking, D. (2003). Automated discovery of search interfaces on the web. In Proceedings of the 14th Australasian database conference, ADC ’03 (Vol. 17, pp. 181–189). Darlinghurst: Australian Computer Society. Inc. ISBN 0-909-92595-X. URL: http://portal.acm.org/citation.cfm?id=820085.820120.

    Google Scholar 

  • D’Agostino, R.B., Belanger, A., D’Agostino, R.B. Jr. (1990). A suggestion for using powerful and informative tests of normality. The American Statistician, 44(4), 316–321. ISSN 00031305. URL http://www.jstor.org/stable/2684359.

    Google Scholar 

  • García-Serrano, J.R., & Martinez-Trinidad, J.F. (1999). Extension to c-means algorithm for the use of similarity functions. In Proceedings of the 3rd European conference on principles of data mining and knowledge discovery, PKDD ’99 (pp. 354–359). London: Springer-Verlag. ISBN 3-540-66490-4. URL: http://dl.acm.org/citation.cfm?id=645803.669654.

    Chapter  Google Scholar 

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The WEKA data mining software: an update (Vol. 11, Issue 1). SIGKDD Explorations.

  • Jericho HTML Parser (2010). A Java Library for parsing HTML documents. Sourceforge Project, 2010. http://jericho.htmlparser.net/docs/index.html. Accessed 12 Dec 2011.

  • Jiang, L., Wu, Z., Zheng, Q., Liu, J. (2009). Learning deep web crawling with diverse features. In Web intelligence (pp. 572–575).

  • Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. In PAKDD (1) (pp 428–439).

  • Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U. (2009). A hierarchical approach to model web query interfaces for web source integration. Proceedings, Very Large Data Bases, 2(1), 325–336.

    Google Scholar 

  • Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources’ discovery and identification. In Proceedings of the 2010 12th International Asia-Pacific Web Conference, APWEB ’10, pages 464–467, Washington, DC, USA. IEEE Computer Society. ISBN 978-0-7695-4012-2. doi:10.1109/APWeb.2010.54.

  • Lin, L., & Zhou, L. (2009). Web database schema identification through simple query interface. In RED (pp. 18–34).

  • Liu, V.Z., Luo, R.C., Cho, J., Chu, W.W. (2004). D-pro: A probabilistic approach for hidden web database selection using dynamic probing. In Proceedings of the ICDE. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2525.

  • Lu, J., & Li, D. (2010). Estimating deep web data source size by capture-recapture method. Information Retrieval, 13(1), 70–95.

    Article  Google Scholar 

  • Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Very Large Data Bases, 1, 1241–1252. ISSN 2150-8097. doi:10.1145/1454159.1454163.

    Google Scholar 

  • Mitchell, T.M. (1997). Machine learning. New York: McGraw-Hill.

    MATH  Google Scholar 

  • Olvera-Lopez, J.A., Martinez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2007). Mixed data object selection based on clustering and border objects. In CIARP (pp. 674–683).

  • Platt, J.C. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. Advances in Kernel MethodsSupport Vector Learning, 208(MSR-TR-98-14), 1–21. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.560&rep=rep1&type=pdf.

  • Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. doi:10.1007/BF00116251.

    Google Scholar 

  • Ru, Y., & Horowitz, E. (2005). Indexing the invisible web: a survey. Online Information Review, 29(3), 249–265.

    Article  Google Scholar 

  • Shestakov, D. (2008). Search interfaces on the web: Querying and characterizing. PhD thesis, University of Turku Department of Information Technology.

  • The UIUC web integration repository (2003). Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.

  • Wang, H., Liu, Y.W., Zuo, W.L. (2008). Using classifiers to find domain-specific online databases automatically. Journal of Software, 19(2), 246–256. URL: http://www.jos.org.cn/1000-9825/19/246.htm.

    Article  Google Scholar 

  • Wang, Y., Li, H., Zuo, W., He, F., Wang, X., Chen, K. (2011). Research on discovering deep web entries. Computer Science and Information Systems, 8(3), 779–799.

    Article  Google Scholar 

  • Witten, I.H., Frank, E., Hall, M.A. (2000). Data mining: Practical machine learning tools and techniques with java implementations. USA: Academic Press. ISBN 1558605525.

    Google Scholar 

  • Wu, W., Yu, C., Doan, A., Meng, W. (2004). An interactive clustering-based approach to integrating source query interfaces on the deep Web. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04 (pp. 95–106). New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007582.

    Chapter  Google Scholar 

  • Zhang, P., Qu, Y., Huang, C., Jaeger, P.T., Wells, J., Hayes, W.S., Hayes, J.E., Jin, X. (2010) Collaborative identification and annotation of government deep web resources: A hybrid approach. In Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10 (pp. 285–286). New York: ACM. ISBN 978-1-4503-0041-4. doi:10.1145/1810617.1810677.

    Chapter  Google Scholar 

  • Zhang, Z., He, B., Chang, K.C.-C. (2004). Understanding Web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04, pages 107–118, New York: ACM. ISBN 1-58113-859-8. doi:10.1145/1007568.1007583.

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heidy M. Marin-Castro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marin-Castro, H.M., Sosa-Sosa, V.J., Martinez-Trinidad, J.F. et al. Automatic discovery of Web Query Interfaces using machine learning techniques. J Intell Inf Syst 40, 85–108 (2013). https://doi.org/10.1007/s10844-012-0217-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0217-4

Keywords

Navigation