Automatic discovery of Web Query Interfaces using machine learning techniques
Rent the article at a discountRent now
* Final gross prices may vary according to local VAT.Get Access
The amount of information contained in databases available on the Web has grown explosively in the last years. This information, known as the Deep Web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through Web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep Web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep Web. Since WQIs are the only means to access to the Deep Web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable Web. The accurate identification of Deep Web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of Web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.
- Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. In Proceedings of the 8th ACM SIGMOD international workshop on web and databases (pp. 1–6). Baltimore, Maryland.
- Barbosa, L, Freire, J (2007) Combining classifiers to identify online databases. Proceedings of the 16th international conference on World Wide Web, WWW ’07. ACM, New York, pp. 431-440 CrossRef
- Barbosa, L, Freire, J (2007) An adaptive crawler for locating hidden-web entry points. Proceedings of the 16th international conference on World Wide Web, WWW ’07. ACM, New York, pp. 441-450 CrossRef
- Barbosa, L, Nguyen, H, Nguyen, T, Pinnamaneni, R, Freire, J (2010) Creating and exploring web form repositories. Proceedings of the 2010 international conference on management of data, SIGMOD ’10. ACM, New York, pp. 1175-1178 CrossRef
- Bergman, MK (2001) The deep web: surfacing hidden value (white paper). Journal of Electronic Publishing 7: pp. 4 CrossRef
- Cope, J, Craswell, N, Hawking, D (2003) Automated discovery of search interfaces on the web. Proceedings of the 14th Australasian database conference, ADC ’03 (Vol. 17). Australian Computer Society. Inc, Darlinghurst, pp. 181-189
- D’Agostino, RB, Belanger, A, D’Agostino, RB (1990) A suggestion for using powerful and informative tests of normality. The American Statistician 44: pp. 316-321
- García-Serrano, JR, Martinez-Trinidad, JF (1999) Extension to c-means algorithm for the use of similarity functions. Proceedings of the 3rd European conference on principles of data mining and knowledge discovery, PKDD ’99. Springer-Verlag, London, pp. 354-359 CrossRef
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P.,Witten, I.H. (2009). The WEKA data mining software: an update (Vol. 11, Issue 1). SIGKDD Explorations.
- Jericho HTML Parser (2010). A Java Library for parsing HTML documents. Sourceforge Project, 2010. http://jericho.htmlparser.net/docs/index.html. Accessed 12 Dec 2011.
- Jiang, L., Wu, Z., Zheng, Q., Liu, J. (2009). Learning deep web crawling with diverse features. In Web intelligence (pp. 572–575).
- Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. In PAKDD (1) (pp 428–439).
- Kabisch, T, Dragut, EC, Yu, CT, Leser, U (2009) A hierarchical approach to model web query interfaces for web source integration. Proceedings, Very Large Data Bases 2: pp. 325-336
- Li, Y., Nie, T., Shen, D., Yu, G. (2010). Domain-oriented deep web data sources’ discovery and identification. In Proceedings of the 2010 12th International Asia-Pacific Web Conference, APWEB ’10, pages 464–467, Washington, DC, USA. IEEE Computer Society. ISBN 978-0-7695-4012-2. doi:10.1109/APWeb.2010.54.
- Lin, L., & Zhou, L. (2009). Web database schema identification through simple query interface. In RED (pp. 18–34).
- Liu, V.Z., Luo, R.C., Cho, J., Chu, W.W. (2004). D-pro: A probabilistic approach for hidden web database selection using dynamic probing. In Proceedings of the ICDE. URL: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.2525.
- Lu, J, Li, D (2010) Estimating deep web data source size by capture-recapture method. Information Retrieval 13: pp. 70-95 CrossRef
- Madhavan, J, Ko, D, Kot, L, Ganapathy, V, Rasmussen, A, Halevy, A (2008) Google’s deep web crawl. Very Large Data Bases 1: pp. 1241-1252
- Mitchell, TM (1997) Machine learning. McGraw-Hill, New York
- Olvera-Lopez, J.A., Martinez-Trinidad, J.F., Carrasco-Ochoa, J.A. (2007). Mixed data object selection based on clustering and border objects. In CIARP (pp. 674–683).
- Platt, J.C. (1998). Sequential minimal optimization: a fast algorithm for training support vector machines. Advances in Kernel MethodsSupport Vector Learning, 208(MSR-TR-98-14), 1–21. URL: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.560&rep=rep1&type=pdf.
- Quinlan, JR (1986) Induction of decision trees. Machine Learning 1: pp. 81-106
- Ru, Y, Horowitz, E (2005) Indexing the invisible web: a survey. Online Information Review 29: pp. 249-265 CrossRef
- Shestakov, D. (2008). Search interfaces on the web: Querying and characterizing. PhD thesis, University of Turku Department of Information Technology.
- The UIUC web integration repository (2003). Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.
- Wang, H, Liu, YW, Zuo, WL (2008) Using classifiers to find domain-specific online databases automatically. Journal of Software 19: pp. 246-256 CrossRef
- Wang, Y, Li, H, Zuo, W, He, F, Wang, X, Chen, K (2011) Research on discovering deep web entries. Computer Science and Information Systems 8: pp. 779-799 CrossRef
- Witten, IH, Frank, E, Hall, MA (2000) Data mining: Practical machine learning tools and techniques with java implementations. Academic Press, USA
- Wu, W, Yu, C, Doan, A, Meng, W (2004) An interactive clustering-based approach to integrating source query interfaces on the deep Web. Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04. ACM, New York, pp. 95-106 CrossRef
- Zhang, P, Qu, Y, Huang, C, Jaeger, PT, Wells, J, Hayes, WS, Hayes, JE, Jin, X (2010) Collaborative identification and annotation of government deep web resources: A hybrid approach. Proceedings of the 21st ACM conference on Hypertext and hypermedia, HT ’10. ACM, New York, pp. 285-286 CrossRef
- Zhang, Z, He, B, Chang, KC-C (2004) Understanding Web query interfaces: Best-effort parsing with hidden syntax. Proceedings of the 2004 ACM SIGMOD international conference on management of data, SIGMOD ’04. ACM, New York, pp. 107-118 CrossRef
- Automatic discovery of Web Query Interfaces using machine learning techniques
Journal of Intelligent Information Systems
Volume 40, Issue 1 , pp 85-108
- Cover Date
- Print ISSN
- Online ISSN
- Springer US
- Additional Links
- Deep Web
- Hidden-Web databases
- Web Query Interfaces
- supervised classification
- Industry Sectors
- Author Affiliations
- 1. Center of Research and Advanced Studies of the National Polytechnic Institute, Information Technology Laboratory, Victoria City, Tamaulipas, Mexico
- 2. National Institute for Astrophysics, Optics and Electronics Tonantzintla, Puebla, San Andrés Cholula, Mexico