The ontological key: automatically understanding and integrating forms to access the deep Web
Abstract
Forms are our gates to the Web. They enable us to access the deep content of Web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present opal, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems, opal advances the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a Web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern Web forms, opal outperforms previous approaches for form labeling by a significant margin. For form interpretation, opal uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (\(>\)97 % accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of opal’s form interpretations through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.
Keywords
Form understanding Web interfaces Deep WebNotes
Acknowledgments
The research leading to these results has received funding from the European Research Council under the European Community’s FP7/2007-2013/ERC Grant agreement DIADEM, No. 246858, and the Oxford Martin School.
References
- 1.Araujo, S., Gao, Q., Leonardi, E., Houben, G.-J.: Carbon: domain-independent automatic web form filling. In: Proceedings of the International Conference on Web Engineering (ICWE), pp. 292–306 (2010)Google Scholar
- 2.Barbosa, L., Freire, J.: Combining classifiers to identify online databases. In: Proceedings of the International World Wide Web Conference (WWW), pp. 431–440 (2007)Google Scholar
- 3.Barbosa, L., Freire, J.: Siphoning hidden-web data through keyword-based interfaces. In: Proceedings of the Brazilian Symposium on Databases, pp. 309–321 (2004)Google Scholar
- 4.Bar-Yossef, Z., Gurevich, M.: Random sampling from a search engine’s index. J. ACM 55(5), 24:1–247:4 (2008)MathSciNetCrossRefGoogle Scholar
- 5.Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proceedings Symposium on Principles of Database Systems (PODS), pp. 211–222 (2011)Google Scholar
- 6.Benedikt, M., Koch, C.: XPath leashed. In: ACM Computing Surveys, pp. 3:1–3:54 (2007)Google Scholar
- 7.Cafarella, M.J., Chang, E.Y., Fikes, A., Halevy, A.Y., Hsieh, W.C., Lerner, A., Madhavan, J., Muthukrishnan, S.: Data management projects at Google. Sigmod Record 37(1), 34–38 (2008)CrossRefGoogle Scholar
- 8.Chang, K.C.-C., He, B., Zhang, Z.: Mining semantics for large scale integration on the web: evidences, insights, and challenges. SIGKDD Explor. Newsl. 6(2), 67–76 (2004)CrossRefGoogle Scholar
- 9.Crescenzi, W., Merialdo, P., Qiu, D.: A framework for learning web wrappers from the crowd. In: Proceedings of the International World Wide Web Conference (WWW) (2013)Google Scholar
- 10.Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model web query interfaces for web source integration. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 325–336 (2009)Google Scholar
- 11.Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers (2012)Google Scholar
- 12.Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: OPAL: automated form understanding for the deep web. In: Proceedings of the International World Wide Web Conference (WWW), pp. 829–838 (2012)Google Scholar
- 13.Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little knowledge rules the web: domain-centric result page extraction. In: Proceedings of the International Conference on Web Reasoning and Rule Systems (RR), pp. 61–76 (2011)Google Scholar
- 14.He, B., Zhang, Z., Chang, K.C.-C.: Towards building a MetaQuerier: extracting and matching web query interfaces. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 1098–1099 (2005)Google Scholar
- 15.He, H., Meng, W., Lu, Y., Yu, C., Wu, Z.: Towards deeper understanding of the search interfaces of the deep web. Word Wide Web 10, 133–155 (2007)CrossRefGoogle Scholar
- 16.Kaljuvee, O., Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Efficient web form entry on PDAs. In: Proceedings of the International World Wide Web Conference (WWW), pp. 663–672 (2001) Google Scholar
- 17.Khare, R., An, Y.: An empirical study on using hidden markov model for search interface segmentation. In: Proceedings of the Conference on Information and Knowledge Management (CIKM), pp. 17–26 (2009)Google Scholar
- 18.Khare, R., An, Y., Song, I.-Y.: Understanding deep web search interfaces: a survey. Sigmod Records 39(1), 33–40 (2010)CrossRefGoogle Scholar
- 19.Lehmann, J., Furche, T., Grasso, G., Ngomo, A.-C.N., Schallhart, C., Sellers, A., Unger, C., Bühmann, L., Gerber, D., Konrad Höffner, D.L., Auer S.: Deqa: deep web extraction for question answering. In: Proceedings of the International Semantic Web Conference (ISWC) (2012)Google Scholar
- 20.Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.: Google’s deep web crawl. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 1241–1252 (2008)Google Scholar
- 21.Maiti, A., Dasgupta, A., Zhang, N., Das, G.: HDSampler: revealing data behind web form interfaces. In: Proceedings of the Symposium on Management of Data (SIGMOD), pp. 1131–1134 (2009)Google Scholar
- 22.Navarrete, I., Morales, A., Cardenas, M., Sciavicco, G.: Spatial reasoning with rectangular cardinal relations—the convex tractable subalgebra. Ann. Math. Artif. Intell. (2012)Google Scholar
- 23.Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 684–694 (2008)Google Scholar
- 24.Nguyen, T.H., Nguyen, H., Freire, J.: PruSM: a prudent schema matching approach for web forms. In: International Conference on Information and Knowledge Management (CIKM), pp. 1385–1388 (2010)Google Scholar
- 25.Niu, F., Zhang, C., Re, C., Shavlik, J.: DeepDive: web-scale knowledge-base construction using statistical learning and inference. In: Very Large Data Search (VLDS), pp. 25–28 (2012)Google Scholar
- 26.Pedersen, T., Patwardhan, S., Michelizzi, J.: Wordnet::similarity—measuring the relatedness of concepts. In: Proceedings of the HLT-NAACL-Demonstrations, pp. 38–41 (2004)Google Scholar
- 27.Raghavan, S., Garcia-Molina, H.: Crawling the hidden web. In: Conference on Very Large Data Bases (VLDB), pp. 129–138 (2001)Google Scholar
- 28.Shestakov, D., Bhowmick, S., Lim, E.: Deque: querying the deep web. Data Knowl. Eng. (DKE) 52(3), 273–311 (2005)Google Scholar
- 29.Su, W., Wang, J., Lochovsky, F.H.: ODE: ontology-assisted data extraction. ACM Trans. Database Syst. 34(2), 12:1–12:35 (2009)CrossRefGoogle Scholar
- 30.Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans. Web 7(2), 8:1–8:22 (2012)Google Scholar
- 31.Wang, J., Wen, J.-R., Lochovsky, F., Ma, W.-Y.: Instance-based schema matching for web databases by domain-specific query probing. In: Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 408–419 (2004)Google Scholar
- 32.Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and extracting deep-web query interfaces. Adv. Inf. Intell. Syst., pp. 65–90 (2009)Google Scholar
- 33.Yuan, X., Zhang, H., Yang, Z., Wen, Y.: Understanding the search interfaces of the deep web based on domain model. In: International Conference on Computer and Information Science, pp. 1194–1199 (2009)Google Scholar
- 34.Zhang, Z., He, B., Chang, K.C.-C.: Understanding web query interfaces: best-effort parsing with hidden syntax. In: Proceedings of the Symposium on Management of Data (SIGMOD), (2004)Google Scholar