Abstract
Humans require automated support to profit from the wealth of data nowadays available on the web. To that end, the linked open data initiative and others have been asking data providers to publish structured, semantically annotated data. Small data providers, such as most UK real-estate agencies, however, are overburdened with this task—often just starting to move from simple, table- or list-like directories to web applications with rich interfaces.
We argue that fully automated extraction of structured data can help resolve this dilemma. Ironically, automated data extraction has seen a recent revival thanks to ontologies and linked open data to guide data extraction. First results from the DIADEM project illustrate that high quality, fully automated data extraction at a web scale is possible, if we combine domain ontologies with a phenomenology describing the representation of domain concepts. We briefly summarise the DIADEM project and discuss a few preliminary results.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
The research leading to these results has received funding from the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013) / ERC grant agreement no. 246858 (DIADEM).
References
Nguyen, H., Nguyen, T., Freire, J.: Learning to Extract From Labels. In: Proc. of the VLDB Endowment (PVLDB), pp. 684–694 (2008)
Su, W., Wang, J., Lochovsky, F.H.: ODE: Ontology-Assisted Data Extraction. ACM Transactions on Database Systems 34(2) (2009)
Kushmerick, N.: Learning to invoke web forms. In: Chung, S., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 997–1013. Springer, Heidelberg (2003)
Shadbolt, N., Hall, W., Berners-Lee, T.: The Semantic Web Revisited. IEEE Intelligent Systems 21(3), 96–101 (2006)
Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and Extracting Deep-Web Query Interfaces. In: Advances in Information & Intelligent Systems, pp. 65–90 (2009)
Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 325–336 (2009)
Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 129–138 (2001)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)
Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Heidelberg (2006)
Marx, M.: Conditional xpath. ACM Trans. Database Syst. 30(4), 929–959 (2005)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.J.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. In: Proc. of the VLDB Endowment PVLDB (2011) (to appear)
Sellers, A., Furche, T., Gottlob, G., Grasso, G., Schallhart, C.: Taking the oxpath down the deep web. In: Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT 2011, pp. 542–545. ACM, New York (2011)
Sellers, A.J., Furche, T., Gottlob, G., Grasso, G., Schallhart, C.: Oxpath: little language, little memory, great value. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 261–264. ACM, New York (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Furche, T., Gottlob, G., Guo, X., Schallhart, C., Sellers, A., Wang, C. (2011). How the Minotaur Turned into Ariadne: Ontologies in Web Data Extraction. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds) Web Engineering. ICWE 2011. Lecture Notes in Computer Science, vol 6757. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22233-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-22233-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22232-0
Online ISBN: 978-3-642-22233-7
eBook Packages: Computer ScienceComputer Science (R0)