How the Minotaur Turned into Ariadne: Ontologies in Web Data Extraction

  • Tim Furche
  • Georg Gottlob
  • Xiaonan Guo
  • Christian Schallhart
  • Andrew Sellers
  • Cheng Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6757)


Humans require automated support to profit from the wealth of data nowadays available on the web. To that end, the linked open data initiative and others have been asking data providers to publish structured, semantically annotated data. Small data providers, such as most UK real-estate agencies, however, are overburdened with this task—often just starting to move from simple, table- or list-like directories to web applications with rich interfaces.

We argue that fully automated extraction of structured data can help resolve this dilemma. Ironically, automated data extraction has seen a recent revival thanks to ontologies and linked open data to guide data extraction. First results from the DIADEM project illustrate that high quality, fully automated data extraction at a web scale is possible, if we combine domain ontologies with a phenomenology describing the representation of domain concepts. We briefly summarise the DIADEM project and discuss a few preliminary results.


Domain Ontology Result Page Page Model Very Large Data Base Kleene Star 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Nguyen, H., Nguyen, T., Freire, J.: Learning to Extract From Labels. In: Proc. of the VLDB Endowment (PVLDB), pp. 684–694 (2008)Google Scholar
  2. 2.
    Su, W., Wang, J., Lochovsky, F.H.: ODE: Ontology-Assisted Data Extraction. ACM Transactions on Database Systems 34(2) (2009)Google Scholar
  3. 3.
    Kushmerick, N.: Learning to invoke web forms. In: Chung, S., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 997–1013. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Shadbolt, N., Hall, W., Berners-Lee, T.: The Semantic Web Revisited. IEEE Intelligent Systems 21(3), 96–101 (2006)CrossRefGoogle Scholar
  5. 5.
    Wu, W., Doan, A., Yu, C., Meng, W.: Modeling and Extracting Deep-Web Query Interfaces. In: Advances in Information & Intelligent Systems, pp. 65–90 (2009)Google Scholar
  6. 6.
    Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 325–336 (2009)Google Scholar
  7. 7.
    Raghavan, S., Garcia-Molina, H.: Crawling the Hidden Web. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 129–138 (2001)Google Scholar
  8. 8.
    Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  9. 9.
    Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  10. 10.
    Batini, C., Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques. Springer, Heidelberg (2006)zbMATHGoogle Scholar
  11. 11.
    Marx, M.: Conditional xpath. ACM Trans. Database Syst. 30(4), 929–959 (2005)CrossRefGoogle Scholar
  12. 12.
    Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 1099–1110. ACM, New York (2008)CrossRefGoogle Scholar
  13. 13.
    Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.J.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. In: Proc. of the VLDB Endowment PVLDB (2011) (to appear)Google Scholar
  14. 14.
    Sellers, A., Furche, T., Gottlob, G., Grasso, G., Schallhart, C.: Taking the oxpath down the deep web. In: Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT 2011, pp. 542–545. ACM, New York (2011)Google Scholar
  15. 15.
    Sellers, A.J., Furche, T., Gottlob, G., Grasso, G., Schallhart, C.: Oxpath: little language, little memory, great value. In: Proceedings of the 20th International Conference Companion on World Wide Web, WWW 2011, pp. 261–264. ACM, New York (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Tim Furche
    • 1
  • Georg Gottlob
    • 1
  • Xiaonan Guo
    • 1
  • Christian Schallhart
    • 1
  • Andrew Sellers
    • 1
  • Cheng Wang
    • 1
  1. 1.Department of Computer ScienceUniversity of OxfordUK

Personalised recommendations