Reasoning and Ontologies in Data Extraction

  • Sergio Flesca
  • Tim Furche
  • Linda Oro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7487)


The web has become a pig sty—everyone dumps information at random places and in random shapes. Try to find the cheapest apartment in Oxford considering rent, travel, tax and heating costs; or a cheap, reasonable reviewed 11” laptop with an SSD drive.

Data extraction flushes structured information out of this sty: It turns mostly unstructured web pages into highly structured knowledge. In this chapter, we give a gentle introduction to data extraction including pointers to existing systems. We start with an overview and classification of data extraction systems along two primary dimensions, the level of supervision and the considered scale. The rest of the chapter is organized along the major division of these approaches into site-specific and supervised versus domain-specific and unsupervised. We first discuss supervised data extraction, where a human user identifies for each site examples of the relevant data and the system generalizes these examples into extraction programs. We focus particularly on declarative and rule-based paradigms. In the second part, we turn to fully automated (or unsupervised) approaches where the system by itself identifies the relevant data and fully automatically extracts data from many websites. Ontologies or schemata have proven invaluable to guide unsupervised data extraction and we present an overview of the existing approaches and the different ways in which they are using ontologies.


Data Extraction Information Extraction Extraction Rule Pattern Instance XPath Expression 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arocena, G.O., Mendelzon, A.O.: Weboql: Restructuring documents, databases, and webs. In: Proc. Int’l. Conf. on Data Engineering (ICDE), pp. 24–33. IEEE Comp. Soc. Press, Washington, DC (1998)CrossRefGoogle Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: The Elog Web Extraction Language. In: Nieuwenhuis, R., Voronkov, A. (eds.) LPAR 2001. LNCS (LNAI), vol. 2250, pp. 548–560. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  3. 3.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), San Francisco, CA, USA, pp. 119–128 (2001),
  4. 4.
    Benedikt, M., Gottlob, G., Senellart, P.: Determining relevance of accesses at runtime. In: Proc. Symp. on Principles of Database Systems, PODS (2011)Google Scholar
  5. 5.
    Blanco, L., Bronzi, M., Crescenzi, V., Merialdo, P., Papotti, P.: Exploiting information redundancy to wring out structured data from the web. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1063–1064. ACM, New York (2010), Google Scholar
  6. 6.
    Bolin, M., Webber, M., Rha, P., Wilson, T., Miller, R.C.: Automation and customization of rendered web pages. In: Proc. Symposium on User Interface Software and Technology (UIST), pp. 163–172. ACM, New York (2005)Google Scholar
  7. 7.
    Calì, A., Gottlob, G., Pieris, A.: Query Answering under Non-guarded Rules in Datalog+/-. In: Hitzler, P., Lukasiewicz, T. (eds.) RR 2010. LNCS, vol. 6333, pp. 1–17. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proc. AAAI Conf. on Artificial Intelligence (AAAI), pp. 1306–1313. AAAI Press (2010)Google Scholar
  9. 9.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  10. 10.
    Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011), CrossRefGoogle Scholar
  11. 11.
    Dalvi, N., Machanavajjhala, A., Pang, B.: An analysis of structured data on the web. Proc. VLDB Endow. 5(7), 680–691 (2012), CrossRefGoogle Scholar
  12. 12.
    Embley, D., Campbell, D., Jiang, Y., Liddle, S., Lonsdale, D., Ng, Y.K., Smith, R.: Conceptual-model-based data extraction from multiple-record web pages. Journal on Data & Knowledge Engineering 31(3), 227–251 (1999)CrossRefzbMATHGoogle Scholar
  13. 13.
    Fazzinga, B., Flesca, S., Tagarelli, A.: Schema-based web wrapping. Knowl. Inf. Syst. 26(1), 127–173 (2011)CrossRefGoogle Scholar
  14. 14.
    Ferrara, E., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey (2010) unpublished,
  15. 15.
    Flesca, S., Oro, E., Ruffolo, M.: Wrappo: Wrapping objects from the web. Tech. rep., Institute of High Performance Computing and Networking of the Italian National Research Council, ICAR-CNR (2012)Google Scholar
  16. 16.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: Real understanding of real estate forms. In: Proceedings of the Internation Conference on Web Intelligence, Mining and Semantics, WIMS 2011 (2011)Google Scholar
  17. 17.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: Opal: automated form understanding for the deep web. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 829–838. ACM, New York (2012), Google Scholar
  18. 18.
    Furche, T., Gottlob, G., Grasso, G., Orsi, G., Schallhart, C., Wang, C.: Little Knowledge Rules the Web: Domain-Centric Result Page Extraction. In: Rudolph, S., Gutierrez, C. (eds.) RR 2011. LNCS, vol. 6902, pp. 61–76. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  19. 19.
    Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: A language for scalable, memory-efficient data extraction from web applications. In: Proc. Int’l. Conf. on Very Large Data Bases, VLDB (2011)Google Scholar
  20. 20.
    Furche, T., Gottlob, G., Guo, X., Schallhart, C., Sellers, A., Wang, C.: How the Minotaur Turned into Ariadne: Ontologies in Web Data Extraction. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 13–27. Springer, Heidelberg (2011), CrossRefGoogle Scholar
  21. 21.
    Furche, T., Grasso, G., Kravchenko, A., Schallhart, C.: Turn the Page: Automated Traversal of Paginated Websites. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 332–346. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  22. 22.
    Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of the International Conference on Computational Linguistics (1996)Google Scholar
  23. 23.
    Gulhane, P., Rastogi, R., Sengamedu, S.H., Tengli, A.: Exploiting content redundancy for web information extraction. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1105–1106. ACM, New York (2010), Google Scholar
  24. 24.
    Halevy, A.Y.: Structured Data on the Web. In: Feldman, Y.A., Kraft, D., Kuflik, T. (eds.) NGITS 2009. LNCS, vol. 5831, pp. 2–2. Springer, Heidelberg (2009), CrossRefGoogle Scholar
  25. 25.
    Kayed, M., Chang, C.H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Transactions on Knowledge and Data Engineering 22(2), 249–263 (2010)CrossRefGoogle Scholar
  26. 26.
    Leshed, G., Haber, E.M., Matthews, T., Lau, T.: Coscripter: automating & sharing how-to knowledge in the enterprise. In: Proceeding of the Twenty-Sixth Annual SIGCHI Conference on Human Factors in Computing Systems, CHI 2008, pp. 1719–1728. ACM, New York (2008), CrossRefGoogle Scholar
  27. 27.
    Lin, J., Wong, J., Nichols, J., Cypher, A., Lau, T.A.: End-user programming of mashups with vegemite. In: Proceedings of the 13th International Conference on Intelligent User Interfaces, IUI 2009, pp. 97–106. ACM, New York (2009), Google Scholar
  28. 28.
    Liu, M., Ling, T.W.: A rule-based query language for html. In: Proc. Int’l. Conf. on Database Systems for Advanced Applications (DASFAA), pp. 6–13. IEEE Comp. Soc. Press (2001)Google Scholar
  29. 29.
    Liu, W., Meng, X., Meng, W.: Vision-based Web Data Records Extraction. In: Proc. 9th International Workshop on the Web and Databases, pp. 20–25 (2006)Google Scholar
  30. 30.
    Madhavan, J., Jeffery, S.R., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A., Inc, G.: Web-scale data integration: You can only afford to pay as you go. In: CIDR (2007)Google Scholar
  31. 31.
    Marx, M.: Conditional XPath, the First Order Complete XPath Dialect. In: Proc. ACM Symposium on Principles of Database Systems, pp. 13–22. ACM (June 2004),
  32. 32.
    Mendelzon, A.O., Mihaila, G.A., Milo, T.: Querying the world wide web. Int. J. on Digital Libraries 1(1), 54–67 (1997)Google Scholar
  33. 33.
    Navarrete, I., Sciavicco, G.: Spatial reasoning with rectangular cardinal direction relations. In: ECAI, pp. 1–9 (2006)Google Scholar
  34. 34.
    Oro, E., Ruffolo, M.: Xonto: An ontology-based system for semantic information extraction from pdf documents. In: Proc. Int’. Conf. on Tools with Artificial Intelligence (ICTAI), pp. 118–125 (2008)Google Scholar
  35. 35.
    Oro, E., Ruffolo, M., Saccà, D.: Ontology-based information extraction from pdf documents with xonto. International Journal on Artificial Intelligence Tools (IJAIT) 18(5), 673–695 (2009)CrossRefGoogle Scholar
  36. 36.
    Oro, E., Ruffolo, M., Staab, S.: Sxpath - extending xpath towards spatial querying on web documents. PVLDB 4(2), 129–140 (2010)Google Scholar
  37. 37.
    Renz, J.: Qualitative spatial reasoning with topological information. Springer (2002)Google Scholar
  38. 38.
    Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: Proc. Int’l. Conf. on Very Large Data Bases (VLDB), pp. 738–741 (1999)Google Scholar
  39. 39.
    Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008), CrossRefzbMATHGoogle Scholar
  40. 40.
    Satpal, S., Bhadra, S., Sellamanickam, S., Rastogi, R., Sen, P.: Web information extraction using markov logic networks. In: Proc. Int’l. Conf. on World Wide Web (WWW), pp. 115–116. ACM, New York (2011), Google Scholar
  41. 41.
    Sawa, N., Morishima, A., Sugimoto, S., Kitagawa, H.: Wraplet: Wrapping your web contents with a lightweight language. In: Proceedings of the 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, pp. 387–394. IEEE Computer Society, Washington, DC (2007)CrossRefGoogle Scholar
  42. 42.
    Senellart, P., Mittal, A., Muschick, D., Gilleron, R., Tommasi, M.: Automatic wrapper induction from hidden-web sources with domain knowledge. In: Proc. Int’l. Workshop on Web Information and Data Management, WIDM 2008, pp. 9–16. ACM, New York (2008), Google Scholar
  43. 43.
    Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: Proc. Int’l. Conf. on Very Large Data Bases, VLDB, pp. 1033–1044 (2007)Google Scholar
  44. 44.
    Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with visual Perceptions. In: Proc. 14th ACM Conference on Information and Knowledge Management, pp. 381–388 (2005)Google Scholar
  45. 45.
    Su, W., Wang, J., Lochovsky, F.H.: Ode: Ontology-assisted data extraction. ACM Transactions on Database Systems 34, 12:1–12:35 (2009), CrossRefGoogle Scholar
  46. 46.
    W3C, X.M.L.: Path Language (XPath) Version 1.0 (November 1999),
  47. 47.
    Wang, J., Chen, C., Wang, C., Pei, J., Bu, J., Guan, Z., Zhang, W.V.: Can we learn a template-independent wrapper for news article extraction from a single training site? In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1345–1354. ACM, New York (2009), Google Scholar
  48. 48.
    Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: An introduction and a survey of current approaches. J. Inf. Sci. 36, 306–323 (2010), CrossRefGoogle Scholar
  49. 49.
    Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., Soderland, S.: Textrunner: open information extraction on the web. In: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX, NAACL 2007, pp. 25–26. Association for Computational Linguistics, Morristown (2007), Google Scholar
  50. 50.
    Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering 18(12), 1614–1628 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Sergio Flesca
    • 1
  • Tim Furche
    • 2
  • Linda Oro
    • 3
  1. 1.DEISUniversity of CalabriaRendeItaly
  2. 2.Department of Computer ScienceOxford UniversityOxfordUK
  3. 3.ICAR-CNR, University of CalabriaRendeItaly

Personalised recommendations