Skip to main content

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7100))

Abstract

The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adelberg, B.: NoDoSE – a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)

    Article  Google Scholar 

  2. Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of DL 2000, San Antonio, Texas, United States, pp. 85–94. ACM, New York (2000)

    Google Scholar 

  3. Arocena, G.O., Mendelzon, A.O.: WebOQL: restructuring documents, databases, and webs. Theor. Pract. Object Syst. 5(3), 127–141 (1999)

    Article  Google Scholar 

  4. Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of IJCAI 2007, Hyderabad, India, pp. 2670–2676. Morgan Kaufmann Publishers Inc., San Francisco (2007)

    Google Scholar 

  5. Brin, S.: Extracting Patterns and Relations from the World Wide Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  6. Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Proceedings of AAAI 1999/IAAI 1999, Orlando, Florida, United States, pp. 328–334. American Association for Artificial Intelligence, Menlo Park (1999)

    Google Scholar 

  7. Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node selecting tree transducer. Machine Learning 66(1), 33–67 (2007)

    Article  Google Scholar 

  8. Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured Web pages by pattern discovery. Decis. Support Syst. 35(1), 129–147 (2003)

    Article  Google Scholar 

  9. Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Trans. on Knowl. and Data Eng. 18(10), 1411–1428 (2006)

    Article  Google Scholar 

  10. Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of WWW 2004, pp. 462–471. ACM, New York (2004)

    Google Scholar 

  11. Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(9), 539–565 (1998)

    Article  Google Scholar 

  12. Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  13. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  14. de Bruijn, J., Martin-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on Ontology Merging and Aligning v1. Technical report, SEKT project deliverable D4.2.1 (2004), http://sw.deri.org/~jos/sekt-d4.2.1-mediation-survey-final.pdf

  15. Efremidis, S., Papadimitriou, C.H., Sideris, M.: Complexity characterizations of attribute grammar languages. Inf. Comput. 78(3), 178–186 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  16. Eikvil, L.: Information Extraction from World Wide Web - A Survey. Technical Report 945, Norweigan Computing Center (1999)

    Google Scholar 

  17. Embley, D.W.: Towards Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of ADC 2004, Dunedin, New Zealand. Database Technologies, vol. 27 (2004)

    Google Scholar 

  18. Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)

    Article  MATH  Google Scholar 

  19. Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: SIGMOD Conference, pp. 467–478 (1999)

    Google Scholar 

  20. Embley, D.W., Lopresti, D., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: Proceedings of WWW 2004, pp. 100–110. ACM, New York (2004)

    Google Scholar 

  22. Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A Comparative Study of Information Extraction Strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  23. Feldman, R., Rosenfeld, B., Fresko, M.: TEG – a hybrid approach to information extraction. Knowledge and Information Systems 9(1), 1–18 (2006)

    Article  Google Scholar 

  24. Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998/IAAI 1998, Madison, Wisconsin, United States, pp. 517–523. American Association for Artificial Intelligence, Menlo Park (1998)

    Google Scholar 

  25. Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39(2), 169–202 (2000)

    Article  MATH  Google Scholar 

  26. Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis. 5(2), 199–220 (1993)

    Article  Google Scholar 

  27. Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud. 43(5-6), 907–928 (1995)

    Article  Google Scholar 

  28. Guarino, N.: Formal ontology and information systems. In: International Conference On Formal Ontology In Information Systems FOIS 1998, Trento, ITALY, pp. 3–15. IOS Press, Amsterdam (1998)

    Google Scholar 

  29. Hammer, J., García-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-based wrappers in the TSIMMIS system. SIGMOD Rec. 26(2), 532–535 (1997)

    Article  Google Scholar 

  30. Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The Tsimmis Experience. In: Proceedings of ADBIS 1997, St.-Petersburg, Nevsky Dialect, pp. 1–8 (1997)

    Google Scholar 

  31. Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In: Proceedings of ICDAR 2007, pp. 1143–1147. IEEE Computer Society, Washington, DC (2007)

    Google Scholar 

  32. Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(9), 521–538 (1998)

    Article  Google Scholar 

  33. Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Experiments in table recognition (2001)

    Google Scholar 

  34. Ielpa, S.M., Iiritano, S., Leone, N., Ricca, F.: An ASP-Based System for e-Tourism. In: Erdem, E., Lin, F., Schaub, T. (eds.) LPNMR 2009. LNCS, vol. 5753, pp. 368–381. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  35. Kieninger, T., Dengel, A.R.: The T-Recs Table Recognition and Analysis System. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 255–270. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  36. Knuth, D.E.: Semantics of context-free languages. Theory of Computing Systems 2(2), 127–145 (1968)

    MathSciNet  MATH  Google Scholar 

  37. Kuhlins, S., Tredwell, R.: Toolkits for Generating Wrappers. In: Aksit, M., Awasthi, P., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  38. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1-2), 15–68 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  39. Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  40. Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of IJCAI 1997, NAGOYA, Aichi, Japan, pp. 729–737 (1997)

    Google Scholar 

  41. Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: DEByE - Data Extraction By Example. Data Knowl. Eng. 40(2), 121–154 (2002)

    Article  MATH  Google Scholar 

  42. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002)

    Article  Google Scholar 

  43. Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The dlv system for knowledge representation and reasoning. ACM Trans. Comput. Logic 7(3), 499–562 (2006)

    Article  MathSciNet  Google Scholar 

  44. Liu, B., Grossman, R., Zhai, Y.: Mining Web Pages for Data Records. IEEE Intelligent Systems 19(6), 49–55 (2004)

    Article  Google Scholar 

  45. Liu, L., Pu, C., Han, W.: XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In: Proceedings of ICDE 2000, San Diego, CA, USA, pp. 611–621. IEEE Computer Society, Washington, DC (2000)

    Google Scholar 

  46. Manna, M., Scarcello, F., Nicola, L.: On the complexity of regular-grammars with integer attributes. J. Comput. System Sci., 1–29 (2010)

    Google Scholar 

  47. Mecca, G., Atzeni, P., Masci, A., Sindoni, G., Merialdo, P.: The Araneus Web-based management system. SIGMOD Rec. 27(2), 544–546 (1998)

    Article  Google Scholar 

  48. Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proceedings of AGENTS 1999, Seattle, Washington, United States, pp. 190–197. ACM, New York (1999)

    Google Scholar 

  49. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)

    Article  Google Scholar 

  50. Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)

    Article  Google Scholar 

  51. Pivk, E., Sure, Y.: From tables to frames. Journal of Web Semantics, 166–181 (2005)

    Google Scholar 

  52. Predoiu, L., de Bruijn, J., Feier, C., Scharffe, F., Martín-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on ontology merging and aligning v2. Deliverable D4.2.2, SEKT (2005)

    Google Scholar 

  53. Ribeiro-Neto, B., Laender, A.H.F., da Silva, A.S.: Extracting semi-structured data through examples. In: Proceedings of CIKM 1999, Kansas City, Missouri, United States, pp. 94–101. ACM (1999)

    Google Scholar 

  54. Ricca, F., Alviano, M., Dimasi, A., Grasso, G., Ielpa, S.M., Iiritano, S., Manna, M., Leone, N.: A Logic-Based System for e-Tourism. Fundamenta Informaticae 105, 35–55 (2010)

    MathSciNet  Google Scholar 

  55. Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv +  system. J. Applied Logic 5(3), 545–573 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  56. Ruffolo, M., Manna, M., Cozza, V., Ursino, R.: Semantic clinical process management. In: CBMS, pp. 518–523 (2007)

    Google Scholar 

  57. Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)

    Article  MATH  Google Scholar 

  58. Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn. 34(1-3), 233–272 (1999)

    Article  MATH  Google Scholar 

  59. Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of CIKM 2007, Lisbon, Portugal, pp. 41–50. ACM, New York (2007)

    Google Scholar 

  60. Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: IICAI, pp. 1773–1785 (2005)

    Google Scholar 

  61. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7, 1–16 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Abdelkader Hameurlain Josef Küng Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Manna, M., Oro, E., Ruffolo, M., Alviano, M., Leone, N. (2012). The H \(\imath\) L ε X System for Semantic Information Extraction. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems V. Lecture Notes in Computer Science, vol 7100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28148-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28148-8_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28147-1

  • Online ISBN: 978-3-642-28148-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics