The H\(\imath\)LεX System for Semantic Information Extraction

  • Marco Manna
  • Ermelinda Oro
  • Massimo Ruffolo
  • Mario Alviano
  • Nicola Leone
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7100)

Abstract

The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adelberg, B.: NoDoSE – a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)CrossRefGoogle Scholar
  2. 2.
    Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of DL 2000, San Antonio, Texas, United States, pp. 85–94. ACM, New York (2000)Google Scholar
  3. 3.
    Arocena, G.O., Mendelzon, A.O.: WebOQL: restructuring documents, databases, and webs. Theor. Pract. Object Syst. 5(3), 127–141 (1999)CrossRefGoogle Scholar
  4. 4.
    Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of IJCAI 2007, Hyderabad, India, pp. 2670–2676. Morgan Kaufmann Publishers Inc., San Francisco (2007)Google Scholar
  5. 5.
    Brin, S.: Extracting Patterns and Relations from the World Wide Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  6. 6.
    Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Proceedings of AAAI 1999/IAAI 1999, Orlando, Florida, United States, pp. 328–334. American Association for Artificial Intelligence, Menlo Park (1999)Google Scholar
  7. 7.
    Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node selecting tree transducer. Machine Learning 66(1), 33–67 (2007)CrossRefGoogle Scholar
  8. 8.
    Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured Web pages by pattern discovery. Decis. Support Syst. 35(1), 129–147 (2003)CrossRefGoogle Scholar
  9. 9.
    Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Trans. on Knowl. and Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  10. 10.
    Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of WWW 2004, pp. 462–471. ACM, New York (2004)Google Scholar
  11. 11.
    Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(9), 539–565 (1998)CrossRefGoogle Scholar
  12. 12.
    Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  14. 14.
    de Bruijn, J., Martin-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on Ontology Merging and Aligning v1. Technical report, SEKT project deliverable D4.2.1 (2004), http://sw.deri.org/~jos/sekt-d4.2.1-mediation-survey-final.pdf
  15. 15.
    Efremidis, S., Papadimitriou, C.H., Sideris, M.: Complexity characterizations of attribute grammar languages. Inf. Comput. 78(3), 178–186 (1988)MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Eikvil, L.: Information Extraction from World Wide Web - A Survey. Technical Report 945, Norweigan Computing Center (1999)Google Scholar
  17. 17.
    Embley, D.W.: Towards Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of ADC 2004, Dunedin, New Zealand. Database Technologies, vol. 27 (2004)Google Scholar
  18. 18.
    Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)CrossRefMATHGoogle Scholar
  19. 19.
    Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: SIGMOD Conference, pp. 467–478 (1999)Google Scholar
  20. 20.
    Embley, D.W., Lopresti, D., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: Proceedings of WWW 2004, pp. 100–110. ACM, New York (2004)Google Scholar
  22. 22.
    Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A Comparative Study of Information Extraction Strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  23. 23.
    Feldman, R., Rosenfeld, B., Fresko, M.: TEG – a hybrid approach to information extraction. Knowledge and Information Systems 9(1), 1–18 (2006)CrossRefGoogle Scholar
  24. 24.
    Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998/IAAI 1998, Madison, Wisconsin, United States, pp. 517–523. American Association for Artificial Intelligence, Menlo Park (1998)Google Scholar
  25. 25.
    Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39(2), 169–202 (2000)CrossRefMATHGoogle Scholar
  26. 26.
    Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis. 5(2), 199–220 (1993)CrossRefGoogle Scholar
  27. 27.
    Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud. 43(5-6), 907–928 (1995)CrossRefGoogle Scholar
  28. 28.
    Guarino, N.: Formal ontology and information systems. In: International Conference On Formal Ontology In Information Systems FOIS 1998, Trento, ITALY, pp. 3–15. IOS Press, Amsterdam (1998)Google Scholar
  29. 29.
    Hammer, J., García-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-based wrappers in the TSIMMIS system. SIGMOD Rec. 26(2), 532–535 (1997)CrossRefGoogle Scholar
  30. 30.
    Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The Tsimmis Experience. In: Proceedings of ADBIS 1997, St.-Petersburg, Nevsky Dialect, pp. 1–8 (1997)Google Scholar
  31. 31.
    Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In: Proceedings of ICDAR 2007, pp. 1143–1147. IEEE Computer Society, Washington, DC (2007)Google Scholar
  32. 32.
    Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(9), 521–538 (1998)CrossRefGoogle Scholar
  33. 33.
    Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Experiments in table recognition (2001)Google Scholar
  34. 34.
    Ielpa, S.M., Iiritano, S., Leone, N., Ricca, F.: An ASP-Based System for e-Tourism. In: Erdem, E., Lin, F., Schaub, T. (eds.) LPNMR 2009. LNCS, vol. 5753, pp. 368–381. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  35. 35.
    Kieninger, T., Dengel, A.R.: The T-Recs Table Recognition and Analysis System. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 255–270. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  36. 36.
    Knuth, D.E.: Semantics of context-free languages. Theory of Computing Systems 2(2), 127–145 (1968)MathSciNetMATHGoogle Scholar
  37. 37.
    Kuhlins, S., Tredwell, R.: Toolkits for Generating Wrappers. In: Aksit, M., Awasthi, P., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  38. 38.
    Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1-2), 15–68 (2000)MathSciNetCrossRefMATHGoogle Scholar
  39. 39.
    Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  40. 40.
    Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of IJCAI 1997, NAGOYA, Aichi, Japan, pp. 729–737 (1997)Google Scholar
  41. 41.
    Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: DEByE - Data Extraction By Example. Data Knowl. Eng. 40(2), 121–154 (2002)CrossRefMATHGoogle Scholar
  42. 42.
    Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002)CrossRefGoogle Scholar
  43. 43.
    Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The dlv system for knowledge representation and reasoning. ACM Trans. Comput. Logic 7(3), 499–562 (2006)MathSciNetCrossRefGoogle Scholar
  44. 44.
    Liu, B., Grossman, R., Zhai, Y.: Mining Web Pages for Data Records. IEEE Intelligent Systems 19(6), 49–55 (2004)CrossRefGoogle Scholar
  45. 45.
    Liu, L., Pu, C., Han, W.: XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In: Proceedings of ICDE 2000, San Diego, CA, USA, pp. 611–621. IEEE Computer Society, Washington, DC (2000)Google Scholar
  46. 46.
    Manna, M., Scarcello, F., Nicola, L.: On the complexity of regular-grammars with integer attributes. J. Comput. System Sci., 1–29 (2010)Google Scholar
  47. 47.
    Mecca, G., Atzeni, P., Masci, A., Sindoni, G., Merialdo, P.: The Araneus Web-based management system. SIGMOD Rec. 27(2), 544–546 (1998)CrossRefGoogle Scholar
  48. 48.
    Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proceedings of AGENTS 1999, Seattle, Washington, United States, pp. 190–197. ACM, New York (1999)Google Scholar
  49. 49.
    Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)CrossRefGoogle Scholar
  50. 50.
    Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)CrossRefGoogle Scholar
  51. 51.
    Pivk, E., Sure, Y.: From tables to frames. Journal of Web Semantics, 166–181 (2005)Google Scholar
  52. 52.
    Predoiu, L., de Bruijn, J., Feier, C., Scharffe, F., Martín-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on ontology merging and aligning v2. Deliverable D4.2.2, SEKT (2005)Google Scholar
  53. 53.
    Ribeiro-Neto, B., Laender, A.H.F., da Silva, A.S.: Extracting semi-structured data through examples. In: Proceedings of CIKM 1999, Kansas City, Missouri, United States, pp. 94–101. ACM (1999)Google Scholar
  54. 54.
    Ricca, F., Alviano, M., Dimasi, A., Grasso, G., Ielpa, S.M., Iiritano, S., Manna, M., Leone, N.: A Logic-Based System for e-Tourism. Fundamenta Informaticae 105, 35–55 (2010)MathSciNetGoogle Scholar
  55. 55.
    Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv +  system. J. Applied Logic 5(3), 545–573 (2007)MathSciNetCrossRefMATHGoogle Scholar
  56. 56.
    Ruffolo, M., Manna, M., Cozza, V., Ursino, R.: Semantic clinical process management. In: CBMS, pp. 518–523 (2007)Google Scholar
  57. 57.
    Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)CrossRefMATHGoogle Scholar
  58. 58.
    Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn. 34(1-3), 233–272 (1999)CrossRefMATHGoogle Scholar
  59. 59.
    Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of CIKM 2007, Lisbon, Portugal, pp. 41–50. ACM, New York (2007)Google Scholar
  60. 60.
    Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: IICAI, pp. 1773–1785 (2005)Google Scholar
  61. 61.
    Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7, 1–16 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Marco Manna
    • 1
  • Ermelinda Oro
    • 2
  • Massimo Ruffolo
    • 3
  • Mario Alviano
    • 1
  • Nicola Leone
    • 1
  1. 1.Department of MathematicsUniversity of CalabriaItaly
  2. 2.DEISUniversity of CalabriaItaly
  3. 3.ICAR-CNRUniversity of CalabriaItaly

Personalised recommendations