The H $\imath$ L ε X System for Semantic Information Extraction

Manna, Marco; Oro, Ermelinda; Ruffolo, Massimo; Alviano, Mario; Leone, Nicola

doi:10.1007/978-3-642-28148-8_5

Marco Manna¹⁶,
Ermelinda Oro¹⁷,
Massimo Ruffolo¹⁸,
Mario Alviano¹⁶ &
…
Nicola Leone¹⁶

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7100))

430 Accesses
3 Citations
2 Altmetric

Abstract

The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adelberg, B.: NoDoSE – a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)
Article Google Scholar
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of DL 2000, San Antonio, Texas, United States, pp. 85–94. ACM, New York (2000)
Google Scholar
Arocena, G.O., Mendelzon, A.O.: WebOQL: restructuring documents, databases, and webs. Theor. Pract. Object Syst. 5(3), 127–141 (1999)
Article Google Scholar
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of IJCAI 2007, Hyderabad, India, pp. 2670–2676. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Google Scholar
Brin, S.: Extracting Patterns and Relations from the World Wide Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)
Chapter Google Scholar
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Proceedings of AAAI 1999/IAAI 1999, Orlando, Florida, United States, pp. 328–334. American Association for Artificial Intelligence, Menlo Park (1999)
Google Scholar
Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node selecting tree transducer. Machine Learning 66(1), 33–67 (2007)
Article Google Scholar
Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured Web pages by pattern discovery. Decis. Support Syst. 35(1), 129–147 (2003)
Article Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Trans. on Knowl. and Data Eng. 18(10), 1411–1428 (2006)
Article Google Scholar
Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of WWW 2004, pp. 462–471. ACM, New York (2004)
Google Scholar
Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(9), 539–565 (1998)
Article Google Scholar
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Article MathSciNet MATH Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
de Bruijn, J., Martin-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on Ontology Merging and Aligning v1. Technical report, SEKT project deliverable D4.2.1 (2004), http://sw.deri.org/~jos/sekt-d4.2.1-mediation-survey-final.pdf
Efremidis, S., Papadimitriou, C.H., Sideris, M.: Complexity characterizations of attribute grammar languages. Inf. Comput. 78(3), 178–186 (1988)
Article MathSciNet MATH Google Scholar
Eikvil, L.: Information Extraction from World Wide Web - A Survey. Technical Report 945, Norweigan Computing Center (1999)
Google Scholar
Embley, D.W.: Towards Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of ADC 2004, Dunedin, New Zealand. Database Technologies, vol. 27 (2004)
Google Scholar
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)
Article MATH Google Scholar
Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: SIGMOD Conference, pp. 467–478 (1999)
Google Scholar
Embley, D.W., Lopresti, D., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)
Chapter Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: Proceedings of WWW 2004, pp. 100–110. ACM, New York (2004)
Google Scholar
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A Comparative Study of Information Extraction Strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)
Chapter Google Scholar
Feldman, R., Rosenfeld, B., Fresko, M.: TEG – a hybrid approach to information extraction. Knowledge and Information Systems 9(1), 1–18 (2006)
Article Google Scholar
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998/IAAI 1998, Madison, Wisconsin, United States, pp. 517–523. American Association for Artificial Intelligence, Menlo Park (1998)
Google Scholar
Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39(2), 169–202 (2000)
Article MATH Google Scholar
Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis. 5(2), 199–220 (1993)
Article Google Scholar
Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud. 43(5-6), 907–928 (1995)
Article Google Scholar
Guarino, N.: Formal ontology and information systems. In: International Conference On Formal Ontology In Information Systems FOIS 1998, Trento, ITALY, pp. 3–15. IOS Press, Amsterdam (1998)
Google Scholar
Hammer, J., García-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-based wrappers in the TSIMMIS system. SIGMOD Rec. 26(2), 532–535 (1997)
Article Google Scholar
Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The Tsimmis Experience. In: Proceedings of ADBIS 1997, St.-Petersburg, Nevsky Dialect, pp. 1–8 (1997)
Google Scholar
Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In: Proceedings of ICDAR 2007, pp. 1143–1147. IEEE Computer Society, Washington, DC (2007)
Google Scholar
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(9), 521–538 (1998)
Article Google Scholar
Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Experiments in table recognition (2001)
Google Scholar
Ielpa, S.M., Iiritano, S., Leone, N., Ricca, F.: An ASP-Based System for e-Tourism. In: Erdem, E., Lin, F., Schaub, T. (eds.) LPNMR 2009. LNCS, vol. 5753, pp. 368–381. Springer, Heidelberg (2009)
Chapter Google Scholar
Kieninger, T., Dengel, A.R.: The T-Recs Table Recognition and Analysis System. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 255–270. Springer, Heidelberg (1999)
Chapter Google Scholar
Knuth, D.E.: Semantics of context-free languages. Theory of Computing Systems 2(2), 127–145 (1968)
MathSciNet MATH Google Scholar
Kuhlins, S., Tredwell, R.: Toolkits for Generating Wrappers. In: Aksit, M., Awasthi, P., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)
Chapter Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1-2), 15–68 (2000)
Article MathSciNet MATH Google Scholar
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)
Chapter Google Scholar
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of IJCAI 1997, NAGOYA, Aichi, Japan, pp. 729–737 (1997)
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: DEByE - Data Extraction By Example. Data Knowl. Eng. 40(2), 121–154 (2002)
Article MATH Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002)
Article Google Scholar
Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The dlv system for knowledge representation and reasoning. ACM Trans. Comput. Logic 7(3), 499–562 (2006)
Article MathSciNet Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Web Pages for Data Records. IEEE Intelligent Systems 19(6), 49–55 (2004)
Article Google Scholar
Liu, L., Pu, C., Han, W.: XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In: Proceedings of ICDE 2000, San Diego, CA, USA, pp. 611–621. IEEE Computer Society, Washington, DC (2000)
Google Scholar
Manna, M., Scarcello, F., Nicola, L.: On the complexity of regular-grammars with integer attributes. J. Comput. System Sci., 1–29 (2010)
Google Scholar
Mecca, G., Atzeni, P., Masci, A., Sindoni, G., Merialdo, P.: The Araneus Web-based management system. SIGMOD Rec. 27(2), 544–546 (1998)
Article Google Scholar
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proceedings of AGENTS 1999, Seattle, Washington, United States, pp. 190–197. ACM, New York (1999)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Article Google Scholar
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Article Google Scholar
Pivk, E., Sure, Y.: From tables to frames. Journal of Web Semantics, 166–181 (2005)
Google Scholar
Predoiu, L., de Bruijn, J., Feier, C., Scharffe, F., Martín-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on ontology merging and aligning v2. Deliverable D4.2.2, SEKT (2005)
Google Scholar
Ribeiro-Neto, B., Laender, A.H.F., da Silva, A.S.: Extracting semi-structured data through examples. In: Proceedings of CIKM 1999, Kansas City, Missouri, United States, pp. 94–101. ACM (1999)
Google Scholar
Ricca, F., Alviano, M., Dimasi, A., Grasso, G., Ielpa, S.M., Iiritano, S., Manna, M., Leone, N.: A Logic-Based System for e-Tourism. Fundamenta Informaticae 105, 35–55 (2010)
MathSciNet Google Scholar
Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv⁺ system. J. Applied Logic 5(3), 545–573 (2007)
Article MathSciNet MATH Google Scholar
Ruffolo, M., Manna, M., Cozza, V., Ursino, R.: Semantic clinical process management. In: CBMS, pp. 518–523 (2007)
Google Scholar
Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Article MATH Google Scholar
Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn. 34(1-3), 233–272 (1999)
Article MATH Google Scholar
Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of CIKM 2007, Lisbon, Portugal, pp. 41–50. ACM, New York (2007)
Google Scholar
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: IICAI, pp. 1773–1785 (2005)
Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7, 1–16 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of Calabria, Italy
Marco Manna, Mario Alviano & Nicola Leone
DEIS, University of Calabria, Italy
Ermelinda Oro
ICAR-CNR, University of Calabria, Italy
Massimo Ruffolo

Authors

Marco Manna
View author publications
You can also search for this author in PubMed Google Scholar
Ermelinda Oro
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Ruffolo
View author publications
You can also search for this author in PubMed Google Scholar
Mario Alviano
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Leone
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Abdelkader Hameurlain Josef Küng Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Manna, M., Oro, E., Ruffolo, M., Alviano, M., Leone, N. (2012). The H $\imath$ L ε X System for Semantic Information Extraction. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems V. Lecture Notes in Computer Science, vol 7100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28148-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-28148-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28147-1
Online ISBN: 978-3-642-28148-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

The H \(\imath\) L ε X System for Semantic Information Extraction

Abstract

Access this chapter

Preview

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

The H \(\imath\) L ε X System for Semantic Information Extraction

Abstract

Access this chapter

Preview

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation