Abstract
The explosive growth and popularity of the Web has resulted in a huge amount of digital information sources on the Internet. Unfortunately, such sources only manage data, rather than the knowledge they carry. Recognizing, extracting, and structuring relevant information according to their semantics is a crucial task. Several approaches in the field of Information Extraction (IE) have been proposed to support the translation of semi-structured/unstructured documents into structured data or knowledge. Most of them have a high precision but, since they are mainly syntactic, they often have a low recall, are dependent on the document format, and ignore the semantics of information they extract. In this paper, we describe a new approach for semantic information extraction that could represent the basis for automatically extracting highly structured data from unstructured web sources without any undesirable trade-off between precision and recall. In short, the approach (i) is ontology driven, (ii) is based on a unified representation of documents, (iii) integrates existing IE techniques, (iv) implements semantic regular expressions, (v) has been implemented through Answer Set Programming, (vi) is employed in real-world applications, and (vii) is having a positive feedback from business customers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Adelberg, B.: NoDoSE – a tool for semi-automatically extracting structured and semistructured data from text documents. SIGMOD Rec. 27(2), 283–294 (1998)
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of DL 2000, San Antonio, Texas, United States, pp. 85–94. ACM, New York (2000)
Arocena, G.O., Mendelzon, A.O.: WebOQL: restructuring documents, databases, and webs. Theor. Pract. Object Syst. 5(3), 127–141 (1999)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: Proceedings of IJCAI 2007, Hyderabad, India, pp. 2670–2676. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Brin, S.: Extracting Patterns and Relations from the World Wide Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)
Califf, M.E., Mooney, R.J.: Relational learning of pattern-match rules for information extraction. In: Proceedings of AAAI 1999/IAAI 1999, Orlando, Florida, United States, pp. 328–334. American Association for Artificial Intelligence, Menlo Park (1999)
Carme, J., Gilleron, R., Lemay, A., Niehren, J.: Interactive learning of node selecting tree transducer. Machine Learning 66(1), 33–67 (2007)
Chang, C.-H., Hsu, C.-N., Lui, S.-C.: Automatic information extraction from semi-structured Web pages by pattern discovery. Decis. Support Syst. 35(1), 129–147 (2003)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Trans. on Knowl. and Data Eng. 18(10), 1411–1428 (2006)
Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of WWW 2004, pp. 462–471. ACM, New York (2004)
Crescenzi, V., Mecca, G.: Grammars have exceptions. Inf. Syst. 23(9), 539–565 (1998)
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)
de Bruijn, J., Martin-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on Ontology Merging and Aligning v1. Technical report, SEKT project deliverable D4.2.1 (2004), http://sw.deri.org/~jos/sekt-d4.2.1-mediation-survey-final.pdf
Efremidis, S., Papadimitriou, C.H., Sideris, M.: Complexity characterizations of attribute grammar languages. Inf. Comput. 78(3), 178–186 (1988)
Eikvil, L.: Information Extraction from World Wide Web - A Survey. Technical Report 945, Norweigan Computing Center (1999)
Embley, D.W.: Towards Semantic Understanding – An Approach Based on Information Extraction Ontologies. In: Proceedings of ADC 2004, Dunedin, New Zealand. Database Technologies, vol. 27 (2004)
Embley, D.W., Campbell, D.M., Jiang, Y.S., Liddle, S.W., Lonsdale, D.W., Ng, Y.-K., Smith, R.D.: Conceptual-model-based data extraction from multiple-record web pages. Data Knowl. Eng. 31(3), 227–251 (1999)
Embley, D.W., Jiang, Y.S., Ng, Y.-K.: Record-boundary discovery in web documents. In: SIGMOD Conference, pp. 467–478 (1999)
Embley, D.W., Lopresti, D., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall (preliminary results). In: Proceedings of WWW 2004, pp. 100–110. ACM, New York (2004)
Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A Comparative Study of Information Extraction Strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)
Feldman, R., Rosenfeld, B., Fresko, M.: TEG – a hybrid approach to information extraction. Knowledge and Information Systems 9(1), 1–18 (2006)
Freitag, D.: Information extraction from HTML: application of a general machine learning approach. In: Proceedings of AAAI 1998/IAAI 1998, Madison, Wisconsin, United States, pp. 517–523. American Association for Artificial Intelligence, Menlo Park (1998)
Freitag, D.: Machine learning for information extraction in informal domains. Machine Learning 39(2), 169–202 (2000)
Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis. 5(2), 199–220 (1993)
Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum.-Comput. Stud. 43(5-6), 907–928 (1995)
Guarino, N.: Formal ontology and information systems. In: International Conference On Formal Ontology In Information Systems FOIS 1998, Trento, ITALY, pp. 3–15. IOS Press, Amsterdam (1998)
Hammer, J., García-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-based wrappers in the TSIMMIS system. SIGMOD Rec. 26(2), 532–535 (1997)
Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The Tsimmis Experience. In: Proceedings of ADBIS 1997, St.-Petersburg, Nevsky Dialect, pp. 1–8 (1997)
Hassan, T., Baumgartner, R.: Table recognition and understanding from pdf files. In: Proceedings of ICDAR 2007, pp. 1143–1147. IEEE Computer Society, Washington, DC (2007)
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(9), 521–538 (1998)
Hu, J., Kashi, R., Lopresti, D., Wilfong, G.: Experiments in table recognition (2001)
Ielpa, S.M., Iiritano, S., Leone, N., Ricca, F.: An ASP-Based System for e-Tourism. In: Erdem, E., Lin, F., Schaub, T. (eds.) LPNMR 2009. LNCS, vol. 5753, pp. 368–381. Springer, Heidelberg (2009)
Kieninger, T., Dengel, A.R.: The T-Recs Table Recognition and Analysis System. In: Lee, S.-W., Nakano, Y. (eds.) DAS 1998. LNCS, vol. 1655, pp. 255–270. Springer, Heidelberg (1999)
Knuth, D.E.: Semantics of context-free languages. Theory of Computing Systems 2(2), 127–145 (1968)
Kuhlins, S., Tredwell, R.: Toolkits for Generating Wrappers. In: Aksit, M., Awasthi, P., Unland, R. (eds.) NODe 2002. LNCS, vol. 2591, pp. 184–198. Springer, Heidelberg (2003)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. Artif. Intell. 118(1-2), 15–68 (2000)
Kushmerick, N., Thomas, B.: Adaptive Information Extraction: Core Technologies for Information Agents. In: Klusch, M., Bergamaschi, S., Edwards, P., Petta, P. (eds.) Intelligent Information Agents. LNCS (LNAI), vol. 2586, pp. 79–103. Springer, Heidelberg (2003)
Kushmerick, N., Weld, D.S., Doorenbos, R.B.: Wrapper Induction for Information Extraction. In: Proceedings of IJCAI 1997, NAGOYA, Aichi, Japan, pp. 729–737 (1997)
Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: DEByE - Data Extraction By Example. Data Knowl. Eng. 40(2), 121–154 (2002)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Rec. 31(2), 84–93 (2002)
Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.: The dlv system for knowledge representation and reasoning. ACM Trans. Comput. Logic 7(3), 499–562 (2006)
Liu, B., Grossman, R., Zhai, Y.: Mining Web Pages for Data Records. IEEE Intelligent Systems 19(6), 49–55 (2004)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In: Proceedings of ICDE 2000, San Diego, CA, USA, pp. 611–621. IEEE Computer Society, Washington, DC (2000)
Manna, M., Scarcello, F., Nicola, L.: On the complexity of regular-grammars with integer attributes. J. Comput. System Sci., 1–29 (2010)
Mecca, G., Atzeni, P., Masci, A., Sindoni, G., Merialdo, P.: The Araneus Web-based management system. SIGMOD Rec. 27(2), 544–546 (1998)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Proceedings of AGENTS 1999, Seattle, Washington, United States, pp. 190–197. ACM, New York (1999)
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovič, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60(3), 567–595 (2007)
Pivk, E., Sure, Y.: From tables to frames. Journal of Web Semantics, 166–181 (2005)
Predoiu, L., de Bruijn, J., Feier, C., Scharffe, F., Martín-Recuerda, F., Manov, D., Ehrig, M.: State-of-the-art survey on ontology merging and aligning v2. Deliverable D4.2.2, SEKT (2005)
Ribeiro-Neto, B., Laender, A.H.F., da Silva, A.S.: Extracting semi-structured data through examples. In: Proceedings of CIKM 1999, Kansas City, Missouri, United States, pp. 94–101. ACM (1999)
Ricca, F., Alviano, M., Dimasi, A., Grasso, G., Ielpa, S.M., Iiritano, S., Manna, M., Leone, N.: A Logic-Based System for e-Tourism. Fundamenta Informaticae 105, 35–55 (2010)
Ricca, F., Leone, N.: Disjunctive logic programming with types and objects: The dlv + system. J. Applied Logic 5(3), 545–573 (2007)
Ruffolo, M., Manna, M., Cozza, V., Ursino, R.: Semantic clinical process management. In: CBMS, pp. 518–523 (2007)
Sahuguet, A., Azavant, F.: Building intelligent Web applications using lightweight wrappers. Data Knowl. Eng. 36(3), 283–316 (2001)
Soderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Mach. Learn. 34(1-3), 233–272 (1999)
Wu, F., Weld, D.S.: Autonomously semantifying wikipedia. In: Proceedings of CIKM 2007, Lisbon, Portugal, pp. 41–50. ACM, New York (2007)
Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: A method to extract table information from pdf files. In: IICAI, pp. 1773–1785 (2005)
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences. Int’l J. Document Analysis and Recognition 7, 1–16 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Manna, M., Oro, E., Ruffolo, M., Alviano, M., Leone, N. (2012). The H \(\imath\) L ε X System for Semantic Information Extraction. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems V. Lecture Notes in Computer Science, vol 7100. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28148-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-28148-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28147-1
Online ISBN: 978-3-642-28148-8
eBook Packages: Computer ScienceComputer Science (R0)