Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

Blohm, Sebastian; Cimiano, Philipp

doi:10.1007/978-3-540-74976-9_6

Sebastian Blohm¹ &
Philipp Cimiano¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4702))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

3550 Accesses
13 Citations

Abstract

Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientific databases.

We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different configurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a significant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can have similar effects as produced by increasing the number of seeds, it leads overall to better results. Our approach thus allows to combine advantages of two sources: The high reliability of a closed corpus and the high redundancy of the Web.

Download to read the full chapter text

Chapter PDF

Joint Information Extraction from the Web Using Linked Data

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Article Open access 20 August 2017

Rule Induction and Reasoning over Knowledge Graphs

References

Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on Digital Libraries (DL), pp. 85–94. ACM Press, New York (2000)
Chapter Google Scholar
Blohm, S., Cimiano, P., Stemle, E.: Harvesting relations from the web -quantifiying the impact of filtering functions. In: Proceedings of the 22nd International Conference of the Association for the Advancement of Artificial Intelligence (AAAI) (to appear, 2007)
Google Scholar
Brin, S.: Extracting patterns and relations from the world wide web. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, Springer, Heidelberg (1998)
Google Scholar
Chen, J., Ji, D., Tan, C.L., Niu, Z.: Relation extraction using label propagation based semi-supervised learning. In: Proceedings of the 21st International Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 129–136 (2006)
Google Scholar
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1251–1256 (2001)
Google Scholar
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL), pp. 423–429 (2004)
Google Scholar
Downey, D., Etzioni, O., Soderland, S., Weld, D.: Learning text patterns for web information extraction and assessment. In: Proceedings of the AAAI Workshop on Adaptive Text Extraction and Mining (2004)
Google Scholar
Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. In: Proceedings of the 12th International Conference on World Wide Web (WWW), pp. 366–375. ACM Press, New York (2003)
Google Scholar
Kilgariff, A., Grefenstette, G.: Special Issue on the Web as a Corpus. Journal of Computational Linguistics 29 (2003)
Google Scholar
Muggleton, S., Feng, C.: Efficient induction of logic programs. In: Proceedings of the 1st Conference on Algorithmic Learning Theory, pp. 368–381 (1990)
Google Scholar
Pantel, P., Pennacchiotti, M.: Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the 21st International Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 113–120 (2006)
Google Scholar
Ruiz-Casado, M., Alfonseca, E., Castells, P.: Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia. In: Natural Language Processing and Information Systems, Springer, Berlin (2005)
Google Scholar
Saric, J., Jensen, L., Ouzounova, R., Rojas, I., Bork, P.: Extraction of regulatory gene expression networks from pubmed. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 191–198 (2004)
Google Scholar
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Article MATH Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowledge. In: Proceedings of the 16th International Conference on World Wide Web (WWW), pp. 697–706. ACM Press, New York (2007)
Chapter Google Scholar
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 4, 14–28 (2006)
Article Google Scholar
Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic wikipedia. In: Proceedings of the 15th International Conference on World Wide Web (WWW), pp. 585–594 (2006)
Google Scholar
Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. Journal of Machine Learning Research 3, 1083–1106 (2003)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute AIFB, University of Karlsruhe, Germany
Sebastian Blohm & Philipp Cimiano

Authors

Sebastian Blohm
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Cimiano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Joost N. Kok Jacek Koronacki Ramon Lopez de Mantaras Stan Matwin Dunja Mladenič Andrzej Skowron

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Blohm, S., Cimiano, P. (2007). Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds) Knowledge Discovery in Databases: PKDD 2007. PKDD 2007. Lecture Notes in Computer Science(), vol 4702. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74976-9_6

Download citation

DOI: https://doi.org/10.1007/978-3-540-74976-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74975-2
Online ISBN: 978-3-540-74976-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

Abstract

Chapter PDF

Similar content being viewed by others

Joint Information Extraction from the Web Using Linked Data

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Rule Induction and Reasoning over Knowledge Graphs

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

Abstract

Chapter PDF

Similar content being viewed by others

Joint Information Extraction from the Web Using Linked Data

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

Rule Induction and Reasoning over Knowledge Graphs

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation