Abstract
Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientific databases.
We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different configurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a significant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can have similar effects as produced by increasing the number of seeds, it leads overall to better results. Our approach thus allows to combine advantages of two sources: The high reliability of a closed corpus and the high redundancy of the Web.
Chapter PDF
Similar content being viewed by others
References
Agichtein, E., Gravano, L.: Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on Digital Libraries (DL), pp. 85–94. ACM Press, New York (2000)
Blohm, S., Cimiano, P., Stemle, E.: Harvesting relations from the web -quantifiying the impact of filtering functions. In: Proceedings of the 22nd International Conference of the Association for the Advancement of Artificial Intelligence (AAAI) (to appear, 2007)
Brin, S.: Extracting patterns and relations from the world wide web. In: Schek, H.-J., Saltor, F., Ramos, I., Alonso, G. (eds.) EDBT 1998. LNCS, vol. 1377, Springer, Heidelberg (1998)
Chen, J., Ji, D., Tan, C.L., Niu, Z.: Relation extraction using label propagation based semi-supervised learning. In: Proceedings of the 21st International Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 129–136 (2006)
Ciravegna, F.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1251–1256 (2001)
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL), pp. 423–429 (2004)
Downey, D., Etzioni, O., Soderland, S., Weld, D.: Learning text patterns for web information extraction and assessment. In: Proceedings of the AAAI Workshop on Adaptive Text Extraction and Mining (2004)
Fagin, R., Kumar, R., McCurley, K.S., Novak, J., Sivakumar, D., Tomlin, J.A., Williamson, D.P.: Searching the workplace web. In: Proceedings of the 12th International Conference on World Wide Web (WWW), pp. 366–375. ACM Press, New York (2003)
Kilgariff, A., Grefenstette, G.: Special Issue on the Web as a Corpus. Journal of Computational Linguistics 29 (2003)
Muggleton, S., Feng, C.: Efficient induction of logic programs. In: Proceedings of the 1st Conference on Algorithmic Learning Theory, pp. 368–381 (1990)
Pantel, P., Pennacchiotti, M.: Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the 21st International Conference on Computational Linguistics (COLING) and the 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 113–120 (2006)
Ruiz-Casado, M., Alfonseca, E., Castells, P.: Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia. In: Natural Language Processing and Information Systems, Springer, Berlin (2005)
Saric, J., Jensen, L., Ouzounova, R., Rojas, I., Bork, P.: Extraction of regulatory gene expression networks from pubmed. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 191–198 (2004)
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowledge. In: Proceedings of the 16th International Conference on World Wide Web (WWW), pp. 697–706. ACM Press, New York (2007)
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 4, 14–28 (2006)
Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., Studer, R.: Semantic wikipedia. In: Proceedings of the 15th International Conference on World Wide Web (WWW), pp. 585–594 (2006)
Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. Journal of Machine Learning Research 3, 1083–1106 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Blohm, S., Cimiano, P. (2007). Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds) Knowledge Discovery in Databases: PKDD 2007. PKDD 2007. Lecture Notes in Computer Science(), vol 4702. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74976-9_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-74976-9_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74975-2
Online ISBN: 978-3-540-74976-9
eBook Packages: Computer ScienceComputer Science (R0)