Abstract
Information extraction is concerned with applying natural language processing to automatically extract the essential details from text documents. A great disadvantage of current approaches is their intrinsic dependence to the application domain and the target language. Several machine learning techniques have been applied in order to facilitate the portability of the information extraction systems. This paper describes a general method for building an information extraction system using regular expressions along with supervised learning algorithms. In this method, the extraction decisions are lead by a set of classifiers instead of sophisticated linguistic analyses. The paper also shows a system called TOPO that allows to extract the information related with natural disasters from newspaper articles in Spanish language. Experimental results of this system indicate that the proposed method can be a practical solution for building information extraction systems reaching an F-measure as high as 72%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bouckaert, R.: Low level information extraction. In: Proceedings of the workshop on Text Learning (TextML 2002), Sydney, Australia (2002)
Cowie, J., Lehnert, W.: Information Extraction. Communications of the ACM 39(1), 80–91 (1996)
Freitag, D.: Machine Learning for Information Extraction in Informal Domains. Ph.d. thesis, Computer Science Department, Carnegie Mellon University (1998)
Hobbs, J.R.: The Generic Information Extraction System. In: Proceedings of the Fifth Message Understanding Conference (1993)
Kushmerick, N., Johnston, E., McGuinness, S.: Information Extraction by Text Classification. In: Kushmerick, N. (ed.) Seventeenth International Join Conference on Artificial Intelligence (IJCAI 2001), Adaptive Text Extraction and Mining (Working Notes), Seattle, Washington, pp. 44–50 (2001)
LA RED: Guía Metodológica de DesInventar. OSSO/ITDG, Lima (2003)
Manning, C., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Michel, T.: Machine Learning. McGraw-Hill, New York (1997)
Muslea, I.: Extraction Patterns for Information Extractions Tasks: A Survey. In: Proceedings of the AAAI Workshop on Machine Learning for Information Extraction (1999)
Peng, F.: Models Development in IE Tasks - A survey. CS685 (Intelligent Computer Interface) course project, Computer Science Department, University of Waterloo (1999)
Riloff, E.: Automatically Generating Extraction Patterns from untagged text. In: Proceedings of the 13th National Conference on Artificial Intelligence (AAAI), pp. 1044–1049 (1996)
Roth, D., Yih, W.: Relational Learning Via Propositional Algorithms: An Information Extraction Case Study. In: Proceedings of the 15th International Conference on Artificial Intelligence, IJCAI (2001)
Sebastiani, F.: Machine Learning in Automated Text Categorization: a Survey. Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell’Informazione (1999)
Seymore, K., McCallum, A., Rosenfeld, R.: Learning Hidden Markov Model structure for Information Extraction. In: Proceedings of the 20th National Conference on Artificial Intelligence, AAAI (1999)
Sonderland, S., Fisher, D., Aseltine, J., Lehnert, W.: CRYSTAL: Inducing a Conceptual Dictionary. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1314–1321 (1995)
Sonderland, S.: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning (34), 233–272 (1999)
Turno, J.: Information Extraction, Multilinguality and Portability. Revista Iberoamericana de Inteligencia Artificial (22), 57–78 (2003)
Zavrel, J., Berck, P., Lavrijssen, W.: Information Extraction by Text Classification: Corpus Mining for Features. In: Proceedings of the workshop Information Extraction meets Corpus Linguistics, Athens, Greece (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Téllez-Valero, A., Montes-y-Gómez, M., Villaseñor-Pineda, L. (2005). A Machine Learning Approach to Information Extraction. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_58
Download citation
DOI: https://doi.org/10.1007/978-3-540-30586-6_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)