Abstract
In some domains, Information Extraction (IE) from texts requires syntactic and semantic parsing. This analysis is computationally expensive and IE is potentially noisy if it applies to the whole set of documents when the relevant information is sparse. A preprocessing phase that selects the fragments which are potentially relevant increases the efficiency of the IE process. This phase has to be fast and based on a shallow description of the texts. We applied various classification methods — IVI, a Naive Bayes learner and C4.5 — to this fragment filtering task in the domain of functional genomics. This paper describes the results of this study. We show that the IVI and Naive Bayes methods with feature selection gives the best results as compared with their results without feature selection and with C4.5 results.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Blaschke C., Andrade M. A., Ouzounis C. and Valencia A., “Automatic Extraction of biological information from scientific text: protein-protein interactions”, in Proc. of ISMB’99, 1999.
Collier N., Nobata C. and Tsujii, “Extracting the names of genes and gene products with a hidden Markov model. In Proc. COLING’2000, Saarbrück,, July-August 2000.
Craven M. and Kumlien J., “Constructing Biological Knowledge Bases by Extracting Information from Text Sources.”, In Proc. of ISMB’99, 1999.
Domingos P. and Pazzani M., “Beyond independence: conditions for the optimality of the simple Bayesian classifier”, in Proc. of ICML’96, Saitta L. (ed.), pp. 105–112, 1996.
Fukuda K., Tsunoda T., Tamura A. and Takagi T., “Toward Information Extraction: Identifying protein names from biological papers”. In Proc. PSB’98, 1998.
Humphreys K., Demetriou G, and Gaizauskas R., “Two applications of information extraction to biological science article: enzyme interaction and protein structure”. In Proc. of PSB’2000, vol.5, pp. 502–513, Honolulu, 2000.
John G. and Kohavi R., “Wrappers for feature subset selection”, in Artificial Intelligence Journal, 1997.
Langley P. and Sage S., “Induction of selective Bayesian classifiers”, in Proc. of UAI’ 94, Lopez de Mantaras R. (Ed.), pp. 399–406, Morgan Kaufmann, 1994.
Mitchell, T. M., Machine Learning, Mac Graw Hill, 1997.
Proceedings of the Message Understanding Conference (MUC-4-7), Morgan Kaufman, San Mateo, USA, 1992-98.
Ono T., Hishigaki H., Tanigami A., and Takagi T., “Automated extraction of information on protein-protein interactions from the biological literature”. In Bioinformatics, vol 17 no 2 2001, pp. 155–161, 2001
Pillet V., Méthodologie d’extraction automatique d’information à partir de la littérature scientifique en vue d’alimenter un nouveau système d’information, thèse de l’Université de droit, d’économie et des sciences d’Aix-Marseille, 2000.
Proux, D., Rechenmann, F., Julliard, L., Pillet, V., Jacq, B., “Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction”. In Genome Informatics 1998, S. Miyano and T. Takagi, (Eds), Universal Academy Press, Inc, Tokyo, Japan, pp. 72–80, 1998.
Quinlan J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1992.
Riloff E., “Automatically constructing a Dictionary for Information Extraction Tasks”. In Proc. of AAAI-93, pp. 811–816, AAAI Press / The MIT Press, 1993.
Soderland S., “Learning Information Extraction Rules for Semi-Structured and Free Text” in Machine Learning Journal, vol 34, 1999.
Stapley B. J. and Benoit G., “Bibliometrics: Information Retrieval and Visualization from co-occurrence of gene names in MedLine abstracts”. In Proc. of PSB’2000, 2000.
Thomas, J., Milward, D., Ouzounis C., Pulman S. and Caroll M., “Automatic Extraction of Protein Interactions from Scientific Abstracts”. In Proc. of PSB2000, vol.5, p. 502–513, Honolulu, 2000.
Yang Y. and Pedersen J., “A comparative study on feature selection in text categorization.”, in Proc. of ICML’97,1997. Fehler! Textmarke nicht definiert.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nédellec, C., Abdel Vetah, M.O., Bessières, P. (2001). Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_27
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_27
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive