A Grid Infrastructure for Text Mining of Full Text Articles and Creation of a Knowledge Base of Gene Relations

Natarajan, Jeyakumar; Mulay, Niranjan; DeSesa, Catherine; Hack, Catherine J.; Dubitzky, Werner; Bremer, Eric G.

doi:10.1007/11573067_11

Jeyakumar Natarajan²³,
Niranjan Mulay²⁴,
Catherine DeSesa²⁵,
Catherine J. Hack²³,
Werner Dubitzky²³ &
…
Eric G. Bremer²⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3745))

Included in the following conference series:

International Symposium on Biological and Medical Data Analysis

1207 Accesses
3 Citations

Abstract

We demonstrate the application of a grid infrastructure for conducting text mining over distributed data and computational resources. The approach is based on using LexiQuest Mine, a text mining workbench, in a grid computing environment. We describe our architecture and approach and provide an illustrative example of mining full-text journal articles to create a knowledge base of gene relations. The number of patterns found increased from 0.74 per full-text articles from a corpus of 1000 articles to 0.83 when the corpus contained 5000 articles. However, it was also shown that mining a corpus of 5000 full-text articles took 26 hours on a single computer, whilst the process was completed in less than 2.5 hours on a grid comprising of 20 computers. Thus whilst increasing the size of the corpus improved the efficiency of the text-mining process, a grid infrastructure was required to complete the task in a timely manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hearst, M.A.: Untangling text data mining. In: Proc. Of ACL, p. 37 (1999)
Google Scholar
Fukuda, K., Tsunoda, T., Tamura, A., Takagi, T.: Towards Information Extraction: identifying protein names from biological papers. In: Pacific Symposium on Biocomputing, pp. 707–718 (1998)
Google Scholar
Eriksson, G., Franzen, K., Olsson, F.: Exploiting syntax when detecting protein names in text. In: Workshop on Natural Language. Processing in Biomedical Applications (2002), at http://www.sics.se/humle/projects/prothalt/
Wilbur, W., Hazard Jr., G.F., Divita, G., Mork, J.G., Aronson, A.R., Browne, A.C.: Analysis of biomedical text for biochemical names: A comparison of three methods. In: Proc. of AMIA Symposium, pp. 176–180 (1999)
Google Scholar
Kazama, J., Makino, T., Ohta, Y., Tsujii, J.: Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proc. of the Natural Language Processing in the Biomedical Domain, Philadelphia, PA, USA (2002)
Google Scholar
Ono, T., Hishigaki, H., Tanigami, A., Takagi, T.: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17, 155–161 (2001)
Article Google Scholar
Wong, L.: A protein interaction extraction system. Pacific Symposium on Biocomputing 6, 520–531 (2001)
Google Scholar
Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, vol. 6, pp. 408–419 (2001)
Google Scholar
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Proceedings of the workshop on Genome Informatics, pp. 62–71 (1998)
Google Scholar
Craven, M., Kumlien, J.: Constructing biological knowledge base by extracting information from text sources. In: Proc. of the 7th International Conference on Intelligent Systems for Molecular Biology, pp. 77–76 (1999)
Google Scholar
Stapley, B.J., Kelley, L.A., Strenberg, M.J.E.: Predicting the sub-cellular location of proteins from text using support vector machines. In: Pacific Symposium on Biocomputing, vol. 7, pp. 374–385 (2002)
Google Scholar
Gaizauskas, R., Demetriou, G., Artymiuk, P.J., Willett, P.: Protein structure and Information Extraction from Biological Texts: The PASTA system. Bioinformatics 19(1), 135–143 (2003)
Article Google Scholar
Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Duboue, P.A., Weng, W., Wilbur, W.J., Hatzivassiloglou, V., Friedman, C.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr. of Biomedical Informatics 37, 43–53 (2004)
Article Google Scholar
Hahn, U., Romacker, M., Schulz, S.: Creating knowledge repositories from biomedical reports: The MEDSYNDIKATE text mining system. In: Pacific Symposium on Biocomputing, vol. 7, pp. 338–349 (2002)
Google Scholar
Ideker, T., Galitski, T., Hood, L.: A new approach to decoding life: systems biology. Annu. Rev. Genomics Hum. Genet. 2, 343–372 (2001)
Article Google Scholar
Rzhetsky, A., et al.: GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Jr. of Biomedical Informatics 37, 43–53 (2004)
Article Google Scholar
Pustejovsky, J., etc.: Medstract: Creating large scale information servers for biomedical libraries. In: ACL 2002, Philadelphia (2002)
Google Scholar
Wong, L.: PIES a protein interaction extraction system. In: Pacific Symposium on Biocomputing, vol. 6, pp. 520–531 (2001)
Google Scholar
Bremner, E.G., Natarajan, J., Zhang, Y., DeSesa, C., Hack, C.J., Dubitzky, W.: Text mining of full text articles and creation of a knowledge base for analysis of microarray data. In: Knowledge exploration in Life Science Informatics. LNCS (LNAI), pp. 84–95 (2004)
Google Scholar
Foster, I., Kesselman, C. (eds.): The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (2004)
Google Scholar
SPSS LexiQuest Mine available at http://www.spss.com
United Devices Grid MP Services available at http://www.ud.com

Download references

Author information

Authors and Affiliations

Bioinformatics Research Group, University of Ulster, UK
Jeyakumar Natarajan, Catherine J. Hack & Werner Dubitzky
United Devices Inc, Austin, TX, USA
Niranjan Mulay
Brain Tumor Research Program, Children’s Memorial Hospital, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
Catherine DeSesa & Eric G. Bremer

Authors

Jeyakumar Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Niranjan Mulay
View author publications
You can also search for this author in PubMed Google Scholar
Catherine DeSesa
View author publications
You can also search for this author in PubMed Google Scholar
Catherine J. Hack
View author publications
You can also search for this author in PubMed Google Scholar
Werner Dubitzky
View author publications
You can also search for this author in PubMed Google Scholar
Eric G. Bremer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Aveiro, DETI/IEETA, 3810-193, Aveiro, Portugal
José Luís Oliveira
Biomedical Informatics Group, Dep. Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Spain
Víctor Maojo
Medical Bioinformatics Department, Institute of Health ‘Carlos III’, Ctra. Majadahonda-Pozuelo, km 2. 28220 Majadahonda, Madrid
Fernando Martín-Sánchez
Department of Electronics and Telecommunications (DET/IEETA), University of Aveiro, 3810 193, Aveiro, Portugal
António Sousa Pereira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Natarajan, J., Mulay, N., DeSesa, C., Hack, C.J., Dubitzky, W., Bremer, E.G. (2005). A Grid Infrastructure for Text Mining of Full Text Articles and Creation of a Knowledge Base of Gene Relations. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds) Biological and Medical Data Analysis. ISBMDA 2005. Lecture Notes in Computer Science(), vol 3745. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573067_11

Download citation

DOI: https://doi.org/10.1007/11573067_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29674-4
Online ISBN: 978-3-540-31658-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics