Abstract
This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. (2004) Distribution of information in biomedical abstracts and full-text publications, Bioinformatics 20, 2597–2604.
Schuemie, M. J., Jelier, R., and Kors, J. A. (2007) Peregrine: lightweight gene name normalization by dictionary lookup, in Biocreative 2 workshop, pp. 131–140, Madrid.
Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–270.
Hoffmann, R., and Valencia, A. (2004) A gene network for navigating the literature. Nat Genet 36, 664.
Jenssen, T. K., Laegreid, A., Komorowski, J., and Hovig, E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21–28.
Jensen, L. J., Kuhn, M., Stark, M., et al. (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37, D412–416.
Alexeyenko, A., and Sonnhammer, E. L. (2009) Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res 19, 1107–1116.
Aerts, S., Lambrechts, D., Maity, S., et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24, 537–544.
Jelier, R., Schuemie, M. J., Roes, P. J., van Mulligen, E. M., and Kors, J. A. (2008) Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 77, 354–362.
Swanson, D. R. (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30, 7–18.
van Haagen, H. H. H. B. M., t Hoen, P. A. C., Botelho Bovo, A., et al. (2009) Novel Protein–Protein Interactions Inferred from Literature Context. PLoS ONE 4, e7894.
Jelier, R., Schuemie, M. J., Veldhoven, A., et al. (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 9, R96.
Gene Ontology, C. (2000) Gene Ontology: Tool for the Unification of Biology, pp. 25–29.
UniProt (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169–174.
Salwinski, L., Miller, C. S., Smith, A. J., et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449–451.
Stark, C., Breitkreutz, B. J., Reguly, T., et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535–539.
Matthews, L., Gopinath, G., Gillespie, M., et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619–622.
Ben-Hur, A., and Noble, W. (2006) Choosing negative examples for the prediction of protein-protein interactions., p S2. BMC Bioinformatics.
Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Hewlett-Packard Company.
Wessels, L. F., Reinders, M. J., Hart, A. A., et al. (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21, 3755–3762.
Obayashi, T., Hayashi, S., Shibaoka, M., et al. (2008) COXPRESdb: a database of coexpressed gene networks in mammals. Nucleic Acids Res 36, D77–82.
Su, A. I., Wiltshire, T., Batalov, S., et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 101, 6062–6067.
Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235.
Xia, K., Dong, D., and Han, J. D. (2006) IntNetDB v1.0: an integrated protein–protein interaction network database generated by a probabilistic model. BMC Bioinformatics 7, 508.
Lage, K., Karlberg, E. O., Storling, Z. M., et al. (2007) A human phenome–interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–316.
Ding, J., Berleant, D., Nettleton, D., Wurtele E. (2002) Mining medline: abstracts, sentences, or phrases, pp. 326–337, Pacific Symposium on Biocomputing.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
van Haagen, H., Mons, B. (2011). In Silico Knowledge and Content Tracking. In: Yu, B., Hinchcliffe, M. (eds) In Silico Tools for Gene Discovery. Methods in Molecular Biology, vol 760. Humana Press. https://doi.org/10.1007/978-1-61779-176-5_8
Download citation
DOI: https://doi.org/10.1007/978-1-61779-176-5_8
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-61779-175-8
Online ISBN: 978-1-61779-176-5
eBook Packages: Springer Protocols