In Silico Tools for Gene Discovery pp 129-140 | Cite as
In Silico Knowledge and Content Tracking
- 3 Citations
- 2.3k Downloads
Abstract
This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.
Key words
Text-mining data mining information retrieval disambiguation retrospective analysis ROC curve prioritizer ontology semantic webReferences
- 1.Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. (2004) Distribution of information in biomedical abstracts and full-text publications, Bioinformatics 20, 2597–2604.PubMedCrossRefGoogle Scholar
- 2.Schuemie, M. J., Jelier, R., and Kors, J. A. (2007) Peregrine: lightweight gene name normalization by dictionary lookup, in Biocreative 2 workshop, pp. 131–140, Madrid.Google Scholar
- 3.Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–270.CrossRefGoogle Scholar
- 4.Hoffmann, R., and Valencia, A. (2004) A gene network for navigating the literature. Nat Genet 36, 664.PubMedCrossRefGoogle Scholar
- 5.Jenssen, T. K., Laegreid, A., Komorowski, J., and Hovig, E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21–28.PubMedGoogle Scholar
- 6.Jensen, L. J., Kuhn, M., Stark, M., et al. (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37, D412–416.CrossRefGoogle Scholar
- 7.Alexeyenko, A., and Sonnhammer, E. L. (2009) Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res 19, 1107–1116.PubMedCrossRefGoogle Scholar
- 8.Aerts, S., Lambrechts, D., Maity, S., et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24, 537–544.PubMedCrossRefGoogle Scholar
- 9.Jelier, R., Schuemie, M. J., Roes, P. J., van Mulligen, E. M., and Kors, J. A. (2008) Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 77, 354–362.PubMedCrossRefGoogle Scholar
- 10.Swanson, D. R. (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30, 7–18.PubMedGoogle Scholar
- 11.van Haagen, H. H. H. B. M., t Hoen, P. A. C., Botelho Bovo, A., et al. (2009) Novel Protein–Protein Interactions Inferred from Literature Context. PLoS ONE 4, e7894.PubMedCrossRefGoogle Scholar
- 12.Jelier, R., Schuemie, M. J., Veldhoven, A., et al. (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 9, R96.PubMedCrossRefGoogle Scholar
- 13.Gene Ontology, C. (2000) Gene Ontology: Tool for the Unification of Biology, pp. 25–29.Google Scholar
- 14.UniProt (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169–174.CrossRefGoogle Scholar
- 15.Salwinski, L., Miller, C. S., Smith, A. J., et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449–451.CrossRefGoogle Scholar
- 16.Stark, C., Breitkreutz, B. J., Reguly, T., et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535–539.CrossRefGoogle Scholar
- 17.Matthews, L., Gopinath, G., Gillespie, M., et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619–622.CrossRefGoogle Scholar
- 18.Ben-Hur, A., and Noble, W. (2006) Choosing negative examples for the prediction of protein-protein interactions., p S2. BMC Bioinformatics.Google Scholar
- 19.Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Hewlett-Packard Company.Google Scholar
- 20.Wessels, L. F., Reinders, M. J., Hart, A. A., et al. (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21, 3755–3762.PubMedCrossRefGoogle Scholar
- 21.Obayashi, T., Hayashi, S., Shibaoka, M., et al. (2008) COXPRESdb: a database of coexpressed gene networks in mammals. Nucleic Acids Res 36, D77–82.CrossRefGoogle Scholar
- 22.Su, A. I., Wiltshire, T., Batalov, S., et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 101, 6062–6067.PubMedCrossRefGoogle Scholar
- 23.Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235.PubMedCrossRefGoogle Scholar
- 24.Xia, K., Dong, D., and Han, J. D. (2006) IntNetDB v1.0: an integrated protein–protein interaction network database generated by a probabilistic model. BMC Bioinformatics 7, 508.PubMedCrossRefGoogle Scholar
- 25.Lage, K., Karlberg, E. O., Storling, Z. M., et al. (2007) A human phenome–interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–316.PubMedCrossRefGoogle Scholar
- 26.Ding, J., Berleant, D., Nettleton, D., Wurtele E. (2002) Mining medline: abstracts, sentences, or phrases, pp. 326–337, Pacific Symposium on Biocomputing.Google Scholar