Advertisement

In Silico Knowledge and Content Tracking

Protocol
Part of the Methods in Molecular Biology book series (MIMB, volume 760)

Abstract

This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.

Key words

Text-mining data mining information retrieval disambiguation retrospective analysis ROC curve prioritizer ontology semantic web 

References

  1. 1.
    Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. (2004) Distribution of information in biomedical abstracts and full-text publications, Bioinformatics 20, 2597–2604.PubMedCrossRefGoogle Scholar
  2. 2.
    Schuemie, M. J., Jelier, R., and Kors, J. A. (2007) Peregrine: lightweight gene name normalization by dictionary lookup, in Biocreative 2 workshop, pp. 131–140, Madrid.Google Scholar
  3. 3.
    Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–270.CrossRefGoogle Scholar
  4. 4.
    Hoffmann, R., and Valencia, A. (2004) A gene network for navigating the literature. Nat Genet 36, 664.PubMedCrossRefGoogle Scholar
  5. 5.
    Jenssen, T. K., Laegreid, A., Komorowski, J., and Hovig, E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21–28.PubMedGoogle Scholar
  6. 6.
    Jensen, L. J., Kuhn, M., Stark, M., et al. (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37, D412–416.CrossRefGoogle Scholar
  7. 7.
    Alexeyenko, A., and Sonnhammer, E. L. (2009) Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res 19, 1107–1116.PubMedCrossRefGoogle Scholar
  8. 8.
    Aerts, S., Lambrechts, D., Maity, S., et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24, 537–544.PubMedCrossRefGoogle Scholar
  9. 9.
    Jelier, R., Schuemie, M. J., Roes, P. J., van Mulligen, E. M., and Kors, J. A. (2008) Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 77, 354–362.PubMedCrossRefGoogle Scholar
  10. 10.
    Swanson, D. R. (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30, 7–18.PubMedGoogle Scholar
  11. 11.
    van Haagen, H. H. H. B. M., t Hoen, P. A. C., Botelho Bovo, A., et al. (2009) Novel Protein–Protein Interactions Inferred from Literature Context. PLoS ONE 4, e7894.PubMedCrossRefGoogle Scholar
  12. 12.
    Jelier, R., Schuemie, M. J., Veldhoven, A., et al. (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 9, R96.PubMedCrossRefGoogle Scholar
  13. 13.
    Gene Ontology, C. (2000) Gene Ontology: Tool for the Unification of Biology, pp. 25–29.Google Scholar
  14. 14.
    UniProt (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169–174.CrossRefGoogle Scholar
  15. 15.
    Salwinski, L., Miller, C. S., Smith, A. J., et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449–451.CrossRefGoogle Scholar
  16. 16.
    Stark, C., Breitkreutz, B. J., Reguly, T., et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535–539.CrossRefGoogle Scholar
  17. 17.
    Matthews, L., Gopinath, G., Gillespie, M., et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619–622.CrossRefGoogle Scholar
  18. 18.
    Ben-Hur, A., and Noble, W. (2006) Choosing negative examples for the prediction of protein-protein interactions., p S2. BMC Bioinformatics.Google Scholar
  19. 19.
    Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Hewlett-Packard Company.Google Scholar
  20. 20.
    Wessels, L. F., Reinders, M. J., Hart, A. A., et al. (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21, 3755–3762.PubMedCrossRefGoogle Scholar
  21. 21.
    Obayashi, T., Hayashi, S., Shibaoka, M., et al. (2008) COXPRESdb: a database of coexpressed gene networks in mammals. Nucleic Acids Res 36, D77–82.CrossRefGoogle Scholar
  22. 22.
    Su, A. I., Wiltshire, T., Batalov, S., et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 101, 6062–6067.PubMedCrossRefGoogle Scholar
  23. 23.
    Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235.PubMedCrossRefGoogle Scholar
  24. 24.
    Xia, K., Dong, D., and Han, J. D. (2006) IntNetDB v1.0: an integrated protein–protein interaction network database generated by a probabilistic model. BMC Bioinformatics 7, 508.PubMedCrossRefGoogle Scholar
  25. 25.
    Lage, K., Karlberg, E. O., Storling, Z. M., et al. (2007) A human phenome–interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–316.PubMedCrossRefGoogle Scholar
  26. 26.
    Ding, J., Berleant, D., Nettleton, D., Wurtele E. (2002) Mining medline: abstracts, sentences, or phrases, pp. 326–337, Pacific Symposium on Biocomputing.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Department of Human GeneticsUniversity Medical CenterLeidenThe Netherlands
  2. 2.Netherlands Bioinformatics Centre (NBIC)NijmegenThe Netherlands

Personalised recommendations