Skip to main content

In Silico Knowledge and Content Tracking

  • Protocol
  • First Online:
In Silico Tools for Gene Discovery

Part of the book series: Methods in Molecular Biology ((MIMB,volume 760))

Abstract

This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. (2004) Distribution of information in biomedical abstracts and full-text publications, Bioinformatics 20, 2597–2604.

    Article  PubMed  CAS  Google Scholar 

  2. Schuemie, M. J., Jelier, R., and Kors, J. A. (2007) Peregrine: lightweight gene name normalization by dictionary lookup, in Biocreative 2 workshop, pp. 131–140, Madrid.

    Google Scholar 

  3. Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–270.

    Article  Google Scholar 

  4. Hoffmann, R., and Valencia, A. (2004) A gene network for navigating the literature. Nat Genet 36, 664.

    Article  PubMed  CAS  Google Scholar 

  5. Jenssen, T. K., Laegreid, A., Komorowski, J., and Hovig, E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21–28.

    PubMed  CAS  Google Scholar 

  6. Jensen, L. J., Kuhn, M., Stark, M., et al. (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37, D412–416.

    Article  Google Scholar 

  7. Alexeyenko, A., and Sonnhammer, E. L. (2009) Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res 19, 1107–1116.

    Article  PubMed  CAS  Google Scholar 

  8. Aerts, S., Lambrechts, D., Maity, S., et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24, 537–544.

    Article  PubMed  CAS  Google Scholar 

  9. Jelier, R., Schuemie, M. J., Roes, P. J., van Mulligen, E. M., and Kors, J. A. (2008) Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 77, 354–362.

    Article  PubMed  Google Scholar 

  10. Swanson, D. R. (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30, 7–18.

    PubMed  CAS  Google Scholar 

  11. van Haagen, H. H. H. B. M., t Hoen, P. A. C., Botelho Bovo, A., et al. (2009) Novel Protein–Protein Interactions Inferred from Literature Context. PLoS ONE 4, e7894.

    Article  PubMed  Google Scholar 

  12. Jelier, R., Schuemie, M. J., Veldhoven, A., et al. (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 9, R96.

    Article  PubMed  Google Scholar 

  13. Gene Ontology, C. (2000) Gene Ontology: Tool for the Unification of Biology, pp. 25–29.

    Google Scholar 

  14. UniProt (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169–174.

    Article  Google Scholar 

  15. Salwinski, L., Miller, C. S., Smith, A. J., et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449–451.

    Article  Google Scholar 

  16. Stark, C., Breitkreutz, B. J., Reguly, T., et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535–539.

    Article  Google Scholar 

  17. Matthews, L., Gopinath, G., Gillespie, M., et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619–622.

    Article  Google Scholar 

  18. Ben-Hur, A., and Noble, W. (2006) Choosing negative examples for the prediction of protein-protein interactions., p S2. BMC Bioinformatics.

    Google Scholar 

  19. Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Hewlett-Packard Company.

    Google Scholar 

  20. Wessels, L. F., Reinders, M. J., Hart, A. A., et al. (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21, 3755–3762.

    Article  PubMed  CAS  Google Scholar 

  21. Obayashi, T., Hayashi, S., Shibaoka, M., et al. (2008) COXPRESdb: a database of coexpressed gene networks in mammals. Nucleic Acids Res 36, D77–82.

    Article  Google Scholar 

  22. Su, A. I., Wiltshire, T., Batalov, S., et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 101, 6062–6067.

    Article  PubMed  CAS  Google Scholar 

  23. Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235.

    Article  PubMed  CAS  Google Scholar 

  24. Xia, K., Dong, D., and Han, J. D. (2006) IntNetDB v1.0: an integrated protein–protein interaction network database generated by a probabilistic model. BMC Bioinformatics 7, 508.

    Article  PubMed  Google Scholar 

  25. Lage, K., Karlberg, E. O., Storling, Z. M., et al. (2007) A human phenome–interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–316.

    Article  PubMed  CAS  Google Scholar 

  26. Ding, J., Berleant, D., Nettleton, D., Wurtele E. (2002) Mining medline: abstracts, sentences, or phrases, pp. 326–337, Pacific Symposium on Biocomputing.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Barend Mons .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this protocol

Cite this protocol

van Haagen, H., Mons, B. (2011). In Silico Knowledge and Content Tracking. In: Yu, B., Hinchcliffe, M. (eds) In Silico Tools for Gene Discovery. Methods in Molecular Biology, vol 760. Humana Press. https://doi.org/10.1007/978-1-61779-176-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-61779-176-5_8

  • Published:

  • Publisher Name: Humana Press

  • Print ISBN: 978-1-61779-175-8

  • Online ISBN: 978-1-61779-176-5

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics