In Silico Knowledge and Content Tracking

van Haagen, Herman; Mons, Barend

doi:10.1007/978-1-61779-176-5_8

Herman van Haagen³ &
Barend Mons⁴

Part of the book series: Methods in Molecular Biology ((MIMB,volume 760))

2511 Accesses
3 Citations

Abstract

This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schuemie, M. J., Weeber, M., Schijvenaars, B. J., et al. (2004) Distribution of information in biomedical abstracts and full-text publications, Bioinformatics 20, 2597–2604.
Article PubMed CAS Google Scholar
Schuemie, M. J., Jelier, R., and Kors, J. A. (2007) Peregrine: lightweight gene name normalization by dictionary lookup, in Biocreative 2 workshop, pp. 131–140, Madrid.
Google Scholar
Bodenreider, O. (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32, D267–270.
Article Google Scholar
Hoffmann, R., and Valencia, A. (2004) A gene network for navigating the literature. Nat Genet 36, 664.
Article PubMed CAS Google Scholar
Jenssen, T. K., Laegreid, A., Komorowski, J., and Hovig, E. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nat Genet 28, 21–28.
PubMed CAS Google Scholar
Jensen, L. J., Kuhn, M., Stark, M., et al. (2009) STRING 8 – a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 37, D412–416.
Article Google Scholar
Alexeyenko, A., and Sonnhammer, E. L. (2009) Global networks of functional coupling in eukaryotes from comprehensive data integration. Genome Res 19, 1107–1116.
Article PubMed CAS Google Scholar
Aerts, S., Lambrechts, D., Maity, S., et al. (2006) Gene prioritization through genomic data fusion. Nat Biotechnol 24, 537–544.
Article PubMed CAS Google Scholar
Jelier, R., Schuemie, M. J., Roes, P. J., van Mulligen, E. M., and Kors, J. A. (2008) Literature-based concept profiles for gene annotation: the issue of weighting. Int J Med Inform 77, 354–362.
Article PubMed Google Scholar
Swanson, D. R. (1986) Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect Biol Med 30, 7–18.
PubMed CAS Google Scholar
van Haagen, H. H. H. B. M., t Hoen, P. A. C., Botelho Bovo, A., et al. (2009) Novel Protein–Protein Interactions Inferred from Literature Context. PLoS ONE 4, e7894.
Article PubMed Google Scholar
Jelier, R., Schuemie, M. J., Veldhoven, A., et al. (2008) Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 9, R96.
Article PubMed Google Scholar
Gene Ontology, C. (2000) Gene Ontology: Tool for the Unification of Biology, pp. 25–29.
Google Scholar
UniProt (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37, D169–174.
Article Google Scholar
Salwinski, L., Miller, C. S., Smith, A. J., et al. (2004) The database of interacting proteins: 2004 update. Nucleic Acids Res 32, D449–451.
Article Google Scholar
Stark, C., Breitkreutz, B. J., Reguly, T., et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res 34, D535–539.
Article Google Scholar
Matthews, L., Gopinath, G., Gillespie, M., et al. (2009) Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res 37, D619–622.
Article Google Scholar
Ben-Hur, A., and Noble, W. (2006) Choosing negative examples for the prediction of protein-protein interactions., p S2. BMC Bioinformatics.
Google Scholar
Fawcett, T. (2003) ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Hewlett-Packard Company.
Google Scholar
Wessels, L. F., Reinders, M. J., Hart, A. A., et al. (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics 21, 3755–3762.
Article PubMed CAS Google Scholar
Obayashi, T., Hayashi, S., Shibaoka, M., et al. (2008) COXPRESdb: a database of coexpressed gene networks in mammals. Nucleic Acids Res 36, D77–82.
Article Google Scholar
Su, A. I., Wiltshire, T., Batalov, S., et al. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 101, 6062–6067.
Article PubMed CAS Google Scholar
Mulder, N. J., Apweiler, R., Attwood, T. K., et al. (2002) InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform 3, 225–235.
Article PubMed CAS Google Scholar
Xia, K., Dong, D., and Han, J. D. (2006) IntNetDB v1.0: an integrated protein–protein interaction network database generated by a probabilistic model. BMC Bioinformatics 7, 508.
Article PubMed Google Scholar
Lage, K., Karlberg, E. O., Storling, Z. M., et al. (2007) A human phenome–interactome network of protein complexes implicated in genetic disorders. Nat Biotechnol 25, 309–316.
Article PubMed CAS Google Scholar
Ding, J., Berleant, D., Nettleton, D., Wurtele E. (2002) Mining medline: abstracts, sentences, or phrases, pp. 326–337, Pacific Symposium on Biocomputing.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Human Genetics, University Medical Center, Leiden, The Netherlands
Herman van Haagen
Netherlands Bioinformatics Centre (NBIC), Nijmegen, The Netherlands
Barend Mons

Authors

Herman van Haagen
View author publications
You can also search for this author in PubMed Google Scholar
Barend Mons
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Barend Mons .

Editor information

Editors and Affiliations

Royal Prince Alfred Hospital, Dept. of Molecular & Clinical Genetics, University of Sydney, Missenden Road, Camperdown, 2050, New South Wales, Australia
Bing Yu
Royal Prince Alfred Hospital, Dept. Molecular & Clinical Genetics, University of Sydney, Missenden Road, Camperdown, 2050, New South Wales, Australia
Marcus Hinchcliffe

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

van Haagen, H., Mons, B. (2011). In Silico Knowledge and Content Tracking. In: Yu, B., Hinchcliffe, M. (eds) In Silico Tools for Gene Discovery. Methods in Molecular Biology, vol 760. Humana Press. https://doi.org/10.1007/978-1-61779-176-5_8

Download citation

DOI: https://doi.org/10.1007/978-1-61779-176-5_8
Published: 30 June 2011
Publisher Name: Humana Press
Print ISBN: 978-1-61779-175-8
Online ISBN: 978-1-61779-176-5
eBook Packages: Springer Protocols

Publish with us

Policies and ethics