Skip to main content

Knowledge Acquisition from the Biomedical Literature

  • Chapter
Semantic Web

Abstract

This article focuses on knowledge acquisition from the biomedical literature, and on the infrastructure, specifically text mining, needed to access, extract and integrate the information. The biomedical literature is the major repository of biomedical knowledge. It serves as the source for structured information that populates biological databases, via the process of expert distillation (or curation) of the literature. Today, the literature has grown to the point where an individual scientist cannot read all the relevant literature, and curators of the major biological databases have trouble keeping up to date with newly published articles. Furthermore, important biomedical applications, such as drug discovery and analysis of high-throughput data sets, are dependent on integration of all available information from both biological databases and the literature. The article reviews these applications, focusing on the role of text mining in providing semantic indices into the literature, as well as the importance of interactive tools to augment the power of the human expert to extract information from the literature. These tools are critical in supporting expert curation, finding relationships among biological entities, and creating content for a Semantic Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Calvo S., Jain M., Xie X., Sheth S.A., Chang B., Goldberger O.A., Spinazzola A., Zeviani M., Carr S.A., and Mootha VK. Systematic identification of human mitochondrial disease genes through integrative genomics. Nat Genet., 2006. 38(5): p. 576–82.

    Article  PubMed  CAS  Google Scholar 

  2. Moses H., 3rd, Dorsey E.R., Matheson D.H., and Thier S.O. Financial anatomy of biomedical research. JAMA., 2005. 294(11): p. 1333–42.

    Article  PubMed  CAS  Google Scholar 

  3. Super information about information managers (Super I-AIM). 2001, Outsell, Inc.

    Google Scholar 

  4. Scharf M., Schneider R., Casari G., Bork P., Valencia A., Ouzounis C, and Sander C. GeneQuiz: a workbench for sequence analysis. Proc Int Conf Intell Syst Mol Biol., 1994. 2: p. 348–53.

    PubMed  CAS  Google Scholar 

  5. Andrade M.A., Brown N.P., Leroy C, Hoersch S., de Daruvar A., Reich C, Franchini A., Tamames J., Valencia A., Ouzounis C, and Sander C. Automated genome sequence analysis and annotation. Bioinformatics., 1999. 15(5): p. 391–412.

    Article  PubMed  CAS  Google Scholar 

  6. Bairoch A. and Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 2000. 28(1): p. 45–8.

    Article  PubMed  CAS  Google Scholar 

  7. Tamames J., Ouzounis C, Casari G., Sander C, and Valencia A. EUCLID: automatic classification of proteins in functional classes by their database annotations. Bioinformatics., 1998. 14(6): p. 542–3.

    Article  PubMed  CAS  Google Scholar 

  8. Abascal F. and Valencia A. Clustering of proximal sequence space for the identification of protein families. Bioinformatics, 2002. 18(7): p. 908–21.

    Article  PubMed  CAS  Google Scholar 

  9. Abascal F. and Valencia A. Automatic annotation of protein function based on family identification. Proteins., 2003. 53(3): p. 683–92.

    Article  PubMed  CAS  Google Scholar 

  10. Valencia A. Automatic annotation of protein function. Curr Opin Struct Biol., 2005. 15(3): p. 267–74.

    Article  PubMed  CAS  Google Scholar 

  11. Hubbard T., Barker D., Birney E., Cameron G., Chen Y., Clark L., Cox T., Cuff J., Curwen V., Down T., Durbin R., Eyras E., Gilbert J., Hammond M., Huminiecki L., Kasprzyk A., Lehvaslaiho H., Lijnzaad P., Melsopp C., Mongin E., Pettett R., Pocock M., Potter S., Rust A., Schmidt E., Searle S., Slater G., Smith J., Spooner W., Stabenau A., Stalker J., Stupka E., Ureta-Vidal A., Vastrik I., and Clamp M. The Ensembl genome database project. Nucleic Acids Res., 2002. 30(1): p. 38–41.

    Article  PubMed  CAS  Google Scholar 

  12. Curwen V., Eyras E., Andrews T.D., Clarke L., Mongin E., Searle S.M., and Clamp M. The Ensembl automatic gene annotation system. Genome Res., 2004. 14(5): p. 942–50.

    Article  PubMed  CAS  Google Scholar 

  13. Cohen A.M. and Hersh W.R. A survey of current work in biomedical text mining. Brief Bioinform., 2005. 6(1): p. 57–71.

    Article  PubMed  CAS  Google Scholar 

  14. Joshi-Tope G., Gillespie M., Vastrik I., D’Eustachio P., Schmidt E., de Bono B., Jassal B., Gopinath G.R., Wu G.R., Matthews L., Lewis S., Birney E., and Stein L. Reactome: a know ledge base of biological pathways. Nucleic Acids Res., 2005. 33(Database issue): p. D428–32.

    Article  PubMed  CAS  Google Scholar 

  15. Riley M.L., Schmidt T., Wagner C, Mewes H.W., and Frishman D. The PEDANT genome database in 2005. Nucleic Acids Res., 2005. 33(Database issue): p. D308–10.

    Article  PubMed  CAS  Google Scholar 

  16. Wilkinson, M.D. and Links M. BioMOBY: an open source biological web services proposal. Brief Bioinform., 2002. 3(4): p. 331–41.

    Article  PubMed  Google Scholar 

  17. Hubbard, T. Biological information: making it accessible and integrated (and trying to make sense of it). Bioinformatics., 2002. 18Suppl 2: p. S140.

    PubMed  Google Scholar 

  18. Oinn T., Addis. M., Ferris J., Marvin D., Senger M., Greenwood M., Carver T., Glover K., Pocock M.R., Wipat A., and Li P. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 2004. 20(17): p. 3045–54.

    Article  PubMed  CAS  Google Scholar 

  19. Fleischmann W., Moller S., Gateau A., and Apweiler R. A novel method for automatic functional annotation of proteins. Bioinformatics., 1999. 15(3): p. 228–33.

    Article  PubMed  CAS  Google Scholar 

  20. Moller S., Leser U., Fleischmann W., and Apweiler R. EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics., 1999. 15(3): p. 219–27.

    Article  PubMed  CAS  Google Scholar 

  21. Kretschmann E., Fleischmann W., and Apweiler R. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics., 2001. 17(10): p. 920–6.

    Article  PubMed  CAS  Google Scholar 

  22. Biswas M., O’Rourke J.F., Camon E., Fraser G., Kanapin A., Karavidopoulou Y., Kersey P., Kriventseva E., Mittard V., Mulder N., Phan I., Servant F., and Apweiler R. Applications of InterPro in protein annotation and genome analysis. Brief Bioinform., 2002. 3(3): p. 285–95.

    Article  PubMed  CAS  Google Scholar 

  23. Engelhardt, B.E., Jordan M.I., Muratore K.E., and Brenner S.E. Protein molecular function prediction by bayesian phylogenomics. PLoS Comput Biol., 2005. 1(5): p. e45. Epub 2005 Oct 7.

    Article  PubMed  Google Scholar 

  24. Harris M.A., Clark J., Ireland A., Lomax J., Ashburner M., Foulger R., Eilbeck K., Lewis S., Marshall B., Mungall C., Richter J., Rubin G.M., Blake J.A., Bult C., Dolan M., Drabkin H., Eppig J.T., Hill D.P., Ni L., Ringwald M., Balakrishnan R., Cherry J.M., Christie K.R., Costanzo M.C., Dwight S.S., Engel S., Fisk D.G., Hirschman J.E., Hong E.L., Nash R.S., Sethuraman A., Theesfeld C.L., Botstein D., Dolinski K., Feierbach B., Berardini T., Mundodi S., Rhee S.Y., Apweiler R., Barrell D., Camon E., Dimmer E., Lee V., Chisholm R., Gaudet P., Kibbe W., Kishore R., Schwarz E.M., Sternberg P., Gwinn M., Hannick L., Wortman J., Berriman M., Wood V., de la Cruz N., Tonellato P., Jaiswal P., Seigfried T., and White R. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 2004. 32(Database issue): p. D258–61.

    Article  PubMed  CAS  Google Scholar 

  25. Camon, E., Magrane M., Barrell D., Lee V., Dimmer E., Maslen J., Binns D., Harte N., Lopez R., and Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res., 2004. 32(Database issue): p. D262–6.

    Article  PubMed  CAS  Google Scholar 

  26. Devos D. and Valencia A. Practical limits of function prediction. Proteins., 2000. 41(1): p. 98–107.

    Article  PubMed  CAS  Google Scholar 

  27. Todd A.E., Orengo C.A., and Thornton J.M. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol., 2001. 307(4): p. 1113–43.

    Article  PubMed  CAS  Google Scholar 

  28. Wilson C.A., Kreychman J., and Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol., 2000. 297(1): p. 233–49.

    Article  PubMed  CAS  Google Scholar 

  29. Rost B. Enzyme function less conserved than anticipated. J Mol Biol., 2002. 318(2): p. 595–608.

    Article  PubMed  CAS  Google Scholar 

  30. Blaschke C, Hirschman L., and Valencia A. Information extraction in molecular biology. Brief Bioinform., 2002. 3(2): p. 154–65.

    Article  PubMed  CAS  Google Scholar 

  31. Blaschke C. and Valencia A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comparative and Functional Genomics, 2001. 2: p. 196–206.

    Article  CAS  Google Scholar 

  32. Salwinski L., Miller C.S., Smith A.J., Pettit F.K., Bowie J.U., and Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 2004. 32(Database issue): p. D449–51.

    Article  PubMed  CAS  Google Scholar 

  33. Kanehisa M., Goto S., Kawashima S., Okuno Y., and Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res., 2004. 32(Database issue): p. D277–80.

    Article  PubMed  CAS  Google Scholar 

  34. Keseler I.M., Collado-Vides J., Gama-Castro S., Ingraham J., Paley S., Paulsen I.T., Peralta-Gil M., and Karp P.D. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res., 2005. 33(Database issue): p. D334–7.

    Article  PubMed  CAS  Google Scholar 

  35. Leon E. and Valencia A. Unpublished Manuscript. 2006.

    Google Scholar 

  36. Blaschke C, Leon E.A., Krallinger M., and Valencia A. Evaluation of BioCreAtlvE assessment of task 2. BMC Bioinformatics., 2005. 6Suppl 1: p. S16. Epub 2005 May 24.

    Article  PubMed  Google Scholar 

  37. Hermjakob H., Montecchi-Palazzi L., Lewington C., Mudali S., Kerrien S., Orchard S., Vingron M., Roechert B., Roepstorff P., Valencia A., Margalit H., Armstrong J., Bairoch A., Cesareni G., Sherman D., and Apweiler R. IntAct: an open source molecular interaction database. Nucleic Acids Res., 2004. 32(Database issue): p. D452–5.

    Article  PubMed  CAS  Google Scholar 

  38. Zanzoni A., Montecchi-Palazzi L., Quondam M., Ausiello G., Helmer-Citterich M., and Cesareni G. MINT: a Molecular INTeraction database. FEBS Lett., 2002. 513(1): p. 135–40.

    Article  PubMed  CAS  Google Scholar 

  39. Hsu F., Pringle T.H., Kuhn R.M., Karolchik D., Diekhans M., Haussler D., and Kent W.J. The UCSC Proteome Browser. Nucleic Acids Res., 2005. 33(Database issue): p. D454–8.

    Article  PubMed  CAS  Google Scholar 

  40. Birney E., Andrews D., Caccamo M., Chen Y., Clarke L., Coates G., Cox T., Cunningham F., Curwen V., Cutts T., Down T., Durbin R., Fernandez-Suarez X.M., Flicek P., Graf S., Hammond M., Herrero J., Howe K., Iyer V., Jekosch K., Kahari A., Kasprzyk A., Keefe D., Kokocinski F., Kulesha E., London D., Longden I., Melsopp C, Meidl P., Overduin B., Parker A., Proctor G., Prlic A., Rae M., Rios D., Redmond S., Schuster M., Sealy I., Searle S., Severin J., Slater G., Smedley D., Smith J., Stabenau A., Stalker J., Trevanion S., Ureta-Vidal A., Vogel J., White S., Woodwark C., and Hubbard T.J. Ensembl 2006. Nucleic Acids Res, 2006. 34(Database issue): p. D556–61.

    Article  PubMed  CAS  Google Scholar 

  41. Hoffmann R. and Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics., 2005. 21Suppl 2: p. ii252–ii258.

    Article  PubMed  CAS  Google Scholar 

  42. Hoffmann, R., Krallinger M., Andres E., Tamames J., Blaschke C., and Valencia A. Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE., 2005. 2005(283): p. pe21.

    Article  PubMed  Google Scholar 

  43. Matys V., Kel-Margoulis O.V., Fricke E., Liebich I., Land S., Barre-Dirrie A., Reuter I., Chekmenev D., Krull M., Hornischer K., Voss N., Stegmaier P., Lewicki-Potapov B., Saxel H., Kel A.E., and Wingender E. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res., 2006. 34(Database issue): p. D108–10.

    Article  PubMed  CAS  Google Scholar 

  44. Mitelman, F., Johansson B., and Mertens F. Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet., 2004. 36(4): p. 331–4.

    Article  PubMed  CAS  Google Scholar 

  45. Saric, J., Jensen L.J., Ouzounova R., Rojas I., and Bork P. Extraction of regulatory gene/protein networks from Medline. Bioinformatics., 2006. 22(6): p. 645–50. Epub 2005 Jul 26.

    Article  PubMed  CAS  Google Scholar 

  46. Hoffmann R., Dopazo J., Cigudosa J.C., and Valencia A. HCAD, closing the gap between breakpoints and genes. Nucleic Acids Res., 2005. 33(Database issue): p. D511–3.

    Article  PubMed  CAS  Google Scholar 

  47. Ashburner M., Ball C.A., Blake J. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet, 2000. 25(1): p. 25–29.

    Article  PubMed  CAS  Google Scholar 

  48. Yeh A.S., Hirschman L., and Morgan A.A. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics., 2003. 19Suppl 1:p. i331–9.

    Article  PubMed  Google Scholar 

  49. Hersh W.R., Bhupatiraju R.T., Ross L, Roberts P., Cohen A.M., and Kraemer D.F. Enhancing access to the Bibliome: the TREC 2004 Genomics Track. J Biomed Discov Collab., 2006. 1(1): p. 3.

    Article  PubMed  Google Scholar 

  50. Chen L., Liu H, and Friedman C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 2005. 21(2): p. 248–256.

    Article  PubMed  Google Scholar 

  51. Yeh A., Morgan A., Colosimo M., and Hirschman L. BioCreAtlvE task 1A: gene mention finding evaluation. BMC Bioinformatics., 2005. 6Suppl 1: p. S2. Epub 2005 May 24.

    Article  PubMed  Google Scholar 

  52. Hirschman L., Colosimo M., Morgan A., and Yeh A. Overview of BioCreAtlvE task 1B: normalized gene lists. BMC Bioinformatics., 2005. 6Suppl 1: p. S11. Epub 2005 May 24.

    Article  PubMed  Google Scholar 

  53. Müller H., Kenny E., and Sternberg P. Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. e309. PLoS Biol, 2004. 2(11).

    Google Scholar 

  54. Camon E.B., Barrell D.G., Dimmer E.C., Lee V, Magrane M, Maslen J., Binns D., and Apweiler R. An evaluation of GO annotation retrieval for BioCreAtlvE and GOA. BMC Bioinformatics., 2005. 6Suppl 1: p. S17. Epub 2005 May 24.

    Article  PubMed  Google Scholar 

  55. Banville D.L., Mining chemical structural information from the drug literature. Drug Discov Today, 2006. 11(1–2): p. 35–42.

    Article  PubMed  CAS  Google Scholar 

  56. Milward D., Bjäreland M., Hayes W., Maxwell M., Oberg L., Tilford N., and Hale R., Thomas J., Knight S., and Barnes J. Ontology-based interactive information extraction from scientific abstracts. Comp Funct Genom, 2005. 6(1–2): p. 67.

    Article  CAS  Google Scholar 

  57. Fact Sheet TEMIS Skill Cartridge Biological Entity Relationships. 2006, www.temis.com.

    Google Scholar 

  58. Beitel L. List of AR-interacting proteins. The Androgen Receptor Gene Mutations Database World Wide Web Server, 2002. http://www.biowisdom.com.

    Google Scholar 

  59. Roberts P., Personal communication. 2006, Biogen Idec.

    Google Scholar 

  60. Colosimo M, Microarray Data Analysis Using the Gene Ontology: A Method for Knowledge Discovery. 2006, The MITRE Corporation.

    Google Scholar 

  61. Kash J.C., Basler C.F., Garcia-Sastre A., Carter V., Billharz R., Swayne D.E., Przygodzki R.M., Taubenberger J.K., Katze M.G., and Tumpey T.M. Global host immune response: pathogenesis and transcriptional profiling of type A influenza viruses expressing the hemagglutinin and neuraminidase genes from the 1918 pandemic virus. J Virol., 2004. 78(17): p. 9499–511.

    Article  PubMed  CAS  Google Scholar 

  62. Jenssen T.K., Laegreid A., Komorowski J., and Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet., 2001. 28(1): p. 21–8.

    Article  PubMed  CAS  Google Scholar 

  63. Oliveros J.C., Blaschke C, Herrero J., Dopazo J., and Valencia A. Expression profiles and biological function. Genome Inform Ser Workshop Genome Inform., 2000. 11: p. 106–17.

    PubMed  CAS  Google Scholar 

  64. Blaschke C, Oliveros J.C., and Valencia A. Mining functional information associated with expression arrays. Funct Integr Genomics., 2001. 1(4): p. 256–68.

    Article  PubMed  CAS  Google Scholar 

  65. Blaschke C. and Valencia A. Automatic ontology construction from the literature. Genome Inform Ser., 2002. 13: p. 201–13.

    CAS  Google Scholar 

  66. Kashyap V., Ramakrishnan C., Thomas C., and Sheth A. TaxaMiner; an experimental framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, Special Issue on Semantic Web and Mining Reasoning, 2005.

    Google Scholar 

  67. Mani I., Samuel S., Concepcion K., and Vogel D.P.O.C. Automatically inducing ontologies from corpora. in 3rd International Workshop on Computational Terminology. 2004. Geneva: COLING’2004.

    Google Scholar 

  68. Miles S. Agent-oriented data curation in bioinformatics. in Proc. 1st International Workshop on Multi-Agent Systems for Medicine, Computational Biology and Bioinformatics. 2005.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Hirschman, L., Hayes, W.S., Valencia, A. (2007). Knowledge Acquisition from the Biomedical Literature. In: Baker, C.J.O., Cheung, KH. (eds) Semantic Web. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-48438-9_4

Download citation

Publish with us

Policies and ethics