Knowledge Acquisition from the Biomedical Literature

  • Lynette Hirschman
  • William S. Hayes
  • Alfonso Valencia


This article focuses on knowledge acquisition from the biomedical literature, and on the infrastructure, specifically text mining, needed to access, extract and integrate the information. The biomedical literature is the major repository of biomedical knowledge. It serves as the source for structured information that populates biological databases, via the process of expert distillation (or curation) of the literature. Today, the literature has grown to the point where an individual scientist cannot read all the relevant literature, and curators of the major biological databases have trouble keeping up to date with newly published articles. Furthermore, important biomedical applications, such as drug discovery and analysis of high-throughput data sets, are dependent on integration of all available information from both biological databases and the literature. The article reviews these applications, focusing on the role of text mining in providing semantic indices into the literature, as well as the importance of interactive tools to augment the power of the human expert to extract information from the literature. These tools are critical in supporting expert curation, finding relationships among biological entities, and creating content for a Semantic Web.

Key words

text mining natural language processing information extraction indexing document retrieval entity tagging entity identification adaptation drug discovery high-throughput experiments curation annotation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Calvo S., Jain M., Xie X., Sheth S.A., Chang B., Goldberger O.A., Spinazzola A., Zeviani M., Carr S.A., and Mootha VK. Systematic identification of human mitochondrial disease genes through integrative genomics. Nat Genet., 2006. 38(5): p. 576–82.PubMedCrossRefGoogle Scholar
  2. [2]
    Moses H., 3rd, Dorsey E.R., Matheson D.H., and Thier S.O. Financial anatomy of biomedical research. JAMA., 2005. 294(11): p. 1333–42.PubMedCrossRefGoogle Scholar
  3. [3]
    Super information about information managers (Super I-AIM). 2001, Outsell, Inc.Google Scholar
  4. [4]
    Scharf M., Schneider R., Casari G., Bork P., Valencia A., Ouzounis C, and Sander C. GeneQuiz: a workbench for sequence analysis. Proc Int Conf Intell Syst Mol Biol., 1994. 2: p. 348–53.PubMedGoogle Scholar
  5. [5]
    Andrade M.A., Brown N.P., Leroy C, Hoersch S., de Daruvar A., Reich C, Franchini A., Tamames J., Valencia A., Ouzounis C, and Sander C. Automated genome sequence analysis and annotation. Bioinformatics., 1999. 15(5): p. 391–412.PubMedCrossRefGoogle Scholar
  6. [6]
    Bairoch A. and Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res., 2000. 28(1): p. 45–8.PubMedCrossRefGoogle Scholar
  7. [7]
    Tamames J., Ouzounis C, Casari G., Sander C, and Valencia A. EUCLID: automatic classification of proteins in functional classes by their database annotations. Bioinformatics., 1998. 14(6): p. 542–3.PubMedCrossRefGoogle Scholar
  8. [8]
    Abascal F. and Valencia A. Clustering of proximal sequence space for the identification of protein families. Bioinformatics, 2002. 18(7): p. 908–21.PubMedCrossRefGoogle Scholar
  9. [9]
    Abascal F. and Valencia A. Automatic annotation of protein function based on family identification. Proteins., 2003. 53(3): p. 683–92.PubMedCrossRefGoogle Scholar
  10. [10]
    Valencia A. Automatic annotation of protein function. Curr Opin Struct Biol., 2005. 15(3): p. 267–74.PubMedCrossRefGoogle Scholar
  11. [11]
    Hubbard T., Barker D., Birney E., Cameron G., Chen Y., Clark L., Cox T., Cuff J., Curwen V., Down T., Durbin R., Eyras E., Gilbert J., Hammond M., Huminiecki L., Kasprzyk A., Lehvaslaiho H., Lijnzaad P., Melsopp C., Mongin E., Pettett R., Pocock M., Potter S., Rust A., Schmidt E., Searle S., Slater G., Smith J., Spooner W., Stabenau A., Stalker J., Stupka E., Ureta-Vidal A., Vastrik I., and Clamp M. The Ensembl genome database project. Nucleic Acids Res., 2002. 30(1): p. 38–41.PubMedCrossRefGoogle Scholar
  12. [12]
    Curwen V., Eyras E., Andrews T.D., Clarke L., Mongin E., Searle S.M., and Clamp M. The Ensembl automatic gene annotation system. Genome Res., 2004. 14(5): p. 942–50.PubMedCrossRefGoogle Scholar
  13. [13]
    Cohen A.M. and Hersh W.R. A survey of current work in biomedical text mining. Brief Bioinform., 2005. 6(1): p. 57–71.PubMedCrossRefGoogle Scholar
  14. [14]
    Joshi-Tope G., Gillespie M., Vastrik I., D’Eustachio P., Schmidt E., de Bono B., Jassal B., Gopinath G.R., Wu G.R., Matthews L., Lewis S., Birney E., and Stein L. Reactome: a know ledge base of biological pathways. Nucleic Acids Res., 2005. 33(Database issue): p. D428–32.PubMedCrossRefGoogle Scholar
  15. [15]
    Riley M.L., Schmidt T., Wagner C, Mewes H.W., and Frishman D. The PEDANT genome database in 2005. Nucleic Acids Res., 2005. 33(Database issue): p. D308–10.PubMedCrossRefGoogle Scholar
  16. [16]
    Wilkinson, M.D. and Links M. BioMOBY: an open source biological web services proposal. Brief Bioinform., 2002. 3(4): p. 331–41.PubMedCrossRefGoogle Scholar
  17. [17]
    Hubbard, T. Biological information: making it accessible and integrated (and trying to make sense of it). Bioinformatics., 2002. 18Suppl 2: p. S140.PubMedGoogle Scholar
  18. [18]
    Oinn T., Addis. M., Ferris J., Marvin D., Senger M., Greenwood M., Carver T., Glover K., Pocock M.R., Wipat A., and Li P. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics, 2004. 20(17): p. 3045–54.PubMedCrossRefGoogle Scholar
  19. [19]
    Fleischmann W., Moller S., Gateau A., and Apweiler R. A novel method for automatic functional annotation of proteins. Bioinformatics., 1999. 15(3): p. 228–33.PubMedCrossRefGoogle Scholar
  20. [20]
    Moller S., Leser U., Fleischmann W., and Apweiler R. EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics., 1999. 15(3): p. 219–27.PubMedCrossRefGoogle Scholar
  21. [21]
    Kretschmann E., Fleischmann W., and Apweiler R. Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT. Bioinformatics., 2001. 17(10): p. 920–6.PubMedCrossRefGoogle Scholar
  22. [22]
    Biswas M., O’Rourke J.F., Camon E., Fraser G., Kanapin A., Karavidopoulou Y., Kersey P., Kriventseva E., Mittard V., Mulder N., Phan I., Servant F., and Apweiler R. Applications of InterPro in protein annotation and genome analysis. Brief Bioinform., 2002. 3(3): p. 285–95.PubMedCrossRefGoogle Scholar
  23. [23]
    Engelhardt, B.E., Jordan M.I., Muratore K.E., and Brenner S.E. Protein molecular function prediction by bayesian phylogenomics. PLoS Comput Biol., 2005. 1(5): p. e45. Epub 2005 Oct 7.PubMedCrossRefGoogle Scholar
  24. [24]
    Harris M.A., Clark J., Ireland A., Lomax J., Ashburner M., Foulger R., Eilbeck K., Lewis S., Marshall B., Mungall C., Richter J., Rubin G.M., Blake J.A., Bult C., Dolan M., Drabkin H., Eppig J.T., Hill D.P., Ni L., Ringwald M., Balakrishnan R., Cherry J.M., Christie K.R., Costanzo M.C., Dwight S.S., Engel S., Fisk D.G., Hirschman J.E., Hong E.L., Nash R.S., Sethuraman A., Theesfeld C.L., Botstein D., Dolinski K., Feierbach B., Berardini T., Mundodi S., Rhee S.Y., Apweiler R., Barrell D., Camon E., Dimmer E., Lee V., Chisholm R., Gaudet P., Kibbe W., Kishore R., Schwarz E.M., Sternberg P., Gwinn M., Hannick L., Wortman J., Berriman M., Wood V., de la Cruz N., Tonellato P., Jaiswal P., Seigfried T., and White R. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 2004. 32(Database issue): p. D258–61.PubMedCrossRefGoogle Scholar
  25. [25]
    Camon, E., Magrane M., Barrell D., Lee V., Dimmer E., Maslen J., Binns D., Harte N., Lopez R., and Apweiler R. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res., 2004. 32(Database issue): p. D262–6.PubMedCrossRefGoogle Scholar
  26. [26]
    Devos D. and Valencia A. Practical limits of function prediction. Proteins., 2000. 41(1): p. 98–107.PubMedCrossRefGoogle Scholar
  27. [27]
    Todd A.E., Orengo C.A., and Thornton J.M. Evolution of function in protein superfamilies, from a structural perspective. J Mol Biol., 2001. 307(4): p. 1113–43.PubMedCrossRefGoogle Scholar
  28. [28]
    Wilson C.A., Kreychman J., and Gerstein M. Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol., 2000. 297(1): p. 233–49.PubMedCrossRefGoogle Scholar
  29. [29]
    Rost B. Enzyme function less conserved than anticipated. J Mol Biol., 2002. 318(2): p. 595–608.PubMedCrossRefGoogle Scholar
  30. [30]
    Blaschke C, Hirschman L., and Valencia A. Information extraction in molecular biology. Brief Bioinform., 2002. 3(2): p. 154–65.PubMedCrossRefGoogle Scholar
  31. [31]
    Blaschke C. and Valencia A. Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comparative and Functional Genomics, 2001. 2: p. 196–206.CrossRefGoogle Scholar
  32. [32.
    Salwinski L., Miller C.S., Smith A.J., Pettit F.K., Bowie J.U., and Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 2004. 32(Database issue): p. D449–51.PubMedCrossRefGoogle Scholar
  33. [33]
    Kanehisa M., Goto S., Kawashima S., Okuno Y., and Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res., 2004. 32(Database issue): p. D277–80.PubMedCrossRefGoogle Scholar
  34. [34]
    Keseler I.M., Collado-Vides J., Gama-Castro S., Ingraham J., Paley S., Paulsen I.T., Peralta-Gil M., and Karp P.D. EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res., 2005. 33(Database issue): p. D334–7.PubMedCrossRefGoogle Scholar
  35. [35]
    Leon E. and Valencia A. Unpublished Manuscript. 2006.Google Scholar
  36. [36]
    Blaschke C, Leon E.A., Krallinger M., and Valencia A. Evaluation of BioCreAtlvE assessment of task 2. BMC Bioinformatics., 2005. 6Suppl 1: p. S16. Epub 2005 May 24.PubMedCrossRefGoogle Scholar
  37. [37]
    Hermjakob H., Montecchi-Palazzi L., Lewington C., Mudali S., Kerrien S., Orchard S., Vingron M., Roechert B., Roepstorff P., Valencia A., Margalit H., Armstrong J., Bairoch A., Cesareni G., Sherman D., and Apweiler R. IntAct: an open source molecular interaction database. Nucleic Acids Res., 2004. 32(Database issue): p. D452–5.PubMedCrossRefGoogle Scholar
  38. [38]
    Zanzoni A., Montecchi-Palazzi L., Quondam M., Ausiello G., Helmer-Citterich M., and Cesareni G. MINT: a Molecular INTeraction database. FEBS Lett., 2002. 513(1): p. 135–40.PubMedCrossRefGoogle Scholar
  39. [39]
    Hsu F., Pringle T.H., Kuhn R.M., Karolchik D., Diekhans M., Haussler D., and Kent W.J. The UCSC Proteome Browser. Nucleic Acids Res., 2005. 33(Database issue): p. D454–8.PubMedCrossRefGoogle Scholar
  40. [40]
    Birney E., Andrews D., Caccamo M., Chen Y., Clarke L., Coates G., Cox T., Cunningham F., Curwen V., Cutts T., Down T., Durbin R., Fernandez-Suarez X.M., Flicek P., Graf S., Hammond M., Herrero J., Howe K., Iyer V., Jekosch K., Kahari A., Kasprzyk A., Keefe D., Kokocinski F., Kulesha E., London D., Longden I., Melsopp C, Meidl P., Overduin B., Parker A., Proctor G., Prlic A., Rae M., Rios D., Redmond S., Schuster M., Sealy I., Searle S., Severin J., Slater G., Smedley D., Smith J., Stabenau A., Stalker J., Trevanion S., Ureta-Vidal A., Vogel J., White S., Woodwark C., and Hubbard T.J. Ensembl 2006. Nucleic Acids Res, 2006. 34(Database issue): p. D556–61.PubMedCrossRefGoogle Scholar
  41. [41]
    Hoffmann R. and Valencia A. Implementing the iHOP concept for navigation of biomedical literature. Bioinformatics., 2005. 21Suppl 2: p. ii252–ii258.PubMedCrossRefGoogle Scholar
  42. [42]
    Hoffmann, R., Krallinger M., Andres E., Tamames J., Blaschke C., and Valencia A. Text mining for metabolic pathways, signaling cascades, and protein networks. Sci STKE., 2005. 2005(283): p. pe21.PubMedCrossRefGoogle Scholar
  43. [43]
    Matys V., Kel-Margoulis O.V., Fricke E., Liebich I., Land S., Barre-Dirrie A., Reuter I., Chekmenev D., Krull M., Hornischer K., Voss N., Stegmaier P., Lewicki-Potapov B., Saxel H., Kel A.E., and Wingender E. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res., 2006. 34(Database issue): p. D108–10.PubMedCrossRefGoogle Scholar
  44. [44]
    Mitelman, F., Johansson B., and Mertens F. Fusion genes and rearranged genes as a linear function of chromosome aberrations in cancer. Nat Genet., 2004. 36(4): p. 331–4.PubMedCrossRefGoogle Scholar
  45. [45]
    Saric, J., Jensen L.J., Ouzounova R., Rojas I., and Bork P. Extraction of regulatory gene/protein networks from Medline. Bioinformatics., 2006. 22(6): p. 645–50. Epub 2005 Jul 26.PubMedCrossRefGoogle Scholar
  46. [46]
    Hoffmann R., Dopazo J., Cigudosa J.C., and Valencia A. HCAD, closing the gap between breakpoints and genes. Nucleic Acids Res., 2005. 33(Database issue): p. D511–3.PubMedCrossRefGoogle Scholar
  47. [47]
    Ashburner M., Ball C.A., Blake J. et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat Genet, 2000. 25(1): p. 25–29.PubMedCrossRefGoogle Scholar
  48. [48]
    Yeh A.S., Hirschman L., and Morgan A.A. Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics., 2003. 19Suppl 1:p. i331–9.PubMedCrossRefGoogle Scholar
  49. [49]
    Hersh W.R., Bhupatiraju R.T., Ross L, Roberts P., Cohen A.M., and Kraemer D.F. Enhancing access to the Bibliome: the TREC 2004 Genomics Track. J Biomed Discov Collab., 2006. 1(1): p. 3.PubMedCrossRefGoogle Scholar
  50. [50]
    Chen L., Liu H, and Friedman C. Gene name ambiguity of eukaryotic nomenclatures. Bioinformatics, 2005. 21(2): p. 248–256.PubMedCrossRefGoogle Scholar
  51. [51]
    Yeh A., Morgan A., Colosimo M., and Hirschman L. BioCreAtlvE task 1A: gene mention finding evaluation. BMC Bioinformatics., 2005. 6Suppl 1: p. S2. Epub 2005 May 24.PubMedCrossRefGoogle Scholar
  52. [52]
    Hirschman L., Colosimo M., Morgan A., and Yeh A. Overview of BioCreAtlvE task 1B: normalized gene lists. BMC Bioinformatics., 2005. 6Suppl 1: p. S11. Epub 2005 May 24.PubMedCrossRefGoogle Scholar
  53. [53]
    Müller H., Kenny E., and Sternberg P. Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. e309. PLoS Biol, 2004. 2(11).Google Scholar
  54. [54]
    Camon E.B., Barrell D.G., Dimmer E.C., Lee V, Magrane M, Maslen J., Binns D., and Apweiler R. An evaluation of GO annotation retrieval for BioCreAtlvE and GOA. BMC Bioinformatics., 2005. 6Suppl 1: p. S17. Epub 2005 May 24.PubMedCrossRefGoogle Scholar
  55. [55]
    Banville D.L., Mining chemical structural information from the drug literature. Drug Discov Today, 2006. 11(1–2): p. 35–42.PubMedCrossRefGoogle Scholar
  56. [56]
    Milward D., Bjäreland M., Hayes W., Maxwell M., Oberg L., Tilford N., and Hale R., Thomas J., Knight S., and Barnes J. Ontology-based interactive information extraction from scientific abstracts. Comp Funct Genom, 2005. 6(1–2): p. 67.CrossRefGoogle Scholar
  57. [57]
    Fact Sheet TEMIS Skill Cartridge Biological Entity Relationships. 2006, Scholar
  58. [58]
    Beitel L. List of AR-interacting proteins. The Androgen Receptor Gene Mutations Database World Wide Web Server, 2002. Scholar
  59. [59]
    Roberts P., Personal communication. 2006, Biogen Idec.Google Scholar
  60. [60]
    Colosimo M, Microarray Data Analysis Using the Gene Ontology: A Method for Knowledge Discovery. 2006, The MITRE Corporation.Google Scholar
  61. [61]
    Kash J.C., Basler C.F., Garcia-Sastre A., Carter V., Billharz R., Swayne D.E., Przygodzki R.M., Taubenberger J.K., Katze M.G., and Tumpey T.M. Global host immune response: pathogenesis and transcriptional profiling of type A influenza viruses expressing the hemagglutinin and neuraminidase genes from the 1918 pandemic virus. J Virol., 2004. 78(17): p. 9499–511.PubMedCrossRefGoogle Scholar
  62. [62]
    Jenssen T.K., Laegreid A., Komorowski J., and Hovig E. A literature network of human genes for high-throughput analysis of gene expression. Nat Genet., 2001. 28(1): p. 21–8.PubMedCrossRefGoogle Scholar
  63. [63]
    Oliveros J.C., Blaschke C, Herrero J., Dopazo J., and Valencia A. Expression profiles and biological function. Genome Inform Ser Workshop Genome Inform., 2000. 11: p. 106–17.PubMedGoogle Scholar
  64. [64]
    Blaschke C, Oliveros J.C., and Valencia A. Mining functional information associated with expression arrays. Funct Integr Genomics., 2001. 1(4): p. 256–68.PubMedCrossRefGoogle Scholar
  65. [65]
    Blaschke C. and Valencia A. Automatic ontology construction from the literature. Genome Inform Ser., 2002. 13: p. 201–13.Google Scholar
  66. [66]
    Kashyap V., Ramakrishnan C., Thomas C., and Sheth A. TaxaMiner; an experimental framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services, Special Issue on Semantic Web and Mining Reasoning, 2005.Google Scholar
  67. [67]
    Mani I., Samuel S., Concepcion K., and Vogel D.P.O.C. Automatically inducing ontologies from corpora. in 3rd International Workshop on Computational Terminology. 2004. Geneva: COLING’2004.Google Scholar
  68. [68]
    Miles S. Agent-oriented data curation in bioinformatics. in Proc. 1st International Workshop on Multi-Agent Systems for Medicine, Computational Biology and Bioinformatics. 2005.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  • Lynette Hirschman
    • 1
  • William S. Hayes
    • 2
  • Alfonso Valencia
    • 3
  1. 1.The MITRE CorporationUSA
  2. 2.Biogen IdecUSA
  3. 3.Spanish National Cancer Research CentreSpain

Personalised recommendations