Journal of Computer Science and Technology

, Volume 25, Issue 1, pp 169–179 | Cite as

New Challenges for Biological Text-Mining in the Next Decade

  • Hong-Jie Dai
  • Yen-Ching Chang
  • Richard Tzong-Han Tsai
  • Wen-Lian Hsu
Regular Paper


The massive flow of scholarly publications from traditional paper journals to online outlets has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature, researchers are finding it increasingly difficult to locate needed information. As a result, recent biology contests, notably JNLPBA and BioCreAtIvE, have focused on evaluating various methods in which the literature may be navigated. Among these methods, text-mining technology has shown the most promise. With recent advances in text-mining technology and the fact that publishers are now making the full texts of articles available in XML format, TMSs can be adapted to accelerate literature curation, maintain the integrity of information, and ensure proper linkage of data to other resources. Even so, several new challenges have emerged in relation to full text analysis, life-science terminology, complex relation extraction, and information fusion. These challenges must be overcome in order for text-mining to be more effective. In this paper, we identify the challenges, discuss how they might be overcome, and consider the resources that may be helpful in achieving that goal.


bioinformatics database mining method and algorithm text mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Kim J D et al. Introduction to the bio-entity recognition task at JNLPBA. In Proc. the International Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA2004), Geneva, Switzerland, Aug. 28–29, 2004, pp.70–75.Google Scholar
  2. [2]
    Hirschman L et al. Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics, 2005, 6(Suppl.1): S1.CrossRefGoogle Scholar
  3. [3]
    Krallinger M et al. Evaluation of text-mining systems for biology: Overview of the Second BioCreative community challenge. Genome Biology, 2008, 9(Suppl. 2): S1.CrossRefGoogle Scholar
  4. [4]
    Hearst M A. Untangling text data mining. In Proc. the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, College Park, USA, June 20–26, 1999, pp.3–10.Google Scholar
  5. [5]
    Hahn U et al. Text mining: Powering the database revolution. Nature, 2007, 448(7150): 130.CrossRefGoogle Scholar
  6. [6]
  7. [7]
    Dai H J et al. BIOSMILE web search: A web application for annotating biomedical entities and relations. Nucl. Acids Res., 2008, 36(Web Sever Issue): W390–W398.CrossRefGoogle Scholar
  8. [8]
    Rebholz-Schuhmann D et al. Text processing through Web services: Calling Whatizit. Bioinformatics, 2008, 24(2): 296–298.CrossRefGoogle Scholar
  9. [9]
    Fernández J M et al. iHOP web services. Nucl. Acids Res., 2007, 35(Web Server Issue): W21–W26.CrossRefGoogle Scholar
  10. [10]
    Elsevier Article 2.0 Contest., Accessed July, 2009.
  11. [11]
    The Elsevier Grand Challenge., Accessed November, 2009.
  12. [12]
    BioCreAtIvE II.5., Accessed December, 2009.
  13. [13]
    Ananiadou S, Chruszcz J et al. The national ventre for text mining: Aims and objectives. In Proc. UKKDD2007, Kent, UK, April 25, 2007, pp.6–12.Google Scholar
  14. [14]
    RSC Project Prospect.
  15. [15]
    Seringhaus M, Gerstein M. Manually structured digital abstracts: A scaffold for automatic text mining. FEBS Letters, 2008, 582(8): 1170.CrossRefGoogle Scholar
  16. [16]
    Morgan A et al. Overview of BioCreative II gene normalization. Genome Biology, 2008, 9(Suppl. 2): S3.CrossRefGoogle Scholar
  17. [17]
    Gonzalez G et al. Mining gene-disease relationships from biomedical literature: Weighting protein-protein interactions and connectivity measures. In Proc. the Pacific Symposium on Biocomputing, 2007, 12: 28–29.CrossRefGoogle Scholar
  18. [18]
    Tsai R T H, Lai P et al. HypertenGene: Extracting key hypertension genes from biomedical literature with position and automatically-generated template features. BMC Bioinformatics, 2009, 10(Suppl. 5): S9.CrossRefGoogle Scholar
  19. [19]
    Cohen A M, Hersh W R. A survey of current work in biomedical text mining. Briefings in Bioinformatics, 2005, 6(1): 57–71.CrossRefGoogle Scholar
  20. [20]
    Smith L et al. Overview of BioCreative II gene mention recognition. Genome Biology, 2008, 9(Suppl.2): S2.CrossRefGoogle Scholar
  21. [21]
    Krallinger M et al. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology, 2008, 9(Suppl. 2): S4.CrossRefGoogle Scholar
  22. [22]
    Chinchor N. MUC-7 named entity task definition (Version 3.5). In Proc. the 7th Message Understanding Conference, 1997.Google Scholar
  23. [23]
    Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in Bioinformatics, 2005, 6(4): 357–369.CrossRefGoogle Scholar
  24. [24]
    Erhardt R A A et al. Status of text-mining techniques applied to biomedical text. Drug Discovery Today, 2006, 11(7/8): 315–325.CrossRefGoogle Scholar
  25. [25]
    Liu H et al. A study of abbreviations in MEDLINE abstracts. In Proc. AMIA Annual Symposium, San Antonio, USA, Nov. 9–13, 2002, pp.464–468.Google Scholar
  26. [26]
    Tanabe L, Wilbur W J. Tagging gene and protein names in full text articles. In Proc. the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain — Volume 3, Philadelphia, USA, July 11, 2002, pp.9–13.Google Scholar
  27. [27]
    Tanabe L, Wilbur W J. Tagging gene and protein names in biomedical text. Bioinformatics, 2002, 18(8): 1124–1132.CrossRefGoogle Scholar
  28. [28]
    Zhao S. Named entity recognition in biomedical texts using an HMM model. In Proc. the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, Aug. 28–29, 2004, pp.84–87.Google Scholar
  29. [29]
    Kazama J i et al. Tuning support vector machines for biomedical named entity recognition. In Proc. the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain — Volume 3, Philadelphia, USA, July 11, 2002, pp.1–8.Google Scholar
  30. [30]
    Finkel J et al. Exploiting context for biomedical entity recognition: From syntax to the web. In Proc. the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, Geneva, Switzerland, Aug. 28–29, 2004, pp.88–91.Google Scholar
  31. [31]
    Tsai R T H et al. NERBio: Using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics, 2006, 7(Suppl. 5): S11.CrossRefGoogle Scholar
  32. [32]
    Si L et al. Boosting performance of bio-entity recognition by combining results from multiple systems. In Proc. the 5th International Workshop on Bioinformatics, Chicago, USA, Aug. 21, 2005, pp.76–83.Google Scholar
  33. [33]
    Altman R et al. Text mining for biology — The way forward: Opinions from leading scientists. Genome Biology, 2008, 9(Suppl. 2): S7.CrossRefGoogle Scholar
  34. [34]
    Jimeno A et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics, 2008, 9(Suppl. 3): S3.CrossRefGoogle Scholar
  35. [35]
    Yu H et al. Mapping abbreviations to full forms in biomedical articles. Journal of the American Medical Informatics Association, 2002, 9(3): 262–272.CrossRefGoogle Scholar
  36. [36]
    Schwartz A S, Hearst M A. A simple algorithm for identifying abbreviation definitions in biomedical text. Proc. Pac. Symp. Biocomput., 2003, 8: 451–462.Google Scholar
  37. [37]
    Podowski R et al. Suregene, a scalable system for automated term disambiguation of gene and protein names. Journal of Bioinformatics and Computational Biology, 2005, 3(3): 743–770.CrossRefGoogle Scholar
  38. [38]
    Hirschman L et al. Overview of BioCreAtIvE task 1B: Normalized gene lists. BMC Bioinformatics, 2005, 6(Suppl. 1): S11.CrossRefGoogle Scholar
  39. [39]
    Cohen W, Minkov E. A graph-search framework for associating gene identifiers with documents. BMC Bioinformatics, 2006, 7: 440.CrossRefGoogle Scholar
  40. [40]
    Leitner F. Comparative community assessments for applied biomedical text mining: BioCreative II challenge and metaservices. In Intelligent Systems for Molecular Biology (ISMB) and European Conference on Computational Biology (ECCB), Highlights Track, Stockholm, Sweden, June 27-July 2, 2009.Google Scholar
  41. [41]
    Fundel K, Guttler D et al. A simple approach for protein name identification: Prospects and limits. BMC Bioinformatics, 2005, 6(Suppl. 1): S15.CrossRefGoogle Scholar
  42. [42]
    Hakenberg J et al. Me and my friends: Gene mention normalization with background knowledge. In Proc. the Second BioCreAtIvE Challenge Evaluation Workshop, Madrid, Spain, April 23–25, 2007, p.23–25.Google Scholar
  43. [43]
    Seki K, Javed M. Discovering implicit associations between genes and hereditary diseases. In Proc. Pac. Symp. Biocomput., 2007, 12: 316–327.CrossRefGoogle Scholar
  44. [44]
    Cooper J W, Kershenbaum A. Discovery of protein-protein interactions using a combination of linguistic, statistical and graphical information. BMC Bioinformatics, 2005, 6: 143.CrossRefGoogle Scholar
  45. [45]
    Shah P K et al. Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics, 2003, 4: 20.CrossRefGoogle Scholar
  46. [46]
    Shatkay H et al. Integrating image data into biomedical text categorization. Bioinformatics, July 15, 2006, 22(14): e446–e453.CrossRefGoogle Scholar
  47. [47]
    Kou Z et al. A stacked graphical model for associating information from text and images in figures. In Proc. Pac. Symp. Biocomput., 2007, 12: 257–268.CrossRefMathSciNetGoogle Scholar
  48. [48]
    Saric J et al. Extraction of regulatory gene/protein networks from Medline. Bioinformatics, March 15, 2006, 22(6): 645–650.CrossRefMathSciNetGoogle Scholar
  49. [49]
    Ono T et al. Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, Feb. 2001, 17(2): 155–161.CrossRefGoogle Scholar
  50. [50]
    Kim S et al. Kernel approaches for genic interaction extraction. Bioinformatics, 2008, 24(1): 118–126.CrossRefGoogle Scholar
  51. [51]
    Bunescu R, Mooney R. Subsequence kernels for relation extraction. Advances in Neural Information Processing Systems, 2006, 18: 171–178.Google Scholar
  52. [52]
    Barnickel T et al. Large scale application of neural network based semantic role labeling for automated relation extraction from biomedical texts. PLoS One, 2009, 4(7): e6393.CrossRefGoogle Scholar
  53. [53]
    Ramani A et al. Consolidating the set of known human protein-protein interactions in preparation for large-scale mapping of the human interactome. Genome Biology, 2005, 6(5): R40.CrossRefMathSciNetGoogle Scholar
  54. [54]
    Bunescu R et al. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine, 2005, 33(2): 139–155.CrossRefGoogle Scholar
  55. [55]
    Rosario B, Hearst M A. Multi-way relation classification: Application to protein-protein interactions. In Proc. the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, Canada, Oct. 6–8, 2005, pp.732–739.Google Scholar
  56. [56]
    Craven M, Kumlien J. Constructing biological knowledge bases by extracting information from text sources. In Proc. the 7th International Conference on Intelligent Systems for Molecular Biology, Heidelberg, Germany, Aug. 6–10, 1999, pp.77–86.Google Scholar
  57. [57]
    Rindflesch T C et al. EDGAR: Extraction of drugs, genes and relations from the biomedical literature. In Proc. Pac. Symp. Biocomput., 2000, 5: 514–525.Google Scholar
  58. [58]
    Chun H W et al. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. In Proc. the Pacific Symposium on Biocomputing, 2006, 11: 4–15.Google Scholar
  59. [59]
    Tsai R T H et al. HypertenGene: Extracting key hypertension genes from biomedical literature with position and automatically-generated template features. To appear in BMC Bioinformatics, 2009.Google Scholar
  60. [60]
    Miyao Y, Sagae K et al. Evaluating contributions of natural language parsers to protein-protein interaction extraction. Bioinformatics, 2008, 25(3): 394–400.CrossRefGoogle Scholar
  61. [61]
    Wong L. PIES, a protein interaction extraction system. In Proc. Pacific Symposium on Biocomputing, 2001, 6: 520–531.Google Scholar
  62. [62]
    Castaño J et al. Anaphora resolution in biomedical literature. In International Symposium on Reference Resolution for NLP, Alicante, Spain, June 3–4, 2002.Google Scholar
  63. [63]
    Pustejovsky J et al. Medstract: Creating large-scale information servers for biomedical libraries. In Proc. the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain, Philadelphia, USA, July 11, 2002, pp.85–92.Google Scholar
  64. [64]
    Nguyen N et al. Challenges in pronoun resolution system for biomedical text. In Proc. the Sixth International Language Resources and Evaluation (LREC2008), Marrakech, Morocco, May 28–30, 2008.Google Scholar
  65. [65]
    Tsai R T H et al. PubMed-EX: A web browser extension to enhance PubMed search with text mining features. Bioinformatics, 2009, [Epub ahead of print].Google Scholar
  66. [66]
    Zhang Z et al. Bringing Web 2.0 to bioinformatics. Brief Bioinform., 2009, 10(1): 1–10.zbMATHCrossRefGoogle Scholar
  67. [67]
    Cheung K et al. Semantic Web Approach to Database Integration in the Life Sciences. Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences, Springer, 2007, pp.11–30.Google Scholar
  68. [68]
    Dowell R et al. The distributed annotation system. BMC Bioinformatics, 2001, 2: 7.CrossRefGoogle Scholar
  69. [69]
    O’Reilly T. What is Web 2.0: Design patterns and business models for the next generation of software. 2005,
  70. [70]
    Mons B et al. Calling on a million minds for community annotation in WikiProteins. Genome Biology, 2008, 9(5): R89.CrossRefGoogle Scholar
  71. [71]
    Baral C et al. CBioC: Beyond a prototype for collaborative annotation of molecular interactions from the literature. In Proc. Computational Systems Bioinformatics Conference, 2007, 6: 381–384.CrossRefGoogle Scholar
  72. [72]
    Oda K et al. New challenges for text mining: Mapping between text and manually curated pathways. BMC Bioinformatics, 2008, 9(Suppl. 3): S5.CrossRefMathSciNetGoogle Scholar
  73. [73]
    Kanehisa M et al. KEGG for linking genomes to life and the environment. Nucleic Acids Research, 2008, 36(Database Issue): D480–D484.Google Scholar
  74. [74]
    Hirschman L, Blaschke C. Evaluation of Text Mining in Biology. Text Mining for Biology and Biomedicine, Artech House, 2005, pp.213–245.Google Scholar
  75. [75]
    Yeh A et al. Background and overview for KDD Cup 2002 task 1: Information extraction from biomedical articles. ACM SIGKDD Explorations Newsletter, 2002, 4(2): 87–89.CrossRefGoogle Scholar
  76. [76]
    Hersh W, Voorhees E. TREC genomics special issue overview. Information Retrieval, 2009, 12(1): 1–15.CrossRefGoogle Scholar
  77. [77]
    Hakenberg J, Plake C et al. LLL’05 challenge: Genic interaction extraction-identification of language patterns based on alignment and finite state automata. In Proc. the ICML05 Workshop: Learning Language in Logic (LLL05), 2005, 14: 38–45.Google Scholar
  78. [78]
    Kim J D et al. Overview of BioNLP’09 shared task on event extraction. In Proc. the BioNLP 2009 Workshop Companion Volume for Shared Task, Boulder, USA, June 4–5, 2009, pp.1–9.Google Scholar
  79. [79]
    Kim J D et al. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 2008, 9: 10.CrossRefGoogle Scholar
  80. [80]
    Bader G et al. Pathguide: A pathway resource list. Nucleic Acids Research, 2006, 34(Database Issue): D504–D506.CrossRefGoogle Scholar
  81. [81]
    Camon E et al. The gene ontology annotation (GOA) database: Sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research, 2004, 32(Database Issue): D262–D266.CrossRefGoogle Scholar
  82. [82]
    Kim J D et al. GENIA corpus—A semantically annotated corpus for bio-textmining. Bioinformatics, 2003, 19(Suppl. 1): 180–182.CrossRefGoogle Scholar
  83. [83]
    Tanabe L et al. GENETAG: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics, 2005, 6(Suppl. 1): S3.CrossRefGoogle Scholar
  84. [84]
    Heimonen J et al. Complex-to-pairwise mapping of biological relationships using a semantic network representation. In Proc. the Third International Symposium on Semantic Mining in Biomedicine (SMBM2008), Turku, Finland, Sept. 1–3, 2008, pp.45–52.Google Scholar
  85. [85]
    Rosario B, Hearst M A. Classifying semantic relations in bioscience texts. In Proc. the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, July 21–26, 2004, Article No. 43.Google Scholar
  86. [86]
    Berleant D et al. Corpus properties of protein interaction descriptions in MEDLINE. 2003,
  87. [87]
    Nedellec C. Learning language in logic-genic interaction extraction challenge. In Proc. the ICML05 Workshop: Learning Language in Logic (LLL05), Bonn, Germany, Aug. 7, 2005, pp.31–37.Google Scholar
  88. [88]
    Wattarujeekrit T et al. PASBio: Predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics, Oct. 19, 2004, 5: 155.CrossRefGoogle Scholar
  89. [89]
    Chou W C et al. A semi-automatic method for annotating a biomedical proposition bank. In Proc. ACL Workshop on Frontiers in Linguistically Annotated Corpora, Sydney, Australia, July 22, 2006, pp.5–12.Google Scholar
  90. [90]
    Seth K et al. Integrated annotation for biomedical information extraction. In Proc. HLT/NAACL-2004, Boston, USA, May 2–7, 2004, pp.61–68.Google Scholar
  91. [91]
    Tateisi Y, Tsujii J. Part-of-speech annotation of biology research abstracts. In Proc. the 4th International Conference on Language Resource and Evaluation (LREC2004), Lisbon, Portugal, May 26–28, 2004, pp.1267–1270.Google Scholar
  92. [92]
    Tateisi Y et al. Syntax annotation for the GENIA corpus. In Proc. IJCNLP 2005, Companion Volume, Jeju Island, Korea, Oct. 11–13, 2005, pp.222–227.Google Scholar
  93. [93]
    Lease M, Charniak E. Parsing biomedical literature. In Proc. the Second International Joint Conference on Natural Language Processing, Jeju Island, Korea, Oct. 11–13, 2005, pp.58–69.Google Scholar
  94. [94]
    Smith L et al. MedPost: A part-of-speech tagger for BioMedical text. Bioinformatics, September 22, 2004, 20(14): 2320–2321.CrossRefGoogle Scholar
  95. [95]
    Krallinger M et al. The BioCreative II.5 challenge overview. In Proc. the BioCreative II.5 Workshop 2009 on Digital Annotations, Madrid, Spain, Oct. 7–9, 2009, p.19.Google Scholar
  96. [96]
    GasperIn C et al. Annotation of anaphoric relations in biomedical full-text articles using a domain-relevant scheme. In Proc. the Discourse Anaphora and Anaphor Resolution Colloquium, Lagos (Algarve), Portugal, March 29–30, 2007, pp.19–24.Google Scholar
  97. [97]
    McIntosh M, Curran J. Challenges for automatically extracting molecular interactions from full-text articles. BMC Bioinformatics, 2009, 10: 311.CrossRefGoogle Scholar
  98. [98]
    Kohn K W. Molecular interaction map of the mammalian cell cycle control and DNA repair systems. Mol. Biol. Cell, August 1, 1999, 10(8): 2703–2734.Google Scholar

Copyright information

© Springer 2010

Authors and Affiliations

  • Hong-Jie Dai
    • 1
    • 2
  • Yen-Ching Chang
    • 1
  • Richard Tzong-Han Tsai
    • 3
  • Wen-Lian Hsu
    • 1
    • 2
  1. 1.Institute of Information Science“Academia Sinica”TaiwanChina
  2. 2.Department of Computer Science“National Tsing-Hua University”TaiwanChina
  3. 3.Department of Computer Science and EngineeringYuan Ze UniversityTaiwanChina

Personalised recommendations