Chemical Named Entity Recognition: Improving Recall Using a Comprehensive List of Lexical Features

  • Andre Lamurias
  • João Ferreira
  • Francisco M. Couto
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 294)


As the number of published scientific papers grows everyday, there is also an increasing necessity for automated named entity recognition (NER) systems capable of identifying relevant entities mentioned in a given text, such as chemical entities. Since high precision values are crucial to deliver useful results, we developed a NER method, Identifying Chemical Entities (ICE), which was tuned for precision. Thus, ICE achieved the second highest precision value in the BioCreative IV CHEMDNER task, but with significant low recall values. However, this paper shows how the use of simple lexical features was able to improve the recall of ICE while maintaining high levels of precision. Using a selection of the best features tested, ICE obtained a best recall of 27.2% for a precision of 92.4%.


Text mining Conditional Random Fields Named Entity Recognition Chemical Compounds ChEBI 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., Valencia, A.: Overview of the chemical compound and drug name recognition (CHEMDNER) task. In: BioCreative Challenge Evaluation Workshop, vol. 2, p. 2 (2013)Google Scholar
  2. 2.
    Grego, T., Pęzik, P., Couto, F.M., Rebholz-Schuhmann, D.: Identification of chemical entities in patent documents. In: Omatu, S., Rocha, M.P., Bravo, J., Fernández, F., Corchado, E., Bustillo, A., Corchado, J.M. (eds.) IWANN 2009, Part II. LNCS, vol. 5518, pp. 942–949. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  3. 3.
    Lamurias, A., Grego, T., Couto, F.M.: Chemical compound and drug name recognition using CRFs and semantic similarity based on ChEBI. In: BioCreative Challenge Evaluation Workshop vol. 2, 489, p. 75 (2013)Google Scholar
  4. 4.
    Huber, T., Rocktäschel, T., Weidlich, M., Thomas, P., Leser, U.: Extended feature set for chemical named entity recognition and indexing. In: BioCreative Challenge Evaluation Workshop, vol. 2, p. 88 (2013)Google Scholar
  5. 5.
    Leaman, R., Wei, C.H., Lu, Z.: NCBI at the biocreative IV CHEMDNER task: Recognizing chemical names in PubMed articles with tmChem. In: BioCreative Challenge Evaluation Workshop, vol. 2, p. 34 (2013)Google Scholar
  6. 6.
    McCallum, A.K.: Mallet: A machine learning for language toolkit (2002)Google Scholar
  7. 7.
    Corbett, P., Batchelor, C., Teufel, S.: Annotation of chemical named entities. In: Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, Association for Computational Linguistics, pp. 57–64 (2007)Google Scholar
  8. 8.
    Couto, F.M., Silva, M.J., Coutinho, P.M.: Finding genomic ontology terms in text using evidence content. BMC Bioinformatics 6(suppl. 1), 21 (2005)Google Scholar
  9. 9.
    Gentleman, R.: Visualizing and distances using GO (2005),
  10. 10.
    Grego, T., Couto, F.M.: Enhancement of chemical entity identification in text using semantic similarity validation. PloS One 8(5), e62984 (2013)Google Scholar
  11. 11.
    Batista-Navarro, R.T., Rak, R., Ananiadou, S.: Chemistry-specific features and heuristics for developing a CRF-based chemical named entity recogniser. In: BioCreative Challenge Evaluation Workshop, vol. 2, p. 55 (2013)Google Scholar
  12. 12.
    Usié, A., Cruz, J., Comas, J., Solsona, F., Alves, R.: A tool for the identification of chemical entities (CheNER-BioC). In: BioCreative Challenge Evaluation Workshop, vol. 2, p. 66 (2013)Google Scholar
  13. 13.
    Campos, D., Matos, S., Oliveira, J.L.: Chemical name recognition with harmonized feature-rich conditional random fields. In: BioCreative Challenge Evaluation Workshop, vol. 2, p. 82 (2013)Google Scholar
  14. 14.
    Smith, L., Tanabe, L.K., Ando, R.J., Kuo, C.J., Chung, I.F., Hsu, C.N., Lin, Y.S., Klinger, R., Friedrich, C.M., Ganchev, K., et al.: Overview of BioCreative II gene mention recognition. Genome Biology 9(suppl. 2), 2 (2008)CrossRefGoogle Scholar
  15. 15.
    Couto, F., Pinto, H.: The next generation of similarity measures that fully explore the semantics in biomedical ontologies. Journal of Bioinformatics and Computational Biology 11(5 (1371001), 1–12 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Andre Lamurias
    • 1
  • João Ferreira
    • 1
  • Francisco M. Couto
    • 1
  1. 1.Dep. de Informática, Faculdade de CiênciasUniversidade de LisboaLisboaPortugal

Personalised recommendations