Identification of Chemical Entities in Patent Documents

  • Tiago Grego
  • Piotr Pęzik
  • Francisco M. Couto
  • Dietrich Rebholz-Schuhmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5518)


Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist for gene and protein entity recognition, however very few exist for chemical entities. The main reason for this is the lack of corpus to train named entity recognition systems and perform evaluation.

In this paper we present a chemical entity recognizer that uses a machine learning approach based on conditional random fields (CRF) and compare the performance with dictionary-based approaches using several terminological resources. For the training and evaluation, a gold standard of manually curated patent documents was used. While the dictionary-based systems perform well in partial identification of chemical entities, the machine learning approach performs better (10% increase in F-score in comparison to the best dictionary-based system) when identifying complete entities.


Chemical Named Entity Recognition Conditional Random Fields Text Mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yeh, A., Hirschman, L., Morgan, A.: Evaluation of text data mining for database curation: Lessons learned from the KDD challenge cup. Bioinformatics 19(1), i331–i339 (2003)CrossRefGoogle Scholar
  2. 2.
    Hersh, W., Cohen, A., Roberts, P., Rekapalli, H.: TREC 2006 genomics track overview. In: Proc. of the 15th Text REtrieval Conference (2006)Google Scholar
  3. 3.
    Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6, S1 (2005)CrossRefGoogle Scholar
  4. 4.
    Hirschman, L., Krallinger, M., Valencia, A.: Proc. of the Second BioCreative Challenge Evaluation Workshop. Centro Nacional de Investigaciones Oncologicas (2007)Google Scholar
  5. 5.
    Smith, L., Tanabe, L., Ando, R., Kuo, C., Chung, I., Hsu, C., Lin, Y., Klinger, R., Friedrich, C., Ganchev, K., Torii, M., Liu, H., Haddow, B., Struble, C., Povinelli, R., Vlachos, A., Baumgartner, W., Hunter, L., Carpenter, B., Tsai, R., Dai, H., Liu, F., Chen, Y., Sun, C., Katrenko, S., Adriaans, P., Blaschke, C., Torres, R., Neves, M., Nakov, P., Divoli, M., Mana-Lopez, A., Mata-Vazquez, J., Wilbur, W.: Overview of BioCreative II gene mention recognition. Genome Biology 9(suppl. 1), S2 (2008)CrossRefGoogle Scholar
  6. 6.
    Reyle, U.: Understanding chemical terminology. Terminology 12, 111–126 (2006)CrossRefGoogle Scholar
  7. 7.
    Hanisch, D., Fundel, K., Mevissen, H., Zimmer, R., Fluck, J.: ProMiner: rule-based protein and gene entity recognition. BMC Bioinformatics 6(suppl. 1), S14 (2005)CrossRefGoogle Scholar
  8. 8.
    Rebholz-Schuhmann, D., Kirsch, H., Arregui, M., Gaudan, S., Riethoven, M., Stoehr, P.: Ebimed - text crunching to gather facts for proteins from medline. Bioinformatics 23 (2007)Google Scholar
  9. 9.
    Narayanaswamy, M., Ravikumar, K., Vijay-Shanker, K.: A biological named entity recognizer. In: Proc. of the Pacific Symposium on Biocomputing, pp. 427–438 (2003)Google Scholar
  10. 10.
    Kemp, N., Lynch, M.: The extraction of information from the text of chemical patents. 1. identification of specific chemical names. J. Chem. Inf. Comput. Sci. 38, 544–551 (1998)CrossRefGoogle Scholar
  11. 11.
    Corbett, P., Murray-Rust, P.: High-throughput identification of chemistry in life science texts. In: Berthold, M.R., Glen, R.C., Fischer, I. (eds.) CompLife 2006. LNCS (LNBI), vol. 4216, pp. 107–118. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Alcantara, R., Darsow, M., Guedj, M., Ashburner, M.: ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36, D344–D350 (2008)CrossRefGoogle Scholar
  13. 13.
    Corbett, P., Copestake, A.: Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics 9(suppl. 11), S4 (2008)CrossRefGoogle Scholar
  14. 14.
    Klinger, R., Kolá, C., Fluck, J., Hofmann-Apitius, M., Friedrich, C.: Detection of IUPAC and IUPAC-like chemical names. ISMB 2008. Bioinformatics 24, i268–i276 (2008)CrossRefGoogle Scholar
  15. 15.
    International Union of Pure and Applied Chemistry,
  16. 16.
    Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics 19(suppl. 1), i180–i182 (2003)CrossRefGoogle Scholar
  17. 17.
    Wishart, D., Knox, C., Guo, A., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 34, D668–D672 (2006)CrossRefGoogle Scholar
  18. 18.
    Corbett, P.: OSCAR3 (Open Source Chemistry Analysis Routines) - software for the semantic annotation of chemistry papers,
  19. 19.
    Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th ICML, pp. 282–289 (2001)Google Scholar
  20. 20.
    McCallum, A.: MALLET: A Machine Learning for Language Toolkit,

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Tiago Grego
    • 1
  • Piotr Pęzik
    • 2
  • Francisco M. Couto
    • 1
  • Dietrich Rebholz-Schuhmann
    • 2
  1. 1.Faculty of SciencesUniversity of LisbonLisboaPortugal
  2. 2.EMBL-EBI, Wellcome Trust Genome Campus, HinxtonCambridgeUK

Personalised recommendations