Protein Complex Mention Recognition with Web-Based Knowledge Learning

  • Ruoyao Ding
  • Xiaoyi Pan
  • Yingying Qu
  • Cathy H. Wu
  • K. Vijay-Shanker
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11284)


Protein complex plays an essential role in cellular functions and is an important named entity in the biomedical field. Since protein complex –relevant experimental results are usually published in scientific articles, recognizing protein complex mentions from literature is a crucial step of discovering protein complex-related information from existing scientific research studies. In this paper, we propose a method for protein complex mention recognition, which applies knowledge automatically learned from PubMed. Evaluation shows our method achieves a F1-score of 81%, demonstrating its effectiveness in the protein complex recognition task.


Named entity recognition Protein complex Conditional Random Field 



This paper is supported by grants from National Key R&D Program of China (2016YFF0204205, 2018YFF0213901) and China National Institute of Standardization (522016Y-4681, 522018Y-5948, 522018Y-5941).


  1. 1.
    Gene Ontology Consortium webpage. Accessed 21 May 2018
  2. 2.
    Gingras, A.-C., Aebersold, R., Raught, B.: Advances in protein complex analysis using mass spectrometry. J. Physiol. 563(Pt 1), 11–21 (2005)CrossRefGoogle Scholar
  3. 3.
    Meldal, B.H.M., Forner-Martinez, O., Costanzo, M.C., et al.: The complex portal–an encyclopaedia of macromolecular complexes. Nucleic Acids Res. 43(Database issue), D479–D484 (2015)CrossRefGoogle Scholar
  4. 4.
    Settles, B.: ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics 21(14), 3191–3192 (2005)CrossRefGoogle Scholar
  5. 5.
    Leaman, R., Gonzalez, G.: BANNER: an executable survey of advances in biomedical named entity recognition. In: Pacific Symposium on Biocomputing, pp. 652–663 (2008)Google Scholar
  6. 6.
    Torii, M., Hu, Z., Wu, C.H., Liu, H.: BioTagger-GM: a gene/protein name recognition system. J. Am. Med. Inform. Assoc. (JAMIA) 16(2), 247–255 (2009)CrossRefGoogle Scholar
  7. 7.
    Lu, Y., Ji, D., Yao, X., Wei, X., Liang, X.: CHEMDNER system with mixed conditional random fields and multi-scale word clustering. J. Cheminformatics 7(Suppl 1), S4 (2015). Text mining for chemistry and the CHEMDNER trackCrossRefGoogle Scholar
  8. 8.
    Liu, H., Torii, M., Hu, Z.Z., Wu, C.: Gene mention and gene normalization based on machine learning and online resources. In: Proceedings of the Second BioCreative Challenge Workshop, pp. 135–140. CNIO (2007)Google Scholar
  9. 9.
    Batista-Navarro, R., Rak, R., Ananiadou, S.: Optimising chemical named entity recognition with pre-processing analytics, knowledge-rich features and heuristics. J. Cheminformatics 7(Suppl 1), S6 (2015). Text mining for chemistry and the CHEMDNER trackCrossRefGoogle Scholar
  10. 10.
    Lowe, D.M., Sayle, R.A.: LeadMine: a grammar and dictionary driven approach to entity recognition. J. Cheminformatics 7(Suppl 1), S5 (2015). Text mining for chemistry and the CHEMDNER trackCrossRefGoogle Scholar
  11. 11.
    Kaewphan, S., Hakala, K., Ginter, F.: UTU: disease mention recognition and normalization with CRFs and vector space representations. In: SemEval@ COLING, pp. 807–811 (2014)Google Scholar
  12. 12.
    Natale, D.A., Arighi, C.N., Blake, J.A., et al.: Protein Ontology: a controlled structured network of protein entities. Nucleic Acids Res. 42(Database issue), D415–D421 (2014)CrossRefGoogle Scholar
  13. 13.
    Ruepp, A., Waegele, B., Lechner, M., et al.: CORUM: the comprehensive resource of mammalian protein complexes–2009. Nucleic Acids Res. 38(Database issue), D497–D501 (2010)CrossRefGoogle Scholar
  14. 14.
    Fukuda, K., Tamura, A., Tsunoda, T., Takagi, T.: Toward information extraction: identifying protein names from biological papers. In: Pacific Symposium on Biocomputing, pp. 707–718 (1998)Google Scholar
  15. 15.
    Lafferty, J., McCallum, A., et al.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)Google Scholar
  16. 16.
    Okazaki, N.: CRFsuite: a fast implementation of Conditional Random Fields (2007). [2015-03-24]Google Scholar
  17. 17.
    Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K.: A biological named entity recognizer. In: Pacific Symposium on Biocomputing, pp. 427–438 (2003)Google Scholar
  18. 18.
    Ding, R., Arighi, C.N., Lee, J.-Y., Wu, C.H., Vijay-Shanker, K.: pGenN, a gene normalization tool for plant genes and proteins in scientific literature. PLoS ONE 10(8), e0135305 (2015)CrossRefGoogle Scholar
  19. 19.
    Schwartz, A.S., Hearst, M.A.: A simple algorithm for identifying abbreviation definitions in biomedical text. In: Pacific Symposium on Biocomputing, pp. 451–462 (2003)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Ruoyao Ding
    • 1
  • Xiaoyi Pan
    • 1
  • Yingying Qu
    • 2
  • Cathy H. Wu
    • 3
  • K. Vijay-Shanker
    • 3
  1. 1.School of Information Science and TechnologyGuangdong University of Foreign StudiesGuangzhouChina
  2. 2.School of BusinessGuangdong University of Foreign StudiesGuangzhouChina
  3. 3.Department of Computer and Information ScienceUniversity of DelawareNewarkUSA

Personalised recommendations