Automating the Extraction of Essential Genes from Literature

  • Ruben Rodrigues
  • Hugo Costa
  • Miguel Rocha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10933)


The construction of repositories with curated information about gene essentiality for organisms of interest in Biotechnology is a very relevant task, mainly in the design of cell factories for the enhanced production of added-value products. However, it requires retrieval and extraction of relevant information from literature, leading to high costs regarding manual curation. Text mining tools implementing methods addressing tasks as information retrieval, named entity recognition and event extraction have been developed to automate and reduce the time required to obtain relevant information from literature in many biomedical fields. However, current tools are not designed or optimized for the purpose of identifying mentions to essential genes in scientific texts.

In this work, we propose a pipeline to automatically extract mentions to genes and to classify them accordingly to their essentiality for a specific organism. This pipeline implements a machine learning approach that is trained using a manually curated set of documents related with gene essentiality in yeast. This corpus is provided as a resource for the community, as a benchmark for the development of new methods. Our pipeline was evaluated performing resampling and cross validation over this curated dataset, presenting an accuracy of over 80%, and an f1-score over 75%.



This work is co-funded by the North Portugal Regional Operational Programme, under the “Portugal 2020”, through the European Regional Development Fund (ERDF), within project SISBI- Refa NORTE-01-0247-FEDER-003381.

The Centre of Biological Engineering (CEB), University of Minho, sponsored all computational hardware and software required for this work.

Conflict of Interest

The authors declare they have no conflict of interests regarding this article.


  1. 1.
    Guo, D., Zhang, L., Pan, H., Li, X.: Metabolic engineering of Escherichia coli for production of 2-Phenylethylacetate from L-phenylalanine. MicrobiologyOpen 6(4), e00486 (2017)CrossRefGoogle Scholar
  2. 2.
    Yu, T., Zhou, Y.J., Wenning, L., Liu, Q., Krivoruchko, A., Siewers, V., Nielsen, J., David, F.: Metabolic engineering of Saccharomyces cerevisiae for production of very long chain fatty acid-derived chemicals. Nat. Commun. 8, 15587 (2017)CrossRefGoogle Scholar
  3. 3.
    Chen, W.H., Lu, G., Chen, X., Zhao, X.M., Bork, P.: OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res. 45(D1), D940–D944 (2017)CrossRefGoogle Scholar
  4. 4.
    Cherry, J.M., Hong, E.L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E.T., Christie, K.R., Costanzo, M.C., Dwight, S.S., Engel, S.R., Fisk, D.G., Hirschman, J.E., Hitz, B.C., Karra, K., Krieger, C.J., Miyasato, S.R., Nash, R.S., Park, J., Skrzypek, M.S., Simison, M., Weng, S., Wong, E.D.: Saccharomyces genome database: the genomics resource of budding yeast. Nucleic Acids Res. 40(Database issue), D700-D705 (2012)Google Scholar
  5. 5.
    Shatkay, H., Craven, M.: Mining the Biomedical Literature. Computational Molecular Biology. MIT Press, Cambridge (2012)Google Scholar
  6. 6.
    Gerner, M., Nenadic, G., Bergman, C.M.: LINNAEUS: a species name identification system for biomedical literature. BMC Bioinform. 11(1), 85 (2010)CrossRefGoogle Scholar
  7. 7.
    Gooch, P.: BADREX: In situ expansion and coreference of biomedical abbreviations using dynamic regular expressions. CoRR abs/1206.4, p. 6 (2012)Google Scholar
  8. 8.
    Campos, D., Matos, S., Oliveira, J.: A modular framework for biomedical concept recognition. BMC Bioinform. 14(1), 281 (2013)CrossRefGoogle Scholar
  9. 9.
    Ananiadou, S., Pyysalo, S., Tsujii, J., Kell, D.B.: Event extraction for systems biology by text mining the literature. Trends Biotechnol. 28(7), 381–390 (2010)CrossRefGoogle Scholar
  10. 10.
    Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, pp. 408–419 (2001)Google Scholar
  11. 11.
    McClosky, D., Surdeanu, M., Manning, C.D.: Event extraction as dependency parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT 2011, Stroudsburg, PA, USA, pp. 1626–1635. Association for Computational Linguistics (2011)Google Scholar
  12. 12.
    Chun, H., Hwang, Y., Rim, H.-C.: Unsupervised event extraction from biomedical literature using co-occurrence information and basic patterns. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 777–786. Springer, Heidelberg (2005). Scholar
  13. 13.
    McCallum, A.K.: MALLET: a machine learning for language toolkit (2002).
  14. 14.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Scholar
  15. 15.
    Rodrigues, R., Costa, H., Rocha, M.: Development of a machine learning framework for biomedical text mining. In: Saberi Mohamad, M., Rocha, M., Fdez-Riverola, F., Domínguez Mayo, F., De Paz, J. (eds.) PACBB 2016. AISC, vol. 477, pp. 41–49. Springer, Cham (2016). Scholar
  16. 16.
    Lourenço, A., Carreira, R., Carneiro, S., Maia, P., Glez-Peña, D., Fdez-Riverola, F., Ferreira, E.C., Rocha, I., Rocha, M.: @Note: A workbench for Biomedical Text Mining. J. Biomed. Inform. 42(4), 710–720 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.CEB - Centre Biological EngineeringUniversity of MinhoBragaPortugal
  2. 2.Silicolife LdaBragaPortugal

Personalised recommendations