Advertisement

Evaluating Named-Entity Recognition Approaches in Plant Molecular Biology

  • Huy Do
  • Khoat Than
  • Pierre LarmandeEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11248)

Abstract

Text mining research is becoming an important topic in biology with the aim to extract biological entities from scientific papers in order to extend the biological knowledge. However, few thorough studies are developed for plant molecular biology data, especially rice, thus resulting a lack of datasets available to exploit advanced machine learning methods able to detect entities such as genes and proteins. In this article, we first developed a dataset from the Ozyzabase - a database of rice gene, and used it as the benchmark. Then, we evaluated the performance of two Name Entities Recognition (NER) methods for sequence tagging: a Long Short Term Memory (LSTM) model, combined with Conditional Random Fields (CRFs), and a hybrid method based on the dictionary lookup combining with some machine learning systems to improve result. We analyzed the performance of these methods when apply to the Oryzabase dataset and improved the results. On average, the result from LSTM-CRF reaching 86% in \(F_{1}\) is more exploitable.

Keywords

Text mining LSTM-CRF NER Bioinformatics Plant genomics 

References

  1. 1.
    Basaldella, M., Furrer, L., Tasso, C., Rinaldi, F.: Entity recognition in the biomedical domain using a hybrid approach. J. Biomed. Semant. 8, 51 (2017)CrossRefGoogle Scholar
  2. 2.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  3. 3.
    Forney, G.D.: The viterbi algorithm. In: Proceedings of the IEEE, pp. 268–278 (1973)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: 2005 IEEE International Joint Conference on Neural Networks, 2005. IJCNN 2005. Proceedings, vol. 4, pp. 2047–2052. IEEE (2005)Google Scholar
  5. 5.
    Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)CrossRefGoogle Scholar
  6. 6.
    Hochreiter, S.: Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München 91, 1 (1991)Google Scholar
  7. 7.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  8. 8.
    Lafferty, J., McCallum, A., Pereira, F.C.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data (2001)Google Scholar
  9. 9.
    Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
  10. 10.
    Ling, W., et al.: Finding function in form: Compositional character models for open vocabulary word representation. arXiv preprint arXiv:1508.02096 (2015)
  11. 11.
    Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)CrossRefGoogle Scholar
  12. 12.
    Venkatesan, A., Ngompe, G.T., Hassouni, N.E.l., Chentli, I., Guignon, V., Jonquet, C., et al.: Agronomic Linked Data (AgroLD): a Knowledge-based System to Enable Integrative Biology in Agronomy. BioRxiv. (2018).  https://doi.org/10.1101/325423
  13. 13.
    Yamazaki, Y., Sakaniwa, S., Tsuchiya, R., Nonomura, K.I., Kurata, N.: Oryzabase: an integrated information resource for rice science. Breeding Science (2010)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of Science and Technology of Hanoi (USTH), ICT LabHanoiVietnam
  2. 2.Institute of Research for Development (IRD), LMI RICE, DIADEMontpellierFrance
  3. 3.Hanoi University of Science and Technology (HUST)HanoiVietnam

Personalised recommendations