CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

Tang, Zhuo; Jiang, Lingang; Yang, Li; Li, Kenli; Li, Keqin

doi:10.1007/s10586-015-0426-z

CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

Published: 22 January 2015

Volume 18, pages 493–505, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Zhuo Tang¹,
Lingang Jiang¹,
Li Yang²,
Kenli Li¹ &
…
Keqin Li³

669 Accesses
21 Citations
Explore all metrics

Abstract

As the rapid growth of the biomedical literature, the model training time in biomedical named entity recognition increases sharply when dealing with large-scale training samples. How to increase the efficiency of named entity recognition in biomedical big data becomes one of the key problems in biomedical text mining. For the purposes of improving the recognition performance and reducing the training time, this paper proposes an optimization method for two-phase recognition using conditional random fields. In the first stage, each named entity boundary is detected to distinguish all real entities. In the second stage, we label the semantic class of the entity detected. To expedite the training speed, in these two phases, we implement the model training process on a parallel optimization program framework based on MapReduce. Through dividing the training set into several parts, the iterations in the training algorithm are designed as map tasks which can be executed simultaneously in a cluster, where each map function is designed to complete the calculation of a gradient vector component for each part in the training set. Our experiments show that the proposed method in this paper can achieve high performance with short training time, which has important implications for the current biological big data processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comparative study for biomedical named entity recognition

Article 15 September 2015

Biomedical Named Entity Recognition Based on Multistage Three-Way Decisions

Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach

Article 17 August 2017

References

Wikipedia, Text mining [EB/OL]. http://en.wikipedia.org/wiki/Text_mining. 24 Oct 2013
Wikipedia, Named-entity recognition [EB/OL]. http://en.wikipedia.org/wiki/Named_entity_recognition. 22 Aug 2013
Wikipedia, MEDLINE [EB/OL]. http://en.wikipedia.org/wiki/MEDLINE. 14 Sep 2013
Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, San Francisco (2010). doi:10.2200/S00274ED1V01Y201006HLT007
Google Scholar
Shen, L., Shen, H., Cheng, L.: New algorithms for efficient mining of association rules. In: The Seventh Symposium on the Frontiers of Massively Parallel Computation, pp. 234–241 (1999)
Li, L., Zhou, R., Huang, D.: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33(4), 334–338 (2009)
Article Google Scholar
Finkel, J., Dingare, S., Nguyen, H.: Exploiting context for biomedical entity recognition: from syntax to the web. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 88–91 (2004)
Wang, H., Zhao, T., Li, S., Yu, H.: A conditional random fields approach to biomedical named entity recognition. J. Electron. 6(24), 838–844 (2007)
Google Scholar
Settles, B.: Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), pp. 104–107 (2004)
Li, L., Fan, W., Huang, D.: A two-phase bio-NER system based on integrated classifiers and multi-agent strategy. IEEE/ACM Trans. Comput. Biol. Bioinform. (2013). doi:10.1109/TCBB.2013.106
Yang, L., Zhou, Y.: Exploring feature sets for two-phase biomedical named entity recognition using semi-CRFs. Knowl. Inf. Syst. (2013). doi:10.1007/s10115-013-0637-7
Lee, K.-J., Hwang, Y.-S., Rim, H.-C.: Two-phase biomedical NE recognition based on SVMs. In: Proceedings of the ACL Workshop on Natural Language Processing in Biomedicine (BioMed), pp. 33–40 (2003)
Kim, S., Yoon, J., Park, K.-M., Rim, H.-C.: Two-phase biomedical named entity recognition using a hybrid method. In: Proceedings of the 2nd International Joint Conference (IJCNLP), pp. 646–657 (2005)
Kim, S., Yoon, J.: Experimental study on a two phase method for biomedical named entity recognition. IEICE Trans. Inf. Syst. 7(E90–D), 1103–1110 (2007)
Article Google Scholar
Li, Lishuang, Zhou, Rongpeng, Huang, Degen: Two-phase biomedical named entity recognition using CRFs. Comput. Biol. Chem. 33, 334–338 (2009)
Article Google Scholar
Wang, L., Ke, L., Liu, P., Ranjan, R., Chen, L.: IK-SVD: dictionary learning for spatial big data via incremental atom update. Comput. Sci. Eng. 16(4), 41–52 (2014)
Article Google Scholar
Wang, L., von Laszewski, G., Younge, A.J., He, X., Kunze, M., Tao, J.: Cloud computing: a perspective study. New Gener. Comput. 28(2), 137–146 (2010)
Article MATH Google Scholar
Wittek, P., Darányi, S.: Accelerating text mining workloads in a MapReduce-based distributed GPU environment. J. Parallel Distrib. Comput. 2(73), 98–206 (2013)
Google Scholar
Wang, L., Tao, J., Marten, H., Streit, A., Khan, S.U., Kolodziej, J., Chen, D.: MapReduce across distributed clusters for data-intensive applications. In: The 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS) Workshops 2012: 2004–2011
Laclavik, M., Seleng, M., Hluchy, L.: Towards large scale semantic annotation built on MapReduce architecture. Lecture Notes in Computer Science 3(5103), 331–338 (2008)
Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J., Chen, D.: G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 29(3), 739–750 (2013)
Article Google Scholar
Whitney, M., Clifton, A., Sarkar, A., Fedorova, A.: Making the most of a distributed perceptron for NLP. In: Pacific Northwest Regional NLP Workshop, Redmond, Washington, USA (2012)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: 27th Proceedings of the International Conference on Machine Learning (ICML), pp. 282–289 (2010)
Atkinson, J., Bull, V.: A multi-strategy approach to biological named entity recognition. Expert Syst. Appl. 39(17), 12968–12974 (2012)
Article Google Scholar
Forney, G.D. Jr.: The viterbi algorithm. In: Proceedings of the IEEE, vol. 3(61), pp. 268–278. Codex Corporation. Newton, MA (2005)
Vijay Sundar Ram, R., Akilandeswari, A., Lalitha Devi, S.: Linguistic features for named entity recognition using CRFs. In: International Conference on Asian Language Processing (IALP), pp. 158–161 (2010)
Langford, J.: Parallel machine learning on big data, XRDS: crossroads. ACM Mag. Stud. 1(19), 60–62 (2012)
Meraji, S., Tropper, C.: A machine learning approach for optimizing parallel logic simulation. In: 39th International Conference on Parallel Processing (ICPP), pp. 545–554 (2010)
Livieris, I.E., Apostolopoulou, M.S., Sotiropoulos, D.G., Sioutas, S., Pintelas, P.: Classification of large biomedical data using ANNs based on BFGS method. In: 13th Panhellenic Conference on Informatics (PCI), pp. 87–91 (2009)
Munkhdalai, T., Li, M., Kim, T., Namsrai, O.-E., Jeong, S.-p., Shin, J., Ryu, K.H.: Bio named entity recognition based on co-training algorithm. In: 26th International Conference on Advanced Information Networking and Applications Workshops (WAINA), pp. 857–862 (2012)
Zhang, J., Shen, D., Zhou, G., Tan, C.-L.: Enhancing HMM-based biomedical named entity recognition by studying special phenomena. J. Biomed. Inform. 6(37), 411–422 (2004)
Article Google Scholar
Mathur, A., Chakrabarti, S.: Accelerating newton optimization for log-linear models through feature redundancy. In: 6th International Conference on Data Mining, pp. 404–413 (2006)
Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773–782 (1980)
Article MATH MathSciNet Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. J. Math. Program. B 3(45), 503–528 (1989)
Article MathSciNet Google Scholar
Wang, L., Chen, D., Ranjan, R., Khan, S.U., Kolodziej, J., Wang, J.: Parallel processing of massive EEG data with MapReduce. In: The 18th IEEE International Conference on Parallel and Distributed Systems (ICPADS), pp. 164–171 (2012)
Guodong, Z., Jian, S.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications (JNLPBA), pp. 96–99 (2004)
Okanohara, D., Miyao, Y., Tsuruoka, Y., Tsujii, J.: Improving the scalability of semi-Markov conditional random fields for named entity recognition. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pp. 465–472 (2006)
Zhao, Jiaqi, Wang, Lizhe, Tao, Jie, Chen, Jinjun, Sun, Weiye, Ranjan, Rajiv, Kolodziej, Joanna, Streit, Achim, Georgakopoulos, Dimitrios: A security framework in G-Hadoop for big data computing across distributed cloud data centres. J. Comput. Syst. Sci. 80(5), 994–1007 (2014)
Article MATH MathSciNet Google Scholar
Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Qin, X.: Improving MapReduce performance through data placement in heterogeneous Hadoop clusters. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–9 (2010)

Download references

Acknowledgments

The authors are grateful to the three anonymous reviewers for their criticism and comments which have helped to improve the presentation and quality of the paper. This work is supported by the Key Program of National Natural Science Foundation of China (Grant No. 61133005), and National Natural Science Foundation of China (Grant Nos. 61370095,61432005).

Author information

Authors and Affiliations

College of Information Science and Engineering, Hunan University, Changsha, 410082, China
Zhuo Tang, Lingang Jiang & Kenli Li
College of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, 410004, China
Li Yang
Department of Computer Science, State University of New York, New Paltz, NY, 12561, USA
Keqin Li

Authors

Zhuo Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lingang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Li Yang
View author publications
You can also search for this author in PubMed Google Scholar
Kenli Li
View author publications
You can also search for this author in PubMed Google Scholar
Keqin Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhuo Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, Z., Jiang, L., Yang, L. et al. CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework. Cluster Comput 18, 493–505 (2015). https://doi.org/10.1007/s10586-015-0426-z

Download citation

Received: 06 October 2014
Revised: 04 January 2015
Accepted: 10 January 2015
Published: 22 January 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10586-015-0426-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

Abstract

Access this article

Similar content being viewed by others

A comparative study for biomedical named entity recognition

Biomedical Named Entity Recognition Based on Multistage Three-Way Decisions

Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CRFs based parallel biomedical named entity recognition algorithm employing MapReduce framework

Abstract

Access this article

Similar content being viewed by others

A comparative study for biomedical named entity recognition

Biomedical Named Entity Recognition Based on Multistage Three-Way Decisions

Feature selection for entity extraction from multiple biomedical corpora: A PSO-based approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation