Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text

Akkasi, Abbas; Varoğlu, Ekrem; Dimililer, Nazife

doi:10.1007/s10489-017-0920-5

Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text

Published: 17 April 2017

Volume 48, pages 1965–1978, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

1243 Accesses
23 Citations
Explore all metrics

Abstract

The class imbalance problem is a key factor that affects the performance of many classification tasks when using machine learning methods. This mainly refers to the problem where the number of samples in certain classes is much greater than in others. Such imbalance considerably affects the performance of classifiers in which the majority class or classes are often favored, thus resulting in high-precision/low-recall classifiers. Named entity recognition in free text suffers from this problem to a large extent because in any given free text, many samples do not belong to a specific entity. Furthermore, the data used in this specific type of classification is in sequenced mode and is different than that used in other common classification tasks such as image classification, spam detection, and text classification in which no semantic or syntactic relation exists between samples. In this study, we propose an undersampling approach for sequenced data that preserves existing correlations between sequenced samples that comprise sentences and thus improve the performance of classifiers. We call this method balanced undersampling (BUS). Considering the recent increased interest in the use of NER in the chemical and biomedical domains, the proposed method is developed and tested on four recent state-of-the-art corpora in these domains, including BioCreative IV ChemDNER, Bio-entity Recognition Challenge of JNLPBA (JNLPBA), SemEval2013 DDI DrugBank, and SemEval2013 DDI Medline datasets. The performance of the proposed method is evaluated against two other common undersampling methods: random undersampling and stop-word filtering. Our method is shown to outperform both methods with respect to F-score for all datasets used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An effective undersampling method for biomedical named entity recognition using machine learning

Article 04 April 2024

Sentence-based undersampling for named entity recognition using genetic algorithm

Article 06 March 2018

Active learning approach using a modified least confidence sampling strategy for named entity recognition

Article 19 January 2021

References

Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B (Cybern) 42(4):1119–1130
Article Google Scholar
Chawla N V, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newslett 6(1):1–6
Article Google Scholar
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man, Cyberne Part C (Appl Rev) 42(4):463–484
Article Google Scholar
Marsh E, Perzanowski D (1998) Muc-7 evaluation of information extraction technology: overview of results. In Seventh message understanding conference (MUC-7), pp 1251–1256
Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4, pp 192–201
Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol 68, pp 10–15
Yang Q, Wu X (2006) 10 Challenging problems in data mining research. Int J Inf Technol Decis Mak 5 (4):597–604
Article Google Scholar
Ghanem A S, Venkatesh S, West G (2010) Multi-class pattern classification in imbalanced data. In: Pattern recognition (ICPR), pp 2881–2884
Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36
Google Scholar
Visa S, Ralescu A (2005) Issues in mining imbalanced data sets-a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, pp 67–73
Monard M C, Batista G E (2002) Learmng with skewed class distrihutions, advances in logic. Artif Intell Robot LAPTEC 2002(85):173
Google Scholar
Chawla N V, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newslett 6(1):1–6
Article Google Scholar
Wang S, Tang K, Yao X (2009) Diversity exploration and negative correlation learning on imbalanced data sets. In: 2009 International joint conference on neural networks, pp 3259–3266
Williams D P, Myers V, Silvious M S (2009) Mine classification with imbalanced data. IEEE Geosci Rem Sens Lett 6(3):528–532
Article Google Scholar
Thai-Nghe N, Do T N, Schmidt-Thieme L (2010) Learning optimal threshold on resampling data to deal with class imbalance. In: Proceeding of IEEE RIVF international conference on computing and telecommunication technologies, pp 71–76
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164
Sun Y, Kamel M S, Wong A K, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recog 40(12):3358–3378
Article MATH Google Scholar
Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209
Article Google Scholar
Zhao X, Li X, Chen L, Aihara K (2008) Protein classification with imbalanced data. Proteins 70 (4):1125–1132
Article Google Scholar
Mingrui W, Jieping Y (2009) A small sphere and large margin approach for novelty detection using training data with outliers. IEEE Trans Pattern Anal Mach Intell 31(11):2088–2092
Article Google Scholar
Li X, Wang L, Sung E (2008) Adaboost with svm-based component classifiers. Eng Appl Artif Intell 21 (5):785–795
Article Google Scholar
Partalas I, Tsoumakas G, Vlahavas I (2010) An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Mach Learn 81:257–282
Article MathSciNet Google Scholar
Qun D (2013) A competitive ensemble pruning approach based on cross-validation technique. Knowl-Based Syst 37:394–414
Article Google Scholar
Qun D, Ting Z, Ningzhong L (2015) A new reverse reduce-error ensemble pruning algorithm. Appl Soft Comput 28:237–249
Article Google Scholar
Haibo H, Yunqian M (2013) Imbalanced Learning, foundations, algorithms, and applications. Wiley-IEEE, ISBN: 978-1-118-07462-6, Hardcover, 216 pages, Wiley-IEEE
Longadge R, Dongre S (2013) Class imbalance problem in data mining review, arXiv:1305.1707
Seiffert C, Khoshgoftaar T M, Van H J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern-Part A: Syst Humans 40(1):185– 197
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution Conference on artificial intelligence in medicine in Europe. Springer, Berlin Heidelberg, pp 63–66
Chapter Google Scholar
Geoffery W G (1972) Reduced nearest neighbor rule. IEEE Trans Inf Theory 18:431–433
Article Google Scholar
Hart P H (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 515–516
Ritter G L, Woodruff H B, Lowry S R, Isenhour T L (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669
Article MATH Google Scholar
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772
MathSciNet MATH Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. ICML 97:179–186
Google Scholar
Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Article Google Scholar
Folorunso S O, Adeyemo A B (2012) Theoretical comparison of undersampling techniques against their underlying data reduction techniques. In: 2nd International conference on computer, energy, network, robotics and telecom
Kim M S (2007) An effective under-sampling method for class imbalance data problem. In: ISIS 2007 Proceedings of the 8th symposium on advanced intelligent systems, pp 825–829
Gary M, Provost W F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-43, Department of Computer Science, Rutgers University
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Benjamin X, Japkowicz W N (2004) Imbalanced data set learning with synthetic examples. In: IRIS Machine learning workshop
Han H, Wang W Y, Mao B H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer Berlin Heidelberg, pp 878–887
He H, Bai Y, Garcia E A, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International joint conference on neural Networks (IEEE World congress on computational intelligence), pp 1322–1328
Cho H C, Okazaki N, Miwa M, Tsujii J I (2013) Named entity recognition with multiple segment representations. Inf Process Manag 49(4):954–965
Article Google Scholar
Massimiliano G A, Giulian C, Rinaldi R (2005) Instance filtering for entity recognition. SIGKDD Explor 7:11–18
Article Google Scholar
Gliozzo A M, Giuliano C, Rinaldi R (2005) Instance pruning by filtering uninformative words: an information extraction case study. In: International conference on intelligent text processing and computational linguistics. Springer Berlin Heidelberg, pp 498–509
Tomanek K, Hahn U (2009) Reducing class imbalance during active learning for named entity annotation. In: Proceedings of the fifth international conference on knowledge capture. ACM, pp 105–112
Akkasi A, Varoglu E, Dimililer N (2016) ChemTok: a new rule based tokenizer for chemical named entity recognition. BioMed Research International. doi:10.1155/2016/4248026
Sang E F, Veenstra J (1999) Representing text chunks. In: Proceedings of the ninth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, pp 173–179
Takeuchi K, Collier N (2003) Bio-medical entity extraction using support vector machines. In: Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, pp 57–64
Collier N, Takeuchi K (2004) Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 37:423–35
Article Google Scholar
Kudo T, Matsumoto Y (2003) Chunking with support vector machines. In: Proceeding of the second meeting of the North American chapter of the association for computational linguistics on language technologies, pp 1–8
Eltyeb S, Naomie S (2014) Chemical named entities recognition: a review on approaches and applications. J Cheminform 6:1–17
Article Google Scholar
Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado et al (2015) The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform 7(1)
Kim J D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 70–75
Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T (2013) The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions. J Biomed Inf 46(5):914–920
Segura Bedmar I, Martinez P, Herrero Z M (2013) Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics
Klinger R, Tomanek K (2007) Classical probabilistic models and conditional random fields. TU, Algorithm Engineering
McCallum A K (2002) Mallet: a machine learning for language toolkit
Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A (2013) Overview of the chemical compound and drug name recognition (CHEMDNER) task. In: BioCreative challenge evaluation workshop, vol 2, pp 2–33

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Bandar Abbas Branch, Islamic Azad University, Bandar Abbas, Iran
Abbas Akkasi
Computer Engineering Department, Eastern Mediterranean University, Famagusta, North Cyprus, via Mersin 10, Turkey
Ekrem Varoğlu
Information Technology Department, Eastern Mediterranean University, Famagusta, North Cyprus, via Mersin 10, Turkey
Nazife Dimililer

Authors

Abbas Akkasi
View author publications
You can also search for this author in PubMed Google Scholar
Ekrem Varoğlu
View author publications
You can also search for this author in PubMed Google Scholar
Nazife Dimililer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abbas Akkasi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Akkasi, A., Varoğlu, E. & Dimililer, N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell 48, 1965–1978 (2018). https://doi.org/10.1007/s10489-017-0920-5

Download citation

Published: 17 April 2017
Issue Date: August 2018
DOI: https://doi.org/10.1007/s10489-017-0920-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text

Abstract

Access this article

Similar content being viewed by others

An effective undersampling method for biomedical named entity recognition using machine learning

Sentence-based undersampling for named entity recognition using genetic algorithm

Active learning approach using a modified least confidence sampling strategy for named entity recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text

Abstract

Access this article

Similar content being viewed by others

An effective undersampling method for biomedical named entity recognition using machine learning

Sentence-based undersampling for named entity recognition using genetic algorithm

Active learning approach using a modified least confidence sampling strategy for named entity recognition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation