Skip to main content
Log in

Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The class imbalance problem is a key factor that affects the performance of many classification tasks when using machine learning methods. This mainly refers to the problem where the number of samples in certain classes is much greater than in others. Such imbalance considerably affects the performance of classifiers in which the majority class or classes are often favored, thus resulting in high-precision/low-recall classifiers. Named entity recognition in free text suffers from this problem to a large extent because in any given free text, many samples do not belong to a specific entity. Furthermore, the data used in this specific type of classification is in sequenced mode and is different than that used in other common classification tasks such as image classification, spam detection, and text classification in which no semantic or syntactic relation exists between samples. In this study, we propose an undersampling approach for sequenced data that preserves existing correlations between sequenced samples that comprise sentences and thus improve the performance of classifiers. We call this method balanced undersampling (BUS). Considering the recent increased interest in the use of NER in the chemical and biomedical domains, the proposed method is developed and tested on four recent state-of-the-art corpora in these domains, including BioCreative IV ChemDNER, Bio-entity Recognition Challenge of JNLPBA (JNLPBA), SemEval2013 DDI DrugBank, and SemEval2013 DDI Medline datasets. The performance of the proposed method is evaluated against two other common undersampling methods: random undersampling and stop-word filtering. Our method is shown to outperform both methods with respect to F-score for all datasets used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Wang S, Yao X (2012) Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern Part B (Cybern) 42(4):1119–1130

    Article  Google Scholar 

  2. Chawla N V, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newslett 6(1):1–6

    Article  Google Scholar 

  3. Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man, Cyberne Part C (Appl Rev) 42(4):463–484

    Article  Google Scholar 

  4. Marsh E, Perzanowski D (1998) Muc-7 evaluation of information extraction technology: overview of results. In Seventh message understanding conference (MUC-7), pp 1251–1256

  5. Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 Fourth international conference on natural computation, vol 4, pp 192–201

  6. Japkowicz N (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol 68, pp 10–15

  7. Yang Q, Wu X (2006) 10 Challenging problems in data mining research. Int J Inf Technol Decis Mak 5 (4):597–604

    Article  Google Scholar 

  8. Ghanem A S, Venkatesh S, West G (2010) Multi-class pattern classification in imbalanced data. In: Pattern recognition (ICPR), pp 2881–2884

  9. Kotsiantis S, Kanellopoulos D, Pintelas P (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36

    Google Scholar 

  10. Visa S, Ralescu A (2005) Issues in mining imbalanced data sets-a review paper. In: Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, pp 67–73

  11. Monard M C, Batista G E (2002) Learmng with skewed class distrihutions, advances in logic. Artif Intell Robot LAPTEC 2002(85):173

    Google Scholar 

  12. Chawla N V, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newslett 6(1):1–6

    Article  Google Scholar 

  13. Wang S, Tang K, Yao X (2009) Diversity exploration and negative correlation learning on imbalanced data sets. In: 2009 International joint conference on neural networks, pp 3259–3266

  14. Williams D P, Myers V, Silvious M S (2009) Mine classification with imbalanced data. IEEE Geosci Rem Sens Lett 6(3):528–532

    Article  Google Scholar 

  15. Thai-Nghe N, Do T N, Schmidt-Thieme L (2010) Learning optimal threshold on resampling data to deal with class imbalance. In: Proceeding of IEEE RIVF international conference on computing and telecommunication technologies, pp 71–76

  16. Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 155–164

  17. Sun Y, Kamel M S, Wong A K, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recog 40(12):3358–3378

    Article  MATH  Google Scholar 

  18. Tian J, Gu H, Liu W (2011) Imbalanced classification using support vector machine ensemble. Neural Comput Appl 20(2):203–209

    Article  Google Scholar 

  19. Zhao X, Li X, Chen L, Aihara K (2008) Protein classification with imbalanced data. Proteins 70 (4):1125–1132

    Article  Google Scholar 

  20. Mingrui W, Jieping Y (2009) A small sphere and large margin approach for novelty detection using training data with outliers. IEEE Trans Pattern Anal Mach Intell 31(11):2088–2092

    Article  Google Scholar 

  21. Li X, Wang L, Sung E (2008) Adaboost with svm-based component classifiers. Eng Appl Artif Intell 21 (5):785–795

    Article  Google Scholar 

  22. Partalas I, Tsoumakas G, Vlahavas I (2010) An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Mach Learn 81:257–282

    Article  MathSciNet  Google Scholar 

  23. Qun D (2013) A competitive ensemble pruning approach based on cross-validation technique. Knowl-Based Syst 37:394–414

    Article  Google Scholar 

  24. Qun D, Ting Z, Ningzhong L (2015) A new reverse reduce-error ensemble pruning algorithm. Appl Soft Comput 28:237–249

    Article  Google Scholar 

  25. Haibo H, Yunqian M (2013) Imbalanced Learning, foundations, algorithms, and applications. Wiley-IEEE, ISBN: 978-1-118-07462-6, Hardcover, 216 pages, Wiley-IEEE

  26. Longadge R, Dongre S (2013) Class imbalance problem in data mining review, arXiv:1305.1707

  27. Seiffert C, Khoshgoftaar T M, Van H J, Napolitano A (2010) RUSBoost: a hybrid approach to alleviating class imbalance. IEEE Trans Syst Man, Cybern-Part A: Syst Humans 40(1):185– 197

  28. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution Conference on artificial intelligence in medicine in Europe. Springer, Berlin Heidelberg, pp 63–66

    Chapter  Google Scholar 

  29. Geoffery W G (1972) Reduced nearest neighbor rule. IEEE Trans Inf Theory 18:431–433

    Article  Google Scholar 

  30. Hart P H (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 515–516

  31. Ritter G L, Woodruff H B, Lowry S R, Isenhour T L (1975) An algorithm for a selective nearest neighbor decision rule. IEEE Trans Inf Theory 21(6):665–669

    Article  MATH  Google Scholar 

  32. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6:769–772

    MathSciNet  MATH  Google Scholar 

  33. Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. ICML 97:179–186

    Google Scholar 

  34. Batista G E, Prati R C, Monard M C (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29

    Article  Google Scholar 

  35. Folorunso S O, Adeyemo A B (2012) Theoretical comparison of undersampling techniques against their underlying data reduction techniques. In: 2nd International conference on computer, energy, network, robotics and telecom

  36. Kim M S (2007) An effective under-sampling method for class imbalance data problem. In: ISIS 2007 Proceedings of the 8th symposium on advanced intelligent systems, pp 825–829

  37. Gary M, Provost W F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-43, Department of Computer Science, Rutgers University

  38. Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  39. Benjamin X, Japkowicz W N (2004) Imbalanced data set learning with synthetic examples. In: IRIS Machine learning workshop

  40. Han H, Wang W Y, Mao B H (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer Berlin Heidelberg, pp 878–887

  41. He H, Bai Y, Garcia E A, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International joint conference on neural Networks (IEEE World congress on computational intelligence), pp 1322–1328

  42. Cho H C, Okazaki N, Miwa M, Tsujii J I (2013) Named entity recognition with multiple segment representations. Inf Process Manag 49(4):954–965

    Article  Google Scholar 

  43. Massimiliano G A, Giulian C, Rinaldi R (2005) Instance filtering for entity recognition. SIGKDD Explor 7:11–18

    Article  Google Scholar 

  44. Gliozzo A M, Giuliano C, Rinaldi R (2005) Instance pruning by filtering uninformative words: an information extraction case study. In: International conference on intelligent text processing and computational linguistics. Springer Berlin Heidelberg, pp 498–509

  45. Tomanek K, Hahn U (2009) Reducing class imbalance during active learning for named entity annotation. In: Proceedings of the fifth international conference on knowledge capture. ACM, pp 105–112

  46. Akkasi A, Varoglu E, Dimililer N (2016) ChemTok: a new rule based tokenizer for chemical named entity recognition. BioMed Research International. doi:10.1155/2016/4248026

  47. Sang E F, Veenstra J (1999) Representing text chunks. In: Proceedings of the ninth conference on European chapter of the association for computational linguistics. Association for Computational Linguistics, pp 173–179

  48. Takeuchi K, Collier N (2003) Bio-medical entity extraction using support vector machines. In: Proceedings of the ACL 2003 workshop on natural language processing in biomedicine, pp 57–64

  49. Collier N, Takeuchi K (2004) Comparison of character-level and part of speech features for name recognition in biomedical texts. J Biomed Inform 37:423–35

    Article  Google Scholar 

  50. Kudo T, Matsumoto Y (2003) Chunking with support vector machines. In: Proceeding of the second meeting of the North American chapter of the association for computational linguistics on language technologies, pp 1–8

  51. Eltyeb S, Naomie S (2014) Chemical named entities recognition: a review on approaches and applications. J Cheminform 6:1–17

    Article  Google Scholar 

  52. Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado et al (2015) The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform 7(1)

  53. Kim J D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N (2004) Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications. Association for Computational Linguistics, pp 70–75

  54. Herrero-Zazo M, Segura-Bedmar I, Martínez P, Declerck T (2013) The DDI corpus: an annotated corpus with pharmacological substances and drug–drug interactions. J Biomed Inf 46(5):914–920

  55. Segura Bedmar I, Martinez P, Herrero Z M (2013) Semeval-2013 task 9: extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). Association for Computational Linguistics

  56. Klinger R, Tomanek K (2007) Classical probabilistic models and conditional random fields. TU, Algorithm Engineering

  57. McCallum A K (2002) Mallet: a machine learning for language toolkit

  58. Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A (2013) Overview of the chemical compound and drug name recognition (CHEMDNER) task. In: BioCreative challenge evaluation workshop, vol 2, pp 2–33

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abbas Akkasi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Akkasi, A., Varoğlu, E. & Dimililer, N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text. Appl Intell 48, 1965–1978 (2018). https://doi.org/10.1007/s10489-017-0920-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-0920-5

Keywords

Navigation