Abstract
A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding. In this chapter, we relate DNN outputs to Bayesian a posteriori probabilities and suggest that the Default threshold of 0.5 is almost never optimal when training data is imbalanced. We simulate a wide range of class imbalance levels using three real-world data sets, i.e. positive class sizes of 0.03–90%, and we compare Default threshold results to two alternative thresholding strategies. The Optimal threshold strategy uses validation data or training data to search for the classification threshold that maximizes the geometric mean. The Prior threshold strategy requires no optimization, and instead sets the classification threshold to be the prior probability of the positive class. Multiple deep architectures are explored and all experiments are repeated 30 times to account for random error. Linear models and visualizations show that the Optimal threshold is strongly correlated with the positive class prior. Confidence intervals show that the Default threshold only performs well when training data is balanced and Optimal thresholds perform significantly better when training data is skewed. Surprisingly, statistical results show that the Prior threshold performs consistently as well as the Optimal threshold across all distributions. The contributions of this chapter are twofold: (1) illustrating the side effects of training deep models with highly imbalanced big data and (2) comparing multiple thresholding strategies for maximizing class-wise performance with imbalanced training data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
W. Wei, J. Li, L. Cao, Y. Ou, J. Chen, Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 449–475 (2013)
A.N. Richter, T.M. Khoshgoftaar, Sample size determination for biomedical big data with limited labels. Netw. Model. Anal. Health Inf. Bioinf. 9, 1–13 (2020)
M. Kubat, R.C. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
S. Wang, X. Yao, Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cyb. Part B (Cybern.) 42, 1119–1130 (2012)
N. Japkowicz, The class imbalance problem: significance and strategies, in Proceedings of the International Conference on Artificial Intelligence (2000)
M. Buda, A. Maki, M.A. Mazurowski, A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018)
H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)
G.M. Weiss, Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)
R.A. Bauder, T.M. Khoshgoftaar, T. Hasanin, An empirical study on class rarity in big data, in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) (2018), pp. 785–790
E. Dumbill, What is big data? an introduction to the big data landscape (2012). http://radar.oreilly.com/2012/01/what-is-big-data.html
S.E. Ahmed, Perspectives on Big Data Analysis: methodologies and Applications (Amer Mathematical Society, USA, 2014)
J.L. Leevy, T.M. Khoshgoftaar, R.A. Bauder, N. Seliya, A survey on addressing high-class imbalance in big data. J. Big Data 5, 42 (2018)
J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, (Berkeley, CA, USA), USENIX Association (2010), p. 10
K. Chahal, M. Grover, K. Dey, R.R. Shah, A hitchhiker’s guide on distributed training of deep neural networks. J. Parallel Distrib. Comput. 10 (2019)
R.K.L. Kennedy, T.M. Khoshgoftaar, F. Villanustre, T. Humphrey, A parallel and distributed stochastic gradient descent implementation using commodity clusters. J. Big Data 6(1), 16 (2019)
D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2, 408–421 (1972)
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 321–357 (2002)
H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in Advances in Intelligent Computing ed. by D.-S. Huang, X.-P. Zhang, G.-B. Huang (Springer, Berlin, Heidelberg, 2005), pp. 878–887
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6, 40–49 (2004)
C. Ling, V. Sheng, Cost-sensitive learning and the class imbalance problem, in Encyclopedia of Machine Learning (2010)
J.J Chen, C.-A. Tsai, H. Moon, H. Ahn, J.J. Young, C.-H. Chen, Decision threshold adjustment in class prediction, in SAR and QSAR in Environmental Research, vol. 17 (2006), pp. 337–352
Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the best classification threshold in imbalanced classification. Big Data Res. 5, 2–8 (2016)
X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39, 539–550 (2009)
N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, Smoteboost: improving prediction of the minority class in boosting, in Knowledge Discovery in Databases: PKDD 2003 ed. by N. Lavrač, D. Gamberger, L. Todorovski, H. Blockeel, (Springer, Berlin, Heidelberg, 2003), pp. 107–119
Y. Sun, Cost-sensitive Boosting for Classification of Imbalanced Data. Ph.D. thesis, Waterloo, Ont., Canada, Canada, 2007. AAINR34548
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (The MIT Press, Cambridge, MA, 2016)
I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining, Fourth Edition: practical Machine Learning Tools and Techniques, 4th edn. (San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 2016)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature. 521, 436 (2015)
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems (2015)
Theano Development Team, Theano: a python framework for fast computation of mathematical expressions (2016). arXiv:abs/1605.02688
F. Chollet et al., Keras (2015). https://keras.io
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in NIPS-W (2017)
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, cudnn: efficient primitives for deep learning (2014)
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Neural Inform. Process. Syst. 25, 01 (2012)
M.D. Richard, R.P. Lippmann, Neural network classifiers estimate bayesian a posteriori probabilities. Neural Comput. 3(4), 461–483 (1991)
Centers For Medicare & Medicaid Services, Medicare provider utilization and payment data: physician and other supplier (2018)
Centers For Medicare & Medicaid Services, Medicare provider utilization and payment data: part D prescriber (2018)
U.S. Government, U.S. Centers for Medicare & Medicaid Services, The official U.S. government site for medicare
Evolutionary Computation for Big Data and Big Learning Workshop, Data mining competition 2014: self-deployment track
M. Wani, F. Bhat, S. Afzal, A. Khan, Advances in Deep Learning (Springer, 2020)
J.W. Tukey, Comparing individual means in the analysis of variance. Biometrics 5(2), 99–114 (1949)
R. Anand, K.G. Mehrotra, C.K. Mohan, S. Ranka, An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans. Neural Netw. 4, 962–969 (1993)
J.M. Johnson, T.M. Khoshgoftaar, Survey on deep learning with class imbalance. J. Big Data 6, 27 (2019)
J.M. Johnson, T.M. Khoshgoftaar, Medicare fraud detection using neural networks. J. Big Data 6(1), 63 (2019)
D. Masko, P. Hensman, The impact of imbalanced training data for convolutional neural networks, in 2015. KTH, School of Computer Science and Communication (CSC)
H. Lee, M. Park, J. Kim, Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning, in 2016 IEEE International Conference on Image Processing (ICIP) (2016), pp. 3713–3717
S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, P. J. Kennedy, Training deep neural networks on imbalanced data sets, in 2016 International Joint Conference on Neural Networks (IJCNN) (2016), pp. 4368–4374
H. Wang, Z. Cui, Y. Chen, M. Avidan, A. B. Abdallah, A. Kronzer, Predicting hospital readmission via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinf. 1 (2018)
S.H. Khan, M. Hayat, M. Bennamoun, F.A. Sohel, R. Togneri, Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29, 3573–3587 (2018)
T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2999–3007
C. Huang, Y. Li, C. C. Loy, X. Tang, Learning deep representation for imbalanced classification, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 5375–5384
S. Ando, C.Y. Huang, Deep over-sampling framework for classifying imbalanced data, in Machine Learning and Knowledge Discovery in Databases, ed. by M. Ceci, J. Hollmén, L. Todorovski, C. Vens, S. Džeroski (Springer International Publishing, Cham, 2017), pp. 770–785
Q. Dong, S. Gong, X. Zhu, Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2018)
Q. Chen, J. Huang, R. Feris, L.M. Brown, J. Dong, S. Yan, Deep domain adaptation for describing people based on fine-grained clothing attributes, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 5315–5324
Y. LeCun, C. Cortes, MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/, Accessed 15 Nov 2018
A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/kriz/cifar.html
R.A. Bauder, T.M. Khoshgoftaar, A novel method for fraudulent medicare claims detection from expected payment deviations (application paper), in 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI) (2016), pp. 11–19
R.A. Bauder, T.M. Khoshgoftaar, A probabilistic programming approach for outlier detection in healthcare claims, in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) (2016), pp. 347–354
R.A. Bauder, T.M. Khoshgoftaar, A. Richter, M. Herland, Predicting medical provider specialties to detect anomalous insurance claims, in 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) (2016), pp. 784–790
M. Herland, R.A. Bauder, T.M. Khoshgoftaar, Medical provider specialty predictions for the detection of anomalous medicare insurance claims, in 2017 IEEE International Conference on Information Reuse and Integration (IRI) (2017), pp. 579–588
Office of Inspector General, LEIE downloadable databases (2019)
R.A. Bauder, T.M. Khoshgoftaar, The detection of medicare fraud using machine learning methods with excluded provider labels, in FLAIRS Conference (2018)
M. Herland, T.M. Khoshgoftaar, R.A. Bauder, Big data fraud detection using multiple medicare data sources. J. Big Data 5, 29 (2018)
M. Herland, R.A. Bauder, T.M. Khoshgoftaar, The effects of class rarity on the evaluation of supervised healthcare fraud detection models. J. Big Data 6(1), 21 (2019)
K. Feldman, N.V. Chawla, Does medical school training relate to practice? evidence from big data. Big Data (2015)
Centers for Medicare & Medicaid Services, Physician compare datasets (2019)
J. Ko, H. Chalfin, B. Trock, Z. Feng, E. Humphreys, S.-W. Park, B. Carter, K.D. Frick, M. Han, Variability in medicare utilization and payment among urologists. Urology 85, 03 (2015)
V. Chandola, S.R. Sukumar, J.C. Schryver, Knowledge discovery from massive healthcare claims data, in KDD (2013)
L.K. Branting, F. Reeder, J. Gold, T. Champney, Graph analytics for healthcare fraud risk estimation, in 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (2016), pp. 845–851
National Plan & Provider Enumeration System, NPPES NPI registry (2019)
P.S.P. Center, 9th community wide experiment on the critical assessment of techniques for protein structure prediction
I. Triguero, S. Rí, V. López, J. Bacardit, J. Benítez, F. Herrera, ROSEFW-RF: the winner algorithm for the ecbdl’14 bigdata competition: an extremely imbalanced big data bioinformaticsproblem. Knowl.-Based Syst. 87 (2015)
A. Fernández, S. del Río, N.V. Chawla, F. Herrera, An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)
S. de Río, J.M. Benítez, F. Herrera, Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification, in 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2 (2015), pp. 180–185
Centers for Medicare & Medicaid Services, National provider identifier standard (NPI) (2019)
Centers For Medicare & Medicaid Services, HCPCS general information (2018)
P. Di Lena, K. Nagata, P. Baldi, Deep architectures for protein contact map prediction, Bioinformatics (Oxford, England) 28, 2449–57 (2012)
J. Berg, J. Tymoczko, L. Stryer, Chapter 3, protein structure and function, in Biochemistry, 5th edn. (W H Freeman, New York, 2002)
Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, H. Liu, Advancing feature selection research, ASU Feature Selection Repository (2010), pp. 1–28
S. Linux, About (2014)
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
F. Provost, T. Fawcett, Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions, in Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, vol. 43–48 (1999), p. 12
D. Wilson, T. Martinez, The general inefficiency of batch training for gradient descent learning. Neural Netw.: Off. J. Int. Neural Netw. Soc. 16, 1429–51 (2004)
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR (2015). arXiv:abs/1412.6980
R.P. Lippmann, Neural networks, bayesian a posteriori probabilities, and pattern classification, in From Statistics to Neural Networks, ed. by V. Cherkassky, J.H. Friedman, H. Wechsler (Springer, Berlin, Heidelberg, 1994), pp. 83–104
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32Nd International Conference on International Conference on Machine Learning, ICML’15, vol. 37 (JMLR.org, 2015), pp. 448–456
B. Zdaniuk, Ordinary Least-Squares (OLS) Model (Dordrecht, Springer Netherlands, 2014), pp. 4515–4517
J.M. Johnson, T.M. Khoshgoftaar, Deep learning and data sampling with imbalanced big data, 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) (2019), pp. 175–183
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Johnson, J.M., Khoshgoftaar, T.M. (2021). Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds) Deep Learning Applications, Volume 2. Advances in Intelligent Systems and Computing, vol 1232. Springer, Singapore. https://doi.org/10.1007/978-981-15-6759-9_9
Download citation
DOI: https://doi.org/10.1007/978-981-15-6759-9_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6758-2
Online ISBN: 978-981-15-6759-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)