Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

Johnson, Justin M.; Khoshgoftaar, Taghi M.

doi:10.1007/978-981-15-6759-9_9

Justin M. Johnson¹⁷ &
Taghi M. Khoshgoftaar¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1232))

1203 Accesses
12 Citations

Abstract

A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding. In this chapter, we relate DNN outputs to Bayesian a posteriori probabilities and suggest that the Default threshold of 0.5 is almost never optimal when training data is imbalanced. We simulate a wide range of class imbalance levels using three real-world data sets, i.e. positive class sizes of 0.03–90%, and we compare Default threshold results to two alternative thresholding strategies. The Optimal threshold strategy uses validation data or training data to search for the classification threshold that maximizes the geometric mean. The Prior threshold strategy requires no optimization, and instead sets the classification threshold to be the prior probability of the positive class. Multiple deep architectures are explored and all experiments are repeated 30 times to account for random error. Linear models and visualizations show that the Optimal threshold is strongly correlated with the positive class prior. Confidence intervals show that the Default threshold only performs well when training data is balanced and Optimal thresholds perform significantly better when training data is skewed. Surprisingly, statistical results show that the Prior threshold performs consistently as well as the Optimal threshold across all distributions. The contributions of this chapter are twofold: (1) illustrating the side effects of training deep models with highly imbalanced big data and (2) comparing multiple thresholding strategies for maximizing class-wise performance with imbalanced training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

W. Wei, J. Li, L. Cao, Y. Ou, J. Chen, Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 449–475 (2013)
Article Google Scholar
A.N. Richter, T.M. Khoshgoftaar, Sample size determination for biomedical big data with limited labels. Netw. Model. Anal. Health Inf. Bioinf. 9, 1–13 (2020)
Article Google Scholar
M. Kubat, R.C. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Article Google Scholar
S. Wang, X. Yao, Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cyb. Part B (Cybern.) 42, 1119–1130 (2012)
Article Google Scholar
N. Japkowicz, The class imbalance problem: significance and strategies, in Proceedings of the International Conference on Artificial Intelligence (2000)
Google Scholar
M. Buda, A. Maki, M.A. Mazurowski, A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018)
Article Google Scholar
H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)
Google Scholar
G.M. Weiss, Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)
Article Google Scholar
R.A. Bauder, T.M. Khoshgoftaar, T. Hasanin, An empirical study on class rarity in big data, in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) (2018), pp. 785–790
Google Scholar
E. Dumbill, What is big data? an introduction to the big data landscape (2012). http://radar.oreilly.com/2012/01/what-is-big-data.html
S.E. Ahmed, Perspectives on Big Data Analysis: methodologies and Applications (Amer Mathematical Society, USA, 2014)
Book Google Scholar
J.L. Leevy, T.M. Khoshgoftaar, R.A. Bauder, N. Seliya, A survey on addressing high-class imbalance in big data. J. Big Data 5, 42 (2018)
Article Google Scholar
J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, (Berkeley, CA, USA), USENIX Association (2010), p. 10
Google Scholar
K. Chahal, M. Grover, K. Dey, R.R. Shah, A hitchhiker’s guide on distributed training of deep neural networks. J. Parallel Distrib. Comput. 10 (2019)
Google Scholar
R.K.L. Kennedy, T.M. Khoshgoftaar, F. Villanustre, T. Humphrey, A parallel and distributed stochastic gradient descent implementation using commodity clusters. J. Big Data 6(1), 16 (2019)
Article Google Scholar
D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2, 408–421 (1972)
Google Scholar
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 321–357 (2002)
MATH Google Scholar
H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in Advances in Intelligent Computing ed. by D.-S. Huang, X.-P. Zhang, G.-B. Huang (Springer, Berlin, Heidelberg, 2005), pp. 878–887
Google Scholar
T. Jo, N. Japkowicz, Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6, 40–49 (2004)
Article Google Scholar
C. Ling, V. Sheng, Cost-sensitive learning and the class imbalance problem, in Encyclopedia of Machine Learning (2010)
Google Scholar
J.J Chen, C.-A. Tsai, H. Moon, H. Ahn, J.J. Young, C.-H. Chen, Decision threshold adjustment in class prediction, in SAR and QSAR in Environmental Research, vol. 17 (2006), pp. 337–352
Google Scholar
Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the best classification threshold in imbalanced classification. Big Data Res. 5, 2–8 (2016)
Article Google Scholar
X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39, 539–550 (2009)
Article Google Scholar
N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, Smoteboost: improving prediction of the minority class in boosting, in Knowledge Discovery in Databases: PKDD 2003 ed. by N. Lavrač, D. Gamberger, L. Todorovski, H. Blockeel, (Springer, Berlin, Heidelberg, 2003), pp. 107–119
Google Scholar
Y. Sun, Cost-sensitive Boosting for Classification of Imbalanced Data. Ph.D. thesis, Waterloo, Ont., Canada, Canada, 2007. AAINR34548
Google Scholar
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (The MIT Press, Cambridge, MA, 2016)
MATH Google Scholar
I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining, Fourth Edition: practical Machine Learning Tools and Techniques, 4th edn. (San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 2016)
Google Scholar
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature. 521, 436 (2015)
Google Scholar
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems (2015)
Google Scholar
Theano Development Team, Theano: a python framework for fast computation of mathematical expressions (2016). arXiv:abs/1605.02688
F. Chollet et al., Keras (2015). https://keras.io
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in NIPS-W (2017)
Google Scholar
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, cudnn: efficient primitives for deep learning (2014)
Google Scholar
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Neural Inform. Process. Syst. 25, 01 (2012)
Google Scholar
M.D. Richard, R.P. Lippmann, Neural network classifiers estimate bayesian a posteriori probabilities. Neural Comput. 3(4), 461–483 (1991)
Article Google Scholar
Centers For Medicare & Medicaid Services, Medicare provider utilization and payment data: physician and other supplier (2018)
Google Scholar
Centers For Medicare & Medicaid Services, Medicare provider utilization and payment data: part D prescriber (2018)
Google Scholar
U.S. Government, U.S. Centers for Medicare & Medicaid Services, The official U.S. government site for medicare
Google Scholar
Evolutionary Computation for Big Data and Big Learning Workshop, Data mining competition 2014: self-deployment track
Google Scholar
M. Wani, F. Bhat, S. Afzal, A. Khan, Advances in Deep Learning (Springer, 2020)
Google Scholar
J.W. Tukey, Comparing individual means in the analysis of variance. Biometrics 5(2), 99–114 (1949)
Article MathSciNet Google Scholar
R. Anand, K.G. Mehrotra, C.K. Mohan, S. Ranka, An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans. Neural Netw. 4, 962–969 (1993)
Article Google Scholar
J.M. Johnson, T.M. Khoshgoftaar, Survey on deep learning with class imbalance. J. Big Data 6, 27 (2019)
Article Google Scholar
J.M. Johnson, T.M. Khoshgoftaar, Medicare fraud detection using neural networks. J. Big Data 6(1), 63 (2019)
Article Google Scholar
D. Masko, P. Hensman, The impact of imbalanced training data for convolutional neural networks, in 2015. KTH, School of Computer Science and Communication (CSC)
Google Scholar
H. Lee, M. Park, J. Kim, Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning, in 2016 IEEE International Conference on Image Processing (ICIP) (2016), pp. 3713–3717
Google Scholar
S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, P. J. Kennedy, Training deep neural networks on imbalanced data sets, in 2016 International Joint Conference on Neural Networks (IJCNN) (2016), pp. 4368–4374
Google Scholar
H. Wang, Z. Cui, Y. Chen, M. Avidan, A. B. Abdallah, A. Kronzer, Predicting hospital readmission via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinf. 1 (2018)
Google Scholar
S.H. Khan, M. Hayat, M. Bennamoun, F.A. Sohel, R. Togneri, Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29, 3573–3587 (2018)
Article Google Scholar
T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2999–3007
Google Scholar
C. Huang, Y. Li, C. C. Loy, X. Tang, Learning deep representation for imbalanced classification, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 5375–5384
Google Scholar
S. Ando, C.Y. Huang, Deep over-sampling framework for classifying imbalanced data, in Machine Learning and Knowledge Discovery in Databases, ed. by M. Ceci, J. Hollmén, L. Todorovski, C. Vens, S. Džeroski (Springer International Publishing, Cham, 2017), pp. 770–785
Google Scholar
Q. Dong, S. Gong, X. Zhu, Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2018)
Google Scholar
Q. Chen, J. Huang, R. Feris, L.M. Brown, J. Dong, S. Yan, Deep domain adaptation for describing people based on fine-grained clothing attributes, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 5315–5324
Google Scholar
Y. LeCun, C. Cortes, MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/, Accessed 15 Nov 2018
A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/kriz/cifar.html
R.A. Bauder, T.M. Khoshgoftaar, A novel method for fraudulent medicare claims detection from expected payment deviations (application paper), in 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI) (2016), pp. 11–19
Google Scholar
R.A. Bauder, T.M. Khoshgoftaar, A probabilistic programming approach for outlier detection in healthcare claims, in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) (2016), pp. 347–354
Google Scholar
R.A. Bauder, T.M. Khoshgoftaar, A. Richter, M. Herland, Predicting medical provider specialties to detect anomalous insurance claims, in 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) (2016), pp. 784–790
Google Scholar
M. Herland, R.A. Bauder, T.M. Khoshgoftaar, Medical provider specialty predictions for the detection of anomalous medicare insurance claims, in 2017 IEEE International Conference on Information Reuse and Integration (IRI) (2017), pp. 579–588
Google Scholar
Office of Inspector General, LEIE downloadable databases (2019)
Google Scholar
R.A. Bauder, T.M. Khoshgoftaar, The detection of medicare fraud using machine learning methods with excluded provider labels, in FLAIRS Conference (2018)
Google Scholar
M. Herland, T.M. Khoshgoftaar, R.A. Bauder, Big data fraud detection using multiple medicare data sources. J. Big Data 5, 29 (2018)
Article Google Scholar
M. Herland, R.A. Bauder, T.M. Khoshgoftaar, The effects of class rarity on the evaluation of supervised healthcare fraud detection models. J. Big Data 6(1), 21 (2019)
Article Google Scholar
K. Feldman, N.V. Chawla, Does medical school training relate to practice? evidence from big data. Big Data (2015)
Google Scholar
Centers for Medicare & Medicaid Services, Physician compare datasets (2019)
Google Scholar
J. Ko, H. Chalfin, B. Trock, Z. Feng, E. Humphreys, S.-W. Park, B. Carter, K.D. Frick, M. Han, Variability in medicare utilization and payment among urologists. Urology 85, 03 (2015)
Article Google Scholar
V. Chandola, S.R. Sukumar, J.C. Schryver, Knowledge discovery from massive healthcare claims data, in KDD (2013)
Google Scholar
L.K. Branting, F. Reeder, J. Gold, T. Champney, Graph analytics for healthcare fraud risk estimation, in 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (2016), pp. 845–851
Google Scholar
National Plan & Provider Enumeration System, NPPES NPI registry (2019)
Google Scholar
P.S.P. Center, 9th community wide experiment on the critical assessment of techniques for protein structure prediction
Google Scholar
I. Triguero, S. Rí, V. López, J. Bacardit, J. Benítez, F. Herrera, ROSEFW-RF: the winner algorithm for the ecbdl’14 bigdata competition: an extremely imbalanced big data bioinformaticsproblem. Knowl.-Based Syst. 87 (2015)
Google Scholar
A. Fernández, S. del Río, N.V. Chawla, F. Herrera, An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)
Article Google Scholar
S. de Río, J.M. Benítez, F. Herrera, Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification, in 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2 (2015), pp. 180–185
Google Scholar
Centers for Medicare & Medicaid Services, National provider identifier standard (NPI) (2019)
Google Scholar
Centers For Medicare & Medicaid Services, HCPCS general information (2018)
Google Scholar
P. Di Lena, K. Nagata, P. Baldi, Deep architectures for protein contact map prediction, Bioinformatics (Oxford, England) 28, 2449–57 (2012)
Google Scholar
J. Berg, J. Tymoczko, L. Stryer, Chapter 3, protein structure and function, in Biochemistry, 5th edn. (W H Freeman, New York, 2002)
Google Scholar
Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, H. Liu, Advancing feature selection research, ASU Feature Selection Repository (2010), pp. 1–28
Google Scholar
S. Linux, About (2014)
Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
F. Provost, T. Fawcett, Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions, in Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, vol. 43–48 (1999), p. 12
Google Scholar
D. Wilson, T. Martinez, The general inefficiency of batch training for gradient descent learning. Neural Netw.: Off. J. Int. Neural Netw. Soc. 16, 1429–51 (2004)
Google Scholar
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR (2015). arXiv:abs/1412.6980
R.P. Lippmann, Neural networks, bayesian a posteriori probabilities, and pattern classification, in From Statistics to Neural Networks, ed. by V. Cherkassky, J.H. Friedman, H. Wechsler (Springer, Berlin, Heidelberg, 1994), pp. 83–104
Google Scholar
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
MathSciNet MATH Google Scholar
S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32Nd International Conference on International Conference on Machine Learning, ICML’15, vol. 37 (JMLR.org, 2015), pp. 448–456
Google Scholar
B. Zdaniuk, Ordinary Least-Squares (OLS) Model (Dordrecht, Springer Netherlands, 2014), pp. 4515–4517
Google Scholar
J.M. Johnson, T.M. Khoshgoftaar, Deep learning and data sampling with imbalanced big data, 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) (2019), pp. 175–183
Google Scholar

Download references

Author information

Authors and Affiliations

Florida Atlantic University, Boca Raton, FL, 33431, USA
Justin M. Johnson & Taghi M. Khoshgoftaar

Authors

Justin M. Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Taghi M. Khoshgoftaar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Justin M. Johnson .

Editor information

Editors and Affiliations

Department of Computer Science, University of Kashmir, Srinagar, India
M. Arif Wani
Computer and Electrical Engineering, Florida Atlantic University, Boca Raton, FL, USA
Taghi M. Khoshgoftaar
Faculty of Engineering and Computing, Coventry University, Coventry, UK
Vasile Palade

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Johnson, J.M., Khoshgoftaar, T.M. (2021). Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds) Deep Learning Applications, Volume 2. Advances in Intelligent Systems and Computing, vol 1232. Springer, Singapore. https://doi.org/10.1007/978-981-15-6759-9_9

Download citation

DOI: https://doi.org/10.1007/978-981-15-6759-9_9
Published: 25 September 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6758-2
Online ISBN: 978-981-15-6759-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics