Skip to main content

Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data

  • Chapter
  • First Online:
Deep Learning Applications, Volume 2

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1232))

Abstract

A variety of data-level, algorithm-level, and hybrid methods have been used to address the challenges associated with training predictive models with class-imbalanced data. While many of these techniques have been extended to deep neural network (DNN) models, there are relatively fewer studies that emphasize the significance of output thresholding. In this chapter, we relate DNN outputs to Bayesian a posteriori probabilities and suggest that the Default threshold of 0.5 is almost never optimal when training data is imbalanced. We simulate a wide range of class imbalance levels using three real-world data sets, i.e. positive class sizes of 0.03–90%, and we compare Default threshold results to two alternative thresholding strategies. The Optimal threshold strategy uses validation data or training data to search for the classification threshold that maximizes the geometric mean. The Prior threshold strategy requires no optimization, and instead sets the classification threshold to be the prior probability of the positive class. Multiple deep architectures are explored and all experiments are repeated 30 times to account for random error. Linear models and visualizations show that the Optimal threshold is strongly correlated with the positive class prior. Confidence intervals show that the Default threshold only performs well when training data is balanced and Optimal thresholds perform significantly better when training data is skewed. Surprisingly, statistical results show that the Prior threshold performs consistently as well as the Optimal threshold across all distributions. The contributions of this chapter are twofold: (1) illustrating the side effects of training deep models with highly imbalanced big data and (2) comparing multiple thresholding strategies for maximizing class-wise performance with imbalanced training data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. W. Wei, J. Li, L. Cao, Y. Ou, J. Chen, Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 449–475 (2013)

    Article  Google Scholar 

  2. A.N. Richter, T.M. Khoshgoftaar, Sample size determination for biomedical big data with limited labels. Netw. Model. Anal. Health Inf. Bioinf. 9, 1–13 (2020)

    Article  Google Scholar 

  3. M. Kubat, R.C. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)

    Article  Google Scholar 

  4. S. Wang, X. Yao, Multiclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cyb. Part B (Cybern.) 42, 1119–1130 (2012)

    Article  Google Scholar 

  5. N. Japkowicz, The class imbalance problem: significance and strategies, in Proceedings of the International Conference on Artificial Intelligence (2000)

    Google Scholar 

  6. M. Buda, A. Maki, M.A. Mazurowski, A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018)

    Article  Google Scholar 

  7. H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data Eng. 21, 1263–1284 (2009)

    Google Scholar 

  8. G.M. Weiss, Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004)

    Article  Google Scholar 

  9. R.A. Bauder, T.M. Khoshgoftaar, T. Hasanin, An empirical study on class rarity in big data, in 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA) (2018), pp. 785–790

    Google Scholar 

  10. E. Dumbill, What is big data? an introduction to the big data landscape (2012). http://radar.oreilly.com/2012/01/what-is-big-data.html

  11. S.E. Ahmed, Perspectives on Big Data Analysis: methodologies and Applications (Amer Mathematical Society, USA, 2014)

    Book  Google Scholar 

  12. J.L. Leevy, T.M. Khoshgoftaar, R.A. Bauder, N. Seliya, A survey on addressing high-class imbalance in big data. J. Big Data 5, 42 (2018)

    Article  Google Scholar 

  13. J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  14. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, I. Stoica, Spark: cluster computing with working sets, in Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, (Berkeley, CA, USA), USENIX Association (2010), p. 10

    Google Scholar 

  15. K. Chahal, M. Grover, K. Dey, R.R. Shah, A hitchhiker’s guide on distributed training of deep neural networks. J. Parallel Distrib. Comput. 10 (2019)

    Google Scholar 

  16. R.K.L. Kennedy, T.M. Khoshgoftaar, F. Villanustre, T. Humphrey, A parallel and distributed stochastic gradient descent implementation using commodity clusters. J. Big Data 6(1), 16 (2019)

    Article  Google Scholar 

  17. D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. SMC-2, 408–421 (1972)

    Google Scholar 

  18. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique. J. Artif. Int. Res. 16, 321–357 (2002)

    MATH  Google Scholar 

  19. H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over-sampling method in imbalanced data sets learning, in Advances in Intelligent Computing ed. by D.-S. Huang, X.-P. Zhang, G.-B. Huang (Springer, Berlin, Heidelberg, 2005), pp. 878–887

    Google Scholar 

  20. T. Jo, N. Japkowicz, Class imbalances versus small disjuncts. SIGKDD Explor. Newsl. 6, 40–49 (2004)

    Article  Google Scholar 

  21. C. Ling, V. Sheng, Cost-sensitive learning and the class imbalance problem, in Encyclopedia of Machine Learning (2010)

    Google Scholar 

  22. J.J Chen, C.-A. Tsai, H. Moon, H. Ahn, J.J. Young, C.-H. Chen, Decision threshold adjustment in class prediction, in SAR and QSAR in Environmental Research, vol. 17 (2006), pp. 337–352

    Google Scholar 

  23. Q. Zou, S. Xie, Z. Lin, M. Wu, Y. Ju, Finding the best classification threshold in imbalanced classification. Big Data Res. 5, 2–8 (2016)

    Article  Google Scholar 

  24. X. Liu, J. Wu, Z. Zhou, Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39, 539–550 (2009)

    Article  Google Scholar 

  25. N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, Smoteboost: improving prediction of the minority class in boosting, in Knowledge Discovery in Databases: PKDD 2003 ed. by N. Lavrač, D. Gamberger, L. Todorovski, H. Blockeel, (Springer, Berlin, Heidelberg, 2003), pp. 107–119

    Google Scholar 

  26. Y. Sun, Cost-sensitive Boosting for Classification of Imbalanced Data. Ph.D. thesis, Waterloo, Ont., Canada, Canada, 2007. AAINR34548

    Google Scholar 

  27. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (The MIT Press, Cambridge, MA, 2016)

    MATH  Google Scholar 

  28. I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining, Fourth Edition: practical Machine Learning Tools and Techniques, 4th edn. (San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 2016)

    Google Scholar 

  29. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature. 521, 436 (2015)

    Google Scholar 

  30. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous systems (2015)

    Google Scholar 

  31. Theano Development Team, Theano: a python framework for fast computation of mathematical expressions (2016). arXiv:abs/1605.02688

  32. F. Chollet et al., Keras (2015). https://keras.io

  33. A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in NIPS-W (2017)

    Google Scholar 

  34. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, E. Shelhamer, cudnn: efficient primitives for deep learning (2014)

    Google Scholar 

  35. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Neural Inform. Process. Syst. 25, 01 (2012)

    Google Scholar 

  36. M.D. Richard, R.P. Lippmann, Neural network classifiers estimate bayesian a posteriori probabilities. Neural Comput. 3(4), 461–483 (1991)

    Article  Google Scholar 

  37. Centers For Medicare & Medicaid Services, Medicare provider utilization and payment data: physician and other supplier (2018)

    Google Scholar 

  38. Centers For Medicare & Medicaid Services, Medicare provider utilization and payment data: part D prescriber (2018)

    Google Scholar 

  39. U.S. Government, U.S. Centers for Medicare & Medicaid Services, The official U.S. government site for medicare

    Google Scholar 

  40. Evolutionary Computation for Big Data and Big Learning Workshop, Data mining competition 2014: self-deployment track

    Google Scholar 

  41. M. Wani, F. Bhat, S. Afzal, A. Khan, Advances in Deep Learning (Springer, 2020)

    Google Scholar 

  42. J.W. Tukey, Comparing individual means in the analysis of variance. Biometrics 5(2), 99–114 (1949)

    Article  MathSciNet  Google Scholar 

  43. R. Anand, K.G. Mehrotra, C.K. Mohan, S. Ranka, An improved algorithm for neural network classification of imbalanced training sets. IEEE Trans. Neural Netw. 4, 962–969 (1993)

    Article  Google Scholar 

  44. J.M. Johnson, T.M. Khoshgoftaar, Survey on deep learning with class imbalance. J. Big Data 6, 27 (2019)

    Article  Google Scholar 

  45. J.M. Johnson, T.M. Khoshgoftaar, Medicare fraud detection using neural networks. J. Big Data 6(1), 63 (2019)

    Article  Google Scholar 

  46. D. Masko, P. Hensman, The impact of imbalanced training data for convolutional neural networks, in 2015. KTH, School of Computer Science and Communication (CSC)

    Google Scholar 

  47. H. Lee, M. Park, J. Kim, Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning, in 2016 IEEE International Conference on Image Processing (ICIP) (2016), pp. 3713–3717

    Google Scholar 

  48. S. Wang, W. Liu, J. Wu, L. Cao, Q. Meng, P. J. Kennedy, Training deep neural networks on imbalanced data sets, in 2016 International Joint Conference on Neural Networks (IJCNN) (2016), pp. 4368–4374

    Google Scholar 

  49. H. Wang, Z. Cui, Y. Chen, M. Avidan, A. B. Abdallah, A. Kronzer, Predicting hospital readmission via cost-sensitive deep learning. IEEE/ACM Trans. Comput. Biol. Bioinf. 1 (2018)

    Google Scholar 

  50. S.H. Khan, M. Hayat, M. Bennamoun, F.A. Sohel, R. Togneri, Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29, 3573–3587 (2018)

    Article  Google Scholar 

  51. T.-Y. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2999–3007

    Google Scholar 

  52. C. Huang, Y. Li, C. C. Loy, X. Tang, Learning deep representation for imbalanced classification, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 5375–5384

    Google Scholar 

  53. S. Ando, C.Y. Huang, Deep over-sampling framework for classifying imbalanced data, in Machine Learning and Knowledge Discovery in Databases, ed. by M. Ceci, J. Hollmén, L. Todorovski, C. Vens, S. Džeroski (Springer International Publishing, Cham, 2017), pp. 770–785

    Google Scholar 

  54. Q. Dong, S. Gong, X. Zhu, Imbalanced deep learning by minority class incremental rectification. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2018)

    Google Scholar 

  55. Q. Chen, J. Huang, R. Feris, L.M. Brown, J. Dong, S. Yan, Deep domain adaptation for describing people based on fine-grained clothing attributes, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 5315–5324

    Google Scholar 

  56. Y. LeCun, C. Cortes, MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/, Accessed 15 Nov 2018

  57. A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/kriz/cifar.html

  58. R.A. Bauder, T.M. Khoshgoftaar, A novel method for fraudulent medicare claims detection from expected payment deviations (application paper), in 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI) (2016), pp. 11–19

    Google Scholar 

  59. R.A. Bauder, T.M. Khoshgoftaar, A probabilistic programming approach for outlier detection in healthcare claims, in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) (2016), pp. 347–354

    Google Scholar 

  60. R.A. Bauder, T.M. Khoshgoftaar, A. Richter, M. Herland, Predicting medical provider specialties to detect anomalous insurance claims, in 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI) (2016), pp. 784–790

    Google Scholar 

  61. M. Herland, R.A. Bauder, T.M. Khoshgoftaar, Medical provider specialty predictions for the detection of anomalous medicare insurance claims, in 2017 IEEE International Conference on Information Reuse and Integration (IRI) (2017), pp. 579–588

    Google Scholar 

  62. Office of Inspector General, LEIE downloadable databases (2019)

    Google Scholar 

  63. R.A. Bauder, T.M. Khoshgoftaar, The detection of medicare fraud using machine learning methods with excluded provider labels, in FLAIRS Conference (2018)

    Google Scholar 

  64. M. Herland, T.M. Khoshgoftaar, R.A. Bauder, Big data fraud detection using multiple medicare data sources. J. Big Data 5, 29 (2018)

    Article  Google Scholar 

  65. M. Herland, R.A. Bauder, T.M. Khoshgoftaar, The effects of class rarity on the evaluation of supervised healthcare fraud detection models. J. Big Data 6(1), 21 (2019)

    Article  Google Scholar 

  66. K. Feldman, N.V. Chawla, Does medical school training relate to practice? evidence from big data. Big Data (2015)

    Google Scholar 

  67. Centers for Medicare & Medicaid Services, Physician compare datasets (2019)

    Google Scholar 

  68. J. Ko, H. Chalfin, B. Trock, Z. Feng, E. Humphreys, S.-W. Park, B. Carter, K.D. Frick, M. Han, Variability in medicare utilization and payment among urologists. Urology 85, 03 (2015)

    Article  Google Scholar 

  69. V. Chandola, S.R. Sukumar, J.C. Schryver, Knowledge discovery from massive healthcare claims data, in KDD (2013)

    Google Scholar 

  70. L.K. Branting, F. Reeder, J. Gold, T. Champney, Graph analytics for healthcare fraud risk estimation, in 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (2016), pp. 845–851

    Google Scholar 

  71. National Plan & Provider Enumeration System, NPPES NPI registry (2019)

    Google Scholar 

  72. P.S.P. Center, 9th community wide experiment on the critical assessment of techniques for protein structure prediction

    Google Scholar 

  73. I. Triguero, S. Rí, V. López, J. Bacardit, J. Benítez, F. Herrera, ROSEFW-RF: the winner algorithm for the ecbdl’14 bigdata competition: an extremely imbalanced big data bioinformaticsproblem. Knowl.-Based Syst. 87 (2015)

    Google Scholar 

  74. A. Fernández, S. del Río, N.V. Chawla, F. Herrera, An insight into imbalanced big data classification: outcomes and challenges. Complex Intell. Syst. 3(2), 105–120 (2017)

    Article  Google Scholar 

  75. S. de Río, J.M. Benítez, F. Herrera, Analysis of data preprocessing increasing the oversampling ratio for extremely imbalanced big data classification, in 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 2 (2015), pp. 180–185

    Google Scholar 

  76. Centers for Medicare & Medicaid Services, National provider identifier standard (NPI) (2019)

    Google Scholar 

  77. Centers For Medicare & Medicaid Services, HCPCS general information (2018)

    Google Scholar 

  78. P. Di Lena, K. Nagata, P. Baldi, Deep architectures for protein contact map prediction, Bioinformatics (Oxford, England) 28, 2449–57 (2012)

    Google Scholar 

  79. J. Berg, J. Tymoczko, L. Stryer, Chapter 3, protein structure and function, in Biochemistry, 5th edn. (W H Freeman, New York, 2002)

    Google Scholar 

  80. Z. Zhao, F. Morstatter, S. Sharma, S. Alelyani, A. Anand, H. Liu, Advancing feature selection research, ASU Feature Selection Repository (2010), pp. 1–28

    Google Scholar 

  81. S. Linux, About (2014)

    Google Scholar 

  82. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  83. F. Provost, T. Fawcett, Analysis and visualization of classifier performance: comparison under imprecise class and cost distributions, in Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, vol. 43–48 (1999), p. 12

    Google Scholar 

  84. D. Wilson, T. Martinez, The general inefficiency of batch training for gradient descent learning. Neural Netw.: Off. J. Int. Neural Netw. Soc. 16, 1429–51 (2004)

    Google Scholar 

  85. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR (2015). arXiv:abs/1412.6980

  86. R.P. Lippmann, Neural networks, bayesian a posteriori probabilities, and pattern classification, in From Statistics to Neural Networks, ed. by V. Cherkassky, J.H. Friedman, H. Wechsler (Springer, Berlin, Heidelberg, 1994), pp. 83–104

    Google Scholar 

  87. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  88. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32Nd International Conference on International Conference on Machine Learning, ICML’15, vol. 37 (JMLR.org, 2015), pp. 448–456

    Google Scholar 

  89. B. Zdaniuk, Ordinary Least-Squares (OLS) Model (Dordrecht, Springer Netherlands, 2014), pp. 4515–4517

    Google Scholar 

  90. J.M. Johnson, T.M. Khoshgoftaar, Deep learning and data sampling with imbalanced big data, 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI) (2019), pp. 175–183

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Justin M. Johnson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Johnson, J.M., Khoshgoftaar, T.M. (2021). Thresholding Strategies for Deep Learning with Highly Imbalanced Big Data. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds) Deep Learning Applications, Volume 2. Advances in Intelligent Systems and Computing, vol 1232. Springer, Singapore. https://doi.org/10.1007/978-981-15-6759-9_9

Download citation

Publish with us

Policies and ethics