Skip to main content
Log in

A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model

  • Published:
Communications in Mathematics and Statistics Aims and scope Submit manuscript

Abstract

Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community. This work combines penalized empirical likelihood method, lower bound algorithm and Nyström method and applies these techniques along with kernel method to density ratio model. The resulting classifier, density ratio classifier (DRC), is a combination of kernelization, regularization, efficient implementation and threshold moving, all of which are critical to enable DRC to be an effective and powerful method for solving difficult imbalance problems. Compared with other methods, DRC is competitive in that it is widely applicable and it is simple and easy to use without additional imbalance handling skills. In addition, the convergence rate of the estimate of log density ratio is discussed as well. And the results of numerical analysis also show that DRC outperforms other methods in AUC and G-mean score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Barreno, M., Cárdenas, A.A., Tygar, J.D.:. Optimal roc curve for a combination of classifiers. In: In Advances in Neural Information Processing Systems (NIPS) (2007)

  2. Berlinet, A.: Reproducing kernels in probability and statistics. More Progresses In Analysis (2014)

  3. Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)

    Article  MATH  Google Scholar 

  4. Böhning, Dankmar: Lindsay, Bruce: Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 40(4), 641–663 (1988)

    Article  MATH  Google Scholar 

  5. Breiman, L.: Random forest. Mach. Learn. 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  6. Cai, Song: Chen, Jiahua: Empirical likelihood inference for multiple censored samples. Canadian J. Stat. 46(2), 212–232 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  7. Chawla, Nitesh V., Bowyer, Kevin W., Hall, Lawrence O., Kegelmeyer, W. Philip.: Smote: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16(1), 321–357 (2002)

  8. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDDInternationalConference onKnowledgeDiscovery andDataMining,KDD’16, pp. 785-794. Association for Computing Machinery (2016)

  9. Chen, Jiahua: Liu, Yukun: Quantile and quantile-function estimations under density ratio model. Ann. Stat. 41(3), 1669–1692 (2013)

    Article  MATH  Google Scholar 

  10. Chen, Jiahua: Liu, Yukun: Small area quantile estimation. Int. Stat. Rev. 87(S1), S219–S238 (2019)

    Article  Google Scholar 

  11. Chen, Baojiang, Li, Pengfei, Qin, Jing, Tao, Yu.: Using a monotonic density ratio model to find the asymptotically optimal combination ofmultiple diagnostic tests. J.Am. Stat.Assoc. 111(514), 861-874 (2016)

  12. Cheng, K.F., Chu, C.K.: Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10(4), 583–604 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  13. Collell, Guillem, Prelec, Drazen, Patil, Kaustubh R.: A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing 275, 330–340 (2018)

    Article  Google Scholar 

  14. Cortes, Corinna, Vapnik, Vladimir: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    Article  MATH  Google Scholar 

  15. de Oliveira, V., Kedem, B.: Bayesian analysis of a density ratio model. Canadian J. Stat. 45, 274–289 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  16. Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) Advances in Artificial Intelligence, pp. 220–231. Springer, Berlin Heidelberg, Berlin (2010)

    Chapter  Google Scholar 

  17. Diao, Guoqing: Ning, Jing, qin, jing: Maximum likelihood estimation for semiparametric density ratio model. Int. J. Biostat. 8(1), 1–29 (2012)

    Article  MathSciNet  Google Scholar 

  18. Dua, D., Graff, C.: UCI machine learning repository (2017)

  19. Eguchi, Shinto: Copas, John: A class of logistic type discriminant functions. Biometrika 89(1), 1–22 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  20. Fokianos, Konstantinos: Kaimi, Irene: On the effect of misspecifying the density ratio model. Ann. Inst. Stat. Math. 58(3), 475–497 (2006)

    Article  MATH  Google Scholar 

  21. Gu, C.: Smoothing Spline ANOVA Models, vol. 297. Springer, New York (2013)

    MATH  Google Scholar 

  22. Härdle, Wolfgang: Nonparametric and Semiparametric Models. Springer, Berlin (2004)

    Book  MATH  Google Scholar 

  23. He, H., Garcia, E.A.: Learning fromimbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)

    Article  Google Scholar 

  24. Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intel. 24, 289–300 (2002)

    Article  Google Scholar 

  25. Jing, Qin: Inferences for case-control and semiparametric two-sample density ratiomodels. Biometrika 85(3), 619–630 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  26. Jing, Qin: Biao, Zhang: Best combination of multiple diagnostic tests for screening purposes. Stat. Med. 29(28), 2905–2919 (2010)

    Article  MathSciNet  Google Scholar 

  27. Kanamori, T., Suzuki, T., Sugiyama,M.: Theoretical analysis of density ratio estimation. IEICE Trans. Fundament. Electron. Commun. Comput. Sci. E93A(4), 787-798 (2010)

  28. Karsmakers, P., Pelckmans, K., Suykens, J.A.K.: Multi-class kernel logistic regression: a fixed-size implementation. In: 2007 International Joint Conference on Neural Networks, pp. 1756-1761 (2007)

  29. Katzoff, Myron, Zhou, Wen, Khan, Diba, Guanhua, Lu., Kedem, Benjamin: Out-of-sample fusion in risk prediction. J. Stat. Theory Practice 8(3), 444-459 (2014)

  30. Kedem, Benjamin,Guanhua, Lu., Rong,Wei,Williams, PaulD.: Forecastingmortality rates via density ratio modeling. Canadian J. Stat. 36(2), 193-206 (2010)

  31. Kedem, Benjamin, Pan, Lemeng, Zhou, Wen, Coelho, Carlos A.: Interval estimation of small tail probabilities - applications in food safety. Stat. Med. 35(18), 3229–3240 (2016)

    Article  MathSciNet  Google Scholar 

  32. Kedem, B., Pan, L., Smith, P., Wang, C.: Repeated out of sample fusion in the estimation of small tail probabilities (2019)

  33. Kernels and Reproducing Kernel Hilbert Spaces, pp. 110-163. Springer, New York (2008)

  34. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(2), 539-550 (2009)

  35. Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. Syst. Man Cybern. Part B: Cybern. IEEE Trans. 39, 539-550 (2009)

  36. López, Victoria: Fernández, Alberto, García, Salvador, Palade, Vasile, Herrera, Francisco: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)

    Article  Google Scholar 

  37. Luengo, Julián: Fernández, Alberto, García, Salvador, Herrera, Francisco: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)

    Article  Google Scholar 

  38. Luo, X., Tsai, W.: A proportional likelihood ratio model. Biometrika 99(1), 1 (2011)

    MathSciNet  Google Scholar 

  39. Maalouf, Maher, Homouz, Dirar, Trafalis, Theodore B.: Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods. Comput. Intell. 34(1), 161–174 (2018)

    Article  MathSciNet  Google Scholar 

  40. Prati, R., Batista, G., Monard, M.-C.: Learning with class skews and small disjuncts. In: Bazzan, A.L.C., Labidi, S. (eds.) ), Advances in Artificial Intelligence-SBIA 2004, pp. 296–306. Springer, Berlin Heidelberg, Berlin (2004)

    Google Scholar 

  41. Quionero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton, Lawrence, Neil D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)

    Google Scholar 

  42. Rayhan, F., Ahmed, S., Mahbub, A., Jani, R., Shatabda, S., Farid, D.Md.: Cusboost: Cluster-based under-sampling with boosting for imbalanced classification. In: 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Dec (2017)

  43. Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Helmbold, D., Williamson, B. (eds.) Computational Learning Theory, pp. 416–426. Springer, Berlin Heidelberg (2001)

    Chapter  Google Scholar 

  44. Seiffert, C., Khoshgoftaar, T., Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. Syst. Man Cybern. Part A: Syst. Humans, IEEE Trans. 40, 185-197 (2010)

  45. Shen, Y., Ning, J., Qin, J.: Likelihood approaches for the invariant density ratio model with biased648 sampling data. Biometrika 99(2), 363–378 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  46. Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(1), 281-288 (2009)

  47. Tom, F.: Introduction to roc analysis. Pattern Recogn. Lett. 27, 861–874 (2006)

    Article  Google Scholar 

  48. Trevor, H., Robert, T., Jerome, F.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY (2009)

    MATH  Google Scholar 

  49. Voulgaraki, Anastasia, Kedem, Benjamin, Graubard, Barry I.: Semiparametric regression in testicular germ cell data. Ann. Appl. Stat. 6(3), 1185–1208 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  50. Vuttipittayamongkol, P., Elyan, E., Petrovski, A., Jayne, C.: Overlap-based undersampling for improv613 ing imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 689-697 (2018)

  51. Wang H.: Logistic regression for massive data with rare events (2020)

  52. Wang, Y.: Smoothing Splines: Methods and Applications, 1st edn. CRC Press, Boca Raton, FL (2011)

    Book  MATH  Google Scholar 

  53. Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 324-331 (2009)

  54. Wang, Dongliang: Tian, Lili, Zhao, Yichuan: Smoothed empirical likelihood for the youden index. Comput. Stat. Data Anal. 115, 1–10 (2017)

    Article  Google Scholar 

  55. Weiss, G.M.: The Impact of Small Disjuncts on Classifier Learning, vol. 8, pp. 193-226. Springer, US (2010)

  56. Williams, C., Seeger, 677 M.: Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 682-688, 01 (2000)

  57. Xia, S.Y., Xiong, Z.Y., 620 He, Y., Li, K., Dong, L.M., Zhang, M.: Relative density-based classification noise detection. Optik - Int. J. Light Electron Optics 125, 6829-6834 (2014)

  58. Yijing, Li., Haixiang, Guo, Xiao, Liu, Yanan, Li., Jinling, Li.: Adapted ensemble classification algo587 rithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl.-Based Syst. 94, 88-104 (2016)

  59. Zhou, Z.-H., Liu, X.-Y.: On multi-class cost-sensitive learning. Comput. Intel. 26, 232–257 (2010)

    Google Scholar 

  60. Zhuang, W.W., Hu, B.Y., Chen, J.: Semiparametric inference for the dominance index under the density ratio model. Biometrika 106, 229–241 (2019)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenquan Cui.

Additional information

This research was supported by National Natural Science Foundation of China (Grant No. 71873128)

Appendices

A Proof of Theorems

1.1 A.1 Proof of Theorem 2.1

Consider the quadratic functional

$$\begin{aligned} \frac{1}{n}\sum _{i = 1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})h({\varvec{z}}_{i}) + \frac{1}{2}V(h-h^{*}) + \frac{\lambda }{2}J(h), \end{aligned}$$
(4.1)

and we add a multiplier \(\frac{1}{2}\) in the last term for convenience, which has no effect on our technical development. Since V is completely continuous with respect to J and \(J(h)<\infty \), h can be expressed as a Fourier series expansion \(h = \sum _{\mu }h_{\mu }\psi _{\mu } \), where \( \psi _{\mu } \) are eigenfunctions of J with respect to V and \( h_{\mu } = V(h,\psi _{\mu }) = \int _{{\mathcal {X}}}h(z)\psi _{\mu }(z)w(h^{*}(z))f(z)dz \) are Fourier coefficients. Plugging Fourier series expansions \(h = \sum _{\mu }h_{\mu }\psi _{\mu } \) and \(h^{*} = \sum _{\mu }h_{\mu }^{*}\psi _{\mu }\) into (4.1), it is easy to show that the minimizer \(\tilde{h}\) of (4.1) has Fourier coefficients

$$\begin{aligned} \tilde{h}_{\mu } = (\beta _{\mu }+h_{\mu }^{*})/(1+\lambda \xi _{\mu }), \end{aligned}$$

where \(\beta _{\mu } = n^{-1}\sum _{i= 1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})\psi _{\mu }({\varvec{z}}_{i})\).

Since \(E(\beta _{\mu }) = 0\) and \(E(\beta _{\mu }^{2}) = \frac{1}{n}\), we have

$$\begin{aligned} E(V(\tilde{h}-h^{*}))&= E\left( \sum _{\mu }(\tilde{h}_{\mu }-h_{\mu }^{*})^{2}) = E(\sum _{\mu }\frac{\beta _{\mu }^{2}-2\beta _{\mu }\lambda \xi _{\mu }h_{\mu }^{*}+\lambda ^{2}\xi _{\mu }^{2}h_{\mu }^{*2}}{(1+\lambda \xi _{\mu })^{2}}\right) \nonumber \\&= \frac{1}{n}\sum _{\mu }\frac{1}{(1+\lambda \xi _{\mu })^{2}}+\lambda ^{p}\sum _{\mu }\frac{(\lambda \xi _{\mu })^{2-p}}{(1+\lambda \xi _{\mu })^{2}}\xi _{\mu }^{p}h_{\mu }^{*2}, \end{aligned}$$
(4.2)
$$\begin{aligned} E(\lambda J(\tilde{h}-h^{*}))&= E\left( \sum _{\mu }\lambda \xi _{\mu }(\tilde{h}_{\mu }-h_{\mu }^{*})^{2}\right) \nonumber \\&= E\left( \sum _{\mu }\lambda \xi _{\mu }\frac{\beta _{\mu }^{2}-2\beta _{\mu }\lambda \xi _{\mu }h_{\mu }^{*}+\lambda ^{2}\xi _{\mu }^{2}h_{\mu }^{*2}}{(1+\lambda \xi _{\mu })^{2}}\right) \nonumber \\&= \frac{1}{n}\sum _{\mu }\frac{\lambda \xi _{\mu }}{(1+\lambda \xi _{\mu })^{2}}+\lambda ^{p}\sum _{\mu }\frac{(\lambda \xi _{\mu })^{3-p}}{(1+\lambda \xi _{\mu })^{2}}\xi _{\mu }^{p}h_{\mu }^{*2}. \end{aligned}$$
(4.3)

When \(\lambda \rightarrow 0\) and condition 2 holds, Lemma 8.1 in [21] says that

$$\begin{aligned} \sum _{\mu }\frac{\lambda \xi _{\mu }}{(1+\lambda \xi _{\mu })^{2}}&= O(\lambda ^{-1/r}),\nonumber \\ \sum _{\mu }\frac{1}{(1+\lambda \xi _{\mu })^{2}}&= O(\lambda ^{-1/r}),\nonumber \\ \sum _{\mu }\frac{1}{1+\lambda \xi _{\mu }}&= O(\lambda ^{-1/r}). \end{aligned}$$
(4.4)

So, combining (4.2)–(4.4) and assuming condition 3, we immediately have

$$\begin{aligned} V(\tilde{h}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}) \end{aligned}$$

and

$$\begin{aligned} \lambda J(\tilde{h}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}) . \end{aligned}$$

Therefore,

$$\begin{aligned} (V+\lambda J)(\tilde{h}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}). \end{aligned}$$

1.2 A.2 Proof of Theorem 2.2

We first characterize the rate of \({\hat{h}}-\tilde{h}\) and then turn to \({\hat{h}}-h^{*}\). Define

$$\begin{aligned} A_{g,h}(\eta ) = \frac{1}{n}\sum _{i=1}^{n}l((g+\eta h)({\varvec{z}}_{i});y_{i})+\frac{\lambda }{2}J(g+\eta h) \end{aligned}$$
(4.5)

and

$$\begin{aligned} B_{g,h}(\eta ) = \frac{1}{n}\sum _{i=1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})(g+\eta h)({\varvec{z}}_{i})+\frac{1}{2}V(g+\eta h-h^{*})+\frac{\lambda }{2}J(g+\eta h). \end{aligned}$$
(4.6)

Then, it can be easily shown that

$$\begin{aligned} \frac{dA_{g,h}}{d\eta }|_{\eta = 0} = \frac{1}{n}\sum _{i=1}^{n}u(g({\varvec{z}}_{i});y_{i})h({\varvec{z}}_{i})+ \lambda J(g,h) \end{aligned}$$
(4.7)

and

$$\begin{aligned} \frac{dB_{g,h}}{d\eta }|_{\eta = 0} = \frac{1}{n}\sum _{i=1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})h({\varvec{z}}_{i})+V(g-h^{*},h) + \lambda J(g,h). \end{aligned}$$
(4.8)

If we set \(g = {\hat{h}}, h = {\hat{h}} - \tilde{h}\) in (4.7), we have

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}u({\hat{h}}({\varvec{z}}_{i});y_{i})({\hat{h}} - \tilde{h})({\varvec{z}}_{i})+ \lambda J({\hat{h}},{\hat{h}} - \tilde{h}) = 0. \end{aligned}$$
(4.9)

And setting \(g = \tilde{h}, h = {\hat{h}} - \tilde{h}\) in (4.8), we have

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})({\hat{h}} - \tilde{h})({\varvec{z}}_{i})+V(\tilde{h}-h^{*},{\hat{h}}-\tilde{h})+ \lambda J(\tilde{h},{\hat{h}} - \tilde{h}) = 0. \end{aligned}$$
(4.10)

Subtracting (4.10) from (4.9), we can immediately get

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^{n}(u({\hat{h}}({\varvec{z}}_{i});y_{i})-u(\tilde{h}({\varvec{z}}_{i});y_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})) + \lambda J({\hat{h}} - \tilde{h})\nonumber \\&\quad =V(\tilde{h}-h^{*},{\hat{h}}-\tilde{h})-\frac{1}{n}\sum _{i=1}^{n}(u(\tilde{h}({\varvec{z}}_{i});y_{i})-u(h^{*}({\varvec{z}}_{i});y_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})). \end{aligned}$$
(4.11)

Using the mean value theorem, we can get

$$\begin{aligned} u({\hat{h}}({\varvec{z}}_{i});y_{i})-u(\tilde{h}({\varvec{z}}_{i});y_{i})= & {} w(h_{1}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})), \end{aligned}$$
(4.12)
$$\begin{aligned} u(\tilde{h}({\varvec{z}}_{i});y_{i})-u(h^{*}({\varvec{z}}_{i});y_{i})= & {} w(h_{2}({\varvec{z}}_{i}))(\tilde{h}({\varvec{z}}_{i})-h^{*}({\varvec{z}}_{i})), \end{aligned}$$
(4.13)

where \(h_{1}\) is a convex combination of \({\hat{h}}\) and \(\tilde{h}\), and \(h_{2}\) is that of \(\tilde{h}\) and \(h^{*}\). Substituting (4.12) and (4.13) back to (4.11), we have

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^{n}w(h_{1}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i}))^{2}+\lambda J({\hat{h}}-\tilde{h}) \nonumber \\&\quad = V(\tilde{h}-h^{*},{\hat{h}}-\tilde{h})-\frac{1}{n}\sum _{i=1}^{n}w(h_{2}({\varvec{z}}_{i}))(\tilde{h}({\varvec{z}}_{i})-h^{*}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})). \end{aligned}$$
(4.14)

Under conditions 1, 2, 5 and assuming \(\lambda \rightarrow 0,n\lambda ^{r/2}\rightarrow \infty ,\) lemma 8.17 in [21] indicates that

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}w(h^{*}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i}))^{2}= V({\hat{h}}-\tilde{h})+o_{p}((V+\lambda J)({\hat{h}}-\tilde{h})) \end{aligned}$$
(4.15)

and

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^{n}w(h^{*}({\varvec{z}}_{i}))(\tilde{h}({\varvec{z}}_{i})-h^{*}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i}))\nonumber \\&\quad =V({\hat{h}}-\tilde{h},\tilde{h}-h^{*})+o_{p}(\{(V+\lambda J)({\hat{h}}-\tilde{h})(V+\lambda J)(\tilde{h}-h^{*})\}^{1/2}). \end{aligned}$$
(4.16)

Combining (4.14)–(4.16) and assuming condition 4, for some \(c\in [c_{1},c_{2}]\), we have

$$\begin{aligned}&(c_{1}V+\lambda J)({\hat{h}}-\tilde{h})(1+o_{p}(1)) \nonumber \\&\quad \le \{(|1-c|V+\lambda J)({\hat{h}}-\tilde{h})\}^{1/2}O_{p}(\{(|1-c|V+\lambda J)(\tilde{h}-h^{*})\}^{1/2}), \end{aligned}$$
(4.17)

which together with Theorem 2.1 implies that

$$\begin{aligned} (V+\lambda J)({\hat{h}}-\tilde{h}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}), \end{aligned}$$

and

$$\begin{aligned} (V+\lambda J)({\hat{h}}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}). \end{aligned}$$

B Details for Creation of Objective Function

Let us first recall the log empirical likelihood

$$\begin{aligned} \ell _{n} = \sum _{i=1}^{n}\log (p_{i}) + \sum _{j=1}^{n_{1}}h({\varvec{x}}_{1j}), \end{aligned}$$
(5.1)

where

$$\begin{aligned} \sum _{i=1}^{n}p_{i} = 1, \quad p_{i}\ge 0, \quad \sum _{i=1}^{n}p_{i}\mathrm {exp}(h({\varvec{z}}_{i})) = 1. \end{aligned}$$
(5.2)

The maximization employs the method of Lagrange multipliers, i.e.,

$$\begin{aligned} \ell _{n} + \gamma _{1}\left( \sum _{i=1}^{n}p_{i}\mathrm {exp}(h({\varvec{z}}_{i}))-1\right) +\gamma _{0}(\sum _{i=1}^{n}p_{i}-1), \end{aligned}$$
(5.3)

where \(\gamma _{0},\gamma _{1}\) are Lagrange multipliers. By profiling \( \ell _{n}\) subject to (5.2) with respect to h first, we can express each \(p_{i}\) in terms of h as

$$\begin{aligned} p_{i} = \{n- \gamma _{1}[\mathrm {exp}(h({\varvec{z}}_{i}))-1]\}^{-1}, \end{aligned}$$
(5.4)

where \(\gamma _{0},\gamma _{1}\) satisfy \(\gamma _{0}+\gamma _{1}+n = 0\) and they solve

$$\begin{aligned} \sum _{i=1}^{n}p_{i}[\mathrm {exp}(h({\varvec{z}}_{i}))-1] = 0. \end{aligned}$$
(5.5)

Substituting \(p_{i}\)s back into \(\ell _{n}\), the profile log empirical likelihood can be obtained as

$$\begin{aligned} \tilde{\ell }_{n} = -\sum _{i=1}^{n}log(n-\gamma _{1}[\mathrm {exp}(h({\varvec{z}}_{i}))-1])+\sum _{j=1}^{n_{1}}h(x_{1j}). \end{aligned}$$
(5.6)

Actually, we are solving a constrained maximum likelihood problem, so we can define our optimization problem as

$$\begin{aligned} -\frac{1}{n}\tilde{\ell }_{n} + \lambda \Vert P_{1}h \Vert ^{2}, \end{aligned}$$
(5.7)

With the help of Theorem 2 in [43], we know that the minimizer of the penalized negative log-likelihood has the form as

$$\begin{aligned} h(\cdot )=\sum _{v=1}^{M}d_{v}\phi _{v}(\cdot )+\sum _{i=1}^{n}\alpha _{i}R(\cdot ,{\varvec{z}}_{i}). \end{aligned}$$
(5.8)

Suppose \(\phi _{1}(\cdot ) = 1 \), to get \(\gamma s\), we differentiate \(\ell _{n}\) with respect to \(d_{1}\) and apply the constraints (5.2).

$$\begin{aligned} \frac{\partial \tilde{\ell _{n}}}{\partial d_{1}} = \sum _{i=1}^{n}\frac{\gamma _{1}\mathrm {exp}(h({\varvec{z}}_{i}))}{n-\gamma _{1}[\mathrm {exp}(h({\varvec{z}}_{i})-1]}+n_{1} =\sum _{i = 1}^{n}\gamma _{1}p_{i}\mathrm {exp}(h({\varvec{z}}_{i}) + n_{1} = 0. \end{aligned}$$
(5.9)

We have \(\gamma _{1} = -n_{1}\). So,

$$\begin{aligned} p_{i} = n^{-1}\{1-{\hat{\rho }}+{\hat{\rho }}\mathrm {exp}(h({\varvec{z}}_{i}))\}^{-1}, \end{aligned}$$
(5.10)

and

$$\begin{aligned} \tilde{\ell }_{n} = -\sum _{i=1}^{n}\log n(1-{\hat{\rho }}+{\hat{\rho }}\mathrm {exp}(h({\varvec{z}}_{i})))+\sum _{j=1}^{n_{1}}h(x_{1j}). \end{aligned}$$
(5.11)

C Details for Lower Bound Optimization Method

Let

$$\begin{aligned} r_{i} = \frac{{\hat{\rho }}\mathrm {exp}({\varvec{T}}_{i}^{'}{\varvec{d}}+{\varvec{R}}_{i}^{'}\varvec{\alpha })}{1-{\hat{\rho }}+{\hat{\rho }}\mathrm {exp}({\varvec{T}}_{i}^{'}{\varvec{d}}+{\varvec{R}}_{i}^{'}\varvec{\alpha })}, \end{aligned}$$
(6.1)

\({\varvec{r}} = (r_{1},\ldots ,r_{n})^{'}\), \({\varvec{y}} = (y_{1},\ldots ,y_{n})^{'}\), \({\varvec{G}} = -{\varvec{y}}+{\varvec{r}}\), \({\varvec{Q}} = ({\varvec{T}},{\varvec{R}})\) and

$$\begin{aligned} {\varvec{K}}^{*} = \begin{pmatrix} {\varvec{O}}_{M\times M} &{} {\varvec{O}}_{M\times n}\\ {\varvec{O}}_{n\times M} &{} {\varvec{R}} \end{pmatrix}, \end{aligned}$$
(6.2)

where \({\varvec{O}}_{M\times n}\) is zero matrix with dimension \(M\times n\). Define \({\varvec{W}} = ({\varvec{0}}^{'}, \varvec{\alpha }^{'}{\varvec{R}})^{'} \), where \({\varvec{0}} \) is M-dimensional vector with elements all 0. The first-order partial derivative and Hessian matrix of objective function with respect to \(\varvec{\theta }\) are

$$\begin{aligned} {\varvec{G}}^{*} = {\varvec{Q}}^{'}{\varvec{G}}+ {\varvec{W}} \end{aligned}$$
(6.3)

and

$$\begin{aligned} {\varvec{H}}^{*} = \sum _{i=1}^{n}{\varvec{Q}}_{i}{\varvec{Q}}_{i}^{'}r_{i}(1-r_{i})+2\lambda {\varvec{K}}^{*}, \end{aligned}$$
(6.4)

where \({\varvec{Q}}_{i}\) is the ith row of \({\varvec{Q}}\). So, the iterative equation of Newton’s method is

$$\begin{aligned} \varvec{\theta }^{(t+1)}= \varvec{\theta }^{(t)}- ({\varvec{H}}^{*(t)})^{-1}{\varvec{G}}^{*(t)}, \end{aligned}$$
(6.5)

where (t) indicates quantities evaluated at tth Newton–Raphson iteration. According to (6.5), we need to calculate the inverse of Hessian matrix every iterative step, which is extremely inefficient. The core idea of LB is to find an upper bound \({\varvec{B}}\) of \({\varvec{H}}^{*}\) such that

$$\begin{aligned} \varvec{\theta }^{(t+1)}= \varvec{\theta }^{(t)}- {\varvec{B}}^{-1}{\varvec{G}}^{*(t)}, \end{aligned}$$
(6.6)

where \({\varvec{B}}\) is a symmetric, positive definite matrix and does not depend on \(\varvec{\theta }\). Using \({\varvec{C}}(\varvec{\theta })\) for the objective function, this corresponds to a quadratic approximation to \({\varvec{C}}(\varvec{\theta })-{\varvec{C}}(\varvec{\theta }_{0}) \) of the following form

$$\begin{aligned} (\varvec{\theta }-\varvec{\theta }_{0})^{'}\nabla {\varvec{C}}(\varvec{\theta })+(\varvec{\theta }-\varvec{\theta }_{0})^{'}{\varvec{B}}(\varvec{\theta }-\varvec{\theta }_{0})/2. \end{aligned}$$
(6.7)

The properties of LB can be found in Theorem 4.1 in [4]. As stated in [4], substituting all \(r_{i}s\) with 1/2 in \({\varvec{H}}^{*}\) we will do the job. Actually, we have

$$\begin{aligned} {\varvec{H}}^{*} \preccurlyeq \frac{1}{4}\sum _{i=1}^{n}{\varvec{Q}}_{i}{\varvec{Q}}_{i}^{'}+2\lambda {\varvec{K}}^{*} := {\varvec{B}}. \end{aligned}$$
(6.8)

Here, \(C \preccurlyeq D \) means that \(D-C\) is nonnegative definite.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Cui, W. A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model. Commun. Math. Stat. 11, 369–401 (2023). https://doi.org/10.1007/s40304-021-00254-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40304-021-00254-7

Keywords

Mathematics Subject Classification

Navigation