A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model

Li, Junjun; Cui, Wenquan

doi:10.1007/s40304-021-00254-7

A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model

Published: 03 June 2022

Volume 11, pages 369–401, (2023)
Cite this article

Communications in Mathematics and Statistics Aims and scope Submit manuscript

304 Accesses
Explore all metrics

Abstract

Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community. This work combines penalized empirical likelihood method, lower bound algorithm and Nyström method and applies these techniques along with kernel method to density ratio model. The resulting classifier, density ratio classifier (DRC), is a combination of kernelization, regularization, efficient implementation and threshold moving, all of which are critical to enable DRC to be an effective and powerful method for solving difficult imbalance problems. Compared with other methods, DRC is competitive in that it is widely applicable and it is simple and easy to use without additional imbalance handling skills. In addition, the convergence rate of the estimate of log density ratio is discussed as well. And the results of numerical analysis also show that DRC outperforms other methods in AUC and G-mean score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

References

Barreno, M., Cárdenas, A.A., Tygar, J.D.:. Optimal roc curve for a combination of classifiers. In: In Advances in Neural Information Processing Systems (NIPS) (2007)
Berlinet, A.: Reproducing kernels in probability and statistics. More Progresses In Analysis (2014)
Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)
Article MATH Google Scholar
Böhning, Dankmar: Lindsay, Bruce: Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 40(4), 641–663 (1988)
Article MATH Google Scholar
Breiman, L.: Random forest. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
Cai, Song: Chen, Jiahua: Empirical likelihood inference for multiple censored samples. Canadian J. Stat. 46(2), 212–232 (2018)
Article MathSciNet MATH Google Scholar
Chawla, Nitesh V., Bowyer, Kevin W., Hall, Lawrence O., Kegelmeyer, W. Philip.: Smote: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16(1), 321–357 (2002)
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDDInternationalConference onKnowledgeDiscovery andDataMining,KDD’16, pp. 785-794. Association for Computing Machinery (2016)
Chen, Jiahua: Liu, Yukun: Quantile and quantile-function estimations under density ratio model. Ann. Stat. 41(3), 1669–1692 (2013)
Article MATH Google Scholar
Chen, Jiahua: Liu, Yukun: Small area quantile estimation. Int. Stat. Rev. 87(S1), S219–S238 (2019)
Article Google Scholar
Chen, Baojiang, Li, Pengfei, Qin, Jing, Tao, Yu.: Using a monotonic density ratio model to find the asymptotically optimal combination ofmultiple diagnostic tests. J.Am. Stat.Assoc. 111(514), 861-874 (2016)
Cheng, K.F., Chu, C.K.: Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10(4), 583–604 (2004)
Article MathSciNet MATH Google Scholar
Collell, Guillem, Prelec, Drazen, Patil, Kaustubh R.: A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing 275, 330–340 (2018)
Article Google Scholar
Cortes, Corinna, Vapnik, Vladimir: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Article MATH Google Scholar
de Oliveira, V., Kedem, B.: Bayesian analysis of a density ratio model. Canadian J. Stat. 45, 274–289 (2017)
Article MathSciNet MATH Google Scholar
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) Advances in Artificial Intelligence, pp. 220–231. Springer, Berlin Heidelberg, Berlin (2010)
Chapter Google Scholar
Diao, Guoqing: Ning, Jing, qin, jing: Maximum likelihood estimation for semiparametric density ratio model. Int. J. Biostat. 8(1), 1–29 (2012)
Article MathSciNet Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017)
Eguchi, Shinto: Copas, John: A class of logistic type discriminant functions. Biometrika 89(1), 1–22 (2002)
Article MathSciNet MATH Google Scholar
Fokianos, Konstantinos: Kaimi, Irene: On the effect of misspecifying the density ratio model. Ann. Inst. Stat. Math. 58(3), 475–497 (2006)
Article MATH Google Scholar
Gu, C.: Smoothing Spline ANOVA Models, vol. 297. Springer, New York (2013)
MATH Google Scholar
Härdle, Wolfgang: Nonparametric and Semiparametric Models. Springer, Berlin (2004)
Book MATH Google Scholar
He, H., Garcia, E.A.: Learning fromimbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intel. 24, 289–300 (2002)
Article Google Scholar
Jing, Qin: Inferences for case-control and semiparametric two-sample density ratiomodels. Biometrika 85(3), 619–630 (1998)
Article MathSciNet MATH Google Scholar
Jing, Qin: Biao, Zhang: Best combination of multiple diagnostic tests for screening purposes. Stat. Med. 29(28), 2905–2919 (2010)
Article MathSciNet Google Scholar
Kanamori, T., Suzuki, T., Sugiyama,M.: Theoretical analysis of density ratio estimation. IEICE Trans. Fundament. Electron. Commun. Comput. Sci. E93A(4), 787-798 (2010)
Karsmakers, P., Pelckmans, K., Suykens, J.A.K.: Multi-class kernel logistic regression: a fixed-size implementation. In: 2007 International Joint Conference on Neural Networks, pp. 1756-1761 (2007)
Katzoff, Myron, Zhou, Wen, Khan, Diba, Guanhua, Lu., Kedem, Benjamin: Out-of-sample fusion in risk prediction. J. Stat. Theory Practice 8(3), 444-459 (2014)
Kedem, Benjamin,Guanhua, Lu., Rong,Wei,Williams, PaulD.: Forecastingmortality rates via density ratio modeling. Canadian J. Stat. 36(2), 193-206 (2010)
Kedem, Benjamin, Pan, Lemeng, Zhou, Wen, Coelho, Carlos A.: Interval estimation of small tail probabilities - applications in food safety. Stat. Med. 35(18), 3229–3240 (2016)
Article MathSciNet Google Scholar
Kedem, B., Pan, L., Smith, P., Wang, C.: Repeated out of sample fusion in the estimation of small tail probabilities (2019)
Kernels and Reproducing Kernel Hilbert Spaces, pp. 110-163. Springer, New York (2008)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(2), 539-550 (2009)
Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. Syst. Man Cybern. Part B: Cybern. IEEE Trans. 39, 539-550 (2009)
López, Victoria: Fernández, Alberto, García, Salvador, Palade, Vasile, Herrera, Francisco: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Luengo, Julián: Fernández, Alberto, García, Salvador, Herrera, Francisco: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Article Google Scholar
Luo, X., Tsai, W.: A proportional likelihood ratio model. Biometrika 99(1), 1 (2011)
MathSciNet Google Scholar
Maalouf, Maher, Homouz, Dirar, Trafalis, Theodore B.: Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods. Comput. Intell. 34(1), 161–174 (2018)
Article MathSciNet Google Scholar
Prati, R., Batista, G., Monard, M.-C.: Learning with class skews and small disjuncts. In: Bazzan, A.L.C., Labidi, S. (eds.) ), Advances in Artificial Intelligence-SBIA 2004, pp. 296–306. Springer, Berlin Heidelberg, Berlin (2004)
Google Scholar
Quionero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton, Lawrence, Neil D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Google Scholar
Rayhan, F., Ahmed, S., Mahbub, A., Jani, R., Shatabda, S., Farid, D.Md.: Cusboost: Cluster-based under-sampling with boosting for imbalanced classification. In: 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Dec (2017)
Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Helmbold, D., Williamson, B. (eds.) Computational Learning Theory, pp. 416–426. Springer, Berlin Heidelberg (2001)
Chapter Google Scholar
Seiffert, C., Khoshgoftaar, T., Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. Syst. Man Cybern. Part A: Syst. Humans, IEEE Trans. 40, 185-197 (2010)
Shen, Y., Ning, J., Qin, J.: Likelihood approaches for the invariant density ratio model with biased648 sampling data. Biometrika 99(2), 363–378 (2012)
Article MathSciNet MATH Google Scholar
Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(1), 281-288 (2009)
Tom, F.: Introduction to roc analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Article Google Scholar
Trevor, H., Robert, T., Jerome, F.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY (2009)
MATH Google Scholar
Voulgaraki, Anastasia, Kedem, Benjamin, Graubard, Barry I.: Semiparametric regression in testicular germ cell data. Ann. Appl. Stat. 6(3), 1185–1208 (2012)
Article MathSciNet MATH Google Scholar
Vuttipittayamongkol, P., Elyan, E., Petrovski, A., Jayne, C.: Overlap-based undersampling for improv613 ing imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 689-697 (2018)
Wang H.: Logistic regression for massive data with rare events (2020)
Wang, Y.: Smoothing Splines: Methods and Applications, 1st edn. CRC Press, Boca Raton, FL (2011)
Book MATH Google Scholar
Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 324-331 (2009)
Wang, Dongliang: Tian, Lili, Zhao, Yichuan: Smoothed empirical likelihood for the youden index. Comput. Stat. Data Anal. 115, 1–10 (2017)
Article Google Scholar
Weiss, G.M.: The Impact of Small Disjuncts on Classifier Learning, vol. 8, pp. 193-226. Springer, US (2010)
Williams, C., Seeger, 677 M.: Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 682-688, 01 (2000)
Xia, S.Y., Xiong, Z.Y., 620 He, Y., Li, K., Dong, L.M., Zhang, M.: Relative density-based classification noise detection. Optik - Int. J. Light Electron Optics 125, 6829-6834 (2014)
Yijing, Li., Haixiang, Guo, Xiao, Liu, Yanan, Li., Jinling, Li.: Adapted ensemble classification algo587 rithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl.-Based Syst. 94, 88-104 (2016)
Zhou, Z.-H., Liu, X.-Y.: On multi-class cost-sensitive learning. Comput. Intel. 26, 232–257 (2010)
Google Scholar
Zhuang, W.W., Hu, B.Y., Chen, J.: Semiparametric inference for the dominance index under the density ratio model. Biometrika 106, 229–241 (2019)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics and Finance, School of Management, University of Science and Technology of China, Hefei, 230026, People’s Republic of China
Junjun Li & Wenquan Cui

Authors

Junjun Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenquan Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenquan Cui.

Additional information

This research was supported by National Natural Science Foundation of China (Grant No. 71873128)

Appendices

A Proof of Theorems

1.1 A.1 Proof of Theorem 2.1

Consider the quadratic functional

$$\begin{aligned} \frac{1}{n}\sum _{i = 1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})h({\varvec{z}}_{i}) + \frac{1}{2}V(h-h^{*}) + \frac{\lambda }{2}J(h), \end{aligned}$$

(4.1)

and we add a multiplier $\frac{1}{2}$ in the last term for convenience, which has no effect on our technical development. Since V is completely continuous with respect to J and $J(h)<\infty $, h can be expressed as a Fourier series expansion $h = \sum _{\mu }h_{\mu }\psi _{\mu } $, where $ \psi _{\mu } $ are eigenfunctions of J with respect to V and $ h_{\mu } = V(h,\psi _{\mu }) = \int _{{\mathcal {X}}}h(z)\psi _{\mu }(z)w(h^{*}(z))f(z)dz $ are Fourier coefficients. Plugging Fourier series expansions $h = \sum _{\mu }h_{\mu }\psi _{\mu } $ and $h^{*} = \sum _{\mu }h_{\mu }^{*}\psi _{\mu }$ into (4.1), it is easy to show that the minimizer $\tilde{h}$ of (4.1) has Fourier coefficients

$$\begin{aligned} \tilde{h}_{\mu } = (\beta _{\mu }+h_{\mu }^{*})/(1+\lambda \xi _{\mu }), \end{aligned}$$

where $\beta _{\mu } = n^{-1}\sum _{i= 1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})\psi _{\mu }({\varvec{z}}_{i})$.

Since $E(\beta _{\mu }) = 0$ and $E(\beta _{\mu }^{2}) = \frac{1}{n}$, we have

$$\begin{aligned} E(V(\tilde{h}-h^{*}))&= E\left( \sum _{\mu }(\tilde{h}_{\mu }-h_{\mu }^{*})^{2}) = E(\sum _{\mu }\frac{\beta _{\mu }^{2}-2\beta _{\mu }\lambda \xi _{\mu }h_{\mu }^{*}+\lambda ^{2}\xi _{\mu }^{2}h_{\mu }^{*2}}{(1+\lambda \xi _{\mu })^{2}}\right) \nonumber \\&= \frac{1}{n}\sum _{\mu }\frac{1}{(1+\lambda \xi _{\mu })^{2}}+\lambda ^{p}\sum _{\mu }\frac{(\lambda \xi _{\mu })^{2-p}}{(1+\lambda \xi _{\mu })^{2}}\xi _{\mu }^{p}h_{\mu }^{*2}, \end{aligned}$$

(4.2)

$$\begin{aligned} E(\lambda J(\tilde{h}-h^{*}))&= E\left( \sum _{\mu }\lambda \xi _{\mu }(\tilde{h}_{\mu }-h_{\mu }^{*})^{2}\right) \nonumber \\&= E\left( \sum _{\mu }\lambda \xi _{\mu }\frac{\beta _{\mu }^{2}-2\beta _{\mu }\lambda \xi _{\mu }h_{\mu }^{*}+\lambda ^{2}\xi _{\mu }^{2}h_{\mu }^{*2}}{(1+\lambda \xi _{\mu })^{2}}\right) \nonumber \\&= \frac{1}{n}\sum _{\mu }\frac{\lambda \xi _{\mu }}{(1+\lambda \xi _{\mu })^{2}}+\lambda ^{p}\sum _{\mu }\frac{(\lambda \xi _{\mu })^{3-p}}{(1+\lambda \xi _{\mu })^{2}}\xi _{\mu }^{p}h_{\mu }^{*2}. \end{aligned}$$

(4.3)

When $\lambda \rightarrow 0$ and condition 2 holds, Lemma 8.1 in [21] says that

$$\begin{aligned} \sum _{\mu }\frac{\lambda \xi _{\mu }}{(1+\lambda \xi _{\mu })^{2}}&= O(\lambda ^{-1/r}),\nonumber \\ \sum _{\mu }\frac{1}{(1+\lambda \xi _{\mu })^{2}}&= O(\lambda ^{-1/r}),\nonumber \\ \sum _{\mu }\frac{1}{1+\lambda \xi _{\mu }}&= O(\lambda ^{-1/r}). \end{aligned}$$

(4.4)

So, combining (4.2)–(4.4) and assuming condition 3, we immediately have

$$\begin{aligned} V(\tilde{h}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}) \end{aligned}$$

and

$$\begin{aligned} \lambda J(\tilde{h}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}) . \end{aligned}$$

Therefore,

$$\begin{aligned} (V+\lambda J)(\tilde{h}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}). \end{aligned}$$

1.2 A.2 Proof of Theorem 2.2

We first characterize the rate of ${\hat{h}}-\tilde{h}$ and then turn to ${\hat{h}}-h^{*}$. Define

$$\begin{aligned} A_{g,h}(\eta ) = \frac{1}{n}\sum _{i=1}^{n}l((g+\eta h)({\varvec{z}}_{i});y_{i})+\frac{\lambda }{2}J(g+\eta h) \end{aligned}$$

(4.5)

and

$$\begin{aligned} B_{g,h}(\eta ) = \frac{1}{n}\sum _{i=1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})(g+\eta h)({\varvec{z}}_{i})+\frac{1}{2}V(g+\eta h-h^{*})+\frac{\lambda }{2}J(g+\eta h). \end{aligned}$$

(4.6)

Then, it can be easily shown that

$$\begin{aligned} \frac{dA_{g,h}}{d\eta }|_{\eta = 0} = \frac{1}{n}\sum _{i=1}^{n}u(g({\varvec{z}}_{i});y_{i})h({\varvec{z}}_{i})+ \lambda J(g,h) \end{aligned}$$

(4.7)

and

$$\begin{aligned} \frac{dB_{g,h}}{d\eta }|_{\eta = 0} = \frac{1}{n}\sum _{i=1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})h({\varvec{z}}_{i})+V(g-h^{*},h) + \lambda J(g,h). \end{aligned}$$

(4.8)

If we set $g = {\hat{h}}, h = {\hat{h}} - \tilde{h}$ in (4.7), we have

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}u({\hat{h}}({\varvec{z}}_{i});y_{i})({\hat{h}} - \tilde{h})({\varvec{z}}_{i})+ \lambda J({\hat{h}},{\hat{h}} - \tilde{h}) = 0. \end{aligned}$$

(4.9)

And setting $g = \tilde{h}, h = {\hat{h}} - \tilde{h}$ in (4.8), we have

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})({\hat{h}} - \tilde{h})({\varvec{z}}_{i})+V(\tilde{h}-h^{*},{\hat{h}}-\tilde{h})+ \lambda J(\tilde{h},{\hat{h}} - \tilde{h}) = 0. \end{aligned}$$

(4.10)

Subtracting (4.10) from (4.9), we can immediately get

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^{n}(u({\hat{h}}({\varvec{z}}_{i});y_{i})-u(\tilde{h}({\varvec{z}}_{i});y_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})) + \lambda J({\hat{h}} - \tilde{h})\nonumber \\&\quad =V(\tilde{h}-h^{*},{\hat{h}}-\tilde{h})-\frac{1}{n}\sum _{i=1}^{n}(u(\tilde{h}({\varvec{z}}_{i});y_{i})-u(h^{*}({\varvec{z}}_{i});y_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})). \end{aligned}$$

(4.11)

Using the mean value theorem, we can get

$$\begin{aligned} u({\hat{h}}({\varvec{z}}_{i});y_{i})-u(\tilde{h}({\varvec{z}}_{i});y_{i})= & {} w(h_{1}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})), \end{aligned}$$

(4.12)

$$\begin{aligned} u(\tilde{h}({\varvec{z}}_{i});y_{i})-u(h^{*}({\varvec{z}}_{i});y_{i})= & {} w(h_{2}({\varvec{z}}_{i}))(\tilde{h}({\varvec{z}}_{i})-h^{*}({\varvec{z}}_{i})), \end{aligned}$$

(4.13)

where $h_{1}$ is a convex combination of ${\hat{h}}$ and $\tilde{h}$, and $h_{2}$ is that of $\tilde{h}$ and $h^{*}$. Substituting (4.12) and (4.13) back to (4.11), we have

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^{n}w(h_{1}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i}))^{2}+\lambda J({\hat{h}}-\tilde{h}) \nonumber \\&\quad = V(\tilde{h}-h^{*},{\hat{h}}-\tilde{h})-\frac{1}{n}\sum _{i=1}^{n}w(h_{2}({\varvec{z}}_{i}))(\tilde{h}({\varvec{z}}_{i})-h^{*}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i})). \end{aligned}$$

(4.14)

Under conditions 1, 2, 5 and assuming $\lambda \rightarrow 0,n\lambda ^{r/2}\rightarrow \infty ,$ lemma 8.17 in [21] indicates that

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^{n}w(h^{*}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i}))^{2}= V({\hat{h}}-\tilde{h})+o_{p}((V+\lambda J)({\hat{h}}-\tilde{h})) \end{aligned}$$

(4.15)

and

$$\begin{aligned}&\frac{1}{n}\sum _{i=1}^{n}w(h^{*}({\varvec{z}}_{i}))(\tilde{h}({\varvec{z}}_{i})-h^{*}({\varvec{z}}_{i}))({\hat{h}}({\varvec{z}}_{i})-\tilde{h}({\varvec{z}}_{i}))\nonumber \\&\quad =V({\hat{h}}-\tilde{h},\tilde{h}-h^{*})+o_{p}(\{(V+\lambda J)({\hat{h}}-\tilde{h})(V+\lambda J)(\tilde{h}-h^{*})\}^{1/2}). \end{aligned}$$

(4.16)

Combining (4.14)–(4.16) and assuming condition 4, for some $c\in [c_{1},c_{2}]$, we have

$$\begin{aligned}&(c_{1}V+\lambda J)({\hat{h}}-\tilde{h})(1+o_{p}(1)) \nonumber \\&\quad \le \{(|1-c|V+\lambda J)({\hat{h}}-\tilde{h})\}^{1/2}O_{p}(\{(|1-c|V+\lambda J)(\tilde{h}-h^{*})\}^{1/2}), \end{aligned}$$

(4.17)

which together with Theorem 2.1 implies that

$$\begin{aligned} (V+\lambda J)({\hat{h}}-\tilde{h}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}), \end{aligned}$$

and

$$\begin{aligned} (V+\lambda J)({\hat{h}}-h^{*}) = O_{p}(n^{-1}\lambda ^{-1/r}+\lambda ^{p}). \end{aligned}$$

B Details for Creation of Objective Function

Let us first recall the log empirical likelihood

$$\begin{aligned} \ell _{n} = \sum _{i=1}^{n}\log (p_{i}) + \sum _{j=1}^{n_{1}}h({\varvec{x}}_{1j}), \end{aligned}$$

(5.1)

where

$$\begin{aligned} \sum _{i=1}^{n}p_{i} = 1, \quad p_{i}\ge 0, \quad \sum _{i=1}^{n}p_{i}\mathrm {exp}(h({\varvec{z}}_{i})) = 1. \end{aligned}$$

(5.2)

The maximization employs the method of Lagrange multipliers, i.e.,

$$\begin{aligned} \ell _{n} + \gamma _{1}\left( \sum _{i=1}^{n}p_{i}\mathrm {exp}(h({\varvec{z}}_{i}))-1\right) +\gamma _{0}(\sum _{i=1}^{n}p_{i}-1), \end{aligned}$$

(5.3)

where $\gamma _{0},\gamma _{1}$ are Lagrange multipliers. By profiling $ \ell _{n}$ subject to (5.2) with respect to h first, we can express each $p_{i}$ in terms of h as

$$\begin{aligned} p_{i} = \{n- \gamma _{1}[\mathrm {exp}(h({\varvec{z}}_{i}))-1]\}^{-1}, \end{aligned}$$

(5.4)

where $\gamma _{0},\gamma _{1}$ satisfy $\gamma _{0}+\gamma _{1}+n = 0$ and they solve

$$\begin{aligned} \sum _{i=1}^{n}p_{i}[\mathrm {exp}(h({\varvec{z}}_{i}))-1] = 0. \end{aligned}$$

(5.5)

Substituting $p_{i}$s back into $\ell _{n}$, the profile log empirical likelihood can be obtained as

$$\begin{aligned} \tilde{\ell }_{n} = -\sum _{i=1}^{n}log(n-\gamma _{1}[\mathrm {exp}(h({\varvec{z}}_{i}))-1])+\sum _{j=1}^{n_{1}}h(x_{1j}). \end{aligned}$$

(5.6)

Actually, we are solving a constrained maximum likelihood problem, so we can define our optimization problem as

$$\begin{aligned} -\frac{1}{n}\tilde{\ell }_{n} + \lambda \Vert P_{1}h \Vert ^{2}, \end{aligned}$$

(5.7)

With the help of Theorem 2 in [43], we know that the minimizer of the penalized negative log-likelihood has the form as

$$\begin{aligned} h(\cdot )=\sum _{v=1}^{M}d_{v}\phi _{v}(\cdot )+\sum _{i=1}^{n}\alpha _{i}R(\cdot ,{\varvec{z}}_{i}). \end{aligned}$$

(5.8)

Suppose $\phi _{1}(\cdot ) = 1 $, to get $\gamma s$, we differentiate $\ell _{n}$ with respect to $d_{1}$ and apply the constraints (5.2).

$$\begin{aligned} \frac{\partial \tilde{\ell _{n}}}{\partial d_{1}} = \sum _{i=1}^{n}\frac{\gamma _{1}\mathrm {exp}(h({\varvec{z}}_{i}))}{n-\gamma _{1}[\mathrm {exp}(h({\varvec{z}}_{i})-1]}+n_{1} =\sum _{i = 1}^{n}\gamma _{1}p_{i}\mathrm {exp}(h({\varvec{z}}_{i}) + n_{1} = 0. \end{aligned}$$

(5.9)

We have $\gamma _{1} = -n_{1}$. So,

$$\begin{aligned} p_{i} = n^{-1}\{1-{\hat{\rho }}+{\hat{\rho }}\mathrm {exp}(h({\varvec{z}}_{i}))\}^{-1}, \end{aligned}$$

(5.10)

and

$$\begin{aligned} \tilde{\ell }_{n} = -\sum _{i=1}^{n}\log n(1-{\hat{\rho }}+{\hat{\rho }}\mathrm {exp}(h({\varvec{z}}_{i})))+\sum _{j=1}^{n_{1}}h(x_{1j}). \end{aligned}$$

(5.11)

C Details for Lower Bound Optimization Method

Let

$$\begin{aligned} r_{i} = \frac{{\hat{\rho }}\mathrm {exp}({\varvec{T}}_{i}^{'}{\varvec{d}}+{\varvec{R}}_{i}^{'}\varvec{\alpha })}{1-{\hat{\rho }}+{\hat{\rho }}\mathrm {exp}({\varvec{T}}_{i}^{'}{\varvec{d}}+{\varvec{R}}_{i}^{'}\varvec{\alpha })}, \end{aligned}$$

(6.1)

${\varvec{r}} = (r_{1},\ldots ,r_{n})^{'}$, ${\varvec{y}} = (y_{1},\ldots ,y_{n})^{'}$, ${\varvec{G}} = -{\varvec{y}}+{\varvec{r}}$, ${\varvec{Q}} = ({\varvec{T}},{\varvec{R}})$ and

$$\begin{aligned} {\varvec{K}}^{*} = \begin{pmatrix} {\varvec{O}}_{M\times M} &{} {\varvec{O}}_{M\times n}\\ {\varvec{O}}_{n\times M} &{} {\varvec{R}} \end{pmatrix}, \end{aligned}$$

(6.2)

where ${\varvec{O}}_{M\times n}$ is zero matrix with dimension $M\times n$. Define ${\varvec{W}} = ({\varvec{0}}^{'}, \varvec{\alpha }^{'}{\varvec{R}})^{'} $, where ${\varvec{0}} $ is M-dimensional vector with elements all 0. The first-order partial derivative and Hessian matrix of objective function with respect to $\varvec{\theta }$ are

$$\begin{aligned} {\varvec{G}}^{*} = {\varvec{Q}}^{'}{\varvec{G}}+ {\varvec{W}} \end{aligned}$$

(6.3)

and

$$\begin{aligned} {\varvec{H}}^{*} = \sum _{i=1}^{n}{\varvec{Q}}_{i}{\varvec{Q}}_{i}^{'}r_{i}(1-r_{i})+2\lambda {\varvec{K}}^{*}, \end{aligned}$$

(6.4)

where ${\varvec{Q}}_{i}$ is the ith row of ${\varvec{Q}}$. So, the iterative equation of Newton’s method is

$$\begin{aligned} \varvec{\theta }^{(t+1)}= \varvec{\theta }^{(t)}- ({\varvec{H}}^{*(t)})^{-1}{\varvec{G}}^{*(t)}, \end{aligned}$$

(6.5)

where (t) indicates quantities evaluated at tth Newton–Raphson iteration. According to (6.5), we need to calculate the inverse of Hessian matrix every iterative step, which is extremely inefficient. The core idea of LB is to find an upper bound ${\varvec{B}}$ of ${\varvec{H}}^{*}$ such that

$$\begin{aligned} \varvec{\theta }^{(t+1)}= \varvec{\theta }^{(t)}- {\varvec{B}}^{-1}{\varvec{G}}^{*(t)}, \end{aligned}$$

(6.6)

where ${\varvec{B}}$ is a symmetric, positive definite matrix and does not depend on $\varvec{\theta }$. Using ${\varvec{C}}(\varvec{\theta })$ for the objective function, this corresponds to a quadratic approximation to ${\varvec{C}}(\varvec{\theta })-{\varvec{C}}(\varvec{\theta }_{0}) $ of the following form

$$\begin{aligned} (\varvec{\theta }-\varvec{\theta }_{0})^{'}\nabla {\varvec{C}}(\varvec{\theta })+(\varvec{\theta }-\varvec{\theta }_{0})^{'}{\varvec{B}}(\varvec{\theta }-\varvec{\theta }_{0})/2. \end{aligned}$$

(6.7)

The properties of LB can be found in Theorem 4.1 in [4]. As stated in [4], substituting all $r_{i}s$ with 1/2 in ${\varvec{H}}^{*}$ we will do the job. Actually, we have

$$\begin{aligned} {\varvec{H}}^{*} \preccurlyeq \frac{1}{4}\sum _{i=1}^{n}{\varvec{Q}}_{i}{\varvec{Q}}_{i}^{'}+2\lambda {\varvec{K}}^{*} := {\varvec{B}}. \end{aligned}$$

(6.8)

Here, $C \preccurlyeq D $ means that $D-C$ is nonnegative definite.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Cui, W. A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model. Commun. Math. Stat. 11, 369–401 (2023). https://doi.org/10.1007/s40304-021-00254-7

Download citation

Received: 25 February 2021
Revised: 15 May 2021
Accepted: 12 June 2021
Published: 03 June 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s40304-021-00254-7

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

A Proof of Theorems

1.1 A.1 Proof of Theorem 2.1

1.2 A.2 Proof of Theorem 2.2

B Details for Creation of Objective Function

C Details for Lower Bound Optimization Method

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Learning from imbalanced data: open challenges and future directions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

A Proof of Theorems

1.1 A.1 Proof of Theorem 2.1

1.2 A.2 Proof of Theorem 2.2

B Details for Creation of Objective Function

C Details for Lower Bound Optimization Method

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation