Abstract
Achieving higher true positive rate when decreasing false positive rate is always a great challenge to the imbalance learning community. This work combines penalized empirical likelihood method, lower bound algorithm and Nyström method and applies these techniques along with kernel method to density ratio model. The resulting classifier, density ratio classifier (DRC), is a combination of kernelization, regularization, efficient implementation and threshold moving, all of which are critical to enable DRC to be an effective and powerful method for solving difficult imbalance problems. Compared with other methods, DRC is competitive in that it is widely applicable and it is simple and easy to use without additional imbalance handling skills. In addition, the convergence rate of the estimate of log density ratio is discussed as well. And the results of numerical analysis also show that DRC outperforms other methods in AUC and G-mean score.
Similar content being viewed by others
References
Barreno, M., Cárdenas, A.A., Tygar, J.D.:. Optimal roc curve for a combination of classifiers. In: In Advances in Neural Information Processing Systems (NIPS) (2007)
Berlinet, A.: Reproducing kernels in probability and statistics. More Progresses In Analysis (2014)
Böhning, D.: Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 44(1), 197–200 (1992)
Böhning, Dankmar: Lindsay, Bruce: Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 40(4), 641–663 (1988)
Breiman, L.: Random forest. Mach. Learn. 45, 5–32 (2001)
Cai, Song: Chen, Jiahua: Empirical likelihood inference for multiple censored samples. Canadian J. Stat. 46(2), 212–232 (2018)
Chawla, Nitesh V., Bowyer, Kevin W., Hall, Lawrence O., Kegelmeyer, W. Philip.: Smote: synthetic minority over-sampling technique. J. Artif. Intel. Res. 16(1), 321–357 (2002)
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDDInternationalConference onKnowledgeDiscovery andDataMining,KDD’16, pp. 785-794. Association for Computing Machinery (2016)
Chen, Jiahua: Liu, Yukun: Quantile and quantile-function estimations under density ratio model. Ann. Stat. 41(3), 1669–1692 (2013)
Chen, Jiahua: Liu, Yukun: Small area quantile estimation. Int. Stat. Rev. 87(S1), S219–S238 (2019)
Chen, Baojiang, Li, Pengfei, Qin, Jing, Tao, Yu.: Using a monotonic density ratio model to find the asymptotically optimal combination ofmultiple diagnostic tests. J.Am. Stat.Assoc. 111(514), 861-874 (2016)
Cheng, K.F., Chu, C.K.: Semiparametric density estimation under a two-sample density ratio model. Bernoulli 10(4), 583–604 (2004)
Collell, Guillem, Prelec, Drazen, Patil, Kaustubh R.: A simple plug-in bagging ensemble based on threshold-moving for classifying binary and multiclass imbalanced data. Neurocomputing 275, 330–340 (2018)
Cortes, Corinna, Vapnik, Vladimir: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
de Oliveira, V., Kedem, B.: Bayesian analysis of a density ratio model. Canadian J. Stat. 45, 274–289 (2017)
Denil, M., Trappenberg, T.: Overlap versus imbalance. In: Farzindar, A., Kešelj, V. (eds.) Advances in Artificial Intelligence, pp. 220–231. Springer, Berlin Heidelberg, Berlin (2010)
Diao, Guoqing: Ning, Jing, qin, jing: Maximum likelihood estimation for semiparametric density ratio model. Int. J. Biostat. 8(1), 1–29 (2012)
Dua, D., Graff, C.: UCI machine learning repository (2017)
Eguchi, Shinto: Copas, John: A class of logistic type discriminant functions. Biometrika 89(1), 1–22 (2002)
Fokianos, Konstantinos: Kaimi, Irene: On the effect of misspecifying the density ratio model. Ann. Inst. Stat. Math. 58(3), 475–497 (2006)
Gu, C.: Smoothing Spline ANOVA Models, vol. 297. Springer, New York (2013)
Härdle, Wolfgang: Nonparametric and Semiparametric Models. Springer, Berlin (2004)
He, H., Garcia, E.A.: Learning fromimbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Ho, T., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intel. 24, 289–300 (2002)
Jing, Qin: Inferences for case-control and semiparametric two-sample density ratiomodels. Biometrika 85(3), 619–630 (1998)
Jing, Qin: Biao, Zhang: Best combination of multiple diagnostic tests for screening purposes. Stat. Med. 29(28), 2905–2919 (2010)
Kanamori, T., Suzuki, T., Sugiyama,M.: Theoretical analysis of density ratio estimation. IEICE Trans. Fundament. Electron. Commun. Comput. Sci. E93A(4), 787-798 (2010)
Karsmakers, P., Pelckmans, K., Suykens, J.A.K.: Multi-class kernel logistic regression: a fixed-size implementation. In: 2007 International Joint Conference on Neural Networks, pp. 1756-1761 (2007)
Katzoff, Myron, Zhou, Wen, Khan, Diba, Guanhua, Lu., Kedem, Benjamin: Out-of-sample fusion in risk prediction. J. Stat. Theory Practice 8(3), 444-459 (2014)
Kedem, Benjamin,Guanhua, Lu., Rong,Wei,Williams, PaulD.: Forecastingmortality rates via density ratio modeling. Canadian J. Stat. 36(2), 193-206 (2010)
Kedem, Benjamin, Pan, Lemeng, Zhou, Wen, Coelho, Carlos A.: Interval estimation of small tail probabilities - applications in food safety. Stat. Med. 35(18), 3229–3240 (2016)
Kedem, B., Pan, L., Smith, P., Wang, C.: Repeated out of sample fusion in the estimation of small tail probabilities (2019)
Kernels and Reproducing Kernel Hilbert Spaces, pp. 110-163. Springer, New York (2008)
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(2), 539-550 (2009)
Liu, X.-Y., Wu, J., Zhou, Z.-H.: Exploratory undersampling for class-imbalance learning. Syst. Man Cybern. Part B: Cybern. IEEE Trans. 39, 539-550 (2009)
López, Victoria: Fernández, Alberto, García, Salvador, Palade, Vasile, Herrera, Francisco: An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Luengo, Julián: Fernández, Alberto, García, Salvador, Herrera, Francisco: Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Luo, X., Tsai, W.: A proportional likelihood ratio model. Biometrika 99(1), 1 (2011)
Maalouf, Maher, Homouz, Dirar, Trafalis, Theodore B.: Logistic regression in large rare events and imbalanced data: A performance comparison of prior correction and weighting methods. Comput. Intell. 34(1), 161–174 (2018)
Prati, R., Batista, G., Monard, M.-C.: Learning with class skews and small disjuncts. In: Bazzan, A.L.C., Labidi, S. (eds.) ), Advances in Artificial Intelligence-SBIA 2004, pp. 296–306. Springer, Berlin Heidelberg, Berlin (2004)
Quionero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton, Lawrence, Neil D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Rayhan, F., Ahmed, S., Mahbub, A., Jani, R., Shatabda, S., Farid, D.Md.: Cusboost: Cluster-based under-sampling with boosting for imbalanced classification. In: 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution (CSITSS), Dec (2017)
Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: Helmbold, D., Williamson, B. (eds.) Computational Learning Theory, pp. 416–426. Springer, Berlin Heidelberg (2001)
Seiffert, C., Khoshgoftaar, T., Hulse, J., Napolitano, A.: Rusboost: a hybrid approach to alleviating class imbalance. Syst. Man Cybern. Part A: Syst. Humans, IEEE Trans. 40, 185-197 (2010)
Shen, Y., Ning, J., Qin, J.: Likelihood approaches for the invariant density ratio model with biased648 sampling data. Biometrika 99(2), 363–378 (2012)
Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: Svms modeling for highly imbalanced classification. IEEE Trans. Syst. Man Cybern. Part B, Cybern. Publ. IEEE Syst. Man Cybern. Soc. 39(1), 281-288 (2009)
Tom, F.: Introduction to roc analysis. Pattern Recogn. Lett. 27, 861–874 (2006)
Trevor, H., Robert, T., Jerome, F.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY (2009)
Voulgaraki, Anastasia, Kedem, Benjamin, Graubard, Barry I.: Semiparametric regression in testicular germ cell data. Ann. Appl. Stat. 6(3), 1185–1208 (2012)
Vuttipittayamongkol, P., Elyan, E., Petrovski, A., Jayne, C.: Overlap-based undersampling for improv613 ing imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 689-697 (2018)
Wang H.: Logistic regression for massive data with rare events (2020)
Wang, Y.: Smoothing Splines: Methods and Applications, 1st edn. CRC Press, Boca Raton, FL (2011)
Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, pp. 324-331 (2009)
Wang, Dongliang: Tian, Lili, Zhao, Yichuan: Smoothed empirical likelihood for the youden index. Comput. Stat. Data Anal. 115, 1–10 (2017)
Weiss, G.M.: The Impact of Small Disjuncts on Classifier Learning, vol. 8, pp. 193-226. Springer, US (2010)
Williams, C., Seeger, 677 M.: Using the nyström method to speed up kernel machines. In: Advances in Neural Information Processing Systems 13, Papers from Neural Information Processing Systems (NIPS) 2000, Denver, CO, USA, pp. 682-688, 01 (2000)
Xia, S.Y., Xiong, Z.Y., 620 He, Y., Li, K., Dong, L.M., Zhang, M.: Relative density-based classification noise detection. Optik - Int. J. Light Electron Optics 125, 6829-6834 (2014)
Yijing, Li., Haixiang, Guo, Xiao, Liu, Yanan, Li., Jinling, Li.: Adapted ensemble classification algo587 rithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data. Knowl.-Based Syst. 94, 88-104 (2016)
Zhou, Z.-H., Liu, X.-Y.: On multi-class cost-sensitive learning. Comput. Intel. 26, 232–257 (2010)
Zhuang, W.W., Hu, B.Y., Chen, J.: Semiparametric inference for the dominance index under the density ratio model. Biometrika 106, 229–241 (2019)
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by National Natural Science Foundation of China (Grant No. 71873128)
Appendices
A Proof of Theorems
1.1 A.1 Proof of Theorem 2.1
Consider the quadratic functional
and we add a multiplier \(\frac{1}{2}\) in the last term for convenience, which has no effect on our technical development. Since V is completely continuous with respect to J and \(J(h)<\infty \), h can be expressed as a Fourier series expansion \(h = \sum _{\mu }h_{\mu }\psi _{\mu } \), where \( \psi _{\mu } \) are eigenfunctions of J with respect to V and \( h_{\mu } = V(h,\psi _{\mu }) = \int _{{\mathcal {X}}}h(z)\psi _{\mu }(z)w(h^{*}(z))f(z)dz \) are Fourier coefficients. Plugging Fourier series expansions \(h = \sum _{\mu }h_{\mu }\psi _{\mu } \) and \(h^{*} = \sum _{\mu }h_{\mu }^{*}\psi _{\mu }\) into (4.1), it is easy to show that the minimizer \(\tilde{h}\) of (4.1) has Fourier coefficients
where \(\beta _{\mu } = n^{-1}\sum _{i= 1}^{n}u(h^{*}({\varvec{z}}_{i});y_{i})\psi _{\mu }({\varvec{z}}_{i})\).
Since \(E(\beta _{\mu }) = 0\) and \(E(\beta _{\mu }^{2}) = \frac{1}{n}\), we have
When \(\lambda \rightarrow 0\) and condition 2 holds, Lemma 8.1 in [21] says that
So, combining (4.2)–(4.4) and assuming condition 3, we immediately have
and
Therefore,
1.2 A.2 Proof of Theorem 2.2
We first characterize the rate of \({\hat{h}}-\tilde{h}\) and then turn to \({\hat{h}}-h^{*}\). Define
and
Then, it can be easily shown that
and
If we set \(g = {\hat{h}}, h = {\hat{h}} - \tilde{h}\) in (4.7), we have
And setting \(g = \tilde{h}, h = {\hat{h}} - \tilde{h}\) in (4.8), we have
Subtracting (4.10) from (4.9), we can immediately get
Using the mean value theorem, we can get
where \(h_{1}\) is a convex combination of \({\hat{h}}\) and \(\tilde{h}\), and \(h_{2}\) is that of \(\tilde{h}\) and \(h^{*}\). Substituting (4.12) and (4.13) back to (4.11), we have
Under conditions 1, 2, 5 and assuming \(\lambda \rightarrow 0,n\lambda ^{r/2}\rightarrow \infty ,\) lemma 8.17 in [21] indicates that
and
Combining (4.14)–(4.16) and assuming condition 4, for some \(c\in [c_{1},c_{2}]\), we have
which together with Theorem 2.1 implies that
and
B Details for Creation of Objective Function
Let us first recall the log empirical likelihood
where
The maximization employs the method of Lagrange multipliers, i.e.,
where \(\gamma _{0},\gamma _{1}\) are Lagrange multipliers. By profiling \( \ell _{n}\) subject to (5.2) with respect to h first, we can express each \(p_{i}\) in terms of h as
where \(\gamma _{0},\gamma _{1}\) satisfy \(\gamma _{0}+\gamma _{1}+n = 0\) and they solve
Substituting \(p_{i}\)s back into \(\ell _{n}\), the profile log empirical likelihood can be obtained as
Actually, we are solving a constrained maximum likelihood problem, so we can define our optimization problem as
With the help of Theorem 2 in [43], we know that the minimizer of the penalized negative log-likelihood has the form as
Suppose \(\phi _{1}(\cdot ) = 1 \), to get \(\gamma s\), we differentiate \(\ell _{n}\) with respect to \(d_{1}\) and apply the constraints (5.2).
We have \(\gamma _{1} = -n_{1}\). So,
and
C Details for Lower Bound Optimization Method
Let
\({\varvec{r}} = (r_{1},\ldots ,r_{n})^{'}\), \({\varvec{y}} = (y_{1},\ldots ,y_{n})^{'}\), \({\varvec{G}} = -{\varvec{y}}+{\varvec{r}}\), \({\varvec{Q}} = ({\varvec{T}},{\varvec{R}})\) and
where \({\varvec{O}}_{M\times n}\) is zero matrix with dimension \(M\times n\). Define \({\varvec{W}} = ({\varvec{0}}^{'}, \varvec{\alpha }^{'}{\varvec{R}})^{'} \), where \({\varvec{0}} \) is M-dimensional vector with elements all 0. The first-order partial derivative and Hessian matrix of objective function with respect to \(\varvec{\theta }\) are
and
where \({\varvec{Q}}_{i}\) is the ith row of \({\varvec{Q}}\). So, the iterative equation of Newton’s method is
where (t) indicates quantities evaluated at tth Newton–Raphson iteration. According to (6.5), we need to calculate the inverse of Hessian matrix every iterative step, which is extremely inefficient. The core idea of LB is to find an upper bound \({\varvec{B}}\) of \({\varvec{H}}^{*}\) such that
where \({\varvec{B}}\) is a symmetric, positive definite matrix and does not depend on \(\varvec{\theta }\). Using \({\varvec{C}}(\varvec{\theta })\) for the objective function, this corresponds to a quadratic approximation to \({\varvec{C}}(\varvec{\theta })-{\varvec{C}}(\varvec{\theta }_{0}) \) of the following form
The properties of LB can be found in Theorem 4.1 in [4]. As stated in [4], substituting all \(r_{i}s\) with 1/2 in \({\varvec{H}}^{*}\) we will do the job. Actually, we have
Here, \(C \preccurlyeq D \) means that \(D-C\) is nonnegative definite.
Rights and permissions
About this article
Cite this article
Li, J., Cui, W. A New Classifier for Imbalanced Data Based on a Generalized Density Ratio Model. Commun. Math. Stat. 11, 369–401 (2023). https://doi.org/10.1007/s40304-021-00254-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40304-021-00254-7