Robust regression against heavy heterogeneous contamination

The γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}-divergence is well-known for having strong robustness against heavy contamination. By virtue of this property, many applications via the γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}-divergence have been proposed. There are two types of γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}-divergence for the regression problem, in which the base measures are handled differently. In this study, these two γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}-divergences are compared, and a large difference is found between them under heterogeneous contamination, where the outlier ratio depends on the explanatory variable. One γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document}-divergence has the strong robustness even under heterogeneous contamination. The other does not have in general; however, it has under homogeneous contamination, where the outlier ratio does not depend on the explanatory variable, or when the parametric model of the response variable belongs to a location-scale family in which the scale does not depend on the explanatory variables. Hung et al. (Biometrics 74(1):145–154, 2018) discussed the strong robustness in a logistic regression model with an additional assumption that the tuning parameter γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma $$\end{document} is sufficiently large. The results obtained in this study hold for any parametric model without such an additional assumption.


Introduction
The maximum likelihood estimation and least squares method have been widely used. However, these approaches are not robust against outliers. To overcome this problem, several robust methods have been proposed, primarily using M-estimation (Hampel et al. 2005;Maronna et al. 2006;Huber and Ronchetti 2009). The maximum likelihood estimation can be regarded as the minimization of the empirical estimator of the Kullback-Leibler divergence. As an extension of this concept, some robust estimators have been proposed as the minimization of the empirical estimators of the modified divergences, e.g., density power divergence (Basu et al. 1998), L 2 -divergence (Scott 2001), γ -divergence (or the Type-0 divergence) (Jones et al. 2001;Fujisawa and Eguchi 2008).
Recently, some robust regression methods have been proposed based on divergences, using L 2 -divergence (Chi and Scott 2014;Lozano et al. 2016), density power divergence (Ghosh and Basu 2016;Riani et al. 2020;Ghosh and Majumdar 2020), and γ -divergence (Kawashima and Fujisawa 2017;Hung et al. 2018;Ren et al. 2020). In these methods, the robust properties are generally investigated under the contaminated model. The difference between the i.i.d. problem and the regression problem lies in whether or not the outlier ratio in the contaminated model depends on the explanatory variable. They are called heterogeneous and homogeneous contaminations, respectively. Recently, Hung et al. (2018) showed that a logistic regression model with mislabeled data can be regarded as a logistic regression model with heterogeneous contamination. Then, they applied the γ -divergence to a usual logistic regression model, which enables estimation of the parameter of the model without modeling the mislabeled scheme, even if mislabeled data exist. They discussed the strong robustness that the latent bias can be sufficiently small against heavy contamination. However, this was under assumption that the tuning parameter γ was sufficiently large.
There are two types of γ -divergence for regression problems in which the treatments of the base measure are different (Fujisawa and Eguchi 2008;Kawashima and Fujisawa 2017). Kawashima and Fujisawa (2017) proposed another type of the γ -divergence proposed by Fujisawa and Eguchi (2008), and showed a Pythagorean relation, which does not hold for the previous γ -divergence. Hung et al. (2018) adopted the γ -divergence for the regression problem proposed by Fujisawa and Eguchi (2008) and investigated robust properties of the logistic regression model, as mentioned above. In addition, Ren et al. (2020) also adopted it and investigated theoretical properties of the variable consistency and estimation bound under a high-dimensional regression setting. In particular, its application to the generalized linear model (McCullagh and Nelder 1989) has been well studied. However, these studies focused on only one divergence proposed by Fujisawa and Eguchi (2008), and no comparison between two types has been studied.
In this study, two types of γ -divergence are compared in detail. Their differences in terms of the strong robustness are illustrated through numerical experiments. Comparing the obtained results with those of Hung et al. (2018), ours hold for any parametric model, including a logistic regression model, without the assumption that γ is sufficiently large, although Hung et al. (2018) makes this assumption.
The remainder of this paper is organized as follows. In Sect. 2, we show that existing robust regression methods may not work well under heavy heterogeneous contamination in the simple case of a univariate logistic regression model. In Sect. 3, two types of γ -divergence for the regression problem are reviewed. In Sect. 4, we elucidate a large difference between two types of γ -divergence from the viewpoint of robustness. In Sect. 5, the parameter estimation algorithm is proposed. In Sect. 6, numerical experiments are illustrated to verify the differences discussed in Sect. 4.

Illustrative example
Before getting into details, here we show an illustrative example that existing robust regression methods may not work well under heavy heterogeneous contamination, where the outlier ratio depends on the explanatory variable.
Here, a univariate logistic regression model was used as a simulation model. Outliers were incorporated in a similar setting to that described in Sect. 6. A weighted maximum likelihood estimator (WMLE), M-estimator (Mest), redescending weighted M-estimator (WMest), conditional unbiased bounded influence estimator (CUBIF), and robust quasi-likelihood estimator (MQLE) were adopted as existing robust logistic regression methods (see Chapter 7 in Maronna et al. (2018) for details). Figure 1 shows the mean squared errors (MSEs) of existing methods, type I, and type II at each outlier ratio. All existing methods showed the almost identical results at each outlier ratio, and their performance worsened as the contamination became heavier. Type I, which is a robust regression method based on the γ -divergence proposed by Fujisawa and Eguchi (2008), performed well. In contrast, a similar method, type II, which is based on the γ -divergence proposed by Kawashima and Fujisawa (2017), presented different behaviors from type I. In the subsequent sections, we discuss a difference in robustness between the two types of γ -divergence for the regression problem. Specifically, why type I is better than type II is explored.

Regression based on -Divergence
The γ -divergence for regression was first proposed by Fujisawa and Eguchi (2008). It measures the difference between two conditional probability density functions. The other type of γ -divergence for regression was proposed by Kawashima and Fujisawa (2017), in which the treatment of the base measure on the explanatory variable was changed. For simplicity, the former is referred to as type I and the latter as type II. This section presents a brief review of both types of γ -divergence for regression alongside the corresponding parameter estimation.

Two types of -Divergence for regression
First, the γ -divergence for the i.i.d. problem is reviewed. Let g(u) and f (u) be two probability density functions. The γ -cross entropy and γ -divergence are defined by This satisfies the following two basic properties of divergence: Let us consider the γ -divergence for the regression problem. Suppose that g(x, y), g(y|x), and g(x) are the underlying probability density functions of (x, y), y given x, and x, respectively. Let f (y|x) be another conditional probability density function of y given x. Let γ be the positive tuning parameter which controls the trade-off between efficiency and robustness. For the regression problem, Fujisawa and Eguchi (2008) proposed the following cross entropy and divergence: Type I γ -cross entropy for regression: g(x, y)dxdy.
(3.1) Type I γ -divergence for regression: The cross entropy is empirically estimable, as will be seen in Sect. 3.2, and the parameter estimation is easily defined. On the other hand, Kawashima and Fujisawa (2017) proposed the following cross entropy and divergence: Type II γ -cross entropy for regression: (3.2) Type II γ -divergence for regression: The base measures on the explanatory variable are taken twice on each term of the γ -divergence for the i.i.d. problem. This extension from the i.i.d. problem to the regression problem appears more natural than (3.1). The cross entropy is also empirically estimable. Both types of γ -divergence satisfy the following two basic properties of divergence for j = 1, 2: The equality holds for the conditional probability density function instead of usual probability density function. Theoretical properties of the γ -divergence for the i.i.d. problem were deeply investigated by Fujisawa and Eguchi (2008). There have been several studies on theoretical properties for the regression problem (Kanamori and Fujisawa 2015;Kawashima and Fujisawa 2017;Hung et al. 2018;Ren et al. 2020). However, there is a lack of comprehensive studies, such as comparison between properties under heterogeneous contamination. Heterogeneous contamination appears as a specific case in the regression problem and does not appear in the i.i.d. problem. Hung et al. (2018) pointed out that a logistic regression model with mislabeled data can be regarded as a logistic regression model with heterogeneous contamination and then they applied type I to a usual logistic regression model, which enables us to estimate the parameter of the logistic regression model without modeling mislabeled scheme even if mislabeled data exist. They also investigated theoretical properties on robustness, but they assumed that γ is sufficiently large. In Sect. 4 of this paper, it is shown that type I is superior to type II under heterogeneous contamination in the sense of the strong robustness without assuming that γ is sufficiently large.
Finally, it is shown that the density power divergence (Basu et al. 1998) is another candidate for divergence which provides robustness; however, this does not have strong robustness (Hung et al. 2018). For the completeness of this paper, details of the robustness of the density power divergence under homogeneous and heterogeneous contaminations are provided, although some parts have been investigated by Hung et al. (2018). See Appendix G for details.

Estimation for -regression
Let f (y|x; θ) be a conditional probability density function of y given x with parameter θ . Let (x i , y i ) (i = 1, . . . , n) be the observations randomly drawn from the underlying distribution g(x, y). Using (3.1) and (3.2), both types of γ -cross entropy for regression can be empirically estimated bȳ The estimator is defined as the minimizer as follows: Using an similar approach to that in Fujisawa and Eguchi (2008), we can show that Suppose that f (y|x; θ * ) is the target conditional probability density function. The latent bias is expressed as θ * γ, j − θ * . This is zero when the underlying model belongs to a parametric model, in other words, g(y|x) = f (y|x; θ * ), but is not always zero when the underlying model is contaminated by outliers. This issue is discussed in Sect. 4.

Case of location-scale family
Here, it is shown that both types of γ -divergence give the same parameter estimation when the parametric conditional probability density function f (y|x; θ) belongs to a location-scale family in which the scale does not depend on the explanatory variables. This is given by where s(y) is a probability density function, σ is a scale parameter, and q(x; ζ ) is a location function with a regression parameter ζ , e.g., q(x; ζ ) = x T ζ . Then, we can obtain This does not depend on the explanatory variables x. Using this property, we can show that both types of γ -cross entropy are the same.

Proposition 1 Consider the location-scale family (3.3). We see
The proof is in Appendix A. As a result, both types of γ -divergence give the same parameter estimation, because the estimator is defined by the empirical estimation of the γ -cross entropy. However, it should be noted that both types of γ -divergence are not the same, because d γ,1 (g(y|x), g(y|x); g(x)) = d γ,2 (g(y|x), g(y|x); g(x)).

Robust properties
In this section, we show a large difference between the two types of γ -divergence.

Contamination model and basic condition
Let δ(y|x) be the contaminated conditional probability density function related to outliers. Let ε(x) and ε denote the outlier ratios which depends on x and does not, respectively. Suppose that the underlying conditional probability density functions under heterogeneous and homogeneous contaminations are given by Heterogeneous contamination: Homogeneous contamination: Here we assume that This is an extended assumption used for the i.i.d. problem (Fujisawa and Eguchi 2008) to the regression problem. This assumption implies that and illustrates that the contaminated conditional probability density function δ(y|x) lies on the tail of the target conditional probability density function f (y|x; θ * ). For example, if δ(y|x) is the Dirac delta function at the outlier y † (x) given x, then we have Here we also consider the condition ν f θ ,γ ≈ 0, as used in Hung et al. (2018). This will be true in the neighborhood of θ = θ * . In addition, even when θ is not close to θ * , if δ(y|x) lies on the tail of f (y|x; θ), we can see ν f θ ,γ ≈ 0.
To simplify the discussion, the monotone transformation of both types of γ -cross entropy for regression is prepared as follows:

Robustness of type I
Following some calculations, we havẽ . A detailed derivation is in Appendix B. From this relation, we can easily show the following theorem.

Theorem 1 Consider the case of heterogeneous contamination. Under the condition
Using this theorem, we can expect the strong robustness that the latent bias θ * γ,1 − θ * is close to zero even when ε(x) is not small, because The last equality holds even when g(x) is replaced byg( In addition, we can have the modified Pythagorean relation approximately.
The modified Pythagorean relation also implies strong robustness in a similar manner to that in the subsequent discussion of Theorem 1. Finally, the case of homogeneous contamination is discussed. Under homogeneous contamination, we have the following relation.

Robustness of type II
First, it is illustrated that in the case of type II, the strong robustness does not hold generally hold under heterogeneous contamination, unlike for type I. We seẽ A detailed derivation is in Appendix D. This cannot be expressed using with an appropriate base measure b(x), unlike for type I, because the base measure of the numerator on the explanatory variables is different from that of the denominator. As mentioned in Sect. 3.3, when the parametric conditional probability density function belongs to a location-scale family (3.3), the cross entropy for type II is the same as that for type I and, thus type II can have the strong robustness.
Under homogeneous contamination, we havẽ and then the latent bias θ * γ,2 − θ * can be expected to be sufficiently small, i.e., type II can have the strong robustness.

Parameter estimation algorithm
In this section, the parameter estimation algorithm of type I is proposed. Kawashima and Fujisawa (2017) proposed the iterative estimation algorithm for type II by Majorization-Minimization (MM) algorithm (Hunter and Lange 2004). This algorithm has a monotone decreasing property, i.e., the objective function monotonically decreases at each iterative step, which enables numerical stability and efficiency. The present study also utilizes the MM algorithm.

MM algorithm
Here, the principle of the MM algorithm is explained in brief. Then, it can be shown that the objective function h(η) monotonically decreases at each iterative step, because Note that η (m+1) is not necessary to be the minimizer of h M M (η|η (m) ). We only need The problem with the MM algorithm is how to make a majorization function h M M .

Let us recall the objective function h(θ ) in type I:
The majorization function can be derived in a similar manner to that in Kawashima and Fujisawa (2017). We show the following theorem.

Theorem 4 Consider the function
, and const is a term which does not depend on the parameter θ . It can be seen that it is a majorization function of h(θ ). Consequently, the sequence of iterates given by θ (m+1) = argmin θ h M M (θ |θ (m) ) decreases the objective function h(θ ) at each iterative step.
The proof is in Appendix E. As mentioned in Sect. 3.3, when the parametric conditional probability density function f (y|x; θ) belongs to a location-scale family (3.3), the cross entropy for type I is the same as that for type II, and the above majorization function is also the same as that for type II.

MM algorithm for logistic regression model
Here, the case of the logistic regression model is thoroughly investigated. This model does not belong to a location-scale family as the parametric conditional probability density function f (y|x; θ). Let f (y|x; β) be the Bernoulli distribution given by where π(x; β) = {1 + exp(−x β)} −1 . For simplicity, the intercept term β 0 is omitted in the linear predictor. Following simple calculation, we have Here, the constant term is ignored. By applying the idea of the quadratic approximation (the second-order Taylor polynomial) (Böhning and Lindsay 1988) to h M M , the new majorization function is obtained as follows: Then, we provide the following theorem.

Theorem 5 Consider the functionh M M (β|β (m) ). It can be seen that it is a majorization function of h(β). Consequently, the sequence of iterates given by β (m+1) = (X X ) −1 X z (m) decreases the objective function h(β) at each iterative step.
The proof is in Appendix F. Therefore, to guarantee the monotone decreasing property, the proposed algorithm does not require a line search method (Armijo 1966;Wolfe 1969 As shown in Sect. 4, the large difference occurs under heterogeneous contamination when the parametric conditional probability density function f (y|x; θ) does not belong to a location-scale family. Therefore, the logistic regression model is used as a simulation model, given by All experiments were performed using R (http://www.r-project.org/index.html). The code used to implement the proposed method is available at https://sites.google.com/site/ takayukikawashimaspage/software.
Outliers were incorporated into simulations. The outliers were generated around the edge of the explanatory variables, where the explanatory variables were generated from N (μ out , 0.5 2 I) where μ out = (2, 0, 2, 0, 2, 0 p−5 ) , and the response variable y was set to 0. The outlier ratio was set to ε = 0.1, 0.2, 0.3, 0.4.
To verify the fitness of the regression coefficient, the mean squared error (MSE) was used as the performance measure, given by.
where β * j 's are the true coefficients. The tuning parameter γ in the γ -divergence was set to 1.0, 2.0, and 3.0. The tuning parameter α in the DPD was set to 1.0, 2.0, and 3.0.  Figure 2 shows the MSE in the case ε = 0.1, 0.2, 0.3, and 0.4. Type I presented smaller MSEs than type II. The difference between the two types became more significant as the outlier ratio ε was increased. For types I and II, we can see similar results to those presented in Sect. 2 can be seen, even in the multivariate case. The DPD presented worse MSEs as the outlier ratio ε or p was larger.
The DPD seems to be better than the γ -divergence at the same value of the γ and α. However, the different divergences should not be compared at the same tuning parameter, because they have different meanings.

Application to real data
We compare type I with type II and the DPD in real data. The following datasets, which are available at the UCI Machine Learning Repository (Dua and Graff 2017), were used: Absenteeism at work (Work) (Martiniano et al. 2012), banknote authentication (Bank), Heart, Haberman's Survival (Survival), and Libras Movement (Libras).
First, we applied the ordinal logistic regression to real data and supposed that estimated regression coefficients were true coefficients β * 0 and β * . Then, outliers were incorporated into the data as follows. The magnitudes of the explanatory variables were sorted in descending order x (1) ≥ x (2) ≥ · · · ≥ x (n) , where (i) denotes the i-th order. We selected ε × n samples from the largest variables, and the corresponding each response variable was flipped, e.g., y = 0 → y = 1. The outlier ratio ε was set to 0.4.
In order to verify the fitness of the regression coefficient, we used the MSE as the performance measure. The tuning parameter γ in the γ -divergence was set to 1.0, 2.0, and 3.0. The tuning parameter α in the DPD was set to 1.0, 2.0, and 3.0. Table 1 shows the MSE in real data sets with outliers. Type I presented smaller MSEs than type II and showed better results than the DPD in most cases.

Conclusion
In shit study, the difference between two types of γ -divergence for the regression problem, referred to as type I and type II, were investigated. We showed that type I has the strong robustness unlike type II under heterogeneous contamination even when the parametric conditional probability density function does not belong to a location-scale family. In addition, we denoted difference on the robustness with an existing robust divergence, density power divergence for the regression problem. Further, an efficient estimation algorithm was proposed based on the principle of the MM algorithm. Numerical experiments supported the obtained theoretical results under various settings and the application to real data. In the experiments performed herein, the robust tuning parameter γ was fixed. However, in some cases, it may be better to select an appropriate tuning parameter from among the γ candidates than to fix the value. The monitoring approach (Riani et al. 2020) and robust cross-validation (Kawashima and Fujisawa 2017) can be utilized as a method of selecting γ .
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A Proof of Proposition 1
We use the following relation: This dose not depend on the explanatory variables x. Then, we have ).

D Detailed derivation on robustness of type II in Sect. 4.3
The formulation of Type II under heterogeneous contamination is provided to show that the strong robustness does not hold in general. decreases the objective function, that is, Consequently, the sequence of iterates θ (m) decreases the objective function h(θ ) at each iterative step.
This relation implies that the target conditional probability density function is affected by the outlier ratio ε. Thus, we cannot expect that the latent bias is close to zero.
Next, we consider the heterogeneous contamination. We show the non-strong robustness for the DPD as follows: ≈ d dpd ( f (y|x; θ * ), f (y|x; θ);g(x)) The last term on the right-hand side is not negligible in general. Let us consider the following linear regression model as a specific example: We have f (y|x; β, σ 2 ) 1+α dy = (2πσ 2 ) −α/2 √ 1 + α .
This depends on the value of σ 2 , and it generally cannot be close to zero, even though α → ∞. Therefore, the latent bias cannot be expected to be close to zero under heterogeneous contamination.