Robust regression against heavy heterogeneous contamination

Kawashima, Takayuki; Fujisawa, Hironori

doi:10.1007/s00184-022-00874-1

Robust regression against heavy heterogeneous contamination

Open access
Published: 01 July 2022

Volume 86, pages 421–442, (2023)
Cite this article

Download PDF

You have full access to this open access article

Metrika Aims and scope Submit manuscript

Robust regression against heavy heterogeneous contamination

Download PDF

1921 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The $\gamma $-divergence is well-known for having strong robustness against heavy contamination. By virtue of this property, many applications via the $\gamma $-divergence have been proposed. There are two types of $\gamma $-divergence for the regression problem, in which the base measures are handled differently. In this study, these two $\gamma $-divergences are compared, and a large difference is found between them under heterogeneous contamination, where the outlier ratio depends on the explanatory variable. One $\gamma $-divergence has the strong robustness even under heterogeneous contamination. The other does not have in general; however, it has under homogeneous contamination, where the outlier ratio does not depend on the explanatory variable, or when the parametric model of the response variable belongs to a location-scale family in which the scale does not depend on the explanatory variables. Hung et al. (Biometrics 74(1):145–154, 2018) discussed the strong robustness in a logistic regression model with an additional assumption that the tuning parameter $\gamma $ is sufficiently large. The results obtained in this study hold for any parametric model without such an additional assumption.

A Novel Robust R-Squared Measure and Its Applications in Linear Regression

Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation

Article 27 August 2022

Outlier-robust parameter estimation for unnormalized statistical models

Article 06 February 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The maximum likelihood estimation and least squares method have been widely used. However, these approaches are not robust against outliers. To overcome this problem, several robust methods have been proposed, primarily using M-estimation (Hampel et al. 2005; Maronna et al. 2006; Huber and Ronchetti 2009). The maximum likelihood estimation can be regarded as the minimization of the empirical estimator of the Kullback-Leibler divergence. As an extension of this concept, some robust estimators have been proposed as the minimization of the empirical estimators of the modified divergences, e.g., density power divergence (Basu et al. 1998), $L_2$-divergence (Scott 2001), $\gamma $-divergence (or the Type-0 divergence) (Jones et al. 2001; Fujisawa and Eguchi 2008).

Recently, some robust regression methods have been proposed based on divergences, using $L_2$-divergence (Chi and Scott 2014; Lozano et al. 2016), density power divergence (Ghosh and Basu 2016; Riani et al. 2020; Ghosh and Majumdar 2020), and $\gamma $-divergence (Kawashima and Fujisawa 2017; Hung et al. 2018; Ren et al. 2020). In these methods, the robust properties are generally investigated under the contaminated model. The difference between the i.i.d. problem and the regression problem lies in whether or not the outlier ratio in the contaminated model depends on the explanatory variable. They are called heterogeneous and homogeneous contaminations, respectively. Recently, Hung et al. (2018) showed that a logistic regression model with mislabeled data can be regarded as a logistic regression model with heterogeneous contamination. Then, they applied the $\gamma $-divergence to a usual logistic regression model, which enables estimation of the parameter of the model without modeling the mislabeled scheme, even if mislabeled data exist. They discussed the strong robustness that the latent bias can be sufficiently small against heavy contamination. However, this was under assumption that the tuning parameter $\gamma $ was sufficiently large.

There are two types of $\gamma $-divergence for regression problems in which the treatments of the base measure are different (Fujisawa and Eguchi 2008; Kawashima and Fujisawa 2017). Kawashima and Fujisawa (2017) proposed another type of the $\gamma $-divergence proposed by Fujisawa and Eguchi (2008), and showed a Pythagorean relation, which does not hold for the previous $\gamma $-divergence. Hung et al. (2018) adopted the $\gamma $-divergence for the regression problem proposed by Fujisawa and Eguchi (2008) and investigated robust properties of the logistic regression model, as mentioned above. In addition, Ren et al. (2020) also adopted it and investigated theoretical properties of the variable consistency and estimation bound under a high-dimensional regression setting. In particular, its application to the generalized linear model (McCullagh and Nelder 1989) has been well studied. However, these studies focused on only one divergence proposed by Fujisawa and Eguchi (2008), and no comparison between two types has been studied.

In this study, two types of $\gamma $-divergence are compared in detail. Their differences in terms of the strong robustness are illustrated through numerical experiments. Comparing the obtained results with those of Hung et al. (2018), ours hold for any parametric model, including a logistic regression model, without the assumption that $\gamma $ is sufficiently large, although Hung et al. (2018) makes this assumption.

The remainder of this paper is organized as follows. In Sect. 2, we show that existing robust regression methods may not work well under heavy heterogeneous contamination in the simple case of a univariate logistic regression model. In Sect. 3, two types of $\gamma $-divergence for the regression problem are reviewed. In Sect. 4, we elucidate a large difference between two types of $\gamma $-divergence from the viewpoint of robustness. In Sect. 5, the parameter estimation algorithm is proposed. In Sect. 6, numerical experiments are illustrated to verify the differences discussed in Sect. 4.

2 Illustrative example

Before getting into details, here we show an illustrative example that existing robust regression methods may not work well under heavy heterogeneous contamination, where the outlier ratio depends on the explanatory variable.

Here, a univariate logistic regression model was used as a simulation model. Outliers were incorporated in a similar setting to that described in Sect. 6. A weighted maximum likelihood estimator (WMLE), M-estimator (Mest), redescending weighted M-estimator (WMest), conditional unbiased bounded influence estimator (CUBIF), and robust quasi-likelihood estimator (MQLE) were adopted as existing robust logistic regression methods (see Chapter 7 in Maronna et al. (2018) for details).

Figure 1 shows the mean squared errors (MSEs) of existing methods, type I, and type II at each outlier ratio. All existing methods showed the almost identical results at each outlier ratio, and their performance worsened as the contamination became heavier. Type I, which is a robust regression method based on the $\gamma $-divergence proposed by Fujisawa and Eguchi (2008), performed well. In contrast, a similar method, type II, which is based on the $\gamma $-divergence proposed by Kawashima and Fujisawa (2017), presented different behaviors from type I. In the subsequent sections, we discuss a difference in robustness between the two types of $\gamma $-divergence for the regression problem. Specifically, why type I is better than type II is explored.

3 Regression based on $\gamma $-Divergence

The $\gamma $-divergence for regression was first proposed by Fujisawa and Eguchi (2008). It measures the difference between two conditional probability density functions. The other type of $\gamma $-divergence for regression was proposed by Kawashima and Fujisawa (2017), in which the treatment of the base measure on the explanatory variable was changed. For simplicity, the former is referred to as type I and the latter as type II. This section presents a brief review of both types of $\gamma $-divergence for regression alongside the corresponding parameter estimation.

3.1 Two types of $\gamma $-Divergence for regression

First, the $\gamma $-divergence for the i.i.d. problem is reviewed. Let g(u) and f(u) be two probability density functions. The $\gamma $-cross entropy and $\gamma $-divergence are defined by

$$\begin{aligned}&d_\gamma (g(u),f(u)) = -\frac{1}{\gamma } \log \int g(u) f(u)^\gamma du + \frac{1}{1+\gamma } \log \int f(u)^{1+\gamma } du, \\&D_\gamma (g(u),f(u)) = - d_\gamma (g(u),g(u)) + d_\gamma (g(u),f(u)) . \end{aligned}$$

This satisfies the following two basic properties of divergence:

$$\begin{aligned} \begin{array}{cl} \hbox {(i)} &{} D_\gamma (g(u),f(u)) \ge 0. \\ \hbox {(ii)} &{} D_\gamma (g(u),f(u)) = 0 \ \Leftrightarrow \ g(u)=f(u) \ {(a.e.)}. \end{array} \end{aligned}$$

Let us consider the $\gamma $-divergence for the regression problem. Suppose that g(x, y), g(y|x), and g(x) are the underlying probability density functions of (x, y), y given x, and x, respectively. Let f(y|x) be another conditional probability density function of y given x. Let $\gamma $ be the positive tuning parameter which controls the trade-off between efficiency and robustness.

For the regression problem, Fujisawa and Eguchi (2008) proposed the following cross entropy and divergence:

Type I $\gamma $-cross entropy for regression:
$$\begin{aligned} d_{\gamma ,1} (g(y|x),f(y|x);g(x)) = -\frac{1}{\gamma } \log \int \int \left\{ \frac{ f(y|x)^{\gamma } }{ \left( \int f(y|x)^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} } \right\} g(x,y)dxdy . \end{aligned}$$
(3.1)
Type I $\gamma $-divergence for regression:
$$\begin{aligned}&D_{\gamma ,1} (g(y|x),f(y|x);g(x)) \\&\quad = - d_{\gamma ,1}(g(y|x),g(y|x);g(x)) + d_{\gamma ,1}(g(y|x),f(y|x);g(x)). \end{aligned}$$

The cross entropy is empirically estimable, as will be seen in Sect. 3.2, and the parameter estimation is easily defined. On the other hand, Kawashima and Fujisawa (2017) proposed the following cross entropy and divergence:

Type II $\gamma $-cross entropy for regression:
$$\begin{aligned}&d_{\gamma ,2} (g(y|x),f(y|x);g(x)) \nonumber \\&\quad = -\frac{1}{\gamma } \log \int \int f(y|x)^{\gamma } g(x,y) dxdy + \frac{1}{1+\gamma } \log \int \left( \int f(y|x)^{1+\gamma } dy \right) g(x) dx. \end{aligned}$$
(3.2)
Type II $\gamma $-divergence for regression:
$$\begin{aligned}&D_{\gamma ,2} (g(y|x),f(y|x);g(x)) \\&\quad = -d_{\gamma , 2}(g(y|x),g(y|x);g(x)) +d_{\gamma , 2}(g(y|x),f(y|x);g(x)). \end{aligned}$$

The base measures on the explanatory variable are taken twice on each term of the $\gamma $-divergence for the i.i.d. problem. This extension from the i.i.d. problem to the regression problem appears more natural than (3.1). The cross entropy is also empirically estimable. Both types of $\gamma $-divergence satisfy the following two basic properties of divergence for $j=1,2$:

$$\begin{aligned} \begin{array}{cl} \hbox {(i)} &{} D_{\gamma ,j}(g(y|x),f(y|x);g(x)) \ge 0. \\ \hbox {(ii)} &{} D_{\gamma ,j}(g(y|x),f(y|x);g(x)) = 0 \qquad \Leftrightarrow \ g(y|x)=f(y|x) \ {(a.e.)}. \end{array} \end{aligned}$$

The equality holds for the conditional probability density function instead of usual probability density function.

Theoretical properties of the $\gamma $-divergence for the i.i.d. problem were deeply investigated by Fujisawa and Eguchi (2008). There have been several studies on theoretical properties for the regression problem (Kanamori and Fujisawa 2015; Kawashima and Fujisawa 2017; Hung et al. 2018; Ren et al. 2020). However, there is a lack of comprehensive studies, such as comparison between properties under heterogeneous contamination. Heterogeneous contamination appears as a specific case in the regression problem and does not appear in the i.i.d. problem. Hung et al. (2018) pointed out that a logistic regression model with mislabeled data can be regarded as a logistic regression model with heterogeneous contamination and then they applied type I to a usual logistic regression model, which enables us to estimate the parameter of the logistic regression model without modeling mislabeled scheme even if mislabeled data exist. They also investigated theoretical properties on robustness, but they assumed that $\gamma $ is sufficiently large. In Sect. 4 of this paper, it is shown that type I is superior to type II under heterogeneous contamination in the sense of the strong robustness without assuming that $\gamma $ is sufficiently large.

Finally, it is shown that the density power divergence (Basu et al. 1998) is another candidate for divergence which provides robustness; however, this does not have strong robustness (Hung et al. 2018). For the completeness of this paper, details of the robustness of the density power divergence under homogeneous and heterogeneous contaminations are provided, although some parts have been investigated by Hung et al. (2018). See Appendix G for details.

3.2 Estimation for $\gamma $-regression

Let $f(y|x;\theta )$ be a conditional probability density function of y given x with parameter $\theta $. Let $(x_i,y_i) \ (i=1 , \ldots , n)$ be the observations randomly drawn from the underlying distribution g(x, y). Using (3.1) and (3.2), both types of $\gamma $-cross entropy for regression can be empirically estimated by

$$\begin{aligned}&{\bar{d}}_{\gamma ,1} (f(y|x;\theta )) = -\frac{1}{\gamma } \log \frac{1}{n} \sum _{i=1}^n \frac{ f(y_i|x_i ; \theta )^{\gamma } }{ \left( \int f(y|x_i ;\theta )^{1+\gamma }dy\right) ^\frac{\gamma }{1+\gamma }}, \\&{\bar{d}}_{\gamma ,2} (f(y|x;\theta )) \\&\quad = - \frac{1}{\gamma } \log \left\{ \frac{1}{n} \sum _{i=1}^n f(y_i | x_i ;\theta )^{\gamma } \right\} + \frac{1}{1+\gamma } \log \left\{ \frac{1}{n} \sum _{i=1}^n \int f(y | x_i ;\theta )^{1+\gamma } dy \right\} . \end{aligned}$$

The estimator is defined as the minimizer as follows:

$$\begin{aligned} {\hat{\theta }}_{\gamma ,j} =\mathop {{\mathrm{argmin}}}\limits _{\theta } {\bar{d}}_{\gamma ,j}(f(y|x;\theta )) \ \ \hbox { for}\ j=1,2. \end{aligned}$$

Using an similar approach to that in Fujisawa and Eguchi (2008), we can show that ${\hat{\theta }}_{\gamma ,j}$ converges to $\theta ^*_{\gamma ,j}$, where

$$\begin{aligned} \theta ^*_{\gamma ,j}&= \mathop {{\mathrm{argmin}}}\limits _{\theta } D_{\gamma ,j}(g(y|x),f(y|x;\theta );g(x)) \\&= \mathop {{\mathrm{argmin}}}\limits _{\theta } d_{\gamma ,j}(g(y|x),f(y|x;\theta );g(x)) \ \ \text{ for } j=1,2. \end{aligned}$$

Suppose that $f(y|x;\theta ^*)$ is the target conditional probability density function. The latent bias is expressed as $\theta ^*_{\gamma ,j}-\theta ^*$. This is zero when the underlying model belongs to a parametric model, in other words, $g(y|x)=f(y|x;\theta ^*)$, but is not always zero when the underlying model is contaminated by outliers. This issue is discussed in Sect. 4.

3.3 Case of location-scale family

Here, it is shown that both types of $\gamma $-divergence give the same parameter estimation when the parametric conditional probability density function $f(y|x;\theta )$ belongs to a location-scale family in which the scale does not depend on the explanatory variables. This is given by

$$\begin{aligned} f(y|x;\theta ) = \frac{1}{\sigma } s \left( \frac{y-q(x;\zeta )}{\sigma } \right) , \end{aligned}$$

(3.3)

where s(y) is a probability density function, $\sigma $ is a scale parameter, and $q(x;\zeta )$ is a location function with a regression parameter $\zeta $, e.g., $q(x;\zeta )=x^T \zeta $. Then, we can obtain

$$\begin{aligned} \int f(y|x;\theta )^{1+\gamma } dy&= \int \frac{1}{\sigma ^{1+\gamma }} s \left( \frac{y-q(x;\zeta )}{\sigma } \right) ^{1+\gamma } dy \nonumber \\&= \sigma ^{-\gamma } \int s(z)^{1+\gamma } dz. \end{aligned}$$

This does not depend on the explanatory variables x. Using this property, we can show that both types of $\gamma $-cross entropy are the same.

Proposition 1

Consider the location-scale family (3.3). We see

$$\begin{aligned} d_{\gamma ,1} (g(y|x),f(y|x;\theta );g(x)) =d_{\gamma ,2} (g(y|x),f(y|x;\theta );g(x)). \end{aligned}$$

The proof is in Appendix A. As a result, both types of $\gamma $-divergence give the same parameter estimation, because the estimator is defined by the empirical estimation of the $\gamma $-cross entropy. However, it should be noted that both types of $\gamma $-divergence are not the same, because $d_{\gamma ,1} (g(y|x),g(y|x);g(x)) \ne d_{\gamma ,2} (g(y|x),g(y|x);g(x)) $.

4 Robust properties

In this section, we show a large difference between the two types of $\gamma $-divergence.

4.1 Contamination model and basic condition

Let $\delta (y|x)$ be the contaminated conditional probability density function related to outliers. Let $\varepsilon (x)$ and $\varepsilon $ denote the outlier ratios which depends on x and does not, respectively. Suppose that the underlying conditional probability density functions under heterogeneous and homogeneous contaminations are given by

Heterogeneous contamination:
$$\begin{aligned} g(y|x)&= (1-\varepsilon (x))f(y|x;\theta ^*) + \varepsilon (x) \delta (y|x), \end{aligned}$$
Homogeneous contamination:
$$\begin{aligned} g(y|x)&= (1-\varepsilon )f(y|x;\theta ^*) + \varepsilon \delta (y|x). \end{aligned}$$

Let

$$\begin{aligned} {\nu }_{f,\gamma }(x) = \left\{ \int \delta (y|x) f(y|x)^{ \gamma } dy \right\} ^{ \frac{1}{ \gamma } }, \ {\nu }_{f, \gamma } = \left\{ \int \nu _{f,\gamma }(x)^{\gamma } g(x) dx \right\} ^{ \frac{1}{ \gamma }} . \end{aligned}$$

Here we assume that

$$\begin{aligned} \nu _{f_{\theta ^*},\gamma } \approx 0. \end{aligned}$$

This is an extended assumption used for the i.i.d. problem (Fujisawa and Eguchi 2008) to the regression problem. This assumption implies that

$$\begin{aligned} \nu _{f_{\theta ^*},\gamma }(x) \approx 0 \text{ for } \text{ any } x \text{(a.e.), } \end{aligned}$$

and illustrates that the contaminated conditional probability density function $\delta (y|x)$ lies on the tail of the target conditional probability density function $f(y|x;\theta ^*)$. For example, if $\delta (y|x)$ is the Dirac delta function at the outlier $y_{\dag }(x)$ given x, then we have $\nu _{f_{\theta ^*},\gamma }(x) = f(y_{\dag }(x)|x;\theta ^*) \approx 0$, which is reasonable because $y_{\dag }(x)$ is an outlier.

Here we also consider the condition $\nu _{f_{\theta },\gamma } \approx 0$, as used in Hung et al. (2018). This will be true in the neighborhood of $\theta =\theta ^*$. In addition, even when $\theta $ is not close to $\theta ^*$, if $\delta (y|x)$ lies on the tail of $f(y|x;\theta )$, we can see $\nu _{f_{\theta },\gamma } \approx 0$.

To simplify the discussion, the monotone transformation of both types of $\gamma $-cross entropy for regression is prepared as follows:

$$\begin{aligned} {\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x))&= - \exp \left\{ -\gamma d_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \right\} \\&= - \int \int \frac{ f(y|x;\theta )^{\gamma } }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma } } } g(y|x) g(x)dx dy , \\ {\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x))&= - \exp \left\{ -\gamma d_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \right\} \\&= - \frac{ \int \left( \int g(y|x) f(y|x;\theta )^{\gamma } dy \right) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } } . \end{aligned}$$

4.2 Robustness of type I

Following some calculations, we have

$$\begin{aligned}&{\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad = {\tilde{d}}_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta )); {\tilde{g}}(x)) - \int \frac{ \nu _{f_{\theta },\gamma }(x)^{\gamma } }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma } } } \varepsilon (x) g(x) dx, \end{aligned}$$

where ${\tilde{g}}(x) = (1-\varepsilon (x))g(x)$. A detailed derivation is in Appendix B. From this relation, we can easily show the following theorem.

Theorem 1

Consider the case of heterogeneous contamination. Under the condition $\nu _{f_{\theta },\gamma } \approx 0$, we have

$$\begin{aligned} {\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \approx {\tilde{d}}_{\gamma ,1} (f(y|x;\theta ^*),f(y|x;\theta ); {\tilde{g}}(x)) . \end{aligned}$$

Using this theorem, we can expect the strong robustness that the latent bias $\theta ^*_{\gamma ,1}-\theta ^*$ is close to zero even when $\varepsilon (x)$ is not small, because

$$\begin{aligned} \theta ^*_{\gamma ,1}&= \mathop {{\mathrm{argmin}}}\limits _{\theta } d_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&= \mathop {{\mathrm{argmin}}}\limits _{\theta } {\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\approx \mathop {{\mathrm{argmin}}}\limits _{\theta } {\tilde{d}}_{\gamma ,1} (f(y|x;\theta ^*),f(y|x;\theta ); {\tilde{g}}(x)) \quad \text{(by } \text{ Theorem }~1) \\&= \mathop {{\mathrm{argmin}}}\limits _{\theta } d_{\gamma ,1} (f(y|x;\theta ^*),f(y|x;\theta ); {\tilde{g}}(x)) = \theta ^*. \end{aligned}$$

The last equality holds even when g(x) is replaced by ${\tilde{g}}(x) = (1-\varepsilon (x))g(x)$.

In addition, we can have the modified Pythagorean relation approximately.

Theorem 2

Consider the case of heterogeneous contamination. Under the condition $\nu _{f_{\theta },\gamma } \approx 0$, the modified Pythagorean relation among g(y|x), $f(y|x;\theta ^*)$, $f(y|x;\theta )$ approximately holds:

$$\begin{aligned}&D_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad \approx D_{\gamma ,1}(g(y|x),f(y|x;\theta ^*);g(x)) + D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );{\tilde{g}}(x)). \end{aligned}$$

The modified Pythagorean relation also implies strong robustness in a similar manner to that in the subsequent discussion of Theorem 1.

Finally, the case of homogeneous contamination is discussed. Under homogeneous contamination, we have the following relation.

Theorem 3

Consider the case of homogeneous contamination. We see

$$\begin{aligned} D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );{\tilde{g}}(x)) = D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );g(x)). \end{aligned}$$

The proof is in Appendix C. Then, the modified Pythagorean relation in Theorem 2 is changed to the usual Pythagorean relation as follows:

$$\begin{aligned}&D_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad \approx D_{\gamma ,1}(g(y|x),f(y|x;\theta ^*);g(x)) + D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );g(x)). \end{aligned}$$

4.3 Robustness of type II

First, it is illustrated that in the case of type II, the strong robustness does not hold generally hold under heterogeneous contamination, unlike for type I. We see

$$\begin{aligned} {\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \approx - \frac{ \int \int f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy (1-\varepsilon (x)) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } }. \end{aligned}$$

A detailed derivation is in Appendix D. This cannot be expressed using

$$\begin{aligned} d_{\gamma ,2}(f(y|x;\theta ^*),f(y|x;\theta );b(x)), \end{aligned}$$

with an appropriate base measure b(x), unlike for type I, because the base measure of the numerator on the explanatory variables is different from that of the denominator. As mentioned in Sect. 3.3, when the parametric conditional probability density function belongs to a location-scale family (3.3), the cross entropy for type II is the same as that for type I and, thus type II can have the strong robustness.

Under homogeneous contamination, we have

$$\begin{aligned} {\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \approx (1-\varepsilon ) {\tilde{d}}_{\gamma ,2}(f(y|x;\theta ^*),f(y|x;\theta );g(x)), \end{aligned}$$

and then the latent bias $\theta ^*_{\gamma ,2}-\theta ^*$ can be expected to be sufficiently small, i.e., type II can have the strong robustness.

5 Parameter estimation algorithm

In this section, the parameter estimation algorithm of type I is proposed. Kawashima and Fujisawa (2017) proposed the iterative estimation algorithm for type II by Majorization-Minimization (MM) algorithm (Hunter and Lange 2004). This algorithm has a monotone decreasing property, i.e., the objective function monotonically decreases at each iterative step, which enables numerical stability and efficiency. The present study also utilizes the MM algorithm.

5.1 MM algorithm

Here, the principle of the MM algorithm is explained in brief. Let $h(\eta )$ be the objective function. Let us prepare the majorization function $h_{MM}$ satisfying

$$\begin{aligned} h_{MM}(\eta ^{(m)}|\eta ^{(m)})&= h(\eta ^{(m)}), \\ h_{MM}(\eta |\eta ^{(m)})&\ge h(\eta ) \ \ \text{ for } \text{ all } \eta , \end{aligned}$$

where $\eta ^{(m)}$ is the parameter of the m-th iterative step for $m=0,1,2,\ldots $. The MM algorithm optimizes the majorization function instead of the objective function as follows:

$$\begin{aligned} \eta ^{(m+1)} = \mathop {{\mathrm{argmin}}}\limits _{\eta } h_{MM}(\eta |\eta ^{(m)}). \end{aligned}$$

Then, it can be shown that the objective function $h(\eta )$ monotonically decreases at each iterative step, because

$$\begin{aligned} h(\eta ^{(m)}) = h_{MM}(\eta ^{(m)}|\eta ^{(m)}) \ge h_{MM}(\eta ^{(m+1)}|\eta ^{(m)}) \ge h(\eta ^{(m+1)}). \end{aligned}$$

Note that $\eta ^{(m+1)}$ is not necessary to be the minimizer of $h_{MM}(\eta |\eta ^{(m)})$. We only need

$$\begin{aligned} h_{MM}(\eta ^{(m)}|\eta ^{(m)}) \ge h_{MM}(\eta ^{(m+1)}|\eta ^{(m)}). \end{aligned}$$

The problem with the MM algorithm is how to make a majorization function $h_{MM}$.

5.2 MM algorithm for type I

Let us recall the objective function $h(\theta )$ in type I:

$$\begin{aligned} h(\theta ) = -\frac{1}{\gamma } \log \frac{1}{n} \sum _{i=1}^n \frac{ f(y_i|x_i ; \theta )^{\gamma } }{ \left( \int f(y|x_i ;\theta )^{1+\gamma }dy\right) ^\frac{\gamma }{1+\gamma }}. \end{aligned}$$

The majorization function can be derived in a similar manner to that in Kawashima and Fujisawa (2017). We show the following theorem.

Theorem 4

Consider the function

$$\begin{aligned} h_{MM}(\theta | \theta ^{(m)}) = - \frac{1}{\gamma } \sum _{i=1}^n w_{i}\left( \theta ^{(m)} \right) \log W_i(\theta ) + const, \end{aligned}$$

where

$$\begin{aligned} W_i(\theta ) =\frac{ f(y_i|x_i;\theta )^{\gamma } }{ \left( \int f(y|x_i;\theta )^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} }, w_{i}\left( \theta \right) = \frac{ W_i(\theta ) }{ \sum _{l=1}^n W_l(\theta ) }, \end{aligned}$$

and const is a term which does not depend on the parameter $\theta $. It can be seen that it is a majorization function of $h(\theta )$. Consequently, the sequence of iterates given by $\theta ^{(m+1)} = \mathop {{\mathrm{argmin}}}\limits \nolimits _{\theta } h_{MM}(\theta |\theta ^{(m)})$ decreases the objective function $h(\theta )$ at each iterative step.

The proof is in Appendix E. As mentioned in Sect. 3.3, when the parametric conditional probability density function $f(y|x;\theta )$ belongs to a location-scale family (3.3), the cross entropy for type I is the same as that for type II, and the above majorization function is also the same as that for type II.

5.3 MM algorithm for logistic regression model

Here, the case of the logistic regression model is thoroughly investigated. This model does not belong to a location-scale family as the parametric conditional probability density function $f(y|x;\theta )$. Let $f(y|x;\beta )$ be the Bernoulli distribution given by

$$\begin{aligned} f(y|x;\beta ) = \pi (x;\beta )^y (1- \pi (x;\beta ))^{(1-y)}, \end{aligned}$$

where $\pi (x;\beta )= \{1+\exp (- x^{\top } \beta )\}^{-1}$. For simplicity, the intercept term $\beta _0$ is omitted in the linear predictor. Following simple calculation, we have

$$\begin{aligned}&h_{MM}(\beta | \beta ^{(m)}) \\&\quad = - \sum _{i=1}^n w_{i}\left( \beta ^{(m)} \right) y_i x_i^{\top } \beta - \frac{1}{1+\gamma } \sum _{i=1}^n w_{i}\left( \beta ^{(m)} \right) \log \left[ 1 - \pi (x_i ; (1+\gamma )\beta ) \right] . \end{aligned}$$

Here, the constant term is ignored. By applying the idea of the quadratic approximation (the second-order Taylor polynomial) (Böhning and Lindsay 1988) to $h_{MM}$, the new majorization function is obtained as follows:

$$\begin{aligned} {\tilde{h}}_{MM}(\beta | \beta ^{(m)}) = \frac{1+\gamma }{2} \sum _{i=1}^n \left( z_i^{(m)} - x_i^{\top } \beta \right) ^2 + const, \end{aligned}$$

where

$$\begin{aligned} z_i^{(m)} = x_i^{\top } \beta ^{(m)} + \frac{1}{1+\gamma } w_{i} \left( \beta ^{(m)} \right) \left( y_i - \pi (x_i ; (1+\gamma )\beta ^{ (m) }) \right) . \end{aligned}$$

Then, we provide the following theorem.

Theorem 5

Consider the function ${\tilde{h}}_{MM}(\beta | \beta ^{(m)})$. It can be seen that it is a majorization function of $h(\beta )$. Consequently, the sequence of iterates given by $ \beta ^{(m+1)} = (X^{\top } X)^{-1} X^{\top } z^{(m)}$ decreases the objective function $h(\beta ) $ at each iterative step.

The proof is in Appendix F. Therefore, to guarantee the monotone decreasing property, the proposed algorithm does not require a line search method (Armijo 1966; Wolfe 1969). On the other hand, existing estimation algorithms for logistic regression based on type I (Hung et al. 2018; Ren et al. 2020) require a line search method for the monotone decreasing property.

Finally, the pseudo-code of the proposed parameter estimation algorithm for the logistic regression model is presented.

6 Numerical experiments

In this section, using a simulation model, we compare type I with type II and the density power divergence (DPD).

As shown in Sect. 4, the large difference occurs under heterogeneous contamination when the parametric conditional probability density function $f(y|x;\theta )$ does not belong to a location-scale family. Therefore, the logistic regression model is used as a simulation model, given by

$$\begin{aligned} \mathrm {Pr}(y=1|x) = \pi (x;\beta ) , \ \mathrm {Pr}(y=0|x) = 1-\pi (x;\beta ), \end{aligned}$$

where $\pi (x;\beta )= \{1+\exp (- \beta _{0} - x_1 \beta _1- \cdots - x_p \beta _p )\}^{-1}$. All experiments were performed using R (http://www.r-project.org/index.html). The code used to implement the proposed method is available at https://sites.google.com/site/takayukikawashimaspage/software.

6.1 Synthetic data

We consider the following data generating scheme. The sample size and the number of explanatory variables were set to be $n=200$ and $p=10,20,40$, respectively. The true regression coefficients were given by $\beta _{0}^*=0, \ \varvec{\beta }^* =(1,-1.5,2,-2.5,3, {\mathbf {0}}_{p-5}^{\top })^{\top }$. The explanatory variables were generated from a normal distribution $N(0,{\varSigma })$ with $ {\varSigma }_{ij}=(0.2^{ | i-j | })_{1 \le i,j \le p }$. We generated 30 random samples.

Outliers were incorporated into simulations. The outliers were generated around the edge of the explanatory variables, where the explanatory variables were generated from $N(\varvec{\mu }_{\mathrm{out}}, 0.5^{2} {\mathbf {I}})$ where ${\varvec{\mu }_{\mathrm{out}}}=(2,0,2,0,2, {\mathbf {0}}_{p-5}^{\top })^{\top }$, and the response variable y was set to 0. The outlier ratio was set to $\varepsilon =0.1, 0.2 , 0.3 , 0.4$.

To verify the fitness of the regression coefficient, the mean squared error (MSE) was used as the performance measure, given by.

$$\begin{aligned} \text {MSE}&= \frac{1}{p+1} \sum _{j=0}^p ({\hat{\beta }}_j - \beta _j^* )^2, \end{aligned}$$

where $\beta _j^*$’s are the true coefficients. The tuning parameter $\gamma $ in the $\gamma $-divergence was set to 1.0, 2.0, and 3.0. The tuning parameter $\alpha $ in the DPD was set to 1.0, 2.0, and 3.0.

Figure 2 shows the MSE in the case $\varepsilon =0.1$, 0.2, 0.3, and 0.4. Type I presented smaller MSEs than type II. The difference between the two types became more significant as the outlier ratio $\varepsilon $ was increased. For types I and II, we can see similar results to those presented in Sect. 2 can be seen, even in the multivariate case. The DPD presented worse MSEs as the outlier ratio $\varepsilon $ or p was larger.

The DPD seems to be better than the $\gamma $-divergence at the same value of the $\gamma $ and $\alpha $. However, the different divergences should not be compared at the same tuning parameter, because they have different meanings.

6.2 Application to real data

We compare type I with type II and the DPD in real data. The following datasets, which are available at the UCI Machine Learning Repository (Dua and Graff 2017), were used: Absenteeism at work (Work) (Martiniano et al. 2012), banknote authentication (Bank), Heart, Haberman’s Survival (Survival), and Libras Movement (Libras).

First, we applied the ordinal logistic regression to real data and supposed that estimated regression coefficients were true coefficients $\beta _0^*$ and $\varvec{\beta }^*$. Then, outliers were incorporated into the data as follows. The magnitudes of the explanatory variables were sorted in descending order $\Vert x \Vert _{(1)} \ge \Vert x \Vert _{(2)} \ge \cdots \ge \Vert x \Vert _{(n)} $, where (i) denotes the i-th order. We selected $\lceil \varepsilon \times n \rceil $ samples from the largest variables, and the corresponding each response variable was flipped, e.g., $y=0 \rightarrow y=1$. The outlier ratio $\varepsilon $ was set to 0.4.

Table 1 MSE in application to real data

Full size table

In order to verify the fitness of the regression coefficient, we used the MSE as the performance measure. The tuning parameter $\gamma $ in the $\gamma $-divergence was set to 1.0, 2.0, and 3.0. The tuning parameter $\alpha $ in the DPD was set to 1.0, 2.0, and 3.0.

Table 1 shows the MSE in real data sets with outliers. Type I presented smaller MSEs than type II and showed better results than the DPD in most cases.

7 Conclusion

In shit study, the difference between two types of $\gamma $-divergence for the regression problem, referred to as type I and type II, were investigated. We showed that type I has the strong robustness unlike type II under heterogeneous contamination even when the parametric conditional probability density function does not belong to a location-scale family. In addition, we denoted difference on the robustness with an existing robust divergence, density power divergence for the regression problem. Further, an efficient estimation algorithm was proposed based on the principle of the MM algorithm. Numerical experiments supported the obtained theoretical results under various settings and the application to real data. In the experiments performed herein, the robust tuning parameter $\gamma $ was fixed. However, in some cases, it may be better to select an appropriate tuning parameter from among the $\gamma $ candidates than to fix the value. The monitoring approach (Riani et al. 2020) and robust cross-validation (Kawashima and Fujisawa 2017) can be utilized as a method of selecting $\gamma $.

References

Armijo L (1966) Minimization of functions having lipschitz continuous first partial derivatives. Pacific J Math 16(1):1–3
Article MathSciNet MATH Google Scholar
Basu A, Harris IR, Hjort NL, Jones MC (1998) Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3):549–559
Article MathSciNet MATH Google Scholar
Böhning D, Lindsay B (1988) Monotonicity of quadratic-approximation algorithms. Ann Inst Stat Math 40:641–663
Article MathSciNet MATH Google Scholar
Chi EC, Scott DW (2014) Robust parametric classification and variable selection by a minimum distance criterion. J Comput Graph Stat 23(1):111–128
Article MathSciNet Google Scholar
Dua D, Graff C (2017) UCI machine learning repository
Fujisawa H, Eguchi S (2008) Robust parameter estimation with a small bias against heavy contamination. J Multivar Anal 99(9):2053–2081
Article MathSciNet MATH Google Scholar
Ghosh A, Basu A (2016) Robust estimation in generalized linear models: the density power divergence approach. TEST 25(2):269–290
Article MathSciNet MATH Google Scholar
Ghosh A, Majumdar S (2020) Ultrahigh-dimensional robust and efficient sparse regression using non-concave penalized density power divergence. IEEE Transactions on Information Theory pp 1–1
Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA (2005) Robust statistics: the approach based on influence functions. John Wiley & Sons
Huber PJ, Ronchetti EM (2009) Robust statistics. John Wiley & Sons
Hung H, Jou ZY, Huang SY (2018) Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics 74(1):145–154
Article MathSciNet MATH Google Scholar
Hunter DR, Lange K (2004) A tutorial on mm algorithms. Am Stat 58(1):30–37
Article MathSciNet Google Scholar
Jones MC, Hjort NL, Harris IR, Basu A (2001) A comparison of related density-based minimum divergence estimators. Biometrika 88:865–873
Article MathSciNet MATH Google Scholar
Kanamori T, Fujisawa H (2015) Robust estimation under heavy contamination using unnormalized models. Biometrika 102(3):559–572
Article MathSciNet MATH Google Scholar
Kawashima T, Fujisawa H (2017) Robust and sparse regression via $\gamma $-divergence. Entropy 19(608)
Lozano AC, Meinshausen N, Yang E (2016) Minimum distance lasso for robust high-dimensional regression. Electron J Statist 10(1):1296–1340. https://doi.org/10.1214/16-EJS1136
Article MathSciNet MATH Google Scholar
Maronna RA, Martin DR, Yohai VJ (2006) Robust Statistics: Theory and Methods. John Wiley and Sons
Maronna RA, Martin DR, Yohai VJ, Salibian-Barrera M (2018) Robust Statistics: Theory and Methods (with R), 2nd edn. Wiley Series in Probability and Statistics, John Wiley & Sons Ltd, New York https://doi.org/10.1002/9781119214656
Martiniano A, Ferreira RP, Sassi RJ, Affonso C (2012) Application of a neuro fuzzy network in prediction of absenteeism at work. In: 7th Iberian Conference on Information Systems and Technologies (CISTI 2012), pp 1–4
McCullagh P, Nelder J (1989) Generalized Linear Models, Second Edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series, Chapman & Hall
Ren M, Zhang S, Zhang Q (2020) Robust high-dimensional regression for data with anomalous responses. Annals of the Institute of Statistical Mathematics pp 1–34
Riani M, Atkinson AC, Corbellini A, Perrotta D (2020) Robust regression with density power divergence: Theory, comparisons, and data analysis. Entropy 22(4):399
Article MathSciNet Google Scholar
Scott DW (2001) Parametric statistical modeling by minimum integrated square error. Technometrics 43(3):274–285
Article MathSciNet Google Scholar
Wolfe P (1969) Convergence conditions for ascent methods. SIAM Rev 11(2):226–235
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Tokyo Insitute of Technology/RIKEN, Tokyo, Japan
Takayuki Kawashima
The Institute of Statistical Mathematics/RIKEN, Tokyo, Japan
Hironori Fujisawa

Authors

Takayuki Kawashima
View author publications
You can also search for this author in PubMed Google Scholar
Hironori Fujisawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takayuki Kawashima.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Takayuki Kawashima was partially supported by JSPS KAKENHI Grant Numbers 19K24340 and 22K17859. Hironori Fujisawa was partially supported by JSPS KAKENHI Grant Number 17K00065.

Appendices

Proof of Proposition 1

We use the following relation:

$$\begin{aligned} \int f(y|x;\theta )^{1+\gamma } dy = \int \frac{1}{\sigma ^{1+\gamma }} s \left( \frac{y-q(x;\zeta )}{\sigma } \right) ^{1+\gamma } dy = \sigma ^{-\gamma } \int s(z)^{1+\gamma } dz. \end{aligned}$$

This dose not depend on the explanatory variables x. Then, we have

$$\begin{aligned}&d_{\gamma ,1} (g(y|x),f(y|x;\theta );g(x)) \\&\quad = -\frac{1}{\gamma } \log \int \int g(x,y) f(y|x;\theta )^{\gamma } dx dy + \frac{1}{1+\gamma } \log \int f(y|x;\theta )^{1+\gamma }dy \\&\quad = -\frac{1}{\gamma } \log \int \int g(x,y) f(y|x;\theta )^{\gamma } dx dy + \frac{1}{1+\gamma } \log \int f(y|x;\theta )^{1+\gamma }dy \int g(x) dx \\&\quad = d_{\gamma ,2} (g(y|x),f(y|x;\theta );g(x)). \end{aligned}$$

Detailed derivation on robustness of type I in Sect. 4.2

We see

$$\begin{aligned}&{\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad = - \int \frac{ \int \ (1-\varepsilon (x)) f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma }} } g(x) dx \\&\qquad - \int \frac{ \int \varepsilon (x) \delta (y|x) f(y|x;\theta )^{\gamma } dy }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma }} } g(x) dx \\&\quad = - \int \frac{ \int f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma } }} (1-\varepsilon (x)) g(x) dx \\&\qquad - \int \frac{ \int \delta (y|x;\theta ) f(y|x;\theta )^{\gamma } dy }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma }} } \varepsilon (x) g(x) dx \\&\quad = {\tilde{d}}_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta )); {\tilde{g}}(x)) \\&\qquad - \int \frac{ \nu _{f_{\theta },\gamma }(x)^{\gamma } }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma } } } \varepsilon (x) g(x) dx . \end{aligned}$$

Proof of Theorem 3

$$\begin{aligned}&D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );{\tilde{g}}(x)) \\&\quad = \frac{1}{\gamma } \log \int \left( \int f(y|x;\theta ^*)^{\gamma } dy \right) ^{\frac{1}{1+\gamma }} (1-\varepsilon ) g(x)dx \\&\qquad - \frac{1}{\gamma } \log \int \left\{ \frac{ \int f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{ \frac{\gamma }{1+\gamma } } } \right\} (1-\varepsilon ) g(x)dx \\&\quad =D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );g(x)). \end{aligned}$$

Detailed derivation on robustness of type II in Sect. 4.3

The formulation of Type II under heterogeneous contamination is provided to show that the strong robustness does not hold in general.

$$\begin{aligned}&{\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \\&\quad = - \frac{ \int \left( \int (1-\varepsilon (x)) f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy \right) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } } \\&\qquad - \frac{ \int \left( \int \varepsilon (x) \delta (y|x) f(y|x;\theta )^{\gamma } dy \right) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } } \\&\quad = - \frac{ \int \left( \int (1-\varepsilon (x)) f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy \right) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } } \\&\qquad - \frac{ \int \varepsilon (x) \nu _{f_\theta ,\gamma }(x)^{\gamma } g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } } \\&\quad \approx - \frac{ \int \int f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy (1-\varepsilon (x)) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } }. \end{aligned}$$

Proof of Theorem 4

The majorization function $h_{MM} (\theta | \theta ^{(m)} )$ is constructed from the following inequality:

$$\begin{aligned} \kappa (a ^{\top } b ) \le \sum _{i=1}^n \frac{ a_i b_i^{(m)} }{ a^{\top } b^{(m)} } \kappa \left( b_i \frac{ a^{\top } b^{(m)} }{ b_i^{(m)} } \right) , \end{aligned}$$

where $\kappa (u)$ is a convex function, $a=(a_1, \ldots , a_n)^{\top }$, $b=(b_1, \ldots , b_n)^{\top }$, $b^{(m)} = (b_1^{(m)} , \ldots , b_n^{(m)})^{\top }$, and $a_i$, $b_i$ and $b_i^{(m)}$ are positive. This inequality holds from Jensen’s inequality. Here we take $a_i = \frac{1}{n} $, $b_i = \displaystyle { \frac{ f(y_i|x_i;\theta )^{\gamma } }{ \left( \int f(y|x_i;\theta )^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} } } $, $ b_i^{(m)} = \displaystyle { \frac{ f(y_i|x_i;\theta ^{(m)})^{\gamma } }{ \left( \int f(y|x_i;\theta ^{(m)})^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} } } $, and $\kappa (u) = - \frac{1}{\gamma } \log (u)$ in the above inequality. Then, the majorization function is obtained as follows:

$$\begin{aligned} h(\theta )&= -\frac{1}{\gamma } \log \frac{1}{n} \sum _{i=1}^n \frac{ f(y_i|x_i ; \theta )^{\gamma } }{ \left( \int f(y|x_i ;\theta )^{1+\gamma }dy\right) ^\frac{\gamma }{1+\gamma }} \\&\le - \frac{1}{\gamma } \sum _{i=1}^n w_{i}\left( \theta ^{(m)} \right) \log \left\{ W_i(\theta ) \frac{ \frac{1}{n} \sum _{l=1}^n W_{l}\left( \theta ^{(m)}\right) }{ W_{i}\left( \theta ^{(m)} \right) } \right\} \\&= - \frac{1}{\gamma } \sum _{i=1}^n w_{i}\left( \theta ^{(m)} \right) \log W_i(\theta ) + const = h_{MM}(\theta | \theta ^{(m)}), \end{aligned}$$

where

$$\begin{aligned} W_i(\theta ) =\frac{ f(y_i|x_i;\theta )^{\gamma } }{ \left( \int f(y|x_i;\theta )^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} } \text{ and } w_{i}\left( \theta \right) = \frac{ W_i(\theta ) }{ \sum _{l=1}^n W_l(\theta ) }. \end{aligned}$$

This majorization function satisfies

$$\begin{aligned} h_{MM} (\theta ^{(m)} | \theta ^{(m)})&= - \frac{1}{\gamma } \sum _{i=1}^n w_{i}\left( \theta ^{(m)} \right) \log \left\{ W_i(\theta ^{(m)}) \frac{ \frac{1}{n} \sum _{l=1}^n W_{l}\left( \theta ^{(m)}\right) }{ W_{i}\left( \theta ^{(m)} \right) } \right\} = h(\theta ^{(m)}), \end{aligned}$$

and

$$\begin{aligned} h_{MM}( \theta | \theta ^{(m)}) \ge h( \theta ) \ \text{ for } \text{ all } \theta . \end{aligned}$$

By virtue of the property of the MM algorithm, $\theta ^{(m+1)} = \mathop {{\mathrm{argmin}}}\limits _{\theta } h_{MM}(\theta |\theta ^{(m)})$ decreases the objective function, that is,

$$\begin{aligned} h(\theta ^{(m)}) = h_{MM}(\theta ^{(m)}|\theta ^{(m)}) \ge h_{MM}(\theta ^{(m+1)}|\theta ^{(m)}) \ge h(\theta ^{(m+1)}). \end{aligned}$$

Consequently, the sequence of iterates $ \left\{ \theta ^{(m)} \right\} $ decreases the objective function $h(\theta )$ at each iterative step.

Proof of Theorem 5

By applying the idea of the quadratic approximation (the second-order Taylor polynomial) (Böhning and Lindsay 1988) to $h_{MM}$, we have

$$\begin{aligned} h_{MM} (\beta | \beta ^{(m)})&= h_{MM} (\beta ^{(m)}|\beta ^{(m)}) + (\beta -\beta ^{(m)})^{\top } \frac{\partial h_{MM}}{\partial \beta }(\beta ^{(m)} | \beta ^{(m)}) \\&\quad + \frac{1}{2} (\beta -\beta ^{(m)})^{\top } \frac{\partial ^2 h_{MM}}{\partial \beta \partial \beta ^{\top }}(\beta ^{\dagger }|\beta ^{(m)}) (\beta -\beta ^{(m)}), \end{aligned}$$

where $\beta ^{\dagger }= \rho \beta +(1-\rho )\beta ^{(m)} \text{ for } \text{ some } \rho \in (0,1) $,

$$\begin{aligned} \frac{\partial h_{MM}}{\partial \beta }(\beta ^{(m)} | \beta ^{(m)}) =- \sum _{i=1}^n w_{i}\left( \beta ^{(m)} \right) \left( y_i - \pi (x_i ; (1+\gamma )\beta ^{(m)}) \right) x_i, \end{aligned}$$

and

$$\begin{aligned}&\frac{\partial ^2 h_{MM}}{\partial \beta \partial \beta ^{\top }}( \beta ^{\dagger }|\beta ^{(m)}) \\&\quad =(1+\gamma )\sum _{i=1}^n \left[ w_{i}\left( \beta ^{(m)} \right) \pi (x_i ; (1+\gamma )\beta ^{ \dagger }) \left\{ 1- \pi (x_i ; (1+\gamma )\beta ^{ \dagger }) \right\} x_i x_i^{\top } \right] . \end{aligned}$$

By $w_{i}\left( \beta ^{(m)} \right) \le 1$ and $0< \pi (x_i ; (1+\gamma )\beta ^{ \dagger }) < 1$,

$$\begin{aligned}&\frac{1}{2} (\beta -\beta ^{(m)})^{\top } \frac{\partial ^2 h_{MM}}{\partial \beta \partial \beta ^{\top }}(\beta ^{\dagger }|\beta ^{(m)}) (\beta -\beta ^{(m)}) \\&\quad \le \frac{1}{2} (\beta -\beta ^{(m)})^{\top } \left( (1+\gamma ) \sum _{i=1}^n x_i x_i^{\top } \right) (\beta -\beta ^{(m)}). \end{aligned}$$

From this relation, the following majorization function is obtained.

$$\begin{aligned} {\tilde{h}}_{MM}(\beta | \beta ^{(m)})&= - \sum _{i=1}^n w_{i}\left( \beta ^{(m)} \right) \left( y_i - \pi (x_i ; (1 + \gamma )\beta ^{(m)}) \right) (\beta -\beta ^{(m)})^{\top } x_i \\&\quad + \frac{1}{2} (\beta -\beta ^{(m)})^{\top } \left( (1+\gamma )\sum _{i=1}^n x_i x_i^{\top } \right) (\beta -\beta ^{(m)}) + const \\&= \frac{1+\gamma }{2} \sum _{i=1}^n \left( z_i^{(m)} - x_i^{\top } \beta \right) ^2 + const, \end{aligned}$$

where

$$\begin{aligned} z_i^{(m)} = x_i^{\top } \beta ^{(m)} + \frac{1}{1+\gamma } w_{i} \left( \beta ^{(m)} \right) \left( y_i - \pi (x_i ; (1+\gamma )\beta ^{ (m) }) \right) . \end{aligned}$$

The majorization function ${\tilde{h}}_{MM}$ satisfies

$$\begin{aligned} {\tilde{h}}_{MM}(\beta ^{(m)} | \beta ^{(m)}) = h_{MM}(\beta ^{(m)} | \beta ^{(m)}) = h(\beta ^{(m)}), \end{aligned}$$

and

$$\begin{aligned} {\tilde{h}}_{MM}(\beta | \beta ^{(m)}) \ge h_{MM}(\beta | \beta ^{(m)}) \ge h(\beta ) \ \text{ for } \text{ all } \beta . \end{aligned}$$

The formula of the majorization function ${\tilde{h}}_{MM}$ is equal to that of the ordinary least squares regression, and it can be solved explicitly with respect to $\beta $:

$$\begin{aligned} \beta ^{(m+1)} = (X^{\top } X)^{-1} X^{\top } z^{(m)}, \end{aligned}$$

where $X^{\top }=(x_1, \ldots , x_n)$ and $z^{(m)} = (z_i^{(m)} , \ldots , z_n^{(m)})^{\top }$.

Robustness of density power divergence

For comparison with an existing robust divergence, the robustness of the density power divergence (DPD) for regression is considered under the same contamination models and conditions. The cross entropy of the DPD for regression is defined by:

$$\begin{aligned}&d_{dpd} (g(y|x), f(y|x;\theta ); g(x)) \\&\quad = - \frac{1}{\alpha } \int \left( \int g(y|x) f(y|x;\theta )^{\alpha } dy \right) g(x) dx +\frac{1}{1+\alpha } \int \left( \int f(y|x ;\theta )^{1+\alpha } dy\right) g(x) dx, \end{aligned}$$

where $\alpha $ is the positive tuning parameter which controls the trade-off between efficiency and robustness.

We consider the homogeneous contamination. Hung et al. (2018) showed that the robust property does not hold whether or not $f(y|x;\theta )$ belongs to a location-scale family, as follows:

$$\begin{aligned} d_{dpd} (g(y|x), f(y|x;\theta ); g(x)) \approx d_{dpd} ((1-\varepsilon ) f(y|x;\theta ^*), f(y|x;\theta ); g(x)). \end{aligned}$$

This relation implies that the target conditional probability density function is affected by the outlier ratio $\varepsilon $. Thus, we cannot expect that the latent bias is close to zero.

Next, we consider the heterogeneous contamination. We show the non-strong robustness for the DPD as follows:

$$\begin{aligned}&d_{dpd} (g(y|x), f(y|x;\theta ); g(x)) \\&\quad = - \frac{1}{\alpha } \int \left( \int (1- \varepsilon (x)) f(y|x;\theta ^*) f(y|x;\theta )^{\alpha } dy \right) g(x) dx \\&\qquad - \frac{1}{\alpha } \int \left( \int \varepsilon (x) \delta (y|x) f(y|x;\theta )^{\alpha } dy \right) g(x) dx \\&\qquad +\frac{1}{1+\alpha } \int \left( \int f(y|x ;\theta )^{1+\alpha } dy\right) g(x) dx \\&\quad = - \frac{1}{\alpha } \int \left( \int (1- \varepsilon (x)) f(y|x;\theta ^*) f(y|x;\theta )^{\alpha } dy \right) g(x) dx \\&\qquad - \frac{1}{\alpha } \int \varepsilon (x) \nu _{f_{\theta }, \gamma }(x)^{\gamma } g(x) dx \\&\qquad +\frac{1}{1+\alpha } \int \left( \int f(y|x ;\theta )^{1+\alpha } dy\right) g(x) dx \\&\quad \approx d_{dpd} (f(y|x;\theta ^*), f(y|x;\theta ); {\tilde{g}}(x)) \\&\qquad + \frac{1}{1+\alpha } \int \left( \int f(y|x;\theta )^{1+\alpha } dy \right) \varepsilon (x) g(x) dx. \end{aligned}$$

The last term on the right-hand side is not negligible in general. Let us consider the following linear regression model as a specific example:

$$\begin{aligned} f(y|x;\beta ,\sigma ^2)= \frac{1}{\sqrt{2 \pi \sigma ^2}} \exp \left( - \frac{(y-x^{\top } \beta )^2}{2\sigma ^2} \right) . \end{aligned}$$

We have

$$\begin{aligned} \int f(y|x;\beta ,\sigma ^2)^{1+\alpha } dy = \frac{(2\pi \sigma ^2)^{- \alpha /2}}{\sqrt{1+\alpha }}. \end{aligned}$$

This depends on the value of $\sigma ^2$, and it generally cannot be close to zero, even though $\alpha \rightarrow \infty $. Therefore, the latent bias cannot be expected to be close to zero under heterogeneous contamination.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kawashima, T., Fujisawa, H. Robust regression against heavy heterogeneous contamination. Metrika 86, 421–442 (2023). https://doi.org/10.1007/s00184-022-00874-1

Download citation

Received: 14 December 2020
Accepted: 23 June 2022
Published: 01 July 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00184-022-00874-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robust regression against heavy heterogeneous contamination

Abstract

Similar content being viewed by others

A Novel Robust R-Squared Measure and Its Applications in Linear Regression

Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation

Outlier-robust parameter estimation for unnormalized statistical models

1 Introduction

2 Illustrative example

3 Regression based on \(\gamma \)-Divergence

3.1 Two types of \(\gamma \)-Divergence for regression

3.2 Estimation for \(\gamma \)-regression

3.3 Case of location-scale family

Proposition 1

4 Robust properties

4.1 Contamination model and basic condition

4.2 Robustness of type I

Theorem 1

Theorem 2

Theorem 3

4.3 Robustness of type II

5 Parameter estimation algorithm

5.1 MM algorithm

5.2 MM algorithm for type I

Theorem 4

5.3 MM algorithm for logistic regression model

Theorem 5

6 Numerical experiments

6.1 Synthetic data

6.2 Application to real data

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Proposition 1

Detailed derivation on robustness of type I in Sect. 4.2

Proof of Theorem 3

Detailed derivation on robustness of type II in Sect. 4.3

Proof of Theorem 4

Proof of Theorem 5

Robustness of density power divergence

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation