Introduction

The maximum likelihood estimation and least squares method have been widely used. However, these approaches are not robust against outliers. To overcome this problem, several robust methods have been proposed, primarily using M-estimation (Hampel et al. 2005; Maronna et al. 2006; Huber and Ronchetti 2009). The maximum likelihood estimation can be regarded as the minimization of the empirical estimator of the Kullback-Leibler divergence. As an extension of this concept, some robust estimators have been proposed as the minimization of the empirical estimators of the modified divergences, e.g., density power divergence (Basu et al. 1998), \(L_2\)-divergence (Scott 2001), \(\gamma \)-divergence (or the Type-0 divergence) (Jones et al. 2001; Fujisawa and Eguchi 2008).

Recently, some robust regression methods have been proposed based on divergences, using \(L_2\)-divergence (Chi and Scott 2014; Lozano et al. 2016), density power divergence (Ghosh and Basu 2016; Riani et al. 2020; Ghosh and Majumdar 2020), and \(\gamma \)-divergence (Kawashima and Fujisawa 2017; Hung et al. 2018; Ren et al. 2020). In these methods, the robust properties are generally investigated under the contaminated model. The difference between the i.i.d. problem and the regression problem lies in whether or not the outlier ratio in the contaminated model depends on the explanatory variable. They are called heterogeneous and homogeneous contaminations, respectively. Recently, Hung et al. (2018) showed that a logistic regression model with mislabeled data can be regarded as a logistic regression model with heterogeneous contamination. Then, they applied the \(\gamma \)-divergence to a usual logistic regression model, which enables estimation of the parameter of the model without modeling the mislabeled scheme, even if mislabeled data exist. They discussed the strong robustness that the latent bias can be sufficiently small against heavy contamination. However, this was under assumption that the tuning parameter \(\gamma \) was sufficiently large.

There are two types of \(\gamma \)-divergence for regression problems in which the treatments of the base measure are different (Fujisawa and Eguchi 2008; Kawashima and Fujisawa 2017). Kawashima and Fujisawa (2017) proposed another type of the \(\gamma \)-divergence proposed by Fujisawa and Eguchi (2008), and showed a Pythagorean relation, which does not hold for the previous \(\gamma \)-divergence. Hung et al. (2018) adopted the \(\gamma \)-divergence for the regression problem proposed by Fujisawa and Eguchi (2008) and investigated robust properties of the logistic regression model, as mentioned above. In addition, Ren et al. (2020) also adopted it and investigated theoretical properties of the variable consistency and estimation bound under a high-dimensional regression setting. In particular, its application to the generalized linear model (McCullagh and Nelder 1989) has been well studied. However, these studies focused on only one divergence proposed by Fujisawa and Eguchi (2008), and no comparison between two types has been studied.

In this study, two types of \(\gamma \)-divergence are compared in detail. Their differences in terms of the strong robustness are illustrated through numerical experiments. Comparing the obtained results with those of Hung et al. (2018), ours hold for any parametric model, including a logistic regression model, without the assumption that \(\gamma \) is sufficiently large, although Hung et al. (2018) makes this assumption.

The remainder of this paper is organized as follows. In Sect. 2, we show that existing robust regression methods may not work well under heavy heterogeneous contamination in the simple case of a univariate logistic regression model. In Sect. 3, two types of \(\gamma \)-divergence for the regression problem are reviewed. In Sect. 4, we elucidate a large difference between two types of \(\gamma \)-divergence from the viewpoint of robustness. In Sect. 5, the parameter estimation algorithm is proposed. In Sect. 6, numerical experiments are illustrated to verify the differences discussed in Sect. 4.

Illustrative example

Before getting into details, here we show an illustrative example that existing robust regression methods may not work well under heavy heterogeneous contamination, where the outlier ratio depends on the explanatory variable.

Here, a univariate logistic regression model was used as a simulation model. Outliers were incorporated in a similar setting to that described in Sect. 6. A weighted maximum likelihood estimator (WMLE), M-estimator (Mest), redescending weighted M-estimator (WMest), conditional unbiased bounded influence estimator (CUBIF), and robust quasi-likelihood estimator (MQLE) were adopted as existing robust logistic regression methods (see Chapter 7 in Maronna et al. (2018) for details).

Figure 1 shows the mean squared errors (MSEs) of existing methods, type I, and type II at each outlier ratio. All existing methods showed the almost identical results at each outlier ratio, and their performance worsened as the contamination became heavier. Type I, which is a robust regression method based on the \(\gamma \)-divergence proposed by Fujisawa and Eguchi (2008), performed well. In contrast, a similar method, type II, which is based on the \(\gamma \)-divergence proposed by Kawashima and Fujisawa (2017), presented different behaviors from type I. In the subsequent sections, we discuss a difference in robustness between the two types of \(\gamma \)-divergence for the regression problem. Specifically, why type I is better than type II is explored.

Fig. 1
figure 1

MSEs of existing robust regression methods, type I, and type II at various outlier ratios. The existing methods presented almost identical MSEs

Regression based on \(\gamma \)-Divergence

The \(\gamma \)-divergence for regression was first proposed by Fujisawa and Eguchi (2008). It measures the difference between two conditional probability density functions. The other type of \(\gamma \)-divergence for regression was proposed by Kawashima and Fujisawa (2017), in which the treatment of the base measure on the explanatory variable was changed. For simplicity, the former is referred to as type I and the latter as type II. This section presents a brief review of both types of \(\gamma \)-divergence for regression alongside the corresponding parameter estimation.

Two types of \(\gamma \)-Divergence for regression

First, the \(\gamma \)-divergence for the i.i.d. problem is reviewed. Let g(u) and f(u) be two probability density functions. The \(\gamma \)-cross entropy and \(\gamma \)-divergence are defined by

$$\begin{aligned}&d_\gamma (g(u),f(u)) = -\frac{1}{\gamma } \log \int g(u) f(u)^\gamma du + \frac{1}{1+\gamma } \log \int f(u)^{1+\gamma } du, \\&D_\gamma (g(u),f(u)) = - d_\gamma (g(u),g(u)) + d_\gamma (g(u),f(u)) . \end{aligned}$$

This satisfies the following two basic properties of divergence:

$$\begin{aligned} \begin{array}{cl} \hbox {(i)} &{} D_\gamma (g(u),f(u)) \ge 0. \\ \hbox {(ii)} &{} D_\gamma (g(u),f(u)) = 0 \ \Leftrightarrow \ g(u)=f(u) \ {(a.e.)}. \end{array} \end{aligned}$$

Let us consider the \(\gamma \)-divergence for the regression problem. Suppose that g(xy), g(y|x), and g(x) are the underlying probability density functions of (xy), y given x, and x, respectively. Let f(y|x) be another conditional probability density function of y given x. Let \(\gamma \) be the positive tuning parameter which controls the trade-off between efficiency and robustness.

For the regression problem, Fujisawa and Eguchi (2008) proposed the following cross entropy and divergence:

  • Type I \(\gamma \)-cross entropy for regression:

    $$\begin{aligned} d_{\gamma ,1} (g(y|x),f(y|x);g(x)) = -\frac{1}{\gamma } \log \int \int \left\{ \frac{ f(y|x)^{\gamma } }{ \left( \int f(y|x)^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} } \right\} g(x,y)dxdy . \end{aligned}$$
    (3.1)
  • Type I \(\gamma \)-divergence for regression:

    $$\begin{aligned}&D_{\gamma ,1} (g(y|x),f(y|x);g(x)) \\&\quad = - d_{\gamma ,1}(g(y|x),g(y|x);g(x)) + d_{\gamma ,1}(g(y|x),f(y|x);g(x)). \end{aligned}$$

The cross entropy is empirically estimable, as will be seen in Sect. 3.2, and the parameter estimation is easily defined. On the other hand, Kawashima and Fujisawa (2017) proposed the following cross entropy and divergence:

  • Type II \(\gamma \)-cross entropy for regression:

    $$\begin{aligned}&d_{\gamma ,2} (g(y|x),f(y|x);g(x)) \nonumber \\&\quad = -\frac{1}{\gamma } \log \int \int f(y|x)^{\gamma } g(x,y) dxdy + \frac{1}{1+\gamma } \log \int \left( \int f(y|x)^{1+\gamma } dy \right) g(x) dx. \end{aligned}$$
    (3.2)
  • Type II \(\gamma \)-divergence for regression:

    $$\begin{aligned}&D_{\gamma ,2} (g(y|x),f(y|x);g(x)) \\&\quad = -d_{\gamma , 2}(g(y|x),g(y|x);g(x)) +d_{\gamma , 2}(g(y|x),f(y|x);g(x)). \end{aligned}$$

The base measures on the explanatory variable are taken twice on each term of the \(\gamma \)-divergence for the i.i.d. problem. This extension from the i.i.d. problem to the regression problem appears more natural than (3.1). The cross entropy is also empirically estimable. Both types of \(\gamma \)-divergence satisfy the following two basic properties of divergence for \(j=1,2\):

$$\begin{aligned} \begin{array}{cl} \hbox {(i)} &{} D_{\gamma ,j}(g(y|x),f(y|x);g(x)) \ge 0. \\ \hbox {(ii)} &{} D_{\gamma ,j}(g(y|x),f(y|x);g(x)) = 0 \qquad \Leftrightarrow \ g(y|x)=f(y|x) \ {(a.e.)}. \end{array} \end{aligned}$$

The equality holds for the conditional probability density function instead of usual probability density function.

Theoretical properties of the \(\gamma \)-divergence for the i.i.d. problem were deeply investigated by Fujisawa and Eguchi (2008). There have been several studies on theoretical properties for the regression problem (Kanamori and Fujisawa 2015; Kawashima and Fujisawa 2017; Hung et al. 2018; Ren et al. 2020). However, there is a lack of comprehensive studies, such as comparison between properties under heterogeneous contamination. Heterogeneous contamination appears as a specific case in the regression problem and does not appear in the i.i.d. problem. Hung et al. (2018) pointed out that a logistic regression model with mislabeled data can be regarded as a logistic regression model with heterogeneous contamination and then they applied type I to a usual logistic regression model, which enables us to estimate the parameter of the logistic regression model without modeling mislabeled scheme even if mislabeled data exist. They also investigated theoretical properties on robustness, but they assumed that \(\gamma \) is sufficiently large. In Sect. 4 of this paper, it is shown that type I is superior to type II under heterogeneous contamination in the sense of the strong robustness without assuming that \(\gamma \) is sufficiently large.

Finally, it is shown that the density power divergence (Basu et al. 1998) is another candidate for divergence which provides robustness; however, this does not have strong robustness (Hung et al. 2018). For the completeness of this paper, details of the robustness of the density power divergence under homogeneous and heterogeneous contaminations are provided, although some parts have been investigated by Hung et al. (2018). See Appendix G for details.

Estimation for \(\gamma \)-regression

Let \(f(y|x;\theta )\) be a conditional probability density function of y given x with parameter \(\theta \). Let \((x_i,y_i) \ (i=1 , \ldots , n)\) be the observations randomly drawn from the underlying distribution g(xy). Using (3.1) and (3.2), both types of \(\gamma \)-cross entropy for regression can be empirically estimated by

$$\begin{aligned}&{\bar{d}}_{\gamma ,1} (f(y|x;\theta )) = -\frac{1}{\gamma } \log \frac{1}{n} \sum _{i=1}^n \frac{ f(y_i|x_i ; \theta )^{\gamma } }{ \left( \int f(y|x_i ;\theta )^{1+\gamma }dy\right) ^\frac{\gamma }{1+\gamma }}, \\&{\bar{d}}_{\gamma ,2} (f(y|x;\theta )) \\&\quad = - \frac{1}{\gamma } \log \left\{ \frac{1}{n} \sum _{i=1}^n f(y_i | x_i ;\theta )^{\gamma } \right\} + \frac{1}{1+\gamma } \log \left\{ \frac{1}{n} \sum _{i=1}^n \int f(y | x_i ;\theta )^{1+\gamma } dy \right\} . \end{aligned}$$

The estimator is defined as the minimizer as follows:

$$\begin{aligned} {\hat{\theta }}_{\gamma ,j} =\mathop {{\mathrm{argmin}}}\limits _{\theta } {\bar{d}}_{\gamma ,j}(f(y|x;\theta )) \ \ \hbox { for}\ j=1,2. \end{aligned}$$

Using an similar approach to that in Fujisawa and Eguchi (2008), we can show that \({\hat{\theta }}_{\gamma ,j}\) converges to \(\theta ^*_{\gamma ,j}\), where

$$\begin{aligned} \theta ^*_{\gamma ,j}&= \mathop {{\mathrm{argmin}}}\limits _{\theta } D_{\gamma ,j}(g(y|x),f(y|x;\theta );g(x)) \\&= \mathop {{\mathrm{argmin}}}\limits _{\theta } d_{\gamma ,j}(g(y|x),f(y|x;\theta );g(x)) \ \ \text{ for } j=1,2. \end{aligned}$$

Suppose that \(f(y|x;\theta ^*)\) is the target conditional probability density function. The latent bias is expressed as \(\theta ^*_{\gamma ,j}-\theta ^*\). This is zero when the underlying model belongs to a parametric model, in other words, \(g(y|x)=f(y|x;\theta ^*)\), but is not always zero when the underlying model is contaminated by outliers. This issue is discussed in Sect. 4.

Case of location-scale family

Here, it is shown that both types of \(\gamma \)-divergence give the same parameter estimation when the parametric conditional probability density function \(f(y|x;\theta )\) belongs to a location-scale family in which the scale does not depend on the explanatory variables. This is given by

$$\begin{aligned} f(y|x;\theta ) = \frac{1}{\sigma } s \left( \frac{y-q(x;\zeta )}{\sigma } \right) , \end{aligned}$$
(3.3)

where s(y) is a probability density function, \(\sigma \) is a scale parameter, and \(q(x;\zeta )\) is a location function with a regression parameter \(\zeta \), e.g., \(q(x;\zeta )=x^T \zeta \). Then, we can obtain

$$\begin{aligned} \int f(y|x;\theta )^{1+\gamma } dy&= \int \frac{1}{\sigma ^{1+\gamma }} s \left( \frac{y-q(x;\zeta )}{\sigma } \right) ^{1+\gamma } dy \nonumber \\&= \sigma ^{-\gamma } \int s(z)^{1+\gamma } dz. \end{aligned}$$

This does not depend on the explanatory variables x. Using this property, we can show that both types of \(\gamma \)-cross entropy are the same.

Proposition 1

Consider the location-scale family (3.3). We see

$$\begin{aligned} d_{\gamma ,1} (g(y|x),f(y|x;\theta );g(x)) =d_{\gamma ,2} (g(y|x),f(y|x;\theta );g(x)). \end{aligned}$$

The proof is in Appendix A. As a result, both types of \(\gamma \)-divergence give the same parameter estimation, because the estimator is defined by the empirical estimation of the \(\gamma \)-cross entropy. However, it should be noted that both types of \(\gamma \)-divergence are not the same, because \(d_{\gamma ,1} (g(y|x),g(y|x);g(x)) \ne d_{\gamma ,2} (g(y|x),g(y|x);g(x)) \).

Robust properties

In this section, we show a large difference between the two types of \(\gamma \)-divergence.

Contamination model and basic condition

Let \(\delta (y|x)\) be the contaminated conditional probability density function related to outliers. Let \(\varepsilon (x)\) and \(\varepsilon \) denote the outlier ratios which depends on x and does not, respectively. Suppose that the underlying conditional probability density functions under heterogeneous and homogeneous contaminations are given by

  • Heterogeneous contamination:

    $$\begin{aligned} g(y|x)&= (1-\varepsilon (x))f(y|x;\theta ^*) + \varepsilon (x) \delta (y|x), \end{aligned}$$
  • Homogeneous contamination:

    $$\begin{aligned} g(y|x)&= (1-\varepsilon )f(y|x;\theta ^*) + \varepsilon \delta (y|x). \end{aligned}$$

Let

$$\begin{aligned} {\nu }_{f,\gamma }(x) = \left\{ \int \delta (y|x) f(y|x)^{ \gamma } dy \right\} ^{ \frac{1}{ \gamma } }, \ {\nu }_{f, \gamma } = \left\{ \int \nu _{f,\gamma }(x)^{\gamma } g(x) dx \right\} ^{ \frac{1}{ \gamma }} . \end{aligned}$$

Here we assume that

$$\begin{aligned} \nu _{f_{\theta ^*},\gamma } \approx 0. \end{aligned}$$

This is an extended assumption used for the i.i.d. problem (Fujisawa and Eguchi 2008) to the regression problem. This assumption implies that

$$\begin{aligned} \nu _{f_{\theta ^*},\gamma }(x) \approx 0 \text{ for } \text{ any } x \text{(a.e.), } \end{aligned}$$

and illustrates that the contaminated conditional probability density function \(\delta (y|x)\) lies on the tail of the target conditional probability density function \(f(y|x;\theta ^*)\). For example, if \(\delta (y|x)\) is the Dirac delta function at the outlier \(y_{\dag }(x)\) given x, then we have \(\nu _{f_{\theta ^*},\gamma }(x) = f(y_{\dag }(x)|x;\theta ^*) \approx 0\), which is reasonable because \(y_{\dag }(x)\) is an outlier.

Here we also consider the condition \(\nu _{f_{\theta },\gamma } \approx 0\), as used in Hung et al. (2018). This will be true in the neighborhood of \(\theta =\theta ^*\). In addition, even when \(\theta \) is not close to \(\theta ^*\), if \(\delta (y|x)\) lies on the tail of \(f(y|x;\theta )\), we can see \(\nu _{f_{\theta },\gamma } \approx 0\).

To simplify the discussion, the monotone transformation of both types of \(\gamma \)-cross entropy for regression is prepared as follows:

$$\begin{aligned} {\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x))&= - \exp \left\{ -\gamma d_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \right\} \\&= - \int \int \frac{ f(y|x;\theta )^{\gamma } }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma } } } g(y|x) g(x)dx dy , \\ {\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x))&= - \exp \left\{ -\gamma d_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \right\} \\&= - \frac{ \int \left( \int g(y|x) f(y|x;\theta )^{\gamma } dy \right) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } } . \end{aligned}$$

Robustness of type I

Following some calculations, we have

$$\begin{aligned}&{\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad = {\tilde{d}}_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta )); {\tilde{g}}(x)) - \int \frac{ \nu _{f_{\theta },\gamma }(x)^{\gamma } }{ \left( \int f(y|x;\theta )^{1+\gamma } dy \right) ^{\frac{\gamma }{1+\gamma } } } \varepsilon (x) g(x) dx, \end{aligned}$$

where \({\tilde{g}}(x) = (1-\varepsilon (x))g(x)\). A detailed derivation is in Appendix B. From this relation, we can easily show the following theorem.

Theorem 1

Consider the case of heterogeneous contamination. Under the condition \(\nu _{f_{\theta },\gamma } \approx 0\), we have

$$\begin{aligned} {\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \approx {\tilde{d}}_{\gamma ,1} (f(y|x;\theta ^*),f(y|x;\theta ); {\tilde{g}}(x)) . \end{aligned}$$

Using this theorem, we can expect the strong robustness that the latent bias \(\theta ^*_{\gamma ,1}-\theta ^*\) is close to zero even when \(\varepsilon (x)\) is not small, because

$$\begin{aligned} \theta ^*_{\gamma ,1}&= \mathop {{\mathrm{argmin}}}\limits _{\theta } d_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&= \mathop {{\mathrm{argmin}}}\limits _{\theta } {\tilde{d}}_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\approx \mathop {{\mathrm{argmin}}}\limits _{\theta } {\tilde{d}}_{\gamma ,1} (f(y|x;\theta ^*),f(y|x;\theta ); {\tilde{g}}(x)) \quad \text{(by } \text{ Theorem }~1) \\&= \mathop {{\mathrm{argmin}}}\limits _{\theta } d_{\gamma ,1} (f(y|x;\theta ^*),f(y|x;\theta ); {\tilde{g}}(x)) = \theta ^*. \end{aligned}$$

The last equality holds even when g(x) is replaced by \({\tilde{g}}(x) = (1-\varepsilon (x))g(x)\).

In addition, we can have the modified Pythagorean relation approximately.

Theorem 2

Consider the case of heterogeneous contamination. Under the condition \(\nu _{f_{\theta },\gamma } \approx 0\), the modified Pythagorean relation among g(y|x), \(f(y|x;\theta ^*)\), \(f(y|x;\theta )\) approximately holds:

$$\begin{aligned}&D_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad \approx D_{\gamma ,1}(g(y|x),f(y|x;\theta ^*);g(x)) + D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );{\tilde{g}}(x)). \end{aligned}$$

The modified Pythagorean relation also implies strong robustness in a similar manner to that in the subsequent discussion of Theorem 1.

Finally, the case of homogeneous contamination is discussed. Under homogeneous contamination, we have the following relation.

Theorem 3

Consider the case of homogeneous contamination. We see

$$\begin{aligned} D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );{\tilde{g}}(x)) = D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );g(x)). \end{aligned}$$

The proof is in Appendix C. Then, the modified Pythagorean relation in Theorem 2 is changed to the usual Pythagorean relation as follows:

$$\begin{aligned}&D_{\gamma ,1}(g(y|x),f(y|x;\theta );g(x)) \\&\quad \approx D_{\gamma ,1}(g(y|x),f(y|x;\theta ^*);g(x)) + D_{\gamma ,1}(f(y|x;\theta ^*),f(y|x;\theta );g(x)). \end{aligned}$$

Robustness of type II

First, it is illustrated that in the case of type II, the strong robustness does not hold generally hold under heterogeneous contamination, unlike for type I. We see

$$\begin{aligned} {\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \approx - \frac{ \int \int f(y|x;\theta ^*) f(y|x;\theta )^{\gamma } dy (1-\varepsilon (x)) g(x) dx }{ \left\{ \int \left( \int f(y|x;\theta )^{1+\gamma } dy \right) g(x) dx \right\} ^{\frac{\gamma }{1+\gamma } } }. \end{aligned}$$

A detailed derivation is in Appendix D. This cannot be expressed using

$$\begin{aligned} d_{\gamma ,2}(f(y|x;\theta ^*),f(y|x;\theta );b(x)), \end{aligned}$$

with an appropriate base measure b(x), unlike for type I, because the base measure of the numerator on the explanatory variables is different from that of the denominator. As mentioned in Sect. 3.3, when the parametric conditional probability density function belongs to a location-scale family (3.3), the cross entropy for type II is the same as that for type I and, thus type II can have the strong robustness.

Under homogeneous contamination, we have

$$\begin{aligned} {\tilde{d}}_{\gamma ,2}(g(y|x),f(y|x;\theta );g(x)) \approx (1-\varepsilon ) {\tilde{d}}_{\gamma ,2}(f(y|x;\theta ^*),f(y|x;\theta );g(x)), \end{aligned}$$

and then the latent bias \(\theta ^*_{\gamma ,2}-\theta ^*\) can be expected to be sufficiently small, i.e., type II can have the strong robustness.

Parameter estimation algorithm

In this section, the parameter estimation algorithm of type I is proposed. Kawashima and Fujisawa (2017) proposed the iterative estimation algorithm for type II by Majorization-Minimization (MM) algorithm (Hunter and Lange 2004). This algorithm has a monotone decreasing property, i.e., the objective function monotonically decreases at each iterative step, which enables numerical stability and efficiency. The present study also utilizes the MM algorithm.

MM algorithm

Here, the principle of the MM algorithm is explained in brief. Let \(h(\eta )\) be the objective function. Let us prepare the majorization function \(h_{MM}\) satisfying

$$\begin{aligned} h_{MM}(\eta ^{(m)}|\eta ^{(m)})&= h(\eta ^{(m)}), \\ h_{MM}(\eta |\eta ^{(m)})&\ge h(\eta ) \ \ \text{ for } \text{ all } \eta , \end{aligned}$$

where \(\eta ^{(m)}\) is the parameter of the m-th iterative step for \(m=0,1,2,\ldots \). The MM algorithm optimizes the majorization function instead of the objective function as follows:

$$\begin{aligned} \eta ^{(m+1)} = \mathop {{\mathrm{argmin}}}\limits _{\eta } h_{MM}(\eta |\eta ^{(m)}). \end{aligned}$$

Then, it can be shown that the objective function \(h(\eta )\) monotonically decreases at each iterative step, because

$$\begin{aligned} h(\eta ^{(m)}) = h_{MM}(\eta ^{(m)}|\eta ^{(m)}) \ge h_{MM}(\eta ^{(m+1)}|\eta ^{(m)}) \ge h(\eta ^{(m+1)}). \end{aligned}$$

Note that \(\eta ^{(m+1)}\) is not necessary to be the minimizer of \(h_{MM}(\eta |\eta ^{(m)})\). We only need

$$\begin{aligned} h_{MM}(\eta ^{(m)}|\eta ^{(m)}) \ge h_{MM}(\eta ^{(m+1)}|\eta ^{(m)}). \end{aligned}$$

The problem with the MM algorithm is how to make a majorization function \(h_{MM}\).

MM algorithm for type I

Let us recall the objective function \(h(\theta )\) in type I:

$$\begin{aligned} h(\theta ) = -\frac{1}{\gamma } \log \frac{1}{n} \sum _{i=1}^n \frac{ f(y_i|x_i ; \theta )^{\gamma } }{ \left( \int f(y|x_i ;\theta )^{1+\gamma }dy\right) ^\frac{\gamma }{1+\gamma }}. \end{aligned}$$

The majorization function can be derived in a similar manner to that in Kawashima and Fujisawa (2017). We show the following theorem.

Theorem 4

Consider the function

$$\begin{aligned} h_{MM}(\theta | \theta ^{(m)}) = - \frac{1}{\gamma } \sum _{i=1}^n w_{i}\left( \theta ^{(m)} \right) \log W_i(\theta ) + const, \end{aligned}$$

where

$$\begin{aligned} W_i(\theta ) =\frac{ f(y_i|x_i;\theta )^{\gamma } }{ \left( \int f(y|x_i;\theta )^{1+\gamma }dy\right) ^{\frac{\gamma }{1+\gamma }} }, w_{i}\left( \theta \right) = \frac{ W_i(\theta ) }{ \sum _{l=1}^n W_l(\theta ) }, \end{aligned}$$

and const is a term which does not depend on the parameter \(\theta \). It can be seen that it is a majorization function of \(h(\theta )\). Consequently, the sequence of iterates given by \(\theta ^{(m+1)} = \mathop {{\mathrm{argmin}}}\limits \nolimits _{\theta } h_{MM}(\theta |\theta ^{(m)})\) decreases the objective function \(h(\theta )\) at each iterative step.

The proof is in Appendix E. As mentioned in Sect. 3.3, when the parametric conditional probability density function \(f(y|x;\theta )\) belongs to a location-scale family (3.3), the cross entropy for type I is the same as that for type II, and the above majorization function is also the same as that for type II.

MM algorithm for logistic regression model

Here, the case of the logistic regression model is thoroughly investigated. This model does not belong to a location-scale family as the parametric conditional probability density function \(f(y|x;\theta )\). Let \(f(y|x;\beta )\) be the Bernoulli distribution given by

$$\begin{aligned} f(y|x;\beta ) = \pi (x;\beta )^y (1- \pi (x;\beta ))^{(1-y)}, \end{aligned}$$

where \(\pi (x;\beta )= \{1+\exp (- x^{\top } \beta )\}^{-1}\). For simplicity, the intercept term \(\beta _0\) is omitted in the linear predictor. Following simple calculation, we have

$$\begin{aligned}&h_{MM}(\beta | \beta ^{(m)}) \\&\quad = - \sum _{i=1}^n w_{i}\left( \beta ^{(m)} \right) y_i x_i^{\top } \beta - \frac{1}{1+\gamma } \sum _{i=1}^n w_{i}\left( \beta ^{(m)} \right) \log \left[ 1 - \pi (x_i ; (1+\gamma )\beta ) \right] . \end{aligned}$$

Here, the constant term is ignored. By applying the idea of the quadratic approximation (the second-order Taylor polynomial) (Böhning and Lindsay 1988) to \(h_{MM}\), the new majorization function is obtained as follows:

$$\begin{aligned} {\tilde{h}}_{MM}(\beta | \beta ^{(m)}) = \frac{1+\gamma }{2} \sum _{i=1}^n \left( z_i^{(m)} - x_i^{\top } \beta \right) ^2 + const, \end{aligned}$$

where

$$\begin{aligned} z_i^{(m)} = x_i^{\top } \beta ^{(m)} + \frac{1}{1+\gamma } w_{i} \left( \beta ^{(m)} \right) \left( y_i - \pi (x_i ; (1+\gamma )\beta ^{ (m) }) \right) . \end{aligned}$$

Then, we provide the following theorem.

Theorem 5

Consider the function \({\tilde{h}}_{MM}(\beta | \beta ^{(m)})\). It can be seen that it is a majorization function of \(h(\beta )\). Consequently, the sequence of iterates given by \( \beta ^{(m+1)} = (X^{\top } X)^{-1} X^{\top } z^{(m)}\) decreases the objective function \(h(\beta ) \) at each iterative step.

The proof is in Appendix F. Therefore, to guarantee the monotone decreasing property, the proposed algorithm does not require a line search method (Armijo 1966; Wolfe 1969). On the other hand, existing estimation algorithms for logistic regression based on type I (Hung et al. 2018; Ren et al. 2020) require a line search method for the monotone decreasing property.

Finally, the pseudo-code of the proposed parameter estimation algorithm for the logistic regression model is presented.

figure a

Numerical experiments

In this section, using a simulation model, we compare type I with type II and the density power divergence (DPD).

As shown in Sect. 4, the large difference occurs under heterogeneous contamination when the parametric conditional probability density function \(f(y|x;\theta )\) does not belong to a location-scale family. Therefore, the logistic regression model is used as a simulation model, given by

$$\begin{aligned} \mathrm {Pr}(y=1|x) = \pi (x;\beta ) , \ \mathrm {Pr}(y=0|x) = 1-\pi (x;\beta ), \end{aligned}$$

where \(\pi (x;\beta )= \{1+\exp (- \beta _{0} - x_1 \beta _1- \cdots - x_p \beta _p )\}^{-1}\). All experiments were performed using R (http://www.r-project.org/index.html). The code used to implement the proposed method is available at https://sites.google.com/site/takayukikawashimaspage/software.

Synthetic data

We consider the following data generating scheme. The sample size and the number of explanatory variables were set to be \(n=200\) and \(p=10,20,40\), respectively. The true regression coefficients were given by \(\beta _{0}^*=0, \ \varvec{\beta }^* =(1,-1.5,2,-2.5,3, {\mathbf {0}}_{p-5}^{\top })^{\top }\). The explanatory variables were generated from a normal distribution \(N(0,{\varSigma })\) with \( {\varSigma }_{ij}=(0.2^{ | i-j | })_{1 \le i,j \le p }\). We generated 30 random samples.

Outliers were incorporated into simulations. The outliers were generated around the edge of the explanatory variables, where the explanatory variables were generated from \(N(\varvec{\mu }_{\mathrm{out}}, 0.5^{2} {\mathbf {I}})\) where \({\varvec{\mu }_{\mathrm{out}}}=(2,0,2,0,2, {\mathbf {0}}_{p-5}^{\top })^{\top }\), and the response variable y was set to 0. The outlier ratio was set to \(\varepsilon =0.1, 0.2 , 0.3 , 0.4\).

To verify the fitness of the regression coefficient, the mean squared error (MSE) was used as the performance measure, given by.

$$\begin{aligned} \text {MSE}&= \frac{1}{p+1} \sum _{j=0}^p ({\hat{\beta }}_j - \beta _j^* )^2, \end{aligned}$$

where \(\beta _j^*\)’s are the true coefficients. The tuning parameter \(\gamma \) in the \(\gamma \)-divergence was set to 1.0, 2.0, and 3.0. The tuning parameter \(\alpha \) in the DPD was set to 1.0, 2.0, and 3.0.

Figure 2 shows the MSE in the case \(\varepsilon =0.1\), 0.2, 0.3, and 0.4. Type I presented smaller MSEs than type II. The difference between the two types became more significant as the outlier ratio \(\varepsilon \) was increased. For types I and II, we can see similar results to those presented in Sect. 2 can be seen, even in the multivariate case. The DPD presented worse MSEs as the outlier ratio \(\varepsilon \) or p was larger.

The DPD seems to be better than the \(\gamma \)-divergence at the same value of the \(\gamma \) and \(\alpha \). However, the different divergences should not be compared at the same tuning parameter, because they have different meanings.

Fig. 2
figure 2

Under heterogeneous contamination with various outlier ratios

Application to real data

We compare type I with type II and the DPD in real data. The following datasets, which are available at the UCI Machine Learning Repository (Dua and Graff 2017), were used: Absenteeism at work (Work) (Martiniano et al. 2012), banknote authentication (Bank), Heart, Haberman’s Survival (Survival), and Libras Movement (Libras).

First, we applied the ordinal logistic regression to real data and supposed that estimated regression coefficients were true coefficients \(\beta _0^*\) and \(\varvec{\beta }^*\). Then, outliers were incorporated into the data as follows. The magnitudes of the explanatory variables were sorted in descending order \(\Vert x \Vert _{(1)} \ge \Vert x \Vert _{(2)} \ge \cdots \ge \Vert x \Vert _{(n)} \), where (i) denotes the i-th order. We selected \(\lceil \varepsilon \times n \rceil \) samples from the largest variables, and the corresponding each response variable was flipped, e.g., \(y=0 \rightarrow y=1\). The outlier ratio \(\varepsilon \) was set to 0.4.

Table 1 MSE in application to real data

In order to verify the fitness of the regression coefficient, we used the MSE as the performance measure. The tuning parameter \(\gamma \) in the \(\gamma \)-divergence was set to 1.0, 2.0, and 3.0. The tuning parameter \(\alpha \) in the DPD was set to 1.0, 2.0, and 3.0.

Table 1 shows the MSE in real data sets with outliers. Type I presented smaller MSEs than type II and showed better results than the DPD in most cases.

Conclusion

In shit study, the difference between two types of \(\gamma \)-divergence for the regression problem, referred to as type I and type II, were investigated. We showed that type I has the strong robustness unlike type II under heterogeneous contamination even when the parametric conditional probability density function does not belong to a location-scale family. In addition, we denoted difference on the robustness with an existing robust divergence, density power divergence for the regression problem. Further, an efficient estimation algorithm was proposed based on the principle of the MM algorithm. Numerical experiments supported the obtained theoretical results under various settings and the application to real data. In the experiments performed herein, the robust tuning parameter \(\gamma \) was fixed. However, in some cases, it may be better to select an appropriate tuning parameter from among the \(\gamma \) candidates than to fix the value. The monitoring approach (Riani et al. 2020) and robust cross-validation (Kawashima and Fujisawa 2017) can be utilized as a method of selecting \(\gamma \).