1 Introduction

Linear regression is used to model the conditional mean of a response variable y given the predictor variable x. A well-known least-squares estimator for linear regression coefficients is highly sensitive to outliers. To alleviate this problem, many estimators have been developed, such as robust M-estimators [12, 13]. However, the consistency of the robust M-estimators requires the conditional error distribution to have homoscedasticity and symmetricity given a predictor. In reality, various types of data do not follow these assumptions (e.g., wages, prices, and expenditures). Baldauf and Silva [3] pointed out that the estimation cannot be consistent unless data follow proper assumptions, which is also the case in some real-world applications [18].

Modal linear regression (MLR) is used to model the conditional mode of y given x by using a linear predictor function of x. MLR relaxes the distribution assumptions for the M-estimators of linear regression and is robust against outliers compared to the least-squares estimation of linear regression coefficients. It is also robust against violations of standard assumptions on the usual mean regression, such as heavy-tailed noise and skewed conditional and noise distributions. Kemp and Silva [14] and Yao and Li [23] proved that their estimators for the MLR model are consistent even when the error distribution is asymmetric. For the above reasons, improving methods for mode estimation has been an important topic of research for many years.

In information geometry [2], a manifold that consists of statistical models is called a model manifold. Information geometric formulation of estimating algorithm is useful for understanding the behavior and characteristic of the algorithm. For example, the procedure of parameter estimation is regarded as a projection from a point in the statistical manifold to a point in the model manifold. Modal linear regression is known to be robust to outliers, and we aim to elucidate the source of robustness by formulating the estimation procedure as geometric operations. Identifying and understanding the source of robustness of the estimation algorithm and statistical model associated with the modal linear regression would be helpful for developing other algorithms and models robust to outliers. In information geometry, a model manifold is often constructed by using a parametric distribution. Because of the lack of a parametric distribution, constructing a manifold that corresponds to the MLR model is difficult with conventional approaches. The contribution of this paper is being the first to obtain an information geometric perspective on MLR.

1.1 Related works

The modal regression model is related to kernel density estimation (KDE) [22], which is a nonparametric method for estimating the probability density function of observed data. Parzen [19] revealed the sufficient conditions for the \(L^2\)-consistency and asymptotic normality of KDE. They also derived the conditions for the consistency and asymptotic normality of a mode estimation constructed on the basis of KDE. Epanechnikov [7] found the optimal kernel function for KDE under some conditions. In general, if the probability density function of a random variable X has a unique mode and is symmetric with respect to the mode, \(\text {Pr} \left( p-w \le X \le p+w \right) \) with fixed w is maximized when p is the mode. Based on this property, Lee [15] proposed an estimator for the coefficients of MLR. The MLR model is formulated as follows:

$$\begin{aligned} y&= x^{\top }\beta + \epsilon ,\quad \text {where} \quad \text {Mode} \left[ \epsilon ; x \right] = 0. \end{aligned}$$

The estimator of Lee [15] is consistent when there exists \(w>0\) and the probability density function of \(\epsilon \) is symmetric in the range of \(0\pm w\). Lee [16] improved the objective function for the MLR model to become more tractable by using a quadratic kernel. Lee [15, 16] required the conditional probability density function of \(\epsilon \) given x to be symmetric around 0 in the range of \(\pm w\). Kemp and Silva [14] proved that the mode estimator for the coefficients of MLR is consistent even if symmetry is not satisfied. Yao and Li [23] proposed an expectation–maximization (EM) algorithm to estimate the coefficients of MLR.

Besides linear modeling, other studies have used semiparametric or nonparametric approaches to modal regression. Gannoun et al. [9] proposed a semiparametric modal regression model that assumes a linear relation between the mean, median, and mode. Yao et al. [24] developed a local modal regression that estimates the conditional mode of response as a polynomial function of predictors. Chen et al. [4] defined the modal manifold as a union of sets in which the first derivative of the conditional density is zero and the second derivative is negative. Then, the modal manifold is estimated by kernel estimates of derivatives of densities.

In information geometry, a model manifold is often constructed by using a parametric distribution. Estimates are regarded as the projection of an empirical distribution onto the model manifold. In the case of linear regression, we construct a model manifold based on the assumption that an error variable has a normal distribution. Because of the lack of a parametric distribution, constructing a model manifold that corresponds to the MLR model is difficult with conventional approaches. Some studies have considered nonparametric models for information geometry. Pistone and Sempi [20] showed a well-defined Banach manifold for probability measures. Grasselli [10] addressed the Fisher information and \(\alpha \)-connections for the Banach manifold. Zhang [25] discussed the relationship between divergence functions, the Fisher information, \(\alpha \)-connections, and fundamental issues in information geometry. Takano et al. [21] proposed a framework for a nonparametric e-mixture estimation. In contrast to these nonparametric approaches to information geometry, in this paper we consider information geometry associated with a semiparametric MLR model. We propose to construct a model manifold by using observations, as is done when constructing an empirical distribution with conventional approaches. Our proposal gives a geometric viewpoint to the MLR model.

2 Modal linear regression

Let \(x \in {\mathbb {R}}^p\) and \(y \in {\mathbb {R}}\) be a set of predictor variables and a response variable, respectively. The original least-squares for linear regression estimates a conditional mean of y given x, while MLR estimates a conditional mode of y given x. In this section, we briefly explain the EM algorithm of MLR introduced by Yao and Li [23].

2.1 Formulation

Suppose that \(\left\{ x_i, y_i \right\} _{i=1}^{N}\) are i.i.d. observations, where the i-th predictor variable is denoted by \(x_i\in {\mathbb {R}}^p\) and the corresponding response is denoted by \(y_i \in {\mathbb {R}}\). With MLR, we model a conditional mode of y given x by a linear function of x:

$$\begin{aligned} \text {Mode} \left[ y;x \right]&= x^{\top }\beta , \end{aligned}$$

where \(\text {Mode} \left[ y;x \right] = \mathop {\text {argmax}}\limits _{y} f(y|x)\) for the conditional density function f(y|x). Namely, y and x are related as

$$\begin{aligned} y = x^{\top }\beta + \epsilon , \quad \text {where} \quad \text {Mode} \left[ \epsilon ;x \right] = 0. \end{aligned}$$
(1)

To estimate \(\beta \), Lee [15] introduced a loss function with the form

$$\begin{aligned} l(\beta ; y, x) = - \phi _{h} \left( y - x^{\top }\beta \right) , \end{aligned}$$
(2)

where \(\phi _h(x) = \frac{1}{h}\phi \left( \frac{x}{h} \right) \), \(\phi (\cdot )\) is a kernel function, and h is a bandwidth parameter. Minimizing the empirical loss leads to the estimate \({\hat{\beta }}\) of the linear coefficient:

$$\begin{aligned} {\hat{\beta }}= \mathop {\text {argmax}}\limits _{\beta } \frac{1}{N} \sum _{i=1}^{N} \phi _h(y_i - x_i^{\top }\beta ). \end{aligned}$$
(3)

In this paper, \(\phi (\cdot )\) denotes a standard normal density function. The consistency and asymptotic normality of the estimate \({\hat{\beta }}\) obtained by (3) have been established under certain regularity conditions on the samples, kernel function, parameter space, and vanishing rate of the bandwidth parameter [14].

2.2 EM algorithm for MLR

Here, we introduce the EM algorithm for MLR parameter estimation proposed by Yao and Li [23]. The algorithm consists of two steps starting from an initial estimate \(\beta ^{(1)}\):

E-Step

Consider the surrogate function

$$\begin{aligned} \gamma (\beta ;\beta ^{(k)}) = \sum _{i=1}^{N} \pi _i^{(k)} \log \left[ \frac{ \frac{1}{N}\phi _h \left( y_i-x_i^{\top }\beta \right) }{\pi _i^{(k)}} \right] , \end{aligned}$$
(4)

where

$$\begin{aligned} \pi _i^{(k)} = \frac{\phi _h(y_i - x_i^{\top }\beta ^{(k)}) }{\sum _{j=1}^{N} \phi _h(y_j - x_j^{\top }\beta ^{(k)}) },\quad i = 1 \dots N. \end{aligned}$$
(5)

This function satisfies

$$\begin{aligned} \gamma (\beta ^{(k)};\beta ^{(k)}) =\log \left[ \frac{1}{N}\sum _{i=1}^{N}\phi _h\left( y_i - x_i^{\top }\beta ^{(k)} \right) \right] \end{aligned}$$
(6)

and

$$\begin{aligned}&\log \left[ \frac{1}{N}\sum _{i=1}^{N} \phi _h \left( y_i-x_i^{\top }\beta \right) \right] \nonumber \\&\quad = \log \left[ \sum _{i=1}^{N} \pi _i^{(k)} \frac{\frac{1}{N}\phi _h \left( y_i-x_i^{\top }\beta \right) }{\pi _i^{(k)}} \right] ,\quad \text {by Jensen's inequality} \nonumber \\&\quad \ge \sum _{i=1}^{N} \pi _i^{(k)} \log \left[ \frac{ \frac{1}{N}\phi _h \left( y_i-x_i^{\top }\beta \right) }{\pi _i^{(k)}} \right] = \gamma (\beta ;\beta ^{(k)}). \end{aligned}$$
(7)

M-Step

In this step, the parameter \(\beta \) is updated to increase the value of \(\frac{1}{N}\sum _{i=1}^{N}\phi _h \left( y_i-x_i^{\top }\beta \right) \). The updated parameter \(\beta ^{(k+1)}\) is given as

$$\begin{aligned} \beta ^{(k+1)}&= \mathop {\text {argmax}}\limits _{\beta } \gamma (\beta ;\beta ^{(k)}) . \end{aligned}$$
(8)

The following inequality holds:

$$\begin{aligned} \log \left[ \frac{1}{N}\sum _{i=1}^{N}\phi _h \left( y_i-x_i^{\top }\beta ^{(k+1)} \right) \right]&\ge \gamma (\beta ^{(k+1)};\beta ^{(k)}) \\&\ge \gamma (\beta ^{(k)};\beta ^{(k)}) = \log \left[ \frac{1}{N}\sum _{i=1}^{N}\phi _h \left( y_i-x_i^{\top }\beta ^{(k)} \right) \right] . \end{aligned}$$

Equation (8) is equivalent to (9).

$$\begin{aligned} \beta ^{(k+1)}&= \mathop {\text {argmax}}\limits _{\beta } \sum _{i=1}^{N} \pi _i^{(k)} \log \phi _h(y_i - x_i^{\top }\beta ). \end{aligned}$$
(9)

When \(\phi (\cdot )\) is a standard normal density function,

$$\begin{aligned} \beta ^{(k+1)} = \left( X^{\top } W_{k} X \right) ^{-1} X^{\top } W_{k} y, \quad W_{k} = \text {diag} \begin{pmatrix} \pi _1^{(k)}&\cdots&\pi _N^{(k)} \end{pmatrix}. \end{aligned}$$

The property of the estimate \({\hat{\beta }}\) was discussed in [23].

3 Information geometry

In this section, we briefly explain information geometry, statistical inference, and the em algorithm. Information geometry [2] is a framework for formulating spaces consisting of probability density functions by means of differential geometry. Let S be a statistical manifold which is composed of probability distributions.

A parametric family of probability distributions of the form

$$\begin{aligned} f(x;\theta ) =&\exp \left\{ C(x) + \sum _{i=1}^{n} \theta _i F_i(x) - \psi (\theta ) \right\} ,\nonumber \\&\text {where} \quad \psi (\theta ) = \log \int \exp \left\{ C(x) + \sum _{i=1}^{n} \theta _i F_i(x) \right\} dx \end{aligned}$$
(10)

is called the exponential family and it has a critical role in information geometry. When \(\left\{ F_1 \dots F_n, {\mathbf {1}} \right\} \) is linearly independent, \(\theta \) and \(f(x;\theta )\) have a one-to-one correspondence, and \(\left( \theta _i \right) _{i=1}^{n}\) is an affine coordinate system of the statistical manifold. There is another useful coordinate system:

$$\begin{aligned} \eta _{i} = \text {E}_{f(x;\theta )} \left[ F_i(x) \right] = \int f(x;\theta ) F_i(x) dx, \quad i = 1 \dots n. \end{aligned}$$
(11)

\((\theta _i)_{i=1}^{n}\) and \((\eta _i)_{i=1}^{n}\) are called the natural parameters and expectation parameters, respectively, of the exponential family.

For an exponential family equipped with coordinate systems \(\{ (\theta _{i})_{i=1}^{n}, (\eta _{i})_{i=1}^{n}\}\), there exist the potential functions \(\psi (\theta )\) and \(\phi (\eta )\) satisfying the following relation [2]:

$$\begin{aligned} ^{\forall }i&= 1 \dots n,&\eta _i&= \frac{\partial \psi }{\partial \theta _i},&\theta _i&= \frac{\partial \varphi }{\partial \eta _i},&\psi (\theta ) + \varphi (\eta ) - \sum _{i=1}^{n}\theta _{i} \eta _{i}&= 0. \end{aligned}$$
(12)

In information geometry, a statistical inference is often regarded as the projection of an empirical distribution onto a model manifold, which is a submanifold of S. Projection is defined as a procedure to find a point which minimizes the discrepancy between a point and a submanifold, hence it is important to measure the discrepancy between two probability distributions. We can define divergence from a point (probability distribution) \(p \in S\) to \(q \in S\) by using the coordinate systems and potential functions as follows:

$$\begin{aligned} D(p||q) = \psi \left( \theta (p) \right) + \varphi \left( \eta (q) \right) - \sum _{i=1}^{n}\theta _{i}(p) \eta _{i}(q), \quad ^{\forall }p,q \in S. \end{aligned}$$
(13)

For an exponential family, there are two natural definitions of divergences, the e- and m-divergence. The m-divergence is defined as

$$\begin{aligned} D^{(m)}(p||q)&= \varphi \left( \eta (p) \right) + \psi \left( \theta (q) \right) - \sum _{i=1}^{n} \eta _{i}(p) \theta _{i}(q), \quad \text {from Eq.}~(12) \\&= \sum _{i=1}^{n} \theta _{i}(p) \eta _{i}(p) - \psi \left( \theta (p) \right) \\&\quad - \left\{ \sum _{i=1}^{n} \theta _{i}(q) \eta _{i}(p) - \psi \left( \theta (q) \right) \right\} , \quad \text {from Eq.}~(11) \\&= \text {E}_{\theta (p)} \left[ \sum _{i=1}^{n} \theta _{i}(p) F_i(x) - \psi \left( \theta (p) \right) \right] - \text {E}_{\theta (p)} \left[ \sum _{i=1}^{n} \theta _{i}(q) F_i(x) - \psi \left( \theta (q) \right) \right] \\&= \text {E}_{\theta (p)} \left[ \log p(x) \right] - \text {E}_{\theta (p)} \left[ \log q(x) \right] = \int p(x) \log \frac{p(x)}{q(x)} dx, \end{aligned}$$

which is equal to the KL-divergence. The e-divergence is defined as follows:

$$\begin{aligned} D^{(e)}(p||q)&= \psi \left( \theta (p) \right) + \varphi \left( \eta (q) \right) - \sum _{i=1}^{n}\theta _{i}(p) \eta _{i}(q) \\&= \varphi \left( \eta (q) \right) + \psi \left( \theta (p) \right) - \sum _{i=1}^{n} \eta _{i}(q) \theta _{i}(p) \\&= D^{(m)}(q||p) = \int q(x) \log \frac{q(x)}{p(x)} dx. \end{aligned}$$

The EM algorithm [5] is a method for estimating the maximum likelihood parameters in a latent variable model. In information geometry, the exponential - mixture (em) algorithm [1] corresponds to the EM algorithm. For a latent variable model, an empirical distribution based on observations is not unique. A manifold that consists of empirical joint probability distributions of observable variables and latent variables is called a data manifold \({\mathscr {D}}\). The em algorithm finds the points \(p^{*}\in {\mathscr {M}}\) and \(q^{*}\in {\mathscr {D}}\) that minimize the KL-divergence from \(q^{*}\) to \(p^{*}\) by iterating the following two steps starting from an initial guess \(p^{(1)}\). Figure 1 is a conceptual diagram of the em algorithm.

e-step

e-projection of \(p^{(k)}\in {\mathscr {M}}\) onto the data manifold \({\mathscr {D}}\).

$$\begin{aligned} q^{(k)}&= \mathop {\text {argmin}}\limits _{q\in {\mathscr {D}}} D^{(e)}(p^{(k)}||q). \end{aligned}$$

m-step

m-projection of \(q^{(k)}\in {\mathscr {D}}\) onto the model manifold \({\mathscr {M}}\).

$$\begin{aligned} p^{(k+1)}&= \mathop {\text {argmin}}\limits _{p\in {\mathscr {M}}} D^{(m)}(q^{(k)}||p). \end{aligned}$$
Fig. 1
figure 1

Update process of the em algorithm

4 MEM algorithm and its information geometry

In this section, we introduce the modal EM (MEM) algorithm [17] for estimating the mode given a probability density function and provide an information geometric perspective. The MEM algorithm is an iterative procedure similar to the EM algorithm, but there are no explicit observations, hence the construction of an empirical density function is nontrivial. In this paper, we propose to construct an empirical density function based on the assumption that one pseudo-observation is obtained. This assumption has a critical role for constructing an empirical density function of the MEM algorithm and the MLR model, in which the same difficulty exists because of the absence of the explicit observation.

4.1 MEM algorithm

Consider the Gaussian mixture model, whose parameters are known. In general, it is difficult to express the mode of the Gaussian mixture model in a closed form, even if the parameters are known. In order to obtain the mode, we need to execute a numerical optimization. The MEM algorithm [17] is an iterative method for finding a local mode of a mixture distribution with the form

$$\begin{aligned} f(x)&= \sum _{i=1}^{K} \pi _i f_i(x) ,\quad x \in {\mathbb {R}}^{p} ,\quad \text{ where } \quad \left\{ \begin{aligned}&\pi _i \ge 0, \quad \sum _{i=1}^{K} \pi _i = 1 ,\\ {}&f_i:{\mathbb {R}}^{p} \rightarrow {\mathbb {R}} \text{ is } \text{ a } \text{ probability } \text{ density } \text{ function, } \end{aligned} \right. \end{aligned}$$

where all of the parameters in this model are known. The purpose of the MEM algorithm is to find the mode of f(x): \(\displaystyle x^{*} = \mathop {\text {argmax}}\limits _{x} f(x)\). Li et al [17] proposed to iterate the following two steps starting from the initial estimate \(x^{(1)}\):

  1. E-step
    $$\begin{aligned} p_i^{(k)} = \frac{ \pi _i f_i(x^{(k)}) }{ f(x^{(k)}) }, \quad i = 1 \dots K. \end{aligned}$$
    (14)
  2. M-step
    $$\begin{aligned} x^{(k+1)} = \mathop {\text {argmax}}\limits _{x} \sum _{i=1}^{K} p_i^{(k)} \log f_i(x). \end{aligned}$$
    (15)

4.2 Pseudo-observation

The purpose of the MEM algorithm is to find an optimal point that maximizes a probability density function f(x). In information geometry, projection is a procedure of finding the probability density function in a set of density functions that minimizes the discrepancy from the empirical density function. In this section, we show how we can interpret the MEM algorithm as a problem of finding an optimal function from a set of probability density functions, which leads us to the information geometric perspective of the MEM algorithm.

Let a probability density function \(s(x\;;m)\) be

$$\begin{aligned} s(x\;;m) = f(x+m), \end{aligned}$$

where \(m \in {\mathbb {R}}^p\) is a parameter for \(s(x\;;m)\). From this definition, the mode of \(s(x\;;m^{*})\) is 0 when the mode of f(x) is \(m^{*}\). On the other hand, if \(m^{*}\) is the optimal solution of \(\max _{m} s(0\;;m)\), then the mode of f(x) is \(m^{*}\). Two problems \(\max _{x} f(x)\) and \(\max _{m} s(0\;;m)\) have the same solution, but the idea behind these problems are different. The former is the problem that finds an optimal point from a domain of f(x), while the latter considers a set of probability density functions shifted by m and finds an optimal function that maximizes the value at \(x=0\) from the set. Search space of these two problems are different, and that of the latter interpretation is convenient when we formulate the MEM algorithm from an information geometric perspective because its search space is a set of function as in the procedure of projection.

To derive the information geometric formulation of the MEM algorithm, we introduce the latent variable \(Z\in \left\{ 1 \dots K \right\} \) to the mixture model \(f(x)=\sum _{i=1}^{K} \pi _i f_i(x)\). The latent variable specifies a mixture component from which an observation x is obtained. The joint probability density function g(xz) is expressed as follows [1]:

$$\begin{aligned} g(x,z)&= \prod _{i=1}^{K} \left[ \pi _i f_i(x) \right] ^{ \delta _i(z)} ,\quad \text {where} \quad \left\{ \begin{aligned}&\pi _i \ge 0, \quad \sum _{i=1}^{K} \pi _i = 1, \\&f_i\text { is a probability density function}, \\&\delta _i(z) = \left\{ \begin{aligned}&1 \quad i=z, \\&0 \quad i\ne z. \end{aligned} \right. \end{aligned} \right. \end{aligned}$$
(16)

In the above discussion, we considered finding \(\mathop {\text {argmax}}\nolimits _{m} s(0\;;m)\) from a set of probability density function parameterized by m. In the same way, we define a joint probability density function l(xzm) and a corresponding model manifold \({\mathscr {M}}\) as follows:

$$\begin{aligned} l(x,z;m)&= g(x+m,z), \\ {\mathscr {M}}&= \left\{ l(x,z;m) \mid m\in {\mathbb {R}}^p \right\} . \end{aligned}$$

We then consider an empirical density function and data manifold. In general, an empirical density function is constructed based on observations. For example, when observations \(\left\{ x_i \in {\mathbb {R}}^{p} \right\} _{i=1}^{N}\) are i.i.d., the empirical density function is defined as \(\frac{1}{N} \sum _{i=1}^{N} \delta (x-x_i)\), where \(\delta (\cdot )\) denotes the Dirac delta function. In the formulation of the MEM algorithm, the construction of the empirical density function is nontrivial because there are no explicit observations. In the interpretation of \(\mathop {\text {argmax}}\nolimits _{m} s(0\;;m)\), we considered the case of \(x=0\) and \(s(0\; ;m)\) is treated as a function of m. It is equivalent to assuming that “one observation \(x=0\) is obtained”, and we interpret the procedure of the MEM algorithm as the maximum likelihood estimation, namely, finding the parameter values \(m\in {\mathbb {R}}^p\) that maximizes the likelihood of \(s(0\;;m)\). This assumption leads to the following definition of the empirical density function p(x):

$$\begin{aligned} p(x)&= \delta (x-0) = \delta (x). \end{aligned}$$

By introducing the latent variable \(Z\in \left\{ 1\dots K \right\} \), we extend p(x) to an empirical joint density function of X and Z:

$$\begin{aligned} h(x,z)&= p(x)q(z\mid x), \end{aligned}$$

and by introducing the parameters \(\left\{ q_i \right\} _{i=1}^{K}\), we model the conditional density function \(q(z\mid x)\) as

$$\begin{aligned} q(z\mid x) = \sum _{i=1}^{K} q_i \delta _i(z), \quad \text {where} \quad q_i \ge 0, \; \sum _{i=1}^{K} q_i = 1. \end{aligned}$$

Then, the empirical joint density function \(h(x,z\ ;q_1\dots q_K)\) is expressed as

$$\begin{aligned} h(x,z\;;q_1\dots q_K)&= \sum _{i=1}^{K} q_i \delta (x) \delta _i(z) ,\quad \text {where} \quad q_i \ge 0, \; \sum _{i=1}^{K} q_i = 1. \end{aligned}$$
(17)

The data manifold \({\mathscr {D}}\) is defined as follows:

$$\begin{aligned} {\mathscr {D}}&= \left\{ h(x,z\;;q_1\dots q_K) \mid q_i \ge 0, \quad \sum _{i=1}^{K}q_i = 1 \right\} . \end{aligned}$$
(18)

It is shown in Appendix A that \({\mathscr {D}}\) is in a mixture family.

We consider the MEM algorithm to be a maximum likelihood estimation problem. Thus, we can consider the e- and m-projection between the model manifold \({\mathscr {M}}\) and the data manifold \({\mathscr {D}}\) as follows:

$$\begin{aligned} h\left( x,z;q_{1}^{(k)}\dots q_{K}^{(k)}\right)&= \mathop {\text {argmin}}\limits _{h\in {\mathscr {D}}} D^{(e)}\left( l(\cdot ,\cdot ;m^{(k)}) || h \right) , \end{aligned}$$
(19)
$$\begin{aligned} g\left( x,z;m^{(k+1)}\right)&= \mathop {\text {argmin}}\limits _{g\in {\mathscr {M}}} D^{(m)}\left( h(\cdot ,\cdot ;q_{1}^{(k)}\dots q_{K}^{(k)}) || g \right) . \end{aligned}$$
(20)

The detailed calculation of the e-projection in (19) is provided in Appendix B, and the optimal parameter for \(h\in {\mathscr {D}}\) is given as

$$\begin{aligned} q_i^{(k)} = \frac{ \pi _i f_i(m^{(k)}) }{f(m^{(k)}) }, \quad i = 1 \dots K. \end{aligned}$$
(21)

The m-projection in (20) is equivalent to

$$\begin{aligned} \max _{m \in {\mathbb {R}}^p} \sum _{i=1}^{K} q_i^{(k)} \log f_i(m). \end{aligned}$$
(22)

The detailed derivation is shown in Appendix C.

4.3 Summary: information geometry of MEM

To provide an information geometric perspective of the MEM algorithm, we interpret the algorithm as a problem of finding an optimal probability density function that maximizes the value at \(x=0\). This enables us to cast the MEM algorithm as the maximum likelihood estimation. The e-projection of the model distribution \(l(x,z;m^{(k)})\) onto \({\mathscr {D}}\) gives the optimal \(q_i^{(k)},\ i=1\dots K\), which is equivalent to (14) in the original MEM algorithm. The m-projection of \(h(x,z;q_{1}^{k}\dots q_{K}^{(k)})\) onto \({\mathscr {M}}\) derives the optimal \(m^{(k+1)}\), which is consistent with (15) in the original MEM algorithm.

5 Information geometry of MLR

In this section, we analyze MLR from the viewpoint of information geometry. We elucidate the source of the difficulty with constructing a model manifold and data manifold for the MLR model and propose a framework for geometrically formulating the MLR model.

5.1 Problem of constructing manifolds

In order to elucidate the source of the difficulty with constructing manifolds for the MLR model, we consider the parameter estimation of a Gaussian mixture model as a specific example of statistical inferences in information geometry. Suppose that observations \(\left\{ x_i\in {\mathbb {R}}^{p} \right\} _{i=1}^{N}\) are i.i.d. subject to a Gaussian mixture distribution expressed as

$$\begin{aligned} f(x; \mu ,\varSigma )&= \sum _{i=1}^{K} \pi _i g(x;\mu _i, \varSigma _i) ,\\&\text {where} \quad \left\{ \begin{aligned}&\pi _i \ge 0, \; \sum _{i=1}^{K} \pi _i = 1 ,\\&g(x;\mu _i, \varSigma _i) = \frac{1}{\sqrt{2\pi }^p\sqrt{\mathrm {det}(\varSigma _i) }} \exp \left\{ -\frac{1}{2} (x-\mu _i)^{\top } \varSigma _i^{-1} (x-\mu _i) \right\} . \end{aligned} \right. \end{aligned}$$

Then, the model manifold consists of Gaussian mixture density functions whose parameters are the means and covariance matrices. The data manifold is constructed based on the empirical density function \(\frac{1}{N}\sum _{i=1}^{N}\delta (x-x_i)\).

In the parameter estimation of the Gaussian mixture model, the model manifold is constructed based on the parametric distribution. On the other hand, there is no assumption of parametric distributions in MLR. This makes it nontrivial to construct a model manifold and data manifold.

To construct the model manifold for the MLR model, we consider (i) the assumption that \(\text {Mode}\left[ \epsilon ;x \right] = 0\) and (ii) the form of the objective function of \(\beta \) for the MLR model: \(\frac{1}{N} \sum _{i=1}^{N} \phi _h \left( y_i - x_i^{\top }\beta \right) \). From this assumption and fact, the optimization problem expressed in (3) is regarded as a maximization problem of KDE at \(\epsilon = 0\) for the probability density function of \(\epsilon \). Based on the given observations, we propose constructing the following model for MLR:

$$\begin{aligned} f(\epsilon ;\beta )&= \frac{1}{N} \sum _{i=1}^{N} \phi _h \left( \epsilon - \epsilon _i(\beta ) \right) , \end{aligned}$$
(23)

where \(\epsilon _i(\beta ) = y_i - x_i^{\top }\beta ,\ i=1\dots N\) and the variable \(\epsilon \) denotes an error variable. We introduce the latent variable \(Z\in \left\{ 1\dots N \right\} \), which specifies a mixture component from which an observation is obtained. The joint density function of \(\epsilon \) and Z is

$$\begin{aligned} g(\epsilon ,z;\beta )&= \prod _{i=1}^{N} \left[ \frac{1}{N} \phi _h \left( \epsilon - \epsilon _i(\beta ) \right) \right] ^{\delta _i(z)} . \end{aligned}$$
(24)

The model manifold \({\mathscr {M}}\) is denoted by

$$\begin{aligned} {\mathscr {M}}&= \left\{ g(\epsilon ,z;\beta ) \mid \beta \in {\mathbb {R}}^{p} \right\} . \end{aligned}$$
(25)

It is shown in Appendix D that \({\mathscr {M}}\) is in a curved exponential family.

We next propose constructing a data manifold for the MLR model. The empirical density function is often constructed based on observations. Consider (i) the construction proposed in Sect. 4.2 and (ii) the assumption that \(\text {Mode}\left[ \epsilon ;x \right] = 0\). We propose constructing the empirical density function as follows:

$$\begin{aligned} p(\epsilon )&= \delta (\epsilon - 0) = \delta (\epsilon ) . \end{aligned}$$
(26)

By introducing the latent variable \(Z\in \left\{ 1\dots N \right\} \) to (26), we extend \(p(\epsilon )\) to the empirical joint density function of \(\epsilon \) and Z:

$$\begin{aligned} h(\epsilon , z)&= p(\epsilon ) q(z \mid \epsilon ) . \end{aligned}$$

By introducing the parameters \(\left\{ q_i \right\} _{i=1}^{N}\), we model the conditional density function \(q(z\mid \epsilon )\) as

$$\begin{aligned} q(z \mid \epsilon ) = \sum _{i=1}^{N} q_i \delta _i(z) ,\quad \text {where} \quad q_i \ge 0, \; \sum _{i=1}^{N} q_i = 1 \end{aligned}$$

Then, the empirical joint density function \(h(\epsilon ,z\;q_1\dots q_N)\) is expressed as

$$\begin{aligned} h(\epsilon , z\ ; q_1 \dots q_N)&= \sum _{i=1}^{N} q_i \delta (\epsilon ) \delta _i(z) ,\quad \text {where} \quad q_i \ge 0, \;\sum _{i=1}^{N} q_i = 1 \end{aligned}$$
(27)

The data manifold \({\mathscr {D}}\) is defined as follows:

$$\begin{aligned} {\mathscr {D}} = \left\{ h(\epsilon ,z\ ;q_1\dots q_N) \mid q_i \ge 0, \quad \sum _{i=1}^{N} q_i = 1 \right\} . \end{aligned}$$
(28)

It is shown in Appendix E that \({\mathscr {D}}\) is in a mixture family.

Here, we consider the e-projection of a model with the parameters \(\beta ^{(k)}\) onto the data manifold:

$$\begin{aligned}&\min _{h \in {\mathscr {D}}} D^{(e)}(g(\cdot ,\cdot ;\beta ^{(k)})||h) .\nonumber \\&\quad \rightarrow \quad \left| \begin{aligned} \min _{q_1 \dots q_N}&\quad D^{(e)} \left( g(\cdot ,\cdot ;\beta ^{(k)})||h(\cdot ,\cdot \ ;q_1\dots q_N) \right) ,\\ \text {s.t.}&\quad q_i \ge 0, \; \;\sum _{i=1}^{N} q_i = 1 \end{aligned} \right. \end{aligned}$$
(29)

The detailed calculation is shown in Appendix F. An optimal solution for (29) is

$$\begin{aligned} q_i^{(k)} = \frac{\phi _h \left( y_i - x_i^{\top }\beta ^{(k)} \right) }{\sum _{j=1}^{N} \phi _h \left( y_j - x_j^{\top }\beta ^{(k)} \right) }, \quad i = 1 \dots N, \end{aligned}$$
(30)

which is equivalent to the E-step in (5). Then, we consider the m-projection of the empirical joint density function with the parameters \(q_i=q_i^{(k)},\ i=1\dots N\) onto the model manifold:

$$\begin{aligned} \min _{g \in {\mathscr {M}}} D^{(m)}(h(\cdot ,\cdot \ ;q_1=q_1^{(k)}\dots q_N=q_N^{(k)})||g) . \end{aligned}$$
(31)

The detailed calculation is shown in Appendix G. The optimization problem expressed as (31) is equivalent to

$$\begin{aligned} \max _{\beta } \sum _{i=1}^{N} q_i^{(k)} \log \phi _h \left( y_i - x_i^{\top }\beta \right) , \end{aligned}$$
(32)

which is equivalent to the M-step (9).

Fig. 2
figure 2

Conceptual diagram of the em algorithm corresponding to the MLR model

Figure 2 shows the update process of the em algorithm corresponding to the MLR model parameter estimation.

In this section, we propose constructing a model manifold and data manifold for the MLR model. Although a model manifold is often constructed based on a parametric distribution assumption, we construct it based on observations. Concerning the empirical distribution, our construction is based on (i) the assumption that \(\text {Mode}\left[ \epsilon ;x \right] = 0\) and (ii) the proposed construction in Sect. 4.2. We apply the framework of the em algorithm to the proposed manifolds. We demonstrate that the e-projection of the model onto the data manifold derives (5), and the m-projection of the empirical distribution onto the model manifold leads to (9).

6 On the influence function for MLR

An influence function quantifies the influence of one observation on the estimate. Let T be the functional defined for a set of probability measures and F be the probability measure. The influence function of T at F is expressed by \(\text {IF}(x;T,F)\) and is defined as follows:

$$\begin{aligned} \text {IF}(x;T,F) = \lim _{\epsilon \rightarrow 0} \frac{ T \left( (1-\epsilon )F+ \epsilon \varDelta _{x} \right) - T \left( F \right) }{ \epsilon }. \end{aligned}$$
(33)

Yao and Li [23] argued that MLR is robust against outliers after investigating its breakdown point. Elucidating the reason for the robustness is important because it can be a clue to developing a novel robust estimator. There are various approaches in the literature for making an estimation robust. One representative approach is to change the manifold for the m-projection. For example, the median is often used as a robust estimate of the mean. This approach corresponds to adopting the model manifold composed of Laplacian distributions instead of Gaussian distributions. On the other hand, a robust estimator can also be obtained by changing the projection method, such as using robust divergences instead of the KL-divergence [8].

So far, we have established the equivalence of the em and EM algorithms for MLR. The em algorithm is based on the KL-divergence, which is sensitive to outliers. Thus, we conjecture that the m-projection onto the model manifold, which is composed of the error distributions given by KDE, is the source of the robustness of MLR. Here, we focus on the influence function [11] of the estimator for the MLR model.

Unfortunately, it is difficult to derive the influence functions correspond to the update of the estimate of coefficient by the e- and m-steps for MLR problem because it is not trivial how to deal with the effect of an outlier on the projection to the data-dependent manifold.

In the finite-sample robustness analysis, the rescaled version of the influence function [11] is used to evaluate the effect of an outlier \((u,v)\in {\mathbb {R}}^p \times {\mathbb {R}}\) on each projection. Suppose that the current estimate is \(\beta ^{(k)}\). Without outlier case, the joint density function in Eq. (24) is projected onto the data manifold \({\mathscr {D}}\). When the dataset is contaminated with an outlier (uv), the joint density is modified as

$$\begin{aligned} g_{N+1}(\epsilon ,z;\beta ^{(k)}) = \prod _{i=1}^{N+1} \left[ \frac{1}{N+1} \phi _h(\epsilon - \epsilon _i(\beta ^{(k)})) \right] ^{\delta _{i}(z)} \end{aligned}$$
(34)

where \(\epsilon _{N+1}(\beta ^{(k)}) = v-u^{\top }\beta ^{(k)}\), and it is projected onto the outlier-contaminated data manifold \({\mathscr {D}}_{N+1}\), which is composed of \(q_1\dots q_{N+1}\). Let us express the result of the usual e-projection as \(q^{(k)} = \left( q_{1}^{(k)} \dots q_{N}^{(k)} \right) \) and the result of the e-projection with the outlier as \(q^{'(k)} = \left( q_{1}^{'(k)} \dots q_{N+1}^{'(k)} \right) \). Then the rescaled version of the influence function on the e-projection is expressed as follows

$$\begin{aligned}&\frac{q_{i}^{'(k)}-q_{i}^{(k)}}{\frac{1}{N+1}} \\&\quad = -\frac{ (N+1) \phi _h(v-u^{\top }\beta ^{(k)}) \phi _h(\epsilon _i(\beta ^{(k)})) }{\left( \sum _{j=1}^{N}\phi _h(\epsilon _j(\beta ^{(k)})) \right) \left\{ \phi _h(v-u^{\top }\beta ^{(k)}) + \sum _{j=1}^{N}\phi _h(\epsilon _j(\beta ^{(k)})) \right\} } ,\quad i=1\dots N . \end{aligned}$$

In the same manner, we can calculate the rescaled version of the influence function on the m-projection by comparing the result of the m-projections with and without an outlier. The regression coefficients obtained by the m-projection with and without the outlier are denoted as \(\beta ^{(k+1)}\) and \(\beta ^{'(k+1)}\), respectively. When the sample is contaminated by an outlier (uv), the m-projection projects \(h(\epsilon ,z; q_1^{'(k)}\dots q_{N+1}^{'(k)})\) onto \({\mathscr {M}}_{N+1}\), which is a set of \(g_{N+1}(\epsilon ,z;\beta ^{(k)})\) defined by Eq. (34) parameterized by \(\beta \). The influence function on the m-projection is then written as

$$\begin{aligned}&\frac{\beta ^{'(k+1)} - \beta ^{(k+1)}}{\frac{1}{N+1}} \\&\quad = (N+1) \left[ \begin{aligned}&\left( q_{N+1}^{'(k+1)}uu^{\top } + \sum _{i=1}^{N} q_{i}^{'(k+1)}x_{i}x_{i}^{\top } \right) ^{-1}\left( q_{N+1}^{'(k+1)}vu + \sum _{i=1}^{N}q_{i}^{'(k+1)}y_i x_i \right) \\&- \left( \sum _{i=1}^{N}q_{i}^{(k+1)} x_i x_i^{\top } \right) ^{-1} \left( \sum _{i=1}^{N} q_{i}^{(k+1)}y_i x_i \right) \end{aligned} \right] \end{aligned}$$

We could derive the influence functions on each of the e- and m- projections, though, there are problems on the above consideration on the effect of an outlier. In the m-projection, we did not take the indirect effect of the outlier account, which is inherited from the e-projection for the outlier-contaminated case. A standard definition of the influence function is associated with the probability measure F for XY. The non-parametric nature of the proposed information geometric formulation of the modal linear regression make the conventional robustness analysis difficult.

Towards understanding of the source of robustness of MLR, we consider the influence function of \({\hat{\beta }}\) defined in Eq. 3, which is a standard \(\psi \)-type M-estimator, and the influence function of the estimate by the em algorithm from the viewpoint of the iteratively reweighted least-squares (IRLS) estimator.

It is possible to derive the influence function independently of the solving algorithm. For example, the regression coefficient estimate \({\hat{\beta }}\) of the MLR model defined by Eq. (3) is a \(\psi \)-type M-estimator, and its influence function is expressed as

$$\begin{aligned}&\text {IF}\left( u,v; F \right) = \left( \int \frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }{\hat{\beta }}(F)} xx^{\top } dF(x,y) \right) ^{-1} \psi \left( u,v,{\hat{\beta }}(F) \right) , \end{aligned}$$
(35)
$$\begin{aligned}&\text {where} \quad \psi \left( u,v,{\hat{\beta }}(F) \right) = \frac{d}{dz}\phi _h(z)_{z=v-u^{\top }{\hat{\beta }}(F)} u. \end{aligned}$$
(36)

We note that Eq. (35) does not depend on whether the EM algorithm or the steepest descent method is used. Details of the derivation are given in Proposition 1 of Appendix H. To see the effect of an outlier, let us consider a very simple case with \({\hat{\beta }}(F) = \beta ^{*}\), a predictor variable X and an error variable \(\epsilon \) are independent. We also assume the Gaussian kernel with bandwidth h is adopted as a kernel function. Then, Eq. (35) is simplified to

$$\begin{aligned} \text {IF}(u, v; F) = -\frac{ h^2 (v-u^{\top }\beta ^{*}) \phi _h(v-u^{\top }\beta ^{*}) }{\int (\epsilon ^2-h^2) \phi _h(\epsilon ) dF_{\epsilon }(\epsilon )} \left[ \int xx^{\top } dF_X(x) \right] ^{-1} u , \end{aligned}$$
(37)

where \(F_X, F_{\epsilon }\) are probability measures of X and \(\epsilon \), respectively. Equation (37) shows that given \(u_0\in {\mathbb {R}}^p, \lim _{v\rightarrow \infty } \left| \text {IF}(u_0, v; F) \right| = 0\) holds due to \(x\phi _{h}(x) \underset{x\rightarrow \infty }{\rightarrow } 0\). We can see the behavior of Eq. (37) in Fig. 3, in which the contour of Eq. (37) is calculated under the model \(Y=X\times \beta ^{*} + \epsilon \) with \(\beta ^{*} =1, X,Y \in {\mathbb {R}},\quad \epsilon \sim {\mathscr {N}}(0,1^2),\quad X\sim \text {Uniform}(-10,10)\), and kernel bandwidth is set to \(h=3\). In the figure, the dashed line denotes the ground truth regression line. From this figure, it is seen that outliers around the ground truth regression line have minor impact on the estimate of regression coefficient.

Fig. 3
figure 3

The contour plot of the influence function in a simple setting

On the other hand, it is also possible to define the estimate \({\hat{\beta }}\) as the limit of the EM algorithm derived in  [23]. In particular, when the Gaussian kernel is adopted for density estimation, the resulting estimator is regarded as an IRLS estimator. Dollinger and Staudte [6] addressed a related problem when they derived the influence function of an IRLS estimator for the linear regression model. They revealed the relationship between the influence functions of the \((k+1)\)-step and k-step estimates, and derived the influence function for the estimator as its limit.

Following the approach by Dollinger and Staudte [6], we consider the em algorithm for MLR as an IRLS procedure, in which an iteration of IRLS corresponds to a pair of e- and m-projection. Let the initial estimate be \(\beta ^{(1)}\) and the k-th estimate be \(\beta ^{(k)}\), respectively. Then, it is shown that the influence functions of estimates \(\beta ^{(k)}\) and \(\beta ^{(k+1)}\) satisfy the following recurrence relation:

$$\begin{aligned}&\text {IF}(u,v; \beta ^{(k+1)}, F) = z_{k} + A_{k}\; \text {IF}(u,v;\beta ^{(k)},F) , \end{aligned}$$
(38)
$$\begin{aligned}&\text {where} \left\{ \begin{aligned}&z_{k} = \varSigma _{k}^{-1} c_{k} ,\\&\varSigma _{k} = \int \phi _h \left( y-x^{\top }\beta ^{(k)}(F)\right) xx^{\top } dF(x,y) ,\\&c_{k} = \phi _h \left( v-u^{\top }\beta ^{(k)}(F) \right) \left( v-u^{\top }\beta ^{(k+1)}(F) \right) u\\ \quad&\quad - \int \phi _h \left( y-x^{\top }\beta ^{(k)}(F) \right) \left( y-x^{\top }\beta ^{(k+1)}(F) \right) x dF(x,y) ,\\&C_{k} = -\int \frac{d}{dz} \phi _h(z)_{z=y-x^{\top }\beta ^{(k)}(F)} \left( y-x^{\top }\beta ^{(k+1)}(F)\right) xx^{\top } dF(x,y) ,\\&A_{k} = \varSigma _{k}^{-1} C_{k} . \end{aligned} \right. \end{aligned}$$
(39)

Details of the derivation are given in Proposition 2 of Appendix H. In the original least squares regression problem dealt in  [6], the weighted least-squares (WLS) estimator is proven to be Fisher consistent by using the symmetry of the noise distribution, and \(\varSigma _k, c_k, C_k\) in the recurrence relation are shown to be independent of k. In our MLR problem, though, the noise distribution cannot be assumed to be symmetric. So, we consider the converged value \(\beta ^{(\infty )}\) to remove the dependency on k. In this limit case, \(\varSigma _k, c_k, C_k\) do not depend on k, and Eq. (38) becomes

$$\begin{aligned}&\text {IF}(u,v;\beta ^{(\infty )}, F) = z + A \; \text {IF}(u,v;\beta ^{(\infty )}, F). \end{aligned}$$
(40)

Now we assume that \(\left| A\right| <1\), and the use of Gaussian kernel implies \(\frac{d}{dz}\phi _h(z) = - \frac{1}{h^2}z\phi _h(z)\). Then, we ontain

$$\begin{aligned} \text {IF}(u,v; \beta ^{(\infty )},F) =&\left( \varSigma - C \right) ^{-1} c , \end{aligned}$$
(41)
$$\begin{aligned} =&\left( \int \frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }\beta ^{(\infty )}} xx^{\top } dF(x,y) \right) ^{-1} \nonumber \\&\times \left[ \psi (u,v,\beta ^{(\infty )}) + \frac{\partial }{\partial \beta } \left\{ \int \phi _h(y-x^{\top }\beta ) dF(x,y) \right\} _{\beta ^{(\infty )}} \right] . \end{aligned}$$
(42)

If \(\beta ^{(\infty )}\) is equal to \({\hat{\beta }}(F)\), then \(\frac{\partial }{\partial \beta } \left\{ \int \phi _h(y-x^{\top }\beta ) dF(x,y) \right\} _{{\hat{\beta }}} = 0\) holds due to the definition of \({\hat{\beta }}(F)\) and Eq. (42) is consistent with Eq. (35).

In this section, we conjectured that the robustness of MLR is due to the particular structure of the model manifold. To support this conjecture, we attempted to derive the influence function with respect to the e- and the m-steps of the estimation procedure of model coefficient. Currently it is difficult to derive the influence functions, but we showed the difference between the influence function derived from the veiwpoint of M-estimator and that derived from the IRLS estimator. The IRLS estimator is composed of a set of e- and m- steps. Our future work is to identify the contribution of each of the e- and the m- steps on the influence function.

7 Conclusions

In this paper, we provide an information geometric perspective on the MLR model, which is a nonparametric method. First, we discuss the MEM algorithm and investigate the relationship between the algorithm and information geometry. We cast the MEM algorithm as a maximum likelihood method by assuming a pseudo-observation. This gives us an empirical density function based on the assumption that one pseudo-observation is obtained. Second, we address the relationship between the MLR model and information geometry. Because the MLR model does not assume a parametric distribution, we cannot construct a corresponding model manifold with conventional approaches. We propose constructing the model manifold based on observations. The empirical density function introduced through a discussion of the MEM algorithm is applied to construct the data manifold. We clarify the relationship between the EM algorithm developed by Yao and Li [23] and the em algorithm corresponding to the MLR model.

Elucidating the factors or geometric operation that make the estimator for the MLR model robust remains for future work. We will further investigate the influence function corresponding to the e- and m- steps for estimating the coefficient of the MLR model.