Information geometry of modal linear regression

Sando, Keishi; Akaho, Shotaro; Murata, Noboru; Hino, Hideitsu

doi:10.1007/s41884-019-00017-y

Information geometry of modal linear regression

Research Paper
Open access
Published: 01 July 2019

Volume 2, pages 43–75, (2019)
Cite this article

Download PDF

You have full access to this open access article

Information Geometry Aims and scope Submit manuscript

Information geometry of modal linear regression

Download PDF

Keishi Sando¹,
Shotaro Akaho^2,3,
Noboru Murata^3,4 &
…
Hideitsu Hino ORCID: orcid.org/0000-0002-6405-4361^3,5

3709 Accesses
4 Citations
5 Altmetric
Explore all metrics

Abstract

Modal linear regression (MLR) is used for modeling the conditional mode of a response as a linear predictor of explanatory variables. It is an effective approach to dealing with response variables having a multimodal distribution or those contaminated by outliers. Because of the semiparametric nature of MLR, constructing a statistical model manifold is difficult with the conventional approach. To overcome this difficulty, we first consider the information geometric perspective of the modal expectation–maximization (EM) algorithm. Based on this perspective, model manifolds for MLR are constructed according to observations, and a data manifold is constructed based on the empirical distribution. In this paper, the em algorithm, which is a geometric formulation of the EM algorithm, of MLR is shown to be equivalent to the conventional EM algorithm of MLR. The robustness of the MLR model is also discussed in terms of the influence function and information geometry.

Information Geometric Perspective of Modal Linear Regression

Modal linear regression using log-concave distributions

Article 10 October 2020

Modal regression models based on B-splines

Article 31 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Linear regression is used to model the conditional mean of a response variable y given the predictor variable x. A well-known least-squares estimator for linear regression coefficients is highly sensitive to outliers. To alleviate this problem, many estimators have been developed, such as robust M-estimators [12, 13]. However, the consistency of the robust M-estimators requires the conditional error distribution to have homoscedasticity and symmetricity given a predictor. In reality, various types of data do not follow these assumptions (e.g., wages, prices, and expenditures). Baldauf and Silva [3] pointed out that the estimation cannot be consistent unless data follow proper assumptions, which is also the case in some real-world applications [18].

Modal linear regression (MLR) is used to model the conditional mode of y given x by using a linear predictor function of x. MLR relaxes the distribution assumptions for the M-estimators of linear regression and is robust against outliers compared to the least-squares estimation of linear regression coefficients. It is also robust against violations of standard assumptions on the usual mean regression, such as heavy-tailed noise and skewed conditional and noise distributions. Kemp and Silva [14] and Yao and Li [23] proved that their estimators for the MLR model are consistent even when the error distribution is asymmetric. For the above reasons, improving methods for mode estimation has been an important topic of research for many years.

In information geometry [2], a manifold that consists of statistical models is called a model manifold. Information geometric formulation of estimating algorithm is useful for understanding the behavior and characteristic of the algorithm. For example, the procedure of parameter estimation is regarded as a projection from a point in the statistical manifold to a point in the model manifold. Modal linear regression is known to be robust to outliers, and we aim to elucidate the source of robustness by formulating the estimation procedure as geometric operations. Identifying and understanding the source of robustness of the estimation algorithm and statistical model associated with the modal linear regression would be helpful for developing other algorithms and models robust to outliers. In information geometry, a model manifold is often constructed by using a parametric distribution. Because of the lack of a parametric distribution, constructing a manifold that corresponds to the MLR model is difficult with conventional approaches. The contribution of this paper is being the first to obtain an information geometric perspective on MLR.

1.1 Related works

The modal regression model is related to kernel density estimation (KDE) [22], which is a nonparametric method for estimating the probability density function of observed data. Parzen [19] revealed the sufficient conditions for the $L^2$-consistency and asymptotic normality of KDE. They also derived the conditions for the consistency and asymptotic normality of a mode estimation constructed on the basis of KDE. Epanechnikov [7] found the optimal kernel function for KDE under some conditions. In general, if the probability density function of a random variable X has a unique mode and is symmetric with respect to the mode, $\text {Pr} \left( p-w \le X \le p+w \right) $ with fixed w is maximized when p is the mode. Based on this property, Lee [15] proposed an estimator for the coefficients of MLR. The MLR model is formulated as follows:

$$\begin{aligned} y&= x^{\top }\beta + \epsilon ,\quad \text {where} \quad \text {Mode} \left[ \epsilon ; x \right] = 0. \end{aligned}$$

The estimator of Lee [15] is consistent when there exists $w>0$ and the probability density function of $\epsilon $ is symmetric in the range of $0\pm w$. Lee [16] improved the objective function for the MLR model to become more tractable by using a quadratic kernel. Lee [15, 16] required the conditional probability density function of $\epsilon $ given x to be symmetric around 0 in the range of $\pm w$. Kemp and Silva [14] proved that the mode estimator for the coefficients of MLR is consistent even if symmetry is not satisfied. Yao and Li [23] proposed an expectation–maximization (EM) algorithm to estimate the coefficients of MLR.

Besides linear modeling, other studies have used semiparametric or nonparametric approaches to modal regression. Gannoun et al. [9] proposed a semiparametric modal regression model that assumes a linear relation between the mean, median, and mode. Yao et al. [24] developed a local modal regression that estimates the conditional mode of response as a polynomial function of predictors. Chen et al. [4] defined the modal manifold as a union of sets in which the first derivative of the conditional density is zero and the second derivative is negative. Then, the modal manifold is estimated by kernel estimates of derivatives of densities.

In information geometry, a model manifold is often constructed by using a parametric distribution. Estimates are regarded as the projection of an empirical distribution onto the model manifold. In the case of linear regression, we construct a model manifold based on the assumption that an error variable has a normal distribution. Because of the lack of a parametric distribution, constructing a model manifold that corresponds to the MLR model is difficult with conventional approaches. Some studies have considered nonparametric models for information geometry. Pistone and Sempi [20] showed a well-defined Banach manifold for probability measures. Grasselli [10] addressed the Fisher information and $\alpha $-connections for the Banach manifold. Zhang [25] discussed the relationship between divergence functions, the Fisher information, $\alpha $-connections, and fundamental issues in information geometry. Takano et al. [21] proposed a framework for a nonparametric e-mixture estimation. In contrast to these nonparametric approaches to information geometry, in this paper we consider information geometry associated with a semiparametric MLR model. We propose to construct a model manifold by using observations, as is done when constructing an empirical distribution with conventional approaches. Our proposal gives a geometric viewpoint to the MLR model.

2 Modal linear regression

Let $x \in {\mathbb {R}}^p$ and $y \in {\mathbb {R}}$ be a set of predictor variables and a response variable, respectively. The original least-squares for linear regression estimates a conditional mean of y given x, while MLR estimates a conditional mode of y given x. In this section, we briefly explain the EM algorithm of MLR introduced by Yao and Li [23].

2.1 Formulation

Suppose that $\left\{ x_i, y_i \right\} _{i=1}^{N}$ are i.i.d. observations, where the i-th predictor variable is denoted by $x_i\in {\mathbb {R}}^p$ and the corresponding response is denoted by $y_i \in {\mathbb {R}}$. With MLR, we model a conditional mode of y given x by a linear function of x:

$$\begin{aligned} \text {Mode} \left[ y;x \right]&= x^{\top }\beta , \end{aligned}$$

where $\text {Mode} \left[ y;x \right] = \mathop {\text {argmax}}\limits _{y} f(y|x)$ for the conditional density function f(y|x). Namely, y and x are related as

$$\begin{aligned} y = x^{\top }\beta + \epsilon , \quad \text {where} \quad \text {Mode} \left[ \epsilon ;x \right] = 0. \end{aligned}$$

(1)

To estimate $\beta $, Lee [15] introduced a loss function with the form

$$\begin{aligned} l(\beta ; y, x) = - \phi _{h} \left( y - x^{\top }\beta \right) , \end{aligned}$$

(2)

where $\phi _h(x) = \frac{1}{h}\phi \left( \frac{x}{h} \right) $, $\phi (\cdot )$ is a kernel function, and h is a bandwidth parameter. Minimizing the empirical loss leads to the estimate ${\hat{\beta }}$ of the linear coefficient:

$$\begin{aligned} {\hat{\beta }}= \mathop {\text {argmax}}\limits _{\beta } \frac{1}{N} \sum _{i=1}^{N} \phi _h(y_i - x_i^{\top }\beta ). \end{aligned}$$

(3)

In this paper, $\phi (\cdot )$ denotes a standard normal density function. The consistency and asymptotic normality of the estimate ${\hat{\beta }}$ obtained by (3) have been established under certain regularity conditions on the samples, kernel function, parameter space, and vanishing rate of the bandwidth parameter [14].

2.2 EM algorithm for MLR

Here, we introduce the EM algorithm for MLR parameter estimation proposed by Yao and Li [23]. The algorithm consists of two steps starting from an initial estimate $\beta ^{(1)}$:

E-Step

Consider the surrogate function

$$\begin{aligned} \gamma (\beta ;\beta ^{(k)}) = \sum _{i=1}^{N} \pi _i^{(k)} \log \left[ \frac{ \frac{1}{N}\phi _h \left( y_i-x_i^{\top }\beta \right) }{\pi _i^{(k)}} \right] , \end{aligned}$$

(4)

where

$$\begin{aligned} \pi _i^{(k)} = \frac{\phi _h(y_i - x_i^{\top }\beta ^{(k)}) }{\sum _{j=1}^{N} \phi _h(y_j - x_j^{\top }\beta ^{(k)}) },\quad i = 1 \dots N. \end{aligned}$$

(5)

This function satisfies

$$\begin{aligned} \gamma (\beta ^{(k)};\beta ^{(k)}) =\log \left[ \frac{1}{N}\sum _{i=1}^{N}\phi _h\left( y_i - x_i^{\top }\beta ^{(k)} \right) \right] \end{aligned}$$

(6)

and

$$\begin{aligned}&\log \left[ \frac{1}{N}\sum _{i=1}^{N} \phi _h \left( y_i-x_i^{\top }\beta \right) \right] \nonumber \\&\quad = \log \left[ \sum _{i=1}^{N} \pi _i^{(k)} \frac{\frac{1}{N}\phi _h \left( y_i-x_i^{\top }\beta \right) }{\pi _i^{(k)}} \right] ,\quad \text {by Jensen's inequality} \nonumber \\&\quad \ge \sum _{i=1}^{N} \pi _i^{(k)} \log \left[ \frac{ \frac{1}{N}\phi _h \left( y_i-x_i^{\top }\beta \right) }{\pi _i^{(k)}} \right] = \gamma (\beta ;\beta ^{(k)}). \end{aligned}$$

(7)

M-Step

In this step, the parameter $\beta $ is updated to increase the value of $\frac{1}{N}\sum _{i=1}^{N}\phi _h \left( y_i-x_i^{\top }\beta \right) $. The updated parameter $\beta ^{(k+1)}$ is given as

$$\begin{aligned} \beta ^{(k+1)}&= \mathop {\text {argmax}}\limits _{\beta } \gamma (\beta ;\beta ^{(k)}) . \end{aligned}$$

(8)

The following inequality holds:

$$\begin{aligned} \log \left[ \frac{1}{N}\sum _{i=1}^{N}\phi _h \left( y_i-x_i^{\top }\beta ^{(k+1)} \right) \right]&\ge \gamma (\beta ^{(k+1)};\beta ^{(k)}) \\&\ge \gamma (\beta ^{(k)};\beta ^{(k)}) = \log \left[ \frac{1}{N}\sum _{i=1}^{N}\phi _h \left( y_i-x_i^{\top }\beta ^{(k)} \right) \right] . \end{aligned}$$

Equation (8) is equivalent to (9).

$$\begin{aligned} \beta ^{(k+1)}&= \mathop {\text {argmax}}\limits _{\beta } \sum _{i=1}^{N} \pi _i^{(k)} \log \phi _h(y_i - x_i^{\top }\beta ). \end{aligned}$$

(9)

When $\phi (\cdot )$ is a standard normal density function,

$$\begin{aligned} \beta ^{(k+1)} = \left( X^{\top } W_{k} X \right) ^{-1} X^{\top } W_{k} y, \quad W_{k} = \text {diag} \begin{pmatrix} \pi _1^{(k)}&\cdots&\pi _N^{(k)} \end{pmatrix}. \end{aligned}$$

The property of the estimate ${\hat{\beta }}$ was discussed in [23].

3 Information geometry

In this section, we briefly explain information geometry, statistical inference, and the em algorithm. Information geometry [2] is a framework for formulating spaces consisting of probability density functions by means of differential geometry. Let S be a statistical manifold which is composed of probability distributions.

A parametric family of probability distributions of the form

$$\begin{aligned} f(x;\theta ) =&\exp \left\{ C(x) + \sum _{i=1}^{n} \theta _i F_i(x) - \psi (\theta ) \right\} ,\nonumber \\&\text {where} \quad \psi (\theta ) = \log \int \exp \left\{ C(x) + \sum _{i=1}^{n} \theta _i F_i(x) \right\} dx \end{aligned}$$

(10)

is called the exponential family and it has a critical role in information geometry. When $\left\{ F_1 \dots F_n, {\mathbf {1}} \right\} $ is linearly independent, $\theta $ and $f(x;\theta )$ have a one-to-one correspondence, and $\left( \theta _i \right) _{i=1}^{n}$ is an affine coordinate system of the statistical manifold. There is another useful coordinate system:

$$\begin{aligned} \eta _{i} = \text {E}_{f(x;\theta )} \left[ F_i(x) \right] = \int f(x;\theta ) F_i(x) dx, \quad i = 1 \dots n. \end{aligned}$$

(11)

$(\theta _i)_{i=1}^{n}$ and $(\eta _i)_{i=1}^{n}$ are called the natural parameters and expectation parameters, respectively, of the exponential family.

For an exponential family equipped with coordinate systems $\{ (\theta _{i})_{i=1}^{n}, (\eta _{i})_{i=1}^{n}\}$, there exist the potential functions $\psi (\theta )$ and $\phi (\eta )$ satisfying the following relation [2]:

$$\begin{aligned} ^{\forall }i&= 1 \dots n,&\eta _i&= \frac{\partial \psi }{\partial \theta _i},&\theta _i&= \frac{\partial \varphi }{\partial \eta _i},&\psi (\theta ) + \varphi (\eta ) - \sum _{i=1}^{n}\theta _{i} \eta _{i}&= 0. \end{aligned}$$

(12)

In information geometry, a statistical inference is often regarded as the projection of an empirical distribution onto a model manifold, which is a submanifold of S. Projection is defined as a procedure to find a point which minimizes the discrepancy between a point and a submanifold, hence it is important to measure the discrepancy between two probability distributions. We can define divergence from a point (probability distribution) $p \in S$ to $q \in S$ by using the coordinate systems and potential functions as follows:

$$\begin{aligned} D(p||q) = \psi \left( \theta (p) \right) + \varphi \left( \eta (q) \right) - \sum _{i=1}^{n}\theta _{i}(p) \eta _{i}(q), \quad ^{\forall }p,q \in S. \end{aligned}$$

(13)

For an exponential family, there are two natural definitions of divergences, the e- and m-divergence. The m-divergence is defined as

$$\begin{aligned} D^{(m)}(p||q)&= \varphi \left( \eta (p) \right) + \psi \left( \theta (q) \right) - \sum _{i=1}^{n} \eta _{i}(p) \theta _{i}(q), \quad \text {from Eq.}~(12) \\&= \sum _{i=1}^{n} \theta _{i}(p) \eta _{i}(p) - \psi \left( \theta (p) \right) \\&\quad - \left\{ \sum _{i=1}^{n} \theta _{i}(q) \eta _{i}(p) - \psi \left( \theta (q) \right) \right\} , \quad \text {from Eq.}~(11) \\&= \text {E}_{\theta (p)} \left[ \sum _{i=1}^{n} \theta _{i}(p) F_i(x) - \psi \left( \theta (p) \right) \right] - \text {E}_{\theta (p)} \left[ \sum _{i=1}^{n} \theta _{i}(q) F_i(x) - \psi \left( \theta (q) \right) \right] \\&= \text {E}_{\theta (p)} \left[ \log p(x) \right] - \text {E}_{\theta (p)} \left[ \log q(x) \right] = \int p(x) \log \frac{p(x)}{q(x)} dx, \end{aligned}$$

which is equal to the KL-divergence. The e-divergence is defined as follows:

$$\begin{aligned} D^{(e)}(p||q)&= \psi \left( \theta (p) \right) + \varphi \left( \eta (q) \right) - \sum _{i=1}^{n}\theta _{i}(p) \eta _{i}(q) \\&= \varphi \left( \eta (q) \right) + \psi \left( \theta (p) \right) - \sum _{i=1}^{n} \eta _{i}(q) \theta _{i}(p) \\&= D^{(m)}(q||p) = \int q(x) \log \frac{q(x)}{p(x)} dx. \end{aligned}$$

The EM algorithm [5] is a method for estimating the maximum likelihood parameters in a latent variable model. In information geometry, the exponential - mixture (em) algorithm [1] corresponds to the EM algorithm. For a latent variable model, an empirical distribution based on observations is not unique. A manifold that consists of empirical joint probability distributions of observable variables and latent variables is called a data manifold ${\mathscr {D}}$. The em algorithm finds the points $p^{*}\in {\mathscr {M}}$ and $q^{*}\in {\mathscr {D}}$ that minimize the KL-divergence from $q^{*}$ to $p^{*}$ by iterating the following two steps starting from an initial guess $p^{(1)}$. Figure 1 is a conceptual diagram of the em algorithm.

e-step

e-projection of $p^{(k)}\in {\mathscr {M}}$ onto the data manifold ${\mathscr {D}}$.

$$\begin{aligned} q^{(k)}&= \mathop {\text {argmin}}\limits _{q\in {\mathscr {D}}} D^{(e)}(p^{(k)}||q). \end{aligned}$$

m-step

m-projection of $q^{(k)}\in {\mathscr {D}}$ onto the model manifold ${\mathscr {M}}$.

$$\begin{aligned} p^{(k+1)}&= \mathop {\text {argmin}}\limits _{p\in {\mathscr {M}}} D^{(m)}(q^{(k)}||p). \end{aligned}$$

4 MEM algorithm and its information geometry

In this section, we introduce the modal EM (MEM) algorithm [17] for estimating the mode given a probability density function and provide an information geometric perspective. The MEM algorithm is an iterative procedure similar to the EM algorithm, but there are no explicit observations, hence the construction of an empirical density function is nontrivial. In this paper, we propose to construct an empirical density function based on the assumption that one pseudo-observation is obtained. This assumption has a critical role for constructing an empirical density function of the MEM algorithm and the MLR model, in which the same difficulty exists because of the absence of the explicit observation.

4.1 MEM algorithm

Consider the Gaussian mixture model, whose parameters are known. In general, it is difficult to express the mode of the Gaussian mixture model in a closed form, even if the parameters are known. In order to obtain the mode, we need to execute a numerical optimization. The MEM algorithm [17] is an iterative method for finding a local mode of a mixture distribution with the form

$$\begin{aligned} f(x)&= \sum _{i=1}^{K} \pi _i f_i(x) ,\quad x \in {\mathbb {R}}^{p} ,\quad \text{ where } \quad \left\{ \begin{aligned}&\pi _i \ge 0, \quad \sum _{i=1}^{K} \pi _i = 1 ,\\ {}&f_i:{\mathbb {R}}^{p} \rightarrow {\mathbb {R}} \text{ is } \text{ a } \text{ probability } \text{ density } \text{ function, } \end{aligned} \right. \end{aligned}$$

where all of the parameters in this model are known. The purpose of the MEM algorithm is to find the mode of f(x): $\displaystyle x^{*} = \mathop {\text {argmax}}\limits _{x} f(x)$. Li et al [17] proposed to iterate the following two steps starting from the initial estimate $x^{(1)}$:

E-step
$$\begin{aligned} p_i^{(k)} = \frac{ \pi _i f_i(x^{(k)}) }{ f(x^{(k)}) }, \quad i = 1 \dots K. \end{aligned}$$
(14)
M-step
$$\begin{aligned} x^{(k+1)} = \mathop {\text {argmax}}\limits _{x} \sum _{i=1}^{K} p_i^{(k)} \log f_i(x). \end{aligned}$$
(15)

4.2 Pseudo-observation

The purpose of the MEM algorithm is to find an optimal point that maximizes a probability density function f(x). In information geometry, projection is a procedure of finding the probability density function in a set of density functions that minimizes the discrepancy from the empirical density function. In this section, we show how we can interpret the MEM algorithm as a problem of finding an optimal function from a set of probability density functions, which leads us to the information geometric perspective of the MEM algorithm.

Let a probability density function $s(x\;;m)$ be

$$\begin{aligned} s(x\;;m) = f(x+m), \end{aligned}$$

where $m \in {\mathbb {R}}^p$ is a parameter for $s(x\;;m)$. From this definition, the mode of $s(x\;;m^{*})$ is 0 when the mode of f(x) is $m^{*}$. On the other hand, if $m^{*}$ is the optimal solution of $\max _{m} s(0\;;m)$, then the mode of f(x) is $m^{*}$. Two problems $\max _{x} f(x)$ and $\max _{m} s(0\;;m)$ have the same solution, but the idea behind these problems are different. The former is the problem that finds an optimal point from a domain of f(x), while the latter considers a set of probability density functions shifted by m and finds an optimal function that maximizes the value at $x=0$ from the set. Search space of these two problems are different, and that of the latter interpretation is convenient when we formulate the MEM algorithm from an information geometric perspective because its search space is a set of function as in the procedure of projection.

To derive the information geometric formulation of the MEM algorithm, we introduce the latent variable $Z\in \left\{ 1 \dots K \right\} $ to the mixture model $f(x)=\sum _{i=1}^{K} \pi _i f_i(x)$. The latent variable specifies a mixture component from which an observation x is obtained. The joint probability density function g(x, z) is expressed as follows [1]:

$$\begin{aligned} g(x,z)&= \prod _{i=1}^{K} \left[ \pi _i f_i(x) \right] ^{ \delta _i(z)} ,\quad \text {where} \quad \left\{ \begin{aligned}&\pi _i \ge 0, \quad \sum _{i=1}^{K} \pi _i = 1, \\&f_i\text { is a probability density function}, \\&\delta _i(z) = \left\{ \begin{aligned}&1 \quad i=z, \\&0 \quad i\ne z. \end{aligned} \right. \end{aligned} \right. \end{aligned}$$

(16)

In the above discussion, we considered finding $\mathop {\text {argmax}}\nolimits _{m} s(0\;;m)$ from a set of probability density function parameterized by m. In the same way, we define a joint probability density function l(x, z; m) and a corresponding model manifold ${\mathscr {M}}$ as follows:

$$\begin{aligned} l(x,z;m)&= g(x+m,z), \\ {\mathscr {M}}&= \left\{ l(x,z;m) \mid m\in {\mathbb {R}}^p \right\} . \end{aligned}$$

We then consider an empirical density function and data manifold. In general, an empirical density function is constructed based on observations. For example, when observations $\left\{ x_i \in {\mathbb {R}}^{p} \right\} _{i=1}^{N}$ are i.i.d., the empirical density function is defined as $\frac{1}{N} \sum _{i=1}^{N} \delta (x-x_i)$, where $\delta (\cdot )$ denotes the Dirac delta function. In the formulation of the MEM algorithm, the construction of the empirical density function is nontrivial because there are no explicit observations. In the interpretation of $\mathop {\text {argmax}}\nolimits _{m} s(0\;;m)$, we considered the case of $x=0$ and $s(0\; ;m)$ is treated as a function of m. It is equivalent to assuming that “one observation $x=0$ is obtained”, and we interpret the procedure of the MEM algorithm as the maximum likelihood estimation, namely, finding the parameter values $m\in {\mathbb {R}}^p$ that maximizes the likelihood of $s(0\;;m)$. This assumption leads to the following definition of the empirical density function p(x):

$$\begin{aligned} p(x)&= \delta (x-0) = \delta (x). \end{aligned}$$

By introducing the latent variable $Z\in \left\{ 1\dots K \right\} $, we extend p(x) to an empirical joint density function of X and Z:

$$\begin{aligned} h(x,z)&= p(x)q(z\mid x), \end{aligned}$$

and by introducing the parameters $\left\{ q_i \right\} _{i=1}^{K}$, we model the conditional density function $q(z\mid x)$ as

$$\begin{aligned} q(z\mid x) = \sum _{i=1}^{K} q_i \delta _i(z), \quad \text {where} \quad q_i \ge 0, \; \sum _{i=1}^{K} q_i = 1. \end{aligned}$$

Then, the empirical joint density function $h(x,z\ ;q_1\dots q_K)$ is expressed as

$$\begin{aligned} h(x,z\;;q_1\dots q_K)&= \sum _{i=1}^{K} q_i \delta (x) \delta _i(z) ,\quad \text {where} \quad q_i \ge 0, \; \sum _{i=1}^{K} q_i = 1. \end{aligned}$$

(17)

The data manifold ${\mathscr {D}}$ is defined as follows:

$$\begin{aligned} {\mathscr {D}}&= \left\{ h(x,z\;;q_1\dots q_K) \mid q_i \ge 0, \quad \sum _{i=1}^{K}q_i = 1 \right\} . \end{aligned}$$

(18)

It is shown in Appendix A that ${\mathscr {D}}$ is in a mixture family.

We consider the MEM algorithm to be a maximum likelihood estimation problem. Thus, we can consider the e- and m-projection between the model manifold ${\mathscr {M}}$ and the data manifold ${\mathscr {D}}$ as follows:

$$\begin{aligned} h\left( x,z;q_{1}^{(k)}\dots q_{K}^{(k)}\right)&= \mathop {\text {argmin}}\limits _{h\in {\mathscr {D}}} D^{(e)}\left( l(\cdot ,\cdot ;m^{(k)}) || h \right) , \end{aligned}$$

(19)

$$\begin{aligned} g\left( x,z;m^{(k+1)}\right)&= \mathop {\text {argmin}}\limits _{g\in {\mathscr {M}}} D^{(m)}\left( h(\cdot ,\cdot ;q_{1}^{(k)}\dots q_{K}^{(k)}) || g \right) . \end{aligned}$$

(20)

The detailed calculation of the e-projection in (19) is provided in Appendix B, and the optimal parameter for $h\in {\mathscr {D}}$ is given as

$$\begin{aligned} q_i^{(k)} = \frac{ \pi _i f_i(m^{(k)}) }{f(m^{(k)}) }, \quad i = 1 \dots K. \end{aligned}$$

(21)

The m-projection in (20) is equivalent to

$$\begin{aligned} \max _{m \in {\mathbb {R}}^p} \sum _{i=1}^{K} q_i^{(k)} \log f_i(m). \end{aligned}$$

(22)

The detailed derivation is shown in Appendix C.

4.3 Summary: information geometry of MEM

To provide an information geometric perspective of the MEM algorithm, we interpret the algorithm as a problem of finding an optimal probability density function that maximizes the value at $x=0$. This enables us to cast the MEM algorithm as the maximum likelihood estimation. The e-projection of the model distribution $l(x,z;m^{(k)})$ onto ${\mathscr {D}}$ gives the optimal $q_i^{(k)},\ i=1\dots K$, which is equivalent to (14) in the original MEM algorithm. The m-projection of $h(x,z;q_{1}^{k}\dots q_{K}^{(k)})$ onto ${\mathscr {M}}$ derives the optimal $m^{(k+1)}$, which is consistent with (15) in the original MEM algorithm.

5 Information geometry of MLR

In this section, we analyze MLR from the viewpoint of information geometry. We elucidate the source of the difficulty with constructing a model manifold and data manifold for the MLR model and propose a framework for geometrically formulating the MLR model.

5.1 Problem of constructing manifolds

In order to elucidate the source of the difficulty with constructing manifolds for the MLR model, we consider the parameter estimation of a Gaussian mixture model as a specific example of statistical inferences in information geometry. Suppose that observations $\left\{ x_i\in {\mathbb {R}}^{p} \right\} _{i=1}^{N}$ are i.i.d. subject to a Gaussian mixture distribution expressed as

$$\begin{aligned} f(x; \mu ,\varSigma )&= \sum _{i=1}^{K} \pi _i g(x;\mu _i, \varSigma _i) ,\\&\text {where} \quad \left\{ \begin{aligned}&\pi _i \ge 0, \; \sum _{i=1}^{K} \pi _i = 1 ,\\&g(x;\mu _i, \varSigma _i) = \frac{1}{\sqrt{2\pi }^p\sqrt{\mathrm {det}(\varSigma _i) }} \exp \left\{ -\frac{1}{2} (x-\mu _i)^{\top } \varSigma _i^{-1} (x-\mu _i) \right\} . \end{aligned} \right. \end{aligned}$$

Then, the model manifold consists of Gaussian mixture density functions whose parameters are the means and covariance matrices. The data manifold is constructed based on the empirical density function $\frac{1}{N}\sum _{i=1}^{N}\delta (x-x_i)$.

In the parameter estimation of the Gaussian mixture model, the model manifold is constructed based on the parametric distribution. On the other hand, there is no assumption of parametric distributions in MLR. This makes it nontrivial to construct a model manifold and data manifold.

To construct the model manifold for the MLR model, we consider (i) the assumption that $\text {Mode}\left[ \epsilon ;x \right] = 0$ and (ii) the form of the objective function of $\beta $ for the MLR model: $\frac{1}{N} \sum _{i=1}^{N} \phi _h \left( y_i - x_i^{\top }\beta \right) $. From this assumption and fact, the optimization problem expressed in (3) is regarded as a maximization problem of KDE at $\epsilon = 0$ for the probability density function of $\epsilon $. Based on the given observations, we propose constructing the following model for MLR:

$$\begin{aligned} f(\epsilon ;\beta )&= \frac{1}{N} \sum _{i=1}^{N} \phi _h \left( \epsilon - \epsilon _i(\beta ) \right) , \end{aligned}$$

(23)

where $\epsilon _i(\beta ) = y_i - x_i^{\top }\beta ,\ i=1\dots N$ and the variable $\epsilon $ denotes an error variable. We introduce the latent variable $Z\in \left\{ 1\dots N \right\} $, which specifies a mixture component from which an observation is obtained. The joint density function of $\epsilon $ and Z is

$$\begin{aligned} g(\epsilon ,z;\beta )&= \prod _{i=1}^{N} \left[ \frac{1}{N} \phi _h \left( \epsilon - \epsilon _i(\beta ) \right) \right] ^{\delta _i(z)} . \end{aligned}$$

(24)

The model manifold ${\mathscr {M}}$ is denoted by

$$\begin{aligned} {\mathscr {M}}&= \left\{ g(\epsilon ,z;\beta ) \mid \beta \in {\mathbb {R}}^{p} \right\} . \end{aligned}$$

(25)

It is shown in Appendix D that ${\mathscr {M}}$ is in a curved exponential family.

We next propose constructing a data manifold for the MLR model. The empirical density function is often constructed based on observations. Consider (i) the construction proposed in Sect. 4.2 and (ii) the assumption that $\text {Mode}\left[ \epsilon ;x \right] = 0$. We propose constructing the empirical density function as follows:

$$\begin{aligned} p(\epsilon )&= \delta (\epsilon - 0) = \delta (\epsilon ) . \end{aligned}$$

(26)

By introducing the latent variable $Z\in \left\{ 1\dots N \right\} $ to (26), we extend $p(\epsilon )$ to the empirical joint density function of $\epsilon $ and Z:

$$\begin{aligned} h(\epsilon , z)&= p(\epsilon ) q(z \mid \epsilon ) . \end{aligned}$$

By introducing the parameters $\left\{ q_i \right\} _{i=1}^{N}$, we model the conditional density function $q(z\mid \epsilon )$ as

$$\begin{aligned} q(z \mid \epsilon ) = \sum _{i=1}^{N} q_i \delta _i(z) ,\quad \text {where} \quad q_i \ge 0, \; \sum _{i=1}^{N} q_i = 1 \end{aligned}$$

Then, the empirical joint density function $h(\epsilon ,z\;q_1\dots q_N)$ is expressed as

$$\begin{aligned} h(\epsilon , z\ ; q_1 \dots q_N)&= \sum _{i=1}^{N} q_i \delta (\epsilon ) \delta _i(z) ,\quad \text {where} \quad q_i \ge 0, \;\sum _{i=1}^{N} q_i = 1 \end{aligned}$$

(27)

The data manifold ${\mathscr {D}}$ is defined as follows:

$$\begin{aligned} {\mathscr {D}} = \left\{ h(\epsilon ,z\ ;q_1\dots q_N) \mid q_i \ge 0, \quad \sum _{i=1}^{N} q_i = 1 \right\} . \end{aligned}$$

(28)

It is shown in Appendix E that ${\mathscr {D}}$ is in a mixture family.

Here, we consider the e-projection of a model with the parameters $\beta ^{(k)}$ onto the data manifold:

$$\begin{aligned}&\min _{h \in {\mathscr {D}}} D^{(e)}(g(\cdot ,\cdot ;\beta ^{(k)})||h) .\nonumber \\&\quad \rightarrow \quad \left| \begin{aligned} \min _{q_1 \dots q_N}&\quad D^{(e)} \left( g(\cdot ,\cdot ;\beta ^{(k)})||h(\cdot ,\cdot \ ;q_1\dots q_N) \right) ,\\ \text {s.t.}&\quad q_i \ge 0, \; \;\sum _{i=1}^{N} q_i = 1 \end{aligned} \right. \end{aligned}$$

(29)

The detailed calculation is shown in Appendix F. An optimal solution for (29) is

$$\begin{aligned} q_i^{(k)} = \frac{\phi _h \left( y_i - x_i^{\top }\beta ^{(k)} \right) }{\sum _{j=1}^{N} \phi _h \left( y_j - x_j^{\top }\beta ^{(k)} \right) }, \quad i = 1 \dots N, \end{aligned}$$

(30)

which is equivalent to the E-step in (5). Then, we consider the m-projection of the empirical joint density function with the parameters $q_i=q_i^{(k)},\ i=1\dots N$ onto the model manifold:

$$\begin{aligned} \min _{g \in {\mathscr {M}}} D^{(m)}(h(\cdot ,\cdot \ ;q_1=q_1^{(k)}\dots q_N=q_N^{(k)})||g) . \end{aligned}$$

(31)

The detailed calculation is shown in Appendix G. The optimization problem expressed as (31) is equivalent to

$$\begin{aligned} \max _{\beta } \sum _{i=1}^{N} q_i^{(k)} \log \phi _h \left( y_i - x_i^{\top }\beta \right) , \end{aligned}$$

(32)

which is equivalent to the M-step (9).

Figure 2 shows the update process of the em algorithm corresponding to the MLR model parameter estimation.

In this section, we propose constructing a model manifold and data manifold for the MLR model. Although a model manifold is often constructed based on a parametric distribution assumption, we construct it based on observations. Concerning the empirical distribution, our construction is based on (i) the assumption that $\text {Mode}\left[ \epsilon ;x \right] = 0$ and (ii) the proposed construction in Sect. 4.2. We apply the framework of the em algorithm to the proposed manifolds. We demonstrate that the e-projection of the model onto the data manifold derives (5), and the m-projection of the empirical distribution onto the model manifold leads to (9).

6 On the influence function for MLR

An influence function quantifies the influence of one observation on the estimate. Let T be the functional defined for a set of probability measures and F be the probability measure. The influence function of T at F is expressed by $\text {IF}(x;T,F)$ and is defined as follows:

$$\begin{aligned} \text {IF}(x;T,F) = \lim _{\epsilon \rightarrow 0} \frac{ T \left( (1-\epsilon )F+ \epsilon \varDelta _{x} \right) - T \left( F \right) }{ \epsilon }. \end{aligned}$$

(33)

Yao and Li [23] argued that MLR is robust against outliers after investigating its breakdown point. Elucidating the reason for the robustness is important because it can be a clue to developing a novel robust estimator. There are various approaches in the literature for making an estimation robust. One representative approach is to change the manifold for the m-projection. For example, the median is often used as a robust estimate of the mean. This approach corresponds to adopting the model manifold composed of Laplacian distributions instead of Gaussian distributions. On the other hand, a robust estimator can also be obtained by changing the projection method, such as using robust divergences instead of the KL-divergence [8].

So far, we have established the equivalence of the em and EM algorithms for MLR. The em algorithm is based on the KL-divergence, which is sensitive to outliers. Thus, we conjecture that the m-projection onto the model manifold, which is composed of the error distributions given by KDE, is the source of the robustness of MLR. Here, we focus on the influence function [11] of the estimator for the MLR model.

Unfortunately, it is difficult to derive the influence functions correspond to the update of the estimate of coefficient by the e- and m-steps for MLR problem because it is not trivial how to deal with the effect of an outlier on the projection to the data-dependent manifold.

In the finite-sample robustness analysis, the rescaled version of the influence function [11] is used to evaluate the effect of an outlier $(u,v)\in {\mathbb {R}}^p \times {\mathbb {R}}$ on each projection. Suppose that the current estimate is $\beta ^{(k)}$. Without outlier case, the joint density function in Eq. (24) is projected onto the data manifold ${\mathscr {D}}$. When the dataset is contaminated with an outlier (u, v), the joint density is modified as

$$\begin{aligned} g_{N+1}(\epsilon ,z;\beta ^{(k)}) = \prod _{i=1}^{N+1} \left[ \frac{1}{N+1} \phi _h(\epsilon - \epsilon _i(\beta ^{(k)})) \right] ^{\delta _{i}(z)} \end{aligned}$$

(34)

where $\epsilon _{N+1}(\beta ^{(k)}) = v-u^{\top }\beta ^{(k)}$, and it is projected onto the outlier-contaminated data manifold ${\mathscr {D}}_{N+1}$, which is composed of $q_1\dots q_{N+1}$. Let us express the result of the usual e-projection as $q^{(k)} = \left( q_{1}^{(k)} \dots q_{N}^{(k)} \right) $ and the result of the e-projection with the outlier as $q^{'(k)} = \left( q_{1}^{'(k)} \dots q_{N+1}^{'(k)} \right) $. Then the rescaled version of the influence function on the e-projection is expressed as follows

$$\begin{aligned}&\frac{q_{i}^{'(k)}-q_{i}^{(k)}}{\frac{1}{N+1}} \\&\quad = -\frac{ (N+1) \phi _h(v-u^{\top }\beta ^{(k)}) \phi _h(\epsilon _i(\beta ^{(k)})) }{\left( \sum _{j=1}^{N}\phi _h(\epsilon _j(\beta ^{(k)})) \right) \left\{ \phi _h(v-u^{\top }\beta ^{(k)}) + \sum _{j=1}^{N}\phi _h(\epsilon _j(\beta ^{(k)})) \right\} } ,\quad i=1\dots N . \end{aligned}$$

In the same manner, we can calculate the rescaled version of the influence function on the m-projection by comparing the result of the m-projections with and without an outlier. The regression coefficients obtained by the m-projection with and without the outlier are denoted as $\beta ^{(k+1)}$ and $\beta ^{'(k+1)}$, respectively. When the sample is contaminated by an outlier (u, v), the m-projection projects $h(\epsilon ,z; q_1^{'(k)}\dots q_{N+1}^{'(k)})$ onto ${\mathscr {M}}_{N+1}$, which is a set of $g_{N+1}(\epsilon ,z;\beta ^{(k)})$ defined by Eq. (34) parameterized by $\beta $. The influence function on the m-projection is then written as

$$\begin{aligned}&\frac{\beta ^{'(k+1)} - \beta ^{(k+1)}}{\frac{1}{N+1}} \\&\quad = (N+1) \left[ \begin{aligned}&\left( q_{N+1}^{'(k+1)}uu^{\top } + \sum _{i=1}^{N} q_{i}^{'(k+1)}x_{i}x_{i}^{\top } \right) ^{-1}\left( q_{N+1}^{'(k+1)}vu + \sum _{i=1}^{N}q_{i}^{'(k+1)}y_i x_i \right) \\&- \left( \sum _{i=1}^{N}q_{i}^{(k+1)} x_i x_i^{\top } \right) ^{-1} \left( \sum _{i=1}^{N} q_{i}^{(k+1)}y_i x_i \right) \end{aligned} \right] \end{aligned}$$

We could derive the influence functions on each of the e- and m- projections, though, there are problems on the above consideration on the effect of an outlier. In the m-projection, we did not take the indirect effect of the outlier account, which is inherited from the e-projection for the outlier-contaminated case. A standard definition of the influence function is associated with the probability measure F for X, Y. The non-parametric nature of the proposed information geometric formulation of the modal linear regression make the conventional robustness analysis difficult.

Towards understanding of the source of robustness of MLR, we consider the influence function of ${\hat{\beta }}$ defined in Eq. 3, which is a standard $\psi $-type M-estimator, and the influence function of the estimate by the em algorithm from the viewpoint of the iteratively reweighted least-squares (IRLS) estimator.

It is possible to derive the influence function independently of the solving algorithm. For example, the regression coefficient estimate ${\hat{\beta }}$ of the MLR model defined by Eq. (3) is a $\psi $-type M-estimator, and its influence function is expressed as

$$\begin{aligned}&\text {IF}\left( u,v; F \right) = \left( \int \frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }{\hat{\beta }}(F)} xx^{\top } dF(x,y) \right) ^{-1} \psi \left( u,v,{\hat{\beta }}(F) \right) , \end{aligned}$$

(35)

$$\begin{aligned}&\text {where} \quad \psi \left( u,v,{\hat{\beta }}(F) \right) = \frac{d}{dz}\phi _h(z)_{z=v-u^{\top }{\hat{\beta }}(F)} u. \end{aligned}$$

(36)

We note that Eq. (35) does not depend on whether the EM algorithm or the steepest descent method is used. Details of the derivation are given in Proposition 1 of Appendix H. To see the effect of an outlier, let us consider a very simple case with ${\hat{\beta }}(F) = \beta ^{*}$, a predictor variable X and an error variable $\epsilon $ are independent. We also assume the Gaussian kernel with bandwidth h is adopted as a kernel function. Then, Eq. (35) is simplified to

$$\begin{aligned} \text {IF}(u, v; F) = -\frac{ h^2 (v-u^{\top }\beta ^{*}) \phi _h(v-u^{\top }\beta ^{*}) }{\int (\epsilon ^2-h^2) \phi _h(\epsilon ) dF_{\epsilon }(\epsilon )} \left[ \int xx^{\top } dF_X(x) \right] ^{-1} u , \end{aligned}$$

(37)

where $F_X, F_{\epsilon }$ are probability measures of X and $\epsilon $, respectively. Equation (37) shows that given $u_0\in {\mathbb {R}}^p, \lim _{v\rightarrow \infty } \left| \text {IF}(u_0, v; F) \right| = 0$ holds due to $x\phi _{h}(x) \underset{x\rightarrow \infty }{\rightarrow } 0$. We can see the behavior of Eq. (37) in Fig. 3, in which the contour of Eq. (37) is calculated under the model $Y=X\times \beta ^{*} + \epsilon $ with $\beta ^{*} =1, X,Y \in {\mathbb {R}},\quad \epsilon \sim {\mathscr {N}}(0,1^2),\quad X\sim \text {Uniform}(-10,10)$, and kernel bandwidth is set to $h=3$. In the figure, the dashed line denotes the ground truth regression line. From this figure, it is seen that outliers around the ground truth regression line have minor impact on the estimate of regression coefficient.

On the other hand, it is also possible to define the estimate ${\hat{\beta }}$ as the limit of the EM algorithm derived in [23]. In particular, when the Gaussian kernel is adopted for density estimation, the resulting estimator is regarded as an IRLS estimator. Dollinger and Staudte [6] addressed a related problem when they derived the influence function of an IRLS estimator for the linear regression model. They revealed the relationship between the influence functions of the $(k+1)$-step and k-step estimates, and derived the influence function for the estimator as its limit.

Following the approach by Dollinger and Staudte [6], we consider the em algorithm for MLR as an IRLS procedure, in which an iteration of IRLS corresponds to a pair of e- and m-projection. Let the initial estimate be $\beta ^{(1)}$ and the k-th estimate be $\beta ^{(k)}$, respectively. Then, it is shown that the influence functions of estimates $\beta ^{(k)}$ and $\beta ^{(k+1)}$ satisfy the following recurrence relation:

$$\begin{aligned}&\text {IF}(u,v; \beta ^{(k+1)}, F) = z_{k} + A_{k}\; \text {IF}(u,v;\beta ^{(k)},F) , \end{aligned}$$

(38)

$$\begin{aligned}&\text {where} \left\{ \begin{aligned}&z_{k} = \varSigma _{k}^{-1} c_{k} ,\\&\varSigma _{k} = \int \phi _h \left( y-x^{\top }\beta ^{(k)}(F)\right) xx^{\top } dF(x,y) ,\\&c_{k} = \phi _h \left( v-u^{\top }\beta ^{(k)}(F) \right) \left( v-u^{\top }\beta ^{(k+1)}(F) \right) u\\ \quad&\quad - \int \phi _h \left( y-x^{\top }\beta ^{(k)}(F) \right) \left( y-x^{\top }\beta ^{(k+1)}(F) \right) x dF(x,y) ,\\&C_{k} = -\int \frac{d}{dz} \phi _h(z)_{z=y-x^{\top }\beta ^{(k)}(F)} \left( y-x^{\top }\beta ^{(k+1)}(F)\right) xx^{\top } dF(x,y) ,\\&A_{k} = \varSigma _{k}^{-1} C_{k} . \end{aligned} \right. \end{aligned}$$

(39)

Details of the derivation are given in Proposition 2 of Appendix H. In the original least squares regression problem dealt in [6], the weighted least-squares (WLS) estimator is proven to be Fisher consistent by using the symmetry of the noise distribution, and $\varSigma _k, c_k, C_k$ in the recurrence relation are shown to be independent of k. In our MLR problem, though, the noise distribution cannot be assumed to be symmetric. So, we consider the converged value $\beta ^{(\infty )}$ to remove the dependency on k. In this limit case, $\varSigma _k, c_k, C_k$ do not depend on k, and Eq. (38) becomes

$$\begin{aligned}&\text {IF}(u,v;\beta ^{(\infty )}, F) = z + A \; \text {IF}(u,v;\beta ^{(\infty )}, F). \end{aligned}$$

(40)

Now we assume that $\left| A\right| <1$, and the use of Gaussian kernel implies $\frac{d}{dz}\phi _h(z) = - \frac{1}{h^2}z\phi _h(z)$. Then, we ontain

$$\begin{aligned} \text {IF}(u,v; \beta ^{(\infty )},F) =&\left( \varSigma - C \right) ^{-1} c , \end{aligned}$$

(41)

$$\begin{aligned} =&\left( \int \frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }\beta ^{(\infty )}} xx^{\top } dF(x,y) \right) ^{-1} \nonumber \\&\times \left[ \psi (u,v,\beta ^{(\infty )}) + \frac{\partial }{\partial \beta } \left\{ \int \phi _h(y-x^{\top }\beta ) dF(x,y) \right\} _{\beta ^{(\infty )}} \right] . \end{aligned}$$

(42)

If $\beta ^{(\infty )}$ is equal to ${\hat{\beta }}(F)$, then $\frac{\partial }{\partial \beta } \left\{ \int \phi _h(y-x^{\top }\beta ) dF(x,y) \right\} _{{\hat{\beta }}} = 0$ holds due to the definition of ${\hat{\beta }}(F)$ and Eq. (42) is consistent with Eq. (35).

In this section, we conjectured that the robustness of MLR is due to the particular structure of the model manifold. To support this conjecture, we attempted to derive the influence function with respect to the e- and the m-steps of the estimation procedure of model coefficient. Currently it is difficult to derive the influence functions, but we showed the difference between the influence function derived from the veiwpoint of M-estimator and that derived from the IRLS estimator. The IRLS estimator is composed of a set of e- and m- steps. Our future work is to identify the contribution of each of the e- and the m- steps on the influence function.

7 Conclusions

In this paper, we provide an information geometric perspective on the MLR model, which is a nonparametric method. First, we discuss the MEM algorithm and investigate the relationship between the algorithm and information geometry. We cast the MEM algorithm as a maximum likelihood method by assuming a pseudo-observation. This gives us an empirical density function based on the assumption that one pseudo-observation is obtained. Second, we address the relationship between the MLR model and information geometry. Because the MLR model does not assume a parametric distribution, we cannot construct a corresponding model manifold with conventional approaches. We propose constructing the model manifold based on observations. The empirical density function introduced through a discussion of the MEM algorithm is applied to construct the data manifold. We clarify the relationship between the EM algorithm developed by Yao and Li [23] and the em algorithm corresponding to the MLR model.

Elucidating the factors or geometric operation that make the estimator for the MLR model robust remains for future work. We will further investigate the influence function corresponding to the e- and m- steps for estimating the coefficient of the MLR model.

References

Amari, S.: Information geometry of the EM and em algorithms for neural networks. Neural Netw. 8(9), 1379–1408 (1995)
Article Google Scholar
Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, New York (2000)
MATH Google Scholar
Baldauf, M., Silva, J.S.: On the use of robust regression in econometrics. Econ. Lett. 114(1), 124–127 (2012)
Article MathSciNet Google Scholar
Chen, Y.C., Genovese, C.R., Tibshirani, R.J., Wasserman, L.: Nonparametric modal regression. Ann. Stat. 44(2), 489–514 (2016). https://doi.org/10.1214/15-AOS1373
Article MathSciNet MATH Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Dollinger, M.B., Staudte, R.G.: Influence functions of iteratively reweighted least squares estimators. J. Am. Stat. Assoc. 86(415), 709–716 (1991)
Article MathSciNet MATH Google Scholar
Epanechnikov, V.A.: Non-parametric estimation of a multivariate probability density. Theory Probab. Appl. 14(1), 153–158 (1969)
Article MathSciNet Google Scholar
Fujisawa, H., Eguchi, S.: Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 99(9), 2053–2081 (2008)
Article MathSciNet MATH Google Scholar
Gannoun, A., Saracco, J., Yu, K.: On semiparametric mode regression estimation. Commun. Stat. Theory Methods 39(7), 1141–1157 (2010)
Article MathSciNet MATH Google Scholar
Grasselli, M.R.: Dual connections in nonparametric classical information geometry. Ann. Inst. Stat. Math. 62(5), 873–896 (2010). https://doi.org/10.1007/s10463-008-0191-3
Article MathSciNet MATH Google Scholar
Hampel, F.R.: The influence curve and its role in robust estimation. J. Am. Stat. Assoc. 69(346), 383–393 (1974)
Article MathSciNet MATH Google Scholar
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A.: Robust Statistics—the Approach Based on Influence Functions. Wiley, Amsterdam (1986)
MATH Google Scholar
Huber, P.J., Ronchetti, E.M.: Robust Statistics. Wiley, Amsterdam (2011)
MATH Google Scholar
Kemp, G.C., Silva, J.S.: Regression towards the mode. J. Econ. 170(1), 92–101 (2012)
Article MathSciNet MATH Google Scholar
Lee, M.J.: Mode regression. J. Econ. 42(3), 337–349 (1989)
Article MathSciNet MATH Google Scholar
Lee, M.J.: Quadratic mode regression. J. Econ. 57(1), 1–19 (1993)
Article MathSciNet MATH Google Scholar
Li, J., Ray, S., Lindsay, B.G.: A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res. 8, 1687–1723 (2007)
MathSciNet MATH Google Scholar
Matzner-Lofber, E., Gannoun, A., Gooijer, J.G.D.: Nonparametric forecasting: a comparison of three kernel-based methods. Commun. Stat. Theory Methods 27(7), 1593–1617 (1998). https://doi.org/10.1080/03610929808832180
Article MathSciNet MATH Google Scholar
Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)
Article MathSciNet MATH Google Scholar
Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 23(5), 1543–1561 (1995)
Article MathSciNet MATH Google Scholar
Takano, K., Hino, H., Akaho, S., Murata, N.: Nonparametric e-mixture estimation. Neural Comput. 28(12), 2687–2725 (2016)
Article MathSciNet MATH Google Scholar
Wand, M.P., Jones, M.C.: Kernel Smoothing. Taylor & Francis, Milton Park (1994)
Book MATH Google Scholar
Yao, W., Li, L.: A new regression model: modal linear regression. Scand. J. Stat. 41(3), 656–671 (2014)
Article MathSciNet MATH Google Scholar
Yao, W., Lindsay, B.G., Li, R.: Local modal regression. J. Nonparametr. Stat. 24(3), 647–663 (2012)
Article MathSciNet MATH Google Scholar
Zhang, J.: Nonparametric information geometry: from divergence function to referential–representational biduality on statistical manifolds. Entropy 15(12), 5384–5418 (2013). https://doi.org/10.3390/e15125384
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors are grateful to the anonymous reviewers whose comments led to valuable improvements in the manuscript.

Author information

Authors and Affiliations

Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Ibaraki, 305-8573, Japan
Keishi Sando
National Institute of Advanced Industrial Science and Technology, Tsukuba, Ibaraki, 305-8568, Japan
Shotaro Akaho
RIKEN AIP, Tokyo, 103-0027, Japan
Shotaro Akaho, Noboru Murata & Hideitsu Hino
Faculty of Science and Engineering, Waseda University, Shinjuku, Tokyo, 169-8555, Japan
Noboru Murata
The Institute of Statistical Mathematics, Tachikawa, Tokyo, 190-8565, Japan
Hideitsu Hino

Authors

Keishi Sando
View author publications
You can also search for this author in PubMed Google Scholar
Shotaro Akaho
View author publications
You can also search for this author in PubMed Google Scholar
Noboru Murata
View author publications
You can also search for this author in PubMed Google Scholar
Hideitsu Hino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hideitsu Hino.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Part of this work is supported by JST KAKENHI 16K16108, 16H02842, 17H01793 and JST CREST JPMJCR1761.

Appendices

A ${\mathscr {D}}$ of MEM is in a mixture family

This section shows that the data manifold ${\mathscr {D}}$ of MEM is in a mixture family. An element of ${\mathscr {D}}$ is expressed as:

$$\begin{aligned} h(x, z\;;q_1\dots q_K)&= \sum _{i=1}^{K} q_i \delta (x) \delta _i(z) ,\\&\quad \text {where} \quad \left\{ \begin{aligned}&^{\forall }i=1 \dots K,\ q_i \ge 0 ,\\&\sum _{i=1}^{K} q_i = 1 . \end{aligned} \right. \end{aligned}$$

Because $h(x,z\;;q_1\dots q_K)$ is expressed in a convex combination of K probability density functions $\left\{ \delta (x) \delta _i(z) \right\} _{i=1}^{K}$, the subset ${\mathscr {D}}$ is a $(K-1)$-dimensional mixture family.

B The e-projection onto ${\mathscr {D}}$ for the MEM algorithm

This appendix shows that the optimization problem expressed by (19) derives an optimal solution for $\left\{ q_i^{(k)} \right\} _{i=1}^{K}$. $D^{(e)}\left( l(\cdot ,\cdot ; m^{(k)}) || h \right) $ can be expanded as follows:

$$\begin{aligned} D^{(e)}\left( l(\cdot ,\cdot ; m^{(k)})||h \right)&= D^{(m)} \left( h || l(\cdot ,\cdot ; m^{(k)}) \right) \nonumber \\&= \sum _{i=1}^{K} \int h(x,i\;;q_1\dots q_K) \log \frac{h(x,i\;;q_1\dots q_K)}{l(x,i\;;m^{(k)})} dx \nonumber \\&= \sum _{i=1}^{K} \int q_i \delta (x) \log \left[ q_i \delta (x) \right] dx - \sum _{i=1}^{K} \int q_i \delta (x) \log \left[ \pi _i f_i(x+m^{(k)}) \right] dx \nonumber \\&= \sum _{i=1}^{K} q_i \log \frac{q_i}{\pi _i f_i(m^{(k)})} + \int \delta (x) \log \delta (x) dx . \end{aligned}$$

(43)

The divergence $D^{(e)}\left( l(\cdot ,\cdot \;;m^{(k)}) ||h\right) $ can be decomposed into two terms, as expressed in (43). Because the variables $\left\{ q_i \right\} _{i=1}^{K}$ are only included in the first term, the optimization problem expressed by (19) is equivalent to

$$\begin{aligned} \begin{aligned} \min _{q_1\dots q_K}&\quad \sum _{i=1}^{K} q_i \log \frac{q_i}{\pi _i f_i(m^{(k)})} ,\quad \text {s.t.} \quad \left\{ \begin{aligned}&^{\forall }i=1\dots K,\quad q_i \ge 0 ,\\&\sum _{i=1}^{K} q_i = 1 . \end{aligned} \right. \end{aligned} \end{aligned}$$

(44)

In order to obtain an optimal solution, we solve the following problem:

$$\begin{aligned} \min _{q_1\dots q_K}&\quad L = \sum _{i=1}^{K} q_i \log \frac{q_i}{\pi _i f_i(m^{(k)})} + \lambda \left[ 1 - \sum _{i=1}^{K} q_i \right] . \end{aligned}$$

We can then obtain $\left\{ q_i^{(k)} \right\} _{i=1}^{K}$ that satisfies $\left. \frac{\partial L}{\partial q_i} \right| _{q_i^{(k)}} = 0$:

$$\begin{aligned} \frac{\partial L}{\partial q_i}&= \log \frac{q_i}{\pi _i f_i(m^{(k)})} + 1 - \lambda ,\quad \left. \frac{\partial L}{\partial q_i} \right| _{q_i^{(k)}} = \log \frac{q_i^{(k)}}{\pi _i f_i(m^{(k)})} + 1 - \lambda = 0 ,\\ \therefore q_i^{(k)}&= \pi _i f_i(m^{(k)}) e^{\lambda - 1}. \end{aligned}$$

The value of $\lambda $ can be derived from the constraint $\sum _{i=1}^{K} q_i^{(k)} = 1$:

$$\begin{aligned} e^{\lambda -1} \sum _{i=1}^{K} \pi _i f_i(m^{(k)})&= 1 \\ \therefore \lambda&= 1 - \log \left[ f(m^{(k)}) \right] . \end{aligned}$$

Therefore, the optimal solution $\left\{ q_i^{(k)} \right\} _{i=1}^{K}$ is expressed as

$$\begin{aligned} q_i^{(k)}&= \pi _i f_i(m^{(k)}) e^{-\log f(m^{(k)})} = \frac{ \pi _i f_i(m^{(k)}) }{ f(m^{(k)}) } ,\quad i=1\dots K . \end{aligned}$$

This solution satisfies the constraint $^{\forall }i=1\dots K,\ q_i\ge 0$ and $\sum _{i=1}^{K} q_i = 1$.

C The m-projection onto ${\mathscr {M}}$ for the MEM algorithm

This appendix shows that the optimization problem expressed by (20) is equivalent to the one expressed by (22).

We can expand $D^{(m)}\left( h(\cdot ,\cdot \;;q_{1}^{(k)}\dots q_{K}^{(k)}) ||l \right) $ to

$$\begin{aligned}&D^{(m)}\left( h(\cdot ,\cdot \;;q_{1}^{(k)}\dots q_{K}^{(k)}) ||l \right) = \sum _{i=1}^{K} \int h( x,i\;; q_{1}^{(k)}\dots q_{K}^{(k)} ) \log \frac{h( x,i\;; q_{1}^{(k)}\dots q_{K}^{(k)} )}{l(x,i\;;m)}dx \nonumber \\&\quad = \sum _{i=1}^{K} \int q_i^{(k)} \delta (x) \log \left[ q_i^{(k)} \delta (x) \right] dx - \sum _{i=1}^{K} \int q_i^{(k)} \delta (x) \log \left[ \pi _i f_i(x+m) \right] dx \nonumber \\&\quad = \sum _{i=1}^{K} q_i^{(k)} \log q_i^{(k)} + \int \delta (z) \log \delta (z) dz - \sum _{i=1}^{K} q_i^{(k)} \log \pi _i - \sum _{i=1}^{K} q_i^{(k)} \log f_i(m) . \end{aligned}$$

(45)

The divergence $D^{(m)}\left( h(\cdot ,\cdot \;;q_{1}^{(k)}\dots q_{K}^{(k)})||l\right) $ can be decomposed into four terms, as expressed in (45). Because the variable m is only included in the fourth term, the optimization problem expressed by (20) is equivalent to

$$\begin{aligned} \max _{m\in {\mathbb {R}}^p}&\quad \sum _{i=1}^{K} q_i^{(k)} \log f_i(m). \end{aligned}$$

D Proposed model manifold for the MLR model is in a curved exponential family

This appendix shows that the proposed model manifold expressed by (25) is in a curved exponential family. In this paper, we adopt $\phi (z) = \frac{1}{\sqrt{2\pi }} \exp \left\{ -\frac{z^2}{2} \right\} $ as a kernel function.

We can expand $\log \phi _h\left( \epsilon - (y_i - x_i^{\top }\beta ) \right) $:

$$\begin{aligned} \log \phi _h \left( \epsilon - (y_i - x_i^{\top }\beta ) \right) =&\frac{1}{h^2} \left( -\frac{\epsilon ^2}{2} + \epsilon y_i \right) - \sum _{l=1}^{p} \frac{\beta _l}{h^2} x_{il} \epsilon - \frac{(y_i - x_i^{\top }\beta )^2}{2h^2} \nonumber \\&- \log h - \frac{1}{2} \log 2\pi . \end{aligned}$$

(46)

Then, we show that $\log g(\epsilon ,z;\beta )$ can be transformed to the standard form of an exponential family:

$$\begin{aligned} \log g(\epsilon ,z;\beta )&= \sum _{i=1}^{N} \delta _i \left( z\right) \log \phi _h \left( \epsilon - (y_i - x_i^{\top }\beta ) \right) - \log N, \quad \text {from}~(46) \nonumber \\&= \frac{1}{h^2} \left( -\frac{\epsilon ^2}{2} + \sum _{i=1}^{N}\delta _i(z) y_i \epsilon \right) - \sum _{l=1}^{p} \frac{\beta _l}{h^2} \left( \sum _{i=1}^{N}\delta _i(z) x_{il} \epsilon \right) \nonumber \\&\quad - \sum _{i=1}^{N-1} \frac{\delta _i(z)}{2} \frac{(y_i - x_i^{\top }\beta )^2 - (y_N - x_N^{\top }\beta )^2}{h^2} \nonumber \\&\quad - \left( \frac{(y_N - x_N^{\top }\beta )^2}{2h^2} + \log h + \frac{1}{2} \log 2\pi + \log N \right) . \end{aligned}$$

(47)

Replacing the following amount as natural parameters results in

$$\begin{aligned} \left\{ \begin{aligned} \theta ^{00}&= \frac{1}{h^2} ,\\ \theta ^{0l}&= \frac{\beta _l}{h^2}, \quad l=1 \dots p ,\\ \theta ^{0}&= \begin{pmatrix} \theta ^{01}&\dots&\theta ^{0p} \end{pmatrix}^{\top } ,\\ \theta ^{1i}&= \frac{(y_i - x_i^{\top }\beta )^2 - (y_N - x_N^{\top }\beta )^2}{h^2}, \quad i=1 \dots N-1 . \end{aligned} \right. \end{aligned}$$

Now, $y_i - x_i^{\top }\beta $ is denoted by natural parameters as follows:

$$\begin{aligned} y_i - x_i^{\top }\beta = y_i - \sum _{l=1}^{p} x_{il} \beta _l = y_i - \frac{1}{\theta ^{00}} \sum _{l=1}^{p} x_{il} \theta ^{0l} = y_i - \frac{1}{\theta ^{00}} x_i^{\top } \theta ^{0}. \end{aligned}$$

Functions $F_{00}(\epsilon ,z), F_{0l}(\epsilon ,z), F_{1i}(\epsilon ,z),\ l=1\dots p,\ i=1\dots N-1$ denote

$$\begin{aligned} \left\{ \begin{aligned} F_{00}(\epsilon ,z)&= -\frac{\epsilon ^2}{2} + \sum _{i=1}^{N} \delta _i(z) y_i \epsilon ,\\ F_{0l}(\epsilon ,z)&= -\sum _{i=1}^{N} \delta _i(z) x_{il} \epsilon ,\\ F_{1i}(\epsilon ,z)&= -\frac{\delta _i(z)}{2} . \end{aligned} \right. \end{aligned}$$

We replace the last term in (47) with $\psi (\theta )$:

$$\begin{aligned} \frac{ (y_N - x_N^{\top }\beta )^2 }{2h^2} + \log h + \frac{1}{2} \log 2\pi + \log N&= \frac{1}{2} \theta ^{00} \left( y_N - \frac{1}{\theta ^{00}} x_N^{\top } \theta ^{0} \right) ^2 - \frac{1}{2} \log \theta ^{00}\\&\quad + \frac{1}{2} \log 2\pi + \log N \\&= \psi (\theta ). \end{aligned}$$

Thus, $\log g(\epsilon ,z)$ can be expressed by natural parameters and $\psi (\theta )$ as follows:

$$\begin{aligned} \log g(\epsilon ,z)&= \sum _{l=0}^{p} \theta ^{0l} F_{0l}(\epsilon ,z) + \sum _{i=1}^{N-1} \theta ^{1i} F_{1i}(\epsilon ,z) - \psi (\theta ). \end{aligned}$$

This shows that ${\mathscr {M}}$ expressed by (25) is in an exponential family.

Here, we derive the expectation parameters:

$$\begin{aligned} \eta _{00}&= \text {E} \left[ F_{00}(\epsilon ,z) \right] = \sum _{i=1}^{N} \int g(\epsilon ,i) F_{00}(\epsilon ,i) d\epsilon ,\quad \because g(\epsilon ,i) = \frac{1}{N} \phi _h \left( \epsilon - (y_i - x_i^{\top }\beta ) \right) \\&= \frac{1}{N} \sum _{i=1}^{N} \int \phi _h \left( \epsilon - (y_i - x_i^{\top }\beta ) \right) \left[ -\frac{\epsilon ^2}{2} + y_i \epsilon \right] d\epsilon \\&= - \frac{1}{2N} \sum _{i=1}^{N} \left( h^2 + (y_i - x_i^{\top }\beta )^2 \right) + \frac{1}{N} \sum _{i=1}^{N} y_i (y_i - x_i^{\top }\beta ) \\ \therefore \eta _{00}&= - \frac{h^2}{2} + \frac{1}{2N} \left( y^{\top } y - (X\beta )^{\top } (X\beta ) \right) .\\ \eta _{0l}&= \text {E} \left[ F_{0l}(\epsilon ,z) \right] = \sum _{i=1}^{N} \int g(\epsilon ,i) F_{0l}(\epsilon ,i) d\epsilon \\&= \frac{1}{N} \sum _{i=1}^{N} \int \phi _h \left( \epsilon - (y_i - x_i^{\top }\beta ) \right) (-x_{il}\epsilon ) d\epsilon \\ \therefore \eta _{0l}&= - \frac{1}{N} v_l^{\top }(y-X\beta ) ,\quad \text {where}\quad v_l = \begin{pmatrix} x_{1l}&\cdots&x_{Nl} \end{pmatrix}^{\top } ,\quad l = 1 \dots p .\\ \eta _{1i}&= \text {E} \left[ F_{1i}(\epsilon ,z) \right] = \sum _{j=1}^{N} \int g(\epsilon ,j) F_{1i}(\epsilon ,j) d\epsilon = - \frac{1}{2N} \int \phi _h \left( \epsilon - (y_i - x_i^{\top }\beta ) \right) d\epsilon \\ \therefore \eta _{1i}&= - \frac{1}{2N}, \quad i = 1 \dots N. \end{aligned}$$

In order to obtain the dual potential $\varphi (\eta )$, we express $\log h$ as a function of expectation parameters:

$$\begin{aligned} \eta _{0}&= \begin{pmatrix} \eta _{01}&\dots&\eta _{0p} \end{pmatrix}^{\top } = -\frac{1}{N} X^{\top } (y - X\beta ) \nonumber \\ \therefore \beta&= (X^{\top }X)^{-1} (X^{\top }y + N\eta _{0}) .\nonumber \\ \eta _{00}&= - \frac{h^2}{2} + \frac{1}{2N} \left( y^{\top } y - (X^{\top }y + N\eta _{0})^{\top } (X^{\top }X)^{-1} (X^{\top }y + N\eta _{0}) \right) \nonumber \\ h^2&= -2\eta _{00} + \frac{1}{N} \left( y^{\top } y - (X^{\top }y + N\eta _{0})^{\top } (X^{\top }X)^{-1} (X^{\top }y + N\eta _{0}) \right) \nonumber \\ \therefore \log h&= \frac{1}{2} \log \left[ -2\eta _{00} + \frac{1}{N} \left( y^{\top } y - (X^{\top }y + N\eta _{0})^{\top } (X^{\top }X)^{-1} (X^{\top }y + N\eta _{0}) \right) \right] . \end{aligned}$$

(48)

Finally, the dual potential $\varphi (\eta )$ is expressed as follows:

$$\begin{aligned} \varphi (\eta )&= \sum _{l=0}^{p} \theta ^{0l} \eta _{0l} + \sum _{i=1}^{N-1} \theta ^{1i} \eta _{1i} - \psi (\theta ) \\&= - \frac{1}{2} - \frac{1}{2N} \sum _{i=1}^{N} \frac{(y_i - x_i^{\top }\beta )^2}{h^2} + \frac{1}{N} \sum _{i=1}^{N} \frac{(y_i - x_i^{\top }\beta )^2}{h^2} \\&\qquad - \frac{1}{2N} \sum _{i=1}^{N-1} \frac{ (y_i - x_i^{\top }\beta )^2 - (y_N - x_N^{\top }\beta )^2 }{h^2} \\&\qquad - \left( \frac{(y_N - x_N^{\top }\beta )^2}{2h^2} + \log h + \frac{1}{2} \log 2\pi + \log N \right) ,\\&\because \sum _{i=1}^{N-1} \frac{ (y_i - x_i^{\top }\beta )^2 - (y_N - x_N^{\top }\beta )^2 }{h^2} = \sum _{i=1}^{N} \frac{ (y_i - x_i^{\top }\beta )^2 - (y_N - x_N^{\top }\beta )^2 }{h^2} , \\&= - \left( \log h + \frac{1}{2} + \frac{1}{2} \log 2\pi + \log N \right) \\ \therefore \varphi (\eta )&= - \frac{1}{2} \log \left[ -2\eta _{00} + \frac{1}{N} \left( y^{\top } y - (X^{\top }y + N\eta _{0})^{\top } (X^{\top }X)^{-1} (X^{\top }y + N\eta _{0}) \right) \right] \\&\quad - \left( \frac{1}{2} + \frac{1}{2} \log 2\pi + \log N \right) . \end{aligned}$$

The derived natural parameters, expectation parameters, potential function, and dual potential are shown to satisfy the relationship in (12):

$$\begin{aligned} \frac{\partial \psi }{\partial \theta ^{00}}&= \frac{1}{2N} \left( y^{\top } y - \left( \frac{1}{\theta ^{00}}X\theta ^{0} \right) ^{\top } \left( \frac{1}{\theta ^{00}}X\theta ^{0} \right) \right) - \frac{1}{2\theta ^{00}} \\ \therefore \frac{\partial \psi }{\partial \theta ^{00}}&= - \frac{h^2}{2} + \frac{1}{2N} \left( y^{\top }y - (X\beta )^{\top } (X\beta ) \right) = \eta _{00} .\\ \frac{\partial \psi }{\partial \theta ^{0}}&= - \frac{1}{N} X^{\top } y + \frac{1}{N\theta ^{00}} X^{\top } X\theta ^{0} = - \frac{1}{N} X^{\top } (y - X\beta ) \\ \therefore \frac{\partial \psi }{\partial \theta ^{0}}&= \eta _{0} .\\ \therefore \frac{\partial \psi }{\partial \theta ^{1i}}&= - \frac{1}{2N}, \quad i=1 \dots N .\\ \frac{\partial \varphi }{\partial \eta _{00}}&= - \frac{1}{2} \frac{-2}{-2\eta _{00} + \frac{1}{N} \left( y^{\top }y - (X^{\top }y + N\eta _{0})^{\top } (X^{\top }X)^{-1} (X^{\top }y + N\eta _{0}) \right) } \\ \therefore \frac{\partial \varphi }{\partial \eta _{00}}&= \frac{1}{h^2} = \theta ^{00} .\\ \therefore \frac{\partial \varphi }{\partial \eta _{0}}&= \frac{1}{h^2} \left( X^{\top } X \right) ^{-1} (X^{\top }y + N\eta _{0}) = \frac{\beta }{h^2} . \end{aligned}$$

E Proposed data manifold for the MLR model is in a mixture family

This appendix shows that the data manifold ${\mathscr {D}}$ for the MLR model introduced in Sect. 5 is in a mixture family. An element of ${\mathscr {D}}$ is expressed as

$$\begin{aligned} h(\epsilon , z\ ; q_1\dots q_N)&= \sum _{i=1}^{N} q_i \delta (e)\delta _i(z) , \quad \text {where} \quad \left\{ \begin{aligned}&^{\forall }i=1 \dots N, \quad q_i \ge 0 ,\\&\sum _{i=1}^{N} q_i = 1 . \end{aligned} \right. \end{aligned}$$

Because $h(\epsilon ,z\;q_1\dots q_N)$ is expressed in a convex combination of N probability density functions $\left\{ \delta (\epsilon ) \delta _i(z) \right\} _{i=1}^{N}$, the subset ${\mathscr {D}}$ is an $(N-1)$-dimensional mixture manifold.

F The e-projection onto ${\mathscr {D}}$ for the MLR model

This appendix shows that the optimization problem expressed by (29) derives an optimal solution for $\left\{ q_i^{(k)} \right\} _{i=1}^{N}$:

$$\begin{aligned} \begin{aligned} \min _{q_1\dots q_N}&\quad D^{(e)}\left( g(\cdot ,\cdot ;\beta ^{(k)})||h(\cdot ,\cdot \ ;q_1\dots q_N) \right) ,\quad \text {s.t.} \quad \left\{ \begin{aligned}&^{\forall }i=1 \dots N, \quad q_i \ge 0 ,\\&\sum _{i=1}^{N} q_i = 1. \end{aligned} \right. \end{aligned} \end{aligned}$$

(29)

We can expand $D^{(e)}\left( g(\cdot ,\cdot ;\beta ^{(k)}) || h(\cdot ,\cdot \; q_1\dots q_N) \right) $ to

$$\begin{aligned}&D^{(e)} \left( g(\cdot ,\cdot ;\beta ^{(k)}||h(\cdot , \cdot \ ; q_1\dots q_N) \right) = D^{(m)} \left( h(\cdot ,\cdot \ ;q_1\dots q_N) || g(\cdot ,\cdot ;\beta ^{(k)}) \right) \\&= \sum _{i=1}^{N} \int h(\epsilon ,i) \log h(\epsilon ,i) d\epsilon - \sum _{i=1}^{N} \int h(\epsilon ,i) \log g \left( \epsilon ,i;\beta ^{(k)} \right) d\epsilon \\&= \sum _{i=1}^{N} \int q_i \delta (\epsilon ) \log \left[ q_i \delta (\epsilon ) \right] d\epsilon - \sum _{i=1}^{N} \int q_i \delta (\epsilon ) \log g(\epsilon ,i;\beta ^{(k)}) d\epsilon \\&= \sum _{i=1}^{N} q_i \log \frac{q_i}{g(0,i;\beta ^{(k)}) } + \int \delta (\epsilon ) \log \delta (\epsilon ) d\epsilon . \end{aligned}$$

The divergence $D^{(e)}(g||h)$ can be decomposed into two terms. Because $\left\{ q_i \right\} _{i=1}^{N}$ is only included in the first term, the optimization problem expressed by (29) is equivalent to

$$\begin{aligned} \begin{aligned} \min _{q_1\dots q_N}&\quad \sum _{i=1}^{N} q_i \log \frac{q_i}{g(0,i;\beta ^{(k)}) } ,\quad \text {s.t.}&\quad \left\{ \begin{aligned}&^{\forall }i=1 \dots N, \quad q_i \ge 0 ,\\&\sum _{i=1}^{N} q_i = 1. \end{aligned} \right. \end{aligned} \end{aligned}$$

(49)

In order to obtain an optimal solution, we solve the following problem:

$$\begin{aligned} \min _{q_1\dots q_N} L = \sum _{i=1}^{N} q_i \log \frac{q_i}{g(0,i;\beta ^{(k)}) } + \lambda \left[ 1 - \sum _{i=1}^{N} q_i \right] . \end{aligned}$$

We can obtain $\left\{ q_i^{(k)} \right\} _{i=1}^{N}$ that satisfies $\left. \frac{\partial L}{\partial q_i} \right| _{q_i^{(k)}} = 0$:

$$\begin{aligned} \frac{\partial L}{\partial q_i}&= \log \frac{q_i}{g(0,i;\beta ^{(k)}) } + q_i \frac{\frac{1}{g(0,i;\beta ^{(k)}) }}{\frac{q_i}{g(0,i;\beta ^{(k)}) }} - \lambda = \log \frac{q_i}{g(0,i;\beta ^{(k)}) } + 1 - \lambda ,\\ \left. \frac{\partial L}{\partial q_i} \right| _{q_i^{(k)}}&= \log \frac{q_i}{g(0,i;\beta ^{(k)}) } + 1 - \lambda = 0 ,\\ \therefore q_i^{(k)}&= g(0,i;\beta ^{(k)}) e^{\lambda - 1}. \end{aligned}$$

The value of $\lambda $ can be derived from the constraint $\sum _{i=1}^{N} q_i^{(k)} = 1$:

$$\begin{aligned} e^{\lambda -1} \sum _{i=1}^{N} g(0,i;\beta ^{(k)})&= 1 \\ \therefore \lambda&= 1 - \log f(0;\beta ^{(k)}). \end{aligned}$$

Therefore, the optimal solution $\left\{ q_i^{(k)} \right\} _{i=1}^{N}$ is expressed as

$$\begin{aligned} q_i^{(k)}&= g(0,i;\beta ^{(k)}) e^{-\log f(0;\beta ^{(k)})} = \frac{ g(0,i;\beta ^{(k)}) }{ f(0;\beta ^{(k)}) } = \frac{ \phi _h \left( y_i - x_i^{\top }\beta ^{(k)} \right) }{ \sum _{j=1}^{N}\phi _h \left( y_j - x_j^{\top }\beta ^{(k)} \right) }. \end{aligned}$$

This solution satisfies the condition $^{\forall }i=1\dots N,\ q_i\ge 0$ and $\sum _{i=1}^{N} q_i = 1$.

G m-projection onto ${\mathscr {M}}$ for the MLR model

We show that the optimization problem expressed by (31) is equivalent to the one expressed by (32). We can expand $D^{(m)} \left( h||g \right) $ to

$$\begin{aligned} D^{(m)}(h||g)&= \sum _{i=1}^{N} \int h\left( \epsilon ,i;q_1=q_1^{(k)}\dots q_N=q_N^{(k)}\right) \log h\left( \epsilon ,i;q_1=q_1^{(k)}\dots q_N=q_N^{(k)}\right) d\epsilon \nonumber \\&\quad - \sum _{i=1}^{N} \int h(\epsilon ,i;q_1=q_1^{(k)}\dots q_N=q_N^{(k)}) \log g(\epsilon ,i;\beta ) d\epsilon \nonumber \\&= \sum _{i=1}^{N} \int q_i^{(k)} \delta (\epsilon ) \log \left[ q_i^{(k)} \delta (\epsilon ) \right] d\epsilon - \sum _{i=1}^{N} \int q_i^{(k)} \delta (\epsilon ) \log g(\epsilon ,i;\beta ) d\epsilon \nonumber \\&= \sum _{i=1}^{N} q_i^{(k)} \log q_i^{(k)} + \int \delta (\epsilon ) \log \delta (\epsilon ) d\epsilon - \sum _{i=1}^{N} q_i^{(k)} \log g(0,i;\beta ) \nonumber \\&= \sum _{i=1}^{N} q_i^{(k)} \log q_i^{(k)} + \int \delta (\epsilon ) \log \delta (\epsilon ) d\epsilon + \log N - \sum _{i=1}^{N} q_i^{(k)} \log \phi _h \left( y_i - x_i^{\top }\beta \right) . \end{aligned}$$

(50)

The divergence $D^{(m)}(h||g)$ can be decomposed into four terms. Because only the fourth term in (50) includes $\beta $, the optimization problem expressed by (31) is equivalent to

$$\begin{aligned} \max _{\beta } \sum _{i=1}^{N} q_i^{(k)} \log \phi _h \left( y_i - x_i^{\top }\beta \right) . \end{aligned}$$

H Influence function of $\beta $

We formally derive the influence function of $\beta $. We first define the $\psi $-type M-estimator:

Definition 1

($\psi $–type M–estimator) The $\psi $-type M-estimator T(F) is defined as a solution to the equation with respect to $\theta $:

$$\begin{aligned} \int _{{\mathscr {X}}} \psi (x,\theta ) dF(x) = 0 , \end{aligned}$$

where $\psi :{\mathscr {X}}\times \varTheta \rightarrow {\mathbb {R}}^r$ is a measurable function and F is a probability measure on ${\mathscr {X}}$.

Proposition 1

(Influence function of MLR coefficient estimation) To ease the derivation of the influence function, the estimator of the MLR coefficient${\hat{\beta }}=\mathop {\text {argmax}}\nolimits _{\beta \in B} \frac{1}{n}\sum _{i=1}^{n} \phi _h(y_{(i)} - x_{(i)}^{\top }\beta ),\ y_{(i)}\in {\mathbb {R}},\ x_{(i)}\in {\mathbb {R}}^p$is expressed as

$$\begin{aligned} {\hat{\beta }} = \mathop {\text {argmax}}\limits _{\beta \in B} \int \phi _h(y-x^{\top }\beta ) dF_{X,Y}(x,y) , \end{aligned}$$

(51)

where$F_{X,Y}$is a probability measure for the random variablesX, Y. With the Dirac measure$\varDelta _{u,v}$, the empirical measure$F_{X,Y} = \frac{1}{n}\sum _{i=1}^{n}\varDelta _{x_{(i)},y_{(i)}}$leadsto the original form of (3).

To formulate the estimator of (3) as a$\psi $-type M-estimator, we define the estimator of the MLR coefficient as a solution to

$$\begin{aligned} \frac{\partial }{\partial \beta } \int \phi _h(y-x^{\top }\beta ) dF_{X,Y}(x,y) = 0. \end{aligned}$$

(52)

The solution is denoted by${\hat{\beta }}(F_{X,Y})$to make the dependence on the measure$F_{X,Y}$explicit.

With the measureF, the influence function$\text {IF}(u,v;F)$of the estimator${\hat{\beta }}$with a given$u\in {\mathbb {R}}^p, v\in {\mathbb {R}}$is denoted as

$$\begin{aligned} \text {IF}(u,v;F)&= \left( \int \frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }{\hat{\beta }}(F)} xx^{\top } dF(x,y) \right) ^{-1} \psi (u,v,{\hat{\beta }}(F)) ,\\ \text {where}&\quad \psi (u,v,{\hat{\beta }}(F)) = \frac{d}{dz}\phi _h(z)_{z=v-u^{\top }{\hat{\beta }}(F)} u . \end{aligned}$$

To prove the proposition, we note that

$$\begin{aligned} \frac{\partial }{\partial \beta }\phi _h(y-x^{\top }\beta ) = \frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta } \frac{\partial }{\partial \beta }(y-x^{\top }\beta ) = -\frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta } x. \end{aligned}$$

At ${\hat{\beta }}(F_{X,Y})$, (52) leads to

$$\begin{aligned} \int \frac{d}{dz}\phi _h(z)_{z=y-x^{\top } {\hat{\beta }}(F_{X,Y}) } x dF_{X,Y}(x,y) = 0. \end{aligned}$$

(53)

We write $\psi \left( x,y,{\hat{\beta }}(F_{X,Y}) \right) = \frac{d}{dz}\phi _h(z)_{z=y-x^{\top } {\hat{\beta }}(F_{X,Y})} x$ henceforth.

The influence function is obtained by letting $F_{X,Y} = (1-t)F + t\varDelta _{u,v}$ and evaluating the derivative at $t=0$. To do so, we expand $\frac{\partial }{\partial \beta }\psi \left( x,y,\beta \right) $ as

$$\begin{aligned} \frac{\partial }{\partial \beta }\psi \left( x,y,\beta \right)&= \begin{pmatrix} \left( \frac{\partial }{\partial \beta _1}\psi (x,y,\beta ) \right) ^{\top }\\ \vdots \\ \left( \frac{\partial }{\partial \beta _p}\psi (x,y,\beta ) \right) ^{\top } \end{pmatrix} ,\nonumber \\ \frac{\partial }{\partial \beta _k}\psi (x,y,\beta )&= \frac{\partial }{\partial \beta _k} \begin{pmatrix} \frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta }x_1\\ \vdots \\ \frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta }x_p \end{pmatrix} ,\quad \text {where}\quad x=\begin{pmatrix} x_1&\dots&x_p \end{pmatrix}^{\top } ,\nonumber \\&= \begin{pmatrix} x_1 \frac{\partial }{\partial \beta _k}\left[ \frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta }\right] \\ \vdots \\ x_p\frac{\partial }{\partial \beta _k}\left[ \frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta }\right] \end{pmatrix} \nonumber \\&= \begin{pmatrix} x_1 \frac{d}{dz}\left[ \frac{d}{dz}\phi _h(z)\right] _{z=y-x^{\top }\beta }\frac{\partial }{\partial \beta _k}\left[ y-x^{\top }\beta \right] \\ \vdots \\ x_p\frac{d}{dz}\left[ \frac{d}{dz}\phi _h(z)\right] _{z=y-x^{\top }\beta }\frac{\partial }{\partial \beta _k}\left[ y-x^{\top }\beta \right] \end{pmatrix} ,\nonumber \\&= -\frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }\beta } x_k\ x ,\nonumber \\ \therefore \frac{\partial }{\partial \beta }\psi \left( x,y,\beta \right)&= -\frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }\beta } \begin{pmatrix}x_1 x^{\top }\\ \vdots \\ x_px^{\top }\end{pmatrix} = -\frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }\beta }xx^{\top }. \end{aligned}$$

(54)

Then, $\frac{d}{dt}\psi (x,y,{\hat{\beta }}(F_{X,Y}))$ with $F_{X,Y} = (1-t)F + t\varDelta _{u,v}$ is given by

$$\begin{aligned} \frac{d}{dt}\psi (x,y,{\hat{\beta }}(F_{X,Y}))_{t=0}&= \left[ \frac{\partial }{\partial \beta } \psi (x,y,\beta )_{\beta ={\hat{\beta }}(F)}\right] ^{\top } \frac{d}{dt}{\hat{\beta }}(F_{X,Y})_{t=0} \end{aligned}$$

(55)

for $t=0$. Considering that $\frac{d}{dt}{\hat{\beta }}(F_{X,Y})_{t=0}$ in (55) is the influence function, we substitute $F_{X,Y}=(1-t)F+t\varDelta _{u,v}$ into (53) and obtain

$$\begin{aligned}&\frac{d}{dt} \left[ \int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) dF_{X,Y}(x,y) \right] _{t=0} = 0 ,\nonumber \\&\frac{d}{dt} \left[ (1-t)\int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) dF(x,y) + t\int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) d\varDelta _{u,v}(x,y) \right] _{t=0} = 0 ,\nonumber \\&\quad - \left[ \int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) dF(x,y) \right] _{t=0} \nonumber \\&\quad + \left[ (1-t) \frac{d}{dt}\int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) dF(x,y) \right] _{t=0} \nonumber \\&\quad + \left[ \int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) d\varDelta _{u,v}(x,y) \right] _{t=0} \nonumber \\&\quad + \left[ t \frac{d}{dt}\int \psi \left( x,y,{\hat{\beta }}(F_{X,Y})\right) d\varDelta _{u,v}(x,y) \right] _{t=0} = 0 ,\nonumber \\&\text {From Eq.}~(55) \nonumber \\&\quad - \int \psi \left( x,y,{\hat{\beta }}(F)\right) dF(x,y) \nonumber \\&\quad + \left( \int \frac{\partial }{\partial \beta } \psi (x,y,\beta )_{\beta ={\hat{\beta }}(F)} dF(x,y)\right) ^{\top } \left( \frac{d}{dt}{\hat{\beta }}(F_{X,Y})_{t=0} \right) \nonumber \\&\quad + \psi \left( u,v,{\hat{\beta }}(F)\right) = 0 \end{aligned}$$

(56)

The first term in Eq. (56) is zero due to Eq. (53). Then,

$$\begin{aligned}&\quad \frac{d}{dt}{\hat{\beta }}(F_{X,Y})_{t=0} = - \left( \left[ \int \frac{\partial }{\partial \beta } \psi (x,y,\beta )_{\beta ={\hat{\beta }}(F)} dF(x,y)\right] ^{\top } \right) ^{-1} \psi \left( u,v,{\hat{\beta }}(F)\right) . \end{aligned}$$

From (54), we obtain

$$\begin{aligned}&\therefore \frac{d}{dt}{\hat{\beta }}(F_{X,Y})_{t=0} = \left( \int \frac{d^2}{dz^2}\phi _h(z)_{z=y-x^{\top }{\hat{\beta }}(F)}xx^{\top } dF(x,y)\right) ^{-1}\psi \left( u,v,{\hat{\beta }}(F)\right) . \end{aligned}$$

(57)

Proposition 2

(Influence function of MLR coefficient estimator defined by IRLS) Let us consider the estimator${\hat{\beta }}$that is defined by EM algorithm in 2.2:

$$\begin{aligned} \beta ^{(k+1)}&= \left( X^{\top } W_k X \right) ^{-1} X^{\top } W_k y ,\nonumber \\ \text {where}&\quad \left\{ \begin{aligned}&W_k = \text {diag}\begin{pmatrix} \pi _{1}^{(k)}&\cdots&\pi _{N}^{(k)} \end{pmatrix} ,\\&\pi _i^{(k)} = \frac{\phi _h(y_i - x_i^{\top }\beta ^{(k)}) }{\sum _{j=1}^{N} \phi _h(y_j - x_j^{\top }\beta ^{(k)}) },\quad i = 1 \dots N . \end{aligned} \right. \end{aligned}$$

(58)

The above estimator can be expressed as

$$\begin{aligned}&\left[ \int \phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) xx^{\top } dF_{X,Y}(x,y) \right] \beta ^{(k+1)}(F_{X,Y}) \nonumber \\&\quad = \int \phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) y x dF_{X,Y}(x,y) , \end{aligned}$$

(59)

where$F_{X,Y}$is a probability measure for the random variablesX, Y. With the Dirac measure$\varDelta _{u,v}$, the empirical measure$F_{X,Y} = \frac{1}{n} \sum _{i=1}^{n}\varDelta _{x_{(i)},y_{(i)}}$leadsto the original form of (58).

Then, the relation between the influence function of $\beta ^{(k)}$ and that of $\beta ^{(k+1)}$ is denoted as

$$\begin{aligned}&\text {IF}(u,v; \beta ^{(k+1)}, F) = z_{k} + A_{k}\; \text {IF}(u,v;\beta ^{(k)},F) ,\\&\text {where}\quad \left\{ \begin{aligned}&z_{k} = \varSigma _{k}^{-1} c_{k} ,\\&\varSigma _{k} = \int \phi _h \left( y-x^{\top }\beta ^{(k)}(F)\right) xx^{\top } dF(x,y) ,\\&c_{k} = \phi _h \left( v-u^{\top }\beta ^{(k)}(F) \right) \left( v-u^{\top }\beta ^{(k+1)}(F) \right) u \\ {}&- \int \phi _h \left( y-x^{\top }\beta ^{(k)}(F) \right) \left( y-x^{\top }\beta ^{(k+1)}(F) \right) x dF(x,y) ,\\&C_{k} = -\int \frac{d}{dz} \phi _h(z)_{z=y-x^{\top }\beta ^{(k)}(F)} \left( y-x^{\top }\beta ^{(k+1)}(F)\right) xx^{\top } dF(x,y) ,\\&A_{k} = \varSigma _{k}^{-1} C_{k} . \end{aligned} \right. \end{aligned}$$

The influence function is obtained by letting $F_{X,Y} = (1-t)F + t\varDelta _{u,v}$ and evaluating the derivative at $t=0$. We calculate necessary terms in advance.

$$\begin{aligned}&\frac{d}{dt}\left[ \phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) \right] _{t=0} \nonumber \\&\quad = -\frac{d}{dz} \phi _h(z)_{z=y-x^{\top }\beta ^{(k)}(F)} x^{\top } \frac{d}{dt}\beta ^{(k)}(F_{X,Y})_{t=0} , \end{aligned}$$

(60)

$$\begin{aligned}&\frac{d}{dt}\left[ \left\{ \int \phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) xx^{\top } dF_{X,Y}(x,y) \right\} \beta ^{(k+1)}(F_{X,Y}) \right] _{t=0} \nonumber \\&\quad = -\int \phi _h\left( y-x^{\top }\beta ^{(k)}(F) \right) x^{\top }\beta ^{(k+1)}(F)x dF(x,y) \nonumber \\&\qquad + \int \left[ \frac{d}{dt}\phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) \right] _{t=0} x^{\top }\beta ^{(k+1)}(F)x dF(x,y) \nonumber \\&\qquad + \phi _h\left( v-u^{\top }\beta ^{(k)}(F) \right) u^{\top }\beta ^{(k+1)}(F)u + \varSigma _{k} \frac{d}{dt}\beta ^{(k+1)}(F_{X,Y})_{t=0} , \end{aligned}$$

(61)

$$\begin{aligned}&\frac{d}{dt}\left[ \int \phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) yx dF_{X,Y}(x,y) \right] _{t=0} \nonumber \\&\quad = -\int \phi _h\left( y-x^{\top }\beta ^{(k)}(F) \right) yx dF(x,y)\nonumber \\&\qquad +\int \frac{d}{dt}\left[ \phi _h\left( y-x^{\top }\beta ^{(k)}(F_{X,Y}) \right) \right] _{t=0} yx dF(x,y) \nonumber \\&\qquad + \phi _h\left( v-u^{\top }\beta ^{(k)}(F) \right) vu. \end{aligned}$$

(62)

Substituting (60), (61), (62) into the derivative of (59) at $t=0$ leads to

$$\begin{aligned}&\varSigma _{k} \frac{d}{dt}\beta ^{(k+1)}(F_{X,Y})_{t=0} \nonumber \\&\quad = \phi _h\left( v-u^{\top }\beta ^{(k)}(F) \right) \left( v-u^{\top }\beta ^{(k+1)}(F) \right) u \nonumber \\&\qquad -\int \phi _h\left( y-x^{\top }\beta ^{(k)}(F) \right) \left( y-x^{\top }\beta ^{(k+1)}(F) \right) x dF(x,y) \nonumber \\&\qquad -\int \frac{d}{dz}\phi _h(z)_{z=y-x^{\top }\beta ^{(k)}(F)} \left( y-x^{\top }\beta ^{(k+1)}(F) \right) xx^{\top } dF(x,y) \frac{d}{dt}\beta ^{(k)}(F_{X,Y})_{t=0} \nonumber \\&\therefore \frac{d}{dt}\beta ^{(k+1)}(F_{X,Y})_{t=0} = z_k + A_k\; \frac{d}{dt}\beta ^{(k)}(F_{X,Y})_{t=0}. \end{aligned}$$

(63)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Sando, K., Akaho, S., Murata, N. et al. Information geometry of modal linear regression. Info. Geo. 2, 43–75 (2019). https://doi.org/10.1007/s41884-019-00017-y

Download citation

Received: 29 August 2018
Revised: 12 June 2019
Published: 01 July 2019
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s41884-019-00017-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Information geometry of modal linear regression

Abstract

Similar content being viewed by others

Information Geometric Perspective of Modal Linear Regression

Modal linear regression using log-concave distributions

Modal regression models based on B-splines

1 Introduction

1.1 Related works

2 Modal linear regression

2.1 Formulation

2.2 EM algorithm for MLR

3 Information geometry

4 MEM algorithm and its information geometry

4.1 MEM algorithm

4.2 Pseudo-observation

4.3 Summary: information geometry of MEM

5 Information geometry of MLR

5.1 Problem of constructing manifolds

6 On the influence function for MLR

7 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A \({\mathscr {D}}\) of MEM is in a mixture family

B The e-projection onto \({\mathscr {D}}\) for the MEM algorithm

C The m-projection onto \({\mathscr {M}}\) for the MEM algorithm

D Proposed model manifold for the MLR model is in a curved exponential family

E Proposed data manifold for the MLR model is in a mixture family

F The e-projection onto \({\mathscr {D}}\) for the MLR model

G m-projection onto \({\mathscr {M}}\) for the MLR model

H Influence function of \(\beta \)

Definition 1

Proposition 1

Proposition 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation