Information geometry of modal linear regression

Modal linear regression (MLR) is used for modeling the conditional mode of a response as a linear predictor of explanatory variables. It is an effective approach to dealing with response variables having a multimodal distribution or those contaminated by outliers. Because of the semiparametric nature of MLR, constructing a statistical model manifold is difficult with the conventional approach. To overcome this difficulty, we first consider the information geometric perspective of the modal expectation–maximization (EM) algorithm. Based on this perspective, model manifolds for MLR are constructed according to observations, and a data manifold is constructed based on the empirical distribution. In this paper, the em algorithm, which is a geometric formulation of the EM algorithm, of MLR is shown to be equivalent to the conventional EM algorithm of MLR. The robustness of the MLR model is also discussed in terms of the influence function and information geometry.


Introduction
Linear regression is used to model the conditional mean of a response variable y given the predictor variable x. A well-known least-squares estimator for linear regression coefficients is highly sensitive to outliers. To alleviate this problem, many estimators have been developed, such as robust M-estimators [12,13]. However, the consistency of the robust M-estimators requires the conditional error distribution to have homoscedasticity and symmetricity given a predictor. In reality, various types of data do not follow these assumptions (e.g., wages, prices, and expenditures). Baldauf and Silva [3] pointed out that the estimation cannot be consistent unless data follow proper assumptions, which is also the case in some real-world applications [18].
Modal linear regression (MLR) is used to model the conditional mode of y given x by using a linear predictor function of x. MLR relaxes the distribution assumptions for the M-estimators of linear regression and is robust against outliers compared to the least-squares estimation of linear regression coefficients. It is also robust against violations of standard assumptions on the usual mean regression, such as heavy-tailed noise and skewed conditional and noise distributions. Kemp and Silva [14] and Yao and Li [23] proved that their estimators for the MLR model are consistent even when the error distribution is asymmetric. For the above reasons, improving methods for mode estimation has been an important topic of research for many years.
In information geometry [2], a manifold that consists of statistical models is called a model manifold. Information geometric formulation of estimating algorithm is useful for understanding the behavior and characteristic of the algorithm. For example, the procedure of parameter estimation is regarded as a projection from a point in the statistical manifold to a point in the model manifold. Modal linear regression is known to be robust to outliers, and we aim to elucidate the source of robustness by formulating the estimation procedure as geometric operations. Identifying and understanding the source of robustness of the estimation algorithm and statistical model associated with the modal linear regression would be helpful for developing other algorithms and models robust to outliers. In information geometry, a model manifold is often constructed by using a parametric distribution. Because of the lack of a parametric distribution, constructing a manifold that corresponds to the MLR model is difficult with conventional approaches. The contribution of this paper is being the first to obtain an information geometric perspective on MLR.

Related works
The modal regression model is related to kernel density estimation (KDE) [22], which is a nonparametric method for estimating the probability density function of observed data. Parzen [19] revealed the sufficient conditions for the L 2 -consistency and asymptotic normality of KDE. They also derived the conditions for the consistency and asymptotic normality of a mode estimation constructed on the basis of KDE. Epanechnikov [7] found the optimal kernel function for KDE under some conditions. In general, if the probability density function of a random variable X has a unique mode and is symmetric with respect to the mode, Pr ( p − w ≤ X ≤ p + w) with fixed w is max-imized when p is the mode. Based on this property, Lee [15] proposed an estimator for the coefficients of MLR. The MLR model is formulated as follows: The estimator of Lee [15] is consistent when there exists w > 0 and the probability density function of is symmetric in the range of 0 ± w. Lee [16] improved the objective function for the MLR model to become more tractable by using a quadratic kernel. Lee [15,16] required the conditional probability density function of given x to be symmetric around 0 in the range of ±w. Kemp and Silva [14] proved that the mode estimator for the coefficients of MLR is consistent even if symmetry is not satisfied. Yao and Li [23] proposed an expectation-maximization (EM) algorithm to estimate the coefficients of MLR.
Besides linear modeling, other studies have used semiparametric or nonparametric approaches to modal regression. Gannoun et al. [9] proposed a semiparametric modal regression model that assumes a linear relation between the mean, median, and mode. Yao et al. [24] developed a local modal regression that estimates the conditional mode of response as a polynomial function of predictors. Chen et al. [4] defined the modal manifold as a union of sets in which the first derivative of the conditional density is zero and the second derivative is negative. Then, the modal manifold is estimated by kernel estimates of derivatives of densities.
In information geometry, a model manifold is often constructed by using a parametric distribution. Estimates are regarded as the projection of an empirical distribution onto the model manifold. In the case of linear regression, we construct a model manifold based on the assumption that an error variable has a normal distribution. Because of the lack of a parametric distribution, constructing a model manifold that corresponds to the MLR model is difficult with conventional approaches. Some studies have considered nonparametric models for information geometry. Pistone and Sempi [20] showed a well-defined Banach manifold for probability measures. Grasselli [10] addressed the Fisher information and α-connections for the Banach manifold. Zhang [25] discussed the relationship between divergence functions, the Fisher information, α-connections, and fundamental issues in information geometry. Takano et al. [21] proposed a framework for a nonparametric e-mixture estimation. In contrast to these nonparametric approaches to information geometry, in this paper we consider information geometry associated with a semiparametric MLR model. We propose to construct a model manifold by using observations, as is done when constructing an empirical distribution with conventional approaches. Our proposal gives a geometric viewpoint to the MLR model.

Modal linear regression
Let x ∈ R p and y ∈ R be a set of predictor variables and a response variable, respectively. The original least-squares for linear regression estimates a conditional mean of y given x, while MLR estimates a conditional mode of y given x. In this section, we briefly explain the EM algorithm of MLR introduced by Yao and Li [23].

Formulation
where the i-th predictor variable is denoted by x i ∈ R p and the corresponding response is denoted by y i ∈ R. With MLR, we model a conditional mode of y given x by a linear function of x: Namely, y and x are related as To estimate β, Lee [15] introduced a loss function with the form where is a kernel function, and h is a bandwidth parameter. Minimizing the empirical loss leads to the estimateβ of the linear coefficient: In this paper, φ(·) denotes a standard normal density function. The consistency and asymptotic normality of the estimateβ obtained by (3) have been established under certain regularity conditions on the samples, kernel function, parameter space, and vanishing rate of the bandwidth parameter [14].

EM algorithm for MLR
Here, we introduce the EM algorithm for MLR parameter estimation proposed by Yao and Li [23]. The algorithm consists of two steps starting from an initial estimate β (1) : E-Step Consider the surrogate function where This function satisfies and M-Step In this step, the parameter β is updated to increase the value of The updated parameter β (k+1) is given as The following inequality holds: Equation (8) is equivalent to (9).
When φ(·) is a standard normal density function, The property of the estimateβ was discussed in [23].

Information geometry
In this section, we briefly explain information geometry, statistical inference, and the em algorithm. Information geometry [2] is a framework for formulating spaces consisting of probability density functions by means of differential geometry. Let S be a statistical manifold which is composed of probability distributions. A parametric family of probability distributions of the form is called the exponential family and it has a critical role in information geometry. When {F 1 . . . F n , 1} is linearly independent, θ and f (x; θ) have a one-to-one correspondence, and (θ i ) n i=1 is an affine coordinate system of the statistical manifold. There is another useful coordinate system: are called the natural parameters and expectation parameters, respectively, of the exponential family.
For an exponential family equipped with coordinate systems {(θ i ) n i=1 , (η i ) n i=1 }, there exist the potential functions ψ(θ) and φ(η) satisfying the following relation [2]: In information geometry, a statistical inference is often regarded as the projection of an empirical distribution onto a model manifold, which is a submanifold of S. Projection is defined as a procedure to find a point which minimizes the discrepancy between a point and a submanifold, hence it is important to measure the discrepancy between two probability distributions. We can define divergence from a point (probability distribution) p ∈ S to q ∈ S by using the coordinate systems and potential functions as follows: For an exponential family, there are two natural definitions of divergences, the eand m-divergence. The m-divergence is defined as which is equal to the KL-divergence. The e-divergence is defined as follows: The EM algorithm [5] is a method for estimating the maximum likelihood parameters in a latent variable model. In information geometry, the exponential -mixture (em) algorithm [1] corresponds to the EM algorithm. For a latent variable model, an empirical distribution based on observations is not unique. A manifold that consists of empirical joint probability distributions of observable variables and latent variables is called a data manifold D. The em algorithm finds the points p * ∈ M and q * ∈ D that minimize the KL-divergence from q * to p * by iterating the following two steps starting from an initial guess p (1) . Figure 1 is a conceptual diagram of the em algorithm. e-step e-projection of p (k) ∈ M onto the data manifold D.

MEM algorithm and its information geometry
In this section, we introduce the modal EM (MEM) algorithm [17] for estimating the mode given a probability density function and provide an information geometric perspective. The MEM algorithm is an iterative procedure similar to the EM algorithm, but there are no explicit observations, hence the construction of an empirical density function is nontrivial. In this paper, we propose to construct an empirical density function based on the assumption that one pseudo-observation is obtained. This assumption has a critical role for constructing an empirical density function of the MEM algorithm and the MLR model, in which the same difficulty exists because of the absence of the explicit observation.

MEM algorithm
Consider the Gaussian mixture model, whose parameters are known. In general, it is difficult to express the mode of the Gaussian mixture model in a closed form, even if the parameters are known. In order to obtain the mode, we need to execute a numerical optimization. The MEM algorithm [17] is an iterative method for finding a local mode of a mixture distribution with the form where all of the parameters in this model are known. The purpose of the MEM algorithm is to find the mode of f (x): Li et al [17] proposed to iterate the following two steps starting from the initial estimate x (1) : M-step

Pseudo-observation
The purpose of the MEM algorithm is to find an optimal point that maximizes a probability density function f (x). In information geometry, projection is a procedure of finding the probability density function in a set of density functions that minimizes the discrepancy from the empirical density function. In this section, we show how we can interpret the MEM algorithm as a problem of finding an optimal function from a set of probability density functions, which leads us to the information geometric perspective of the MEM algorithm. Let a probability density function s(x ; m) be where m ∈ R p is a parameter for s(x ; m). From this definition, the mode of s(x ; m * ) is 0 when the mode of f (x) is m * . On the other hand, if m * is the optimal solution of max m s(0 ; m), then the mode of f (x) is m * . Two problems max x f (x) and max m s(0 ; m) have the same solution, but the idea behind these problems are different. The former is the problem that finds an optimal point from a domain of f (x), while the latter considers a set of probability density functions shifted by m and finds an optimal function that maximizes the value at x = 0 from the set. Search space of these two problems are different, and that of the latter interpretation is convenient when we formulate the MEM algorithm from an information geometric perspective because its search space is a set of function as in the procedure of projection.
To derive the information geometric formulation of the MEM algorithm, we intro- . The latent variable specifies a mixture component from which an observation x is obtained. The joint probability density function g(x, z) is expressed as follows [1]: f i is a probability density function, In the above discussion, we considered finding argmax m s(0 ; m) from a set of probability density function parameterized by m. In the same way, we define a joint probability density function l(x, z; m) and a corresponding model manifold M as follows: We then consider an empirical density function and data manifold. In general, an empirical density function is constructed based on observations. For example, , the empirical density function is defined as where δ(·) denotes the Dirac delta function. In the formulation of the MEM algorithm, the construction of the empirical density function is nontrivial because there are no explicit observations. In the interpretation of argmax m s(0 ; m), we considered the case of x = 0 and s(0 ; m) is treated as a function of m. It is equivalent to assuming that "one observation x = 0 is obtained", and we interpret the procedure of the MEM algorithm as the maximum likelihood estimation, namely, finding the parameter values m ∈ R p that maximizes the likelihood of s(0 ; m). This assumption leads to the following definition of the empirical density function p(x): By introducing the latent variable Z ∈ {1 . . . K }, we extend p(x) to an empirical joint density function of X and Z : and by introducing the parameters Then, the empirical joint density function h(x, z ; The data manifold D is defined as follows: It is shown in Appendix A that D is in a mixture family.
We consider the MEM algorithm to be a maximum likelihood estimation problem. Thus, we can consider the e-and m-projection between the model manifold M and the data manifold D as follows: The detailed calculation of the e-projection in (19) is provided in Appendix B, and the optimal parameter for h ∈ D is given as The m-projection in (20) is equivalent to The detailed derivation is shown in Appendix C.

Summary: information geometry of MEM
To provide an information geometric perspective of the MEM algorithm, we interpret the algorithm as a problem of finding an optimal probability density function that maximizes the value at x = 0. This enables us to cast the MEM algorithm as the maximum likelihood estimation. The e-projection of the model distribution l(x, z; m (k) ) onto D gives the optimal q (k) K ) onto M derives the optimal m (k+1) , which is consistent with (15) in the original MEM algorithm.

Information geometry of MLR
In this section, we analyze MLR from the viewpoint of information geometry. We elucidate the source of the difficulty with constructing a model manifold and data manifold for the MLR model and propose a framework for geometrically formulating the MLR model.

Problem of constructing manifolds
In order to elucidate the source of the difficulty with constructing manifolds for the MLR model, we consider the parameter estimation of a Gaussian mixture model as a specific example of statistical inferences in information geometry. Suppose that obser- Then, the model manifold consists of Gaussian mixture density functions whose parameters are the means and covariance matrices. The data manifold is constructed based on the empirical density function 1 In the parameter estimation of the Gaussian mixture model, the model manifold is constructed based on the parametric distribution. On the other hand, there is no assumption of parametric distributions in MLR. This makes it nontrivial to construct a model manifold and data manifold.
To construct the model manifold for the MLR model, we consider (i) the assumption that Mode [ ; x] = 0 and (ii) the form of the objective function of β for the MLR model: 1 From this assumption and fact, the optimization problem expressed in (3) is regarded as a maximization problem of KDE at = 0 for the probability density function of . Based on the given observations, we propose constructing the following model for MLR: where i (β) = y i − x i β, i = 1 . . . N and the variable denotes an error variable. We introduce the latent variable Z ∈ {1 . . . N }, which specifies a mixture component from which an observation is obtained. The joint density function of and Z is The model manifold M is denoted by It is shown in Appendix D that M is in a curved exponential family. We next propose constructing a data manifold for the MLR model. The empirical density function is often constructed based on observations. Consider (i) the construction proposed in Sect. 4.2 and (ii) the assumption that Mode [ ; x] = 0. We propose constructing the empirical density function as follows: By introducing the latent variable Z ∈ {1 . . . N } to (26), we extend p( ) to the empirical joint density function of and Z : By introducing the parameters {q i } N i=1 , we model the conditional density function q(z | ) as Then, the empirical joint density function h ( , z q 1 . . . q N ) is expressed as The data manifold D is defined as follows: It is shown in Appendix E that D is in a mixture family.
Here, we consider the e-projection of a model with the parameters β (k) onto the data manifold: min h∈D D (e) (g(·, ·; β (k) )||h).
→ min q 1 ...q N D (e) g(·, ·; β (k) )||h(·, · ; q 1 . . . q N ) , The detailed calculation is shown in Appendix F. An optimal solution for (29) is which is equivalent to the E-step in (5). Then, we consider the m-projection of the empirical joint density function with the parameters q i = q The detailed calculation is shown in Appendix G. The optimization problem expressed as (31) is equivalent to which is equivalent to the M-step (9). Figure 2 shows the update process of the em algorithm corresponding to the MLR model parameter estimation.
In this section, we propose constructing a model manifold and data manifold for the MLR model. Although a model manifold is often constructed based on a parametric distribution assumption, we construct it based on observations. Concerning the empirical distribution, our construction is based on (i) the assumption that Mode [ ; x] = 0 and (ii) the proposed construction in Sect. 4.2. We apply the framework of the em algorithm to the proposed manifolds. We demonstrate that the e-projection of the model onto the data manifold derives (5), and the m-projection of the empirical distribution onto the model manifold leads to (9).

On the influence function for MLR
An influence function quantifies the influence of one observation on the estimate. Let T be the functional defined for a set of probability measures and F be the probability measure. The influence function of T at F is expressed by IF(x; T , F) and is defined as follows: Yao and Li [23] argued that MLR is robust against outliers after investigating its breakdown point. Elucidating the reason for the robustness is important because it can be a clue to developing a novel robust estimator. There are various approaches in the literature for making an estimation robust. One representative approach is to change the manifold for the m-projection. For example, the median is often used as a robust estimate of the mean. This approach corresponds to adopting the model manifold composed of Laplacian distributions instead of Gaussian distributions. On the other hand, a robust estimator can also be obtained by changing the projection method, such as using robust divergences instead of the KL-divergence [8].
So far, we have established the equivalence of the em and EM algorithms for MLR. The em algorithm is based on the KL-divergence, which is sensitive to outliers. Thus, we conjecture that the m-projection onto the model manifold, which is composed of the error distributions given by KDE, is the source of the robustness of MLR. Here, we focus on the influence function [11] of the estimator for the MLR model.
Unfortunately, it is difficult to derive the influence functions correspond to the update of the estimate of coefficient by the e-and m-steps for MLR problem because it is not trivial how to deal with the effect of an outlier on the projection to the datadependent manifold.
In the finite-sample robustness analysis, the rescaled version of the influence function [11] is used to evaluate the effect of an outlier (u, v) ∈ R p ×R on each projection. Suppose that the current estimate is β (k) . Without outlier case, the joint density function in Eq. (24) is projected onto the data manifold D. When the dataset is contaminated with an outlier (u, v), the joint density is modified as where N +1 (β (k) ) = v −u β (k) , and it is projected onto the outlier-contaminated data manifold D N +1 , which is composed of q 1 . . .
In the same manner, we can calculate the rescaled version of the influence function on the m-projection by comparing the result of the m-projections with and without an outlier. The regression coefficients obtained by the m-projection with and without the outlier are denoted as β (k+1) and β (k+1) , respectively. When the sample is contaminated by an outlier (u, v), the m-projection projects h( , z; q , which is a set of g N +1 ( , z; β (k) ) defined by Eq. (34) parameterized by β. The influence function on the m-projection is then written as We could derive the influence functions on each of the e-and m-projections, though, there are problems on the above consideration on the effect of an outlier. In the mprojection, we did not take the indirect effect of the outlier account, which is inherited from the e-projection for the outlier-contaminated case. A standard definition of the influence function is associated with the probability measure F for X , Y . The nonparametric nature of the proposed information geometric formulation of the modal linear regression make the conventional robustness analysis difficult. Towards understanding of the source of robustness of MLR, we consider the influence function ofβ defined in Eq. 3, which is a standard ψ-type M-estimator, and the influence function of the estimate by the em algorithm from the viewpoint of the iteratively reweighted least-squares (IRLS) estimator.
It is possible to derive the influence function independently of the solving algorithm. For example, the regression coefficient estimateβ of the MLR model defined by Eq. (3) is a ψ-type M-estimator, and its influence function is expressed as We note that Eq. (35) does not depend on whether the EM algorithm or the steepest descent method is used. Details of the derivation are given in Proposition 1 of Appendix H. To see the effect of an outlier, let us consider a very simple case withβ(F) = β * , a predictor variable X and an error variable are independent. We also assume the Gaussian kernel with bandwidth h is adopted as a kernel function. Then, Eq. (35) is simplified to  On the other hand, it is also possible to define the estimateβ as the limit of the EM algorithm derived in [23]. In particular, when the Gaussian kernel is adopted for density estimation, the resulting estimator is regarded as an IRLS estimator. Dollinger and Staudte [6] addressed a related problem when they derived the influence function of an IRLS estimator for the linear regression model. They revealed the relationship between the influence functions of the (k + 1)-step and k-step estimates, and derived the influence function for the estimator as its limit.
Following the approach by Dollinger and Staudte [6], we consider the em algorithm for MLR as an IRLS procedure, in which an iteration of IRLS corresponds to a pair of e-and m-projection. Let the initial estimate be β (1) and the k-th estimate be β (k) , respectively. Then, it is shown that the influence functions of estimates β (k) and β (k+1) satisfy the following recurrence relation: where F(x, y), F(x, y), F(x, y), Details of the derivation are given in Proposition 2 of Appendix H. In the original least squares regression problem dealt in [6], the weighted least-squares (WLS) estimator is proven to be Fisher consistent by using the symmetry of the noise distribution, and Σ k , c k , C k in the recurrence relation are shown to be independent of k. In our MLR problem, though, the noise distribution cannot be assumed to be symmetric. So, we consider the converged value β (∞) to remove the dependency on k. In this limit case, Σ k , c k , C k do not depend on k, and Eq. (38) becomes Now we assume that |A| < 1, and the use of Gaussian kernel implies d dz φ h (z) = − 1 h 2 zφ h (z). Then, we ontain In this section, we conjectured that the robustness of MLR is due to the particular structure of the model manifold. To support this conjecture, we attempted to derive the influence function with respect to the e-and the m-steps of the estimation procedure of model coefficient. Currently it is difficult to derive the influence functions, but we showed the difference between the influence function derived from the veiwpoint of Mestimator and that derived from the IRLS estimator. The IRLS estimator is composed of a set of e-and m-steps. Our future work is to identify the contribution of each of the e-and the m-steps on the influence function.

Conclusions
In this paper, we provide an information geometric perspective on the MLR model, which is a nonparametric method. First, we discuss the MEM algorithm and investigate the relationship between the algorithm and information geometry. We cast the MEM algorithm as a maximum likelihood method by assuming a pseudo-observation. This gives us an empirical density function based on the assumption that one pseudoobservation is obtained. Second, we address the relationship between the MLR model and information geometry. Because the MLR model does not assume a parametric distribution, we cannot construct a corresponding model manifold with conventional approaches. We propose constructing the model manifold based on observations. The empirical density function introduced through a discussion of the MEM algorithm is applied to construct the data manifold. We clarify the relationship between the EM algorithm developed by Yao and Li [23] and the em algorithm corresponding to the MLR model.
Elucidating the factors or geometric operation that make the estimator for the MLR model robust remains for future work. We will further investigate the influence function corresponding to the e-and m-steps for estimating the coefficient of the MLR model.

Acknowledgements
The authors are grateful to the anonymous reviewers whose comments led to valuable improvements in the manuscript.

Compliance with ethical standards
Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

A D of MEM is in a mixture family
This section shows that the data manifold D of MEM is in a mixture family. An element of D is expressed as: Because h(x, z ; q 1 . . . q K ) is expressed in a convex combination of K probability density functions {δ(x)δ i (z)} K i=1 , the subset D is a (K − 1)-dimensional mixture family.

B The e-projection onto D for the MEM algorithm
This appendix shows that the optimization problem expressed by (19) derives an optimal solution for q . D (e) l(·, ·; m (k) )||h can be expanded as follows: The divergence D (e) l(·, · ; m (k) )||h can be decomposed into two terms, as expressed in (43). Because the variables {q i } K i=1 are only included in the first term, the optimization problem expressed by (19) is equivalent to In order to obtain an optimal solution, we solve the following problem: We can then obtain q The value of λ can be derived from the constraint K i=1 q (k) i = 1: Therefore, the optimal solution q is expressed as This solution satisfies the constraint ∀ i = 1 . . . K , q i ≥ 0 and K i=1 q i = 1.

C The m-projection onto M for the MEM algorithm
This appendix shows that the optimization problem expressed by (20) is equivalent to the one expressed by (22).
The divergence D (m) h(·, · ; q (k) K )||l can be decomposed into four terms, as expressed in (45). Because the variable m is only included in the fourth term, the optimization problem expressed by (20) is equivalent to

D Proposed model manifold for the MLR model is in a curved exponential family
This appendix shows that the proposed model manifold expressed by (25) is in a curved exponential family. In this paper, we adopt φ(z) = 1 √ 2π exp − z 2 2 as a kernel function.
We can expand log φ h − (y i − x i β) : Then, we show that log g( , z; β) can be transformed to the standard form of an exponential family: Replacing the following amount as natural parameters results in Now, y i − x i β is denoted by natural parameters as follows: We replace the last term in (47) with ψ(θ): Thus, log g( , z) can be expressed by natural parameters and ψ(θ) as follows: This shows that M expressed by (25) is in an exponential family.
Here, we derive the expectation parameters: In order to obtain the dual potential ϕ(η), we express log h as a function of expectation parameters: Finally, the dual potential ϕ(η) is expressed as follows: The derived natural parameters, expectation parameters, potential function, and dual potential are shown to satisfy the relationship in (12):

E Proposed data manifold for the MLR model is in a mixture family
This appendix shows that the data manifold D for the MLR model introduced in Sect. 5 is in a mixture family. An element of D is expressed as Because h( , z q 1 . . . q N ) is expressed in a convex combination of N probability density functions {δ( )δ i (z)} N i=1 , the subset D is an (N − 1)-dimensional mixture manifold.
This solution satisfies the condition ∀ i = 1 . . . N , q i ≥ 0 and N i=1 q i = 1.

G m-projection onto M for the MLR model
We show that the optimization problem expressed by (31) is equivalent to the one expressed by (32). We can expand D (m) (h||g) to The divergence D (m) (h||g) can be decomposed into four terms. Because only the fourth term in (50) includes β, the optimization problem expressed by (31) is equivalent to

H Influence function ofW
e formally derive the influence function of β. We first define the ψ-type M-estimator: Definition 1 (ψ-type M-estimator) The ψ-type M-estimator T (F) is defined as a solution to the equation with respect to θ : From (54), we obtain Proposition 2 (Influence function of MLR coefficient estimator defined by IRLS) Let us consider the estimatorβ that is defined by EM algorithm in 2.2: The above estimator can be expressed as where F X ,Y is a probability measure for the random variables X , Y . With the Dirac measure Δ u,v , the empirical measure F X ,Y = 1 n n i=1 Δ x (i) ,y (i) leads to the original form of (58).