Holonomic extended least angle regression

One of the main problems studied in statistics is the fitting of models. Ideally, we would like to explain a large dataset with as few parameters as possible. There have been numerous attempts at automatizing this process. Most notably, the Least Angle Regression algorithm, or LARS, is a computationally efficient algorithm that ranks the covariates of a linear model. The algorithm is further extended to a class of distributions in the generalized linear model by using properties of the manifold of exponential families as dually flat manifolds. However this extension assumes that the normalizing constant of the joint distribution of observations is easy to compute. This is often not the case, for example the normalizing constant may contain a complicated integral. We circumvent this issue if the normalizing constant satisfies a holonomic system, a system of linear partial differential equations with a finite-dimensional space of solutions. In this paper we present a modification of the holonomic gradient method and add it to the extended LARS algorithm. We call this the holonomic extended least angle regression algorithm, or HELARS. The algorithm was implemented using the statistical software R, and was tested with real and simulated datasets.


Introduction
In model selection, one would ideally want to choose a statistical model that fits the data well, while still being simple enough to allow meaningful interpretations and explanatory power. In this paper we consider model simplification of linear and generalized linear models, where we want to choose a subset of the covariates to include in the model.
In two decades, there have been many advances in sparse modeling. One of the most famous methods is L1-regularization: Least Absolute Shrinkage and Selection Operator (LASSO [19]). LASSO is defined only for the normal linear regression problem. However, the idea of LASSO has been applied to many other problems. For example, Park & Hastie [16] considered the generalized linear models and Yuan & Lin [20] treated the Gaussian graphical models. Least Angle Regression (LARS, [5]) is an efficient algorithm for computing the LASSO solution paths. The LARS algorithm is described based on Euclidean geometry because LARS considers the normal linear regression problem. Hirose & Komaki [9] proposed the ELARS algorithm based on the information geometry of dually flat spaces. ELARS is an algorithm for estimating and selecting parameters in the generalized linear models. The idea of ELARS was applied to edge selection in the Gaussian graphical models [10] and the contingency table models [11]. Another version of geometrical extensions of LARS was given by Augugliaro, Mineo & Wit [3], and a geometrical approach to sparse modeling was also proposed in [21].
The ELARS algorithm by Hirose and Komaki [9] has a computational drawback in that it assumes that the potential function (i.e. the normalizing constant) of the underlying probability distribution function is easy to compute. This is often not the case, which motivates us to use the holonomic gradient method, a computationally efficient method for computing potential functions and their gradients, introduced by Nakayama et al. [14]. A system of linear partial differential equations is called a holonomic system if it has a finitedimensional space of solutions. Refer to Section 3 for a more precise description. If the potential function satisfies a holonomic system, we can use a modification of the holonomic gradient method to keep track of its value at each step of the algorithm and update it when needed in a computationally efficient way. We call the combined algorithm the holonomic extended LARS algorithm, or HELARS.
The main result of the paper is an implementation in R of the HELARS. We choose the truncated normal distribution as the underlying distribution, as it is simple enough to handle due to its similarities with the well-known normal distribution. Despite the truncated normal having no closed from potential function, our implementation of the algorithm does not use numerical integration. Of course, the potential function of the truncated normal distribution is nothing but the Gaussian cumulative distribution function, that is implemented as a built-in function in almost all software packages. However, since the truncated normal model is a special case of more complicated models such as the exponential-polynomial distributions [7] and the multivariate truncated normal distributions [12], our result will become a prototype of the overall method.
The paper is organized as follows. In Section 2 we review basic definitions and results concerning generalized linear models. In Section 3 we present the holonomic gradient method. Section 4 discusses the extended LARS algorithm [9] by Hirose and Komaki, and we look at what necessary changes and additions are needed for the HELARS algorithm. In Section 5 we use the truncated normal distribution as the underlying distribution, and implement the HELARS algorithm. We validate the algorithm using both real and simulated datasets. Finally, we end with a discussion of the results in Section 6.

Generalized linear models
In this section we will review some foundations of generalized linear models. We will follow [1] in our exposition.
ψ(ξ) = log e C(y)+ξ·F (y) dy Example 2.2. The normal distribution is a member of the exponential family. It has the probability density function Note that here the natural parameter is ξ = − 1 2σ 2 µ σ 2 T and F (y) = y 2 y T .
Definition 2.3. The Fisher information matrix of a distribution p(y | ξ) at a point ξ is an d × d matrix G(ξ) = (g i,j ) with entries given by Equivalently, we may write the elements of the Fisher information matrix as Next we introduce generalized linear models. Assume we have n independent observations y 1 , y 2 , . . . , y n . Each observation y i is sampled from an exponential family with scalar parameter ξ i , which will depend on a covariate vector is called the design matrix. We assume that the covariate vector x i influences the distribution of y i only via the linear predictor η i , defined as or using matrices η := Xθ, for some vector θ ∈ R d . We can also add an intercept term by definining a new design matrixX := 1 n×1 X so that η :=Xθ, for some vector θ = (θ 0 , θ 1 , . . . , θ d ) T . We will always use an intercept term throughout this paper. The final piece of a generalized linear model, the link function g(µ), determines in which way the linear predictor influences the distribution by setting where µ i is the expectaion of y i . The canonical link is the link function g for which ξ i = η i for all i = 1, 2, . . . , n.
The combination of an exponential family, design matrix and link function define a generalized linear model (GLM). The model has d + 1 parameters θ 0 , θ 1 , . . . , θ d , that we want to estimate given the response y and design matrix X. Fitting a GLM is usually more delicate than fitting a linear model. Again, the standard goal is to find the parameters θ 0 , θ 1 , . . . , θ d which maximize the (log-)likelihood. The log-likelihood of the joint distribution of n observations is L = n i=1 L i , where L i = log p(y i | ξ i ). This is maximized when all the partial derivatives ∂L ∂θ k vanish. In general, the partial derivatives will not be linear functions of θ, so we have to resort to numerical methods to compute the MLE.
A common iterative method to find the estimateθ is the Newton-Raphson method. Note that we will use this method extensively along with the holonomic gradient method in our implementation (see Section 5) for both maximum likelihood estimation and other optimization tasks. We start with an initial guess θ (0) . For each k ≥ 0, we approximate the function at the point θ (k) with a polynomial of degree 2. Finding the extremum of the approximation is easy, and we set the point reaching the extremum as the next estimate θ (k+1) .
More precisely, let θ (k) ∈ R d be the current estimate. The Taylor expansion up to the second order term at this point is where u (k) and H (k) are respectively the gradient and Hessian evaluated at Setting the derivative ofL to zero yields the value of the next estimate Given a good initial guess, the method will converge to the maximum likelihood estimate as k → ∞.

Holonomic gradient method
In this section we will describe the holonomic gradient method (HGM), first proposed by Nakayama et al. [14]. Consider first the "classical" gradient descent algorithm, which is used to find a local minimum of a function F : R n → R. Given a starting point (or initial guess) x (0) , we know that the value of the function decreases the fastest in the direction opposite to the gradient. In other words, we should choose the next point for some stepsize γ 1 > 0. Now, given a suitably chosen stepsize γ 1 , we have F (x (1) ) < F (x (0) ). We then iterate while choosing a suitable stepsize γ k > 0 at each iteration. We can terminate the algorithm when the gradient is small enough (i.e. when we are close to a local minimum), or when a certain number of iterations have elapsed. Details concerning the choice of step size and efficiency of this method will not be discussed here; see for example [2]. The issue with this method is that it requires computing the gradient ∇F (x (k) ) at each step. In many statistical applications, the function we want to optimize will be a likelihood function, which will in some cases contain an integral that does not have a closed form expression, and has to be computed using numerical methods. As discussed previously, one such example is the 1-dimensional truncated normal distribution.
The holonomic gradient method takes a different approach to function minimization. The main idea is still the same: we use the same iterative step as in the classical gradient descent The difference is how we compute the gradient. We will construct a vector Q and a set of matrices P i to form a Pfaffian system The vector Q will be chosen so that the gradient ∇F (x (k) ) is easily recoverable, typically ∇F (x (k) ) = A(x (k) )Q(x (k) ) for some matrix A with entries in C(x 1 , . . . , x n ). With the gradient, we can determine the next point x (k+1) using (4). Given Q(x (k) ), the value of Q in the previous step, we can compute its value in the next step Q(x (k+1) ) by solving the Pfaffian system (5) using standard numerical ODE solvers.
Observe that we can also implement the Newton-Raphson method using the holonomic gradient framework. The update step will be and we can also recover the Hessian easily from the Pfaffian system, since there is a matrix B(x (k) ) with elements in C(x 1 , . . . , x n ) such that Hess(F )(x (k) ) = B(x (k) )Q(x (k) ).

Rings of differential operators
Let C(x 1 , . . . , x n ) denote the ring of rational functions.
Definition 3.1. The ring of differential operators with rational function coefficients, denoted R n , is where the operator ∂ i corresponds to differentiation with relation to x i , i.e.
and the operators x i just multiply by x i .
Note that the "multiplication" operation inside the ring is actually a composition of operators. Since every element in R n is an operator, there is a natural action • on the set C ∞ of smooth functions. If f (x 1 , . . . , x n ) ∈ C ∞ , then R n is not a commutative, since because of the chain rule. When i = j, everything commutes: Because of these commutation rules, any element in R n can be written as a sum of terms with the ∂ i on the right of each term: where α is a multi-index, ∂ α = ∂ α1 1 ∂ α2 2 · · · ∂ αn n and c α ∈ C(x 1 , . . . , x n ) with only finitely many nonzero c α . Most standard theorems and algorithms from algebraic geometry in the regular polynomial ring C[x 1 , . . . , x n ] carry over to R n with minor modifications.
In particular, Macaulay's theorem will be useful in the next section. Let ≺ be a term order on the differential operators ∂ i . Let f = α c α ∂ α ∈ R n . The leading term of f is the term LT(f ) = c α ∂ α such that ∂ α ≺ ∂ α for all α = α such that c α = 0. For an ideal I ⊂ R n , we define LT(I) as the set of all leading terms in I.
is a basis of R n /I as a vector space over R n .
We say an ideal I ⊆ R n is 0-dimensional if there are finitely many standard monomials. A necessary and sufficient condition for I to be 0-dimensional is that for all i the following holds For more details on computations on rings of differential operators, see [8].

Pfaffian systems
Let f (x 1 , . . . , x n ) ∈ C ∞ be a function, and ∈ R n . When • f = 0, we say that f is annihilated by . We say that f is annihilated by an ideal I ⊂ R n if f is annihilated by all ∈ I. Observe that if I = 1 , . . . , s , f is annihilated by I if and only if f is annihilated by all i . Assume that I is 0-dimensional, and let s 1 = 1, s 2 , . . . , s r be the standard monomials, which are the generators of R n /I. For all 1 ≤ i ≤ n and 1 ≤ j ≤ r, we can look at the image of the operator ∂ i s j in the quotient R n /I (under the canonical map p → p + I), and write it as a C(x 1 , . . . , x n ) linear combination of the basis elements where p i jk ∈ C(x 1 , . . . , x n ) for all 1 ≤ i ≤ n and 1 ≤ j, k ≤ r. Thus, if we define the vector S = (s 1 , s 2 , . . . , s r ) T , then for each i, there is a matrix P i such that Since f is annihilated by all elements in I, Equation (7) is true when we replace S by Q. We get the following system of differential equations the Pfaffian system. Because we chose s 1 = 1, the gradient of f can be recovered from the first elements of each equation in the Pfaffian system By following the procedure above, one can construct a Pfaffian system given a 0-dimensional ideal annihilating our function f . This is indeed desirable, since finding a 0-dimensional annihilating ideal is often easier than to find a Pfaffian system from scratch. One noteworthy fact is that if f has a holonomic annihilating ideal, then its integral over one variable f dx i also has a holonomic annihilating ideal. This is extremely useful when using the holonomic gradient method in maximum likelihood estimation, since the normalizing constant will usually contain an integral. Oaku [15] describes an algorithm for computing the (holonomic) annihilating ideal of an integral.

Holonomic Extended LARS
In this section, we will describe the holonomic extended least angle regression algorithm. We will also compute explicit forms for coordinate conversion functions between the e-affine and m-affine coordinates, Fisher information matrix, and divergence for the manifold used in Hirose and Komaki [9].
Consider a set of observed data {y a , x a = (x a 1 , . . . , x a d ) | a = 1, 2, . . . , n}, where y = (y 1 , . . . , y n ) T is the response vector, and X = (x a j ) is the design matrix, which has dimensions (n × d). We will also add an intercept term to the model, so the design matrix becomesX = 1 n×1 X . We will consider exponential families of the form We will define some notation. Let ξ be the natural parameter, a n + r sized vector containing elements ξ i . We can split ξ into two subvectors: we call ξ the subvector containing the first n elements, i.e. ξ = (ξ a ) n a=1 , and we call ξ the subvector containing the last r elements, i.e. ξ = (ξ b ) n+r b=n+1 . Hence ξ = (ξ , ξ ) T . The function ψ * (ξ) is the potential function of ξ, and it is equal to the logarithm of the normalizing constant of the distribution In a generalized linear model with canonical link function, the natural parameter is related to linear predictor by ξ =Xθ , where θ = (θ 0 , θ 1 , . . . , θ d ) T is a parameter vector. In addition, as in [9], we require r additional parameters that are equal to ξ . Hence we can write θ = ξ , and define θ = (θ , θ ). Equation (8) thus becomes where the potential function of θ is ψ(θ) = ψ * (Xθ , θ ). Alternatively, define the sufficient statistic Y = (y 1 , . . . , y n , u 1 (y), . . . , u r (y)) T , and an (n + r) × (d + r + 1) block-diagonal matrix where 0 n×m and I n×n are respectively the (n × m) zero matrix and the (n × n) identity matrix. Then we have the identities and the probability density function becomes The ξ coordinate, being the natural parameter of an exponential family, is the e-affine coordinate of the model manifold. The corresponding m-affine coordinate µ is the expectation parameter µ = E[Y ] and its potential function is defined as φ * (µ) = ξ ·µ−ψ * (ξ). The model manifold defined by the coordinates θ is a submanifold of the model defined by the ξ coordinates. The e-affine coordinate of this submanifold is θ, and there is a dual m-affine coordinate η which is related to µ by Note that the Fisher information matrix of the model in (8) is equal to the Hessian of the potential function ψ * (ξ), denoted G * = (g * i,j ). Similarly, denote the Hessian of the potential function of the m-affine coordinates φ * (µ) as the matrix G * = (g i,j * ). Since ξ and µ are dual coordinates, the matrices G * and G * are inverses of each other.
The Hessian G(θ) = (g i,j ) of the potential function ψ(θ)(= ψ * (Xθ , θ )) can be recovered using the chain rule: Using the identities in (9) and (10), we see that Thus (11) becomes We denote elements of its inverse with superscripts: G −1 = (g i,j ). Similar to the previous case, we have Remark. In subsequent sections we will make extensive use of matrix and vector differentiation. The convention in [13] will be used: the shape of ∂f ∂x depends either on the shape of f or the shape of x T . For example, differentiating a scalar by a length n column vector yields a length n row vector Differentiating a length m column vector by a scalar yields a length m column vector Finally, differentiating a length m column vector by a length n column vector yields an (m × n) matrix, the Jacobian.
This notation allows the natural use of the chain rule for derivatives ∂f (g(x)) with the usual matrix multiplication between the two terms on the right hand side. Furthermore, we can express the Hessian of a scalar valued function f (x) as Hess(f ) = ∂ 2 f ∂x∂x T
We will use Newton's method to find the root of the function Proposition 4.1. The Jacobian Jac(F ) of F in (13), has columns (12)), and the vector Since the ith element ofη(ηJ ) is simply η i and none of the other elements depend on η i , we have ∂F ∂ηi = e i .
Next let i ∈ J. Using the chain rule we get Newton's method will iteratively output a vector (θ J , ηJ ) with the following update step where the Jacobian and F are evaluated at (θ J , ηJ ) (k) . Given a suitable initial guess, the method converges very quickly.

Extended least angle regression algorithm
We describe shortly the algorithm by Hirose and Komaki [9]. Let I be the set containing the indices of covariates present in the model. We first start with the model containing all covariates, that is I = {1, 2, . . . , d}, and compute the maximum likelihood estimateθ MLE . In addition, we also compute the maximum likelihood estimateθ ∅ of the empty model, i.e. the model where θ 1 , . . . , θ d = 0. We will work in the d dimensional submanifold of S and setθ (0) :=θ MLE and k = 1. For each i ∈ I, letθ i be the m-projection of the current pointθ (k) to the e-flat submanifold corresponding to θ i = 0. Let i * be the coordinate which has smallest divergence between the pointθ (k) and its m-projectionθ i , and let this divergence be t * . Now for each i ∈ I, look at the m-geodesic connectingθ (k) andθ i , and find the point θ * i along that geodesic that has divergence t * from θ (k) . The estimate for the next stepθ (k+1) is constructed as follows: for all i ∈ I, set the ith coordinate ofθ (k+1) to the ith coordinate of θ * i , and for all j ∈ I, set the jth coordinate ofθ (k+1) to 0. Notice that the i * th coordinate will also be 0. We now remove i * from the list of "active" covariates I, and restart at the beginning of the paragraph, this time in the submanifold We quit the algorithm after d steps, when no covariates are left. The divergence funtion used in the submanifold M I , which we will denote by D [I] , is the restriction of the KL-divergence on M onto the submanifold. We compute it as follows: denote the coordinates in M I as θ I = (θ i ) i∈I and η I = (η i ) i∈I . The potential functions become ψ I (θ I ) = ψ(θ I , 0J ) and φ I (η I ) = η I · θ I − ψ I (θ I ).
The divergence is then The algorithm starts from the full model and proceeds step by step towards the empty model, which is the opposite direction compared to the LARS. Other than that, the geometric idea of the algorithm is the same as in LARS: at each step k, we move the current estimateθ (k) towards the origin, in a direction that bisects the m-geodesics corresponding to each m-projection. We hit the next estimateθ (k+1) exactly when the first of the coordinates i ∈ I of the vectorθ (k) hits 0.
The following pseudocode describes the algorithm more precisely. The algorithm is described in the submanifold M of (14), so we do not write down coordinates 0, d+1, . . . , d+r explicitly. We input the data (observations and design matrix) and an underlying distribution (essentially the functions u 1 (y), . . . , u r (y) in (8)), and we get as an output a sequence of estimators (θ (0) , . . . ,θ (d) ), where the estimator obtained in the kth step corresponds to a model with k covariates removed.
We can now rank the covariates in order of importance by looking at the output of the algorithm. The zeroth estimatorθ (0) was defined as the maximum likelihood estimate of the full model containing every covariate, and at each subsequent estimator, one of the components will vanish, i.e. the kth estimator θ (k) will have exactly k of its elements equal to zero. The element that vanishes corresponds to the covariate that is deemed the least impactful at that particular step. Thus by looking at the order in which the covariates vanish in the sequencê θ (0) , . . . ,θ (d) , we can order the covariates from least to most important.

Adding holonomicity
We will focus our attention to the potential function ψ(ξ) in (8), which can be written as Whether or not ψ * (ξ) has a closed form representation depends on the underlying distribution. We could use numerical integration if no closed form expression for ψ * (ξ) exists, but such an approach is computationally inefficient.
Instead, we assume that the potential function can be written as where G ψ is an elementary function with easily computable derivatives, and L(ξ) is a scalar or vector valued function with a set of Pfaffian systems ∂Li ∂ξj = H(ξ, L(ξ)) i,j , or using matrix notation
Now we obtain the gradient of ψ * (ξ) as a function of ξ and L(ξ) The derivatives ∂G ψ ∂ξ and ∂G ψ ∂L are easily computed since we are assuming that G ψ is an elementary function. We can also write the Fisher information matrix, which is equal to the Hessian of ψ * (ξ), by differentiating (16).

Holonomic update of the vector L
Nearly every step of the algorithm requires the knowledge of the vector L at some point P with coordinates ξ. For example in the case of the truncated normal distribution in Example 4.2, computing the vector L requires n separate numerical integrations. Using numerical methods to compute L(ξ) at every step is computationally costly. Fortunately we have a Pfaffian system for every element L a (ξ) in (24). Given another point ξ old and L(ξ old ), we can use standard ODE solvers such as Runge-Kutta to find the value of L(ξ) at some other point ξ. In the implementation we use the R package hgm [18] by Takayama et al. , which uses the RK4(5)7 method from Dormand and Prince [4].
We can also find the value of L after a change in θ coordinates. Since Again, if both θ old and L(θ old ) are known, then we can use numerical ODE solvers to obtain L(θ).
There are also cases where we need to conduct the holonomic update step in terms of mixed coordinates. Let J ⊂ {0, 1, . . . , d + r} and assume P old = (θJ old , η J old ) and L(P old ) are known. In order to obtain L(P ) for some other P = (θJ , η J ), we will need to find a Pfaffian system for L in terms of the mixed coordinates Theorem 4.3. Let ∅ = J {0, 1, . . . , d+r} be a nonempty, strict subset and let ρ = (θJ , η J ) denote a mixed coordinate. For a vector ρ with d + r + 1 elements, let ρ J denote the subvector (ρ j ) j∈J , and similarly ρJ = (ρ j ) j ∈J . Let be the function that maps the mixed coordinates ρ to the θ coordinate. Then where the derivative. By the chain rule, we obtain the first part of the theorem By definition, ρ J = η J and ρJ = θJ , and the function θ * satisfies the following identities Let i ∈J and j ∈ J. Then clearly since the components of ρ do not depend on each other. Likewise, if both i, j ∈J, then Hence we get ∂θ * J ∂ρ J = 0 |J|×|J| and ∂θ * J ∂ρJ = IJ ×J . Differentiating both sides of (18) by ρ J , we have where the right-hand side becomes ∂η J ∂η J = I, and the left-hand side becomes ∂η(θ * (ρ)) J ∂ρ J = ∂η(θ * (ρ)) J ∂θ * (ρ) Finally, differentiate both sides of (18) by ρJ to get The right hand side is equal to ∂η J ∂θJ = 0, since once again the elements of ρ do not depend on each other. The left-hand side becomes Finally, ∂L ∂ρ is indeed a function of ρ and L(ρ), since ∂L ∂θ , ∂η ∂θ = ∂ 2 ψ(θ) ∂θ∂θ T is a function of ρ and L(ρ) 1 based on the discussion in the beginning of Subsection 4.3.

Holonomic m-projections
Using the results of Theorem 4.3 we can now carry out m-projections and recover the vector L at the projected point given the value of L at the previous point. Let ∅ = I ⊂ {0, 1, . . . , d + r}, i ∈ I and α ∈ R. In the algorithm, all of the m-projections will be to the space M (i, α, I) = {θ | θ i = α, θ j = 0 (j ∈ I)}. Let a point P have the dual coordinates θ and η. The m-projection of P onto M (i, α, I) will have the mixed coordinates ρ = (θ I , θ i = α, ηĪ \{i} ). In other words, we first convert the point P to mixed coordinates according to the set I ∪ {i} to get ρ 0 = (θ I∪{i} , η I∪{i} ), and then send the element θ i to α to get ρ. Given L(ρ 0 ) (= L(P )) and the Pfaffian system, we may now use Theorem 4.3 to obtain L(ρ), and thus recover the full θ coordinates from the mixed coordinates.

Holonomic extended LARS algorithm
The holonomic extended LARS algorithm is our main result. The algorithm works exactly as the extended LARS algorithm described in Subsection 4.2, but now we have also to specify a Pfaffian system for L(ξ) as an input in addition to the data (response y and design matrix X) and the underlying distribution (u 1 (y), . . . , u r (y)). Again, we describe the algorithm in the submanifold M ⊂ S (see (14)), so we will mostly ignore the coordinates indexed by 0, d + 1, . . . , d + r in the vectors µ and θ. We will only compute them at the end of step 5, because they are needed for the initial guesses of the numerical solvers.

For all
and calculate the holonomic m-projection θ(i, I) ofθ (k) on M (i, 0, I) and obtain the vector L(θ(i, I)).
As in Subsection 4.2, by looking at the order in which the covariates vanish in the sequenceθ (0) , . . . ,θ (d) , we can determine the order of importance of the covariates.

A worked out example: the truncated normal distribution
In this section we will work out the implementation of the Holonomic Extended Least Angle Regression algorithm with the truncated normal distribution. The algorithm was implemented in the R programming language [17], and the code can be found in [6].

Introduction
The truncated normal distribution is defined as the restriction of the normal distribution to the positive real axis. Its probability density function is where y ∈ (0, ∞), µ ∈ R and σ ∈ (0, ∞). By expanding, we can write the probability density function as a function of natural parameters where ξ 1 = µ σ 2 and ξ 2 = − 1 2σ 2 . From the form above we see that the truncated normal distribution belongs to the exponential family. Note also that the normalizing constant A(ξ 1 , ξ 2 ) = ∞ 0 e ξ1y+ξ2y 2 dy does not in general have a closed form, and converges if and only if ξ 2 < 0. A generalization of the truncated normal distribution are the exponential-polynomial distributions, of the form f (y | ξ 1 , . . . , ξ d ) = exp(ξ n y n + · · · + ξ 1 y) ∞ 0 exp(ξ n y n + · · · + ξ 1 y) dy for y > 0 and ξ n < 0. This family of distributions and their usage with the holonomic gradient method has been studied in Hayakawa and Takemura [7].
We can naturally construct a generalized linear model using the canonical link where each observation is distributed according to the truncated normal distribution. Given a sample y = (y 1 , . . . , y n ), assume that each y i is independent and distributed according to a truncated normal distribution with a unique mean parameter µ i and a common variance parameter σ 2 . Hence, using the notation in equation (19) each observation has their own ξ 1 parameter, and ξ 2 is the same in each observation. To make the notation consistent with [9], for each i = 1, . . . , n, the "ξ 1 parameter" of observation i will be called ξ i and the common "ξ 2 parameter" will be called ξ n+1 . With this notation, each observation will have the distribution and since every observation is independent, the joint distribution of y is p(y 1 , . . . , y n | ξ 1 , . . . , ξ n , ξ n+1 ) = e n a=1 ξ a ya+( n a=1 y 2 a )ξ n+1 n a=1 A(ξ a , ξ n+1 ) In the generalized linear model, each observation y i is explained by a set of d explanatory variables x i 1 , . . . , x i d . With the canonical link function in particular, the natural parameter is simply an affine combination of the explanatory variables, i.e. for some real numbers θ 0 , θ 1 , . . . , θ d , we have ξ i = θ 0 +θ 1 x i 1 +· · ·+θ d x i d . We will now define some notation. Let X = (x i j ) be the (n × d) design matrix, θ = (θ 0 , θ 1 , . . . , θ d ) T and ξ = (ξ 1 , . . . , ξ n ) T . IfX = 1 n X , where 1 n is a column vector of size n where each element is 1, then we have ξ =Xθ . We can also define a block-diagonal matrix X B and vector Y as As we defined in Section 4, we have θ n+1 = ξ n+1 . If we set θ = (θ 0 , . . . , θ d+1 ) T and ξ = (ξ 1 , . . . , ξ n+1 ) T we have ξ = X B θ, and we can write the pdf of the model as where A a (ξ) = A(ξ a , ξ n+1 ) = ∞ 0 exp(ξ a y + ξ n+1 y 2 ) dy is the normalizing constant of the ath observation, and ψ(θ) = ψ * (X B θ) = n a=1 log A a (X B θ) is the potential function.

Normalizing constant as a holonomic system
Next we construct a holonomic system for the normalizing constant. We denote the differential operators by the symbol ∂ with the appropriate subscript. For example, we denote ∂ ∂ξ n+1 by ∂ ξ n+1 . We will also omit the symbol • used to denote the application of an operator to a function when its usage is clear from context. In addition, any subscript or superscript a will take integer values in [1, n].
We start by looking at the function which is defined when ξ n+1 < 0. Any partial derivative of A a can be expressed as a partial derivative in terms of ξ a , since Furthermore, we can use integration by parts on A a to get and hence the following partial differential equation holds From equations (21) and (22) we can derive the gradient A a Let L a (ξ) = log A a (ξ) for all a = 1, 2, . . . , n. Since ∂La ∂ξ = 1 Aa ∂Aa ∂ξ we can derive a Pfaffian system for L a , and hence we can obtain the gradient of the potential function ψ * (ξ) = n a=1 L a (ξ) In addition to the gradient of ψ * (ξ), we will also need its Hessian, i.e. the matrix of second derivatives, once again as a function of L a (ξ).
Now clearly for a, b = 1, . . . , n and a = b, we have ∂ ξ a ∂ ξ b A a = 0. By (21), the second derivative of A a by ξ a is equal to the derivative by ξ n+1 . Similarly, ∂ ξ a ∂ ξ n+1 A a = ∂ 3 ξ a A a and ∂ 2 ξ n+1 A a = ∂ 4 ξ a A a . Using these we derive the Hessian of ψ * . Again, let a, b ∈ 1, . . . , n and a = b. Then The Hessian of ψ * is indeed a function of ξ and L(ξ) = (L 1 (ξ), L 2 (ξ), . . . , L n (ξ)) T , since ∂ ξ a Aa Aa = ∂ ξ a L a is a function of ξ and L a (ξ) by equation (24)

Maximum likelihood estimation
Next we will discuss details regarding maximum likelihood estimation of the model in equation (20). The log-likelihood is easily obtained from equation (20) We will use the Holonomic Gradient Method to find the maximum likelihood estimate. Since the Hessian matrix of (θ | y) is easily obtained, we will use the Newton-Raphson method. Since ψ(θ) = ψ * (X B θ), we can use matrix calculus to obtain the Hessian and gradient of the log-likelihood function. Indeed, since the gradient is ∂ψ ∂θ = ∂ψ * ∂ξ · X B and the Hessian is ∂ 2 ψ ∂θ∂θ T = X T B · ∂ 2 ψ * ∂ξ∂ξ T · X B , we get the gradient and Hessian of the log-likelihood function as follows There are some numerical issues to consider when using the method outlined above for maximum likelihood estimation. Let θ (k) be approximation of the maximum likelihood estimate at the kth iteration of the Newton-Raphson method. The next estimate is expressed as θ (k+1) = θ (k) +∆, and the difference ∆ is obtained by solving the linear system H (θ (k) )∆ = −∇ (θ (k) ).
However, there are times where the Newton-Raphson method is "too violent", and yields a ∆ of large magnitude, meaning that θ (k) and θ (k+1) are relatively far apart. This in turn increases the error in the holonomic update. Furthermore, there are cases where the Newton-Raphson method yields an iterate which does not belong to the model, i.e. when θ (k+1) d+1 ≥ 0. In our implementation we solve the problem by introducing a small step γ when the Newton-Raphson method yields an estimate that is either too far from the previous estimate, or an estimate not belonging to the model.

Coordinate conversions
As described in Section 4, we have two sets of e-affine coordinates, ξ and θ, and m-affine coordinates, µ and η, along with their potential functions, respectively ψ * (ξ), ψ(θ), φ * (µ), and φ(η). The two sets of coordinates are related with Let P be a point on the manifold (20), and assume the vector L(P ) (the length n vector of the logarithm of normalizing constants of each observation) is known. Given the ξ coordinates of P , we can recover its µ coordinates from equations (24) and (25) since µ i = ∂ψ * ∂ξ i . Hence We can also invert (28) to get the coordinate conversion from µ to ξ The conversion θ to η is also simple, since we can just compose the transformations in (27) and (28), i.e. η(θ) = X T B µ(X B θ). Next we will tackle mixed coordinate conversions. As in Subsection 4.1, let J ⊆ {0, 1, 2, . . . , d + r},J = {0, 1, 2, . . . , d + r} \ J and let P = (η J , θJ ) denote a mixed coordinate. Additionally, assume that the value of the vector L is known at point P .
Newton's method applied to the function F in (13) will output ηJ and θ J at the same time, thus allowing us to recover the full θ and η simultaneously. With the truncated normal distribution, using Newton's method to convert mixed coordinates converges very quickly given a suitable initial guess. Fortunately, there are a few convenient initial guesses that work well. Mixed coordinate conversion is needed in three different situations in the algorithm described in Subsection 4.4: 1. m-projections (steps 2, 4). Use the point before the projection as an initial guess.
2. updating L (steps 2, 4, 5). Use the point before the update as the initial guess.
3. the "wrap-up step" (step 5). Use the estimateθ (k) of the current iteration as the initial guess.
We note again that there are cases where Newton's method outputs a point (θ J , ηJ ) (k+1) = (θ J , ηJ ) (k) + ∆ (k) that does not belong to the model. 2 In our implementation, we simply iteratively half the step ∆ (k) until the resulting point (θ J , ηJ ) (k+1) is satisfactory. Such a scaling of the Newton step is required if d + 1 ∈ J and the element (θ d+1 ) (k+1) in (θ J , ηJ ) (k+1) becomes positive. More precisely, in this case the next iterate becomes

Computational details
In order to not end up with nearly singular matrices in the algorithm, we will sometimes have to rescale both the design matrix and the response vector. We center and rescale each covariate such that the mean becomes 0 and the standard deviation becomes 1. In other words, if x i is the ith column of the design matrix X, the scaling maps . Note that as in [9], scaling and centering and scaling the design matrix will not affect the result of the algorithm. In addition, we will scale the response vector y such that the sample standard deviation equals 1 .
These scaling operations allow us to keep the orders of magnitude of the elements in the ξ and µ coordinates roughly equal, which in turn make the orders of magnitude of the elements in the θ and η coordinates roughly similar. This is needed when doing actual computations, since otherwise many operations involving mixed coordinates (for example the matrix Jac(F ) in Proposition 4.1) will end up nearly singular, with certain columns several orders of magnitude larger than others.

Results
First, we use a simulated dataset to test the algorithm. We will use d = 3 covariates X 1 , X 2 , X 3 , and n = 1000 observations. As a first test, we will simulate three uncorrelated covariates. For each observation, each covariate is independently sampled from a uniform distribution between [0, 1], and the response is sampled from a truncated normal distribution with mean parameter X 1 + X 2 + X 3 , and variance σ 2 = 1. The result of the HELARS algorithm applied to the simulated data is depicted in Figure 1. The algorithm starts on the right, where the value of each parameter is equal to the maximum likelihood estimate of the full model. At each iteration, we compute the divergence of the current parameters compared to the empty model, and we plot the value of each parameter. The result is as expected: the algorithm sees each covariate as roughly equally important, since they go to zero very close to each other and their value decreases at roughly the same rate. The order in which the covariates go to zero is fully determined by the value of the MLE estimator in the full model. For example since X 3 has the smallest coefficient in the full model and it is uncorrelated with the other covariates, it is deemed the least important.
Next, we will introduce correlation between X 1 and X 2 , and leave X 3 uncorrelated. The covariates X 1 and X 3 will once again be sampled from a uniform distribution between [0, 1], but X 2 = X 1 + ε, where ε ∼ N (0, 1/4). Again, the response will be sampled from a truncated normal distribution, with mean parameter X 1 + X 2 + X 3 and variance parameter 1. The path of the covariates is in Figure 2. We see that X 2 , one of the two correlated covariates, goes relatively quickly to zero relative to the others, whereas X 1 and X 3 are deemed to be equally important. One possible interpretation is that X 2 is redundant since X 1 already carries the same information, so it is quickly eliminated. Once X 2 is eliminated, the information of both covariates X 1 and X 3 is needed, since they are independent. This is also visible when looking at the sum of squared  Table 1. Since we know that X 1 and X 2 are heavily correlated, one of them is redundant and should be removed first. We see that X 2 should be removed first, since {X 1 , X 3 } has less error than {X 2 , X 3 }. The difference of SSE in the model {X 1 } and {X 3 } is due to the fact that the effect of X 1 is essentially seen as doubled in the response: recall that the response Y ≈ X 1 + X 2 + X 3 , and since there is a strong positive correlation between X 1 and X 2 , we have Y ≈ X 1 + X 1 + X 3 . Next, we used the Diabetes dataset used in the original LARS paper [5] and the extended LARS paper [9]. Assuming the truncated normal distribution as the underlying distribution of each observation, the values ofθ obtained from the holonomic extended LARS algorithm are plotted in Figure 3. The algorithm ordered the covariates in the following order, from least to most important: θ 1 , θ 7 , θ 8 , θ 10 , θ 6 , θ 2 , θ 4 , θ 5 , θ 3 , θ 9 . We can compare the output of the HELARS algorithm to the output of the ELARS algorithm, depicted in Figure 4. In the ELARS algorithm we assume that the underlying distribution is the normal distribution, which is why the output looks slightly different. The ELARS algorithm ordered the covariates in the following order: θ 1 , θ 7 , θ 10 , θ 8 , θ 6 , θ 2 , θ 4 , θ 5 , θ 3 , θ 9 . While the path is different to the truncated normal case, the ordering of variables is almost exactly the same, with the exception of θ 8 and θ 10 being flipped. Table 1: Sum of square errors (SSE) using models consisting of every possible subset of covariates. We use a simulation of 1000 observations, with covariates X 1 and X 2 correlated, and X 3 independent of the rest. Subset

Discussion
In this manuscript, we presented the holonomic extended LARS algorithm, and successfully implemented in in R. the dually flat structure is still useful even when the potential function is not easy to compute. The HELARS implementation is slower than the ELARS implementation due to the overhead caused by keeping track of L and constantly updating it using the holonomic gradient method. The benefits of using holonomicity are most visible when the potential function does not have a closed form expression. Then we can either find a Pfaffian system for the potential function by hand, as we did in our truncated normal distribution example, or use the theory of D-modules to construct the Pfaffian system from a holonomic ideal annihilating the potential function. Since in exponential families the potential function is the integral of an exponential function, finding the annihilating ideal is relatively easy in many cases. We can then use the integration algorithm [15] to get the annihilating ideal of the integral.
Finding the function G ψ in Equation (15) satisfying the necessary conditions can also be problematic. At the moment, we have to find it from scratch for every distribution considered. Because finding an elementary enough G ψ is a very non-trivial task, an algorithm that could automatically output such a function would improve the usability of the HELARS algorithm. Also since the algorithm can only handle a certain class of generalized linear models using the canonical link function, a natural next step would be to extend it to an arbitrary generalized linear model.