On the exact maximum likelihood inference of Fisher–Bingham distributions using an adjusted holonomic gradient method

Holonomic function theory has been success-fully implemented in a series of recent papers to efﬁciently calculate the normalizing constant and perform likelihood estimation for the Fisher–Bingham distributions. A key ingredient for establishing the standard holonomic gradient algorithms is the calculation of the Pfafﬁan equations. So far, these papers either calculate these symbolically or apply certain methods to simplify this process. Here we show the explicit form of the Pfafﬁan equations using the expressions fromLaplaceinversionmethods.Thisimprovesontheimple-mentation of the holonomic algorithms for these problems and enables their adjustments for the degenerate cases. As a result, an exact and more dimensionally efﬁcient ODE is implemented for likelihood inference.


Introduction
The Fisher-Bingham distribution is defined as the conditional distribution of a general multivariate normal distribu- Department of Mathematical Informatics, The University of Tokyo, Tokyo, Japan tion on a unit sphere. In particular, for a p-dimensional multivariate normal distribution with parameters μ and Σ, the corresponding density function with respect to d S p−1 (x), the uniform measure in the p − 1-dimensional sphere S p−1 , is is the normalizing constant. Since multiplication by any orthogonal transformation induces isometry in S p−1 , where Δ = diag(δ 2 1 , δ 2 2 , . . ., δ 2 p ) and the orthogonal matrix O ∈ O( p) are obtained from the singular value decomposition of Σ = O ΔO. Similarly, we can also choose the particular O such that entries of Δ −1 Oμ are non-negative. Hence, without loss of generality, we can assume that the covariance parameter is diagonal, and therefore, a more efficient parametrization of dimension 2 p can be used for the normalizing constant with θ = (θ 1 , θ 2 , . . . , θ p ) = diag( Δ −1 2 ), i.e. θ i = 1 2δ 2 i and γ = (γ 1 , γ 2 , . . . , γ p ) = Δ −1 Oμ. Note the slight inconsistency in notation as we write C(diag(θ ), γ ) = C(θ, γ ). The special case of γ = 0 corresponds to the Bingham distributions studied separately in Wood (1993); Kume and Wood (2007); Sei and Kume (2015). Despite the fact that these distributions are part of the exponential family (see e.g. Mardia and Jupp 2000), maximum likelihood estimation ultimately involves numerical routines for approximating the normalizing constant term C(θ, γ ). The method of Kume and Wood (2005) relies on the saddlepoint approximation which is known to be not only very close to the exact value with little computational cost but also numerically stable. However, recently there has been a renewed interest in this problem with the implementation of the holonomic gradient method (HGM), which in theory is exact since the problem of calculating C is mathematically characterized via a solution of an ODE (see e.g. Nakayama et al. 2011;Hashiguchi et al. 2013;Sei et al. 2013;Koyama 2011;Koyama and Takemura 2016;Koyama et al. 2014 andKoyama et al. 2012).
In particular, the HGM approach generates exact solutions if the corresponding ODE is numerically stable and the dimensionality of the parameters is not extremely large. Please note that, in the relevant literature, numerically unstable ODE's are called stiff (see eg 10.6 in Zarowsky 2004). Koyama et al. (2014) focus on the numerical efficiency of HGM implementation by expressing the corresponding Pfaffian equations (see Sect. 4) in terms of some elementary matrices R i and Q i . Note that HGM is applicable not only to C but also any holonomic function. See Chapter 6 of Hibi (2013) for more details.
The contribution in this paper is threefold. Firstly, by expanding the Laplace transform in Eq. (1) in partial fractions, we obtain the Pfaffian equations explicitly in terms of only two vector parameters θ and γ , each of length p. This makes the differential structure of these functions more transparent and the implementation of the holonomic algorithm more dimensionally and computationally efficient, since at most 2 p parameters are needed for the normalizing constant and there is no need to use symbolic algebra packages for generating the Pfaffians explicitly.
Secondly, by imposing some constraints on θ i and γ i , our approach is easily applied to many important sub-classes within the Fisher-Bingham family (such as the Bingham, Watson and Kent distributions). In fact, the general methodology of HGM algorithms does not automatically apply to these situations because the Pfaffian equations become degenerate. In particular, the corresponding ODE is stiff if some eigenvalues of Δ coalesce. Therefore, special attention for cases with various multiplicities in parameters is practically useful in the model selection process. Our explicit Pfaffian expressions, however, require minimal adjustments for these degenerate cases. The special case of Bingham distribution appearing when all γ i 's are zero is considered separately by Sei and Kume (2015). However, in this paper our approach is more general and accommodates all possible variations in the parameter space. Therefore, we can easily perform model selection within the Fisher-Bingham family based on the standard likelihood ratio tests. If we only need to evaluate the normalizing constant and the first-order derivatives at these degenerate points, the HGM with respect to the radius parameter can be applied as Koyama et al. (2014) and Koyama and Takemura (2016) suggested. However, if we also have to evaluate higher-order derivatives (e.g. standard errors for MLE) or apply the ODE along any general curve and not just as radial rescaling of parameters, the Pfaffian system in our paper is necessary.
Finally, while many papers focus on the normalizing constant, there has not been much interest in the estimation of the orthogonal component O from the real data. For p = 3, this problem is tackled in Kent (1982) where a closed form solution is shown for a very useful family of spherical distributions. However, for general p such a solution is not available. We combine the holonomic gradient method for the normalizing constant with that of a particular solution on orthogonal matrices O so that a maximum likelihood estimator is evaluated. This method is shown to work well in both simulated and real data examples, but special care is needed in the general setting for the Fisher-Bingham distributions due to multimodality of the likelihood function for these members of the curved exponential family.
The paper is organized as follows. We start with general remarks about the Fisher-Bingham normalizing constant where we provide a simple univariate integral representation. We then give a brief introduction to the holonomic gradient method which characterizes the the evaluation of C(θ, γ ) as a solution of an ordinary differential system of equations. The explicit expressions for the Pfaffian equations needed for such ODE in the case of the Fisher-Bingham integral are given in the next section where degenerate cases with multiplicities on the parameters are specifically addressed. We then focus on the implementation of the proposed MLE approach for both degenerate and non-degenerate cases of Fisher-Bingham distributions so that some log-likelihood ratio test can be used for choosing the appropriate model.

General case
Based on the key result in Proposition 1 from Kume and Wood (2005), one can easily derive that where f r (1) is the density at point 1 of r = p i=1 y 2 i , while y i are independent normal random variables as Since the random variable r takes non-negative values, the Laplace transform of its density is the same as its moment generating function (with a sign switch in its argument) which in our parametrization is: Applying the inverse Laplace transform for any −t 0 < min(θ ), implies where Equation (1) establishes the general Fisher-Bingham normalizing constant in terms of an univariate complex integration. In particular, it is easily seen from (1) and the definition of A(γ , θ ) that for any c ∈ R where |γ | is the vector of absolute values of γ . Therefore, without loss of generality we can assume that both vector parameters θ and γ have non-negative entries.

Degenerate cases
Constraints on the parameter values θ and γ could lead to degeneracy in the corresponding ODE. For statistical inference however, some model constraints in the Fisher-Bingham distributions are necessary for practical use. Such models induce constraints on θ and γ for the corresponding normalizing constants as follows (c.f. Mardia and Jupp 2000, Table 9.2): -Bingham distribution is generated if γ is set to zero.
In order to accommodate scenario (a), let us assume that we have l distinct values such that each θ i has multiplicity n i , i.e. n 1 +n 2 +. . .+n l = p. Let us index the corresponding n i entries of γ as γ 1,i , . . ., γ n i ,i . From the integral representation of it is clear that its value depends on only the summation terms n i r =1 γ 2 r,i and not on the particular values γ 2 r,i . This implies that for scenario (b), we can work with n i r =1 γ 2 r,i = γ 2 i and perform the required differentiation only with respect to this particular γ 1,i = γ i = n i r =1 γ 2 r,i , while the other γ 2 r,i remain zero. As a result, and without loss of generality, we can focus on evaluating (3) with l distinct θ i , while (1) is derived from above if n i = 1 for all i. In the remainder of the paper, we will focus on evaluating C(θ, γ ) as in (3) where θ has l distinct values. Hashiguchi et al. (2013), Sei et al. (2013), Koyama (2011), Koyama et al. (2014) and Koyama et al. (2012) for details and further information. Let Θ be an open subset of the d-dimensional Euclidean space. Denote the partial derivative ∂/∂α i by ∂ i . A function c(α) of α ∈ Θ is called holonomic if there exists a finitedimensional (say r -dimensional) column vector g = g(α) consisting of (possibly c(α) and higher-order) partial derivatives of c(α) such that g satisfies where where ∂ = ∂/∂α, d = 1 and r = 2. It is known that the normalizing constants of the von Mises-Fisher, Bingham and Fisher-Bingham distributions are holonomic. The Eq. (4) is called the Pfaffian equation of g. This equation essentially states that higher-order derivatives of g(α) are linear combinations of its entries while involving the Pfaffian matrices as rescaling constants. For example, the second-order derivative is Assume that a numerical value of the vector g(α (0) ) at some point α (0) ∈ Θ is given. The holonomic gradient algorithm evaluates g(α (1) ) at any other point α (1) . Here the term gradient refers to the gradient of g(α).
Note that the standard numerical routines for solving (5) are highly accurate and available in most computer packages. More specifically, the rk function in the deSolve package of R provides the required solution for a given accuracy.
As shown later in Sect. 5, the holonomic gradient method is used for maximum likelihood estimation via some gradient descent scheme, where the orthogonal matrix O can somehow be treated independently from the normalizing constant. As a result, we only need the Pfaffian equations for diagonal covariance matrices when the corresponding ODE has dimension 2l.

Explicit Pfaffians and HGM for Fisher Bingham
The parameters of Sect. 2.2 for the most general Fisher-Bingham case are α = (θ , γ ), i.e. dim(Θ) = 2l where l is the number of distinct values of θ i . Using properties (2), we can assume here that θ i and γ i are allowed to vary freely as positive values, while the smallest entry of θ can be fixed to 0. As a direct consequence of differentiating (1) and the fact that This equation implies that partial derivatives ∂C(θ ,γ ) ∂θ i are sufficient for evaluating C(θ, γ ) where the vector g has length r = 2l and is defined as where the first-order partial derivatives above are easily seen from (1) or (3), to depend on γ and θ as In this case, the corresponding ODE as in (5) is seeking the solution of some vector curve g(α) of dimension 2l and the required normalizing constant is simply minus the sum of the components of this vector as in (6). The left side of the Pfaffian equations (4) is clearly ∂ g ∂θ i and ∂ g ∂γ i , which are actually the second-order derivatives ∂ 2 C(θ,γ ) ∂θ i ∂γ j . In other words, the Pfaffian equations (4) are stating identities such that these second-order derivatives of C(θ , γ ) are linearly dependent on the first-order ones and Pfaffian entries. Therefore, in order to establish explicitly the Pfaffian equations for g we need to consider such particular relationships between the first-and second-order derivatives of C(θ, γ ). They are stated in the following theorem: Theorem 1 If θ i = θ j and γ i = 0 = γ j , the Pfaffian equations (4) for the general Fisher-Bingham distribution are generated by (13) and (14) can be given in terms of first-order derivatives using (11) and (10).
The proofs of these identities which are in "Appendix" rely on results from partial fractions.
Note also that as the Pfaffian matrices are defined in terms of the pairwise differences θ i − θ j , one can easily see that the ODE solution for C(θ, γ ) satisfies properties (2).
As a corollary to the theorem, the differential equation for the well-known von Mises-Fisher distribution is derived from (6) and (15) as: where l = 1, n 1 = p and θ = 0. The expression γ p 2 −1 1 C(0, γ 1 ) satisfies equation 9.6.1 in Abramowitz and Stegun (1972) for the modified Bessel functions and is consistent with the known expression for these cases (see 9.3.4 in Mardia and Jupp 2000).

Two types of Pfaffians
The Pfaffian matrices will be of two types: P i and P i+l for i = 1, 2, . . . , l since the vector g in (7) ∂γ i . Each P i or P i+l will be of dimension 2l × 2l, with all but two rows having at most 4 nonzero entries. The explicit expressions are found in "Appendix" For a curveᾱ with constant derivative, the matrix function K of (5) will be a linear combination of the 2l Pfaffian matrices.
This implies that for situations where some θ i and θ j coalesce, the matrix K will have intolerably large entries due to the presence of 1 (θ j −θ i ) r for r = 1, 2, 3 in the Pfaffian matrices. In these cases, stiffness in the corresponding ODE could appear. These situations are generally addressed by reparametrizing or changing the integrating curvē α(τ ) along which K remains manageable. For example, the choose of integrating path along some radial direction as suggested in Koyama et al. (2014) and Koyama and Takemura (2016) seems to work well. The default setting in our implementation is based on the same path so that starting from a small τ 0 so that g(α(τ 0 )) is accurately evaluated as a starting point for the ODE. For example, using the curve above for a choice of close entries for θ = (1, 2.9999, 3, 3.0001) and γ = (1, 1, 1, 1) the method works well by providing within 0.53 seconds a value for C(θ, γ ) = 2.9753553. Note that the saddlepoint approximation provides the value 2.942742 in 0.001 seconds. This is not surprising since, while SPA is very fast, the proposed method relies on a potentially computationally expensive step of evaluating the starting value of g at a sufficiently small τ 0 so that to guarantee the required accuracy at the target value τ = 1. In general, our method could require a careful choice of both the starting values and the integration path so that the ODE is does not have numerical problems. However, our implementation with the radial curve described as not failed in the examples that we have considered.
One can easily see that the Pfaffian values do not become degenerate even if all γ i become zero (except cases when n i > 1); therefore, a possible starting value for carrying out the numerical evaluation for general C(θ , γ ) or g(θ , γ ) could be the corresponding derivatives of the Bingham normalizing constant at g(θ , γ = 0) (evaluated as in Sei and Kume 2015), and then stemming from this point in R 2l , a second integration curve can be defined ending at the required g(θ, γ ). In cases of n i > 1, we can use the power series derived by Kume and Walker (2009) and Koyama et al. (2014) as a starting value of g(θ , γ ).

MLE optimization using the gradient approach
If the observed data are collected in a matrix X = Therefore, maximizing log L( Since values of θ can be shifted so that its smallest value becomes 0 and O can allow for γ to have non-negative entries, we can optimize (16) on θ ≥ 0 with min(θ ) = 0 and γ ≥ 0 by iteratively updating the parameters which increase the likelihood value such that: 1. for a fixed O, consider the optimization problem on θ and γ which is performed in 2l dimensions including the ODE for the HGM implementation. 2. by keeping these values θ and γ fixed, we can then find the optimal O by minimizing or decreasing only the quadratic part of the likelihood tr(AO diag(θ)O +OBγ ).
In order to establish a gradient descent approach for the first step, we only need the partial derivatives of log L(θ, γ , O) as follows: where ∂C(θ ,γ ) ∂θ 1 C(θ,γ ) and ∂C(θ ,γ ) ∂γ 1 C(θ,γ ) are the output of our holonomic gradient algorithm implementation for the Pfaffian equations shown earlier. In its general form, the second optimization needs special care as it is a non standard optimization problem in O( p). We show below an adopted gradient method which addresses this problem and therefore completes the MLE optimization. In fact, two special cases that do not require our optimization in O( p) are: • Bingham distribution, i.e. γ = 0, here the orthogonal component of the SVD decomposition of A is optimal • Kent distributions for p = 3 where approximate MLE is used and the problem is conveniently reduced to an optimization in O(2) after the third column vector of O is chosen independently such that the 3-dimensional vector B coincides with a fixed axis (see Kent 1982, Sect. 4).

Optimization in O
In particular, we need to find the optimalÔ such that In fact, this problem is equivalent tô This is the weighted Procrustes optimization problem considered in Chu and Trendafilov (1998). The authors there adopt an ODE approach to this problem as a simple adaption of continuous gradient optimization. We show in the following the gradient descent version in discrete time which can be immediately implemented within a unified MLE optimization procedure for the Fisher-Bingham family of distributions. Note that, provided we allow in the likelihood optimization the sign of one of the components in γ to vary, the optimal matrix O can be allowed to be a rotation matrix, i.e. O = e v where v is skew symmetric, i.e. v+v = 0.

Proposition 1 A necessary condition for O to be an optimal orthogonal matrix is that
Proof (see "Appendix").
In our case however, we can implement the gradient approach in the orthogonal group by taking as a possible new update for O some rotation along the curve Clearly, this curve reduces to a single point only if A is symmetric, i.e. O is a critical point. We can use this fact as a stopping criterion in our gradient optimization. We proceed in a similar way to obtain the second derivative, and it can be shown that a necessary condition that a particular critical O is a local minimum is

Algorithm for finding the MLE
We now have all the ingredients to establish our gradient approach for the MLE of the Fisher-Bingham distributions.

Algorithm
For a given initial set of estimates θ , γ and O, we perform the updates as follows: is as in (17) and δ θ is a real number such that log L(θ, γ , O) > log L (θ , γ , O).
Note, however, that if we wanted to fit the Bingham distribution, i.e. γ is assumed to be zero, then there is no need to implement step 3 above as the optimal O in this case is simply the one for which A = 0 that is OAO is diagonal.

Numerical evidence
In Nakayama et al. (2011), the authors illustrate the general methodology of holonomic gradient method by focusing on two data sets: one from the area of astronomy and the other one from the magnetism. We revisit the first data set in order to confirm that our method gives the same MLE results. We also want to make use of our different parametrization which deals with the sub-classes of Fisher-Bingham family to perform statistical inference to choose the most appropriate model. The second data set considered is previously used in the paper of Arnold and Jupp (2013) where a statistical model of orthogonal frames is introduced. Particular recordings of three orthogonal axis related to individual earthquake events in New Zealand are grouped in three data sets. Each triplet of orthogonal axes in R 3 related to a particular earthquake event gives rise to a direction orthogonal to the horizontal plane. Observations of these directions can allow modelling by Bingham distributions c.f Arnold and Jupp (2013). So we have three classes of directional data where Bingham distributions are considered appropriate. In particular, a Bayesian modelling approach to fitting Bingham distributions to such data is also considered in Fallaize and Kypraios (2014). We will show below that in fact the best modelling choice among the sub-classes of Fisher-Bingham family is indeed the Bingham distribution.

Astronomy data
For this data set in our parametrization, the components A and B are as follows: where the values in brackets are the MLE estimates using the saddlepoint approximation for the normalizing constant.
The optimal likelihood value rescaled by −n as defined in (16) is 2.457746 = log 11.67846 which is same as the value reported in Nakayama et al. (2011). The corresponding quantity for the saddlepoint approximation is 2.463414. We fitted to this data set the Kent distribution and the MLE of the corresponding quantities using HGM (and saddlepoint approximation) arê  (16) at these optimal points. Since the difference in the number of parameters between these models is 8−5 = 3, we can apply the log-likelihood ratio test, under the null hypothesis of the Kent model where log L is (the rescaled log-likelihood by −n) defined in (16) and Θ FB and Θ K represent the MLE estimates for the full Fisher-Bingham and Kent, respectively. The sample size is n = 168, and the value of likelihood ratio statistic is therefore 2 * 168 * (2.465478 − 2.457746) = 2.597952 which suggests that there is not enough evidence supporting the full Fisher-Bingham distribution model here. The same conclusion holds for the saddlepoint approximation quantities.

Earthquake data
The three axes of interest for an earthquake event are the directions of compressional axis P; tensional axis T, while the null axis A is defined as A = P × T. For these data sets, the first two axes tend to be horizontal, and therefore, the third axis points vertically. These axial data are shown in Fig. 1 and are split into three groups of particular interest. The We also fitted Fisher-Bingham distributions to these three clusters of directions, and the corresponding values of the corresponding χ 2 3 test statistics and their p values are 0.4889717(0.9213075), 3.885764(0.2740667) and 1.630983 (0.6523852). These results suggest that the Bingham distribution assumption is reasonable for these data sets. One of the referees mentioned correctly that the Bingham distribution is in fact appropriate for axial data. This means that if the data points undergo independently some axial rearrangement, (namely independent sign changes to each individual coordinates), the likelihood will not change under Bingham but will do so for the Fisher-Bingham case. Therefore, the model choice that we perform here is to only illustrate numerically that our HGM implementation here works for a given axial arrangement of these data points, and alternative models like the matrix Fisher distributions as suggested by the referee could be better modelling strategies for these orthogonal frames.

Concluding remarks
In this paper, we provide explicitly the Pfaffian system for the normalizing constant of the Fisher-Bingham distributions including the degenerate cases. Such explicit expressions have not only theoretical interest but also improve on the implementation of the current methods used for the MLE of these models. We reduce the dimensionality of the ODE equation as we need to operate at a dimension not more than twice the number of distinct values of θ i . The standard HGM so far does not account for multiplicities among θ i 's or γ i = 0. We can also perform exact MLE inference by using gradient optimization methods for the optimal orthogonal component O as in weighted Procrustes optimization. Note, however, that optimization in O shown in Sect. 5.1 is only local and rank(B γ ) = 1 might imply many optimal solutions. For the Bingham distribution, namely γ = 0 case, the optimal matrix O does not depend on θ as that is defined such that n i=1 Ox i (Ox i ) is diagonal. The numerical examples indicate that when carefully implemented, the method is highly accurate and performs well in real applications. Its implementation can fail sometimes since the corresponding ODE does not perform well numerically. This can be addressed by changing the ODE, namely, by altering either the starting point and/or the integrating path as discussed in the last paragraph of Sect. 4. The default choice of the curve which is used in our implementation in R works well in many tests. As indicated in our first real data example, the MLE using the saddle point approximation is with some exceptions, not far from the our MLE. One can start the HGM from this solution. This hybrid approach could in principle reduce the regions of the numerical search and could be seen as a way of calibrating the saddle point approximation. Our proposed method clearly generalizes that given in Koyama et al. (2014) since it offers explicit expressions for the Pfaffian equations for all Fisher-Bingham distributions including those with degeneracies in the parameters. Finally, since the saddle point approximation method is numerically stable, practically accurate and immediately available, the HGM could be used as a refinement to this approximation.
Acknowledgements The authors are very grateful to Richard Arnold and Peter Jupp for providing the earthquake data and the two anonymous referees for their helpful comments. This work was partially supported by JSPS KAKENHI Grant No. JP26108003 and JP26540013. Our special thanks go to Andy Wood for general discussions and encouragement.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix
The results of Theorem 1 rely heavily on Lemma 1 which is stated after some initial remarks.
Remark 1 One can easily notice that and therefore, these elementary functions iR+t 0 1 (θ i + t) r A(γ , θ )e t dt r = 1, 2 are actually representing the first-order derivatives ∂C(θ ,γ ) ∂θ i and ∂C (θ ,γ ) ∂γ i . In what follows, we will show that based on the theory of partial fractions, ∂C(θ ,γ ) ∂θ i and ∂C (θ ,γ ) ∂γ i can be used to express the integrals above even for r = 3, 4 which will then derive the expressions for the second-order derivatives. This is the basis of the following methodology for obtaining the Pfaffian equations.
For example, using (8) and (9) the second-order derivatives generate these expressions: and for i = j, Remark 2 If i = j it is clear, however, that nonzero terms γ i give rise to iR+t 0 1 (θ i + t) r A(γ , θ )e t dt r = 3, 4 in the second-order derivatives, while for γ i = 0 such terms vanish.
Remark 3 The second-order derivatives ∂ 2 C(θ,γ ) ∂θ i ∂θ j , ∂ 2 C(θ,γ ) ∂γ i ∂γ j and ∂ 2 C(θ,γ ) ∂γ i ∂θ j for i = j can be given in terms of only this pair of basis functions iR+t 0 1 (θ i + t) r A(γ , θ )e t dt r = 1, 2 which from Remark 1 are obtained in terms of ∂C(θ ,γ ) ∂θ i and ∂C(θ ,γ ) ∂γ i . This is easily seen if applying the integration with respect to A(γ , θ )e t dt and by using the following basis decomposition: and i.e. a = −c Note that expressions on the right-hand side for a and b in the statement of Lemma are valid if B = 0 = D.
Proof of Lemma 1 The identity (26) is a direct consequence of the theory of partial fractions. From (26), we see that and, applying this equation which establish the explicit expressions for b and d. After differentiating with respect to t both sides of (29) and then substituting t = −θ i , t = −θ j consecutively, we have the following pair of equations  (21) and (22), we obtain the following three identities: since the corresponding terms are a = n i n j 4(θ j − θ i ) The corresponding cases of i = j, are obtained after applying ∂ ∂γ i on both sides of (6) and separating the term ∂θ i ∂γ i while using (32) Similarly, after applying ∂ ∂θ j on both sides of (6) and using (31) we have Finally, the equation for ∂ 2 C(θ,γ ) with singularity at γ i = 0 if n i ≥ 2.
Explicit expressions for the Pfaffians For P i only the rows i and i + l will have 2l nonzero entries, while the remaining 2(l − 1) rows indexed by j ∈ {1, 2, . . ., l| j = i} and j + l will have at most 4 nonzero entries as indicated in (10): and for the l − 1 rows j + l using (11) (with i and j interchanged) we have only 3 nonzero entries: The ith row of P i can be obtained by rewriting (14): P i ( j, j + l)g j+l , and therefore, the nonzero entries of P i (i, :) are: j ∈ {1, 2, . . ., l| j = i} and for the i + lth row. Please note that Eq. (13) implies that P j (i + l, j + l)g j+l and P i (i + l, j) = − l i = j=1 P j (i + l, j) P i (i + l, j + l) = −P j (i + l, j + l) Similarly, one can show that for the second type P i+l the only nonzero elements in the rows j and j + l, for all j = i are P i+l ( j + l, i + l) = γ j 2(θ j − θ i ) P i+l ( j + l, j + l) = − γ i 2(θ j − θ i ) as seen from (11). For the ith row P i+l (i, i + l) = −1 − l i = j=1 P i+l ( j, i + l) P i+l (i, j) = −P i+l ( j, j) P i+l (i, j +l) = −P i+l ( j, j + l)) and for the (i + l)th row