Spherical regression models with general covariates and anisotropic errors

Existing parametric regression models in the literature for response data on the unit sphere assume that the covariates have particularly simple structure, for example that they are either scalar or are themselves on the unit sphere, and/or that the error distribution is isotropic. In many practical situations, such models are too inflexible. Here, we develop richer parametric spherical regression models in which the covariates can have quite general structure (for example, they may be on the unit sphere, in Euclidean space, categorical or some combination of these) and in which the errors are anisotropic. We consider two anisotropic error distributions—the Kent distribution and the elliptically symmetric angular Gaussian distribution—and two parametrisations of each which enable distinct ways to model how the response depends on the covariates. Various hypotheses of interest, such as the significance of particular covariates, or anisotropy of the errors, are easy to test, for example by classical likelihood ratio tests. We also introduce new model-based residuals for evaluating the fitted models. In the examples we consider, the hypothesis tests indicate strong evidence to favour the novel models over simpler existing ones.

Typical parametric regression models currently in use for spherical responses in dimension p ≥ 3 are fairly restrictive in the sense that (i) the covariates are assumed to have special structure, e.g. that the covariate is a scalar (such as time) or is itself on the sphere (i.e. a direction); and/or (ii) the models assume isotropic error distributions. Examples of (i) and (ii) in the literature are Chang (1986), Rivest (1989) and Rosenthal et al. (2014), see also Di Marzio et al. (2014) in a nonparametric context. Recent work in regression modelling on general Riemannian manifolds, for which the unit sphere is a special case, includes the nonparametric approach of Lin et al. (2017), who develop local regression models assuming Euclidean covariates, and the semi-parametric approach of Cornea et al. (2017), who use parametric link functions mapping from a general covariate space to the manifold, with a nonparametric error distribution; though in neither is the possibility of anisotropic errors explicitly considered.
The principal contribution of this paper is to introduce parametric regression models for spherical response data that relax both (i) and (ii). The motivation for doing so is that in many applications the covariates do not have the simple structure described in (i), and that there is rarely any basis for assuming a priori that the error distribution is isotropic.
There are two main ingredients of the spherical regression models we develop: a distribution on the sphere, to play the role of an error distribution, and a structural model linking the parameters of this error distribution to the covariates. Our approach is similar in spirit to generalised linear models in the sense that we express parameters of the distribution of y i in terms of Bx i , where B is a matrix of parameters. Two simple distributions on the sphere, each broadly analogous to the isotropic normal distribution in R 2 , are the von Mises-Fisher and isotropic angular Gaussian (IAG) distributions. Both are "isotropic" (or equivalently "rotationally symmetric") on the sphere at the mean direction,μ ∈ S 2 , meaning that their contours are small circles centred onμ. The von Mises-Fisher distribution arises from conditioning an isotropic multivariate normal random variable z ∈ R p to have unit norm. On S 2 , to which we specialise henceforth, it is often called the Fisher distribution. It has three free parameters: two to define the mean direction,μ ∈ S 2 , and another scalar parameter, κ > 0, that controls concentration. Its density function on S 2 is The IAG distribution arises from projecting (as opposed to conditioning) z to lie on S p−1 . On S 2 its density function is where M(α) = αφ(α) + (1 + α 2 ) (α), and where φ(·) and (·) are the standard normal probability density function and cumulative distribution function, respectively. It is parametrised by the vector μ ∈ R 3 . In terms of μ, the mean direction is μ/ μ and the concentration is determined by μ . Note that (2) could equally be re-parametrised in terms ofμ = μ/ μ and κ = μ , analogous to the parametrisation of (1), and likewise (1) could be re-parametrised in terms of a parameter κμ ∈ R 3 ; the distinction between parametrisations is a matter of modelling convenience and in the following we shall make use of both. Because they are isotropic, the 3-parameter Fisher and IAG distributions are too restrictive for many applications. Each, however, has a 5-parameter anisotropic generalisation: the Kent (1982) distribution, and the elliptically symmetric angular Gaussian (ESAG) distribution (Paine et al. 2017), respectively. Both the Kent and ESAG distributions have elliptical symmetry about the mean direction, that is, they have ellipse-like contours centred on the mean direction. The two extra parameters over their isotropic counterparts control the orientation and eccentricity of the elliptical contours. We describe the Kent and ESAG distributions in more detail in Sect. 2, but here introduce two parametrisations we shall use for each. The first parametrisation we shall consider is in terms of (κ, β, ), in which κ > 0 is a concentration parameter, β ≥ 0 is an eccentricity parameter, and = (μ ξ 1 ξ 2 ) ∈ O(3) is an orthogonal matrix (i.e. = I, where I is the identity matrix), in whichμ is the mean direction (having 2 degrees of freedom) and (ξ 1 , ξ 2 ) are the major and minor axes that identify the orientation of the elliptical contours (together having 1 remaining degree of freedom). This parametrisation generalises that of (1).
The second parametrisation we consider, generalising (2), is in terms of a pair of vectors, μ ∈ R 3 and γ ∈ R 2 , in which, as in (2), μ controls the mean direction and concentration; then γ ∈ R 2 controls eccentricity and orientation of the elliptical contours.
These two parametrisations lend themselves to different ways of modelling how the response variable depends on covariates. We consider models with the following structures (3) where H(·) is one of Kent(·) or ESAG(·) and Q is an orthogonal matrix, discussed later in Sect. 3 and the "Appendix", which is needed so that inference does not depend in undesirable ways on the particular, and possibly arbitrary, coordinate system in which the y i are defined. A primary difference between (3) and (4) is that in Structure 1 we allow the mean direction and orientation of the dispersion to depend on the covariate vector, but the magnitude of dispersion and anisotropy is fixed. For Structure 2, all of these depend on the covariate. We specify in Sect. 3 particular functional forms for the (·), μ(·) and γ (·), but for now note that we will consider four models that result from the different combinations of these two structures and two error distributions. We will call these Kent1, ESAG1, Kent2, and ESAG2, where, for example, ESAG1 means using H(·) = ESAG(·) as the error distribution and Structure 1 to model dependence on covariates. This modelling approach, in which we assume that the parameters of the error distribution depend on the covariates in particular functional ways, closely parallels generalised linear modelling, although rather than having a single linear predictor, here we have several.
Before giving more details about the parametrisations and models, we briefly discuss some earlier papers on spherical regression. Rivest (1989) considered the case with covariates themselves on the sphere, x i ∈ S 2 , and a Fisher error distribution with the mean direction modelled asμ(x i ) = Rx i , where R ∈ SO(3) is a rotation. Rosenthal et al. (2014) replaced the rotation with the "projective linear transformation" (PLT),μ(x i ) = Ax i / Ax i , with A ∈ SL(3) where SL(3) = {A ∈ R 3×3 : det(A) = 1} is the special linear group. This is a generalisation of Rivest's model since SL(3) contains SO(3). We consider the PLT later, using it to benchmark performance of the new models we introduce.
Besides regression models on the unit sphere, S 2 , there are several models for regression on the unit circle, S 1 . Presnell et al. (1998) considers regression on S 1 for a general covariate x i , assuming IAG errors. We mention this model in particular because it is a close analogue on S 1 of our ESAG2 model on S 2 in the isotropic case (which corresponds to γ = 0), as discussed later. Related work includes the S 1 regression model of Fisher and Lee (1992), but this is less relevant to the present paper because it does not generalise conveniently to S 2 or higher dimensional spheres; see Mardia and Jupp (2000) for a discussion of this and of the wider context of regression on S 1 . We also mention a regression model for data on the simplex introduced by Scealy and Welsh (2011). Their approach is to use a "square-root transformation" to map the data from the simplex to the positive orthant of the sphere, then to develop regression models for the transformed data using the Kent distribution. On the sphere, as opposed to the simplex, however, we believe it is especially important to allow what Scealy and Welsh (2011) refer to as K * to depend on regression variables, something that they do not consider due to the fact they focus on transformed compositional data; see the discussion in the concluding section of their paper.
The main goals of this paper are: to explore and compare the modelling Structures 1 and 2; to investigate in the regression context the advantages and disadvantages of the Kent and ESAG distributions as error distributions; and to develop hypothesis tests for the significance of particular covariates, and of anisotropy.
In the following section, we introduce the Kent and ESAG distributions in each of the two parametrisations, then in Sect. 3 we develop the two modelling structures and hypothesis testing procedures. In Sect. 4, we introduce some novel residuals for model fitting diagnostics; then in Sect. 5, we implement the models and methods on various examples involving both synthetic and real data. Code for fitting the models in this paper is available on the second author's web page.

Elliptically symmetric distributions on S 2
Here, we give details of the (μ, γ ) and (κ, β, ) parametrisations of the Kent and ESAG distributions. Kent (1982) introduced this distribution using a (κ, β, ) parametrisation, in terms of which the density is

Kent distribution
where C(κ, β) is the normalising constant.
The proof of Lemma 1 is in the "Appendix".

Elliptically symmetric angular Gaussian (ESAG) distribution
The general angular Gaussian distribution is the marginal distribution of the directional component of the multivariate normal distribution; that is, for a general mean μ ∈ R p and covariance matrix V, let z ∼ N (μ, V) ∈ R p , then z/ z has a general angular Gaussian distribution. The elliptically symmetric angular Gaussian (ESAG) distribution, developed in Paine et al. (2017), is a subfamily of the general angular Gaussian distribution. It is defined by the two conditions and on S 2 has density where M(·) is defined as in (2). Distribution (8) has 5 free parameters, which can be seen by first fixing the 3 free parameters of μ = (μ 1 , μ 2 , μ 3 ) then observing that conditions (7) leave 2 degrees of freedom in V. Let ρ 1 , ρ 2 , ρ 3 be the eigenvalues of V, with corresponding orthonormal eigenvectors ξ ξ ξ 1 , ξ ξ ξ 2 , ξ ξ ξ 3 , respectively. Then, by the spectral decomposition theorem, whereμ = μ/ μ . The final term in (9) is a consequence of the constraint Vμ = μ, and ρ −1 2 = ρ 1 then follows from det(V) = 1. Once μ is fixed, then in V −1 there is one degree of freedom from ρ 1 , and one degree of freedom from fixing the orientation of ξ 1 and ξ 2 .

Practical differences between Kent and ESAG distributions
Both the Kent and ESAG distributions have similar characteristics from a modelling perspective: each typically has ellipse-like contours of constant probability density centred on the mean direction in the unimodal case and for different parameter values each has unimodal and bimodal cases. On practical grounds, the two distributions have different advantages and disadvantages. The Kent distribution belongs to the exponential family, and hence, its density, (5), has a simple mathematical form. In comparison, the ESAG density, (8), is rather cumbersome. On the other hand, the ESAG density and likelihood can be computed exactly, whereas the Kent density and likelihood involves a normalising constant, C(κ, β) in (5), which is not known in closed form and hence needs to be approximated, by truncating an infinite series (Kent 1982), or else by saddlepoint or holonomic gradient methods (Kume and Sei 2017;Kume et al. 2013). In the present context, we maximise the likelihood for the regression models numerically, so the ESAG likelihood having a cumbersome form is no drawback, and the fact that it can be computed exactly is an advantage. For simulation, the Kent distribution requires a rejection algorithm (Kent et al. 2018), whereas ESAG can be simulated quickly and easily. Fast simulation is especially helpful in simulation-heavy inference procedures, e.g. the parametric bootstrap.

Regression model structures
In this section, we specify the two model structures in (3) and (4) and then discuss the advantages and disadvantages of each. It is assumed throughout the paper that the first element of x i is 1, which is analogous to the inclusion in linear modelling of an "intercept term". For Structure 2 models, see (4), this means that the simpler model of {y i } being IID, i.e. not depending on the covariates, is nested in the general regression model and this is helpful for testing the significance of regression. The motivation for including the intercept term is less clear-cut a priori for Structure 1 models, see (3), though empirical results, for example in Table 2 later, suggest there is sometimes a benefit from doing so. Each model structure is defined in terms of a preliminary orthogonal transformation, Q. For Structure 1 models, Q is assumed to be a population quantity, defined explicitly in the "Appendix", and estimated by a sample versionQ. For Structure 2 models, Q is treated as a tuning parameter and optimised with respect to. These preliminary transformations are needed so that desirable invariance and equivariance properties, discussed in the "Appendix", hold when an arbitrary orthogonal transformation is applied to the y i .

Structure 1: Q y i ∼ H(Ä,ˇ, 0(x i ))
In this structure, Q is a population quantity, as defined in the "Appendix", and in all calculations involving data it is replaced by a sample versionQ = [ξ ξ ξξ ξ ξ 1ξ ξ ξ 2 ] withξ ξ ξ = n i=1 y i /|| n i=1 y i || andξ ξ ξ 1 andξ ξ ξ 2 unit eigenvectors corresponding to the larger and smaller positive eigenvalues of Here,Q is the moment estimator defined in Kent (1982, p. 74) of in defined in Kent (1982, p. 74) (5) under the assumption of IID y i .
We consider for (x i ) viewed as a function of x i : where the R(x i ) are orthogonal 3-by-3 matrices, the S(x i ) are orthogonal 2-by-2 matrices, diag[., .] is a 3-by-3 block diagonal matrix with 1 × 1 and 2 × 2 blocks, and x i is q × 1. The dependence of R(x i ) and S(x i ) on the covariate vector x i needs to be prescribed. We choose to do so using the Cayley transform: for any skew-symmetric matrix A, i.e. A = −A , the matrix (I − A)(I + A) −1 is a 3-by-3 rotation matrix (i.e. an orthogonal matrix with determinant +1). The Cayley transform maps the skew-symmetric matrices onto the set of rotations minus a set of lower dimension (see the "Appendix"). This is an injective mapping, which is the reason we favour it over, e.g., the exponential of A. Define where are skew-symmetric. Here, R(·) and S(·), and hence (·), are playing a role analogous to link functions in generalised linear models, linking linear predictors to the parameters of the distribution of the response variable. The nature of the link functions means that interpreting the influence of individual β j s is somewhat harder for this model than for Structure 2 models described below. This model is fitted by maximising the likelihood function of observed data with respect to the 4-by-q parameter matrix B = β 1 , β 2 , β 3 , β 4 .

Structure 2: Q y i ∼ H( (x i ), (x i ))
In this parametrisation, μ ∈ R 3 and γ ∈ R 2 are unrestricted, and μ(x i ) and γ (x i ) are easy to specify as functions mapping from the q-dimensional domain of the {x i } to R 3 and R 2 , respectively. Here, we limit attention to linear functions, where B 1 = (β 1 , β 2 , β 3 ) and B 2 = (β 4 , β 5 ) . In keeping with the notation of the preceding model, we collect these parameters together into a 5-by-q matrix B = B 1 , B 2 , where the influence of the subsets of parameters can be clearly distinguished: B 1 controls the influence of the covariates, via μ, on the concentration and mean direction; and B 2 controls influence, via γ , on the degree and orientation of anisotropy. This leads to natural tests, e.g. for anisotropy, discussed below. Unlike in Structure 1, in which model (16) is naturally tied to the particularly defined Q, for Structure 2 and model (19) there is no a priori reason to select a particular Q ∈ O(3); hence, we treat Q as a tuning parameter, seeking to maximise the likelihood of the data Q y i over {Q, B}. A practical way to do so at least approximately is via a bruteforce search for Q over O(3), for each value of Q on a grid over O(3) computing the maximum likelihood estimatorB of B, then selecting the pair Q,B corresponding to the largest maximised likelihood. In this paper, when comparing models for a particular data set, we compute Q for the most general ESAG2 model and keep this Q fixed for submodels and Kent2 models.
Model (19) with ESAG errors, is close in spirit to the circular p = 2 regression models of Presnell et al. (1998) and Wang and Gelfand (2013), particularly in the isotropic special case, with B 2 = 0, in which this model is a direct analogue for p = 3 of Presnell et al.'s regression model on the circle. A helpful property proved by Presnell et al. in the circular case is that the log-likelihood function is a concave function of the regression parameters-in our notation B 1 -that determine μ; this guarantees that the MLE of B 1 is unique and easily determined by numerical optimisation. The corresponding result holds for ESAG2 (20) in the p = 3 case with isotropic errors, i.e. B 2 = 0, as follows [in which vec is the standard vectorisation operator; see e.g. Mardia et al. (1979)]. (20), let B 2 = 0, and let l(B 1 ) denote the log of the likelihood function for parameter B 1 given observations (x 1 , y 1 ), . . . , (x n , y n ). Provided (x 1 , . . . , x n ) has full rank, the negative Hessian

Proposition 1 Consider model
is positive definite, hence l(B 1 ) is a concave function of B 1 , and the MLE of B 1 is unique.
The proof of this Proposition is given in the "Appendix".

Tests for the significance of anisotropy and regression
In this section, we discuss procedures for performing hypothesis tests required for model selection and inference. To do so, we introduce the notation for the parameters B = β (1) , β (2) , . . . , β (q) , i.e. such that β ( j) is the jth column of B and corresponds to the covariate appearing as the jth element of x i . A test of the significance of this particular covariate corresponds to a test with null and alternative hypotheses Since the null hypothesis is nested in the alternative then by Wilks' theorem, subject to the usual regularity conditions, under H 0 , asymptotically as n → ∞, where L 0 and L 1 are the maximised likelihood functions under H 0 and H 1 , respectively; and ν is the difference in the number of free parameters between H 0 and H 1 , here equal to 4 or 5 for the Structure 1 and 2 models, respectively. The significance of the parameter can be assessed by referring the observed test statistic, T , to the χ 2 ν distribution. An alternative possibility, preferable when n is insufficiently large for the null asymptotic distribution (21) to be reasonable, is to approximate the null distribution using a bootstrap procedure.
Within Structure 2 models, it may be relevant to consider whether particular covariates are significant in μ or γ distinctly. For example, for the covariate corresponding to the jth element of x i , a test that the covariate is significant in γ corresponds to the hypotheses H 0 :(β ( j) ) 4 = (β ( j) ) 5 = 0 versus H 1 :(β ( j) ) 4 , (β ( j) ) 5 free, for which the degrees of freedom in (21) is ν = 2. Having isotropic errors corresponds to γ = 0, so for a test of the significance of anisotropy the hypotheses are H 0 :B 2 = 0 versus H 1 :B 2 free, where B 2 is as defined in (19), and ν = 2q.

Residuals for model diagnostics
For spherical regression models, there are many possible ways to define a residual. Here, we describe some general spherical residuals defined by Jupp (1988) before defining some particular model-based residuals for regression models with ESAG and Kent errors. For observations y 1 , . . . , y n denote the fitted values bŷ y 1 , . . . ,ŷ n . Jupp defined "crude residuals" as i.e. as projections of each observation, y i , into the tangent plane at its fitted value,ŷ i . Since the r 1 , . . . , r n lie in different tangent planes, Jupp defined the "rotated residuals" where y 0 is an arbitrary point on the sphere which is not dependent on i, and R(ŷ i , y 0 ) is a rotation fromŷ i to y 0 , where R(·, ·) does not depend on i. Then, the s 1 , . . . , s n lie in the plane tangent to the sphere at y 0 . Let ζ 1 , ζ 2 be an arbitrary pair of unit vectors orthogonal to each other and to y 0 , then a plot of the projected residuals can be inspected to identify structure amongst residuals that could indicate a shortcoming of the model. For parametric regression models with ESAG or Kent errors, Jupp's residuals are potentially limited in that they are not model-based and hence do not take into account the dispersion of errors in the fitted model, i.e. (23) is a function of the fitted valueŷ i but not of the parameters that determine dispersion.
We define model-based residuals for ESAG and Kent error models as follows, in each case motivated by highconcentration Gaussian limits of each distribution, although we expect the residuals to be useful for detecting model inadequacy even in non high-concentration settings. For a random variable y ∼ ESAG(μ, γ ) consider the corresponding random variable where ρ 1 , ξ 1 , ξ 2 are as defined in (9). From Proposition 2 in Paine et al. (2017), provided μ is large then, approximately, η ESAG ∼ N 2 (0, I). Hence for regression models with ESAG errors, we define residuals Then, a scatterplot of η 1 , . . . ,η n can be compared with random N 2 (0, I) scatter; see Fig. 1 for examples. Similarly, for a random variable y ∼ Kent(κ, β, ) then η Kent (y; κ, β approximately, for large κ; see property (e) in Kent (1982). For models with Kent errors, writingˆ i =ˆ (x i ), we hence define the residuals η i = η Kent (y;κ,β,ˆ i

Applications
Here, we consider three applications, in each investigating the spherical regression models towards different statistical goals. The first involves a simulated data set with a scalar covariate, t ∈ R. We exploit having a simple data-generating model to illustrate the flexibility within this regression framework for the mean direction and dispersion to depend on the covariate; to investigate the performance of hypothesis tests in detecting anisotropy and regression; and to compare Jupp and η-residuals in the special setting where the model being fitted is the true one.
The second data set concerns the movement of clouds between two consecutive days. The cloud shapes are repre- sented by landmarks spaced around the cloud outline, and the position of these landmarks is regressed on their positions the previous day. This data set has been considered previously in the context of spherical-spherical regression models with isotropic errors (Rosenthal et al. 2014); hence, it makes for an interesting comparison with the more general framework developed in this paper. The third data set is derived from vectorcardiogram measurement of heart activity in children. These data too have been studied in the context of spherical-spherical isotropic regression (Chang 1986), but with the non-spherical covariates disregarded. The primary goal is inference, to understand which covariates are significantly related to the response. The framework of the present paper enables us to incorporate easily the additional non-spherical covariates, as well as anisotropic errors, and furthermore then to test formally whether such generalisations are warranted by the data.

Simulated data set (involving a scalar covariate)
Denote by M 1 the ESAG2 regression model with (1) , and t i = (i − 1)/(n − 1) for i = 1, . . . , n. In the notation of (20), Figure 1a-c is plot for a synthetic data set generated from M 1 using μ (0) = (5, 10, 2) , μ (1) = (−5, 10, 2) , γ (0) = (2, 3) , γ (1) = (−2, 5) , and n = 41. As a visual aid, plot markers corresponding to the subset of points with indices i = 1, 11, 21, 31, 41 are coloured red. Figure 1a shows the data, together with contours of constant probability density with 90% coverage for the true and fitted parameters, for the covariates corresponding to the red-marked points. The data-generating parameters are deliberately chosen here to produce highly anisotropic dispersion, as can be seen from the highly eccentric contours. These contours are well matched by corresponding contours of the fitted model, indicating that the parameters have been estimated well. Figure 1b shows the same data projected onto the tangent plane at the sample mean, with the index used as the point marker so that the ordering of the points can be seen. Figure 1c is a plot of η-residuals, which seem consistent with IID bivariate normal scatter, indicating that the model is reasonable. This is expected since the data-generating model is a special case of the model being fitted. Exploring residuals further, Fig. 1d shows residuals analogous to those in c but this time for a larger sample size of n = 401, with the corresponding Jupp projecting residuals (24) shown in e. The Jupp residuals appear non-Gaussian and anisotropic, even though the fitted model is appropriate to the data, making these residuals harder in general to interpret for model diagnostics.
We can use the inference procedures described in Sect. 3.3 to test for significance of anisotropy and regression. Table 1 shows the maximised log-likelihood for the true model, M 1 , and some different models involving various combinations of the two model structures and two error distributions. Using Wilks' statistic and the null asymptotic χ 2 approximation (21) to compare M 1 with each of models M 2 , M 3 , M 4 with errors assumed to be ESAG results in p values < 10 −5 in each case, indicating very strong evidence to favour the data-generating model over the simpler alternatives, which include the isotropic (M 3 ) and IID (M 4 ) models. When Kent errors are assumed for the fitted model, i.e. in contrast to the ESAG errors used in generating the data, the statistical conclusions (and even to some extent the numerical values of the maximised log-likelihoods) are very robust to this misspecification. This is probably a consequence of how similar the ESAG and Kent densities are in practise, especially if the concentration is reasonably high. The table also shows the results of fitting Structure 2 models M 5 -M 8 to the Structure 1-generated data. Here, model M 5 is not favoured strongly over M 6 , in contrast to how M 1 is strongly favoured over M 2 . The explanation is that models M 2 and M 6 are only loosely analogous as submodels of M 1 and M 5 , respectively. A major difference is that M 2 cannot capture the way the orientation of the anisotropy substantially depends on the covariate, because γ does not depend on the covariate, whereas M 6 can still do so via R(x i ) even when S(x i ) is fixed to be the identity matrix. The conclusion to reject isotropy (M 7 ) and the assumption of IID data (M 8 ) in favour of M 5 are both robust to the model misspecification.

Cloud formation data (involving a spherical covariate)
These data involve 29 landmarks spaced around the outline of a cloud to represent its shape on each of two consecutive days, 4th and 5th Sept 2012. The data, see Fig. 2, are from NASA's Visible Earth project [with original cloud images from XPlanet (2018)] and were used as an application by Rosenthal et al. (2014) in assessing accuracy of their PLT model, albeit with a focus on prediction rather than inference. The goal is to regress the landmarks {y i } 29 i=1 for the second day on those {u i } 29 i=1 of the first. We hence define a covariate vector, including "intercept", as The models we fitted to these data and the corresponding values of the maximised log-likelihood, are shown in Fig. 2c. The maximised log-likelihood values show that each of the models with anisotropic errors is very strongly favoured over its isotropic counterpart. The non-nestedness of the models otherwise makes them hard to select between formally. Model ESAG2 has substantially the largest log-likelihood value, although recall that the transformation Q used for the Structure 2 models is chosen specifically to maximise the ESAG2 log-likelihood.
The residuals of the fitted ESAG2 model, shown in Fig. 2b, show a small amount of serial correlation (points 21-27), but otherwise little to suggest the model is poorly fitting.

Vectorcardiogram data (involving a mixed-type covariate)
This data set was considered by Chang (1986) in the context of his spherical-spherical regression models. Here, our more general model enables incorporation of other covariates, and of anisotropic errors. The data themselves are derived from vectorcardiogram measurements of the electrical activity of the heart of children of different ages and genders. The vectorcardiogram involves three leads being connected to the torso produce a time-dependent vector that traces approximately closed curves, each representing a heartbeat cycle, in R 3 . Sometimes used as a summary for clinical diagnosis is a unit vector defined as the directional component of the vector at a particular extremum across the cycles. The data comprise such unit vectors derived from data for two different lead placement systems, the Frank system (y i ∈ S 2 ) and for the McFee system (u i ∈ S 2 ), for each of 98 children of different ages and gender. Age is represented by a binary variable A i ∈ {0, 1} (0 meaning aged 2-10 years, and 1 meaning aged 11-18 years) and gender by a variable G i ∈ {0, 1} (0 for a boy, and 1 for a girl). We aim to regress y i on the other variables, and hence take the covariate to be To identify the meaning of the parameters in the parameter matrix B, we write it in the block structure where β β β 0 1 , β β β A 1 and β β β G 1 are 3 × 1 and B u 1 is 3 × 3; and β β β 0 2 , β β β A 2 and β β β G 2 are s × 1 and B u 2 is s × 3, where s = 1 for Structure 1 models and s = 2 for Structure 2 models. Setting any of  these blocks equal to zero amounts to removing the influence of a particular covariate on particular parameters in the error model. For example, in Structure 2 models, setting β β β A 1 = 0 means that the covariate A i does not influence μ.
Tables 2 and 3, respectively, show the results of fitting Structure 1 and 2 models, and several submodels, to the vectorcardiogram data. Within each table, each of the submodels is nested within M 1 , and some of the submodels are further nested within each other. Pairwise comparisons of relevant nested models using likelihood ratio tests described in Sect. 3.3 at 5% level suggest that for both ESAG1 and Kent1 the preferred model is M 15 . This suggests that Structure 1 is poor for characterising how the response depends on the covariates for this application, to the extent that there is little benefit to retaining the covariates in the model. In contrast, for both ESAG2 and Kent2 the preferred model is M 3 , which retains all of the covariates. assumes isotropic errors and neglects the age and gender covariates, the scatter appears less isotropic, and there are slight differences in the scatter according to age and gender, consistent with there being residual variation due to these factors not being incorporated in the model.

Conclusions
The regression models we have introduced are rather more general than existing regression models in the literature, allowing covariates with general structure, and errors that are nonisotropic. We have also introduced novel model-based residuals that enable simple visual diagnostics to check fitted models, to identify for example any residual structure dependent on a covariate, any serial correlation or any outliers, and to explore adequacy of the error models.
For the anisotropic error model, there is little to choose on statistical grounds between using Kent or ESAG, though we have found occasions for models based on the Kent that the likelihood function is harder to maximise numerically (perhaps owing to roughness in the likelihood approximation arising from approximating the normalising constant).
Models based on ESAG are free from this issue, and the computation of the ESAG likelihood is much faster. Of the two model structures we considered, models with Structure 2 tended to perform better; such models are also simpler and enable the influence of particular covariates to be related more directly to the response variable. On the foregoing grounds, ESAG2 models are our preferred ones.
The likelihood framework in which we have developed the models makes it very easy to use classical methods to compare nested models of different complexity, in particular to test hypotheses about significance of regression or the anisotropy of errors. Indeed, applying such tests for the examples considered provides strong support that the regression modelling generalisations we have developed are warranted.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Q y → (UQ) Uy = Q U Uy = Q y, so that structures (3) and (4) are invariant with respect to orthogonal transformations of the y i .
When fitting models with structures (3) and (4) we estimate Q using its sample analogue defined in Sect. 3.1. Providing the y i have absolutely continuous distributions and the sample size n is at least 3,Q is well defined with probability 1 even if the population version Q is not well defined, i.e. even if the denominator of (27) is 0 and/or the positive eigenvalues of (28) are equal. If Q is well defined thenQ estimates Q consistently.
Finally, we consider what happens if we start with the original coordinate system for the y i , in which case the models concerned are equivariant rather than invariant with respect to orthogonal transformations of the y i . In the case of Structure 1 in (3), the orthogonal matrices (x i ) transform according to (x i ) → Q (x i ), i = 1, . . . , n, with κ and β unchanged.
For Structure 2, in which Q ∈ O(3) is considered a tuning parameter and optimised with respect to, the μ μ μ(x i ) and γ γ γ (x i ) are invariant to orthogonal transformations of the y i .