Bayesian Linear Size-and-Shape Regression with Applications to Face Data

Regression models for size-and-shape analysis are developed, where the model is specified in the Euclidean space of the landmark coordinates. Statistical models in this space (which is known as the top space or ambient space) are often easier for practitioners to understand than alternative models in the quotient space of size-and-shapes. We consider a Bayesian linear size-and-shape regression model in which the response variable is given by labelled configuration matrix, and the covariates represent quantities such as gender and age. It is important to parameterize the model so that it is identifiable, and we use the LQ decomposition in the intercept term in the model for this purpose. Gamma priors for the inverse variance of the error term, matrix Fisher priors for the random rotation matrix, and flat priors for the regression coefficients are used. Markov chain Monte Carlo algorithms are used for sampling from the posterior distribution, in particular by using combinations of Metropolis-Hastings updates and a Gibbs sampler. The proposed Bayesian methodology is illustrated with an application to forensic facial data in three dimensions, where we investigate the main changes in growth by describing relative movements of landmarks for each gender over time.


Introduction
Bayesian linear regression analysis has been extensively studied for various types of response variables and covariates, where prior distributions are specified for the parameters in the classical regression model and statistical inference is carried out using the joint posterior distribution of the parameters (Gelman et al., 2013).
We wish to explore regression models for landmark data, where the location and orientation of the objects can be ignored. Such objects can be represented as points in the size-and-shape space (Dryden and Mardia, 2016, Chapter 5), which is defined as the space of landmark co-ordinates after rotation and translation information has been removed (Kendall, 1989). The shape space on the other hand (Dryden and Mardia, 2016, Chapter 4) is the space of landmark co-ordinates after rotation, translation and scale information has been removed (Kendall, 1986). Throughout this paper we will concentrate on size-and-shape rather than shape.
The size-and-shape space is a quotient space, where location and rotation are quotiented out by using least squares optimization. However, the geometry of the size-and-shape quotient space is complicated (Kendall et al., 1999), and it can be difficult for a practitioner to understand the meaning of statistical models formulated in the quotient space.
An alternative approach is to specify a statistical model in the Euclidean space of the landmark co-ordinates and then integrate out the unwanted location and rotation information by considering the marginal distribution of size-and-shape. In this case, the space in which the statistical model is specified is called the top space in differential geometry, and also known as the ambient space by some authors (Cheng et al., 2016). A top space modelling approach has the advantage that the model is often easier to understand than a quotient space model, and relatively standard inference methods can be used. We shall develop a Bayesian linear model in the space of the Euclidean landmark co-ordinates, and carry out statistical inference using Markov chain Monte Carlo (MCMC) algorithms. Care needs to be taken with identifiability of parameters in the model, and this issue often arises in high-dimensional object data (Dryden, 2014).
The remainder of this paper is organized as follows. In Section 2 we describe the Bayesian linear size-and-shape regression model, including the prior and posterior distributions. In Section 3, methods for Bayesian inference for the coefficients and model selection are presented. Finally an application to forensic facial data is given in Section 4.
2 Bayesian Linear Size-and-Shape Regression Model 2.1. Linear Model Consider a random sample of n configurations of k labelled landmarks in m dimensions, where each configuration is represented by a k × m matrix Y i ∈ R k×m , k > m, i = 1, . . . , n. We are interested only in the size-and-shapes of Y i after removing translation and rotation, but preserving scale information (Dryden and Mardia, 2016, Chapter 5). In addition we have real valued covariates x ij , j = 1, . . . , p, corresponding to each configuration and without loss of generality we assume that each covariate is centred, i.e. i x ij = 0. Categorical variables with g levels can be represented by g − 1 binary indicator variables in the standard way. We write x i = (1, x i1 , . . . , x ip ) as a (p+1)-dimensional column vector containing the p covariates and 1 for the intercept. We aim to predict the size-and-shape of Y i using the covariates, and explore the relationship between Y i and x i , i = 1, . . . , n.
Suppose that Y i are modelled with a probability distribution with conditional mean function μ(x i ) given covariates x i and subject to an arbitrary unknown rotation Λ i ∈ SO(m), where SO(m) is the group of special orthogonal matrices that satisfy Λ i Λ i = Λ i Λ i = I m and det(Λ i ) = 1, and where I m is the m × m identity matrix. So we have the conditional mean Including a noise term we have the model where ε i are assumed to be i.i.d. random matrix normal variables of dimension k × m (Gupta and Nagar, 1999, Chapter 2). In this paper we consider the conditional mean function μ(x i ) to be linear so that the following linear regression model is of interest, where α 0 , α j ∈ R k×m , j = 1, . . . , p, are k × m regression parameter matrices, the errors are matrix normal where vec(A) denotes the vectorization of the matrix A (i.e. stacking columns) and ⊗ denotes the Kronecker product. The model (2.1) is not identifiable since the rotation effect from Λ i dictates the coefficients {α 0 , α j }. We can make the model identifiable using an LQ decomposition of α 0 . In particular we write α 0 = β 0 Q 0 , where Q 0 ∈ SO(m) and β 0 is lower triangular (i.e. has zero entries above the leading diagonal). Therefore the model (2.1) can be rewritten as ×m is a k(p + 1) × m matrix of regression parameters and Γ i ∈ SO(m). In the following we describe a Bayesian approach to estimate μ(x i ) given deterministic covariates x i .

Likelihood
It follows from the matrix normality of ε i that Y i ∼ MN k×m X i βΓ i , σ 2 I m , I k , therefore the probability density function of Y i is given by and the likelihood is given by

Prior and Posterior
We shall concentrate on the m = 3 dimensional case and it is then helpful to adopt a particular parameterization of the rotation matrices. We can represent the three dimensional rotation matrix using the ZXZ-convention where and Lifschitz, 1976). If we assume a uniform prior, then using these co-ordinates the density of the uniform distribution on SO(m) is We consider the following priors for parameters (κ, Γ i , β), where κ = 1/σ 2 . Assume that κ follows a Gamma distribution with shape parameter a and scale parameter b. We consider the prior for the rotation matrix to be the matrix Fisher distribution (Mardia and Jupp, 2000, p.89 and sin θ i2 is due to the uniform measure. The regression parameters β are taken to be uniform and all the parameters are independent, i.e. independently. Then the joint posterior density for (β, Γ 1 , . . . , Γ n , κ) is given by Ω⊗Σ is a 3k(p+1)×3k(p+1) covariance matrix, and (−0) stands for removing 2th, 3th, 6th elements of vec(β ) and vec(ξ ), and also removing those three rows and columns of Ω⊗Σ. Hence for each vec(β 0 ) (−0) , vec(β 1 ), . . . , vec(β p ) of length 3k − 3, 3k, . . . , 3k, we can use the following conditional distribution of a partitioned multivariate normal distribution The conditional posterior for (Γ 1 , . . . , Hence for a specific ith observation, using independence the conditional posterior for Γ i is proportional to Let us drop the observation index i for a moment then and R 1 , R 2 , R 3 are remainder terms independent of each Euler angle. Hence for the ith observation, the conditional distributions for θ i1 and θ i3 are von Mises distributions (Green and Mardia, 2006). Since we use a Metropolis-Hastings update for θ i2 . Hence for posterior sampling by MCMC we use Gibbs sampler for (κ, β, θ i1 , θ i3 ), and Metropolis-Hastings algorithm for θ i2 .
Remark. (Helmertized size-and-shape). Let Y H i = HY i be Helmertized size-and-shapes, where H is the Helmert sub-matrix (Dryden and Mardia, 2016, p.49-50). It is often useful to work with Y H i as this takes care of the location invariance for size-and-shapes, and reduces the number of parameters appropriately.

Condition. (Identifiability). Consider the Helmertized model
where β j , j = 0, 1, . . . , p, are (k − 1) × 3 matrices. Let G be the number of distinct sets of covariate tuples in (x 1 , . . . , x n ), then for k ≥ 4 is the number of regression parameters that can be identifiable. Let be the number of parameters in regression model. Then all p 2 parameters in regression model are identifiable if p 1 ≥ p 2 . This identifiability condition indicates that the stability of estimation depends on how many distinct tuples of covariates are used. If p 1 < p 2 then MCMC draws of parameters can be away from the true values due to non-identifiability, or we may need a long number iterations if p 1 = p 2 .

Inference and Model Selection
where β j,P denotes the quantile at probability P based on order statistics from the sample after burn-in. Since β (t) j is a matrix we define the matrix quantile as an element-wise quantile.
An alternative approach based on marginal Gaussian distributions for β j is The Watanabe-Akaike information criterion or widely available information criterion (WAIC) (Watanabe, 2010;Gelman et al., 2013, p.173) is a fully Bayesian criterion based on the log pointwise posterior predictive density adjusted by the effective number of parameters, p WAIC , to avoid overfitting and is defined by is the probability density function of Y i , and again we have two possible estimates of the effective number of parameters: The Akaike information criterion (AIC) (Akaike, 1973) and the Bayesian information criterion (BIC) (Schwarz, 1978) are defined by where Θ mle is the sample point where the log-likelihood function is maximised after burn-in and K is the number of parameters, so that K = 3(k − 1)p − 3 + n + 1. It is well known that AIC is minimax-rate optimal in estimating the regression function (Barron et al., 1999;Yang, 2005), and BIC is consistent in model selection (Shao, 1997;Yang, 2005). Note that the model which has smaller DIC, WAIC, AIC or BIC provides a better model. 4.1. Data Description Facial features play an important role in forensic science including in criminal investigations where CCTV evidence is commonly used. A study was carried out into using face landmarks for identification, and was reported by Evison and Bruegge (2010). Clearly age and gender are expected to be important covariates when describing the size and shape of the face landmark configurations, and so we develop some Bayesian regression models to explore the relationship. A set of 3D facial images was captured by a Geometrix FaceVision FV802 Series Biometric camera and then 30 anthropometric landmarks in 3D were selected by trained observers. The volunteers in the study were primarily scanned at the Magma Science Adventure Centre, Rotherham, UK. Evison and Bruegge (2010, Chapter 3) give full details of the project and provide discussion about the selection of the 30 landmarks. Many of the face landmark sets were recorded twice, either with different observers or the same observer. In total we have 3248 face landmark configurations from 1964 volunteers, in particular 956 faces from 627 females and 2292 faces from 1337 males. The landmark positions and descriptions are described in Table 1 following Evison and Bruegge (2010). Our main interest here involves investigating the relation between age and the size and shape of the faces for each gender. As would be expected on average male faces are larger and wider than female faces as shown in Fig. 1a, where the main growth direction corresponds to the first shape principal component's direction indicated by black lines in Fig. 1b,c. See Dryden and Mardia (2016, Section 7.7-7.8) for a summary of principal components analysis in shape and size-and-shape analysis, which has been implemented in R functions procGPA() and shapepca() in the package shapes (Dryden, 2017). In order to measure the size of the face landmark configuration, we use the centroid size of a configuration X given by where H is the Helmert submatrix (Dryden and Mardia, 2016, p.49) and X = trace(X X). We see that the centroid size is closely related to the first size-and-shape principal component (PC) as seen in Fig. 2 and the correlation coefficient between the centroid size and the first size-andshape principal component score is -0.959 and 0.954 for female and male, respectively. Note that the signs of the PC loadings are arbitrary, and here PC1 and PC3 have different signs for females and males.

Models and Implementation
Recall from Section 2.1 that the intercept matrix β 0 is lower triangular using an LQ decomposition for model identifiability, and the procedure is more stable if the landmarks in the first three positions are well separated. Hence in this application we re-ordered the landmarks as (1, 3, 30, 2, 4, 5, ..., 28, 29). Note that the inference is invariant to a re-ordering of the landmarks, and so in theory such a re-ordering should make no difference in our modelling. However, in computational implementation it is best to avoid having the first three landmarks too close together as otherwise some numerical instabilities can appear due to the standardisation via the LQ decomposition. The proposed re-ordering leads to stable results, which would in practice be equivalent to any other reordering with well separated landmarks in the first three positions. For each gender we use the following three Helmertized models, where the Helmertizing takes care of the location invariance: We consider a weakly informative conjugate prior κ so that a = 0.001 and b = 1000 (Spiegelhalter et al., 1994(Spiegelhalter et al., , 2003, and the hyperparameter F 0 in the prior distribution for the rotation parameters is taken as a 3 × 3 matrix of zeroes. For the MCMC algorithm we set the initial value of β to 0, Γ i , i = 1, . . . , n, to 3 × 3 identity matrices, and κ to a random draw from Gamma(a, b). The Gibbs samplers of Section 2.3 are used for updating (κ, β, θ 1 , θ 3 ) and in order to update θ 2 via the Metropolis-Hastings algorithm we use a normal distribution with standard deviation σ θ2 = 0.3 as the proposal distribution. To obtain a centred predicted face configuration we pre-multiply each fitted value Y i by C, for example for M2: where C = I k − 1 k 1 k 1 k , I k is the k×k identity matrix, 1 k is the column vector of k ones, β j = 1 i are the arithmetic means of the MCMC sample of T iterations for β j and Γ i after burn-in, and in our faces application we have k = 30 landmarks.

Results
We run the MCMC chain for 200,000 iterations with 100,000 iterations of burn-in. The Metropolis-Hastings acceptance rate for θ 2 is around 3.72% for female and 3.83% for male data. The posterior variance for females is smaller than that for males as the posterior mean estimates for κ are larger than those for males in Table 2.
For the three models considered, model M2 is generally the best model for both female and male groups since M2 outperforms the others in terms of the model selection statistics DIC, WAIC and AIC in Table 3 (except that M1 has the smallest BIC for females). We now investigate the structure of the models of the fitted size-and-shapes of the face landmarks versus age and gender.
We display the results from the fitted regression models M1, M2, M3 for each gender in Fig. 3, which shows the individual face landmark data registered by generalized Procrustes size-and-shape analysis (Dryden and Mardia, 2016, p.143) in light grey and viewed from the front projection of the face. The fitted configurations can be arbitrarily rotated, and in order to compare the fitted configurations over age with the size-and-shapes of the Procrustes registered data, we apply ordinary Procrustes analysis to translate and rotate the fitted faces from the model (in red) onto the Procrustes mean size-and-shape of the data. The fitted faces are indicated by a red curved line from the fitted face at age 15 through to the fitted face at age 80 (which is identified with a black dot). The fitted models for the females and males show important differences in Fig. 3. In particular it is noticeable that the amount and direction of facial growth as age increases differ between females and males. For model M1, the males' fitted face linearly grows as age increases but the change is different for females as age increases. In some areas such as the eyes and ears, the face grows quicker later for females. The growth direction of the ears of females is relatively wider than that for males. The credible intervals for age for eight landmarks on the ears (landmarks 23 -30), indicated by  Figure 4 shows magnified ears and lips. The main features that are apparent from Fig. 4 are that as the faces become older the ears become larger and the lips become less full (i.e. thinner). Some of the curves have turning points where the behaviour is different before and after the turning point. The age at the turning point of the predicted curves can be different depending on the landmark position. We mark blue points to indicate age 37 for female and age 52 for male, and a black dot for age 80. Note that the predicted red points were obtained at equal age intervals. The speed of facial growth Landmark No. Length of credible interval (age) 1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 9 0.3 0.4 0.5 0.6 0.7 0.8 Landmark No. Length of credible interval (age) 1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 9 0.004 0.006 0.008 0.010 0.012 Landmark No. Length of credible interval (age^2) 1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 9 0.003 0.005 0.007 Landmark No. Length of credible interval (age^2) 1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 9 Figure 5: Length of credible interval (M2) varies over age, for example for the top of the ears of females, a and b, the upper parts of the top ears grow slowly for young women but those parts grow rapidly for older women. For the lower parts of the top of the ears, the speed of growth starts slowly and then becomes faster after age 37. For men in c and d, a similar pattern to the lower parts of the top of the ears appears with the turning age 52. For the bottom of the ears and lips, ej females and males show opposite results in growing speed, where females grow rapidly as age increases, however males grow slowly as age increases. From Fig. 5a and b the outer parts of the face have more posterior variability which is shown in the length of the credible intervals for both ears (landmarks 23 to 30). In contrast near the eyes (landmarks 4 -11), the lengths of credibility intervals are short. When faces grow as age increases for both females and males, the variability near both ears is larger as shown in c, d, e and f.
In Fig. 6 we see that the predicted centroid size for females is monotonically increasing. On the other hand male faces grow in size until age 40 then this stops, and so we can see important differences here between the genders. Of course Fig. 6 also illustrates the wide amount of individual variability in face data, and our model is just a first step in modelling average face shape. There is considerably more work required in modelling individual or subgroup face data, although our methodology provides a useful framework in which to develop these ideas.