1 Introduction

It is difficult to overrate the value of graphical displays to accompany formal statistical classification procedures (see e.g., Tukey 1975). Indeed, it is an open question if a complete assessment of group structure, including the overlap and separation of groups and the within groups sample behaviour is possible without the aid of suitable graphics, relying exclusively on computed statistical measures and tables. In particular, graphical procedures that are capable of displaying simultaneously the means and the individuals of different groups together with information on the relative contributions of a chosen set of classification variables are highly in demand.

Gabriel (1971) proposed the biplot for displaying simultaneously the rows and columns of a data matrix \({\varvec{X}}: n\times p\) in a single graph. A year later Gabriel showed how to construct a canonical variate analysis (CVA) biplot (Gabriel 1972) that provides a two-dimensional graphical approximation of various groups optimally separated according to a multidimensional CVA criterion. This CVA biplot became popular among statisticians and practitioners applying linear discriminant analysis (LDA) and CVA in various fields. Gittins (1985) can be consulted for an overview of CVA. Distances in this classical Gabriel CVA biplot are interpreted in terms of inner products between vectors. Such interpretations are not as straightforward as distances in an ordinary scatter plot. The unified biplot methodology proposed by Gower (1995) and extensively discussed in the monograph by Gower and Hand (1996) allows the CVA biplot to be regarded as a multivariate extension of an ordinary scatter plot. The concept of inter-sample distance is central in this approach, while information on the classifier variables is added to the graph of means and individual sample points in the form of biplot axes–an axis calibrated in its original units for each (classifier) variable. The perspective of Gower and Hand (1996) inspires the biplot methodology discussed in this paper. Gower et al. (2011) discuss the one-dimensional CVA biplot for use in two-group studies in some detail. Although the latter authors provide several enhancements to this one-dimensional CVA biplot they failed to address the challenge of constructing an optimal two-dimensional biplot for the two-group case.

Classification procedures in practice are often confronted with the problem of optimally distinguishing between two groups. The aim of several procedures used in multidimensional classification is to separate the group means optimally. This is the aim of CVA and the closely related procedure of LDA. These techniques involve the transformation of the original observations into a so-called canonical space. Flury (1997), among others, proves that in the case of J groups, only the first \(K = min(p,J-1)\) elements of the group mean vectors differ in the canonical space. This induces some pitfalls when routinely constructing CVA biplots in the case where \(p > J-1\) with J equaling two or three. Since in the case of two groups the group means can be exactly represented on a line, the associated CVA biplot becomes a one-dimensional plot with all p biplot axes representing the different variables on top of each other on the line extending through the representations of the two group means. All n sample points also fall on this line. The theoretical basis for constructing this line is provided by the eigenanalysis of a two-sided eigenvalue problem (see e.g., Gower and Hand (1996) and Gower et al. (2011)). In the two-sample case this eigenanalysis involves only one non-zero (positive) eigenvalue together with \(p-1\) zero eigenvalues. Therefore, only the eigenvector associated with the single non-zero eigenvalue is uniquely defined. This eigenvector provides the scaffolding for constructing a one-dimensional CVA biplot. Although a two-dimensional biplot can be constructed using two eigenvectors, the second eigenvector will not be uniquely defined, as is the case when \(J > 2\), unless some precautions are taken. A similar situation arises in the case of three groups: the first two eigenvectors are then uniquely defined but not the third. Why do we want extra scaffolding dimensions for constructing CVA biplots when \(p > J-1\)? There is a real advantage when we have a two-group or three-group classification problem: not only are the group means then represented exactly but also an improvement in the approximations of the individual sample points is accomplished. Therefore, in this paper, we first define a measure of how well any mean or individual sample point is represented in a CVA biplot. Then we show, in the case of two groups, how to construct a uniquely defined two-dimensional CVA biplot such that both the means and individual samples are optimally represented together with the variables in the form of calibrated biplot axes. In addition, we will show how this process can be extended directly for constructing a uniquely defined optimal three-dimensional biplot when three groups are considered.

The paper is organized as follows: in the next section, we begin with some historical background of discriminant analysis. After that, we briefly review the basic concepts and theory of CVA biplot methodology according to the perspective of Gower and Hand (1996). This is followed by a section describing the geometry of the graphical representation of the class means and the sample points about them. In section 4 it is discussed why the two-group CVA biplot deserves special attention. We then put forward proposals for one-dimensional CVA biplots as well as a two-dimensional CVA biplot such that the group means in a two-group classification problem are exactly represented together with an optimal representation of the individual sample points. We also show how to generalize this procedure to construct a three-dimensional biplot with similar properties for use when \(J = 3 < p\). In section 6 we briefly discussed an alternative unique 2D biplot, which is based on the Bhattacharyya distance. The theoretical results are illustrated in section 7 where we provide examples, covering one- and two-dimensional CVA biplots. Some conclusions are considered in section 8.

2 A brief review of canonical variate analysis

2.1 Two-group discriminant analysis: Fisher’s LDA

Two-group discriminant analysis considers two populations (groups) \(G_{1}\) and \(G_{2}\). An observation \({\varvec{x}}\) of \({\varvec{X}}^{T} = (X_{1}, X_{2}, \ldots , X_{p})\) is to be allocated to one of these populations. It is assumed that the density \(f_{i}({\varvec{x}})\) of \({\varvec{X}}\) for \(i = (1; 2)\) is known with expected value \({\varvec{\mu }}_{i}:p\times 1\) and covariance matrix \({\varvec{\varSigma }}_{i}: p \times p\), respectively. Let the prior probability of an unknown \({\varvec{x}}\) belonging to \(G_{i}\) be given by

\(p_{1} = P(G_{1})\) and \(p_{2} = P(G_{2})\), respectively, with \(p_{i} > 0\) and \(p_{1}+p_{2} = 1\).

Define

$$\begin{aligned} {\varvec{\mu }}= & {} E\left( {\varvec{X}}\right) = p_{1}{\varvec{\mu }}_{1} + p_{2}{\varvec{\mu }}_{2},\\ {\varvec{T}}= & {} E\left[ \left( {\varvec{X}}-{\varvec{\mu }})({\varvec{X}}-{\varvec{\mu }}\right) ^{T}\right] ,\\ {\varvec{B}}= & {} p_{1}\left( {\varvec{\mu }}_{1}-{\varvec{\mu }}\right) \left( {\varvec{\mu }}_{1}-{\varvec{\mu }}\right) ^{T} + p_{2}\left( {\varvec{\mu }}_{2}-{\varvec{\mu }}\right) \left( {\varvec{\mu }}_{2}-{\varvec{\mu }}\right) ^{T}, \; \textrm{and}\\ {\varvec{W}}= & {} p_{1}{\varvec{\varSigma }}_{1}+p_{2}{\varvec{\varSigma }}_{2}.\end{aligned}$$

Under the assumption that \({\varvec{\varSigma }}_{1} = {\varvec{\varSigma }}_{2} = {\varvec{\varSigma }}\), (say) it follows that \({\varvec{W}} = {\varvec{\varSigma }}\). The Fisher LDA (Fisher 1936) searches for the linear function

$$\begin{aligned} Y= {\varvec{m}}^{T}{\varvec{X}} \end{aligned}$$

with \(E(Y) = {\varvec{m}}^{T}E({\varvec{X}}) = {\varvec{m}}^{T}{\varvec{\mu }}\), \(E(Y|G_{i}) = {\varvec{m}}^{T}{\varvec{\mu }}_{i}\) and \(var(Y) = {\varvec{m}}^{T}{\varvec{W}}{\varvec{m}}\) to maximize

$$\begin{aligned} \frac{p_{1}\left( {\varvec{m}}^{T}{\varvec{\mu }}_{1} -{\varvec{m}}^{T}{\varvec{\mu }}\right) ^{2} + p_{2}\left( {\varvec{m}}^{T}{\varvec{\mu }}_{2} -{\varvec{m}}^{T}{\varvec{\mu }}\right) ^{2} }{{\varvec{m}}^{T}{\varvec{W}}{\varvec{m}}}=\frac{{\varvec{m}}^{T}{\varvec{B}}{\varvec{m}}}{{\varvec{m}}^{T}{\varvec{W}}{\varvec{m}}}. \end{aligned}$$
(1)

The maximum is obtained from the eigenequation

$$\begin{aligned} ({\varvec{W}}^{-1}{\varvec{B}}){\varvec{m}} = {\varvec{m}}{\varvec{\varLambda }}. \end{aligned}$$

Pre-multiplying the above equation with \({\varvec{W}}^{1/2}\) leads to

$$\begin{aligned} \left( {\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2}\right) \left( {\varvec{W}}^{1/2}{\varvec{m}}\right) = \left( {\varvec{W}}^{1/2}{\varvec{m}}\right) {\varvec{\varLambda }}, \end{aligned}$$

so that \({\varvec{l}} = {\varvec{W}}^{1/2}{\varvec{m}}\) is the eigenvector of \({\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2}\) associated with the largest eigenvalue \(\lambda _{1}\) and \({\varvec{m}}={\varvec{W}}^{-1/2}{\varvec{l}}.\)

Since \(rank({\varvec{B}})=1=rank({\varvec{W}}^{-1}{\varvec{B}})=rank({\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2})\), it follows that

$$\begin{aligned} {\varvec{BM}} ={\varvec{WM}}\left[ \begin{array}{cccc} \lambda _{1} &{} 0 &{} \ldots &{} 0 \\ 0 &{} 0 &{} \ldots &{} 0 \\ \vdots &{} \vdots &{} \vdots &{} \vdots \\ 0 &{} 0 &{} \ldots &{} 0 \end{array}\right] . \end{aligned}$$
(2)

If we choose \({\varvec{L}}\) to be an orthogonal matrix in \(({\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2}){\varvec{L}}={\varvec{L\varLambda }}\), then

$$\begin{aligned} {\varvec{L}}^{T}{\varvec{L}} = {\varvec{I}} = {\varvec{M}}^{T}{\varvec{WM}} \end{aligned}$$

and

$$\begin{aligned} {\varvec{M}}^{T}{\varvec{BM}} = {\varvec{M}}^{T}{\varvec{WM\varLambda }} = {\varvec{\varLambda }}. \end{aligned}$$

The transformation

$$\begin{aligned} Y_{k} = {\varvec{m}}^{T}_{k}{\varvec{X}} \end{aligned}$$

for \(k = 1, 2, \ldots , p\) is termed a transformation into the canonical space with \(Y_{1}\) the first canonical variate, where

\(E(Y_{1}|G_{i}) = {\varvec{m}}_{1}^{T}{\varvec{\mu }}_{i} \; \textrm{and} \; Var(Y_{1}|G_{i}) = Var(Y_{1}|G_{i}) = 1 \; \textrm{for} \; i = 1; \; 2\).

After properly scaled, the solution \({\varvec{m}}_{1}\) maximizing (1) can be written as

$$\begin{aligned} {\varvec{W}}^{-1}\left( {\varvec{\mu }}_{1}-{\varvec{\mu }}_{2}\right) , \end{aligned}$$
(3)

which is known as Fisher’s linear discriminant function (LDF). The maximum of (1) is given by the squared Mahalanobis distance, namely

$$\begin{aligned} \left( {\varvec{\mu }}_{1} - {\varvec{\mu }}_{2}\right) ^{T}{\varvec{W}}^{-1}\left( {\varvec{\mu }}_{1} - {\varvec{\mu }}_{2}\right) . \end{aligned}$$

For \(k=2, 3, \ldots , p\) we have \(Var(Y_{k} | G_{i}) = 1\) and

$$\begin{aligned} {\varvec{m}}^{T}_{k} {\varvec{Bm}}_{k} = 0. \end{aligned}$$
(4)

Since

$$\begin{aligned} {\varvec{B}}= & {} p_{1}\left( {\varvec{\mu }}_{1}-{\varvec{\mu }}\right) \left( {\varvec{\mu }}_{1}-{\varvec{\mu }}\right) ^{T} + p_{2}\left( {\varvec{\mu }}_{2}-{\varvec{\mu }}\right) \left( {\varvec{\mu }}_{2}-{\varvec{\mu }}\right) ^{T} \\= & {} \left( p_{1}p_{2}^{2} + p_{2}p_{1}^{2}\right) \left( {\varvec{\mu }}_{1}-{\varvec{\mu }}_{2}\right) \left( {\varvec{\mu }}_{1}-{\varvec{\mu }}_{2}\right) ^{T}, \end{aligned}$$

it follows from (4) that \({\varvec{m}}_{k}^{T}\left( {\varvec{\mu }}_{1}-{\varvec{\mu }}_{2}\right) =0\) and all differences vanish between the groups for the second and further canonical variates.

Rao (1948) extends the Fisher’s LDA procedure to \(J>2\) groups by deriving \(J-1\) linear discriminant functions of the form (3), but in this paper we are primarily interested in the case \(J=2\).

2.2 Bayes linear and quadratic classifiers

Under the assumption of multivariate normal distributions with different covariance matrices \({\varvec{\varSigma }}_{1}\) and \({\varvec{\varSigma }}_{2}\), the Bayes quadratic classifier (see e.g., Hastie et al. 2001) is given by

$$\begin{aligned}{} & {} \frac{1}{2}\left( {\varvec{X}}-{\varvec{\mu }}_{1}\right) ^{T}{\varvec{\varSigma }}_{1}^{-1}\left( {\varvec{X}}-{\varvec{\mu }}_{1}\right) - \frac{1}{2}\left( {\varvec{X}}-{\varvec{\mu }}_{2}\right) ^{T}{\varvec{\varSigma }}_{2}^{-1}\left( {\varvec{X}}-{\varvec{\mu }}_{2}\right) \\{} & {} \quad + \frac{1}{2} log\left( \frac{|{\varvec{\varSigma }}_{1}|}{|{\varvec{\varSigma }}_{2}|}\right) {\mathop {<}\limits ^{\textstyle {>}}} log \left( \frac{p_{1}}{p_{2}}\right) . \end{aligned}$$

If \({\varvec{\varSigma }}_{1} = {\varvec{\varSigma }}_{2} = {\varvec{\varSigma }}\), (say) we have the Bayes linear classifier

$$\begin{aligned} \frac{1}{2}\left( {\varvec{\mu }}_{2}-{\varvec{\mu }}_{1}\right) ^{T}{\varvec{\varSigma }}^{-1}{\varvec{X}} + \frac{1}{2}\left( {\varvec{\mu }}_{1}^{T}{\varvec{\varSigma }}^{-1}{\varvec{\mu }}_{1}-{\varvec{\mu }}_{2}^{T}{\varvec{\varSigma }}^{-1}{\varvec{\mu }}_{2}\right) {\mathop {<}\limits ^{\textstyle {>}}} log \left( \frac{p_{1}}{p_{2}}\right) . \end{aligned}$$

It is clear that if \(p_{1} =p_{2}\), the Bayes linear classifier is equivalent to Fisher’s LDF.

In general, the Bhattacharyya distance measures the similarity of two probability distributions. When the distributions concerned are \(N_{p}\left( {\varvec{\mu }}_{1}, {\varvec{\varSigma }}_{1}\right)\) and \(N_{p}\left( {\varvec{\mu }}_{2}, {\varvec{\varSigma }}_{2}\right)\) it is given by (see e.g., Fukunaga 1990; Hennig 2004):

$$\begin{aligned} D_{Bhat} = \frac{1}{8}\left( {\varvec{\mu }}_{2}-{\varvec{\mu }}_{1}\right) ^{T}\left( \frac{{\varvec{\varSigma }}_{1}+{\varvec{\varSigma }}_{2}}{2}\right) ^{-1}\left( {\varvec{\mu }}_{2}-{\varvec{\mu }}_{1}\right) + \frac{1}{2}log\left( \frac{\left| \frac{{\varvec{\varSigma }}_{1}+{\varvec{\varSigma }}_{2}}{2}\right| }{\sqrt{|{\varvec{\varSigma }}_{1}||{\varvec{\varSigma }}_{2}|}}\right) . \end{aligned}$$
(5)

It can be shown that the sample form of (5) provides an upper bound for the Bayes error (see e.g., McLachlan 1992). Furthermore, when \({\varvec{\varSigma }}_{1} = {\varvec{\varSigma }}_{2} = {\varvec{\varSigma }}\) then (5) is proportional to a squared Mahalanobis distance. For this case, Fukunaga (1990) shows that \(D_{Bhat}\) is maximized by \(Y= {\varvec{m}}^{T}{\varvec{X}}\) where \({\varvec{Bm}} = \lambda _{1}{\varvec{Wm}}\), so that maximization is achieved when \({\varvec{m}}\) is taken as (3).

2.3 Two-group discrimination where group means and or group covariance matrices may differ

While LDA discussed above allows only for groups to differ with respect to the means, Fukunaga (1990) looks for a linear transformation \({\varvec{Y}} = {\varvec{XM}}\) to separate groups with respect to their means or covariance matrices. With the notation of the subsection above, write

$$\begin{aligned} {\varvec{W}}= & {} p_{1}{\varvec{\varSigma }}_{1}+p_{2}{\varvec{\varSigma }}_{2}= p_{1}E\left[ \left( {\varvec{X}}-{\varvec{\mu }}_{1}\right) \left( {\varvec{X}}-{\varvec{\mu }}_{1}\right) ^{T}|G_{1}\right] \\{} & {} + p_{2}E\left[ \left( {\varvec{X}}-{\varvec{\mu }}_{2}\right) \left( {\varvec{X}}-{\varvec{\mu }}_{2}\right) ^{T}|G_{2}\right] . \end{aligned}$$

Fukunaga (1990) provides four criteria for class separability:

\(J_{1} = tr({\varvec{S}}_{2}^{-1}{\varvec{S}}_{1}); J_{2} = log|{\varvec{S}}_{2}^{-1}{\varvec{S}}_{1}|; J_{3} = tr({\varvec{S}}_{1}) - \lambda (tr({\varvec{S}}_{2})-c)\), where \(\lambda\) is a Lagrange multiplier, c is a constant, and \(J_{4} = tr({\varvec{S}}_{1})/tr({\varvec{S}}_{2})\), where \({\varvec{S}}_{1}\) and \({\varvec{S}}_{2}\), respectively, are one of \({\varvec{T}}\), \({\varvec{B}}\), or, \({\varvec{W}}\).

Let \(J_{i}(k)\) indicate the criterion with \({\varvec{S}}_{1}= {\varvec{B}}\) and \({\varvec{S}}_{2}= {\varvec{W}}\), and where \({\varvec{Y}} = {\varvec{XM}}\) with \({\varvec{M}}:p\times k\). Since we are restricted to linear transformations, optimization of \(J_{1}(1)\) is equivalent to Fisher discriminant analysis with (1) and \(J_{1}(1)= \lambda _{1}\) with no other dimension contributing to the value of \(J_{1}\). Fukunaga (1990) further shows that criterion \(J_{1}\) gives the same optimum transformation for other combinations of \({\varvec{B}}\), \({\varvec{W}}\), and \({\varvec{T}}\) for \({\varvec{S}}_{1}\) and \({\varvec{S}}_{2}\) and also for optimizing \(J_{2}\).

When \({\varvec{\mu }}_{1}={\varvec{\mu }}_{2}={\varvec{\mu }}\), (5) becomes

$$\begin{aligned} D_{Bhat}= & {} \frac{1}{2}log\left( \frac{\left| \frac{{\varvec{\varSigma }}_{1}+{\varvec{\varSigma }}_{2}}{2}\right| }{\sqrt{|{\varvec{\varSigma }}_{1}||{\varvec{\varSigma }}_{2}|}}\right) \nonumber \\= & {} \frac{1}{4}\left[ log\left( |{\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1} +{\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2} +2{\varvec{I}}_{p}|\right) -plog(4)\right] . \end{aligned}$$
(6)

If more than a single dimension is needed we have that

$$\begin{aligned} J_{1}=log|({\varvec{M}}^{T}{\varvec{\varSigma }}_{2}{\varvec{M}})^{-1}({\varvec{M}}^{T}{\varvec{\varSigma }}_{1}{\varvec{M}})+ ({\varvec{M}}^{T}{\varvec{\varSigma }}_{1}{\varvec{M}})^{-1}({\varvec{M}}^{T}{\varvec{\varSigma }}_{2}{\varvec{M}}) +2{\varvec{I}}_{k}| \end{aligned}$$

is maximized by \({\varvec{Y}} = {\varvec{XM}}\) with \({\varvec{M}}: p \times k\), where

$$\begin{aligned} ({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1}){\varvec{M}} ={\varvec{M}}\left( \left( {\varvec{M}}^{T}{\varvec{\varSigma }}_{2}{\varvec{M}}\right) ^{-1}\left( {\varvec{M}}^{T}{\varvec{\varSigma }}_{1}{\varvec{M}}\right) \right) \end{aligned}$$

and

$$\begin{aligned} ({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2}){\varvec{M}} ={\varvec{M}}\left( ({\varvec{M}}^{T}{\varvec{\varSigma }}_{1}{\varvec{M}})^{-1}({\varvec{M}}^{T}{\varvec{\varSigma }}_{2}{\varvec{M}})\right) . \end{aligned}$$

Thus, \({\varvec{M}}\) must contain the eigenvectors of both \({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1}\) and \({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2}\). However, they have the same eigenvectors since \({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1} = ({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2})^{-1}\) and we have that

$$\begin{aligned} ({\varvec{M}}^{T}{\varvec{\varSigma }}_{2}{\varvec{M}})^{-1}({\varvec{M}}^{T}{\varvec{\varSigma }}_{1}{\varvec{M}})={\varvec{\varLambda }} \end{aligned}$$

and

$$\begin{aligned} ({\varvec{M}}^{T}{\varvec{\varSigma }}_{1}{\varvec{M}})^{-1}({\varvec{M}}^{T}{\varvec{\varSigma }}_{2}{\varvec{M}})={\varvec{\varLambda }}^{-1}, \end{aligned}$$

so that \(({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1}){\varvec{M}} ={\varvec{M\varLambda }}\) and \(({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2}){\varvec{M}} ={\varvec{M\varLambda }}^{-1}\).

Therefore, if \(k > 1\) dimensions are needed the k eigenvectors are chosen, which correspond to the k largest values for J, i.e., corresponding to the largest values \(\lambda _{i}+\frac{1}{\lambda _{i}} + 2\).

3 Discriminant analysis with sampled data

In practice, discriminant analysis is usually performed using sampled data. This necessitates the substitution of population parameters with sample estimates in the formulas introduced in section 2. The plug-in principle and maximum likelihood method are popular methods for this.

It is well known that LDA often outperforms quadratic discriminant analysis (QDA) even when the assumption of equal covariance matrices is violated (see e.g., Flury et al. 1997; McLachlan 1992). This can be attributed to the large number of parameters that have to be estimated in QDA with over-parameterization inducing a loss of power. This contributes to the popularity of LDA among practitioners and so in the rest of the paper, our focus will be on LDA.

Consider the data matrix \({\varvec{X}}:n\times p\) centered such that \({\varvec{1}}^{T}{\varvec{X}} = {\varvec{0}}^{T}\). The data contained in \({\varvec{X}}\) consists of p measurements made for each of the J groups. The group sizes are \(n_{1},n_{2},\ldots ,n_{J}\), respectively, such that \(\sum _{i=1}^{J} n_{i} =n\). Let \({\varvec{N}}_{g} = diag(n_{1},n_{2},\ldots ,n_{J})\), so that a matrix of group means can be calculated as

$$\begin{aligned} \overline{{\varvec{X}}}:J\times p={\varvec{N}}_{g}^{-1}{\varvec{G}}^{T}{\varvec{X}}=({\varvec{G}}^{T}{\varvec{G}})^{-1} {\varvec{G}}^{T}{\varvec{X}}, \end{aligned}$$
(7)

where \({\varvec{G}}: n\times J\) denotes an indicator matrix defining the J groups.

Let \({\mathcal {V}}({\varvec{X}}^{T})\) denote the vector space generated by the columns of \({\varvec{X}}^{T}\). We assume this vector space of p-vectors to be of dimension p. Since each row of \(\overline{{\varvec{X}}}\) is a linear combination of the rows of \({\varvec{X}}\) it follows that \(\overline{{\varvec{X}}}^{T} \in {\mathcal {V}}({\varvec{X}}^{T})\).

Define:

  1. 1.

    \({\varvec{S}}_{B}:p\times p\), as the between-group matrix of squares and products: \({\varvec{S}}_{B} =\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}= {\varvec{X}}^{T}{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}{\varvec{X}}\) and

  2. 2.

    \({\varvec{S}}_{W}:p\times p\) as the within-group matrix of squares and products: \({\varvec{S}}_{W} = {\varvec{X}}^{T}{\varvec{X}} -\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}={\varvec{X}}^{T}({\varvec{I}}-{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}){\varvec{X}}\).

Generally, \(rank({\varvec{S}}_{W}) = p\) while \(rank({\varvec{S}}_{B}) = min(J-1,p)\).

The two-sided eigenvalue problem

$$\begin{aligned} ({\varvec{S}}_{B}){\varvec{B}}=({\varvec{S}}_{W}){\varvec{B}}{\varvec{\varLambda }} \end{aligned}$$
(8)

provides the solution \({\varvec{b}}_{1}\) to the CVA criterion

$$\begin{aligned} \mathop{{\,\textrm{maximize}\,}}\limits_{{\varvec{b}}}\left( \frac{{\varvec{b}}^{T}({\varvec{S}}_{B}){\varvec{b}}}{{\varvec{b}}^{T}({\varvec{S}}_{W}){\varvec{b}}}\right) , \text { subject to } {\varvec{b}}^{T}({\varvec{S}}_{W}){\varvec{b}} = 1. \end{aligned}$$
(9)

In the above, the diagonal matrix \({\varvec{\varLambda }}:p\times p\) contains the eigenvalues \(\lambda _{1} \ge \lambda _{2} \ge \ldots \ge \lambda _{p} \ge 0\), where \(\lambda _{J} = \lambda _{J+1} = \ldots \ = \lambda _{p} = 0\) if \(J<p+1\).

The matrix \({\varvec{B}} = \left[ {\varvec{b}}_{1},{\varvec{b}}_{2},\ldots ,{\varvec{b}}_{p}\right]\) contains all p solutions to the two-sided eigenvalue problem. Only the solution \({\varvec{b}}_{1}\) is optimal for the CVA criterion (9). The matrix \({\varvec{B}}: p \times p\) is non-singular with \({\varvec{B}}^{-1}={\varvec{B}}^{T}({\varvec{S}}_{W})\), while the columns of \({\varvec{B}}\) are orthogonal in the metric \({\varvec{S}}_{W}\) because of the constraints \({\varvec{B}}^{T}({\varvec{S}}_{W}){\varvec{B}}={\varvec{I}}\).

Canonical variates are defined by the transformation \({\varvec{y}}^{T}={\varvec{x}}^{T}{\varvec{B}}\), where \({\varvec{x}}\) is any p-vector belonging to \({\mathcal {V}}({\varvec{X}}^{T})\). The centered data matrix itself is transformed to the canonical variate values matrix \({\varvec{Y}}: n \times p\) through the one-to-one (canonical) transformation

$$\begin{aligned} {\varvec{Y}}: n \times p = {\varvec{XB}}. \end{aligned}$$
(10)

The transformation (10) implies a transformation of \({\mathcal {V}}({\varvec{X}}^{T})\) to \({\mathcal {V}}({\varvec{Y}}^{T})\), the canonical space, of dimension p since \(rank({\varvec{Y}}) = rank({\varvec{X}})\). Furthermore, (10) implies that

$$\begin{aligned} \overline{{\varvec{X}}}{\varvec{B}} = \overline{{\varvec{Y}}}: J \times p. \end{aligned}$$
(11)

We will call \(\overline{{\varvec{Y}}}\) the canonical means matrix. It follows that the columns of \(\overline{{\varvec{Y}}}^{T}\) generate a subspace of dimension \(min(J-1, p)\) of \({\mathcal {V}}({\varvec{Y}}^{T})\). This subspace is denoted by \({\mathcal {V}}(\overline{{\varvec{Y}}}^{T})\).

4 Biplot display of the canonical means matrix \(\overline{{\varvec{Y}}}\) and the canonical variates values matrix \({\varvec{Y}}\)

An r-dimensional canonical variate analysis (CVA) plot is constructed by taking the first r canonical variates, associated with

$$\begin{aligned} {\varvec{B}}_{r} = \left[ {\varvec{b}}_{1},{\varvec{b}}_{2},\ldots ,{\varvec{b}}_{r}\right] , \text { where } {\varvec{B}} = \left[ {\varvec{B}}_{r},{\varvec{b}}_{r+1},\ldots ,{\varvec{b}}_{p}\right] \end{aligned}$$
(12)

to provide the coordinates (or scaffolding) for representing the J canonical group means contained in (11) as points in r dimensions. If this plot is equipped with p linear axes to represent the original p variables, a CVA biplot is obtained. Each of these biplot axes is determined by a vector, which induces also a graduation on it. Gower and Hand (1996) consider two types of CVA biplots, each one characterized by its system of p linear axes, its aim and its corresponding geometry:

  • The interpolation biplot, which has the aim of placing on the plot the image \((y_{1},y_{2},...,y_{r})\) of any new point \({\varvec{x}} \in {\mathcal {V}}({\varvec{X}}^{T})\).

  • The prediction biplot, which has the aim of estimating the point \({\varvec{x}}\) (i.e., the set of variable values) having as an image a given point \((y_{1},y_{2},...,y_{r})\) in the plot.

Once the CVA biplot for representing the group means is constructed, all transformed samples contained in \({\varvec{Y}}\) can be interpolated into the biplot. Thus both the canonical means and the transformed samples \({\varvec{Y}}\) can be displayed in a CVA biplot in an r-dimensional subspace of \({\mathcal {V}}(\overline{{\varvec{Y}}}^{T})\), (with \(r \le min(J-1, p)\)). Typically, an r of two or three will be chosen to construct this subspace that we will call the biplot space.

Gower and Hand (1996) show the above processes of prediction and interpolation to be based on the following: A sample \({\varvec{x}}: p \times 1\) can be interpolated into \({\mathcal {V}}({\varvec{Y}}^{T})\) by

$$\begin{aligned} {\varvec{y}}: p \times 1 ={\varvec{B}}^{T}{\varvec{x}}, \,\,\,\mathrm {i.e.,} \,\,\,{\varvec{y}}^{T}= {\varvec{x}}^{T}{\varvec{B}}=\sum _{k=1}^{p}(x_{k}{\varvec{e}}_{k}^{T}){\varvec{B}}. \end{aligned}$$

The representation of \({\varvec{x}}\) in the biplot space is given by

$$\begin{aligned} {\varvec{z}}^{T}: 1\times r = {\varvec{x}}^{T}{\varvec{B}}_{r}=\sum _{k=1}^{p}(x_{k}{\varvec{e}}_{k}^{T}){\varvec{B}}_{r}, \end{aligned}$$
(13)

where \({\varvec{B}}_{r}\) is defined in (12).

Prediction is the inverse of interpolation and since \({\varvec{B}}\) is non-singular it follows by inverting the above formula for interpolation that

\({\varvec{x}}^{T}= {\varvec{y}}^{T}{\varvec{B}}^{-1}\). The matrix \({\varvec{B}}^{-1}\) can be partitioned into

$$\begin{aligned} {\varvec{B}}^{-1}=\begin{bmatrix} {\varvec{B}}^{(r)}: r \times p \hspace{0.5cm} \\ {\varvec{B}}^{(2)}: p-r \times p\end{bmatrix}. \end{aligned}$$
(14)

The predicted value for the kth variable can be written as \({\hat{x}}_{k} = {\varvec{z}}^{T}{\varvec{B}}^{(r)}{\varvec{e}}_{k}\) and therefore, the predicted value for \({\varvec{x}}^{T}\) is

$$\begin{aligned} \hat{{\varvec{x}}}^{T}= & {} {\varvec{z}}^{T}{\varvec{B}}^{(r)}\left[ {\varvec{e}}_{1}, \ldots ,{\varvec{e}}_{p}\right] \nonumber \\= & {} {\varvec{z}}^{T}{\varvec{B}}^{(r)} \nonumber \\= & {} {\varvec{x}}^{T}{\varvec{B}}_{r}{\varvec{B}}^{(r)}. \end{aligned}$$
(15)

It follows from (15) that

$$\begin{aligned} \hat{{\varvec{X}}} = {\varvec{X}}{\varvec{B}}_{r}{\varvec{B}}^{(r)}, \end{aligned}$$
(16)

and in addition

$$\begin{aligned} \hat{\overline{{\varvec{X}}}} = \overline{{\varvec{X}}}{\varvec{B}}_{r}{\varvec{B}}^{(r)}. \end{aligned}$$
(17)

Since in the CVA biplot described above, the samples are interpolated into the biplot constructed for the canonical means, it is expected that the class means will be better represented than the canonical variate values i.e., the rows of \({\varvec{Y}}\). What is needed then, are measures of fit for use in CVA.

4.1 Measures of fit for use in CVA

From the identity

$$\begin{aligned} {\varvec{X}}= {\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1} {\varvec{G}}^{T}{\varvec{X}} +\left( {\varvec{I}} -{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1} {\varvec{G}}^{T}\right) {\varvec{X}}, \end{aligned}$$

we have that

$$\begin{aligned} {\varvec{X}}^{T}{\varvec{X}} = {\varvec{X}}^{T}{\varvec{Q}}{\varvec{X}} +{\varvec{X}}^{T}({\varvec{I}}-{\varvec{Q}}){\varvec{X}}, \end{aligned}$$

where the matrix \({\varvec{Q}}:n \times n = {\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}\) is positive semi-definite, symmetric and idempotent.

This between-and-within group structure is of interest for:

  • Assigning a given sample to its most appropriate group.

  • Relating the groups to one another.

A study of the relationships among the groups encourages the use of low-dimensional approximations for visualization, including CVA biplots.

4.2 Measures of fit for CVA biplots: recovering the canonical group means

The orthogonal partitioning

$$\begin{aligned} {\varvec{B}}^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}} = {\varvec{B}}^{T}(\hat{\overline{{\varvec{X}}}})^{T}{\varvec{N}}_{g}\hat{\overline{{\varvec{X}}}}{\varvec{B}} +{\varvec{B}}^{T}\left( {\varvec{\overline{{\varvec{X}}}}}-\hat{\overline{{\varvec{X}}}}\right) ^{T}{\varvec{N}}_{g}\left( {\varvec{\overline{{\varvec{X}}}}}-\hat{\overline{{\varvec{X}}}}\right) {\varvec{B}}, \end{aligned}$$

(see Gardner-Lubbe et al. 2008) allows an overall measure of how well the group means are represented in the CVA biplot, namely

$$\begin{aligned} Overall \;\; quality = \frac{tr\left( {\varvec{B}}^{T}(\hat{\overline{{\varvec{X}}}})^{T}{\varvec{N}}_{g}\hat{\overline{{\varvec{X}}}}{\varvec{B}} \right) }{ tr\left( {\varvec{B}}^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}} \right) } = \frac{\sum _{k=1}^{r} \lambda _{k}}{\sum _{k=1}^{p} \lambda _{k}}.\end{aligned}$$

In the orthogonal partitioning above, the matrix \({\varvec{B}}\) can be eliminated to define axis predictivities as the diagonal elements of the matrix

$$\begin{aligned} {\varvec{\varPi }}:p\times p = diag\left( (\hat{\overline{{\varvec{X}}}})^{T}{\varvec{N}}_{g}\hat{\overline{{\varvec{X}}}}\right) \left[ diag(\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}})\right] ^{-1}. \end{aligned}$$

Each axis predictivity is a measure of how well the values for the group means can be determined from the CVA biplot for the variable associated with that particular biplot axis. We note that the overall quality is a weighted mean of the individual axis predictivities.

4.3 Measures of fit for CVA biplots: recovering the individual samples

We also need predictivities for individual samples corrected for class means i.e., for \(({\varvec{I}}-{\varvec{Q}}){\varvec{X}}\). So, our starting point becomes the decomposition

$$\begin{aligned} ({\varvec{I}}-{\varvec{Q}}){\varvec{X}}{\varvec{B}}=({\varvec{I}}-{\varvec{Q}})\hat{{\varvec{X}}}{\varvec{B}}+({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}, \end{aligned}$$

where \(\hat{{\varvec{X}}}\) is defined in (16). Then the following orthogonal decompositions (see Gower et al. 2011) hold:

  1. 1.

    Type A

    $$\begin{aligned} {\varvec{B}}^{T}{\varvec{X}}^{T}({\varvec{I}}-{\varvec{Q}}){\varvec{X}}{\varvec{B}}= & {} {\varvec{B}}^{T}\hat{{\varvec{X}}}^{T}({\varvec{I}}-{\varvec{Q}})\hat{{\varvec{X}}}{\varvec{B}}\\{} & {} +{\varvec{B}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) ^{T}({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}. \end{aligned}$$
  2. 2.

    Type B

    $$\begin{aligned} ({\varvec{I}}-{\varvec{Q}}){\varvec{X}}{\varvec{B}}{\varvec{B}}^{T}{\varvec{X}}^{T}({\varvec{I}}-{\varvec{Q}})= & {} ({\varvec{I}}-{\varvec{Q}})\hat{{\varvec{X}}}{\varvec{B}}{\varvec{B}}^{T}\hat{{\varvec{X}}}^{T}({\varvec{I}}-{\varvec{Q}})\\{} & {} + ({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}{\varvec{B}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) ^{T}({\varvec{I}}-{\varvec{Q}}). \end{aligned}$$

While the class means are exactly represented in a subspace of dimension \(min(J-1,p)\) of the canonical space, this is generally not true for the individual samples. From Type B orthogonality within-group sample predictivities can be defined as the diagonal elements of

$$\varvec{\Psi }_{W} :n \times n\; = \;diag\left( {\left( {\varvec{I} - \varvec{Q}} \right)\varvec{\hat{X}S}_{W}^{{ - 1}} \varvec{\hat{X}}^{T} \left( {\varvec{I} - \varvec{Q}} \right)} \right)\left[ {diag\left( {\left( {\varvec{I} - \varvec{Q}} \right)\varvec{XS}_{W}^{{ - 1}} \varvec{X}^{T} \left( {\varvec{I} - \varvec{Q}} \right)} \right)} \right]^{{ - 1}}$$

4.4 CVA biplots for \(J=2\) groups

When \(J=2\) it now follows that:

  • The underlying two-sided eigenequation has one non-zero eigenvalue and \(p-1\) zero eigenvalues.

  • All p class means are exactly represented in a single dimension.

  • This single dimension contains the p biplot axes (each with predictivity \(100\%\) for recovering the group means) on top of each other.

  • Overall quality of representing group means is \(100\%\).

  • The one-dimensional CVA biplot is optimal for representing groups irrespective of the number of variables p.

  • The samples are not exactly represented in the one-dimensional CVA biplot.

Our challenge is now to add another dimension for improving recovering of sample information without changing optimality properties already available for the group means. We address this challenge by considering the orthonormal complement in the canonical space of the subspace containing the two group means. Therefore, we consider eigenvectors associated with the zero eigenvalues. These eigenvectors have no natural ordering associated with them. So, any one of these eigenvectors or even any linear combination of them has the same claim to be used as a second scaffolding axis. Hence, our aim is to find the linear combination that satisfies some optimality criterion using Type A and Type B orthogonality for natural candidates.

4.5 Optimal two-dimensional CVA biplot for \(J=2\) groups: optimality criterion based on Type B orthogonality

Consider minimizing

$$sum\left\{ {diag\left( {\left( {\varvec{I} - \varvec{Q}} \right)\left( {\varvec{X} - \widehat{\varvec{X}}} \right)\varvec{BB}^{T} \left( {\varvec{X} - \widehat{\varvec{X}}} \right)^{T} \left( {\varvec{I} - \varvec{Q}} \right)} \right)} \right\}.$$

Now,

$$\begin{aligned} ({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right)= & {} \left( {\varvec{I}}-{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}\right) \left( {\varvec{X}}-\hat{{\varvec{X}}}\right) \\= & {} \left( {\varvec{X}}-\hat{{\varvec{X}}}\right) -{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) \\= & {} \left( {\varvec{X}}-\hat{{\varvec{X}}}\right) -{\varvec{G}}{\varvec{{\overline{X}}}} + {\varvec{G}}{\varvec{\hat{{\overline{X}}}}}. \end{aligned}$$

We have \({\varvec{G}}{\varvec{\hat{{\overline{X}}}}}= {\varvec{G}}{\varvec{{\overline{X}}}}\) if the eigenvector associated with \(\lambda >0\) is chosen. Therefore, maximizing the sum of within-group sample predictivities becomes equivalent to minimizing \(sum\left\{ diag\left( \left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}{\varvec{B}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) ^{T}\right) \right\}\). However, this sum remains constant for any additional eigenvector. This is not surprising because in the canonical space the constraint \({\varvec{B}}^{T}{\varvec{S}}_{W}{\varvec{B}} = {\varvec{I}}\), implies constant variation in all dimensions, resulting in Type B orthogonality not useful to define a criterion for an optimal two-dimensional biplot.

4.6 Optimal two-dimensional CVA biplot for \(J=2\) groups: optimality criterion based on Type A orthogonality

Since the matrix \({\varvec{B}}\) is non-singular it can be eliminated from the equation defining Type A orthogonality. As before, \(({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) =\left( {\varvec{X}}-\hat{{\varvec{X}}}\right)\). Therefore, the proposed optimality criterion for constructing an optimal two-dimensional CVA biplot for two groups is the total squared reconstruction error for samples:

$$TSRES = tr\left\{ {\left( {X - \widehat{X}} \right)\left( {X - \widehat{X}} \right)^{T} } \right\}.$$
(18)

A similar measure of the goodness of approximations of the means can be defined as the total squared reconstruction error for means:

$$TSREM = tr\left\{ {\left( {\overline{X} - \widehat{{\overline{X} }}} \right)\left( {\overline{X} - \widehat{{\overline{X} }}} \right)^{T} } \right\}.{\text{ }}$$
(19)

5 Optimal CVA biplots when the number of groups is less than or equal to the number of variables

From now on we consider the case where \(J < p + 1\). Then it follows that the canonical means matrix \(\overline{{\varvec{Y}}}\) is of the form

$$\begin{aligned} \begin{bmatrix} {\overline{y}}^{T}_{1} \\ \cdots \\ {\overline{y}}^{T}_{J} \end{bmatrix} = \begin{bmatrix} k_{11} &{} \cdots &{} k_{1(J-1)} &{} 0 &{} \cdots &{} 0 \\ \cdots &{} \cdots &{} \cdots &{} \cdots &{} \cdots &{} \cdots \\ k_{J1} &{} \cdots &{} k_{J(J-1)1} &{} 0 &{} \cdots &{} 0 \end{bmatrix}, \end{aligned}$$
(20)

(see e.g., Flury, 1997, p. 491).

Hence, the canonical transformation optimally separates the p-vectors

\({\overline{y}}^{T}_{1},{\overline{y}}^{T}_{2}, \ldots , {\overline{y}}^{T}_{J}\) in \(J-1\) dimensions while all differences among them vanish in dimensions \(J,J+1, \ldots ,p\). It follows that \({\mathcal {V}}(\overline{{\varvec{Y}}}^{T})\) is now of dimension \(J-1\) and it is thus possible to consider a biplot space of dimension \(r>J-1\). The resulting r-dimensional CVA biplot is not uniquely defined because the first \(J-1\) columns of \({\varvec{B}}: p \times p\) appended with any set of \(r-J+1\) of its remaining columns will result in a biplot where the canonical means are exactly represented. Therefore, \(\overline{{\varvec{x}}}^{T}_{j}{\varvec{B}}=\left[ k_{j1} \ldots k_{j(J-1)} 0 \ldots 0 \right]\) for \(j=1, 2, \ldots ,J\) with the predicted value for \(\overline{{\varvec{x}}}_{j}\) given by the jth row, \(\hat{\overline{{\varvec{x}}}}_{j}^{T}\), of (17). It follows that

$$\begin{aligned} \hat{\overline{{\varvec{x}}}}^{T}_{j}= & {} \overline{{\varvec{x}}}^{T}_{j}{\varvec{B}}_{r} {\varvec{B}}^{(r)} \\= & {} \overline{{\varvec{x}}}^{T}_{j} \begin{bmatrix} {\varvec{b}}_{1}&{\varvec{b}}_{2}&\ldots&{\varvec{b}}_{J-1}&{\varvec{b}}_{J}&\ldots {\varvec{b}}_{r} \end{bmatrix} \begin{bmatrix} {\varvec{b}}^{(1)} \\ {\varvec{b}}^{(2)}\\ \cdots \\ {\varvec{b}}^{(r)} \end{bmatrix}, \mathrm {\ where \ } \begin{bmatrix} {\varvec{b}}^{(1)} \\ {\varvec{b}}^{(2)}\\ \cdots \\ {\varvec{b}}^{(r)} \end{bmatrix} = {\varvec{B}}^{(r)} \\= & {} \overline{{\varvec{x}}}^{T}_{j} \begin{bmatrix} {\varvec{b}}_{1}&{\varvec{b}}_{2}&\ldots&{\varvec{b}}_{J-1}&{\varvec{b}}_{J}&\ldots {\varvec{0}} \end{bmatrix} \begin{bmatrix} {\varvec{b}}^{(1)} \\ {\varvec{b}}^{(2)}\\ \cdots \\ {\varvec{b}}^{(p)} \end{bmatrix}, \mathrm {\ with \ } \begin{bmatrix} {\varvec{b}}^{(1)} \\ {\varvec{b}}^{(2)}\\ \cdots \\ {\varvec{b}}^{(p)} \end{bmatrix} = {\varvec{B}}^{-1} \\= & {} \overline{{\varvec{x}}}^{T}_{j} \mathrm {\ for \ } j=1,2,\ldots , J. \end{aligned}$$

Therefore, in the above r-dimensional CVA biplot the canonical means are exactly represented, resulting in \(TSREM = 0\). For a sample \({\varvec{x}}^{T}_{i}\), we have the ith row of \({\varvec{X}}\),

$$\begin{aligned} {\varvec{x}}^{T}_{i} = {\varvec{x}}^{T}_{i}{\varvec{B}}_{r} {\varvec{B}}^{(r)} \ne {\varvec{x}}^{T}_{i}{\varvec{B}} {\varvec{B}}^{-1}, \end{aligned}$$

since in general \({\varvec{x}}^{T}_{i}{\varvec{b}}_{j}\) is not zero for all \(j=J,J+1, \ldots , p\) so that \(TSRES>0\).

This leaves us with the challenge to construct scaffolding axes \(j=J,J+1, \ldots , r\) in addition to those contained in \({\varvec{B}}_{J-1}\) so that TSRES is minimized without sacrificing what we already have for the class means in \(J-1\) dimensions.

Possible candidates for the additional scaffolding axes are any \(r-J+1\) of the vectors \({\varvec{b}}_{J}, {\varvec{b}}_{J+1}, \ldots , {\varvec{b}}_{p}\). All these vectors are associated with the zero eigenvalues (diagonal elements of \({\varvec{\varLambda }}\)). Therefore, there is no natural ordering of them as is the case with the \(J-1\) eigenvectors associated with the non-zero eigenvalues. Furthermore, since \(\overline{{\varvec{X}}}{\varvec{b}}_{i}={\varvec{0}}\) for \(i=J, J+1, \ldots , p\) it follows that \(\overline{{\varvec{X}}}{\varvec{d}}={\varvec{0}}\) where \({\varvec{d}}\) is any linear combination of the vectors \({\varvec{b}}_{J}, {\varvec{b}}_{J+1}, \ldots , {\varvec{b}}_{p}\). A similar result will hold for any set of basis vectors of the vector space generated by the columns of the matrix

$$\begin{aligned} {\varvec{B}}^{*} = \left[ {\varvec{b}}_{J}, {\varvec{b}}_{J+1}, \ldots , {\varvec{b}}_{p} \right] , \end{aligned}$$
(21)

so that \(\overline{{\varvec{X}}}{\varvec{B}}^{*}={\varvec{0}}\).

Therefore, a set of \(r-J+1\) linear independent vectors of the form \({\varvec{d}}\) where \({\varvec{d}}\) is a linear combination of any basis of \({\mathcal {V}}({\varvec{B}}^{*})\) is needed such that the scaffolding vectors consisting of the first \(J-1\) columns of \({\varvec{B}}\) together with the \(r-J+1\) \({\varvec{d}}\) vectors minimize TSRES for all legitimate choices of the \(\{{\varvec{d}}\}\). Write these r scaffolding vectors as the columns of the matrix \({\varvec{D}}_{r}\) i.e.,

$$\begin{aligned} {\varvec{D}}_{r} = \left[ {\varvec{b}}_{1}, {\varvec{b}}_{2}, \ldots , {\varvec{b}}_{J-1}, {\varvec{d}}_{J}, {\varvec{d}}_{J+1}, \ldots , {\varvec{d}}_{r} \right] \end{aligned}$$
(22)

and let the columns of \({\varvec{D}}: p\times p =\left[ {\varvec{D}}_{r}, {\varvec{d}}_{r+1}, \ldots , {\varvec{d}}_{p}\right]\) represent a basis of \({\mathcal {V}}({\varvec{B}})\). It follows that any column of \({\varvec{B}}\) can be written as a linear combination of the columns of \({\varvec{D}}\). Therefore, there exists a non-singular matrix \({\varvec{C}}: p\times p\) such that \({\varvec{B}}={\varvec{DC}}\) i.e., \({\varvec{D}}={\varvec{BF}}\) with \({\varvec{F}}={\varvec{C}}^{-1}\).

Straightforward algebraic manipulation shows that \({\varvec{F}}\) is of the form

$$\begin{aligned} {\varvec{F}}=\begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}} &{} {\varvec{F}}^{*} \end{bmatrix}, \end{aligned}$$
(23)

where \({\varvec{F}}^{*}\) is an \((p-J+1)\times (p-J+1)\) orthogonal matrix. We provide a detailed derivation as supplementary material. Write

$$\begin{aligned} {\varvec{F}}^{*}= \left[ {\varvec{f}}^{*}_{1},{\varvec{f}}^{*}_{2}, \ldots ,{\varvec{f}}^{*}_{p-J+1}\right] . \end{aligned}$$
(24)

Then, \({\varvec{F}}^{-1}=\begin{bmatrix} {\varvec{I}} &{} {\varvec{0}} \\ {\varvec{0}} &{} ({\varvec{F}}^{*})^{T} \end{bmatrix}\) and our scaffolding vectors for constructing the r-dimensional CVA biplot are the columns of

$$\begin{aligned} {\varvec{D}}_{r}= & {} \left[ {\varvec{b}}_{1},{\varvec{b}}_{2}, \ldots ,{\varvec{b}}_{J-1}, {\varvec{d}}_{J},{\varvec{d}}_{J+1}, \ldots ,{\varvec{d}}_{r}\right] \nonumber \\= & {} {\varvec{B}}\left[ \begin{bmatrix} {\varvec{I}}_{J-1} \\ {\varvec{0}}\end{bmatrix} \begin{bmatrix} {\varvec{0}} \\ {\varvec{f}}^{*}_{1}\end{bmatrix} \cdots \begin{bmatrix} {\varvec{0}} \\ {\varvec{f}}^{*}_{r-J+1} \end{bmatrix}\right] . \end{aligned}$$
(25)

Furthermore,

$$\begin{aligned} {\varvec{D}}^{(r)}=\begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} ({\varvec{f}}^{*}_{1})^{T} \\ \cdots &{} \cdots \\ {\varvec{0}}^{T} &{} ({\varvec{f}}^{*}_{r-J+1})^{T} \end{bmatrix} {\varvec{B}}^{-1}. \end{aligned}$$
(26)

The approximations of the rows of \({\varvec{X}}\), i.e., the original samples, in the biplot constructed on the scaffolding provided by the columns of \({\varvec{D}}_{r}\) follow from (16) and using (25) and (26) as

$$\begin{aligned} \hat{{\varvec{X}}}= & {} {\varvec{X}}{\varvec{D}}_{r}{\varvec{D}}^{(r)} \nonumber \\= & {} {\varvec{XB}} \left[ \begin{bmatrix} {\varvec{I}}_{J-1} \\ {\varvec{0}}\end{bmatrix} \begin{bmatrix} {\varvec{0}} \\ {\varvec{f}}^{*}_{1}\end{bmatrix} \cdots \begin{bmatrix} {\varvec{0}} \\ {\varvec{f}}^{*}_{r-J+1} \end{bmatrix}\right] \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} ({\varvec{f}}^{*}_{1})^{T} \\ \cdots &{} \cdots \\ {\varvec{0}}^{T} &{} ({\varvec{f}}^{*}_{r-J+1})^{T} \end{bmatrix}{\varvec{B}}^{-1} \nonumber \\= & {} {\varvec{XB}} \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix} {\varvec{B}}^{-1}, \end{aligned}$$
(27)

where, from the orthogonality of \({\varvec{F}}^{*}\), it follows that \(({\varvec{f}}^{*}_{1})^{T}{\varvec{f}}^{*}_{1}=({\varvec{f}}^{*}_{2})^{T}{\varvec{f}}^{*}_{2} = \cdots =({\varvec{f}}^{*}_{r-J+1})^{T}{\varvec{f}}^{*}_{r-J+1}=1\).

The criterion TSRES now becomes

$$\begin{aligned}{} & {} \left\| {\varvec{X}} -\hat{{\varvec{X}}}\right\| ^{2} \nonumber \\{} & {} \quad = tr\left\{ ({\varvec{X}} -\hat{{\varvec{X}}})({\varvec{X}} -\hat{{\varvec{X}}})^{T}\right\} \nonumber \\{} & {} \quad =\left\| {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T} + \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix} {\varvec{B}}^{-1}\right) \right\| ^{2}. \end{aligned}$$
(28)

To construct an r-dimensional CVA biplot satisfying our aim of minimizing TSRES while simultaneously providing \(100\%\) accurate predictions for the J sample means when \(J < p\), we propose the following:

  • Find the solution of

    $$\begin{aligned} argmin\left\| {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} {\varvec{L}} {\varvec{B}}^{-1}\right) \right\| ^{2}, \end{aligned}$$
    (29)

    where \({\varvec{L}} = \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix}\) and the minimum is taken with respect to the \({\varvec{f}}^{*}_{j}\) such that \(({\varvec{f}}^{*}_{j})^{T}{\varvec{f}}^{*}_{j}=1\) for \(j=1, 2, \ldots , r-J+1\).

  • Use the optimum \(\{{\varvec{f}}^{*}_{1}, {\varvec{f}}^{*}_{2}, \ldots , {\varvec{f}}^{*}_{r-J+1}\}\) to construct \(({\varvec{B}}_{opt})_{r} = \left[ {\varvec{b}}_{1}, \ldots ,{\varvec{b}}_{J-1}, {\varvec{d}}_{J}, \ldots ,{\varvec{d}}_{r}\right]\) where

    $$\begin{aligned} {\varvec{d}}_{j}= & {} {\varvec{B}}^{*}({\varvec{f}}_{j-J+1}^{*})_{opt} \\= & {} {\varvec{B}}\begin{bmatrix} {\varvec{0}} \\ ({\varvec{f}}_{j-J+1}^{*})_{opt} \end{bmatrix} \\= & {} f^{opt}_{J(j-J+1)}{\varvec{b}}_{J}+f^{opt}_{(J+1)(j-J+1)}{\varvec{b}}_{J+1} + \cdots + f^{opt}_{p(j-J+1)}{\varvec{b}}_{p}, \end{aligned}$$

    for \(j=J, J+1, \ldots , r\).

  • Next,\(({\varvec{B}}_{opt})_{r}\) is used for constructing the r-dimensional CVA biplot with calibrated prediction (or interpolation) axes.

  • Finally, calculate a standardised form of min(TSRES): \(\frac{min(TSRES({\varvec{X}}, \hat{{\varvec{X}}}))}{tr({\varvec{X}}{\varvec{X}}^{T})}\), as a measure of the accuracy of the approximations of the individual sample points in the r-dimensional biplot.

The solution for (29) can be found from (28) as follows: Since \({\varvec{B}}^{T}{\varvec{S}}_{W}{\varvec{B}} = {\varvec{I}}_{p}\) it follows that

$$\begin{aligned} {\varvec{I}}_{p}= & {} {\varvec{B}}^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}} -{\varvec{B}}^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}} \\= & {} \begin{bmatrix} ({\varvec{B}}_{J-1})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}_{J-1} &{} ({\varvec{B}}_{J-1})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}^{*}\\ ({\varvec{B}}^{*})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}_{J-1} &{} ({\varvec{B}}^{*})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}^{*} \end{bmatrix} \\{} & {} - \begin{bmatrix} ({\varvec{B}}_{J-1})^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}}_{J-1} &{} ({\varvec{B}}_{J-1})^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}}^{*} \\ ({\varvec{B}}^{*})^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}}_{J-1} &{} ({\varvec{B}}^{*})^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}}^{*}\end{bmatrix} \\= & {} \begin{bmatrix} ({\varvec{B}}_{J-1})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}_{J-1} &{} ({\varvec{B}}_{J-1})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}^{*}\\ ({\varvec{B}}^{*})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}_{J-1} &{} ({\varvec{B}}^{*})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}^{*} \end{bmatrix} - \begin{bmatrix} ({\varvec{B}}_{J-1})^{T}\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}{\varvec{B}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}} &{} {\varvec{0}}\end{bmatrix}, \end{aligned}$$

establishing that

$$\begin{aligned} ({\varvec{B}}^{*})^{T}{\varvec{X}}^{T}{\varvec{X}}{\varvec{B}}^{*} = {\varvec{I}}_{p-J+1}. \end{aligned}$$
(30)

Since \({\varvec{B}}= \begin{bmatrix} {\varvec{B}}_{J-1}&{\varvec{B}}^{*} \end{bmatrix}\) and non-singular, we set

$$\begin{aligned}{\varvec{B}}^{-1} =\begin{bmatrix} {\varvec{B}}^{(J-1)}: (J-1)\times p \\ {\varvec{B}}^{(2)}: (p-J+1)\times p \end{bmatrix}.\end{aligned}$$

Write \({\varvec{U}}: (p-J+1)\times (p-J+1) = {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\); then it follows that

$$\begin{aligned}{} & {} {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix} {\varvec{B}}^{-1}\right) \nonumber \\{} & {} \quad = {\varvec{X}}\left( {\varvec{I}}_{p}-\left( {\varvec{B}}_{J-1}{\varvec{B}}^{(J-1)} + {\varvec{B}}^{*}{\varvec{U}}{\varvec{B}}^{(2)}\right) \right) . \end{aligned}$$
(31)

From \(\begin{bmatrix} {\varvec{B}}_{J-1}&{\varvec{B}}^{*} \end{bmatrix} \begin{bmatrix} {\varvec{B}}^{(J-1)} \\ {\varvec{B}}^{(2)}\end{bmatrix} = {\varvec{I}}_{p}\) it follows that \({\varvec{B}}_{J-1}{\varvec{B}}^{(J-1)}={\varvec{I}}_{p}-{\varvec{B}}^{*}{\varvec{B}}^{(2)}\) so that (31) can be written as

$$\begin{aligned} {\varvec{X}}\left( {\varvec{I}}_{p}-\left( {\varvec{B}}_{J-1}{\varvec{B}}^{(J-1)} + {\varvec{B}}^{*}{\varvec{U}}{\varvec{B}}^{(2)}\right) \right) = {\varvec{X}}\left( {\varvec{B}}^{*}{\varvec{B}}^{(2)} - {\varvec{B}}^{*}{\varvec{U}}{\varvec{B}}^{(2)}\right) . \end{aligned}$$
(32)

Therefore,

$$\begin{aligned}{} & {} \left\| {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix} {\varvec{B}}^{-1}\right) \right\| ^{2} \nonumber \\{} & {} \quad =\left\| {\varvec{X}}{\varvec{B}}^{*}({\varvec{I}}_{p-J+1}-{\varvec{U}}){\varvec{B}}^{(2)} \right\| ^{2} \nonumber \\{} & {} \quad =tr\left\{ {\varvec{X}}{\varvec{B}}^{*}({\varvec{I}}_{p-J+1}-{\varvec{U}}){\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}({\varvec{I}}_{p-J+1}-{\varvec{U}})({\varvec{B}}^{*})^{T}{\varvec{X}}^{T}\right\} \nonumber \\{} & {} \quad = tr\left\{ {\varvec{X}}{\varvec{B}}^{*}({\varvec{B}}^{(2)})^{T}({\varvec{B}}^{*})^{T} {\varvec{X}}^{T}\right\} -2tr\left\{ {\varvec{X}}{\varvec{B}}^{*}{\varvec{U}}{\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}({\varvec{B}}^{*})^{T} {\varvec{X}}^{T}\right\} \nonumber \\{} & {} \qquad +tr\left\{ {\varvec{X}}{\varvec{B}}^{*}{\varvec{U}}({\varvec{B}}^{(2)})^{T}{\varvec{U}}({\varvec{B}}^{*})^{T} {\varvec{X}}^{T}\right\}\end{aligned}$$
(33)
$$\begin{aligned} = tr\left\{ {\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}\right\} -2tr\left\{ {\varvec{U}}{\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}\right\} +tr\left\{ {\varvec{U}}{\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}{\varvec{U}}\right\}\end{aligned}$$
(34)

by substituting (30) in (33). Write \({\varvec{H}} = {\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}: (p-J+1)\times (p-J+1)\) then it follows that \({\varvec{H}}\) has rank \((p-J+1)\) and is thus positive definite. Therefore,

$$\begin{aligned}{} & {} \left\| {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix} {\varvec{B}}^{-1}\right) \right\| ^{2} \nonumber \\{} & {} \quad = tr\left\{ {\varvec{H}}\right\} -2tr\left\{ {\varvec{U}}{\varvec{H}}\right\} +tr\left\{ {\varvec{U}}{\varvec{H}}{\varvec{U}}\right\} , \end{aligned}$$
(35)

where

$$\begin{aligned} tr\left\{ {\varvec{U}}{\varvec{H}}\right\} = ({\varvec{f}}^{*}_{1})^{T}{\varvec{H}}{\varvec{f}}^{*}_{1}+({\varvec{f}}^{*}_{2})^{T}{\varvec{H}}{\varvec{f}}^{*}_{2}+ \cdots + ({\varvec{f}}^{*}_{r-J+1})^{T}{\varvec{H}}{\varvec{f}}^{*}_{r-J+1} \end{aligned}$$
(36)

and

$$\begin{aligned} tr\{{\varvec{U}}{\varvec{H}}{\varvec{U}}\} &= tr\{{\varvec{U}}{\varvec{U}}{\varvec{H}}\} \\ & = tr\big \{\left( {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\right) \\ & \quad \times \left( {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\right) {\varvec{H}}\big \}. \end{aligned}$$

Since \(({\varvec{f}}^{*}_{i})^{T}{\varvec{f}}^{*}_{j}\; =\; \left\{ \begin{array}{ll} 1 &{} \text{ if } i=j \\ 0 &{} \text{ if } i\ne j \end{array} \right.\) it follows that

$$\begin{aligned} tr\{{\varvec{U}}{\varvec{H}}{\varvec{U}}\} &= tr\big \{\left( {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}{\varvec{H}}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}{\varvec{H}}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}{\varvec{H}}\right) \big \} \\ &= ({\varvec{f}}^{*}_{1})^{T}{\varvec{H}}{\varvec{f}}^{*}_{1}+({\varvec{f}}^{*}_{2})^{T}{\varvec{H}}{\varvec{f}}^{*}_{2}+ \cdots + ({\varvec{f}}^{*}_{r-J+1})^{T}{\varvec{H}}{\varvec{f}}^{*}_{r-J+1}. \end{aligned}$$
(37)

Substituting (36) and (37) into (35) leads to

$$\begin{aligned}{} & {} \left\| {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix} {\varvec{B}}^{-1}\right) \right\| ^{2} \nonumber \\{} & {} \quad = tr\left\{ {\varvec{H}}\right\} -\big (({\varvec{f}}^{*}_{1})^{T}{\varvec{H}}{\varvec{f}}^{*}_{1}+({\varvec{f}}^{*}_{2})^{T}{\varvec{H}}{\varvec{f}}^{*}_{2}+ \cdots + ({\varvec{f}}^{*}_{r-J+1})^{T}{\varvec{H}}{\varvec{f}}^{*}_{r-J+1}\big ). \end{aligned}$$
(38)

Remembering that \({\varvec{H}}\) is positive definite, the criterion (38) can be minimized by maximizing each of the terms \(({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}\) with respect to \({\varvec{f}}^{*}_{j}\) for \(j=1, 2, \ldots , r-J+1\) under the constraint \(({\varvec{f}}^{*}_{j})^{T}{\varvec{f}}^{*}_{j}=1\). This is readily accomplished by introducing the Lagrange multiplier \(\lambda _{j}\) to form

$$\begin{aligned} ({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}-\lambda _{j}\big (({\varvec{f}}^{*}_{j})^{T}{\varvec{f}}^{*}_{j}-1\big ). \end{aligned}$$
(39)

Differentiating (39) with respect to \(\lambda _{j}\) and equating to zero leads to \({\varvec{H}}{\varvec{f}}^{*}_{j}=\lambda _{j}{\varvec{f}}^{*}_{j}\) i.e., \(\lambda _{j}=({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}\). Thus \(({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}\) is maximized when \({\varvec{f}}^{*}_{j}\) is a normalized eigenvector associated with the jth largest eigenvalue of \({\varvec{H}}\) and hence the criterion (39) attains its minimum when the \(\big \{{\varvec{f}}^{*}_{j}\big \}\) are set to the normalized eigenvectors associated with the largest \(r-J+1\) eigenvalues of \({\varvec{H}} = {\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}\), respectively. Denote these eigenvectors by \({\varvec{f}}_{1}^{opt}, {\varvec{f}}_{2}^{opt}, \cdots , {\varvec{f}}_{r-J+1}^{opt}\), respectively, where \(r\le p\) then a \(p\times r\) matrix \(({\varvec{B}}_{opt})_{r}\) can be constructed as

$$\begin{aligned} ({\varvec{B}}_{opt})_{r}=\left[ {\varvec{B}}_{J-1}, {\varvec{B}}^{*}\left[ {\varvec{f}}_{1}^{opt}, {\varvec{f}}_{2}^{opt}, \cdots , {\varvec{f}}_{r-J+1}^{opt}\right] \right] . \end{aligned}$$
(40)

Setting \(r = p\) in the above leads to a matrix \({\varvec{B}}_{opt}\) of size \(p \times p\) which is non-singular, allowing for the computation of the matrices \(({\varvec{B}}_{opt})_{r}\), consisting of the first r columns of \({\varvec{B}}_{opt}\), and \(({\varvec{B}}_{opt})^{(r)}\), consisting of the first r rows of \(({\varvec{B}}_{opt})^{-1}\). Therefore

$$\begin{aligned} \hat{{\varvec{X}}}={\varvec{X}}{\varvec{D}}_{r}{\varvec{D}}^{(r)}={\varvec{X}}({\varvec{B}}_{opt})_{r}({\varvec{B}}_{opt})^{(r)} \end{aligned}$$
(41)

will minimize TSRES.

6 An alternative biplot based on the Bhattacharyya distance for the two sample case

If, analogous to section 3 the population parameters in (5) are replaced with their sample estimates the sample version of the Bhattacharyya distance consists of two terms. The first measures the dissimilarity between the two sample means and the second measures the dissimilarity between the two sample covariance matrices. Fukunaga (1990) uses this property to construct a second dimension for a visual display of the rows of a data matrix \({\varvec{X}}: n\times p\) in two dimensions. Furthermore, Hennig (2004) utilizes the Bhattacharyya distance, among other methods to construct two-dimensional visualizations of the dispersions of two asymmetric samples – asymmetric in the sense that one is known to be more homogeneous and the other to be more heterogeneous. However, it should be noted that none of the visualizations proposed by Fukunaga (1990) and Hennig (2004) are biplots because no information regarding the columns of \({\varvec{X}}\) is displayed. The optimal two-group CVA biplot proposed in section 5 assumes equality of covariance matrices as is usual for CVA. This implies that the second term of the sample version of (5) will vanish and optimization of (5) becomes equivalent to maximizing (9).

Denote the transformation to the canonical space \({\mathcal {C}}\) by \({\varvec{Y}} ={\varvec{XM}}\) with \({\varvec{M}}:p \times p\). We can write

$$\begin{aligned}{\mathcal {C}}={\mathcal {C}}_{1} \cup {\mathcal {V}},\end{aligned}$$

where \({\mathcal {C}}_{1}\) is one-dimensional based on \({\varvec{m}}_{1}\) and \({\mathcal {V}}\) is \((p-1)\)-dimensional with basis \({\varvec{m}}_{2}, {\varvec{m}}_{3}, \ldots , {\varvec{m}}_{p}\). In \({\mathcal {V}}\) we have

$$\begin{aligned}{} & {} {\varvec{Y}}^{*T}=[Y_{2}, Y_{3}, \ldots , Y_{p}],\\{} & {} {\varvec{Y}}^{*}| G_{i}:(p-1)\times 1 \sim ({\varvec{\mu }}^{*}, {\varvec{\varSigma }}_{i}^{*}), i= 1; 2, \end{aligned}$$

with \({\varvec{\mu }}^{*T}=[\mu _{2}^{(Y)}, \mu _{3}^{(Y)}, \ldots , \mu _{p}^{(Y)}], {\varvec{\varSigma }}_{1}^{*}={\varvec{M}}^{*T}{\varvec{\varSigma }}_{1}{\varvec{M}}^{*}, {\varvec{\varSigma }}_{2}^{*}={\varvec{M}}^{*T}{\varvec{\varSigma }}_{2}{\varvec{M}}^{*}\) and \({\varvec{M}}^{*}:p \times (p-1) =[{\varvec{m}}_{2},{\varvec{m}}_{3}, \ldots , {\varvec{m}}_{p} ]\).

Since \({\varvec{\mu }}_{1}^{*} ={\varvec{\mu }}_{2}^{*}={\varvec{\mu }}^{*}, (say)\) we have from (5)

$$\begin{aligned} D_{Bhat} = \frac{1}{2}log\left( \frac{\left| \frac{{\varvec{\varSigma }}_{1}^{*}+{\varvec{\varSigma }}_{2}^{*}}{2}\right| }{\sqrt{|{\varvec{\varSigma }}_{1}^{*}||{\varvec{\varSigma }}_{2}^{*}|}}\right) . \end{aligned}$$
(42)

Maximization of the sample version of (42) proceeds parallel to the process described in section 2.3. The outcome is the matrix \({\varvec{A}}: (p-1) \times (p-1)\) containing as columns the required eigenvectors arranged in decreasing order of the values of \(\lambda _{i}^{*}+\frac{1}{\lambda _{i}^{*}}+2\).

A 2D biplot can now be constructed using the methods described in sections 3 and 4 by first noting that the matrix \({\varvec{M}}\) is available as the matrix \({\varvec{B}}\) of section 3. Next, we calculate the matrix

$$\begin{aligned} {\varvec{K}}:p\times p= [{\varvec{m}}_{1} \;\;\; {\varvec{M}}^{*}{\varvec{A}}]\end{aligned}$$

and its inverse \({\varvec{K}}^{-1}\). Let \({\varvec{a}}: (p-1) \times 1\) denote the first column of \({\varvec{A}}\), then the 2D biplot can be constructed as described in section 4 using the rows of \({\varvec{Z}}:n\times 2\), where

$$\begin{aligned}{\varvec{Z}}= [{\varvec{Xm}}_{1} \;\;\; {\varvec{Y}}^{*}{\varvec{a}}]={\varvec{X}}[{\varvec{m}}_{1} \;\;\; {\varvec{M}}^{*}{\varvec{a}}]\end{aligned}$$

for plotting the samples. The variables are represented by p calibrated biplot axes constructed from

$$\begin{aligned} \frac{marker}{{\varvec{e}}_{k}^{T}{\varvec{K}}^{(2)T}{\varvec{K}}^{(2)}{\varvec{e}}_{k}}{\varvec{K}}^{(2)}{\varvec{e}}_{k},\end{aligned}$$

where \({\varvec{K}}^{(2)}\) denotes the first 2 rows of \({\varvec{K}}^{-1}\).

7 Examples

As an example, we consider the Vertebral Column Data Set from the UCI Machine Learning Repository (Barreto et al. 2011) and discussed in detail by da Rocha Neto et al. (2011). The full data set contains measurements on six continuous/numeric variables relating to the shape and orientation of the pelvis and lumbar spine for each of the 310 individuals (samples). These samples are classified as normal, disk hernia, or spondylolisthesis patients. As an example of a two-group CVA, we study the subset of 60 disk hernia and 150 spondylolisthesis patients. The six numeric variables are Pelvic incidence, Pelvic tilt, Lumbar lordosis angle, Sacral slope, Pelvic radius, and Degree spondylolisthesis.

Table 1 Group means for the six measurements V1 = Pelvic incidence, V2 = Pelvic tilt, V3 = Lumbar lordosis angle, V4 = Sacral slope, V5 = Pelvic radius and V6 = Degree spondylolisthesis

Table 1 contains the group means for each of the six variables - with the outlier (see Fig. 2) included and excluded. It is hard to see from the table the presence of the outlier but, as is evident from Figs. 1 and 2, the outlier will be clearly revealed in a CVA biplot.

The one-dimensional biplot of this two-group data set is shown in Fig. 1. In this biplot, the two group means as well as all six variables (in the form of six calibrated axes) lie on the one scaffolding line defined by the singular vector associated with the single non-zero singular value of the underlying two-sided eigenequation. The six calibrated axes representing the variables have been vertically translated to aid the interpretation of the biplot. The values for each of the group means for all six variables can easily be read from the biplot axes and it can be verified that these values coincide exactly with the corresponding values in Table 1. As expected all variables have predictivities of 100% for determining the mean values with \(TSREM = 0\). All the individual sample points have been interpolated onto the bipot and therefore they also lie on the single scaffolding axis defining the biplot. However, \(TSRES > 0\) with a standardized version \(TSRES/(tr({\varvec{X}}{\varvec{X}}^{T}) = 0.4702\). To visualize the within groups variation the biplot has been enhanced by the addition of density estimates of the two sets of sample points interpolated onto the one-dimensional CVA biplot. These density estimates show graphically the separation/overlap of the two groups. Inspection of the six biplot axes suggests V5 to be negatively correlated with the other five variables; the latter are all pairwise positively correlated.

Fig. 1
figure 1

1D biplot of the two-group Vertebral Column data. V1 = Pelvic incidence, V2 = Pelvic tilt, V3 = Lumbar lordosis angle, V4 = Sacral slope, V5 = Pelvic radius, V6 = Degree spondylolisthesis. The one-dimensional biplot is enhanced by superimposing density estimates for the interpolated sample points in the two groups

It is clear that much can be learned from the one-dimensional CVA biplot but some serious issues are calling for considering a second scaffolding axis:

  • Is the conspicuous outlier an outlier on all variables?

  • Can the approximation of the sample points be improved without sacrificing what has been achieved with the mean vectors?

  • Is it possible to construct a more detailed visualization of the separation/overlap of the two groups?

  • Is it possible to construct a more accurate and detailed visualization of the correlation structure of the six variables?

Fig. 2
figure 2

Optimal 2D CVA biplot of the two-group Vertebral Column data. V1 = Pelvic incidence, V2 = Pelvic tilt, V3 = Lumbar lordosis angle, V4 = Sacral slope, V5 = Pelvic radius, V6 = Degree spondylolisthesis. The optimal two-dimensional biplot is enhanced by superimposing bags containing the innermost 95\(\%\) samples of the two groups respectively

Figure 2 provides an answer to the above issues. The optimal two-dimensional CVA biplot in Fig. 2 demonstrates the following:

  • The biplot is uniquely defined.

  • Each of the six biplot axes has a predictivity of \(100\%\) for determining the values of the two groups for all variables.This results in TSREM to remain zero. Figure 2 provides a clearer picture of how the predictivities are determined.

  • Introducing the second scaffolding axis improves the approximation of the samples appreciably: the standardized TSRES decreases to 0.1799 (a decrease of more than 60% of the corresponding value obtained in Fig. 2).

  • The second scaffolding axis is optimal in the sense that no other scaffolding axis can be found, which will improve TSRES while restricting TSREM to zero.

  • It is clear that Sample 116 is less of an outlier on V3 and V4 than on V2 and V6.

  • There is a suggestion that while V5 appears to be negatively correlated with V3 and V4 it appears to be positively correlated with V2 and V6. We note that the addition of a second scaffolding axis provides angles between the biplot axes, which allow for visualizing the approximate correlational structure between the variables.

Fig. 3
figure 3

Optimal 2D CVA biplot of the two-group Vertebral Column data excluding outlier sample 116. V1 = Pelvic incidence, V2 = Pelvic tilt, V3 = Lumbar lordosis angle, V4 = Sacral slope, V5 = Pelvic radius, V6 = Degree spondylolisthesis. The optimal two-dimensional biplot is enhanced by superimposing bags containing the innermost 95% samples of the two groups respectively

The biplot in Fig. 2 has been enhanced by superimposing 95%-bags onto the biplot. Alpha-bags are discussed in detail by Gower et al. (2011). The 95%-bag used here contains the innermost 95% of the bivariate sample points, where the innermost is relative to the Tukey median (Ruts and Rousseeuw 1996). Now, we are ready for a detailed appraisal of the overlap/separation of the two groups based on the two-dimensional clouds of points visualizing the within groups sample variation. However, we first exclude Sample 116 from the analysis and consider the optimal two-dimensional CVA biplot given in Fig. 3 overlaid with 95%-bags. This figure shows:

  • Clearly how the biplot axes are used to determine the group means exactly for each variable.

  • The angles between the biplot axes allow for an approximate visual appraisal of the correlation structure.

  • It is seen that the two 95%-bags almost touch each other but do not overlap giving us an overall quantitative measure of the degree of separation between the two groups.

  • The standardized TSRES value is 0.2155.

  • Although the two clouds of points have a high degree of separation it can also be seen that

    • there is a high degree of overlap between the two groups concerning V2 and V5;

    • there is almost no overlap on V1, V3 and V6;

    • on V4 large Disk Hernia values overlap with small Spondylolisthesis measurements, while small measurements of V4 almost exclusively occur in Disk Hernia.

The CVA biplots in Figs. 2 and 3 assume equal covariance matrices. We can now relax this assumption and construct in Fig. 4, the biplot based on the Bhattacharyya distance as discussed in section 6.

Fig. 4
figure 4

The 2D biplot based on the Bhattacharyya distance of the two-group Vertebral Column data excluding outlier sample 116. V1 = Pelvic incidence, V2 = Pelvic tilt, V3 = Lumbar lordosis angle, V4 = Sacral slope, V5 = Pelvic radius, V6 = Degree spondylolisthesis. The optimal two-dimensional biplot is enhanced by superimposing bags containing the innermost 95% samples of the two groups respectively

Although the appearance of this biplot is quite similar to that of the corresponding optimal CVA biplot shown in Fig. 3 its standardized TSRES value is approximately 10\(\%\) higher, namely 0.2367.

8 Conclusions

CVA biplots are constructed to show in a single plot the group means as points and all the variables as calibrated linear biplot axes. In the case of two groups the CVA biplot becomes a line containing all these points and biplot axes. Furthermore, there is no approximation in the positions of the group means and their respective values, which can be exactly determined from the biplot axes. It is common practice to interpolate the individual sample points onto the CVA biplot as well, but then they appear as approximations in the one-dimensional CVA biplot space. Since all the biplot axes lie on top of each other it is difficult to use them for determining the values of the two means for the different variables. However, the vertical translation of one of these axes does not affect the values of the two means for that particular variable. Therefore, as has been shown in Fig. 1, vertical translation of the biplot axes does not change the dimensionality of the CVA biplot but increases the usefulness of the different biplot axes for reading off values of the respective variables.

The fundamental question that is addressed in this paper is: What can be gained, if anything, by increasing the dimensionality of the above one-dimensional CVA biplot to two dimensions? This question can be rephrased as: Can we add a second dimension to our one-dimensional CVA biplot to improve the approximation of the individual sample points leading to a better understanding of the within groups variability while leaving unchanged the optimal representation of the two group means? As it turned out the addition of a second dimension is not a straightforward process since there are an infinite number of ways that this can be done. Furthermore, if existing software is used for constructing a two-dimensional CVA biplot in the two-group case the result can be highly misleading. This is because the two-sided eigenequation underlying the CVA procedure has only a single non-zero singular value with no natural ordering of the zero singular values resulting in the indeterminacy of singular vectors associated with zero singular values. Therefore, to guarantee a unique solution for finding a second dimension, we had to consider a criterion to optimize the approximation of the sample points while leaving the optimal representation of the group means unchanged. We suggested the TSRES criterion for meeting this goal. Minimizing TSRES results in a uniquely defined two-dimensional CVA biplot for the two-group case. It optimizes the approximations of the within-group samples while the two group means are exactly represented with the linear biplot axes providing exact values for both groups on all variables.

Figure 3 shows that our proposed optimal two-dimensional CVA biplot for two groups has the potential to provide the researcher with a tool that not only distinguishes the group means optimally but also where the within sample variation can be depicted graphically to gain deeper insight into the separation/overlap of the two groups. Moreover, it is unique and thus prevents any possibility of ambiguity when routinely using existing software. Thus we have attained our primary objective as is illustrated in the example discussed above.

The algebra underlying the optimal two-dimensional CVA biplot extends directly to an optimal three-dimensional biplot when dealing with a three-groups case having a CVA biplot space of dimension two.

An alternative suggestion based on the Bhattacharyya distance is available when relaxing the equal within-group covariance matrix assumption. As can be seen from Fig. 4, the biplot is slightly different, but the overall conclusion regarding overlap and separation in terms of the individual variables remains unchanged. However, the biplot based on Bhattacharyya distance is designed to optimize a different objective than the optimization of the sum of the squared approximations of the data matrix. Therefore, our preferred 2D CVA biplot to construct in the two-group case is the proposed optimal CVA biplot.

Finally, we note that our R code for constructing the biplots discussed in this paper is available in Lubbe et al. (2023).