Abstract
Canonical variate analysis (CVA) entails a two-sided eigenvalue decomposition. When the number of groups, J, is less than the number of variables, p, at most \(J-1\) eigenvalues are not exactly zero. A CVA biplot is the simultaneous display of the two entities: group means as points and variables as calibrated biplot axes. It follows that with two groups the group means can be exactly represented in a one-dimensional biplot but the individual samples are approximated. We define a criterion to measure the quality of representing the individual samples in a CVA biplot. Then, for the two-group case we propose an additional dimension for constructing an optimal two-dimensional CVA biplot. The proposed novel CVA biplot maintains the exact display of group means and biplot axes, but the individual sample points satisfy the optimality criterion in a unique simultaneous display of group means, calibrated biplot axes for the variables, and within group samples. Although our primary aim is to address two-group CVA, our proposal extends immediately to an optimal three-dimensional biplot when encountering the equally important case of comparing three groups in practice.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
It is difficult to overrate the value of graphical displays to accompany formal statistical classification procedures (see e.g., Tukey 1975). Indeed, it is an open question if a complete assessment of group structure, including the overlap and separation of groups and the within groups sample behaviour is possible without the aid of suitable graphics, relying exclusively on computed statistical measures and tables. In particular, graphical procedures that are capable of displaying simultaneously the means and the individuals of different groups together with information on the relative contributions of a chosen set of classification variables are highly in demand.
Gabriel (1971) proposed the biplot for displaying simultaneously the rows and columns of a data matrix \({\varvec{X}}: n\times p\) in a single graph. A year later Gabriel showed how to construct a canonical variate analysis (CVA) biplot (Gabriel 1972) that provides a two-dimensional graphical approximation of various groups optimally separated according to a multidimensional CVA criterion. This CVA biplot became popular among statisticians and practitioners applying linear discriminant analysis (LDA) and CVA in various fields. Gittins (1985) can be consulted for an overview of CVA. Distances in this classical Gabriel CVA biplot are interpreted in terms of inner products between vectors. Such interpretations are not as straightforward as distances in an ordinary scatter plot. The unified biplot methodology proposed by Gower (1995) and extensively discussed in the monograph by Gower and Hand (1996) allows the CVA biplot to be regarded as a multivariate extension of an ordinary scatter plot. The concept of inter-sample distance is central in this approach, while information on the classifier variables is added to the graph of means and individual sample points in the form of biplot axes–an axis calibrated in its original units for each (classifier) variable. The perspective of Gower and Hand (1996) inspires the biplot methodology discussed in this paper. Gower et al. (2011) discuss the one-dimensional CVA biplot for use in two-group studies in some detail. Although the latter authors provide several enhancements to this one-dimensional CVA biplot they failed to address the challenge of constructing an optimal two-dimensional biplot for the two-group case.
Classification procedures in practice are often confronted with the problem of optimally distinguishing between two groups. The aim of several procedures used in multidimensional classification is to separate the group means optimally. This is the aim of CVA and the closely related procedure of LDA. These techniques involve the transformation of the original observations into a so-called canonical space. Flury (1997), among others, proves that in the case of J groups, only the first \(K = min(p,J-1)\) elements of the group mean vectors differ in the canonical space. This induces some pitfalls when routinely constructing CVA biplots in the case where \(p > J-1\) with J equaling two or three. Since in the case of two groups the group means can be exactly represented on a line, the associated CVA biplot becomes a one-dimensional plot with all p biplot axes representing the different variables on top of each other on the line extending through the representations of the two group means. All n sample points also fall on this line. The theoretical basis for constructing this line is provided by the eigenanalysis of a two-sided eigenvalue problem (see e.g., Gower and Hand (1996) and Gower et al. (2011)). In the two-sample case this eigenanalysis involves only one non-zero (positive) eigenvalue together with \(p-1\) zero eigenvalues. Therefore, only the eigenvector associated with the single non-zero eigenvalue is uniquely defined. This eigenvector provides the scaffolding for constructing a one-dimensional CVA biplot. Although a two-dimensional biplot can be constructed using two eigenvectors, the second eigenvector will not be uniquely defined, as is the case when \(J > 2\), unless some precautions are taken. A similar situation arises in the case of three groups: the first two eigenvectors are then uniquely defined but not the third. Why do we want extra scaffolding dimensions for constructing CVA biplots when \(p > J-1\)? There is a real advantage when we have a two-group or three-group classification problem: not only are the group means then represented exactly but also an improvement in the approximations of the individual sample points is accomplished. Therefore, in this paper, we first define a measure of how well any mean or individual sample point is represented in a CVA biplot. Then we show, in the case of two groups, how to construct a uniquely defined two-dimensional CVA biplot such that both the means and individual samples are optimally represented together with the variables in the form of calibrated biplot axes. In addition, we will show how this process can be extended directly for constructing a uniquely defined optimal three-dimensional biplot when three groups are considered.
The paper is organized as follows: in the next section, we begin with some historical background of discriminant analysis. After that, we briefly review the basic concepts and theory of CVA biplot methodology according to the perspective of Gower and Hand (1996). This is followed by a section describing the geometry of the graphical representation of the class means and the sample points about them. In section 4 it is discussed why the two-group CVA biplot deserves special attention. We then put forward proposals for one-dimensional CVA biplots as well as a two-dimensional CVA biplot such that the group means in a two-group classification problem are exactly represented together with an optimal representation of the individual sample points. We also show how to generalize this procedure to construct a three-dimensional biplot with similar properties for use when \(J = 3 < p\). In section 6 we briefly discussed an alternative unique 2D biplot, which is based on the Bhattacharyya distance. The theoretical results are illustrated in section 7 where we provide examples, covering one- and two-dimensional CVA biplots. Some conclusions are considered in section 8.
2 A brief review of canonical variate analysis
2.1 Two-group discriminant analysis: Fisher’s LDA
Two-group discriminant analysis considers two populations (groups) \(G_{1}\) and \(G_{2}\). An observation \({\varvec{x}}\) of \({\varvec{X}}^{T} = (X_{1}, X_{2}, \ldots , X_{p})\) is to be allocated to one of these populations. It is assumed that the density \(f_{i}({\varvec{x}})\) of \({\varvec{X}}\) for \(i = (1; 2)\) is known with expected value \({\varvec{\mu }}_{i}:p\times 1\) and covariance matrix \({\varvec{\varSigma }}_{i}: p \times p\), respectively. Let the prior probability of an unknown \({\varvec{x}}\) belonging to \(G_{i}\) be given by
\(p_{1} = P(G_{1})\) and \(p_{2} = P(G_{2})\), respectively, with \(p_{i} > 0\) and \(p_{1}+p_{2} = 1\).
Define
Under the assumption that \({\varvec{\varSigma }}_{1} = {\varvec{\varSigma }}_{2} = {\varvec{\varSigma }}\), (say) it follows that \({\varvec{W}} = {\varvec{\varSigma }}\). The Fisher LDA (Fisher 1936) searches for the linear function
with \(E(Y) = {\varvec{m}}^{T}E({\varvec{X}}) = {\varvec{m}}^{T}{\varvec{\mu }}\), \(E(Y|G_{i}) = {\varvec{m}}^{T}{\varvec{\mu }}_{i}\) and \(var(Y) = {\varvec{m}}^{T}{\varvec{W}}{\varvec{m}}\) to maximize
The maximum is obtained from the eigenequation
Pre-multiplying the above equation with \({\varvec{W}}^{1/2}\) leads to
so that \({\varvec{l}} = {\varvec{W}}^{1/2}{\varvec{m}}\) is the eigenvector of \({\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2}\) associated with the largest eigenvalue \(\lambda _{1}\) and \({\varvec{m}}={\varvec{W}}^{-1/2}{\varvec{l}}.\)
Since \(rank({\varvec{B}})=1=rank({\varvec{W}}^{-1}{\varvec{B}})=rank({\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2})\), it follows that
If we choose \({\varvec{L}}\) to be an orthogonal matrix in \(({\varvec{W}}^{-1/2}{\varvec{B}}{\varvec{W}}^{-1/2}){\varvec{L}}={\varvec{L\varLambda }}\), then
and
The transformation
for \(k = 1, 2, \ldots , p\) is termed a transformation into the canonical space with \(Y_{1}\) the first canonical variate, where
\(E(Y_{1}|G_{i}) = {\varvec{m}}_{1}^{T}{\varvec{\mu }}_{i} \; \textrm{and} \; Var(Y_{1}|G_{i}) = Var(Y_{1}|G_{i}) = 1 \; \textrm{for} \; i = 1; \; 2\).
After properly scaled, the solution \({\varvec{m}}_{1}\) maximizing (1) can be written as
which is known as Fisher’s linear discriminant function (LDF). The maximum of (1) is given by the squared Mahalanobis distance, namely
For \(k=2, 3, \ldots , p\) we have \(Var(Y_{k} | G_{i}) = 1\) and
Since
it follows from (4) that \({\varvec{m}}_{k}^{T}\left( {\varvec{\mu }}_{1}-{\varvec{\mu }}_{2}\right) =0\) and all differences vanish between the groups for the second and further canonical variates.
Rao (1948) extends the Fisher’s LDA procedure to \(J>2\) groups by deriving \(J-1\) linear discriminant functions of the form (3), but in this paper we are primarily interested in the case \(J=2\).
2.2 Bayes linear and quadratic classifiers
Under the assumption of multivariate normal distributions with different covariance matrices \({\varvec{\varSigma }}_{1}\) and \({\varvec{\varSigma }}_{2}\), the Bayes quadratic classifier (see e.g., Hastie et al. 2001) is given by
If \({\varvec{\varSigma }}_{1} = {\varvec{\varSigma }}_{2} = {\varvec{\varSigma }}\), (say) we have the Bayes linear classifier
It is clear that if \(p_{1} =p_{2}\), the Bayes linear classifier is equivalent to Fisher’s LDF.
In general, the Bhattacharyya distance measures the similarity of two probability distributions. When the distributions concerned are \(N_{p}\left( {\varvec{\mu }}_{1}, {\varvec{\varSigma }}_{1}\right)\) and \(N_{p}\left( {\varvec{\mu }}_{2}, {\varvec{\varSigma }}_{2}\right)\) it is given by (see e.g., Fukunaga 1990; Hennig 2004):
It can be shown that the sample form of (5) provides an upper bound for the Bayes error (see e.g., McLachlan 1992). Furthermore, when \({\varvec{\varSigma }}_{1} = {\varvec{\varSigma }}_{2} = {\varvec{\varSigma }}\) then (5) is proportional to a squared Mahalanobis distance. For this case, Fukunaga (1990) shows that \(D_{Bhat}\) is maximized by \(Y= {\varvec{m}}^{T}{\varvec{X}}\) where \({\varvec{Bm}} = \lambda _{1}{\varvec{Wm}}\), so that maximization is achieved when \({\varvec{m}}\) is taken as (3).
2.3 Two-group discrimination where group means and or group covariance matrices may differ
While LDA discussed above allows only for groups to differ with respect to the means, Fukunaga (1990) looks for a linear transformation \({\varvec{Y}} = {\varvec{XM}}\) to separate groups with respect to their means or covariance matrices. With the notation of the subsection above, write
Fukunaga (1990) provides four criteria for class separability:
\(J_{1} = tr({\varvec{S}}_{2}^{-1}{\varvec{S}}_{1}); J_{2} = log|{\varvec{S}}_{2}^{-1}{\varvec{S}}_{1}|; J_{3} = tr({\varvec{S}}_{1}) - \lambda (tr({\varvec{S}}_{2})-c)\), where \(\lambda\) is a Lagrange multiplier, c is a constant, and \(J_{4} = tr({\varvec{S}}_{1})/tr({\varvec{S}}_{2})\), where \({\varvec{S}}_{1}\) and \({\varvec{S}}_{2}\), respectively, are one of \({\varvec{T}}\), \({\varvec{B}}\), or, \({\varvec{W}}\).
Let \(J_{i}(k)\) indicate the criterion with \({\varvec{S}}_{1}= {\varvec{B}}\) and \({\varvec{S}}_{2}= {\varvec{W}}\), and where \({\varvec{Y}} = {\varvec{XM}}\) with \({\varvec{M}}:p\times k\). Since we are restricted to linear transformations, optimization of \(J_{1}(1)\) is equivalent to Fisher discriminant analysis with (1) and \(J_{1}(1)= \lambda _{1}\) with no other dimension contributing to the value of \(J_{1}\). Fukunaga (1990) further shows that criterion \(J_{1}\) gives the same optimum transformation for other combinations of \({\varvec{B}}\), \({\varvec{W}}\), and \({\varvec{T}}\) for \({\varvec{S}}_{1}\) and \({\varvec{S}}_{2}\) and also for optimizing \(J_{2}\).
When \({\varvec{\mu }}_{1}={\varvec{\mu }}_{2}={\varvec{\mu }}\), (5) becomes
If more than a single dimension is needed we have that
is maximized by \({\varvec{Y}} = {\varvec{XM}}\) with \({\varvec{M}}: p \times k\), where
and
Thus, \({\varvec{M}}\) must contain the eigenvectors of both \({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1}\) and \({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2}\). However, they have the same eigenvectors since \({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1} = ({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2})^{-1}\) and we have that
and
so that \(({\varvec{\varSigma }}_{2}^{-1}{\varvec{\varSigma }}_{1}){\varvec{M}} ={\varvec{M\varLambda }}\) and \(({\varvec{\varSigma }}_{1}^{-1}{\varvec{\varSigma }}_{2}){\varvec{M}} ={\varvec{M\varLambda }}^{-1}\).
Therefore, if \(k > 1\) dimensions are needed the k eigenvectors are chosen, which correspond to the k largest values for J, i.e., corresponding to the largest values \(\lambda _{i}+\frac{1}{\lambda _{i}} + 2\).
3 Discriminant analysis with sampled data
In practice, discriminant analysis is usually performed using sampled data. This necessitates the substitution of population parameters with sample estimates in the formulas introduced in section 2. The plug-in principle and maximum likelihood method are popular methods for this.
It is well known that LDA often outperforms quadratic discriminant analysis (QDA) even when the assumption of equal covariance matrices is violated (see e.g., Flury et al. 1997; McLachlan 1992). This can be attributed to the large number of parameters that have to be estimated in QDA with over-parameterization inducing a loss of power. This contributes to the popularity of LDA among practitioners and so in the rest of the paper, our focus will be on LDA.
Consider the data matrix \({\varvec{X}}:n\times p\) centered such that \({\varvec{1}}^{T}{\varvec{X}} = {\varvec{0}}^{T}\). The data contained in \({\varvec{X}}\) consists of p measurements made for each of the J groups. The group sizes are \(n_{1},n_{2},\ldots ,n_{J}\), respectively, such that \(\sum _{i=1}^{J} n_{i} =n\). Let \({\varvec{N}}_{g} = diag(n_{1},n_{2},\ldots ,n_{J})\), so that a matrix of group means can be calculated as
where \({\varvec{G}}: n\times J\) denotes an indicator matrix defining the J groups.
Let \({\mathcal {V}}({\varvec{X}}^{T})\) denote the vector space generated by the columns of \({\varvec{X}}^{T}\). We assume this vector space of p-vectors to be of dimension p. Since each row of \(\overline{{\varvec{X}}}\) is a linear combination of the rows of \({\varvec{X}}\) it follows that \(\overline{{\varvec{X}}}^{T} \in {\mathcal {V}}({\varvec{X}}^{T})\).
Define:
-
1.
\({\varvec{S}}_{B}:p\times p\), as the between-group matrix of squares and products: \({\varvec{S}}_{B} =\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}= {\varvec{X}}^{T}{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}{\varvec{X}}\) and
-
2.
\({\varvec{S}}_{W}:p\times p\) as the within-group matrix of squares and products: \({\varvec{S}}_{W} = {\varvec{X}}^{T}{\varvec{X}} -\overline{{\varvec{X}}}^{T}{\varvec{N}}_{g}\overline{{\varvec{X}}}={\varvec{X}}^{T}({\varvec{I}}-{\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}){\varvec{X}}\).
Generally, \(rank({\varvec{S}}_{W}) = p\) while \(rank({\varvec{S}}_{B}) = min(J-1,p)\).
The two-sided eigenvalue problem
provides the solution \({\varvec{b}}_{1}\) to the CVA criterion
In the above, the diagonal matrix \({\varvec{\varLambda }}:p\times p\) contains the eigenvalues \(\lambda _{1} \ge \lambda _{2} \ge \ldots \ge \lambda _{p} \ge 0\), where \(\lambda _{J} = \lambda _{J+1} = \ldots \ = \lambda _{p} = 0\) if \(J<p+1\).
The matrix \({\varvec{B}} = \left[ {\varvec{b}}_{1},{\varvec{b}}_{2},\ldots ,{\varvec{b}}_{p}\right]\) contains all p solutions to the two-sided eigenvalue problem. Only the solution \({\varvec{b}}_{1}\) is optimal for the CVA criterion (9). The matrix \({\varvec{B}}: p \times p\) is non-singular with \({\varvec{B}}^{-1}={\varvec{B}}^{T}({\varvec{S}}_{W})\), while the columns of \({\varvec{B}}\) are orthogonal in the metric \({\varvec{S}}_{W}\) because of the constraints \({\varvec{B}}^{T}({\varvec{S}}_{W}){\varvec{B}}={\varvec{I}}\).
Canonical variates are defined by the transformation \({\varvec{y}}^{T}={\varvec{x}}^{T}{\varvec{B}}\), where \({\varvec{x}}\) is any p-vector belonging to \({\mathcal {V}}({\varvec{X}}^{T})\). The centered data matrix itself is transformed to the canonical variate values matrix \({\varvec{Y}}: n \times p\) through the one-to-one (canonical) transformation
The transformation (10) implies a transformation of \({\mathcal {V}}({\varvec{X}}^{T})\) to \({\mathcal {V}}({\varvec{Y}}^{T})\), the canonical space, of dimension p since \(rank({\varvec{Y}}) = rank({\varvec{X}})\). Furthermore, (10) implies that
We will call \(\overline{{\varvec{Y}}}\) the canonical means matrix. It follows that the columns of \(\overline{{\varvec{Y}}}^{T}\) generate a subspace of dimension \(min(J-1, p)\) of \({\mathcal {V}}({\varvec{Y}}^{T})\). This subspace is denoted by \({\mathcal {V}}(\overline{{\varvec{Y}}}^{T})\).
4 Biplot display of the canonical means matrix \(\overline{{\varvec{Y}}}\) and the canonical variates values matrix \({\varvec{Y}}\)
An r-dimensional canonical variate analysis (CVA) plot is constructed by taking the first r canonical variates, associated with
to provide the coordinates (or scaffolding) for representing the J canonical group means contained in (11) as points in r dimensions. If this plot is equipped with p linear axes to represent the original p variables, a CVA biplot is obtained. Each of these biplot axes is determined by a vector, which induces also a graduation on it. Gower and Hand (1996) consider two types of CVA biplots, each one characterized by its system of p linear axes, its aim and its corresponding geometry:
-
The interpolation biplot, which has the aim of placing on the plot the image \((y_{1},y_{2},...,y_{r})\) of any new point \({\varvec{x}} \in {\mathcal {V}}({\varvec{X}}^{T})\).
-
The prediction biplot, which has the aim of estimating the point \({\varvec{x}}\) (i.e., the set of variable values) having as an image a given point \((y_{1},y_{2},...,y_{r})\) in the plot.
Once the CVA biplot for representing the group means is constructed, all transformed samples contained in \({\varvec{Y}}\) can be interpolated into the biplot. Thus both the canonical means and the transformed samples \({\varvec{Y}}\) can be displayed in a CVA biplot in an r-dimensional subspace of \({\mathcal {V}}(\overline{{\varvec{Y}}}^{T})\), (with \(r \le min(J-1, p)\)). Typically, an r of two or three will be chosen to construct this subspace that we will call the biplot space.
Gower and Hand (1996) show the above processes of prediction and interpolation to be based on the following: A sample \({\varvec{x}}: p \times 1\) can be interpolated into \({\mathcal {V}}({\varvec{Y}}^{T})\) by
The representation of \({\varvec{x}}\) in the biplot space is given by
where \({\varvec{B}}_{r}\) is defined in (12).
Prediction is the inverse of interpolation and since \({\varvec{B}}\) is non-singular it follows by inverting the above formula for interpolation that
\({\varvec{x}}^{T}= {\varvec{y}}^{T}{\varvec{B}}^{-1}\). The matrix \({\varvec{B}}^{-1}\) can be partitioned into
The predicted value for the kth variable can be written as \({\hat{x}}_{k} = {\varvec{z}}^{T}{\varvec{B}}^{(r)}{\varvec{e}}_{k}\) and therefore, the predicted value for \({\varvec{x}}^{T}\) is
It follows from (15) that
and in addition
Since in the CVA biplot described above, the samples are interpolated into the biplot constructed for the canonical means, it is expected that the class means will be better represented than the canonical variate values i.e., the rows of \({\varvec{Y}}\). What is needed then, are measures of fit for use in CVA.
4.1 Measures of fit for use in CVA
From the identity
we have that
where the matrix \({\varvec{Q}}:n \times n = {\varvec{G}}({\varvec{G}}^{T}{\varvec{G}})^{-1}{\varvec{G}}^{T}\) is positive semi-definite, symmetric and idempotent.
This between-and-within group structure is of interest for:
-
Assigning a given sample to its most appropriate group.
-
Relating the groups to one another.
A study of the relationships among the groups encourages the use of low-dimensional approximations for visualization, including CVA biplots.
4.2 Measures of fit for CVA biplots: recovering the canonical group means
The orthogonal partitioning
(see Gardner-Lubbe et al. 2008) allows an overall measure of how well the group means are represented in the CVA biplot, namely
In the orthogonal partitioning above, the matrix \({\varvec{B}}\) can be eliminated to define axis predictivities as the diagonal elements of the matrix
Each axis predictivity is a measure of how well the values for the group means can be determined from the CVA biplot for the variable associated with that particular biplot axis. We note that the overall quality is a weighted mean of the individual axis predictivities.
4.3 Measures of fit for CVA biplots: recovering the individual samples
We also need predictivities for individual samples corrected for class means i.e., for \(({\varvec{I}}-{\varvec{Q}}){\varvec{X}}\). So, our starting point becomes the decomposition
where \(\hat{{\varvec{X}}}\) is defined in (16). Then the following orthogonal decompositions (see Gower et al. 2011) hold:
-
1.
Type A
$$\begin{aligned} {\varvec{B}}^{T}{\varvec{X}}^{T}({\varvec{I}}-{\varvec{Q}}){\varvec{X}}{\varvec{B}}= & {} {\varvec{B}}^{T}\hat{{\varvec{X}}}^{T}({\varvec{I}}-{\varvec{Q}})\hat{{\varvec{X}}}{\varvec{B}}\\{} & {} +{\varvec{B}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) ^{T}({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}. \end{aligned}$$ -
2.
Type B
$$\begin{aligned} ({\varvec{I}}-{\varvec{Q}}){\varvec{X}}{\varvec{B}}{\varvec{B}}^{T}{\varvec{X}}^{T}({\varvec{I}}-{\varvec{Q}})= & {} ({\varvec{I}}-{\varvec{Q}})\hat{{\varvec{X}}}{\varvec{B}}{\varvec{B}}^{T}\hat{{\varvec{X}}}^{T}({\varvec{I}}-{\varvec{Q}})\\{} & {} + ({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}{\varvec{B}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) ^{T}({\varvec{I}}-{\varvec{Q}}). \end{aligned}$$
While the class means are exactly represented in a subspace of dimension \(min(J-1,p)\) of the canonical space, this is generally not true for the individual samples. From Type B orthogonality within-group sample predictivities can be defined as the diagonal elements of
4.4 CVA biplots for \(J=2\) groups
When \(J=2\) it now follows that:
-
The underlying two-sided eigenequation has one non-zero eigenvalue and \(p-1\) zero eigenvalues.
-
All p class means are exactly represented in a single dimension.
-
This single dimension contains the p biplot axes (each with predictivity \(100\%\) for recovering the group means) on top of each other.
-
Overall quality of representing group means is \(100\%\).
-
The one-dimensional CVA biplot is optimal for representing groups irrespective of the number of variables p.
-
The samples are not exactly represented in the one-dimensional CVA biplot.
Our challenge is now to add another dimension for improving recovering of sample information without changing optimality properties already available for the group means. We address this challenge by considering the orthonormal complement in the canonical space of the subspace containing the two group means. Therefore, we consider eigenvectors associated with the zero eigenvalues. These eigenvectors have no natural ordering associated with them. So, any one of these eigenvectors or even any linear combination of them has the same claim to be used as a second scaffolding axis. Hence, our aim is to find the linear combination that satisfies some optimality criterion using Type A and Type B orthogonality for natural candidates.
4.5 Optimal two-dimensional CVA biplot for \(J=2\) groups: optimality criterion based on Type B orthogonality
Consider minimizing
Now,
We have \({\varvec{G}}{\varvec{\hat{{\overline{X}}}}}= {\varvec{G}}{\varvec{{\overline{X}}}}\) if the eigenvector associated with \(\lambda >0\) is chosen. Therefore, maximizing the sum of within-group sample predictivities becomes equivalent to minimizing \(sum\left\{ diag\left( \left( {\varvec{X}}-\hat{{\varvec{X}}}\right) {\varvec{B}}{\varvec{B}}^{T}\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) ^{T}\right) \right\}\). However, this sum remains constant for any additional eigenvector. This is not surprising because in the canonical space the constraint \({\varvec{B}}^{T}{\varvec{S}}_{W}{\varvec{B}} = {\varvec{I}}\), implies constant variation in all dimensions, resulting in Type B orthogonality not useful to define a criterion for an optimal two-dimensional biplot.
4.6 Optimal two-dimensional CVA biplot for \(J=2\) groups: optimality criterion based on Type A orthogonality
Since the matrix \({\varvec{B}}\) is non-singular it can be eliminated from the equation defining Type A orthogonality. As before, \(({\varvec{I}}-{\varvec{Q}})\left( {\varvec{X}}-\hat{{\varvec{X}}}\right) =\left( {\varvec{X}}-\hat{{\varvec{X}}}\right)\). Therefore, the proposed optimality criterion for constructing an optimal two-dimensional CVA biplot for two groups is the total squared reconstruction error for samples:
A similar measure of the goodness of approximations of the means can be defined as the total squared reconstruction error for means:
5 Optimal CVA biplots when the number of groups is less than or equal to the number of variables
From now on we consider the case where \(J < p + 1\). Then it follows that the canonical means matrix \(\overline{{\varvec{Y}}}\) is of the form
(see e.g., Flury, 1997, p. 491).
Hence, the canonical transformation optimally separates the p-vectors
\({\overline{y}}^{T}_{1},{\overline{y}}^{T}_{2}, \ldots , {\overline{y}}^{T}_{J}\) in \(J-1\) dimensions while all differences among them vanish in dimensions \(J,J+1, \ldots ,p\). It follows that \({\mathcal {V}}(\overline{{\varvec{Y}}}^{T})\) is now of dimension \(J-1\) and it is thus possible to consider a biplot space of dimension \(r>J-1\). The resulting r-dimensional CVA biplot is not uniquely defined because the first \(J-1\) columns of \({\varvec{B}}: p \times p\) appended with any set of \(r-J+1\) of its remaining columns will result in a biplot where the canonical means are exactly represented. Therefore, \(\overline{{\varvec{x}}}^{T}_{j}{\varvec{B}}=\left[ k_{j1} \ldots k_{j(J-1)} 0 \ldots 0 \right]\) for \(j=1, 2, \ldots ,J\) with the predicted value for \(\overline{{\varvec{x}}}_{j}\) given by the jth row, \(\hat{\overline{{\varvec{x}}}}_{j}^{T}\), of (17). It follows that
Therefore, in the above r-dimensional CVA biplot the canonical means are exactly represented, resulting in \(TSREM = 0\). For a sample \({\varvec{x}}^{T}_{i}\), we have the ith row of \({\varvec{X}}\),
since in general \({\varvec{x}}^{T}_{i}{\varvec{b}}_{j}\) is not zero for all \(j=J,J+1, \ldots , p\) so that \(TSRES>0\).
This leaves us with the challenge to construct scaffolding axes \(j=J,J+1, \ldots , r\) in addition to those contained in \({\varvec{B}}_{J-1}\) so that TSRES is minimized without sacrificing what we already have for the class means in \(J-1\) dimensions.
Possible candidates for the additional scaffolding axes are any \(r-J+1\) of the vectors \({\varvec{b}}_{J}, {\varvec{b}}_{J+1}, \ldots , {\varvec{b}}_{p}\). All these vectors are associated with the zero eigenvalues (diagonal elements of \({\varvec{\varLambda }}\)). Therefore, there is no natural ordering of them as is the case with the \(J-1\) eigenvectors associated with the non-zero eigenvalues. Furthermore, since \(\overline{{\varvec{X}}}{\varvec{b}}_{i}={\varvec{0}}\) for \(i=J, J+1, \ldots , p\) it follows that \(\overline{{\varvec{X}}}{\varvec{d}}={\varvec{0}}\) where \({\varvec{d}}\) is any linear combination of the vectors \({\varvec{b}}_{J}, {\varvec{b}}_{J+1}, \ldots , {\varvec{b}}_{p}\). A similar result will hold for any set of basis vectors of the vector space generated by the columns of the matrix
so that \(\overline{{\varvec{X}}}{\varvec{B}}^{*}={\varvec{0}}\).
Therefore, a set of \(r-J+1\) linear independent vectors of the form \({\varvec{d}}\) where \({\varvec{d}}\) is a linear combination of any basis of \({\mathcal {V}}({\varvec{B}}^{*})\) is needed such that the scaffolding vectors consisting of the first \(J-1\) columns of \({\varvec{B}}\) together with the \(r-J+1\) \({\varvec{d}}\) vectors minimize TSRES for all legitimate choices of the \(\{{\varvec{d}}\}\). Write these r scaffolding vectors as the columns of the matrix \({\varvec{D}}_{r}\) i.e.,
and let the columns of \({\varvec{D}}: p\times p =\left[ {\varvec{D}}_{r}, {\varvec{d}}_{r+1}, \ldots , {\varvec{d}}_{p}\right]\) represent a basis of \({\mathcal {V}}({\varvec{B}})\). It follows that any column of \({\varvec{B}}\) can be written as a linear combination of the columns of \({\varvec{D}}\). Therefore, there exists a non-singular matrix \({\varvec{C}}: p\times p\) such that \({\varvec{B}}={\varvec{DC}}\) i.e., \({\varvec{D}}={\varvec{BF}}\) with \({\varvec{F}}={\varvec{C}}^{-1}\).
Straightforward algebraic manipulation shows that \({\varvec{F}}\) is of the form
where \({\varvec{F}}^{*}\) is an \((p-J+1)\times (p-J+1)\) orthogonal matrix. We provide a detailed derivation as supplementary material. Write
Then, \({\varvec{F}}^{-1}=\begin{bmatrix} {\varvec{I}} &{} {\varvec{0}} \\ {\varvec{0}} &{} ({\varvec{F}}^{*})^{T} \end{bmatrix}\) and our scaffolding vectors for constructing the r-dimensional CVA biplot are the columns of
Furthermore,
The approximations of the rows of \({\varvec{X}}\), i.e., the original samples, in the biplot constructed on the scaffolding provided by the columns of \({\varvec{D}}_{r}\) follow from (16) and using (25) and (26) as
where, from the orthogonality of \({\varvec{F}}^{*}\), it follows that \(({\varvec{f}}^{*}_{1})^{T}{\varvec{f}}^{*}_{1}=({\varvec{f}}^{*}_{2})^{T}{\varvec{f}}^{*}_{2} = \cdots =({\varvec{f}}^{*}_{r-J+1})^{T}{\varvec{f}}^{*}_{r-J+1}=1\).
The criterion TSRES now becomes
To construct an r-dimensional CVA biplot satisfying our aim of minimizing TSRES while simultaneously providing \(100\%\) accurate predictions for the J sample means when \(J < p\), we propose the following:
-
Find the solution of
$$\begin{aligned} argmin\left\| {\varvec{X}}\left( {\varvec{I}}_{p}-{\varvec{B}} {\varvec{L}} {\varvec{B}}^{-1}\right) \right\| ^{2}, \end{aligned}$$(29)where \({\varvec{L}} = \begin{bmatrix} {\varvec{I}}_{J-1} &{} {\varvec{0}} \\ {\varvec{0}}^{T} &{} {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\end{bmatrix}\) and the minimum is taken with respect to the \({\varvec{f}}^{*}_{j}\) such that \(({\varvec{f}}^{*}_{j})^{T}{\varvec{f}}^{*}_{j}=1\) for \(j=1, 2, \ldots , r-J+1\).
-
Use the optimum \(\{{\varvec{f}}^{*}_{1}, {\varvec{f}}^{*}_{2}, \ldots , {\varvec{f}}^{*}_{r-J+1}\}\) to construct \(({\varvec{B}}_{opt})_{r} = \left[ {\varvec{b}}_{1}, \ldots ,{\varvec{b}}_{J-1}, {\varvec{d}}_{J}, \ldots ,{\varvec{d}}_{r}\right]\) where
$$\begin{aligned} {\varvec{d}}_{j}= & {} {\varvec{B}}^{*}({\varvec{f}}_{j-J+1}^{*})_{opt} \\= & {} {\varvec{B}}\begin{bmatrix} {\varvec{0}} \\ ({\varvec{f}}_{j-J+1}^{*})_{opt} \end{bmatrix} \\= & {} f^{opt}_{J(j-J+1)}{\varvec{b}}_{J}+f^{opt}_{(J+1)(j-J+1)}{\varvec{b}}_{J+1} + \cdots + f^{opt}_{p(j-J+1)}{\varvec{b}}_{p}, \end{aligned}$$for \(j=J, J+1, \ldots , r\).
-
Next,\(({\varvec{B}}_{opt})_{r}\) is used for constructing the r-dimensional CVA biplot with calibrated prediction (or interpolation) axes.
-
Finally, calculate a standardised form of min(TSRES): \(\frac{min(TSRES({\varvec{X}}, \hat{{\varvec{X}}}))}{tr({\varvec{X}}{\varvec{X}}^{T})}\), as a measure of the accuracy of the approximations of the individual sample points in the r-dimensional biplot.
The solution for (29) can be found from (28) as follows: Since \({\varvec{B}}^{T}{\varvec{S}}_{W}{\varvec{B}} = {\varvec{I}}_{p}\) it follows that
establishing that
Since \({\varvec{B}}= \begin{bmatrix} {\varvec{B}}_{J-1}&{\varvec{B}}^{*} \end{bmatrix}\) and non-singular, we set
Write \({\varvec{U}}: (p-J+1)\times (p-J+1) = {\varvec{f}}^{*}_{1}({\varvec{f}}^{*}_{1})^{T}+{\varvec{f}}^{*}_{2}({\varvec{f}}^{*}_{2})^{T}+ \cdots + {\varvec{f}}^{*}_{r-J+1}({\varvec{f}}^{*}_{r-J+1})^{T}\); then it follows that
From \(\begin{bmatrix} {\varvec{B}}_{J-1}&{\varvec{B}}^{*} \end{bmatrix} \begin{bmatrix} {\varvec{B}}^{(J-1)} \\ {\varvec{B}}^{(2)}\end{bmatrix} = {\varvec{I}}_{p}\) it follows that \({\varvec{B}}_{J-1}{\varvec{B}}^{(J-1)}={\varvec{I}}_{p}-{\varvec{B}}^{*}{\varvec{B}}^{(2)}\) so that (31) can be written as
Therefore,
by substituting (30) in (33). Write \({\varvec{H}} = {\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}: (p-J+1)\times (p-J+1)\) then it follows that \({\varvec{H}}\) has rank \((p-J+1)\) and is thus positive definite. Therefore,
where
and
Since \(({\varvec{f}}^{*}_{i})^{T}{\varvec{f}}^{*}_{j}\; =\; \left\{ \begin{array}{ll} 1 &{} \text{ if } i=j \\ 0 &{} \text{ if } i\ne j \end{array} \right.\) it follows that
Substituting (36) and (37) into (35) leads to
Remembering that \({\varvec{H}}\) is positive definite, the criterion (38) can be minimized by maximizing each of the terms \(({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}\) with respect to \({\varvec{f}}^{*}_{j}\) for \(j=1, 2, \ldots , r-J+1\) under the constraint \(({\varvec{f}}^{*}_{j})^{T}{\varvec{f}}^{*}_{j}=1\). This is readily accomplished by introducing the Lagrange multiplier \(\lambda _{j}\) to form
Differentiating (39) with respect to \(\lambda _{j}\) and equating to zero leads to \({\varvec{H}}{\varvec{f}}^{*}_{j}=\lambda _{j}{\varvec{f}}^{*}_{j}\) i.e., \(\lambda _{j}=({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}\). Thus \(({\varvec{f}}^{*}_{j})^{T}{\varvec{H}}{\varvec{f}}^{*}_{j}\) is maximized when \({\varvec{f}}^{*}_{j}\) is a normalized eigenvector associated with the jth largest eigenvalue of \({\varvec{H}}\) and hence the criterion (39) attains its minimum when the \(\big \{{\varvec{f}}^{*}_{j}\big \}\) are set to the normalized eigenvectors associated with the largest \(r-J+1\) eigenvalues of \({\varvec{H}} = {\varvec{B}}^{(2)}({\varvec{B}}^{(2)})^{T}\), respectively. Denote these eigenvectors by \({\varvec{f}}_{1}^{opt}, {\varvec{f}}_{2}^{opt}, \cdots , {\varvec{f}}_{r-J+1}^{opt}\), respectively, where \(r\le p\) then a \(p\times r\) matrix \(({\varvec{B}}_{opt})_{r}\) can be constructed as
Setting \(r = p\) in the above leads to a matrix \({\varvec{B}}_{opt}\) of size \(p \times p\) which is non-singular, allowing for the computation of the matrices \(({\varvec{B}}_{opt})_{r}\), consisting of the first r columns of \({\varvec{B}}_{opt}\), and \(({\varvec{B}}_{opt})^{(r)}\), consisting of the first r rows of \(({\varvec{B}}_{opt})^{-1}\). Therefore
will minimize TSRES.
6 An alternative biplot based on the Bhattacharyya distance for the two sample case
If, analogous to section 3 the population parameters in (5) are replaced with their sample estimates the sample version of the Bhattacharyya distance consists of two terms. The first measures the dissimilarity between the two sample means and the second measures the dissimilarity between the two sample covariance matrices. Fukunaga (1990) uses this property to construct a second dimension for a visual display of the rows of a data matrix \({\varvec{X}}: n\times p\) in two dimensions. Furthermore, Hennig (2004) utilizes the Bhattacharyya distance, among other methods to construct two-dimensional visualizations of the dispersions of two asymmetric samples – asymmetric in the sense that one is known to be more homogeneous and the other to be more heterogeneous. However, it should be noted that none of the visualizations proposed by Fukunaga (1990) and Hennig (2004) are biplots because no information regarding the columns of \({\varvec{X}}\) is displayed. The optimal two-group CVA biplot proposed in section 5 assumes equality of covariance matrices as is usual for CVA. This implies that the second term of the sample version of (5) will vanish and optimization of (5) becomes equivalent to maximizing (9).
Denote the transformation to the canonical space \({\mathcal {C}}\) by \({\varvec{Y}} ={\varvec{XM}}\) with \({\varvec{M}}:p \times p\). We can write
where \({\mathcal {C}}_{1}\) is one-dimensional based on \({\varvec{m}}_{1}\) and \({\mathcal {V}}\) is \((p-1)\)-dimensional with basis \({\varvec{m}}_{2}, {\varvec{m}}_{3}, \ldots , {\varvec{m}}_{p}\). In \({\mathcal {V}}\) we have
with \({\varvec{\mu }}^{*T}=[\mu _{2}^{(Y)}, \mu _{3}^{(Y)}, \ldots , \mu _{p}^{(Y)}], {\varvec{\varSigma }}_{1}^{*}={\varvec{M}}^{*T}{\varvec{\varSigma }}_{1}{\varvec{M}}^{*}, {\varvec{\varSigma }}_{2}^{*}={\varvec{M}}^{*T}{\varvec{\varSigma }}_{2}{\varvec{M}}^{*}\) and \({\varvec{M}}^{*}:p \times (p-1) =[{\varvec{m}}_{2},{\varvec{m}}_{3}, \ldots , {\varvec{m}}_{p} ]\).
Since \({\varvec{\mu }}_{1}^{*} ={\varvec{\mu }}_{2}^{*}={\varvec{\mu }}^{*}, (say)\) we have from (5)
Maximization of the sample version of (42) proceeds parallel to the process described in section 2.3. The outcome is the matrix \({\varvec{A}}: (p-1) \times (p-1)\) containing as columns the required eigenvectors arranged in decreasing order of the values of \(\lambda _{i}^{*}+\frac{1}{\lambda _{i}^{*}}+2\).
A 2D biplot can now be constructed using the methods described in sections 3 and 4 by first noting that the matrix \({\varvec{M}}\) is available as the matrix \({\varvec{B}}\) of section 3. Next, we calculate the matrix
and its inverse \({\varvec{K}}^{-1}\). Let \({\varvec{a}}: (p-1) \times 1\) denote the first column of \({\varvec{A}}\), then the 2D biplot can be constructed as described in section 4 using the rows of \({\varvec{Z}}:n\times 2\), where
for plotting the samples. The variables are represented by p calibrated biplot axes constructed from
where \({\varvec{K}}^{(2)}\) denotes the first 2 rows of \({\varvec{K}}^{-1}\).
7 Examples
As an example, we consider the Vertebral Column Data Set from the UCI Machine Learning Repository (Barreto et al. 2011) and discussed in detail by da Rocha Neto et al. (2011). The full data set contains measurements on six continuous/numeric variables relating to the shape and orientation of the pelvis and lumbar spine for each of the 310 individuals (samples). These samples are classified as normal, disk hernia, or spondylolisthesis patients. As an example of a two-group CVA, we study the subset of 60 disk hernia and 150 spondylolisthesis patients. The six numeric variables are Pelvic incidence, Pelvic tilt, Lumbar lordosis angle, Sacral slope, Pelvic radius, and Degree spondylolisthesis.
Table 1 contains the group means for each of the six variables - with the outlier (see Fig. 2) included and excluded. It is hard to see from the table the presence of the outlier but, as is evident from Figs. 1 and 2, the outlier will be clearly revealed in a CVA biplot.
The one-dimensional biplot of this two-group data set is shown in Fig. 1. In this biplot, the two group means as well as all six variables (in the form of six calibrated axes) lie on the one scaffolding line defined by the singular vector associated with the single non-zero singular value of the underlying two-sided eigenequation. The six calibrated axes representing the variables have been vertically translated to aid the interpretation of the biplot. The values for each of the group means for all six variables can easily be read from the biplot axes and it can be verified that these values coincide exactly with the corresponding values in Table 1. As expected all variables have predictivities of 100% for determining the mean values with \(TSREM = 0\). All the individual sample points have been interpolated onto the bipot and therefore they also lie on the single scaffolding axis defining the biplot. However, \(TSRES > 0\) with a standardized version \(TSRES/(tr({\varvec{X}}{\varvec{X}}^{T}) = 0.4702\). To visualize the within groups variation the biplot has been enhanced by the addition of density estimates of the two sets of sample points interpolated onto the one-dimensional CVA biplot. These density estimates show graphically the separation/overlap of the two groups. Inspection of the six biplot axes suggests V5 to be negatively correlated with the other five variables; the latter are all pairwise positively correlated.
It is clear that much can be learned from the one-dimensional CVA biplot but some serious issues are calling for considering a second scaffolding axis:
-
Is the conspicuous outlier an outlier on all variables?
-
Can the approximation of the sample points be improved without sacrificing what has been achieved with the mean vectors?
-
Is it possible to construct a more detailed visualization of the separation/overlap of the two groups?
-
Is it possible to construct a more accurate and detailed visualization of the correlation structure of the six variables?
Figure 2 provides an answer to the above issues. The optimal two-dimensional CVA biplot in Fig. 2 demonstrates the following:
-
The biplot is uniquely defined.
-
Each of the six biplot axes has a predictivity of \(100\%\) for determining the values of the two groups for all variables.This results in TSREM to remain zero. Figure 2 provides a clearer picture of how the predictivities are determined.
-
Introducing the second scaffolding axis improves the approximation of the samples appreciably: the standardized TSRES decreases to 0.1799 (a decrease of more than 60% of the corresponding value obtained in Fig. 2).
-
The second scaffolding axis is optimal in the sense that no other scaffolding axis can be found, which will improve TSRES while restricting TSREM to zero.
-
It is clear that Sample 116 is less of an outlier on V3 and V4 than on V2 and V6.
-
There is a suggestion that while V5 appears to be negatively correlated with V3 and V4 it appears to be positively correlated with V2 and V6. We note that the addition of a second scaffolding axis provides angles between the biplot axes, which allow for visualizing the approximate correlational structure between the variables.
The biplot in Fig. 2 has been enhanced by superimposing 95%-bags onto the biplot. Alpha-bags are discussed in detail by Gower et al. (2011). The 95%-bag used here contains the innermost 95% of the bivariate sample points, where the innermost is relative to the Tukey median (Ruts and Rousseeuw 1996). Now, we are ready for a detailed appraisal of the overlap/separation of the two groups based on the two-dimensional clouds of points visualizing the within groups sample variation. However, we first exclude Sample 116 from the analysis and consider the optimal two-dimensional CVA biplot given in Fig. 3 overlaid with 95%-bags. This figure shows:
-
Clearly how the biplot axes are used to determine the group means exactly for each variable.
-
The angles between the biplot axes allow for an approximate visual appraisal of the correlation structure.
-
It is seen that the two 95%-bags almost touch each other but do not overlap giving us an overall quantitative measure of the degree of separation between the two groups.
-
The standardized TSRES value is 0.2155.
-
Although the two clouds of points have a high degree of separation it can also be seen that
-
there is a high degree of overlap between the two groups concerning V2 and V5;
-
there is almost no overlap on V1, V3 and V6;
-
on V4 large Disk Hernia values overlap with small Spondylolisthesis measurements, while small measurements of V4 almost exclusively occur in Disk Hernia.
-
The CVA biplots in Figs. 2 and 3 assume equal covariance matrices. We can now relax this assumption and construct in Fig. 4, the biplot based on the Bhattacharyya distance as discussed in section 6.
Although the appearance of this biplot is quite similar to that of the corresponding optimal CVA biplot shown in Fig. 3 its standardized TSRES value is approximately 10\(\%\) higher, namely 0.2367.
8 Conclusions
CVA biplots are constructed to show in a single plot the group means as points and all the variables as calibrated linear biplot axes. In the case of two groups the CVA biplot becomes a line containing all these points and biplot axes. Furthermore, there is no approximation in the positions of the group means and their respective values, which can be exactly determined from the biplot axes. It is common practice to interpolate the individual sample points onto the CVA biplot as well, but then they appear as approximations in the one-dimensional CVA biplot space. Since all the biplot axes lie on top of each other it is difficult to use them for determining the values of the two means for the different variables. However, the vertical translation of one of these axes does not affect the values of the two means for that particular variable. Therefore, as has been shown in Fig. 1, vertical translation of the biplot axes does not change the dimensionality of the CVA biplot but increases the usefulness of the different biplot axes for reading off values of the respective variables.
The fundamental question that is addressed in this paper is: What can be gained, if anything, by increasing the dimensionality of the above one-dimensional CVA biplot to two dimensions? This question can be rephrased as: Can we add a second dimension to our one-dimensional CVA biplot to improve the approximation of the individual sample points leading to a better understanding of the within groups variability while leaving unchanged the optimal representation of the two group means? As it turned out the addition of a second dimension is not a straightforward process since there are an infinite number of ways that this can be done. Furthermore, if existing software is used for constructing a two-dimensional CVA biplot in the two-group case the result can be highly misleading. This is because the two-sided eigenequation underlying the CVA procedure has only a single non-zero singular value with no natural ordering of the zero singular values resulting in the indeterminacy of singular vectors associated with zero singular values. Therefore, to guarantee a unique solution for finding a second dimension, we had to consider a criterion to optimize the approximation of the sample points while leaving the optimal representation of the group means unchanged. We suggested the TSRES criterion for meeting this goal. Minimizing TSRES results in a uniquely defined two-dimensional CVA biplot for the two-group case. It optimizes the approximations of the within-group samples while the two group means are exactly represented with the linear biplot axes providing exact values for both groups on all variables.
Figure 3 shows that our proposed optimal two-dimensional CVA biplot for two groups has the potential to provide the researcher with a tool that not only distinguishes the group means optimally but also where the within sample variation can be depicted graphically to gain deeper insight into the separation/overlap of the two groups. Moreover, it is unique and thus prevents any possibility of ambiguity when routinely using existing software. Thus we have attained our primary objective as is illustrated in the example discussed above.
The algebra underlying the optimal two-dimensional CVA biplot extends directly to an optimal three-dimensional biplot when dealing with a three-groups case having a CVA biplot space of dimension two.
An alternative suggestion based on the Bhattacharyya distance is available when relaxing the equal within-group covariance matrix assumption. As can be seen from Fig. 4, the biplot is slightly different, but the overall conclusion regarding overlap and separation in terms of the individual variables remains unchanged. However, the biplot based on Bhattacharyya distance is designed to optimize a different objective than the optimization of the sum of the squared approximations of the data matrix. Therefore, our preferred 2D CVA biplot to construct in the two-group case is the proposed optimal CVA biplot.
Finally, we note that our R code for constructing the biplots discussed in this paper is available in Lubbe et al. (2023).
References
Barreto GdA, da Rocha Neto AR, da Mota Filho HAF (2011) Vertebral column data set. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Annals of Eugencs 7:179–188
Flury B (1997) A first course in multivariate statistics. Springer-Verlag, New York
Flury L, Boukai B, Flury BD (1997) The discrimination subspace model. J American Stat Assoc 92:758–766
Fukunaga K (1990) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston
Gabriel KR (1971) The biplot graphical display of matrices with application to principal component analysis. Biometrika 58:451–467
Gabriel KR (1972) Analysis of meteorological mata by means of manonical decomposition and biplots. J Appl Meteorol 11:1071–1077
Gardner-Lubbe S, Le Roux NJ, Gower JC (2008) Measures of fit in principal component and canonical variate analyses. J Appl Stat 35:947–965
Gittins R (1985) Canonical analysis. Springer-Verlag, Berlin
Gower JC (1995) A general theory of biplots. In: Krzanowski WJ (ed) Recent advances in descriptive multivariate analysis. Clarendon Press, Oxford, pp 283–303
Gower JC, Hand DJ (1996) Biplots. Chapman & Hall, London
Gower JC, Le Roux NJ, Lubbe S (2011) Understanding biplots. Wiley, Chichester
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning. Springer-Verlag, New York
Hennig C (2004) Asymmetric linear dimension reduction for classification. J Comput Graph Stat 13:930–945
Lubbe S, Le Roux N, Nienkemper-Swanepoel J, Ganey R, Van der Merwe C (2023) biplotEZ: EZ-to-Use Biplots. https://CRAN.R-project.org/package=biplotEZ, r package version 1.2.0
McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. John Wiley, New York
Rao CR (1948) The utilization of multiple measurements in problems of biological classification. J Royal Stat Soc , Series B 10:159–193
da Rocha Neto AR, Sousa R, Barreto GdA, Cardoso JS (2011) Diagnostic of pathology on the vertebral column with embedded reject option. In: Iberian Conference on Pattern Recognition and Image Analysis, Springer, pp 588–595
Ruts I, Rousseeuw PJ (1996) Computing depth contours of bivariate point clouds. Comput Stat Data Anal 23:153–168
Tukey JW (1975) Mathematics and the picturing of data. Proc Int Congress Math 2:523–531
Acknowledgements
We are grateful to the comments made by an anonymous reviewer and the review editor that have improved the quality of this manuscript.
Funding
Open access funding provided by Stellenbosch University. This work is based upon research supported in part by the National Research Foundation (NRF) of South Africa (Grant Number 103310). Any opinions, findings and conclusions, or recommendations expressed in this material are those of the authors, and therefore the NRF does not accept any liability in regard thereof.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
le Roux, N., Gardner-Lubbe, S. A two-group canonical variate analysis biplot for an optimal display of both means and cases. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00593-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11634-024-00593-7