15.1. Introduction

We will employ the same notations as in the previous chapters. Lower-case letters x, y, … will denote real scalar variables, whether mathematical or random. Capital letters X, Y, … will be used to denote real matrix-variate mathematical or random variables, whether square or rectangular matrices are involved. A tilde will be placed on top of letters such as \(\tilde {x},\tilde {y},\tilde {X},\tilde {Y}\) to denote variables in the complex domain. Constant matrices will for instance be denoted by A, B, C. A tilde will not be used on constant matrices unless the point is to be stressed that the matrix is in the complex domain. The determinant of a square matrix A will be denoted by |A| or det(A) and, in the complex case, the absolute value or modulus of the determinant of A will be denoted as |det(A)|. When matrices are square, their order will be taken as p × p, unless specified otherwise. When A is a full rank matrix in the complex domain, then AA is Hermitian positive definite where an asterisk designates the complex conjugate transpose of a matrix. Additionally, dX will indicate the wedge product of all the distinct differentials of the elements of the matrix X. Thus, letting the p × q matrix X = (x ij) where the x ij’s are distinct real scalar variables, \(\text{d}X=\wedge _{i=1}^p\wedge _{j=1}^q\text{d}x_{ij}\). For the complex matrix \(\tilde {X}=X_1+iX_2,\ i=\sqrt {(-1)}\), where X 1 and X 2 are real, \(\text{d}\tilde {X}=\text{d}X_1\wedge \text{d}X_2\).

15.1.1. Clusters

A cluster means a group or a cloud of items close together with reference to one or more characteristics. For instance, in a countryside, there are villages which are clusters of houses. In a city, there are clusters of high-rise buildings or clusters of apartment blocks. If we have 2-dimensional data points marked on a sheet of paper, then there may be several places where the points are grouped together in large crowds, at other places the points may be bunched together in smaller clumps and somewhere else, there may be singleton points. In a classification problem, we have a number of preassigned populations and we want to assign a point at hand to one of those populations. This cannot be achieved in the context of cluster analysis as we do not know beforehand how many clusters there are in the data at hand or which data point belongs to which cluster. Cluster analysis is akin to pattern recognition whereas classification is a sort of taxonomy. Suppose that a new plant is to be classified as belonging to one of the known species of plants; if it does not fall into any of the known species, then we have a member from a new species. In cluster analysis, we are, in a manner of speaking, going to create various ‘species’. To start with, we have only a cloud of items and we do not know how many categories or clusters there exist.

Cluster analysis techniques are widely utilized in many fields such as psychiatry, sociology, anthropology, archeology, medicine, criminology, engineering and geology, to mention only a few areas. If real scalar variables are to be classified as belonging to a certain category, one way of achieving this is to ascertain their joint dispersion or joint variation as measured in terms of scale-free covariance or correlation. Those variables that are similarly correlated may be grouped together.

We will consider the problem of cluster analysis involving n points X 1, …, X n where each X j is a real p-dimensional vector, that is, we have a p × n data matrix

(15.1.1)

15.1.2. Distance measures

Two real p-vectors are close together if the “distance” between them is small. Many types of distance measures can be defined. Let X r and X s be two real p-vectors. These are the r-th and s-th members or columns in the data matrix (15.1.1). Then, the following are some distance measures:

$$\displaystyle \begin{aligned}d_m(X_r,X_s)=\Big[\sum_{i=1}^p|x_{ir}-x_{is}|{}^m\Big]^{\frac{1}{m}}\,;\end{aligned}$$

for m = 2, we have the Euclidean distance \(d_2(X_r,X_s)=[\sum _{i=}^p|x_{ir}-x_{is}|{ }^2]^{\frac {1}{2}}\), or, denoting \(d_2^2\) as d 2, we have

$$\displaystyle \begin{aligned} d^2(X_r,X_s)=\sum_{i=1}^p(x_{ir}-x_{is})^2,{} \end{aligned} $$
(15.1.2)

where the absolute value sign can be replaced by parentheses since we are dealing with real elements. We will utilize this convenient quantity d 2 for comparing observation vectors. There may be joint variation or covariances among the coordinates in each of the vectors, in which case, Cov(X r) = Σ > O. If all the X j’s, j = 1, …, n, have the same covariance matrix, then Cov(X j) = Σ, j = 1, …, n, and a statistician might wish to consider the generalized distance between X r and X s, or its square, \(d^2_{(g)}(X_r,X_s)=(X_r-X_s)'\,\varSigma ^{-1}(X_r-X_s)\), the subscript g designating the generalized distance. Since Σ is unknown, we may wish to estimate it. However, if there are clusters, it may not be appropriate to make use of the entire data set of all n points, since the joint variation or the covariance within each cluster is likely to be different. And as we do not know beforehand whether clusters are present, securing a proper estimate of Σ turns out to prove problematic. As a result, this problem is usually circumvented by resorting to the ordinary Euclidean distance instead of the generalized distance.

Let us examine the effect of scaling a vector. If the unit of measurement in one vector is changed, what will be the effect on the squared distance? Consider the following vectors:

The squared distances between the vectors when (1) X 1 is multiplied by 2; (2) X 2 is multiplied by 2; (3) X 1 and X 2 are each multiplied by 2, are

$$\displaystyle \begin{aligned} d^2(2X_1,X_2)&=(-2+3)^2+(0-2)^2+(-4-4)^2=69\\ d^2(X_1,2X_2)&=(-1+6)^2+(0-4)^2+(-2-8)^2=141\\ d^2(2X_1,2X_2)&=4[(X_1-X_2)'(X_1-X_2)]=4\times 44=176.\end{aligned} $$

Note that they are fully distorted as 69≠4(44) and 141≠4(44). Thus, the scaling of individual vectors can fully alter the nature of the clusters when there are clusters in the original data. As well, members of the original clusters need not be members of the same clusters in the scaled data and the number of clusters may also change. Accordingly, it is indeed inadvisable to make use of the generalized distance. Nor is re-scaling the individual vectors a good idea if we are seeking clusters. Accordingly, the recommended procedure consists of utilizing the original data without modifying them. It may also happen that the components in each p-vector are recorded in different units of measurement. Then, how to eliminate the location and scale effect on the components in each vector? This can be achieved by standardizing them individually, that is, by subtracting the average value of the components from the components of each vector and dividing the result by the sample standard deviation. Let us see what happens in the case of our numerical example. Letting \(\bar {x}_1\) and \(\bar {x}_2\) be the averages of the components in X 1 and X 2, and \(s_1^2\) and \(s_2^2\) be the associated sums of products, we have

$$\displaystyle \begin{aligned} \bar{x}_1&=\frac{1}{3}[(-1)+(0)+(-2)]=-1, \ \bar{x}_2=\frac{1}{3}[(-3)+(2)+(4)]=1,\\ s^2_1&=\sum_{i=1}^p(x_{i1}-\bar{x}_1)^2=[(-1)-(-1)]^2+[(0)-(-1)]^2+[(-2)-(-1)]^2=2,\\ s^2_2&=\sum_{i=1}^p(x_{i2}-\bar{x}_2)^2=26.\end{aligned} $$

Thus, the standardized vectors X 1 and X 2, denoted by Y 1 and Y 2, are the following:

and d 2(Y 1, Y 2) = (Y 1 − Y 2)(Y 1 − Y 2) = 7.6641. However, Y 1 and Y 2 are very distorted and the distance between X 1 and X 2 is also modified. Hence, such procedures will change the clustering aspect as well, with new clusters possibly differing from the original clusters.

Let us consider the matrix of squared distances, denoted by D:

(15.1.3)

For example, letting

we have \(\,d^2_{12}=(1-2)^2+(0-1)^2+(-1-3)^2=18\), \(d^2_{13}=11,\ d^2_{14}=14\), \(d^2_{23}=5,\ d^2_{24}=2\), \(d^2_{34}=9\), so that

The question of interest is the following: Given a set of n vectors of order p, how can one determine the number of clusters and then, classify them into these clusters?

15.2. Different Methods of Clustering

The main methods are hierarchical in nature, the other ones being non-hierarchical. We will begin with non-hierarchical techniques. In this category, the most popular one involves optimization or partitioning.

15.2.1. Optimization or partitioning

With this approach, we have to come up with two numbers: k, a probable number of clusters, and r, the maximum separation between the members of each prospective cluster. Based on the distances or on the dissimilarity matrix, D, one should be able to determine the likely number of clusters, that is, k. Then, one has to find a set of k vectors among the n given vectors, which will be taken as seed members or starting members within the k potential clusters. Several methods have been proposed for determining this k, including the following: 1. Examine the closeness of the original vectors as indicated by the dissimilarity matrix D and, to start with, decide on an initial numbers for k and the likely distance between members within a cluster denoted by r. 2. Examine the original data points or original p-vectors and, based on the comparative magnitudes of the components of the observed p-vectors, ascertain whether there is any grouping possible and predict a value for each of k and r. 3. Evaluate the sample sum of products matrix S from the original data matrix. Compute the two main principal components associated with this S. Substitute X j, the j-th observation vector, in the two principal components. This provides a pair of numbers or one point in a two-dimensional space. Compute n such points for j = 1, …, n. Plot these points. From the graph, assess the clustering pattern, the number k of possible clusters, estimates for r, the maximum distance between two members within a cluster as well as the minimum distance between the clusters. 4. Choose any number k, select k vectors at random from the set of n vectors; then, preselect a number r and use it as a measure of maximum separation between vectors. 5. Take any number k and select as seed vectors the first k vectors whose separation is at least two units among the set of n vectors. 6. Look at the farthest points. Select k of them that are separated by at least r units for preselected values of k and r.

If the dissimilarity matrix D is utilized, then the separation number r must be measured in \(d^2_{ij}\) units, whereas r should be in d ij units if the actual distances d ij are used. After the seed vectors are selected, the remaining n − k points are to be associated to these seed points to form clusters. Assign the vectors closest to each of the seed vectors and form the initial k clusters of two or more vectors. For example, if there are three closest members at equal distance to a seed vector then, that cluster comprises 4 members, including the seed vector. Then, compute the centroids of all initial clusters. The centroid of a cluster is the simple average of the vectors included in that cluster. Thus, the centroid is a p-vector. Then, measure the distances of all the points belonging to the same cluster from each centroid, and incorporate all points within the distance of r from a centroid to that cluster. This process will create the second stage of k clusters. Now, evaluate the centroid of each of these k clusters. Again, repeat the process of computing the distances of all points from each centroid. If a member in a cluster is found to be closer to the centroid of another cluster than to its own cluster’s centroid, then redirect that vector to the cluster to which it belongs. Rearrange all vectors in such a manner, assigning each one to a cluster whose centroid is the closest. Note that the number k can increase or decrease in the course of this process. Continue the procedure until no more improvement is possible. At this stage, that final k is the number of clusters in the data and the final members in each cluster are set. This procedure is also called k-means approach.

This k-means approach has a serious shortcoming: if one starts with a different set of seed vectors, then it is possible to end up with a different set of final clusters. On the other hand, this method has the appreciable advantage that it allows a member provisionally assigned to a cluster to be moved to another cluster where it really belongs, that is, it allows the transfer of points. The following example should clarify the procedure.

Example 15.2.1

Ten volunteers are given an exercise routine in an experiment that monitors systolic pressure, diastolic pressure and heart beat. These are measured after adhering to the exercise routine for four weeks. The data entries are systolic pressure minus 120 (SP), diastolic pressure minus 80 (DP) and heart beat minus 60 (HB), where 120, 80 and 60 are taken as the standard readings of systolic pressure, diastolic pressure and heart beat, respectively. Carry out a cluster analysis of the data. The data matrix is the following where (1), …, (10) represent the data vectors A 1, …, A 10 for the 10 volunteers, the first row represents SP, the second row, DP, and the third, HB:

Solution 15.2.1

Let us compute the dissimilarity matrix D:

The data matrix suggests the possibility of three clusters. Accordingly, we may begin with the vectors A 2, A 5 and A 8 as seed vectors and take the separation width as r = 15 units. From D, we find \(d^2_{23}=1\), the smallest number, and hence A 2 and A 3 form the cluster: {A 2, A 3}. Note that \(d^2_{56}=11\), so that A 6 and A 5 form the cluster: {A 5, A 6}. Since \(d^2_{87}=9\), A 7 and A 8 form a cluster: {A 7, A 8}. Now, consider the centroids. Letting C 11, C 21 and C 31 denote the centroids, \(C_{11}=\frac {1}{2}(A_2+A_3),\ C_{21}=\frac {1}{2}(A_5+A_6)\) and \(C_{31}=\frac {1}{2}(A_7+A_8)\), that is,

Let us calculate the distances of A 1, …, A 10 from C 11, C 21, C 31:

$$\displaystyle \begin{aligned} d^2(C_{11},A_1)&=\frac{13}{4},~d^2(C_{11},A_2)=\frac{1}{4},~d^2(C_{11},A_3)=\frac{1}{4},~d^2(C_{11},A_4)=\frac{57}{4},\\ d^2(C_{11},A_5)&=\frac{185}{2},~d^2(C_{11},A_6)=\frac{121}{4},~d^2(C_{11},A_7)=\frac{645}{4},~d^2(C_{11},A_8)=\frac{961}{4},\\ d^2(C_{11},A_9)&=\frac{633}{4},~d^2(C_{11},A_{10})=\frac{713}{4},~d^2(C_{21},A_1)=\frac{139}{4},~d^2(C_{21},A_2)=\frac{131}{4},\\ d^2(C_{21},A_3)&=\frac{155}{4},~d^2(C_{21},A_4)=\frac{131}{4},~d^2(C_{21},A_5)=\frac{11}{4},~d^2(C_{21},A_6)=\frac{11}{4},\\ d^2(C_{21},A_7)&=\frac{195}{4},~d^2(C_{21},A_8)=\frac{387}{4},~d^2(C_{21},A_9)=\frac{179}{4},~d^2(C_{21},A_{10})=\frac{291}{4},\\ d^2(C_{31},A_1)&=\frac{741}{4},~d^2(C_{31},A_2)=\frac{757}{4},~d^2(C_{31},A_3)=\frac{833}{4},~d^2(C_{31},A_4)=\frac{605}{4},\\ d^2(C_{31},A_5)&=\frac{285}{4},~d^2(C_{31},A_6)=\frac{301}{4},~d^2(C_{31},A_7)=\frac{9}{4},~d^2(C_{31},A_8)=\frac{9}{4},\\ d^2(C_{31},A_9)&=\frac{61}{4},~d^2(C_{31},A_{10})=\frac{89}{4}.\end{aligned} $$

We include all the points located within 15 units of distance to the nearest cluster. Then, the second set of clusters are the following: Cluster 1: {A 1, A 2, A 3, A 4}, cluster 2: {A 5, A 6}, cluster 3: {A 7, A 8}. Note that A 9 is quite close to Cluster 3. We may either include it in Cluster 3 or treat it as a singleton. Since the next stage calculations do not change the composition of the clusters, we may take the final clusters as {A 1, A 2, A 3, A 4}, {A 5, A 6}, {A 7, A 8, A 9} and {A 10} where Cluster 4 consists of a single element. This completes the computations.

Let us examine the principal components of the sample sum of products matrix and plot the points to see whether any cluster can be detected. The sample matrix denoted by X and the sample average, denoted by \(\bar {X}\), are the following:

Let the matrix of sample averages be \({\bar {\mathbf {X}}}=[\bar {X},\bar {X},\ldots ,\bar {X}]\) and the deviation matrix be \({\mathbf {X}}_d=\mathbf {X}-{\bar {\mathbf {X}}}\). Then,

and the sample sum of products matrix is \(S={\mathbf {X}}_d{\mathbf {X}}_d^{\prime }\), that is,

The eigenvalues of S are λ 1 = 330.440, λ 2 = 40.522 and λ 3 = 9.039. An eigenvector corresponding to λ 1 = 330.440 and an eigenvector corresponding to λ 2 = 40.522, respectively denoted by U 1 and U 2 are the following:

Then the first two principal components are \(U_1^{\prime }Y\) and \(U_2^{\prime }Y \) with Y = [y 1, y 2, y 3]. We substitute our sample points A 1, …, A 10 to obtain 10 pairs of numbers. For example,

and hence the first pair of numbers or the first point is P 1 : (−0.057, −1.500). Similar calculations yield the remaining 9 points as: P 2 : (−0.218, −1.676), P 3 : (−1.161, −1.176), P 4 : (2.393, −4.852), P 5 : (9.232, 1.972), P 6 : (7.957, −2.204), P 7 : (19.236, −1.056), P 8 : (23.686, −2.408), P 9 : (18.568, 2.620), P 10 : (19.364, −6.760). It is seen that these points which are plotted in Fig. 15.2.2 form the same clusters as the original points shown in Fig. 15.2.1, that is, Cluster 1: {A 1, A 2, A 3, A 4}; Cluster 2: {A 5, A 6}; Cluster 3: {A 7, A 8, A 9}; Cluster 4: {A 10}.

Figure 15.2.1
figure 1

The original 10 data points

Figure 15.2.2
figure 2

Second versus first principal component evaluated at the A i’s

Other non-hierarchical methods are currently in use. We will mention these procedures later, after discussing the main hierarchical technique known as single linkage or nearest neighbor method.

15.3. Hierarchical Methods of Clustering

Hierarchical procedures are of two categories. In one of them, we begin with all the n data points as n different clusters of one element each. Then, by applying certain rules, we start combining these single-member clusters into larger clusters, the process being halted when a desired number of clusters are obtained. If the process is continued, we ultimately end up with a single cluster containing all of the n points. In the second category, we initially consider one cluster that comprises the n elements. We then start splitting this cluster into two clusters by making use of some criteria. Next, one or both of these sub-clusters are divided again by applying the same criteria. If the process is continued, we finally end up with n clusters of one element each. The process is halted when a desired number of clusters are obtained. In all these procedures, one cannot objectively determine when to stop the process or how many distinct clusters are present. We have to specify some stopping rules as a means of selecting a suitable number of clusters.

15.3.1. Single linkage or nearest neighbor method

In this single linkage procedure, we begin by assuming that there are n clusters consisting of one item each. We then combine these clusters by applying a minimum distance rule. At the initial stage, we have only one element in each ‘cluster’, but at the following steps, each cluster will potentially contain several items and hence, the rule is stated for general clusters. Consider two clusters A and B whose elements are denoted by X j and Y j, that is, X j ∈ A and Y j ∈ B, the X j’s and Y j’s being p-vectors belonging to the data set at hand. In the minimum distance rule, we define the distance between two clusters, denoted by d(A, B), as follows:

$$\displaystyle \begin{aligned} d(A,B)=\min\{d(X_i,Y_j),\mbox{ for all }X_i\in A,Y_j\in B\}.{}\end{aligned} $$
(15.3.1)

This distance is measured in the units of the definition of the distance being utilized. We will illustrate the single linkage hierarchical procedure by making use of the data set provided in Example 15.2.1 and its associated dissimilarity matrix D. We will utilize the dissimilarity matrix D to represent various “distances”. Since the matrix D will be repeatedly referred to at every stage, it is duplicated next for ready reference:

To start with, we have 10 clusters {A j}, j = 1, …, 10. At the initial stage, each cluster has one element. Then d(A, B) as defined in (15.3.1) is the smallest distance (dissimilarity) appearing in D, that is, 1 which occurs between the elements corresponding to A 2 and A 3. These two clusters of one vector each are combined and replaced by B 1 by taking the smaller entries in each column of the combined representation of the dissimilarity measures corresponding A 2 and A 3. For illustration, we now list the dissimilarity measures corresponding to the original A 2 and A 3 and the new B 1 as rows:

The rows representing A 2 and A 3 are combined and replaced by B 1 as shown above. The second and third columns in D are combined into one column, namely, the B 1 column. The elements in B 1 are the smaller elements in each column of A 2 and A 3. The bracketed elements in A 2 and A 3, namely [0, 1] and [1, 0], are combined into one element [0] in B 1, the updated dissimilarity matrix having one fewer row and one fewer column. These are the intersections of the two rows and columns. Other smaller elements in the two original columns, which make up B 1, are displayed in parentheses. This process will be repeated at each stage. At the first stage of the procedure, we end up with 9 clusters: C 1 = {A 2, A 3}, {A j}, j = 1, 4, …, 10, the resulting configuration of the dissimilarity matrix, denoted by D 1, being

Now, the next smallest dissimilarity is 2 which occurs at (A 1, B 1). Thus, the rows (columns) corresponding to A 1 and B 1 are combined into one row (column) B 2. The original rows corresponding to A 1 and B 1 and the new row corresponding to B 2 are the following:

The new configuration, denoted by D 2, is the following:

the resulting clusters being C 2 = {A 1, A 2, A 3}, {A j}, j = 4, …, 10. The next smallest dissimilarity is 9, which occurs at (B 2, A 4). Hence these are combined, that is, the first two columns (rows) are merged as explained. The combined row, denoted by B 3, is the following, its transpose becoming the first column:

$$\displaystyle \begin{aligned} B_3 =[0,44,20,122,185,139,125],\end{aligned}$$

and the new configuration is the following:

At this stage, the clusters are C 2 = {A 1, A 2, A 3, A 4}, {A j}, j = 5, …, 10. The next smallest number is 9, which is occurring at (A 7, A 8), (A 7, A 9). Accordingly, we combine A 7, A 8 and A 9, and the resulting configuration is the following where the resultant of the replacement rows (columns) is denoted by B 4:

the clusters being C 2 = {A 1, A 2, A 3, A 4}, C 3 = {A 7, A 8, A 9}, {A i}, i = 5, 6, 10. The next smallest dissimilarity measure is 11 at (A 5, A 6). Combining these, the replacement row is B 5 = [20, 0, 36, 65], and the new configuration, denoted by D 5 is as follows:

the resulting clusters being C 2 = {A 1, A 2, A 3, A 4}, C 3 = {A 7, A 8, A 9}, C 4 = {A 5, A 6}, C 5 = {A 10}.

We may stop at this stage since the clusters obtained from the other methods coincide with C 2, C 3, C 4, C 5. At the following step of the procedure, C 4 would combine with C 3, with the next final stage resulting in a single cluster that would encompass all 10 points.

15.3.2. Average linking as a modified distance measure

An alternative distance measure involving all the items in pairs of clusters is considered in this subsection. As one proceeds from any stage to the next one in a hierarchical procedure, a decision is based on the next smallest distance between two clusters. At the initial stage, this does not pose any problem since the dissimilarity matrix D is available and each cluster contains only a single element. However, further on in the process, as there are several elements in the clusters, a more suitable definition of “distance” is required in order to proceed to the next stage. Several types of methods have been proposed in the literature. One such procedure is the average linkage method under which the distance between two clusters A and B, denoted again by d(A, B), is defined as follows:

$$\displaystyle \begin{aligned} d(A,B)=\frac{1}{n_1n_2}\sum_{j=1}^{n_2}\sum_{i=1}^{n_1}d(X_i,Y_j)\mbox{ for all }X_i\in A, Y_j\in B{} \end{aligned} $$
(15.3.2)

where the X i’s and Y j’s are all p-vectors from the given set of data points. In this case, the rule being applied is that two clusters having the smallest distance, as measured in terms of (15.3.2), are combined before initiating the next stage.

15.3.3. The centroid method

In a hierarchical single linkage procedure, another way of determining the distance between two clusters before proceeding to the next stage is referred to as the centroid method under which the Euclidean distance between the centroids of clusters A and B is defined as follows:

$$\displaystyle \begin{aligned} d(A,B)=d(\bar{X},\bar{Y})\, \ \text{with} \, \ \bar{X}=\frac{1}{n_1}\sum_{j=1}^{n_1}X_j\ \, \text{and} \, \ \bar{Y}=\frac{1}{n_2}\sum_{j=1}^{n_2}Y_j,{} \end{aligned} $$
(15.3.3)

where \(\bar {X}\) is the centroid of the cluster A and \(\bar {Y}\) is the centroid of the cluster B, X i ∈ A, i = 1, …, n 1, Y j ∈ B, j = 1, …, n 2. In this case, the process involves combining two clusters with the smallest d(A, B) as specified in (15.3.3) into a single cluster. After combining them, or equivalently, after taking the union of A and B, the centroid of the combined cluster, denoted by \(\bar {Z}\), is

$$\displaystyle \begin{aligned}\bar{Z}=\frac{n_1\bar{X}+n_2\bar{Y}}{n_1+n_2}=\frac{1}{n_1+n_2}\sum_{j=1}^{n_1+n_2}Z_j,\ Z_j\in A\cup B, \end{aligned}$$

where the Z j’s are the original vectors that were included in A or B.

15.3.4. The median method

A main shortcoming of the centroid method of joining two clusters is that if n 1 is very large compared to n 2, then \(\bar {Z}\) is likely to be closer of \(\bar {X}\), and vice versa. In order to avoid this type of imbalance, a method based on the median is suggested, under which the median of the combined clusters A and B is defined as

$$\displaystyle \begin{aligned} \mbox{Median}_{A\cup B}=\frac{1}{2}(\bar{X}+\bar{Y})\,\ \text{with} \, \ X_i\in A\,\ \text{and} \, \ Y_j\in B,{} \end{aligned} $$
(15.3.4)

for all i, j and r. In this process, the clusters A and B for which MedianAB is the smallest are combined to form the next cluster whose elements are the Z r’s, Z r ∈ A ∪ B.

15.3.5. The residual sum of products method

From the one-way MANOVA layout, a residual or within group (within cluster) sum of products for clusters A, B and A ∪ B, denoted by R A, R B and R AB, are the following:

$$\displaystyle \begin{aligned} R_A&=\sum_{i=1}^{n_1}(X_i-\bar{X})'(X_i-\bar{X}),\ R_B=\sum_{j=1}^{n_2}(Y_j-\bar{Y})'(Y_j-\bar{Y})\\ R_{A\cup B}&=\sum_{r=1}^{n_1+n_2}(Z_r-\bar{Z})'(Z_r-\bar{Z}), \ \,Z_j\in A\cup B,\ \bar{Z}=\frac{n_1\bar{X}+n_2\bar{Y}}{n_1+n_2}.\end{aligned} $$

Once those sums of squares have been evaluated, we compute the quantity

$$\displaystyle \begin{aligned} T_{A\cup B}=R_{A\cup B}-(R_A+R_B),{} \end{aligned} $$
(15.3.5)

which can be interpreted as the increase in residual sum of products due to the process of merging the clusters A and B. Then, the procedure consists of combining those clusters A and B for which T AB as defined in (15.3.5) is the minimum. This method is also called Ward’s method.

There exist other methods for combining clusters such as the flexible beta method, and several comparative studies point out the merits and drawbacks of the various methods.

In the hierarchical procedures considered in Sect. 15.3, we begin with the n data points as n distinct clusters of one element each. Then, by applying certain “minimum distance” methods, “distance” being defined in different ways, we combined the clusters one by one. We may also consider a hierarchical procedure wherein the n data points are treated as one cluster of n elements. At this stage, by making use of some rules, we break up this cluster into two clusters. Then, one of these or both are split again as two clusters by applying the same rule. We continue the process and stop it when it is determined that there is a sufficient number of clusters. If the process is not halted at a certain stage, we will end up with a single cluster containing all of the n elements or points. We will not elaborate further on such procedures.

15.3.6. Other criteria for partitioning or optimization

In Sect. 15.2, we considered a non-hierarchical procedure known as the k-means method, which is the most popular in this area. After discussing this, we described the most widely utilized non-hierarchical procedure in Sect. 15.3. We will now examine other non-hierarchical procedures in common use. Some of these are connected with the MANOVA or multivariate analysis of variation of a one-way classification. In a multivariate one-way layout, let X ij be the j-th vector in the i-th group or i-th cluster, all vectors being p-vectors or p × 1 real vectors. Let there be k groups (k clusters) of sizes n 1, …, n k with n 1 + n 2 + ⋯ + n k = n . = n, that is, the cluster sizes are n 1, …, n k, respectively. Let the residual sum of products or sum of squares and cross products matrix be denoted by U, which is p × p. This matrix U is also called within group or within cluster variation matrix. Let the between groups or between clusters variation matrix be V . In this setup, U and V  are the following:

$$\displaystyle \begin{aligned} U&=\sum_{i=1}^k\sum_{j=1}^{n_i}(X_{ij}-\bar{X}_i)(X_{ij}-\bar{X}_i)',\ \,\bar{X}_i=\frac{1}{n_i}\sum_{j=1}^{n_i}X_{ij},\ {} \end{aligned} $$
(15.3.6)
$$\displaystyle \begin{aligned} V&\!=\!\sum_{i\, j}(\bar{X}_i-\bar{X})(\bar{X}_i-\bar{X})'\!=\!\sum_{i=1}^kn_i(\bar{X}_i-\bar{X})(\bar{X}_i-\bar{X})',\ \,\bar{X}=\frac{1}{n_.}\sum_{i\, j}X_{ij}\ .{} \end{aligned} $$
(15.3.7)

Then, under the hypothesis that the group effects or cluster effects are the same, and under the normality assumption on the X ij’s, U and V  are independently distributed Wishart matrices with n . − k and k − 1 degrees of freedom, respectively, where Σ > O is the parameter matrix in the Wishart densities as well as the common covariance matrix of the X ij’s, referring to Chap. 5. Thus, \(W_1=(U+V)^{-\frac {1}{2}}U(U+V)^{-\frac {1}{2}}\) is a real matrix-variate type-1 beta random variable with the parameters \((\frac {n_.-k}{2},\frac {k-1}{2})\), \(W_2=U^{-\frac {1}{2}}VU^{-\frac {1}{2}}\) is a real matrix-variate type-2 beta random variable with the parameters \((\frac {k-1}{2},\frac {n_.-k}{2})\) and W 3 = U + V  follows a real Wishart distribution having n . − 1 degrees of freedom and parameter matrix Σ > O, again referring Chap. 5. Observe that both U and V  are real positive definite matrices, so that all of their eigenvalues are positive. The likelihood ratio criterion λ for testing the hypothesis that the group effects are the same is the following:

$$\displaystyle \begin{aligned} \lambda^{\frac{2}{n_.}}=\frac{|U|}{|U+V|}=|W_1|=\frac{1}{|I+U^{-\frac{1}{2}}VU^{-\frac{1}{2}}|}=\frac{1}{|I+W_2|}.{} \end{aligned} $$
(15.3.8)

We are aiming to have the within cluster variation small and the between cluster variation large, which means, in some sense, that U will be small and V  will be large, in which case λ as given in (15.3.8) will be small. This also means that the trace of U must be small and trace of W 2 must be large. Accordingly, a few criteria for merging clusters are based on tr(U), |U| and tr(W 2). The following are some commonly utilized criteria for combining clusters: (1) Minimizing tr(U); (2) Minimizing |U|; (3) Maximizing tr(W 2). These criteria are applied as follows: One of the n observation vectors is moved to a selected cluster if tr(U) is a minimum (|U| is a minimum and tr(W 2) is a maximum for the other criteria). Then, tr(U) is evaluated after moving the observation vectors one by one to the selected cluster and, each time, tr(U) is noted; the vector for which tr(U) attains a minimum value belongs to the selected cluster, that is, it is combined with the selected cluster. Observe that

$$\displaystyle \begin{aligned} \text{tr}(U)&=\text{tr}\Big(\sum_{i\,j}(X_{ij}-\bar{X}_i)(X_{ij}-\bar{X}_i)'\Big)\\ &=\text{tr}\Big(\sum_{j=1}^{n_1}(X_{1j}-\bar{X}_1)(X_{1j}-\bar{X}_1)'\Big)+\cdots+\text{tr}\Big(\sum_{j=1}^{n_k}(X_{kj}-\bar{X}_k)(X_{kj}-\bar{X}_k)'\Big)\\ &=\sum_{j=1}^{n_1}(X_{1j}-\bar{X}_1)'(X_{1j}-\bar{X}_1)+\cdots+\sum_{j=1}^{n_k}(X_{kj}-\bar{X}_k)'(X_{kj}-\bar{X}_k),{} \end{aligned} $$
(15.3.9)

owing to the property that, for two matrices P and Q, tr(PQ) = tr(QP) as long as PQ and QP are defined. As well, observe that since \((X_{ij}-\bar {X}_i)'(X_{ij}-\bar {X}_i)\) is a scalar quantity for every i and j, it is equal to its trace. How does this criterion work in practice? Consider moving a member from the s-th cluster to the selected cluster, namely, the r-th cluster. The original centroids are \(\bar {X}_r\) and \(\bar {X}_s\), and when one element is added to the r-th cluster from the s-th cluster, both centroids will respectively change to, say, \(\bar {X}_{r+1}\) and \(\bar {X}_{s-1}\). Compute the updated sums of squares in the new r-th and s-th clusters. Then, add up all the sums of squares in all the clusters and obtain a new tr(U). Carry out this process for every member in every other cluster and compute tr(U) each time. Take the smallest value of tr(U) thus calculated, including the original value of tr(U), before considering transferring any point. That vector for which tr(U) is minimum really belongs to the r-th cluster and so, is included in it. Repeat the process until no more improvement can be made, at which point no more transfer of points is necessary.

Simplification of the computations of tr(U)

As will be explained, computing tr(U) can be simplified. Consider the new sum of squares in the r-th cluster. Let the new and old sums of squares be denoted by (New)r, (New)s, and (Old)r, (Old)s, respectively. Let the vector transferred from the s-th cluster to the r-th cluster be denoted by Y . Then,

$$\displaystyle \begin{aligned} \mbox{(New)}_r&=\sum_{j=1}^r(X_{rj}-\bar{X}_{r+1})'(X_{rj}-\bar{X}_{r+1})+(Y-\bar{X}_{r+1})'(Y-\bar{X}_{r+1})\\ &=\sum_{j=1}^r(X_{rj}-\bar{X}_r+(\bar{X}_r-\bar{X}_{r+1}))'(X_{rj}-\bar{X}_r +(\bar{X}_r-\bar{X}_{r+1}))\\ &\ \ \ \ \ +(Y-\bar{X}_{r+1})'(Y-\bar{X}_{r+1})\\ &=\sum_{j=1}^r(X_{rj}-\bar{X}_r)'(X_{rj}-\bar{X}_r)+r(\bar{X}_r-\bar{X}_{r+1})'(\bar{X}_r-\bar{X}_{r+1})\\ &\ \ \ \ \ +(Y-\bar{X}_{r+1})'(Y-\bar{X}_{r+1})\\ &=\mbox{(Old)}_r+r(\bar{X}_r-\bar{X}_{r+1})'(\bar{X}_r-\bar{X}_{r+1})+(Y-\bar{X}_{r+1})'(Y-\bar{X}_{r+1}).\end{aligned} $$

The difference between the new sum of squares and the old one is

$$\displaystyle \begin{aligned}\delta_1=r(\bar{X}_r-\bar{X}_{r+1})'(\bar{X}_r-\bar{X}_{r+1})+(Y-\bar{X}_{r+1})'(Y-\bar{X}_{r+1}).\end{aligned}$$

Noting that

$$\displaystyle \begin{aligned}\bar{X}_r-\bar{X}_{r+1}=\bar{X}_r-\frac{r\bar{X}_r+Y}{r+1}=\frac{1}{r+1}[\bar{X}_r-Y]\mbox{ and }Y-\bar{X}_{r+1}=\frac{r}{r+1}[Y-\bar{X}_r], \end{aligned}$$

δ 1 simplifies to

$$\displaystyle \begin{aligned}\delta_1=\frac{r}{r+1}(Y-\bar{X}_r)'(Y-\bar{X}_r).\end{aligned}$$

A similar procedure can be used for the s-th cluster. In that case, the new sum of squares can be written as

$$\displaystyle \begin{aligned} \mbox{(New)}_s&=\sum_{j=1}^{s-1}(X_{sj}-\bar{X}_{s-1})'(X_{sj}-\bar{X}_{s-1})\\ &=\sum_{j=1}^s(X_{ij}-\bar{X}_{s-1})'(X_{sj}-\bar{X}_{s-1})-(Y-\bar{X}_{s-1})'(Y-\bar{X}_{s-1}).\end{aligned} $$

Then, proceeding as in the case of the r-th cluster and denoting the difference between the new and the old sums of squares as δ 2, we have

$$\displaystyle \begin{aligned}\delta_2=-\frac{s}{s-1}(Y-\bar{X}_s)'(Y-\bar{X}_s), \ s>1,\end{aligned}$$

so that the sum of the differences between the new and old sums of squares, denoted by δ, is the following:

$$\displaystyle \begin{aligned} \delta=\delta_1+\delta_2=\frac{r}{r+1}(Y-\bar{X}_r)'(Y-\bar{X}_r)-\frac{s}{s-1}(Y-\bar{X}_s)'(Y-\bar{X}_s){} \end{aligned} $$
(15.3.10)

for s > 1, where \(\bar {X}_r\) and \(\bar {X}_s\) are the original centroids of the r-th and s-th clusters, respectively. As such, computing δ is very simple. Evaluate the quantity specified in (15.3.10) for all the points outside the r-th cluster and look for the minimum of δ, including the original value of δ = 0. If the minimum occurs at a point Y 1 outside of the r-th cluster, then transfer that point to the r-th cluster. Continue the process for every vector in the s-th cluster and then, for r = 1, …, k, assuming there are k clusters, until δ = 0. In the end, all the clusters are stabilized, and k may take on another value.

Among the three statistics tr(U), |U| and tr(W 2), tr(U) is the easiest to compute, as was just explained. However, if we consider a non-singular transformation, other than an orthonormal transformation, then |U| and tr(W 2) are invariant, but tr(U) is not.

We have discussed one hierarchical methodology of single linkage nearest neighbor method and one non-hierarchical procedure consisting of the k-means method. These seem to be the most widely utilized. We also mentioned other hierarchical and non-hierarchical methods without going into the details. All these procedures are not well-defined mathematical procedures. None of the procedures can uniquely determine the clusters if there are some clusters in the multivariate data at hand, and none of the methods can uniquely determine the number of clusters. The advantages and shortcomings of the various methods will not be discussed so as not to confound the reader. Exercises 15

15.1

For the p × 1 vectors X 1, …, X n, let the dissimilarity measures be (1) \(d_{ij}^{(1)}=\sum _{k=1}^n|x_{ik}-x_{jk}|\), (2) \(d_{ij}^{(2)}=\sum _{k=1}^n(x_{ik}-x_{jk})^2\), \(X_i^{\prime }=[x_{1i},x_{2i},\ldots ,x_{pi}]\). Compute the matrices (1) \((d_{ij}^{(1)})\); (2) \((d_{ij}^{(2)})\), for the following vectors:

15.2

Nine test runs T − 1, …, T − 9 are done to test the breaking strengths of three alloys. The following data are the deviations from the respective expected strengths:

Carry out a cluster analysis by applying the following methods: (1) The single linkage or nearest neighbor method; (2) The average linkage method; (3) The centroid method; (4) The residual sum of products method.

15.3

Using the data provided in Exercise 15.2, carry out a cluster analysis by utilizing the following methods: (1) Partitioning or optimization; (2) Minimization of tr(U); (3) Minimization of |U|; (4) Maximization of tr(W 2) where U and W are given in Sect. 15.3.6.

15.4

Compare the results from the different methods in (1) Exercise 15.2; (2) Exercise 15.3, and make your observations.

15.5

Compare the results from the different methods in Exercises 15.2 and 15.3, and comment on the similarities and differences.

15.4. Correspondence Analysis

If the data at hand are classified according to two attributes, these characteristics may be of the same type, that is, both quantitative or both qualitative, or of different types, and whatever the types may be, we may construct a two-way contingency table. In a contingency table, the entries in the cells are frequencies or the number of times various combinations of the attributes appear. Correspondence Analysis is a process of identifying, quantifying, separating and plotting associations among the characteristics and relationships among the various levels. In a two-way contingency table, we identify, separate and plot associations between the two characteristics and attempt to identify relationships between row and column labels.

15.4.1. Two-way contingency table

Consider the following example. A random sample of 100 persons from a certain township are classified according to their educational level and their liberal disposition. In the frequency Table 15.4.1, the A j’s represent their dispositions and the B j’s, their educational levels, with A 1 ≡ tolerant, A 2 ≡ indifferent, A 3 ≡ intolerant, B 1 ≡ primary school education level, B 2 ≡ high school education level, B 3 ≡ bachelor’s degree education level, B 4 ≡ master’s and higher degree education level.

Table 15.4.1 A two-way contingency table

There are 6 persons having a tolerant disposition and primary school level of education. There is one person with an intolerant disposition and a master’s degree or a higher level of education, and so on. The marginal sums are also provided in the table. For example, the total number of persons having a primary school level of education is 30, the total number of persons having an intolerant disposition is 20, and so on. The corresponding relative frequencies (a given frequency divided by 100, the total frequency) are as follows (Table 15.4.2):

Table 15.4.2 Relative frequencies f ij in the two-way contingency table

The relative frequencies are denoted in parentheses by f ij where the summation with respect to a subscript is designated by a dot, that is, f i. =∑j f ij, f .j =∑i f ij and f .. =∑ij f ij. Note that f .. = 1. In a general notation, a two-way contingency table and the corresponding relative frequencies are displayed as follows (Table 15.4.3):

Table 15.4.3 A two-way contingency table and a table of relative frequencies

Letting the true probability of the occurrence of an observation in the (i, j)-th cell be p ij, the following is the table of true probabilities:

These are multinomial probabilities and, in this case, the n ij’s become multinomial variables. An estimate of p ij, denoted by \(\hat {p}_{ij}\), is \(\hat {p}_{ij}=f_{ij},\) the corresponding relative frequency. The marginal sums in Table 15.4.4 can be interpreted as follows: p 1. =  the probability of finding an item in the first row or the probability of an event will have the attribute A 1; p .j =  the probability that an event will have the characteristic B j, and so on. Thus,

$$\displaystyle \begin{aligned}\hat{p}_{ij}=f_{ij}=\frac{n_{ij}}{n},~ \,\hat{p}_{i.}=\frac{n_{i.}}{n},~\, \hat{p}_{.j}=\frac{n_{.j}}{n},\,~i=1,\ldots,r,~j=1,\ldots,s. \end{aligned}$$
Table 15.4.4 True probabilities p ij in a two-way contingency table

If A i and B j are respectively interpreted as the event that an observation will belong to the i-th row or the event of the occurrence of the characteristic A i, and the event that an observation will belong to the j-th column or the event of the occurrence of the attribute B j, and if we let p i. = P(A i) and p .j = P(B j), then p ij = P(A i ∩ B j), where P(A i) is the probability of the event A i, P(B j) is the probability of the event B j, and (A i ∩ B j) is the intersection or joint occurrence of the events A i and B j. If A i and B j are independent events, P(A i ∩ B j) = P(A i)P(B j) or p ij = p i. p .j, the product of the marginal probabilities or the marginal totals in the table of probabilities. That is,

$$\displaystyle \begin{aligned} P(A_i\cap B_j)=P(A_i)P(B_j)\ \Rightarrow \ p_{ij}=p_{i.}p_{.j},\ \hat{p}_{ij}=\Big(\frac{n_{i.}}{n}\Big)\Big(\frac{n_{.j}}{n}\Big)=\frac{n_{i.}n_{.j}}{n^2}{} \end{aligned} $$
(15.4.1)

for all i and j. In a multinomial distribution, the expected frequency in the (i, j)-th cell is np ij where n is the total frequency. Then, the expected frequency, denoted by E[⋅], the maximum likelihood estimate (MLE) of the expected frequency, denoted by \(\hat {E}[\cdot ]\), and the MLE of the expected frequency under the hypothesis H o of independence of events A i and B j, are the following:

$$\displaystyle \begin{aligned} E[n_{ij}]=np_{ij}, \ \hat{E}[n_{ij}]=n\hat{p}_{ij}=n\Big(\frac{n_{ij}}{n}\Big),\ n\hat{p}_{ij}|H_o=n\hat{p}_{i,}\hat{p}_{.j}=n\Big(\frac{n_{i.}}{n}\Big)\Big(\frac{n_{.j}}{n}\Big)=\frac{n_{i.}n_{.j}}{n}.{} \end{aligned} $$
(15.4.2)

Now, referring to our numerical example and the first row of Table 15.4.1, the estimated expected frequencies, under H o are: \(E[n_{11}|H_o]=\frac {n_{1.}n_{.1}}{n}=\frac {40\times 30}{100}=12\), \(E[n_{12}|H_o]=\frac {n_{1.}n_{.2}}{n}=\frac {40\times 25}{100}=10\), \(E[n_{13}|H_o]=\frac {40\times 30}{100}=12\), \(E[n_{14}|H_o]=\frac {40\times 15}{100}=6\). All the estimated expected frequencies are shown in parentheses next to the observed frequencies in Table 15.4.5:

Table 15.4.5 A two-way contingency table

15.4.2. Some general computations

Let J r and J s be respectively r × 1 and s × 1 vectors of unities and P be the true probability matrix, that is,

(15.4.3)

Letting the marginal totals be denoted by R and C , we have

(15.4.4)

Referring to the initial numerical example, we have the following:

Writing the bordered matrix P as

(15.4.5)

in the numerical example, these quantities are

Let D r and D c be the following diagonal matrices corresponding respectively to the row and column marginal probabilities:

(15.4.6)

In the numerical example, these quantities are

$$\displaystyle \begin{aligned}\hat{D}_r=\text{diag}(0.4,0.4,0.2)\ \ \text{and}\ \ \hat{D}_c=\text{diag}(0.30,0.25,0.30,0.15). \end{aligned}$$

Now, consider \(D_r^{-1}P\) and \(PD_c^{-1}\):

(15.4.7)
(15.4.8)

Referring to the numerical example, we have

For computing the test statistics in vector/matrix notation, we need (15.4.7) and (15.4.8).

15.5. Various Representations of Pearson’s χ 2 Statistic

Now, let us consider Pearson’s χ 2 statistic for testing the hypothesis that there is no association between the two characteristics of classification or the hypothesis H o : p ij = p i. p .j. The χ 2 statistic is the following:

$$\displaystyle \begin{aligned} \chi^2&=\sum_{ij}\frac{(\mbox{observed frequency}-\mbox{expected frequency})^2}{(\mbox{expected frequency})}=\sum_{ij}\frac{(n_{ij}-\frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}{} \end{aligned} $$
(15.5.1)
$$\displaystyle \begin{aligned} &=\sum_{ij}n\frac{(\frac{n_{ij}}{n}-\frac{n_{i.}}{n}\frac{n_{.j}}{n})^2}{\frac{n_{i.}}{n}\frac{n_{.j}}{n}} =n\sum_{ij}\frac{(\hat{p}_{ij}-\hat{p}_{i.}\hat{p}_{.j})^2}{\hat{p}_{i.}\hat{p}_{.j}}{} \end{aligned} $$
(15.5.2)
$$\displaystyle \begin{aligned} &=\sum_{i=1}^rn\hat{p}_{i.}\sum_{j=1}^s\left[\Big(\frac{\hat{p}_{ij}}{\hat{p}_{i.}}-\hat{p}_{.j}\Big)^2/\hat{p}_{.j}\right]{} \end{aligned} $$
(15.5.3)
$$\displaystyle \begin{aligned} &=\sum_{j=1}^sn\hat{p}_{.j}\sum_{i=1}^r\left[\Big(\frac{\hat{p}_{ij}}{\hat{p}_{.j}}-\hat{p}_{i.}\Big)^2/\hat{p}_{i.}\right].{} \end{aligned} $$
(15.5.4)

In order to simplify the notation, we shall omit placing a hat on top of the estimates of R i, C j, R, C, D c and D r. We may then express the χ 2 statistic as the following quadratic forms:

$$\displaystyle \begin{aligned} \chi^2&=\sum_{i=1}^rnp_{i.}(R_i-C)'D_c^{-1}(R_i-C){} \end{aligned} $$
(15.5.5)
$$\displaystyle \begin{aligned} &=\sum_{j=1}^snp_{.j}(C_j-R)'D_r^{-1}(C_j-R).{} \end{aligned} $$
(15.5.6)

The forms given in (15.5.5) and (15.5.6) are very convenient for extending the theory to multi-way classifications.

It is well known that, under H o, Pearson’s χ 2 statistic is asymptotically distributed as a chisquare random variable having (r − 1)(s − 1) degrees of freedom as n →. One can also express (15.4.8) as a generalized distance between the observed frequencies and the expected frequencies, which is a quadratic form involving the inverse of the true covariance matrix of the multinomial distribution of the n ij’s. Then, on applying the multivariate version of the central limit theorem, it can be established that, as n →, Pearson’s χ 2 statistic has a χ 2 distribution with (r − 1)(s − 1) degrees of freedom. For the representation of Pearson’s χ 2 goodness-of-fit statistic as a generalized distance and as a quadratic form, and for the proof of its asymptotic distribution, the reader may refer to ?. There exist other derivations of this result in the literature.

The quadratic forms specified in (15.5.5) and (15.5.6) can also be interpreted as comparing the generalized distance between the vectors R i and C in (15.5.5) and between the vectors C j and R in (15.5.6), respectively. These will also be equivalent to testing the hypothesis H o : p ij = p i. p .j. As well, an interpretation can be provided in terms of profile analysis: then, the test will correspond to testing the hypothesis that the weighted row profiles are similar; analogously, using (15.5.6) corresponds to testing the hypothesis that the column profiles in a two-way contingency table are similar. Now, examine the following item:

Referring to our numerical example, these quantities are the following:

15.5.1. Testing the hypothesis of no association in a two-way contingency table

The observed value of Pearson’s χ 2 statistic is

$$\displaystyle \begin{aligned} \chi^2&=\left[\frac{(-6)^2}{12}+\frac{(5)^2}{12}+\frac{(1)^2}{6}\right]+\left[\frac{(4)^2}{10}+\frac{(-5)^2}{10}+\frac{(1)^2}{5}\right]\\ &\ \ \ +\left[\frac{(4)^2}{12}+\frac{(-4)^2}{12}+\frac{(0)^2}{6}\right]+\left[\frac{(-2)^2}{6}+\frac{(4)^2}{6}+\frac{(-2)^2}{3}\right]\\ &=16.88.\end{aligned} $$

Given our data, (r − 1)(s − 1) = (2)(3) = 6, and the tabulated critical value is \(\chi ^2_{6,0.05}=12.59\) at the 5% significance level. Since 12.59 < 16.88, the hypothesis of no association between the classification attributes is rejected as per the evidence provided by the data. This χ 2 approximation may be questionable since one of the expected cell frequencies is less that 5. For a proper application of this approximation, the cell frequencies ought to be at least 5.

15.6. Plot of Row and Column Profiles

Now, \((P-RC')D_c^{-1}\) means that the columns of (P − RC′) are multiplied by \(\frac {1}{p_{.1}},\ldots ,\frac {1}{p_{.s}},\) respectively. Then, \((P-RC')D_c^{-1}(P-RC')^{\prime }\) is a matrix of all square and cross product terms involving p ij − p i. p .j for all i and j, where the s columns are weighted by \(\frac {1}{p_{.j}}\), and if pre-multiplied by \(D_r^{-1}\), the rows are weighted by \(\frac {1}{p_{1.}},\ldots ,\frac {1}{p_{r.}},\) respectively. Looking at the diagonal elements, we note that Pearson’s χ 2 statistic is nothing but

$$\displaystyle \begin{aligned} \chi^2&=n\,\text{tr}[D_r^{-1}(P-RC')D_c^{-1}(P-RC')']{} \end{aligned} $$
(15.6.1)
$$\displaystyle \begin{aligned} &=n\sum_{ij}\frac{(p_{ij}-p_{i.}p_{.j})^2}{p_{i.}p_{.j}}{} \end{aligned} $$
(15.6.2)
$$\displaystyle \begin{aligned} &=n(\lambda_1^2+\cdots+\lambda_k^2){} \end{aligned} $$
(15.6.3)

where \(\lambda _1^2,\ldots ,\lambda _k^2\) are the nonzero eigenvalues of the matrix \(D_r^{-1}(P-RC')D_c^{-1}(P-RC')^{\prime }\) or of the matrix \(D_r^{-\frac {1}{2}}(P-RC')D_c^{-1}(P-RC')'D_r^{-\frac {1}{2}}\) with k being the rank of P − RC . For the numerical example, the observed value of the matrix Y = (y ij) with \( y_{ij}=\frac {p_{ij}-p_{i.}p_{.j}}{\sqrt {p_{i.}p_{.j}}}\), is obtained as follows, observing that

$$\displaystyle \begin{aligned}\sqrt{n}\,\frac{\hat{p}_{ij}-\hat{p}_{i.}\hat{p}_{.j}}{\sqrt{\hat{p}_{i.}\hat{p}_{.j}}}=\Big[n_{ij}-\frac{n_{i.}n_{.j}}{n}\Big]\Big/\sqrt{n_{i.}n_{.j}/n}.\end{aligned}$$

From the representation of \(\hat {P}-\hat {R}\hat {C}',\) we already have the matrix n ij − n i. n .jn, that is,

Then,

$$\displaystyle \begin{aligned} nYY'=\left[\begin{array}{rrr}\frac{33}{5}\ &-\frac{43}{6}\ &\frac{17\sqrt{2}}{30}\\ -\frac{43}{6}\ &\frac{103}{12}\ &-\frac{17\sqrt{2}}{12}\\ \frac{17\sqrt{2}}{30}&\ \ \ -\frac{17\sqrt{2}}{12}&\frac{17}{10}\ \ \end{array}\right].{} \end{aligned} $$
(15.6.4)

The representation in (15.6.1) has the advantage that

$$\displaystyle \begin{aligned} &\text{tr}[D_r^{-\frac{1}{2}}(P-RC')D_c^{-1}(P-RC')'D_r^{-\frac{1}{2}}]\\ &=\text{tr}[YY'],\ \ Y=D_r^{-\frac{1}{2}}(P-RC')D_c^{-\frac{1}{2}}=(y_{ij}),\\ y_{ij}&=\frac{p_{ij}-p_{i.}p_{.j}}{\sqrt{p_{i.}p_{.j}}},\ \ \sum_jn\hat{y}_{ij}^2=\chi^2=n\text{tr}(\hat{Y}\hat{Y}').{} \end{aligned} $$
(15.6.5)

Note that Y  is r × s and the rank of Y  is equal to the rank of P − RC , which is k, referring to (15.6.3). Thus, there are k nonzero eigenvalues associated with the r × r matrix YY as well as with the s × s matrix Y′Y , which are \(\lambda _1^2,\ldots ,\lambda _k^2\). Since \(\text{ tr}(YY')=\lambda _1^2+\cdots +\lambda _k^2\), we can represent Pearson’s χ 2 statistic as follows, substituting the estimates of p ij, p i. and p .j, etc:

$$\displaystyle \begin{aligned} \frac{\chi^2}{n}&=\text{tr}(YY')=\lambda_1^2+\cdots+\lambda_k^2\\ &=\sum_{i=1}^k\hat{p}_{i.}(\hat{R}_i-\hat{C})'\hat{D}_c^{-1}(\hat{R}_i-\hat{C}){} \end{aligned} $$
(15.6.6)
$$\displaystyle \begin{aligned} &=\sum_{j=1}^s\hat{p}_{.j}(\hat{C}_j-\hat{R})'\hat{D}_r^{-1}(\hat{C}_j-\hat{R}).{} \end{aligned} $$
(15.6.7)

The expressions given in (15.6.6) and (15.6.7) and the sum of the \(\lambda _j^2\)’s are called the total inertia in a two-way contingency table. We can also define the squared distance between two rows as

$$\displaystyle \begin{aligned} d^2_{ij{(r)}}=(R_i-R_j)'D_c^{-1}(R_i-R_j){} \end{aligned} $$
(15.6.8)

and the squared distance between two columns as

$$\displaystyle \begin{aligned} d^2_{ij{(c)}}=(C_i-C_j)'D_r^{-1}(C_i-C_j).{} \end{aligned} $$
(15.6.9)

When the distance as specified in (15.6.8) is very small, we may combine the i-th and j-th rows, if necessary. Sometimes, the cell frequencies are small and we may wish to combine the small frequencies with other cell frequencies so that the χ 2 approximation of Pearson’s χ 2 statistic be more accurate. Then, one can rely on (15.6.8) and (15.6.9) to determine whether it is indicated to combine rows and columns.

For convenience, let r ≤ s. Let U 1, …, U r be the r × 1 normalized eigenvectors of YY and let the r × k matrix U = [U 1, U 2, …, U k], k ≤ r. Let V 1, …, V s be the normalized eigenvectors of Y′Y  and let the r × k matrix V = [V 1, …, V k], k ≤ s. Now, consider the singular value decomposition

$$\displaystyle \begin{aligned} Y=D_r^{-\frac{1}{2}}(P-RC')D_c^{-\frac{1}{2}}=U\varLambda V'{} \end{aligned} $$
(15.6.10)

where UU′ = I k = V′V  and Λ = diag(λ 1, …, λ k). Then, we can write

$$\displaystyle \begin{aligned} P-RC'=D_r^{\frac{1}{2}}U\varLambda V'D_c^{\frac{1}{2}}=W\varLambda Z'{} \end{aligned} $$
(15.6.11)

where \(W=D_r^{\frac {1}{2}}U\) and \(Z=D_c^{\frac {1}{2}}V\). Let W j, j = 1, …, k, denote the columns of W = [W 1, W 2, …, W k] and let Z j, j = 1, …, k, denote the columns of Z = [Z 1, Z 2, …, Z k]. Then, we can write

$$\displaystyle \begin{aligned} P-RC'=\sum_{j=1}^k\lambda_jW_jZ_j^{\prime}{} \end{aligned} $$
(15.6.12)

where \(W'D_r^{-1}W=U'U=I_k=V'V=Z'D_c^{-1}Z\). Note that P − RC is the deviation matrix under the hypothesis H o : p ij = p i. p .j or

$$\displaystyle \begin{aligned}P-RC'=(p_{ij}-p_{i.}p_{.j})\ \mbox{ and }\ Y=(y_{ij})=D_r^{-\frac{1}{2}}(P-RC')D_c^{-\frac{1}{2}}=\left(\frac{p_{ij}-p_{i.}p_{.j}}{\sqrt{p_{i.}p_{.j}}}\right). \end{aligned}$$

Thus, the procedure is as follows: If r ≤ s, then compute the r eigenvalues of the r × r matrix YY . If Y  is of rank r, YY > O (positive definite), otherwise YY is positive semi-definite. Let the nonzero eigenvalues of YY be \(\lambda _1^2,\ldots ,\lambda _k^2\), assuming that k is the number of nonzero eigenvalues of YY . These will also be the nonzero eigenvalues of Y′Y . Compute the normalized eigenvectors from YY and denote those corresponding to the nonzero eigenvalues by U = [U 1, …, U k] where U j is the j-th column of U. Letting the normalized eigenvectors obtained from Y′Y , which correspond to the same nonzero eigenvalues, be denoted by V = [V 1, …, V k], we have

$$\displaystyle \begin{aligned} Y=U\varLambda V',\ \,\varLambda=\text{diag}(\lambda_1,\ldots,\lambda_k),\ \,YY'=U\varLambda^2 U'\, \ \text{and}\ \ Y'Y=V\varLambda^2 V'.{} \end{aligned} $$
(15.6.13)

Example 15.6.1

Construct a singular value decomposition of the following matrix Q:

Solution 15.6.1

Let us compute QQ as well as Q′Q and the eigenvalues of QQ . Since

the eigenvalues of QQ are λ 1 = 3 and λ 2 = 6. Let us determine the normalized eigenvectors of QQ . Consider the equation [QQ′− λI]X = O for λ = 3 and 6, and let X′ = [x 1, x 2] and O′ = [0, 0]. Then, for λ = 3, we see that x 2 = 0 and for λ = 6, we note that x 1 = 0. Thus, the normalized solutions are

Note that − U 1 or − U 2 or − U 1, −U 2 will also satisfy all the conditions, and we could take any of these forms for convenience. Now, consider the equation (Q′Q − λI)X = O, where X′ = [x 1, x 2, x 3, x 4] and O  = [0, 0, 0, 0] for λ = 3, 6. For λ = 3, the coefficient matrix is

$$\displaystyle \begin{aligned}Q'Q-3 I=\left[\begin{array}{rrrr}-1&0&1&2\\ 0&-1&-1&2\\ 1&-1&-2&0\\ 2&2&0&1\end{array}\right]\to\left[\begin{array}{rrrr}-1&0&1&2\\ 0&-1&-1&2\\ 0&0&0&0\\ 0&0&0&9\end{array}\right]\end{aligned}$$

by elementary transformations. Observe that x 4 = 0 so that − x 1 + x 3 = 0 and − x 2 − x 3 = 0. Thus, one solution or an eigenvector corresponding to λ = 3 and their normalized form are

$$\displaystyle \begin{aligned}\left[\begin{array}{r}1\\ -1\\ 1\\ 0\end{array}\right]\Rightarrow V_1=\frac{1}{\sqrt{3}}\left[\begin{array}{r}1\\ -1\\ 1\\ 0\end{array}\right].\end{aligned}$$

Now, take λ = 6 and consider the equation (Q′Q − 6I)X = O; the coefficient matrix and its reduced form obtained through elementary transformations are the following:

$$\displaystyle \begin{aligned}\left[\begin{array}{rrrr}-4&0&1&2\\ 0&-4&-1&2\\ 1&-1&-5&0\\ 2&2&0&-2\end{array}\right]\to \left[\begin{array}{rrrr}1&-1&-5&0\\ 0&-4&-1&2\\ 0&0&-21&0\\ 0&0&9&0\end{array}\right],\end{aligned}$$

which shows that x 3 = 0, so that x 1 − x 2 = 0 and − 4x 2 + 2x 4 = 0. Hence, an eigenvector and its normalized form are

Thus, V = [V 1, V 2]. As mentioned earlier, we could have − V 1 or − V 2 or − V 1, −V 2 as the normalized eigenvectors. As per our notation,

$$\displaystyle \begin{aligned}\varLambda=\text{diag}(\sqrt{3},\sqrt{6})\ \mbox{ and }\ Q=U\varLambda V'. \end{aligned}$$

Let us verify this last equality. Since

we should take − V 1 to obtain Q. Then,

which verifies the result and completes the computations.

Now, we shall continue with our row and column profile plots. From (15.6.4), we have

The eigenvalues of nYY are λ 1 = 15.1369, λ 2 = 1.7471 and λ 3 = 0 and the normalized eigenvectors from nYY , corresponding to λ 1, λ 2 and λ 3 are U 1, U 2 , U 3, so that U = [U 1, U 2, U 3] where

For the same eigenvalues λ 1, λ 2 and λ 3, the normalized eigenvectors determined from nY′Y , which correspond to the nonzero eigenvalues, are V 1 and V 2, with

$$\displaystyle \begin{aligned}V=[V_1,V_2]=\left[\begin{array}{rr}1.10&-0.84\\ -1.06&-0.16\\ -0.83&0.28\\ 1&1\end{array}\right]. \end{aligned}$$

Since λ 3 = 0, k = 2, and we can take the r × k, that is, 3 × 2 matrix \(G=(g_{ij})=D_r^{-\frac {1}{2}}U\varLambda \) to represent the row deviation profiles and the s × k = 4 × 2 matrix \(H=(h_{ij})=D_c^{-\frac {1}{2}}V\varLambda \) to represent the column deviation profiles. For our numerical example, it follows from (15.4.6) that

$$\displaystyle \begin{aligned} D_r&=\text{diag}(0.4,0.4.0.2)\Rightarrow D_r^{-\frac{1}{2}}=\text{diag}\Big(\frac{1}{0.63},\frac{1}{0.63},\frac{1}{0.45}\Big)\\ D_c&=\text{diag}(0.30,0.25,0.30,0.15)\Rightarrow D_c^{-\frac{1}{2}}=\text{diag}(\frac{1}{0.55},\frac{1}{0.50},\frac{1}{0.55},\frac{1}{0.39})\\ \varLambda&=\text{diag}(\sqrt{15.1369},\sqrt{1.7471},0)=\text{diag}(3.89,1.32,0).\end{aligned} $$

We only take the first two columns of U and V  since λ 3 = 0; besides, only the first two vectors are required for plotting. Let U (1) and V (1) represent the first two columns of U and V , respectively. Then, \(D_r^{-\frac {1}{2}}U_{(1)}\varLambda \) will be equivalent to multiplying the first and second columns by 3.89 and 1.32, respectively, and multiplying the first and second rows by \(\frac {1}{0.63}\) and the third row by \(\frac {1}{0.45}\). Then, we have

$$\displaystyle \begin{aligned}U_{(1)}=\left[\begin{array}{rr}4.28&-0.49\\ -4.99&-0.22\\ 1&1\end{array}\right],~D_r^{-\frac{1}{2}}U_{(1)}\varLambda=\left[\begin{array}{rr}26.42&-1.03\\ -30.81&-0.46\\ 6.17&2.09\end{array}\right]\equiv G_2 \end{aligned}$$

where G 2 is the matrix consisting of the first two columns of G. Hence, the points required for plotting the row profile are: (26.42, −1.03), (−30.81, −0.46), (6.17, 2.09). These points being far apart, no two rows should be combined. Now, consider the column profiles: the effect of \(D_c^{-\frac {1}{2}}V_{(1)}\varLambda \) is to multiply the columns of V (1) by 3.89 and 1.32, respectively, and to multiply the rows by \(\frac {1}{0.55}\), \(\frac {1}{0.50}\), \(\frac {1}{0.55}\), \(\frac {1}{0.39}\), respectively. Thus,

$$\displaystyle \begin{aligned}V_{(1)}=\left[\begin{array}{rr}1.10&-0.84\\ -1.06&-0.16\\ -0.83&0.28\\ 1&1\end{array}\right],~D_c^{-\frac{1}{2}}V_{(1)}\varLambda=\left[\begin{array}{rr}7.78&-2.02\\ -8.25&-0.42\\ -5.87&-0.67\\ 9.97&3.38\end{array}\right]\equiv H_2 \end{aligned}$$

where H 2 is the matrix consisting of the first two columns of H. The row profile and the column profile points are plotted in Fig. 15.6.1 where r next to a point indicates a row point and c designates a column point. That is, i r indicates the i-th row point and j c, the j-th column point. It can be seen from this plot that the row points are far apart while the second and third column points are somewhat close; accordingly, if necessary, the second and third columns could be combined.

Figure 15.6.1
figure 3

Row profile and column profile points

15.7. Correspondence Analysis in a Multi-way Contingency Table

When the data is classified under a number of variables, each variable having a number of categories, the resulting frequency table is referred to as a multi-way classification. Correspondence analysis for a multi-way classification involves converting data in a multi-way classification setting into a two-way classification framework and then, employing the techniques developed in Sects. 15.5 and 15.6. The first step in this regard consists of creating an indicator matrix C. In order to illustrate the steps, we will first present an example. Suppose that 10 persons selected at random from a community, are classified according to three variables. Variable 1 is gender. Under this variable, we shall consider the categories male and female. Variable 2 is weight. Under this variable, we are considering three categories: underweight, normal and overweight. The third variable is education which is assumed to have four levels: level 1, level 2, level 3 and level 4. Thus, there are three variables and 9 categories. The actual data are provided in Table 15.7.1.

Table 15.7.1 Ten persons classified under three variables

Next, we construct the indicator matrix C—distinct from C as defined in (15.4.4)—of the data displayed in Table 15.7.1. If an item is present, we write 1 in the corresponding location in Table 15.7.2, and if it is absent, we write 0, thus populating this table where M ≡ Male, F ≡ Female, U ≡ underweight, N ≡ Normal, O ≡ overweight, L1 ≡ Level 1, L2 ≡ Level 2, L3 ≡ Level 3 and L4 ≡ Level 4. The resulting indicator matrix C is

Table 15.7.2 Entries of the indicator matrix of the data included in Table 15.7.1

Note that since a person will belong to a single category of every variable, the row sum of every row will always be equal to the number of variables, which is 3 in the example. The sum of all the column entries under each variable is the number of items classified (10 in the example). We now convert the data into a two-way classification, which is achieved by converting C into a Burt matrix B, where B = C′C. In our example,

Observe that the diagonal blocks in C′C correspond to the variables, gender, weight and educational level or gender versus gender, weight versus weight, educational level versus educational level. These blocks are the following:

Various two-way contingency tables, namely gender versus weight, gender versus educational level, weight versus educational level, are combined into one two-way table displaying category versus category. The observed Pearson’s χ 2 statistic from C′C is seen to be 79.85. In this case, the number of degrees of freedom is 8 × 8 = 64 and at 5% level, the tabulated \(\chi ^2_{64,0.05}\approx 84>79.85\); hence, the hypothesis of no association in C′C is not rejected. Note that this χ 2 approximation is unreliable since the expected frequencies are small. The most relevant parts in the Burt matrix C′C are the non-diagonal blocks of frequencies. The two non-diagonal blocks of the first two rows represent the two-way contingency tables for gender versus weight and gender versus educational level. Similarly, the non-diagonal block in the third to fifth rows represent the two-way contingency table for weight versus educational level. These are the following, denoted by A 1, A 2, A 3 respectively, where A 1 is the two-way contingency table of gender versus weight, A 2 is the contingency table of gender versus educational level and A 3 is the table of weight versus educational level:

The corresponding matrices of expected frequencies, under the hypothesis of no association between the characteristics of classification, denoted by E(A i), i = 1, 2, 3 are

The observed values of Pearson’s χ 2 statistic under the hypothesis of no association in the contingency table, and the corresponding tabulaled χ 2 critical values at the 5% significance level, are the following: \(A_1: ~\chi ^2=0.63,\ \chi ^2_{2,0.05}=5.99>0.63;~ A_2:~\chi ^2=2.36,\ \chi ^2_{3,0.05}=7.81>2.36;~ A_3:~ \chi ^2=5.42, \ \chi ^2_{6,0.05}=12.59>5.42\); hence the hypothesis would not be rejected in any of the contingency table if Pearson’s statistic were applicable. Actually, the χ 2 approximation is not appropriate in any of these cases since the expected frequencies are quite small. Hence, decisions cannot be made on the basis of Pearson’s statistic in these instances.

Observe that the first column of the matrix C corresponds to the count on “Male”, the second to the count on “Female”, the third to “Underweight”, the fourth to “Normal”, the fifth to “Overweight”, the sixth to “Level 1”, the seventh to “Level 2”, the eighth to “Level 3” and the ninth to “Level 4”. Thus, the columns represent the various characteristics or the various variables and their categories. So, if we were to plot one column as one point in the two-dimensional space, then by looking at the points we could determine which points are close to each other. For example, if the “Overweight” column point is close to the “Male” column point, then there is possibility of association between “Overweight” and “Male”. Thus, our aim will be to plot each column of C or each column of C′C as a point in two dimensions. For this purpose, we may make use of the plotting technique described in Sects. 15.5 and 15.6. Consider a singular value decomposition of C = UΛV, U′U = I k, V′V = I k. If C is r × s, s < r, then U is r × k and V  is s × k where k is the number of nonzero eigenvalues of CC as well as those of C′C, and Λ = diag(λ 1, …, λ k) where \(\lambda _j^2,\ j=1,\ldots ,k,\) are the nonzero eigenvalues of CC and C′C. In the numerical example, r = 10 and s = 9. Consider the eigenvalues of C′C since in this case, the order is smaller than the order of CC . Let the nonzero eigenvalues of C′C be \(\lambda _1^2\ge \cdots \ge \lambda _k^2\). From C′C, compute the normalized eigenvectors corresponding to these nonzero eigenvalues. This s × k matrix of normalized eigenvectors is V  in the singular value decomposition. By using the same nonzero eigenvalues, compute the normalized eigenvectors from CC . This r × k matrix is U in the singular value decomposition. Since the columns of C and C′C represent the various variables and their subdivisions, only the columns are useful for our geometrical representation, that is, only V  will be relevant for plotting the points. Consider H = VΛ and let \(\lambda _1^2\ge \lambda _2^2\ge \cdots \ge \lambda _k^2\). Observe that C = UΛV⇒ C′ = VΛU′ = HU . The rows of C represent the various variables and their categories. Let h 1, …, h s be the rows of H. Then, we have

$$\displaystyle \begin{aligned} h_1U'&=\mbox{Men-row}\\ h_2U'&=\mbox{Women-row}\\ & \ \,\vdots\\ h_sU'&=\mbox{Level 4-row}.\end{aligned} $$

This shows that the rows h 1, …, h s represent the various variables and their categories. Since the first two eigenvalues are the largest ones and V 1, V 2 are the corresponding eigenvectors, we can take it for granted that most of the information about the various variables and their categories is contained in the first two elements in h 1, …, h s or in the first two columns weighted by λ 1 and λ 2. Accordingly, take the first two columns from H and denote this submatrix by H (2) where

Plot the points (h 11, h 12), (h 21, h 22), …, (h s1, h s2). These s points correspond to the s columns in the r × s matrix C or the s rows in C .

Referring to our numerical example, the eigenvalues are

$$\displaystyle \begin{aligned} \lambda_1^2&=11.66,\ \lambda_2^2=5.57,\ \lambda_3^2=5.28,\ \lambda_4^2=3.47,\ \lambda_5^2=2.34,\\ \lambda_6^2&=1,14,\ \lambda_7^2=0.54,\ \lambda_8=\lambda_9=0,\end{aligned} $$

so that k = 7 and the nonzero eigenvalues, \(\sqrt {\lambda _j^2},\ j=1,\ldots ,7\), are

$$\displaystyle \begin{aligned}\lambda_1=3.41,\ \lambda_2=2.36,\ \lambda_3=2.30,\ \lambda_4=1.86,\ \lambda_5=1.53,\ \lambda_6=1.07,\ \lambda_7=0.73.\end{aligned}$$

Thus, the matrix Λ is

$$\displaystyle \begin{aligned}\varLambda=\text{diag}(3.41,2.36,2.30,1.86,1.53,1.07,0.73), \end{aligned}$$

and the

$$\displaystyle \begin{aligned}\mbox{total inertia} =11.66+5.57+5.28+3.47+2.34+1.14+0.54=30=\text{tr}(C'C). \end{aligned}$$

Noting that \(\frac {11.66}{30}=0.39\) and \(\frac {(11.66+5.57+5.28)}{30}=0.75,\) we can assert that 75% of the inertia is accounted for by the first three eigenvalues of C′C.

The normalized eigenvectors of C′C, which correspond to the nonzero eigenvalues and are denoted by V = [V 1, …, V 7], are the following:

$$\displaystyle \begin{aligned} V_1\!&=\!\left[ \begin{array}{r} 0.293826 \\ 0.615045 \\ 0.138401 \\ 0.36352 \\ 0.406951 \\ 0.248836 \\ 0.173839 \\ 0.306906 \\ 0.179291 \\ \end{array} \right]\!,~\!V_2\!=\!\left[ \begin{array}{r} -0.711194 \\ 0.512432 \\ -0.0995834 \\ -0.106977 \\ 0.00779839 \\ -0.386116 \\ -0.0833586 \\ 0.041766 \\ 0.228947 \\ \end{array} \right]\!,\!~V_3\!=\!\left[ \begin{array}{r} 0.123362 \\ -0.0941084 \\ -0.0879546 \\ 0.638327 \\ -0.521119 \\ -0.428293 \\ 0.0446188 \\ 0.302598 \\ 0.110329 \\ \end{array} \right]\!,~\!V_4\!=\!\left[ \begin{array}{r} 0.0732966 \\ 0.107206 \\ 0.601988 \\ -0.03252 \\ -0.388966 \\ 0.167134 \\ -0.164402 \\ -0.357001 \\ 0.534772 \\ \end{array} \right],\\ V_5&=\left[ \begin{array}{r} 0.0406003 \\ 0.0238456 \\ -0.124608 \\ 0.110633 \\ 0.0784214 \\ -0.206691 \\ 0.754879 \\ -0.584142 \\ 0.1004 \\ \end{array} \right],~V_6=\left[ \begin{array}{r} -0.115587 \\ -0.0128351 \\ -0.572877 \\ 0.408319 \\ 0.0361355 \\ 0.400243 \\ -0.367304 \\ -0.382451 \\ 0.22109 \\ \end{array} \right],~V_7=\left[ \begin{array}{r} 0.320998 \\ -0.262335 \\ -0.162227 \\ -0.195339 \\ 0.416228 \\ -0.427087 \\ -0.191702 \\ 0.0724589 \\ 0.604993 \\ \end{array} \right].\end{aligned} $$

Then, the first two eigenvectors weighted by λ 1 and λ 2 and the points to be plotted are

$$\displaystyle \begin{aligned} \lambda_1V_1&=3.41472\left[ \begin{array}{r} 0.293826 \\ 0.615045 \\ 0.138401 \\ 0.36352 \\ 0.406951 \\ 0.248836 \\ 0.173839 \\ 0.306906 \\ 0.179291 \\ \end{array} \right]=\left[ \begin{array}{r} 1.00333 \\ 2.10021 \\ 0.4726 \\ 1.24132 \\ 1.38962 \\ 0.849704 \\ 0.593611 \\ 1.048 \\ 0.612228 \\ \end{array} \right],\end{aligned} $$
$$\displaystyle \begin{aligned} \lambda_2V_2=2.36098\left[ \begin{array}{r} -0.711194 \\ 0.512432 \\ -0.0995834 \\ -0.106977 \\ 0.00779839 \\ -0.386116 \\ -0.0833586 \\ 0.041766 \\ 0.228947 \\ \end{array} \right] =\left[ \begin{array}{r} -1.67911 \\ 1.20984 \\ -0.235114 \\ -0.25257 \\ 0.0184118 \\ -0.911613 \\ -0.196808 \\ 0.0986087 \\ 0.540539 \\ \end{array} \right];\end{aligned} $$
$$\displaystyle \begin{aligned} \mbox{Points to be plotted}:\left[\begin{array}{r} (1.00333,-1.67911)\\ (2.10021, 1.20984)\\ (0.4726, -0.235114)\\ (1.24132, -0.25257)\\ (1.38962,0.0184118)\\ (0.849704, -0.911613)\\ (0.593611,-0.196808)\\ (1.048,0.0986087)\\ (0.612228,0.540539) \end{array}\right]\leftrightarrow\left[\begin{array}{c} \mbox{Men}\\ \mbox{Women}\\ \mbox{Underweight}\\ \mbox{Normal}\\ \mbox{Overweight}\\ \mbox{Level 1}\\ \mbox{Level 2}\\ \mbox{Level 3}\\ \mbox{Level 4}\end{array}\right].\end{aligned} $$

The plot of these points is displayed in Fig. 15.7.1.

Figure 15.7.1
figure 4

Multiple contingency plot

It is seen from the points plotted in Fig. 15.7.1 that the categories underweight and educational level 2 are somewhat close to each other, which is indicative of a possible association, whereas the categories underweight and women are the farthest apart.

Exercises 15 (continued)

15.6

In the following two-way contingency table, where the entries in the cells are frequencies, (1) calculate Pearson’s χ 2 statistic and give the representations in (15.5.1)–(15.5.6); (2) plot the row profiles; (3) plot the column profiles:

15.7

Repeat Exercise 15.6 for the following two-way contingency table:

15.8

For the data in (1) Exercise 15.6, (2) Exercise 15.7, and by using the notations defined in Sects. 15.5 and 15.6, compute the following items: Estimates of (i) \(A=D_r^{-\frac {1}{2}}(P-RC')D_c^{-1}(P-RC')'D_r^{-\frac {1}{2}}\); (ii) Eigenvalues of A and tr(A); (iii) Total inertia and proportions of inertia accounted for by the eigenvalues; (iv) The matrix of row-profiles; (v) The matrix of column-profiles, and make comments.

15.9

Referring to Exercises 15.6 and 15.7, plot the row profiles and column profiles and make comments.

15.10

In a used car lot, there are high price, average price and low price cars, the cars come in the following colors: red, white, blue and silver, and the paint finish is either mat or shiny. Fourteen customers bought vehicles from this car lot. Their preferences are given next. (1) Carry out a multiple correspondence analysis, plot the column profiles and make comments; (2) Create individual two-way contingency tables, analyze these tables and make comments. The following is the data where the first column indicates the customer’s serial number:

1

Low price

white color

mat finish

2

Low price

red color

shiny finish

3

Average price

silver color

shiny finish

4

High price

red color

shiny finish

5

High price

blue color

shiny finish

6

Average price

white color

mat finish

7

Average price

blue color

mat finish

8

High price

blue color

shiny finish

9

High price

red color

mat finish

10

Average price

silver color

mat finish

11

Low price

white color

shiny finish

12

Average price

white color

mat finish

13

Average price

silver color

shiny finish

14

Low price

white color

shiny finish