Advertisement

Matrix Algebra pp 329-398 | Cite as

Special Matrices and Operations Useful in Modeling and Data Analysis

  • James E. Gentle
Chapter
  • 5.9k Downloads
Part of the Springer Texts in Statistics book series (STS)

Abstract

In previous chapters, we encountered a number of special matrices, such as symmetric matrices, banded matrices, elementary operator matrices, and so on. In this chapter, we will discuss some of these matrices in more detail and also introduce some other special matrices and data structures that are useful in statistics.

In previous chapters, we encountered a number of special matrices, such as symmetric matrices, banded matrices, elementary operator matrices, and so on. In this chapter, we will discuss some of these matrices in more detail and also introduce some other special matrices and data structures that are useful in statistics.

There are a number of special kinds of matrices that are useful in statistical applications. In statistical applications in which data analysis is the objective, the initial step is the representation of observational data in some convenient form, which often is a matrix. The matrices for operating on observational data or summarizing the data often have special structures and properties. We discuss the representation of observations using matrices in Sect. 8.1.

In Sect. 8.2, we review and discuss some of the properties of symmetric matrices.

One of the most important properties of many matrices occurring in statistical data analysis is nonnegative or positive definiteness; this is the subject of Sects. 8.3 and 8.4.

Fitted values of a response variable that are associated with given values of covariates in linear models are often projections of the observations onto a subspace determined by the covariates. Projection matrices and Gramian matrices useful in linear models are considered in Sects. 8.5 and 8.6.

Another important property of many matrices occurring in statistical modeling is irreducible nonnegativeness or positiveness; this is the subject of Sect. 8.7.

Many of the defining properties of the special matrices discussed in this chapter are invariant under scalar multiplication; hence, the special matrices are members of cones. Interestingly even further, convex combinations of some types of special matrices yield special matrices of the same type; hence, those special matrices are members of convex cones (see Sect.  2.2.8 beginning on page 43).

8.1 Data Matrices and Association Matrices

There are several ways that data can be organized for representation in the computer. We distinguish logical structures from computer-storage structures. Data structure in computers is an important concern and can greatly affect the efficiency of computer processing. We discuss some simple aspects of the organization for computer storage in Sect.  11.1, beginning on page 523. In the present section, we consider some general issues of logical organization and structure.

There are two important aspects of data in applications that we will not address here. One is metadata; that is, data about the data. Metadata includes names or labels associated with data, information about how and when the data were collected, information about how the data are stored in the computer, and so on. Another important concern in applications is missing data. In real-world applications it is common to have incomplete data. If the data are stored in some structure that naturally contains a cell or a region for the missing data, the computer representation of the dataset must contain some indication that the cell is empty. For numeric data, the convenient way of doing this is by using “not-available”, NA (see page 466), or “not-a-number”, NaN (see page 475). The effect on a statistical analysis when some data are missing varies with the type of analysis. We consider some effects of missing data on the estimation of variance-covariance matrices in Sect.  9.5.6, beginning on page 523.

8.1.1 Flat Files

If several features or attributes are observed on each of several entities, a convenient way of organizing the data is as a two-dimensional array with each column corresponding to a specific feature and each row corresponding to a specific observational entity. In the field of statistics, data for the features are stored in “variables”, the entities are called “observational units”, and a row of the array is called an “observation” (see Fig. 8.1).
Figure 8.1

Data appropriate for representation in a flat file

The data may be various types of objects, such as names, real numbers, numbers with associated measurement units, sets, vectors, and so on. If the data are represented as real numbers, the data array is a matrix. (Note again our use of the word “matrix”; not just any rectangular array is a matrix in the sense used in this book.) Other types of data can often be made equivalent to a matrix in an intuitive manner.

The flat file arrangement emphasizes the relationships of the data both within an observational unit or row and within a variable or column. Simple operations on the data matrix may reveal relationships among observational units or among variables.

Flat files are the appropriate data structure for the analysis of linear models, but statistics is not just about analysis of linear models anymore. (It never was.)

8.1.2 Graphs and Other Data Structures

If the numbers of measurements on the observational units varies or if the interest is primarily in simple relationships among observational units or among variables, the flat file structure may not be very useful. Sometimes a graph structure can be used advantageously.

A graph is a nonempty set V of points, called vertices, together with a collection E of unordered pairs of elements of V, called edges. (Other definitions of “graph” allow the null set to be a graph.) If we let \(\mathcal{G}\) be a graph, we represent it as (V, E). We often represent the set of vertices as \(V (\mathcal{G})\) and the collection of edges as \(E(\mathcal{G})\). An edge is said to be incident on each vertex in the edge. The number of vertices (that is, the cardinality of V ) is the order of the graph, and the number of edges, the cardinality of E, is the size of the graph.

An edge in which the two vertices are the same is called a loop. If two or more elements of \(E(\mathcal{G})\) contain the same two vertices, those edges are called multiple edges, and the graph itself is called a multigraph. A graph with no loops and with no multiple edges is called a simple graph. In some literature, “graph” means “simple graph”, as I have defined it; and a graph as I have defined it that is not simple is called a “pseudograph”

A path or walk is a sequence of edges, e1, , e n , such that for i ≥ 2 one vertex in e i is a vertex in edge e i−1. Alternatively, a path or walk is defined as a sequence of vertices with common edges.

A graph such that there is a path that includes any pair of vertices is said to be connected.

A graph with more than one vertex such that all possible pairs of vertices occur as edges is a complete graph.

A closed path or closed walk is a path such that a vertex in the first edge (or the first vertex in the alternate definition) is in the last edge (or the last vertex).

A cycle is a closed path in which all vertices occur exactly twice (or in the alternate definition, in which all vertices except the first and the last are distinct). A graph with no cycles is said to be acyclic. An acyclic graph is also called a tree. Trees are used extensively in statistics to represent clusters.

The number of edges that contain a given vertex (that is, the number of edges incident on the vertex v) denoted by d(v) is the degree of the vertex.

A vertex with degree 0 is said to be isolated.

We see immediately that the sum of the degrees of all vertices equals twice the number of edges, that is,
$$\displaystyle{\sum d(v_{i}) = 2\#(E).}$$
The sum of the degrees hence must be an even number.

A regular graph is one for which d(v i ) is constant for all vertices v i ; more specifically, a graph is k-regular if d(v i ) = k for all vertices v i .

The natural data structure for a graph is a pair of lists, but a graph is often represented graphically (no pun!) as in Fig. 8.2, which shows a graph with five vertices and seven edges. While a matrix is usually not an appropriate structure for representing raw data from a graph, there are various types of matrices that are useful for studying the data represented by the graph, which we will discuss in Sect. 8.8.9.
Figure 8.2

A simple graph

If \(\mathcal{G}\) is the graph represented in Fig. 8.2, the vertices are
$$\displaystyle{V (\mathcal{G}) =\{ a,b,c,d,e\}}$$
and the edges are
$$\displaystyle{E(\mathcal{G}) =\{ (a,b),(a,c),(a,d),(a,e),(b,e),(c,d),(d,e)\}.}$$

The presence of an edge between two vertices can indicate the existence of a relationship between the objects represented by the vertices. The graph represented in Fig. 8.2 may represent five observational units for which our primary interest is in their relationships with one another. For example, the observations may be authors of scientific papers, and an edge between two authors may represent the fact that the two have been coauthors on some paper.

The same information represented in the 5-order graph of Fig. 8.2 may be represented in a 5 × 5 rectangular array, as in Fig. 8.3.
Figure 8.3

An alternate representation

In the graph represented in Fig. 8.2, there are no isolated vertices and the graph is connected. (Note that a graph with no isolated vertices is not necessarily connected.) The graph represented in Fig. 8.2 is not complete because, for example, there is no edge that contains vertices c and e. The graph is cyclic because of the closed path (defined by vertices) (c, d, e, b, a, c). Note that the closed path (c, d, a, e, b, a, c) is not a cycle.

This use of a graph immediately suggests various extensions of a basic graph. For example, E may be a multiset, with multiple instances of edges containing the same two vertices, perhaps, in the example above, representing multiple papers in which the two authors are coauthors. As we stated above, a graph in which E is a multiset is called a multigraph. Instead of just the presence or absence of edges between vertices, a weighted graph may be more useful; that is, one in which a real number is associated with a pair of vertices to represent the strength of the relationship, not just presence or absence, between the two vertices. A degenerate weighted graph (that is, an unweighted graph as discussed above) has weights of 0 or 1 between all vertices. A multigraph is a weighted graph in which the weights are restricted to nonnegative integers. Although the data in a weighted graph carry much more information than a graph with only its edges, or even a multigraph that allows strength to be represented by multiple edges, the simplicity of a graph sometimes recommends its use even when there are varying degrees of strength of relationships. A standard approach in applications is to set a threshold for the strength of relationship and to define an edge only when the threshold is exceeded.

8.1.2.1 Adjacency Matrix: Connectivity Matrix

The connections between vertices in the graphs shown in Fig. 8.2 or in Fig. 8.4 can be represented in an association matrix called an adjacency matrix, a connectivity matrix, or an incidence matrix to represent edges between vertices, as shown in equation (8.1). (The terms “adjacency”, “connectivity”, and “incidence” are synonymous. “Adjacency” is perhaps the most commonly used term, but I will naturally use both that term and “connectivity” because of the connotative value of the latter term.) The graph, \(\mathcal{G}\), represented in Fig. 8.2 has the symmetric adjacency matrix
$$\displaystyle{ A(\mathcal{G}) = \left [\begin{array}{rrrrr} 0&1&1&1&1\\ 1 &0 &0 &0 &1 \\ 1&0&0&1&0\\ 1 &0 &1 &0 &1 \\ 1&1&0&1&0\\ \end{array} \right ]. }$$
(8.1)
As above, we often use this kind of notation; a symbol, such as \(\mathcal{G}\), represents a particular graph, and other objects that relate to the graph make use of that symbol.
Figure 8.4

A simple digraph

There is no difference in the connectivity matrix and a table such as in Fig. 8.3 except for the metadata.

A graph as we have described it is “nondirected”; that is, an edge has no direction. The edge (a, b) is the same as (b, a). An adjacency matrix for a nondirected graph is symmetric.

An interesting property of an adjacency matrix, which we will discuss further on page 393, is that if A is the adjacency matrix of a graph, then A ij k is the number of paths of length k between nodes i and j in that graph. (See Exercise 8.20.)

The adjacency matrix for graph with no loops is hollow; that is, all diagonal elements are 0s. Another common way of representing a hollow adjacency matrix is to use − 1s in place of the off-diagonal zeroes; that is, the absence of a connection between two different vertices is denoted by − 1 instead of by 0. Such a matrix is called a Seidel adjacency matrix. (This matrix has no relationship to the Gauss-Seidel method discussed in Chap.  6)

The relationship can obviously be defined in the other direction; that is, given an n × n symmetric matrix A, we define the graph of the matrix as the graph with n vertices and edges between vertices i and j if a ij ≠ 0. We often denote the graph of the matrix A by \(\mathcal{G}(A)\).

Generally we restrict the elements of the connectivity matrix to be 1 or 0 to indicate only presence or absence of a connection, but not to indicate strength of the connection. In this case, a connectivity matrix is a nonnegative matrix; that is, all of its elements are nonnegative. We indicate that a matrix A is nonnegative by
$$\displaystyle{A \geq 0.}$$
We discuss the notation and properties of nonnegative (and positive) matrices in Sect. 8.7.

8.1.2.2 Digraphs

Another extension of a basic graph is one in which the relationship may not be the same in both directions. This yields a digraph, or “directed graph”, in which the edges are ordered pairs called directed edges. The vertices in a digraph have two kinds of degree, an indegree and an outdegree, with the obvious meanings.

The simplest applications of digraphs are for representing networks. Consider, for example, the digraph represented by the network in Fig. 8.4. This is a network with five vertices, perhaps representing cities, and directed edges between some of the vertices. The edges could represent airline connections between the cities; for example, there are flights from x to u and from u to x, and from y to z, but not from z to y.

Figure 8.4 represents a digraph with order 5 (there are five vertices) and size 11 (eleven directed edges). A sequence of edges, e1, , e n , constituting a path in a digraph must be such that for i ≥ 2 the first vertex in e i is the second vertex in edge e i−1. For example, the sequence x, y, z, w, u, x in the graph of Fig. 8.4 is a path (in fact, a cycle) but the sequence x, u, w, z, y, x is not a path.

The connectivity matrix for the digraph in Fig. 8.4 with nodes ordered as u, w, x, y, z is
$$\displaystyle{ C = \left [\begin{array}{rrrrr} 0&1&1&1&1\\ 1 &0 &0 &0 &0 \\ 1&0&0&1&0\\ 1 &0 &0 &0 &1 \\ 1&1&0&0&0\\ \end{array} \right ]. }$$
(8.2)
A connectivity matrix for a (nondirected) graph is symmetric, but for a digraph it is not necessarily symmetric. Given an n × n matrix A, we define the digraph of the matrix as the digraph with n vertices and edges from vertex i to j if a ij ≠ 0. We use the same notation for a digraph as we used above for a graph, \(\mathcal{G}(A)\).

In statistical applications, graphs are used for representing symmetric associations. Digraphs are used for representing asymmetric associations or one-way processes such as a stochastic process.

In a simple digraph, the edges only indicate the presence or absence of a relationship, but just as in the case of a simple graph, we can define a weighted digraph by associating nonnegative numbers with each directed edge.

Graphical modeling is useful for analyzing relationships between elements of a collection of sets. For example, in an analysis of internet traffic, profiles of users may be constructed based on the set of web sites each user visits in relation to the sets visited by other users. For this kind of application, an intersection graph may be useful. An intersection graph, for a given collection of sets \(\mathcal{S}\), is a graph whose vertices correspond to the sets in \(\mathcal{S}\) and whose edges between any two sets have a common element.

The word “graph” is often used without qualification to mean any of these types.

8.1.2.3 Connectivity of Digraphs

There are two kinds of connected digraphs. A digraph such that there is a (directed) path that includes any pair of vertices is said to be strongly connected. A digraph such that there is a path without regard to the direction of any edge that includes any pair of vertices is said to be weakly connected. The digraph shown in Fig. 8.4 is strongly connected. The digraph shown in Fig. 8.5 is weakly connected but not strongly connected.
Figure 8.5

A digraph that is not strongly connected

A digraph that is not weakly connected must have two sets of nodes with no edges between any nodes in one set and any nodes in the other set.

The connectivity matrix of the digraph in Fig. 8.5 is
$$\displaystyle{ C = \left [\begin{array}{rrrrr} 0&1&1&0&1\\ 0 &0 &0 &0 &0 \\ 0&0&0&1&0\\ 1 &0 &0 &0 &1 \\ 0&1&0&0&0\\ \end{array} \right ]. }$$
(8.3)
The matrix of a digraph that is not strongly connected can always be reduced to a special block upper triangular form by row and column permutations; that is, if the digraph \(\mathcal{G}\) is not strongly connected, then there exists a permutation matrix E(π) such that
$$\displaystyle{ E_{(\pi )}A(\mathcal{G})E_{(\pi )} = \left [\begin{array}{cc} B_{11} & B_{12} \\ 0 &B_{22} \end{array} \right ], }$$
(8.4)
where B11 and B22 are square. Such a transformation is called a symmetric permutation.
Later we will formally prove this relationship between strong connectivity and this reduced form of the matrix, but first we consider the matrix in equation (8.3). If we interchange the second and fourth columns and rows, we get the reduced form

8.1.2.4 Irreducible Matrices

Any nonnegative square matrix that can be permuted into the form in equation (8.4) with square diagonal submatrices is said to be reducible; a matrix that cannot be put into that form is irreducible. We also use the terms reducible and irreducible to refer to the graph itself.

Irreducible matrices have many interesting properties, some of which we will discuss in Sect. 8.7.3, beginning on page 375. The implication (8.77) in that section provides a simple characterization of irreducibility.

8.1.2.5 Strong Connectivity of Digraphs and Irreducibility of Matrices

A nonnegative matrix is irreducible if and only if its digraph is strongly connected. Stated another way, a digraph is not strongly connected if and only if its matrix is reducible.

To see this, first consider a reducible matrix. In its reduced form of equation (8.4), none of the nodes corresponding to the last rows have directed edges leading to any of the nodes corresponding to the first rows; hence, the digraph is not strongly connected.

Now, assume that a given digraph \(\mathcal{G}\) is not strongly connected. In that case, there is some node, say the ith node, from which there is no directed path to some other node. Assume that there are m − 1 nodes that can be reached from node i. If m = 1, then we have a trivial partitioning of the n × n connectivity in which B11 of equation (8.4) is (n − 1) × (n − 1) and B22 is a 1 × 1 0 matrix (that is, 01). If m ≥ 1, perform symmetric permutations so that the row corresponding to node i and all other m − 1 nodes are the last m rows of the permuted connectivity matrix. In this case, the first nm elements in each of those rows must be 0. To see that this must be the case, let k > nm and jnm and assume that the element in the (k, j)th position is nonzero. In that case, there is a path from node i to node k to node j, which is in the set of nodes not reachable from node i; hence the (k, j)th element (in the permuted matrix) must be 0. The submatrix corresponding to B11 is nm × nm, and that corresponding to B22 is m × m. These properties also hold for connectivity matrices with simple loops (with 1s on the diagonal) and for an augmented connectivity matrix (see page 393).

Reducibility plays an important role in the analysis of Markov chains (see Sect.  9.8.1).

8.1.3 Term-by-Document Matrices

An interesting area of statistical application is in clustering and classifying documents. In the simpler cases, the documents consist of text only, and much recent research has been devoted to “text data-mining”. (The problem is not a new one; Mosteller and Wallace 1963, studied a related problem.)

We have a set of text documents (often called a “corpus”). A basic set of data to use in studying the text documents is the term-document matrix, which is a matrix whose columns correspond to documents and whose rows correspond to the various terms used in the documents. The terms are usually just words. The entries in the term-document matrix are measures of the importance of the term in the document. Importance may be measured simply by the number of times the word occurs in the document, possibly weighted by some measure of the total number of words in the document. In other measures of the importance, the relative frequency of the word in the given document may be adjusted for the relative frequency of the word in the corpus. (Such a measure is called term frequency-inverse document frequency or tf-idf.) Certain common words, such as “the”, called “stop-words”, may be excluded from the data. Also, words may be “stemmed”; that is, they may be associated with a root that ignores modifying letters such as an “s” in a plural or an “ed” in a past-tense verb. Other variations include accounting for sequences of terms (called “collocation”).

There are several interesting problems that arise in text analysis. Term-document matrices can be quite large. Terms are often misspelled. The documents may be in different languages. Some terms may have multiple meanings. The documents themselves may be stored on different computer systems.

Two types of manipulation of the term-document matrix are commonly employed in the analysis. One is singular-value decomposition (SVD), and the other is nonnegative matrix factorization (NMF).

SVD is used in what is called “latent semantic analysis” (“LSA”), or “latent semantic indexing” (“LSI”). A variation of latent semantic analysis is “probabilistic” latent semantic analysis, in which a probability distribution is assumed for the words. In a specific instance of probabilistic latent semantic analysis, the probability distribution is a Dirichlet distribution. (Most users of this latter method are not statisticians; they call the method “LDA”, for “latent Dirichlet allocation”.)

NMF is often used in determining which documents are similar to each other or, alternatively, which terms are clustered in documents. If A is the term-document matrix, and A = WH is a nonnegative factorization, then the element w ij can be interpreted as the degree to which term i belongs to cluster j, while element h ij can be interpreted as the degree to which document j belongs to cluster i. A method of clustering the documents is to assign document j (corresponding to the jth column of A) to the kth cluster if h kj is the maximum element in the column hj .

There is a wealth of literature on text data-mining, but we will not discuss the analysis methods further.

8.1.4 Probability Distribution Models

Probability models in statistical data analysis are often multivariate distributions, and hence, matrices arise in the model. In the analysis itself, matrices that represent associations are computed from the observational data. In this section we mention some matrices in the models, and in the next section we refer to some association matrices that are computed in the analysis.

Data in rows of flat files are often assumed to be realizations of vector random variables, some elements of which may have a degenerate distribution (that is, the elements in some columns of the data matrix may be considered to be fixed rather than random). The data in one row are often considered independent of the data in another row. Statistical data analysis is generally concerned with studying various models of relationships among the elements of the vector random variables. For example, the familiar linear regression model relates one variable (one column) to a linear combination of other variables plus a translation and random noise.

A random graph of fixed order is a discrete probability space over all possible graphs of that order. For a graph of order n, there are \(2^{{n\choose 2}}\) possible graphs. Asymptotic properties of the probability distribution refer to the increase of the order without limit. Occasionally it is useful to consider the order of the graph to be random also. If the order is unrestricted, the sample space for a random graph of random order is infinite but countable. The number of digraphs of order n is \(4^{{n\choose 2}}\).

Random graphs have many uses in the analysis of large systems of interacting objects; for example, a random intersection graph may be used to make inferences about the clustering of internet users based on the web sites they visit.

8.1.5 Derived Association Matrices

In data analysis, the interesting questions usually involve the relationships among the variables or among the observational units. Matrices formed from the original data matrix for the purpose of measuring these relationships are called association matrices. There are basically two types: similarity and dissimilarity matrices. The variance-covariance matrix, which we discuss in Sect. 8.6.3, is an example of an association matrix that measures similarity. We discuss dissimilarity matrices in Sect. 8.6.6 and in Sect. 8.8.9 discuss a type of similarity matrix for data represented in graphs.

In addition to the distinction between similarity and dissimilarity association matrices, we may identify two types of association matrices based on whether the relationships of interest are among the rows (observations) or among the columns (variables or features). In applications, dissimilarity relationships among rows tend to be of more interest, and similarity relationships among columns are usually of more interest. (The applied statistician may think of clustering, multidimensional scaling, or Q factor analysis for the former and correlation analysis, principal components analysis, or factor analysis for the latter.)

8.2 Symmetric Matrices and Other Unitarily Diagonalizable Matrices

Most association matrices encountered in applications are real and symmetric. Because real symmetric matrices occur so frequently in statistical applications and because such matrices have so many interesting properties, it is useful to review some of those properties that we have already encountered and to state some additional properties.

First, perhaps, we should iterate a trivial but important fact: the product of symmetric matrices is not, in general, symmetric. A power of a symmetric matrix, however, is symmetric.

We should also emphasize that some of the special matrices we have discussed are assumed to be symmetric because, if they were not, we could define equivalent symmetric matrices. This includes positive definite matrices and more generally the matrices in quadratic forms.

8.2.1 Some Important Properties of Symmetric Matrices

For convenience, here we list some of the important properties of symmetric matrices, many of which concern their eigenvalues. In the following, let A be a real symmetric matrix with eigenvalues c i and corresponding eigenvectors v i .
  • If k is any positive integer, A k is symmetric.

  • AB is not necessarily symmetric even if B is a symmetric matrix.

  • AB is symmetric if B is symmetric.

  • If A is nonsingular, then A−1 is also symmetric because (A−1)T = (AT)−1 = A−1.

  • If A is nonsingular (so that A k is defined for nonpositive integers), A k is symmetric and nonsingular for any integer k.

  • All eigenvalues of A are real (see page 140).

  • A is diagonalizable (or simple), and in fact A is orthogonally diagonalizable; that is, it has an orthogonally similar canonical factorization, A = VCVT (see page 154).

  • A has the spectral decomposition A = i c i v i v i T, where the c i are the eigenvalues and v i are the corresponding eigenvectors (see page 155).

  • A power of A has the spectral decomposition A k = i c i k v i v i T.

  • Any quadratic form xTAx can be expressed as i b i 2c i , where the b i are elements in the vector V−1x.

  • We have
    $$\displaystyle{\max _{x\neq 0}\frac{x^{\mathrm{T}}Ax} {x^{\mathrm{T}}x} =\max \{ c_{i}\}}$$
    (see page 156). If A is nonnegative definite, this is the spectral radius ρ(A).
  • For the L2 norm of the symmetric matrix A, we have
    $$\displaystyle{\|A\|_{2} =\rho (A).}$$
  • For the Frobenius norm of the symmetric matrix A, we have
    $$\displaystyle{\|A\|_{F} = \sqrt{\sum c_{i }^{2}}.}$$
    This follows immediately from the fact that A is diagonalizable, as do the following properties.
  • $$\displaystyle{\mathrm{tr}(A) =\sum c_{i}}$$
  • $$\displaystyle{\vert A\vert =\prod c_{i}}$$
    (see equations ( 3.227) and ( 3.228) on page 140).

8.2.2 Approximation of Symmetric Matrices and an Important Inequality

In Sect.  3.10, we considered the problem of approximating a given matrix by another matrix of lower rank. There are other situations in statistics in which we need to approximate one matrix by another one. In data analysis, this may be because our given matrix arises from poor observations and we know the “true” matrix has some special properties not possessed by the given matrix computed from the data. A familiar example is a sample variance-covariance matrix computed from incomplete data (see Sect.  9.5.6). Other examples in statistical applications occur in the simulation of random matrices (see Gentle 2003; Section  5.3.3). In most cases of interest, the matrix to be approximated is a symmetric matrix.

Consider the difference of two symmetric n × n matrices, A and \(\widetilde{A}\); that is,
$$\displaystyle{ E = A -\widetilde{ A}. }$$
(8.5)
The matrix of the differences, E, is also symmetric. We measure the “closeness” of A and \(\widetilde{A}\) by some norm of E.
The Hoffman-Wielandt theorem gives a lower bound on the Frobenius norm of E in terms of the differences of the eigenvalues of A and \(\widetilde{A}\): if the eigenvalues of A are c1, …c n and the eigenvalues of \(\widetilde{A}\) are \(\tilde{c}_{1},\ldots \tilde{c}_{n}\), each set being arranged in nonincreasing order, we have
$$\displaystyle{ \sum _{i=1}^{n}(c_{ i} -\tilde{ c}_{i})^{2} \leq \| E\|_{\mathrm{ F}}^{2}. }$$
(8.6)
This fact was proved by Hoffman and Wielandt (1953) using techniques from linear programming. Wilkinson (1965) gives a simpler proof (which he attributes to Wallace Givens) along the following lines.
Because A, \(\widetilde{A}\), and E are symmetric, they are all orthogonally diagonalizable. Let the diagonal factorizations of A and E, respectively, be VCVT and Udiag((e1, , e n ))UT, where e1, …e n are the eigenvalues of E in nonincreasing order. Hence, we have
$$\displaystyle\begin{array}{rcl} U\mathrm{diag}((e_{1},\ldots,e_{n}))U^{\mathrm{T}}& =& U\!(A -\widetilde{ A})U^{\mathrm{T}} {}\\ & =& U\!(V \!CV ^{\mathrm{T}} -\widetilde{ A})U^{\mathrm{T}} {}\\ & =& U\!V \!(C - V ^{\mathrm{T}}\widetilde{A}V )V ^{\mathrm{T}}U^{\mathrm{T}}. {}\\ \end{array}$$
Taking norms of both sides, we have
$$\displaystyle{ \sum _{i=1}^{n}e_{ i}^{2} =\| C - V ^{\mathrm{T}}\widetilde{A}V \|^{2}. }$$
(8.7)
(All norms in the remainder of this section will be the Frobenius norm.) Now, let
$$\displaystyle{ f(Q) =\| C - Q^{\mathrm{T}}\widetilde{A}Q\|^{2} }$$
(8.8)
be a function of any n × n orthogonal matrix, Q. (Equation (8.7) yields f(V ) = ∑e i 2.) To arrive at inequality (8.6), we show that this function is bounded below by the sum of the differences in the squares of the elements of C (which are the eigenvalues of A) and the eigenvalues of \(Q^{\mathrm{T}}\widetilde{A}Q\) (which are the eigenvalues of the matrix approximating A).
Because the elements of Q are bounded, f(⋅ ) is bounded, and because the set of orthogonal matrices is compact (see page 133) and f(⋅ ) is continuous, f(⋅ ) must attain its lower bound, say l. To simplify the notation, let
$$\displaystyle{X = Q^{\mathrm{T}}\widetilde{A}Q.}$$
Now suppose that there are r distinct eigenvalues of A (that is, the diagonal elements in C):
$$\displaystyle{d_{1}> \cdots> d_{r}.}$$
We can write C as \(\mathrm{diag}(d_{i}I_{m_{i}})\), where m i is the multiplicity of d i . We now partition \(Q^{\mathrm{T}}\widetilde{A}Q\) to correspond to the partitioning of C represented by \(\mathrm{diag}(d_{i}I_{m_{i}})\):
$$\displaystyle{ X = \left [\begin{array}{ccc} X_{11} & \cdots &X_{1r}\\ \vdots & \vdots & \vdots \\ X_{r1} & \cdots &X_{rr} \end{array} \right ]. }$$
(8.9)
In this partitioning, the diagonal blocks, X ii , are m i × m i symmetric matrices. The submatrix X ij , is an m i × m j matrix.

We now proceed in two steps to show that in order for f(Q) to attain its lower bound l, X must be diagonal. First we will show that when f(Q) = l, the submatrix X ij in equation (8.9) must be null if ij. To this end, let \(Q_{\!\!\!\ _{\nabla }}\) be such that \(f(Q_{\!\!\!\ _{\nabla }}) = l\), and assume the contrary regarding the corresponding \(X_{\!\!\!\ _{\nabla }} = Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\); that is, assume that in some submatrix \(X_{ij_{\!\!\!\ _{ \nabla }}}\) where ij, there is a nonzero element, say \(x_{\!\!\!\ _{\nabla }}\). We arrive at a contradiction by showing that in this case there is another X0 of the form \(Q_{0}^{\mathrm{T}}\widetilde{A}Q_{0}\), where Q0 is orthogonal and such that \(f(Q_{0}) <f(Q_{\!\!\!\ _{\nabla }})\).

To establish some useful notation, let p and q be the row and column, respectively, of \(X_{\!\!\!\ _{\nabla }}\) where this nonzero element \(x_{\!\!\!\ _{\nabla }}\) occurs; that is, \(x_{pq} = x_{\!\!\!\ _{\nabla }}\neq 0\) and pq because x pq is in \(X_{ij_{\!\!\!\ _{ \nabla }}}\). (Note the distinction between uppercase letters, which represent submatrices, and lowercase letters, which represent elements of matrices.) Also, because \(X_{\!\!\!\ _{\nabla }}\) is symmetric, \(x_{qp} = x_{\!\!\!\ _{\nabla }}\). Now let \(a_{\!\!\!\ _{\nabla }} = x_{pp}\) and \(b_{\!\!\!\ _{\nabla }} = x_{qq}\). We form Q0 as \(Q_{\!\!\!\ _{\nabla }}R\), where R is an orthogonal rotation matrix of the form G pq in equation ( 5.12) on page 239. We have, therefore, \(\|Q_{0}^{\mathrm{T}}\widetilde{A}Q_{0}\|^{2} =\| R^{\mathrm{T}}Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}R\|^{2} =\| Q_{\!\!\!\ _{ \nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\|^{2}\). Let a0, b0, and x0 represent the elements of \(Q_{0}^{\mathrm{T}}\widetilde{A}Q_{0}\) that correspond to \(a_{\!\!\!\ _{\nabla }}\), \(b_{\!\!\!\ _{\nabla }}\), and \(x_{\!\!\!\ _{\nabla }}\) in \(Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\).

From the definition of the Frobenius norm, we have
$$\displaystyle{f(Q_{0}) - f(Q_{\!\!\!\ _{\nabla }}) = 2(a_{\!\!\!\ _{\nabla }}- a_{0})d_{i} + 2(b_{\!\!\!\ _{\nabla }}- b_{0})d_{j}}$$
because all other terms cancel. If the angle of rotation is θ, then
$$\displaystyle\begin{array}{rcl} a_{0}& =& a_{\!\!\!\ _{\nabla }}\cos ^{2}\theta - 2x_{\!\!\!\ _{ \nabla }}\cos \theta \sin \theta + b_{\!\!\!\ _{\nabla }}\sin ^{2}\theta, {}\\ b_{0}& =& a_{\!\!\!\ _{\nabla }}\sin ^{2}\theta - 2x_{\!\!\!\ _{ \nabla }}\cos \theta \sin \theta + b_{\!\!\!\ _{\nabla }}\cos ^{2}\theta, {}\\ \end{array}$$
and so for a function h of θ we can write
$$\displaystyle\begin{array}{rcl} h(\theta )& =& f(Q_{0}) - f(Q_{\!\!\!\ _{\nabla }}) {}\\ & =& 2d_{i}((a_{\!\!\!\ _{\nabla }}- b_{\!\!\!\ _{\nabla }})\sin ^{2}\theta + x_{\!\!\!\ _{ \nabla }}\sin 2\theta ) + 2d_{j}((b_{\!\!\!\ _{\nabla }}- b_{0})\sin ^{2}\theta - x_{\!\!\!\ _{ \nabla }}\sin 2\theta ) {}\\ & =& 2d_{i}((a_{\!\!\!\ _{\nabla }}- b_{\!\!\!\ _{\nabla }}) + 2d_{j}(b_{\!\!\!\ _{\nabla }}- b_{0}))\sin ^{2}\theta + 2x_{\!\!\!\ _{ \nabla }}(d_{i} - d_{j})\sin 2\theta, {}\\ \end{array}$$
and so
$$\displaystyle{\frac{\mathrm{d}} {\mathrm{d}\theta }h(\theta ) = 2d_{i}((a_{\!\!\!\ _{\nabla }}- b_{\!\!\!\ _{\nabla }}) + 2d_{j}(b_{\!\!\!\ _{\nabla }}- b_{0}))\sin 2\theta + 4x_{\!\!\!\ _{\nabla }}(d_{i} - d_{j})\cos 2\theta.}$$
The coefficient of cos2θ, \(4x_{\!\!\!\ _{\nabla }}(d_{i} - d_{j})\), is nonzero because d i and d j are distinct, and \(x_{\!\!\!\ _{\nabla }}\) is nonzero by the second assumption to be contradicted, and so the derivative at θ = 0 is nonzero. Hence, by the proper choice of a direction of rotation (which effectively interchanges the roles of d i and d j ), we can make \(f(Q_{0}) - f(Q_{\!\!\!\ _{\nabla }})\) positive or negative, showing that \(f(Q_{\!\!\!\ _{\nabla }})\) cannot be a minimum if some X ij in equation (8.9) with ij is nonnull; that is, if \(Q_{\!\!\!\ _{\nabla }}\) is a matrix such that \(f(Q_{\!\!\!\ _{\nabla }})\) is the minimum of f(Q), then in the partition of \(Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\) only the diagonal submatrices \(X_{ii_{\!\!\!\ _{ \nabla }}}\) can be nonnull:
$$\displaystyle{Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }} = \mathrm{diag}(X_{11_{\!\!\!\ _{\nabla }}},\ldots,X_{rr_{\!\!\!\ _{\nabla }}}).}$$
The next step is to show that each \(X_{ii_{\!\!\!\ _{ \nabla }}}\) must be diagonal. Because it is symmetric, we can diagonalize it with an orthogonal matrix P i as
$$\displaystyle{P_{i}^{\mathrm{T}}X_{ ii_{\!\!\!\ _{\nabla }}}P_{i} = G_{i}.}$$
Now let P be the direct sum of the P i and form
$$\displaystyle\begin{array}{rcl} P^{\mathrm{T}}CP - P^{\mathrm{T}}Q_{\!\!\!\ _{ \nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}P& =& \mathrm{diag}(d_{1}I,\ldots,d_{r}I) -\mathrm{diag}(G_{1},\ldots,G_{r}) {}\\ & =& C - P^{\mathrm{T}}Q_{\!\!\!\ _{ \nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}P. {}\\ \end{array}$$
Hence,
$$\displaystyle{f(Q_{\!\!\!\ _{\nabla }}P) = f(Q_{\!\!\!\ _{\nabla }}),}$$
and so the minimum occurs for a matrix \(Q_{\!\!\!\ _{\nabla }}P\) that reduces \(\widetilde{A}\) to a diagonal form. The elements of the G i must be the \(\tilde{c}_{i}\) in some order, so the minimum of f(Q), which we have denoted by \(f(Q_{\!\!\!\ _{\nabla }})\), is \(\sum (c_{i} -\tilde{ c}_{p_{i}})^{2}\), where the p i are a permutation of 1, , n. As the final step, we show p i = i. We begin with p1. Suppose p1 ≠ 1 but p s = 1; that is, \(\tilde{c}_{1} \geq \tilde{ c}_{p_{1}}\). Interchange p1 and p s in the permutation. The change in the sum \(\sum (c_{i} -\tilde{ c}_{p_{i}})^{2}\) is
$$\displaystyle\begin{array}{rcl} (c_{1} -\tilde{ c}_{1})^{2} + (c_{ s} -\tilde{ c}_{p_{s}})^{2} - (c_{ 1} -\tilde{ c}_{p_{s}})^{2} - (c_{ s} -\tilde{ c}_{1})^{2}& =& -2(c_{ s} - c_{1})(\tilde{c}_{p_{1}} -\tilde{ c}_{1}) {}\\ & \leq & 0; {}\\ \end{array}$$
that is, the interchange reduces the value of the sum. Similarly, we proceed through the p i to p n , getting p i = i.

We have shown, therefore, that the minimum of f(Q) is \(\sum _{i=1}^{n}(c_{i} -\tilde{ c}_{i})^{2}\), where both sets of eigenvalues are ordered in nonincreasing value. From equation (8.7), which is f(V ), we have the inequality (8.6).

While an upper bound may be of more interest in the approximation problem, the lower bound in the Hoffman-Wielandt theorem gives us a measure of the goodness of the approximation of one matrix by another matrix. There are various extensions and other applications of the Hoffman-Wielandt theorem, see Chu (1991).

8.2.3 Normal Matrices

A real square matrix A is said to be normal if ATA = AAT. (In general, a square matrix is normal if AHA = AAH.) The Gramian matrix formed from a normal matrix is the same as the Gramian formed from the transpose (or conjugate transpose) of the matrix. Normal matrices include symmetric (and Hermitian), skew symmetric (and Hermitian), square orthogonal (and unitary) matrices, and circulant matrices. The identity is also obviously a normal matrix.

There are a number of interesting properties possessed by a normal matrix, but the most important property is that it can be diagonalized by a unitary matrix. Recall from page 154 that a matrix can be orthogonally diagonalized if and only if the matrix is symmetric. Not all normal matrices can be orthogonally diagonalized, but all can be diagonalized by a unitary matrix (“unitarily diagonalized”). In fact, a matrix can be unitarily diagonalized if and only if the matrix is normal. (This is the reason the word “normal” is used to describe these matrices; and a matrix that is unitarily diagonalizable is an alternate, and more meaningful way of defining a normal matrix.)

It is easy to see that a matrix A is unitarily diagonalizable if and only if AHA = AAH (and for real A, ATA = AAT implies AHA = AAH).

First suppose A is unitarily diagonalizable. Let A = PDPH, where P is unitary and D is diagonal. In that case, the elements of D are the eigenvalues of A, as we see by considering each column in AP. Now consider AAH:
$$\displaystyle{\begin{array}{c} AA^{\mathrm{H}} = P\,D\,P^{\mathrm{H}}P\,D^{\mathrm{H}}\,P^{\mathrm{H}} = P\,D\,D^{\mathrm{H}}\,P^{\mathrm{H}} = \\ \quad \quad \quad \quad \quad \quad P\,D^{\mathrm{H}}D\,P^{\mathrm{H}} = P\,D^{\mathrm{H}}\,P^{\mathrm{H}}P\,D\,P^{\mathrm{H}} = A^{\mathrm{H}}A.\end{array} }$$
Next, suppose AHA = AAH. To see that A is unitarily diagonalizable, form the Schur factorization, A = UTUH (see Sect.  3.8.7 on page 147). We have
$$\displaystyle{A^{\mathrm{H}}A = UT^{\mathrm{H}}U^{\mathrm{H}}UTU^{\mathrm{H}} = UT^{\mathrm{H}}TU^{\mathrm{H}}}$$
and
$$\displaystyle{AA^{\mathrm{H}} = UTU^{\mathrm{H}}UT^{\mathrm{H}}U^{\mathrm{H}} = UTT^{\mathrm{H}}U^{\mathrm{H}}.}$$
Now under the assumption that AHA = AAH,
$$\displaystyle{UT^{\mathrm{H}}TU^{\mathrm{H}} = UTT^{\mathrm{H}}U^{\mathrm{H}},}$$
which implies THT = TTH, in which case
$$\displaystyle{\vert t_{ii}\vert ^{2} =\sum _{ j=1}^{n}\vert t_{ ij}\vert ^{2},}$$
that is, t ij = 0 unless j = i. We conclude that T is diagonal, and hence, the Schur factorization, A = UTUH a is unitary diagonalization of A, so A is unitarily diagonalizable.

Spectral methods, based on the unitary diagonalization, are useful in many areas of applied mathematics. The spectra of nonnormal matrices, however, are quite different (see Trefethen and Embree (2005)).

8.3 Nonnegative Definite Matrices: Cholesky Factorization

We defined nonnegative definite and positive definite matrices on page 91, and discussed some of their properties, particularly in Sect.  3.8.11. We have seen that these matrices have useful factorizations, in particular, the square root and the Cholesky factorization. In this section, we recall those definitions, properties, and factorizations.

A symmetric matrix A such that any quadratic form involving the matrix is nonnegative is called a nonnegative definite matrix. That is, a symmetric matrix A is a nonnegative definite matrix if, for any (conformable) vector x,
$$\displaystyle{ x^{\mathrm{T}}Ax \geq 0. }$$
(8.10)
(We remind the reader that there is a related term, positive semidefinite matrix, that is not used consistently in the literature. We will generally avoid the term “semidefinite”.)
We denote the fact that A is nonnegative definite by
$$\displaystyle{ A\succeq 0. }$$
(8.11)
(Some people use the notation A ≥ 0 to denote a nonnegative definite matrix, but we have decided to use this notation to indicate that each element of A is nonnegative; see page 64.)
There are several properties that follow immediately from the definition.
  • The sum of two (conformable) nonnegative matrices is nonnegative definite.

  • All diagonal elements of a nonnegative definite matrix are nonnegative. Hence, if A is nonnegative definite, tr(A) ≥ 0.

  • Any square submatrix whose principal diagonal is a subset of the principal diagonal of a nonnegative definite matrix is nonnegative definite. In particular, any square principal submatrix of a nonnegative definite matrix is nonnegative definite.

It is easy to show that the latter two facts follow from the definition by considering a vector x with zeros in all positions except those corresponding to the submatrix in question. For example, to see that all diagonal elements of a nonnegative definite matrix are nonnegative, assume the (i, i) element is negative, and then consider the vector x to consist of all zeros except for a 1 in the ith position. It is easy to see that the quadratic form is negative, so the assumption that the (i, i) element is negative leads to a contradiction.

  • A diagonal matrix is nonnegative definite if and only if all of the diagonal elements are nonnegative.

This must be true because a quadratic form in a diagonal matrix is the sum of the diagonal elements times the squares of the elements of the vector.

We can also form other submatrices that are nonnegative definite:
  • If A is nonnegative definite, then \(A_{-(i_{1},\ldots,i_{k})(i_{1},\ldots,i_{k})}\) is nonnegative definite. (See page 599 for notation.)

Again, we can see this by selecting an x in the defining inequality (8.10) consisting of 1s in the positions corresponding to the rows and columns of A that are retained and 0s elsewhere.

By considering xTCTACx and y = Cx, we see that
  • if A is nonnegative definite, and C is conformable for the multiplication, then CTAC is nonnegative definite.

From equation ( 3.252) and the fact that the determinant of a product is the product of the determinants, we have that
  • the determinant of a nonnegative definite matrix is nonnegative.

Finally, for the nonnegative definite matrix A, we have
$$\displaystyle{ a_{ij}^{2} \leq a_{ ii}a_{jj}, }$$
(8.12)
as we see from the definition xTAx ≥ 0 and choosing the vector x to have a variable y in position i, a 1 in position j, and 0s in all other positions. For a symmetric matrix A, this yields the quadratic a ii y2 + 2a ij y + a jj . If this quadratic is to be nonnegative for all y, then the discriminant 4a ij 2 − 4a ii a jj must be nonpositive; that is, inequality (8.12) must be true.

8.3.1 Eigenvalues of Nonnegative Definite Matrices

We have seen on page 159 that a real symmetric matrix is nonnegative (positive) definite if and only if all of its eigenvalues are nonnegative (positive).

This fact allows a generalization of the statement above: a triangular matrix is nonnegative (positive) definite if and only if all of the diagonal elements are nonnegative (positive).

8.3.2 The Square Root and the Cholesky Factorization

Two important factorizations of nonnegative definite matrices are the square root,
$$\displaystyle{ A = (A^{\frac{1} {2} })^{2}, }$$
(8.13)
discussed in Sect.  5.9.1, and the Cholesky factorization,
$$\displaystyle{ A = T^{\mathrm{T}}T, }$$
(8.14)
discussed in Sect.  5.9.2. If T is as in equation (8.14), the symmetric matrix T + TT is also nonnegative definite, or positive definite if A is.
The square root matrix is used often in theoretical developments, such as Exercise 4.7b for example, but the Cholesky factor is more useful in practice. The Cholesky factorization also has a prominent role in multivariate analysis, where it appears in the Bartlett decomposition. If W is a Wishart matrix with variance-covariance matrix \(\Sigma\) (see Exercise 4.12 on page 224), then the Bartlett decomposition of W is
$$\displaystyle{W = (TU)^{\mathrm{T}}TU,}$$
where U is the Cholesky factor of \(\Sigma\), and T is an upper triangular matrix with positive diagonal elements. The diagonal elements of T have independent chi-squared distributions and the off-diagonal elements of T have independent standard normal distributions.

8.3.3 The Convex Cone of Nonnegative Definite Matrices

The class of all n × n nonnegative definite matrices is a cone because if X is a nonnegative definite matrix and a > 0, then aX is a nonnegative definite matrix (see page 43). Furthermore, it is convex cone in \(\mathrm{I\!R}^{n\times n}\), because if X1 and X2 are n × n nonnegative definite matrices and a, b ≥ 0, then aX1 + bX2 is nonnegative definite so long as either a > 0 or b > 0.

This class is not closed under Cayley multiplication (that is, in particular, it is not a group with respect to that operation). The product of two nonnegative definite matrices might not even be symmetric.

The convex cone of nonnegative definite matrices is an important object in a common optimization problem called convex cone programming (“programming” here means “optimization”). A special case of convex cone programming is called “semidefinite programming”, or “SDP” (where “semidefinite” comes from the alternative terminology for nonnegative definite). The canonical SDP problem for given n × n symmetric matrices C and A i , and real numbers b i , for i = 1, , m, is
$$\displaystyle{\begin{array}{rll} \mathrm{minimize}&\langle C,X\rangle \\ \mathrm{subject}\;\mathrm{to}&\langle A_{i},X\rangle = b_{i}&\quad i = 1,\ldots,m \\ &X\succeq 0.\end{array} }$$
The notation 〈C, X〉 here means the matrix inner product (see page 97). SDP includes linear programming as a simple special case, in which C and X are vectors. See Vandenberghe and Boyd (1996) for further discussion of the SDP problem and applications.

8.4 Positive Definite Matrices

An important class of nonnegative definite matrices are those that satisfy strict inequalities in the definition involving xTAx. These matrices are called positive definite matrices and they have all of the properties discussed above for nonnegative definite matrices as well as some additional useful properties.

A symmetric matrix A is called a positive definite matrix if, for any (conformable) vector x ≠ 0, the quadratic form is positive; that is,
$$\displaystyle{ x^{\mathrm{T}}Ax> 0. }$$
(8.15)
We denote the fact that A is positive definite by
$$\displaystyle{ A \succ 0. }$$
(8.16)
(Some people use the notation A > 0 to denote a positive definite matrix, but we have decided to use this notation to indicate that each element of A is positive.)
The properties of nonnegative definite matrices noted above hold also for positive definite matrices, generally with strict inequalities. It is obvious that all diagonal elements of a positive definite matrix are positive. Hence, if A is positive definite, tr(A) > 0. Furthermore, as above and for the same reasons, if A is positive definite, then \(A_{-(i_{1},\ldots,i_{k})(i_{1},\ldots,i_{k})}\) is positive definite. In particular,
  • Any square principal submatrix of a positive definite matrix is positive definite.

Because a quadratic form in a diagonal matrix is the sum of the diagonal elements times the squares of the elements of the vector, a diagonal matrix is positive definite if and only if all of the diagonal elements are positive.

From equation ( 3.252) and the fact that the determinant of a product is the product of the determinants, we have
  • The determinant of a positive definite matrix is positive.

  • If A is positive definite,
    $$\displaystyle{ a_{ij}^{2} <a_{ ii}a_{jj}, }$$
    (8.17)
    which we see using the same argument as for inequality (8.12).
We have a slightly stronger statement regarding sums involving positive definite matrices than what we could conclude about nonnegative definite matrices:
  • The sum of a positive definite matrix and a (conformable) nonnegative definite matrix is positive definite.

That is,
$$\displaystyle{ x^{\mathrm{T}}Ax> 0\;\forall x\neq 0\quad \mathrm{and}\quad y^{\mathrm{T}}By \geq 0\;\forall y\Longrightarrow z^{\mathrm{T}}(A + B)z> 0\;\forall z\neq 0. }$$
(8.18)
  • A positive definite matrix is necessarily nonsingular. (We see this from the fact that no nonzero combination of the columns, or rows, can be 0.) Furthermore, if A is positive definite, then A−1 is positive definite. (We showed this is Sect.  3.8.11, but we can see it in another way: because for any y ≠ 0 and x = A−1y, we have yTA−1y = xTy = xTAx > 0.)

  • A (strictly) diagonally dominant symmetric matrix with positive diagonals is positive definite. The proof of this is Exercise 8.3.

  • A positive definite matrix is orthogonally diagonalizable.

  • A positive definite matrix has a square root.

  • A positive definite matrix Cholesky factorization.

We cannot conclude that the product of two positive definite matrices is positive definite, but we do have the useful fact:
  • If A is positive definite, and C is of full rank and conformable for the multiplication AC, then CTAC is positive definite (see page 114).

We have seen from the definition of positive definiteness and the distribution of multiplication over addition that the sum of a positive definite matrix and a nonnegative definite matrix is positive definite. We can define an ordinal relationship between positive definite and nonnegative definite matrices of the same size. If A is positive definite and B is nonnegative definite of the same size, we say A is strictly greater than B and write
$$\displaystyle{ A \succ B }$$
(8.19)
if AB is positive definite; that is, if AB ≻ 0.
We can form a partial ordering of nonnegative definite matrices of the same order based on this additive property. We say A is greater than B and write
$$\displaystyle{ A\succeq B }$$
(8.20)
if AB is either the 0 matrix or is nonnegative definite; that is, if AB⪰0 (see Exercise 8.2a). The “strictly greater than” relation implies the “greater than” relation. These relations are partial in the sense that they do not apply to all pairs of nonnegative matrices; that is, there are pairs of matrices A and B for which neither A⪰B nor B⪰A.

If AB, we also write BA; and if A⪰B, we may write B⪯A.

8.4.1 Leading Principal Submatrices of Positive Definite Matrices

A sufficient condition for a symmetric matrix to be positive definite is that the determinant of each of the leading principal submatrices be positive. To see this, first let the n × n symmetric matrix A be partitioned as
$$\displaystyle\begin{array}{rcl} A& =& \left [\begin{array}{cc} A_{n-1} & a \\ a^{\mathrm{T}} & a_{nn} \end{array} \right ],{}\\ \end{array}$$
and assume that A n−1 is positive definite and that | A | > 0. (This is not the same notation that we have used for these submatrices, but the notation is convenient in this context.) From equation ( 3.192) on page 123,
$$\displaystyle{\vert A\vert = \vert A_{n-1}\vert (a_{nn} - a^{\mathrm{T}}A_{ n-1}^{-1}a).}$$
Because A n−1 is positive definite, | A n−1 | > 0, and so (a nn aTA n−1−1a) > 0; hence, the 1 × 1 matrix (a nn aTA n−1−1a) is positive definite. That any matrix whose leading principal submatrices have positive determinants is positive definite follows from this by induction, beginning with a 2 × 2 matrix.

8.4.2 The Convex Cone of Positive Definite Matrices

The class of all n × n positive definite matrices is a cone because if X is a positive definite matrix and a > 0, then aX is a positive definite matrix (see page 43). Furthermore, this class is a convex cone in \(\mathrm{I\!R}^{n\times n}\) because if X1 and X2 are n × n positive definite matrices and a, b ≥ 0, then aX1 + bX2 is positive definite so long as either a ≠ 0 or b ≠ 0.

As with the cone of nonnegative definite matrices this class is not closed under Cayley multiplication.

8.4.3 Inequalities Involving Positive Definite Matrices

Quadratic forms of positive definite matrices and nonnegative matrices occur often in data analysis. There are several useful inequalities involving such quadratic forms.

On page 156, we showed that if x ≠ 0, for any symmetric matrix A with eigenvalues c i ,
$$\displaystyle{ \frac{x^{\mathrm{T}}Ax} {x^{\mathrm{T}}x} \leq \max \{ c_{i}\}. }$$
(8.21)
If A is nonnegative definite, by our convention of labeling the eigenvalues, we have max{c i } = c1. If the rank of A is r, the minimum nonzero eigenvalue is denoted c r . Letting the eigenvectors associated with c1, , c r be v1, , v r (and recalling that these choices may be arbitrary in the case where some eigenvalues are not simple), by an argument similar to that used on page 156, we have that if A is nonnegative definite of rank r,
$$\displaystyle{ \frac{v_{i}^{\mathrm{T}}Av_{i}} {v_{i}^{\mathrm{T}}v_{i}} \geq c_{r}, }$$
(8.22)
for 1 ≤ ir.
If A is positive definite and x and y are conformable nonzero vectors, we see that
$$\displaystyle{ x^{\mathrm{T}}A^{-1}x \geq \frac{(y^{\mathrm{T}}x)^{2}} {y^{\mathrm{T}}Ay} }$$
(8.23)
by using the same argument as used in establishing the Cauchy-Schwarz inequality ( 2.26). We first obtain the Cholesky factor T of A (which is, of course, of full rank) and then observe that for every real number t
$$\displaystyle{\left (tTy + T^{-\mathrm{T}}x\right )^{\mathrm{T}}\left (tTy + T^{-\mathrm{T}}x\right ) \geq 0,}$$
and hence the discriminant of the quadratic equation in t must be nonnegative:
$$\displaystyle{4\left ((Ty)^{\mathrm{T}}T^{-\mathrm{T}}x\right )^{2}\, -\, 4\left (T^{-\mathrm{T}}x\right )^{\mathrm{T}}\left (T^{-\mathrm{T}} - x\right )(Ty)^{\mathrm{T}}Ty \leq 0.}$$
The inequality (8.23) is used in constructing Scheffé simultaneous confidence intervals in linear models.
The Kantorovich inequality for positive numbers has an immediate extension to an inequality that involves positive definite matrices. The Kantorovich inequality, which finds many uses in optimization problems, states, for positive numbers c1c2 ≥ ⋯ ≥ c n and nonnegative numbers y1, , y n such that ∑y i = 1, that
$$\displaystyle{\left (\sum _{i=1}^{n}y_{ i}c_{i}\right )\left (\sum _{i=1}^{n}y_{ i}c_{i}^{-1}\right ) \leq \frac{(c_{1} + c_{2})^{2}} {4c_{1}c_{2}}.}$$
Now let A be an n × n positive definite matrix with eigenvalues c1c2 ≥ ⋯ ≥ c n > 0. We substitute x2 for y, thus removing the nonnegativity restriction, and incorporate the restriction on the sum directly into the inequality. Then, using the similar canonical factorization of A and A−1, we have
$$\displaystyle{ \frac{\left (x^{\mathrm{T}}Ax\right )\left (x^{\mathrm{T}}A^{-1}x\right )} {(x^{\mathrm{T}}x)^{2}} \leq \frac{(c_{1} + c_{n})^{2}} {4c_{1}c_{n}}. }$$
(8.24)
This Kantorovich matrix inequality likewise has applications in optimization; in particular, for assessing convergence of iterative algorithms.
The left-hand side of the Kantorovich matrix inequality also has a lower bound,
$$\displaystyle{ \frac{\left (x^{\mathrm{T}}Ax\right )\left (x^{\mathrm{T}}A^{-1}x\right )} {(x^{\mathrm{T}}x)^{2}} \geq 1, }$$
(8.25)
which can be seen in a variety of ways, perhaps most easily by using the inequality (8.23). (You were asked to prove this directly in Exercise 3.31.)

All of the inequalities (8.21) through (8.25) are sharp. We know that (8.21) and (8.22) are sharp by using the appropriate eigenvectors. We can see the others are sharp by using A = I.

There are several variations on these inequalities and other similar inequalities that are reviewed by Marshall and Olkin (1990) and Liu and Neudecker (1996).

8.5 Idempotent and Projection Matrices

An important class of matrices are those that, like the identity, have the property that raising them to a power leaves them unchanged. A matrix A such that
$$\displaystyle{ AA = A }$$
(8.26)
is called an idempotent matrix. An idempotent matrix is square, and it is either singular or the identity matrix. (It must be square in order to be conformable for the indicated multiplication. If it is not singular, we have A = (A−1A)A = A−1(AA) = A−1A = I; hence, an idempotent matrix is either singular or the identity matrix.)

From the definition, it is clear that an idempotent matrix is its own Drazin inverse: AD = A (see page 129).

An idempotent matrix that is symmetric is called a projection matrix.

8.5.1 Idempotent Matrices

Many matrices encountered in the statistical analysis of linear models are idempotent. One such matrix is XX (see page 124 and Sect.  9.3.2). This matrix exists for any n × m matrix X, and it is square. (It is m × m.)

Because the eigenvalues of A2 are the squares of the eigenvalues of A, all eigenvalues of an idempotent matrix must be either 0 or 1.

Any vector in the column space of an idempotent matrix A is an eigenvector of A. (This follows immediately from AA = A.) More generally, if x and y are vectors in span(A) and a is a scalar, then
$$\displaystyle{ A(ax + y) = ax + y. }$$
(8.27)
(To see this, we merely represent x and y as linear combinations of columns (or rows) of A and substitute in the equation.)
The number of eigenvalues that are 1 is the rank of an idempotent matrix. (Exercise 8.4 asks why this is the case.) We therefore have, for an idempotent matrix A,
$$\displaystyle{ \mathrm{tr}(A) = \mathrm{rank}(A). }$$
(8.28)
Because the eigenvalues of an idempotent matrix are either 0 or 1, a symmetric idempotent matrix is nonnegative definite.
If A is idempotent and n × n, then
$$\displaystyle{ \mathrm{rank}(I - A) = n -\mathrm{rank}(A). }$$
(8.29)
We showed this in equation ( 3.200) on page 125. (Although there we were considering the special matrix AA, the only properties used were the idempotency of AA and the fact that rank(AA) = rank(A).)

Equation (8.29) together with the diagonalizability theorem (equation ( 3.248)) implies that an idempotent matrix is diagonalizable.

If A is idempotent and V is an orthogonal matrix of the same size, then VTAV is idempotent (whether or not V is a matrix that diagonalizes A) because
$$\displaystyle{ (V ^{\mathrm{T}}AV )(V ^{\mathrm{T}}AV ) = V ^{\mathrm{T}}AAV = V ^{\mathrm{T}}AV. }$$
(8.30)

If A is idempotent, then (IA) is also idempotent, as we see by multiplication. This fact and equation (8.29) have generalizations for sums of idempotent matrices that are parts of Cochran’s theorem, which we consider below.

Although if A is idempotent so (IA) is also idempotent and hence is not of full rank (unless A = 0), for any scalar a ≠ − 1, (I + aA) is of full rank, and
$$\displaystyle{ (I + aA)^{-1} = I - \frac{a} {a + 1}A, }$$
(8.31)
as we see by multiplication.

On page 146, we saw that similar matrices are equivalent (have the same rank). For idempotent matrices, we have the converse: idempotent matrices of the same rank (and size) are similar (see Exercise 8.5).

If A1 and A2 are matrices conformable for addition, then A1 + A2 is idempotent if and only if A1A2 = A2A1 = 0. It is easy to see that this condition is sufficient by multiplication:
$$\displaystyle{(A_{1} + A_{2})(A_{1} + A_{2}) = A_{1}A_{1} + A_{1}A_{2} + A_{2}A_{1} + A_{2}A_{2} = A_{1} + A_{2}.}$$
To see that it is necessary, we first observe from the expansion above that A1 + A2 is idempotent only if A1A2 + A2A1 = 0. Multiplying this necessary condition on the left by A1 yields
$$\displaystyle{A_{1}A_{1}A_{2} + A_{1}A_{2}A_{1} = A_{1}A_{2} + A_{1}A_{2}A_{1} = 0,}$$
and multiplying on the right by A1 yields
$$\displaystyle{A_{1}A_{2}A_{1} + A_{2}A_{1}A_{1} = A_{1}A_{2}A_{1} + A_{2}A_{1} = 0.}$$
Subtracting these two equations yields
$$\displaystyle{A_{1}A_{2} = A_{2}A_{1},}$$
and since A1A2 + A2A1 = 0, we must have A1A2 = A2A1 = 0.

8.5.1.1 Symmetric Idempotent Matrices

Many of the idempotent matrices in statistical applications are symmetric, and such matrices have some useful properties.

Because the eigenvalues of an idempotent matrix are either 0 or 1, the spectral decomposition of a symmetric idempotent matrix A can be written as
$$\displaystyle{ V ^{\mathrm{T}}AV = \mathrm{diag}(I_{ r},0), }$$
(8.32)
where V is a square orthogonal matrix and r = rank(A). (This is from equation ( 3.253) on page 154.)
For symmetric matrices, there is a converse to the fact that all eigenvalues of an idempotent matrix are either 0 or 1. If A is a symmetric matrix all of whose eigenvalues are either 0 or 1, then A is idempotent. We see this from the spectral decomposition of A, A = V diag(I r , 0)VT, and, with C = diag(I r , 0), by observing
$$\displaystyle{AA = V CV ^{\mathrm{T}}V CV ^{\mathrm{T}} = V CCV ^{\mathrm{T}} = V CV ^{\mathrm{T}} = A,}$$
because the diagonal matrix of eigenvalues C contains only 0s and 1s.
If A is symmetric and p is any positive integer,
$$\displaystyle{ A^{p+1} = A^{p}\Longrightarrow A\;\mbox{ is idempotent}. }$$
(8.33)
This follows by considering the eigenvalues of A, c1, , c n . The eigenvalues of A p+1 are c1 p+1, , c n p+1 and the eigenvalues of A p are c1 p , , c n p , but since A p+1 = A p , it must be the case that c i p+1 = c i p for each i = 1, , n. The only way this is possible is for each eigenvalue to be 0 or 1, and in this case the symmetric matrix must be idempotent.
There are bounds on the elements of a symmetric idempotent matrix. Because A is symmetric and ATA = A,
$$\displaystyle{ a_{ii} =\sum _{ j=1}^{n}a_{ ij}^{2}; }$$
(8.34)
hence, 0 ≤ a ii . Rearranging equation (8.34), we have
$$\displaystyle{ a_{ii} = a_{ii}^{2} +\sum _{ j\neq i}a_{ij}^{2}, }$$
(8.35)
so a ii 2a ii or 0 ≤ a ii (1 − a ii ); that is, a ii ≤ 1. Now, if a ii = 0 or a ii = 1, then equation (8.35) implies
$$\displaystyle{\sum _{j\neq i}a_{ij}^{2} = 0,}$$
and the only way this can happen is if a ij = 0 for all ji. So, in summary, if A is an n × n symmetric idempotent matrix, then
$$\displaystyle{ 0 \leq a_{ii} \leq 1\;\mbox{ for}\;i = 1,\ldots,m, }$$
(8.36)
and
$$\displaystyle{ \mbox{ if}\;a_{ii} = 0\;\mbox{ or}\;a_{ii} = 1,\;\mbox{ then}\;a_{ij} = a_{ji} = 0\;\mbox{ for all}\;j\neq i. }$$
(8.37)

8.5.1.2 Cochran’s Theorem

There are various facts that are sometimes called Cochran’s theorem. The simplest one concerns k symmetric idempotent n × n matrices, A1, , A k , such that
$$\displaystyle{ I_{n} = A_{1} + \cdots + A_{k}. }$$
(8.38)
Under these conditions, we have
$$\displaystyle{ A_{i}A_{j} = 0\;\mbox{ for all}\;i\neq j. }$$
(8.39)
We see this by the following argument. For an arbitrary j, as in equation (8.32), for some matrix V, we have
$$\displaystyle{V ^{\mathrm{T}}A_{ j}V = \mathrm{diag}(I_{r},0),}$$
where r = rank(A j ). Now
$$\displaystyle\begin{array}{rcl} I_{n}& =& V ^{\mathrm{T}}I_{ n}V {}\\ & =& \sum _{i=1}^{k}V ^{\mathrm{T}}A_{ i}V {}\\ & =& \mathrm{diag}(I_{r},0) +\sum _{i\neq j}V ^{\mathrm{T}}A_{ i}V, {}\\ & & {}\\ \end{array}$$
which implies
$$\displaystyle{ \sum _{i\neq j}V ^{\mathrm{T}}A_{ i}V = \mathrm{diag}(0,I_{n-r}). }$$
(8.40)
Now, from equation (8.30), for each i, VTA i V is idempotent, and so from equation (8.36) the diagonal elements are all nonnegative, and hence equation (8.40) implies that for each ij, the first r diagonal elements are 0. Furthermore, since these diagonal elements are 0, equation (8.37) implies that all elements in the first r rows and columns are 0. We have, therefore, for each ij,
$$\displaystyle{V ^{\mathrm{T}}A_{ i}V = \mathrm{diag}(0,B_{i})}$$
for some (nr) × (nr) symmetric idempotent matrix B i . Now, for any ij, consider A i A j and form VTA i A j V. We have
$$\displaystyle\begin{array}{rcl} V ^{\mathrm{T}}A_{ i}A_{j}V & =& (V ^{\mathrm{T}}A_{ i}V )(V ^{\mathrm{T}}A_{ j}V ) {}\\ & =& \mathrm{diag}(0,B_{i})\mathrm{diag}(I_{r},0) {}\\ & =& 0. {}\\ \end{array}$$
Because V is nonsingular, this implies the desired conclusion; that is, that A i A j = 0 for any ij.

We can now extend this result to an idempotent matrix in place of I; that is, for an idempotent matrix A with A = A1 + ⋯ + A k . Rather than stating it simply as in equation (8.39), however, we will state the implications differently.

Let A1, , A k be n × n symmetric matrices and let
$$\displaystyle{ A = A_{1} + \cdots + A_{k}. }$$
(8.41)

Then any two of the following conditions imply the third one:

  1. (a).

    A is idempotent.

     
  2. (b).

    A i is idempotent for i = 1, , k.

     
  3. (c).

    A i A j = 0 for all ij.

     

This is also called Cochran’s theorem. (The theorem also applies to nonsymmetric matrices if condition (c) is augmented with the requirement that rank(A i 2) = rank(A i ) for all i. We will restrict our attention to symmetric matrices, however, because in most applications of these results, the matrices are symmetric.)

First, if we assume properties (a) and (b), we can show that property (c) follows using an argument similar to that used to establish equation (8.39) for the special case A = I. The formal steps are left as an exercise.

Now, let us assume properties (b) and (c) and show that property (a) holds. With properties (b) and (c), we have
$$\displaystyle\begin{array}{rcl} AA& =& \left (A_{1} + \cdots + A_{k}\right )\left (A_{1} + \cdots + A_{k}\right ) {}\\ & =& \sum _{i=1}^{k}A_{ i}A_{i} +\sum _{i\neq j}\sum _{j=1}^{k}A_{ i}A_{j} {}\\ & =& \sum _{i=1}^{k}A_{ i} {}\\ & =& A. {}\\ \end{array}$$
Hence, we have property (a); that is, A is idempotent.
Finally, let us assume properties (a) and (c). Property (b) follows immediately from
$$\displaystyle{A_{i}^{2} = A_{ i}A_{i} = A_{i}A = A_{i}AA = A_{i}^{2}A = A_{ i}^{3}}$$
and the implication (8.33).

Any two of the properties (a) through (c) also imply a fourth property for A = A1 + ⋯ + A k when the A i are symmetric:

  1. (d).

    rank(A) = rank(A1) + ⋯ + rank(A k ).

     
We first note that any two of properties (a) through (c) imply the third one, so we will just use properties (a) and (b). Property (a) gives
$$\displaystyle{\mathrm{rank}(A) = \mathrm{tr}(A) = \mathrm{tr}(A_{1} + \cdots + A_{k}) = \mathrm{tr}(A_{1}) + \cdots + \mathrm{tr}(A_{k}),}$$
and property (b) states that the latter expression is rank(A1) + ⋯ + rank(A k ), thus yielding property (d).

There is also a partial converse: properties (a) and (d) imply the other properties.

One of the most important special cases of Cochran’s theorem is when A = I in the sum (8.41):
$$\displaystyle{I_{n} = A_{1} + \cdots + A_{k}.}$$
The identity matrix is idempotent, so if rank(A1) + ⋯ + rank(A k ) = n, all the properties above hold.

The most important statistical application of Cochran’s theorem is for the distribution of quadratic forms of normally distributed random vectors. These distribution results are also called Cochran’s theorem. We briefly discuss it in Sect.  9.2.3.

8.5.2 Projection Matrices: Symmetric Idempotent Matrices

For a given vector space \(\mathcal{V}\), a symmetric idempotent matrix A whose columns span \(\mathcal{V}\) is said to be a projection matrix onto \(\mathcal{V}\); in other words, a matrix A is a projection matrix onto span(A) if and only if A is symmetric and idempotent. (Some authors do not require a projection matrix to be symmetric. In that case, the terms “idempotent” and “projection” are synonymous.)

It is easy to see that, for any vector x, if A is a projection matrix onto \(\mathcal{V}\), the vector Ax is in \(\mathcal{V}\), and the vector xAx is in \(\mathcal{V}^{\perp }\) (the vectors Ax and xAx are orthogonal). For this reason, a projection matrix is sometimes called an “orthogonal projection matrix”. Note that an orthogonal projection matrix is not an orthogonal matrix, however, unless it is the identity matrix. Stating this in alternative notation, if A is a projection matrix and \(A \in \mathrm{I\!R}^{n\times n}\), then A maps \(\mathrm{I\!R}^{n}\) onto \(\mathcal{V}(A)\) and IA is also a projection matrix (called the complementary projection matrix of A), and it maps \(\mathrm{I\!R}^{n}\) onto the orthogonal complement, \(\mathcal{N}(A)\). These spaces are such that \(\mathcal{V}(A) \oplus \mathcal{N}(A) = \mathrm{I\!R}^{n}\).

In this text, we use the term “projection” to mean “orthogonal projection”, but we should note that in some literature “projection” can include “oblique projection”. In the less restrictive definition, for vector spaces \(\mathcal{V}\), \(\mathcal{X}\), and \(\mathcal{Y}\), if \(\mathcal{V} = \mathcal{X}\oplus \mathcal{Y}\) and v = x + y with \(x \in \mathcal{X}\) and \(y \in \mathcal{Y}\), then the vector x is called the projection of v onto \(\mathcal{X}\) along \(\mathcal{Y}\). In this text, to use the unqualified term “projection”, we require that \(\mathcal{X}\) and \(\mathcal{Y}\) be orthogonal; if they are not, then we call x the oblique projection of v onto \(\mathcal{X}\) along \(\mathcal{Y}\). The choice of the more restrictive definition is because of the overwhelming importance of orthogonal projections in statistical applications. The restriction is also consistent with the definition in equation ( 2.51) of the projection of a vector onto another vector (as opposed to the projection onto a vector space).

Because a projection matrix is idempotent, the matrix projects any of its columns onto itself, and of course it projects the full matrix onto itself: AA = A (see equation (8.27)).

If x is a general vector in \(\mathrm{I\!R}^{n}\), that is, if x has order n and belongs to an n-dimensional space, and A is a projection matrix of rank rn, then Ax has order n and belongs to span(A), which is an r-dimensional space. Thus, we say projections are dimension reductions.

Useful projection matrices often encountered in statistical linear models are X+X and XX+. (Recall that for any generalized inverse XX is an idempotent matrix.) The matrix X+ exists for any n × m matrix X, and X+X is square (m × m) and symmetric.

8.5.2.1 Projections onto Linear Combinations of Vectors

On page 36, we gave the projection of a vector y onto a vector x as
$$\displaystyle{\frac{x^{\mathrm{T}}y} {x^{\mathrm{T}}x}x.}$$
The projection matrix to accomplish this is the “outer/inner products matrix”,
$$\displaystyle{ \frac{1} {x^{\mathrm{T}}x}xx^{\mathrm{T}}. }$$
(8.42)
The outer/inner products matrix has rank 1. It is useful in a variety of matrix transformations. If x is normalized, the projection matrix for projecting a vector on x is just xxT. The projection matrix for projecting a vector onto a unit vector e i is e i e i T, and e i e i Ty = (0, , y i , , 0).

This idea can be used to project y onto the plane formed by two vectors, x1 and x2, by forming a projection matrix in a similar manner and replacing x in equation (8.42) with the matrix X = [x1 | x2]. On page 409, we will view linear regression fitting as a projection onto the space spanned by the independent variables.

The angle between vectors we defined on page 37 can be generalized to the angle between a vector and a plane or any linear subspace by defining it as the angle between the vector and the projection of the vector onto the subspace. By applying the definition ( 2.54) to the projection, we see that the angle θ between the vector y and the subspace spanned by the columns of a projection matrix A is determined by the cosine
$$\displaystyle{ \cos (\theta ) = \frac{y^{\mathrm{T}}Ay} {y^{\mathrm{T}}y}. }$$
(8.43)

8.6 Special Matrices Occurring in Data Analysis

Some of the most useful applications of matrices are in the representation of observational data, as in Fig. 8.1 on page 330. If the data are represented as real numbers, the array is a matrix, say X. The rows of the n × m data matrix X are “observations” and correspond to a vector of measurements on a single observational unit, and the columns of X correspond to n measurements of a single variable or feature. In data analysis we may form various association matrices that measure relationships among the variables or the observations that correspond to the columns or the rows of X. Many summary statistics arise from a matrix of the form XTX. (If the data in X are incomplete—that is, if some elements are missing—problems may arise in the analysis. We discuss some of these issues in Sect.  9.5.6.)

8.6.1 Gramian Matrices

A (real) matrix A such that for some (real) matrix B, A = BTB, is called a Gramian matrix. Any nonnegative definite matrix is Gramian (from equation (8.14) and Sect.  5.9.2 on page 256). (This is not a definition of “Gramian” or “Gram” matrix; these terms have more general meanings, but they do include any matrix expressible as BTB.)

8.6.1.1 Sums of Squares and Cross Products

Although the properties of Gramian matrices are of interest, our starting point is usually the data matrix X, which we may analyze by forming a Gramian matrix XTX or XXT (or a related matrix). These Gramian matrices are also called sums of squares and cross products matrices. (The term “cross product” does not refer to the cross product of vectors defined on page 47, but rather to the presence of sums over i of the products x ij x ik along with sums of squares x ij 2.) These matrices and other similar ones are useful association matrices in statistical applications.

8.6.1.2 Some Immediate Properties of Gramian Matrices

Some interesting properties of a Gramian matrix XTX that we have discussed are:
  • XTX is symmetric.

  • rank(XTX) = rank(X).

  • XTX is of full rank if and only if X is of full column rank.

  • XTX is nonnegative definite.

  • XTX is positive definite if and only if X is of full column rank.

  • XTX = 0 ⇔ X = 0.

  • BXTX = CXTXBXT = CXT.

  • XTXB = XTXCXB = XC.

  • If d is a singular value of X, then c = d2 is an eigenvalue of XTX; or, expressed another way,if c is a nonzero eigenvalue of XTX, then there is a singular value d of X such that d2 = c.

These properties were shown in Sect.  3.3.10, beginning on page 115, except for the last one, which was shown on page 162.

Each element of a Gramian matrix is the dot product of columns of the constituent matrix. If xi and xj are the ith and jth columns of the matrix X, then
$$\displaystyle{ (X^{\mathrm{T}}X)_{ ij} = x_{{\ast}i}^{\mathrm{T}}x_{ {\ast}j}. }$$
(8.44)
A Gramian matrix is also the sum of the outer products of the rows of the constituent matrix. If x i is the ith row of the n × m matrix X, then
$$\displaystyle{ X^{\mathrm{T}}X =\sum _{ i=1}^{n}x_{ i{\ast}}x_{i{\ast}}^{\mathrm{T}}. }$$
(8.45)
This is generally the way a Gramian matrix is computed.
By equation (8.14), we see that any Gramian matrix formed from a general matrix X is the same as a Gramian matrix formed from a square upper triangular matrix T:
$$\displaystyle{X^{\mathrm{T}}X = T^{\mathrm{T}}T.}$$

8.6.1.3 Generalized Inverses of Gramian Matrices

The generalized inverses of XTX have useful properties. First, we see from the definition, for any generalized inverse (XTX), that ((XTX))T is also a generalized inverse of XTX. (Note that (XTX) is not necessarily symmetric.) Also, we have, from equation ( 3.162),
$$\displaystyle{ X(X^{\mathrm{T}}X)^{-}X^{\mathrm{T}}X = X. }$$
(8.46)
This means that (XTX)XT is a generalized inverse of X.
The Moore-Penrose inverse of X has an interesting relationship with a generalized inverse of XTX:
$$\displaystyle{ XX^{+} = X(X^{\mathrm{T}}X)^{-}X^{\mathrm{T}}. }$$
(8.47)
This can be established directly from the definition of the Moore-Penrose inverse.
An important property of X(XTX)XT is its invariance to the choice of the generalized inverse of XTX. Suppose G is any generalized inverse of XTX. Then, from equation (8.46), we have X(XTX)XTX = XGXTX, and from the implication ( 3.162), we have
$$\displaystyle{ XGX^{\mathrm{T}} = X(X^{\mathrm{T}}X)^{-}X^{\mathrm{T}}; }$$
(8.48)
that is, X(XTX)XT is invariant to the choice of generalized inverse.

8.6.1.4 Eigenvalues of Gramian Matrices

The nonzero eigenvalues of XTX are the same as the nonzero eigenvalues of XXT (property 14 on page 140).

If the singular value decomposition of X is UDVT (page 161), then the similar canonical factorization of XTX (equation ( 3.252)) is V DTDVT. Hence, we see that the nonzero singular values of X are the square roots of the nonzero eigenvalues of the symmetric matrix XTX. By using DDT similarly, we see that they are also the square roots of the nonzero eigenvalues of XXT.

8.6.2 Projection and Smoothing Matrices

It is often of interest to approximate an arbitrary n-vector in a given m-dimensional vector space, where m < n. An n × n projection matrix of rank m clearly does this.

8.6.2.1 A Projection Matrix Formed from a Gramian Matrix

An important matrix that arises in analysis of a linear model of the form
$$\displaystyle{ y = X\beta +\epsilon }$$
(8.49)
is
$$\displaystyle{ H = X(X^{\mathrm{T}}X)^{-}X^{\mathrm{T}}, }$$
(8.50)
where (XTX) is any generalized inverse. From equation (8.48), H is invariant to the choice of generalized inverse. By equation (8.47), this matrix can be obtained from the pseudoinverse and so
$$\displaystyle{ H = XX^{+}. }$$
(8.51)
In the full rank case, this is uniquely
$$\displaystyle{ H = X(X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}. }$$
(8.52)
Whether or not X is of full rank, H is a projection matrix onto span(X). It is called the “hat matrix” because it projects the observed response vector, often denoted by y, onto a predicted response vector, often denoted by \(\widehat{y}\) in span(X):
$$\displaystyle{ \widehat{y} = Hy. }$$
(8.53)
Because H is invariant, this projection is invariant to the choice of generalized inverse. (In the nonfull rank case, however, we generally refrain from referring to the vector Hy as the “predicted response”; rather, we may call it the “fitted response”.)
The rank of H is the same as the rank of X, and its trace is the same as its rank (because it is idempotent). When X is of full column rank, we have
$$\displaystyle{ \mathrm{tr}(H) = \mathrm{number}\;\mathrm{of}\;\mathrm{columns}\;\mathrm{of}\;X. }$$
(8.54)
(This can also be seen by using the invariance of the trace to permutations of the factors in a product as in equation ( 3.79).)

In linear models, tr(H) is the model degrees of freedom, and the sum of squares due to the model is just yTHy.

The complementary projection matrix,
$$\displaystyle{ I - H, }$$
(8.55)
also has interesting properties that relate to linear regression analysis. In geometrical terms, this matrix projects a vector onto \(\mathcal{N}(X^{\mathrm{T}})\), the orthogonal complement of span(X). We have
$$\displaystyle\begin{array}{rcl} y& =& Hy + (I - H)y \\ & =& \widehat{y} + r, {}\end{array}$$
(8.56)
where \(r = (I - H)y \in \mathcal{N}(X^{\mathrm{T}})\). The orthogonal complement is called the residual vector space, and r is called the residual vector. Both the rank and the trace of the orthogonal complement are the number of rows in X (that is, the number of observations) minus the regression degrees of freedom. This quantity is the “residual degrees of freedom” (unadjusted).
These two projection matrices (8.50) or (8.52) and (8.55) partition the total sum of squares:
$$\displaystyle{ y^{\mathrm{T}}y = y^{\mathrm{T}}Hy\; +\; y^{\mathrm{T}}(I - H)y. }$$
(8.57)
This partitioning yields the total sum of squares into a sum of squares due to the fitted relationship between y and X and a “residual” sum of squares. The analysis of these two sums of squares is one of the most fundamental and important techniques in statistics. Note that the second term in this partitioning is the Schur complement of XTX in [Xy]T [Xy] (see equation ( 3.191) on page 122).

8.6.2.2 Smoothing Matrices

The hat matrix, either from a full rank X as in equation (8.52) or formed by a generalized inverse as in equation (8.50), smoothes the vector y onto the hyperplane defined by the column space of X. It is therefore a smoothing matrix. (Note that the rank of the column space of X is the same as the rank of XTX.)

A useful variation of the cross products matrix XTX is the matrix formed by adding a nonnegative (positive) definite matrix A to it. Because XTX is nonnegative (positive) definite, XTX + A is nonnegative definite, as we have seen (page 349), and hence XTX + A is a Gramian matrix.

Because the square root of the nonnegative definite A exists, we can express the sum of the matrices as
$$\displaystyle{ X^{\mathrm{T}}X+A = \left [\begin{array}{c} X \\ A^{\frac{1} {2} } \end{array} \right ]^{\mathrm{T}}\left [\begin{array}{c} X \\ A^{\frac{1} {2} } \end{array} \right ]. }$$
(8.58)
In a common application, a positive definite matrix λI, with λ > 0, is added to XTX, and this new matrix is used as a smoothing matrix. The analogue of the hat matrix (8.52) is
$$\displaystyle{ H_{\lambda } = X(X^{\mathrm{T}}X +\lambda I)^{-1}X^{\mathrm{T}}, }$$
(8.59)
and the analogue of the fitted response is
$$\displaystyle{ \widehat{y}_{\lambda } = H_{\lambda }y. }$$
(8.60)
This has the effect of shrinking the \(\widehat{y}\) of equation (8.53) toward 0. (In regression analysis, this is called “ridge regression”; see page 364 or 431.)

Any matrix such as H λ that is used to transform the observed vector y onto a given subspace is called a smoothing matrix.

8.6.2.3 Effective Degrees of Freedom

Because of the shrinkage in ridge regression (that is, because the fitted model is less dependent just on the data in X) we say the “effective” degrees of freedom of a ridge regression model decreases with increasing λ. We can formally define the effective model degrees of freedom of any linear fit \(\widehat{y} = H_{\lambda }y\) as
$$\displaystyle{ \mathrm{tr}(H_{\lambda }), }$$
(8.61)
analogous to the model degrees of freedom in linear regression above. This definition of effective degrees of freedom applies generally in data smoothing. In fact, many smoothing matrices used in applications depend on a single smoothing parameter such as the λ in ridge regression, and so the same notation H λ is often used for a general smoothing matrix.
To evaluate the effective degrees of freedom in the ridge regression model for a given λ and X, for example, using the singular value decomposition of X, X = UDVT, we have
$$\displaystyle{ \begin{array}{rl} \mathrm{tr}(X(X^{\mathrm{T}}X +\lambda I)^{-1}X^{\mathrm{T}}) \\ =&\mathrm{tr}\left (UDV ^{\mathrm{T}}(V D^{2}V ^{\mathrm{T}} +\lambda V V ^{\mathrm{T}})^{-1}V DU^{\mathrm{T}}\right ) \\ =&\mathrm{tr}\left (UDV ^{\mathrm{T}}(V (D^{2} +\lambda I)V ^{\mathrm{T}})^{-1}V DU^{\mathrm{T}}\right ) \\ =&\mathrm{tr}\left (UD(D^{2} +\lambda I)^{-1}DU^{\mathrm{T}}\right ) \\ =&\mathrm{tr}\left (D^{2}(D^{2} +\lambda I)^{-1}\right ) \\ =&\sum \frac{d_{i}^{2}} {d_{i}^{2}+\lambda }, \end{array} }$$
(8.62)
where the d i are the singular values of X.

When λ = 0, this is the same as the ordinary model degrees of freedom, and when λ is positive, this quantity is smaller, as we would want it to be by the argument above. The d i 2∕(d i 2 + λ) are called shrinkage factors.

If XTX is not of full rank, the addition of λI to it also has the effect of yielding a full rank matrix, if λ > 0, and so the inverse of XTX + λI exists even when that of XTX does not. In any event, the addition of λI to XTX yields a matrix with a better condition number, which we discussed in Sect.  6.1. (On page 272, we showed that the condition number of XTX + λI is better than that of XTX.)

8.6.2.4 Residuals from Smoothed Data

Just as in equation (8.56), we can write
$$\displaystyle{ y =\widehat{ y}_{\lambda } + r_{\lambda }. }$$
(8.63)
Notice, however, that in \(\widehat{y}_{\lambda } = H_{\lambda }y\), H λ is not in general a projection matrix. Unless H λ is a projection matrix, \(\widehat{y}_{\lambda }\) and r λ are not orthogonal in general as are \(\widehat{y}\) and r, and we do not have the additive partitioning of the sum of squares as in equation (8.57).

The rank of H λ is the same as the number of columns of X, but the trace, and hence the model degrees of freedom, is less than this number.

8.6.3 Centered Matrices and Variance-Covariance Matrices

In Sect.  2.3, we defined the variance of a vector and the covariance of two vectors. These are the same as the “sample variance” and “sample covariance” in statistical data analysis and are related to the variance and covariance of random variables in probability theory. We now consider the variance-covariance matrix associated with a data matrix. We occasionally refer to the variance-covariance matrix simply as the “variance matrix” or just as the “variance”.

First, we consider centering and scaling data matrices.

8.6.3.1 Centering and Scaling of Data Matrices

When the elements in a vector represent similar measurements or observational data on a given phenomenon, summing or averaging the elements in the vector may yield meaningful statistics. In statistical applications, the columns in a matrix often represent measurements on the same feature or on the same variable over different observational units as in Fig. 8.1, and so the mean of a column may be of interest.

We may center the column by subtracting its mean from each element in the same manner as we centered vectors on page 48. The matrix formed by centering all of the columns of a given matrix is called a centered matrix, and if the original matrix is X, we represent the centered matrix as Xc in a notation analogous to what we introduced for centered vectors. If we represent the matrix whose ith column is the constant mean of the ith column of X as \(\overline{X}\),
$$\displaystyle{ X_{\mathrm{c}} = X -\overline{X}. }$$
(8.64)
Here is an R statement to compute this:

 Xc <- X-rep(1,n)%*%t(apply(X,2,mean))

If the unit of a measurement is changed, all elements in a column of the data matrix in which the measurement is used will change. The amount of variation of elements within a column or the relative variation among different columns ideally should not be measured in terms of the basic units of measurement, which can differ irreconcilably from one column to another. (One column could represent scores on an exam and another column could represent weight, for example.)

In analyzing data, it is usually important to scale the variables so that their variations are comparable. We do this by using the standard deviation of the column. If we have also centered the columns, the column vectors are the centered and scaled vectors of the form of those in equation ( 2.74),
$$\displaystyle{x_{\mathrm{cs}} = \frac{x_{\mathrm{c}}} {s_{x}},}$$
where s x is the standard deviation of x,
$$\displaystyle{s_{x} = \frac{\|x_{\mathrm{c}}\|} {\sqrt{n - 1}}.}$$
If all columns of the data matrix X are centered and scaled, we denote the resulting matrix as Xcs. If s i represents the standard deviation of the ith column, this matrix is formed as
$$\displaystyle{ X_{\mathrm{cs}} = X_{\mathrm{c}}\mathrm{diag}(1/s_{i}). }$$
(8.65)
Here is an R statement to compute this:

 Xcs <- Xc%*%diag(1/apply(X,2,sd))

If the rows of X are taken as representative of a population of similar vectors, it is often useful to center and scale any vector from that population in the manner of equation (8.65):
$$\displaystyle{ \tilde{x} = \mathrm{diag}(1/s_{i})x_{\mathrm{c}}. }$$
(8.66)
(Note that xc is a vector of the same order as a row of Xc.)

8.6.3.2 Gramian Matrices Formed from Centered Matrices:Covariance Matrices

An important Gramian matrix is formed as the sums of squares and cross products matrix from a centered matrix and scaled by (n − 1), where n is the number of rows of the original matrix:
$$\displaystyle\begin{array}{rcl} S_{X}& =& \frac{1} {n - 1}X_{\mathrm{c}}^{\mathrm{T}}X_{\mathrm{ c}} \\ & =& (s_{ij}). {}\end{array}$$
(8.67)
This matrix is called the variance-covariance matrix associated with the given matrix X, or the sample variance-covariance matrix, to distinguish it from the variance-covariance matrix of the distribution given in equation ( 4.71) on page 218. We denote it by S X or just S. If xi and xj are the vectors corresponding to the ith and jth columns of X, then s ij = Cov(xi , xj ); that is, the off-diagonal elements are the covariances between the column vectors, as in equation ( 2.78), and the diagonal elements are variances of the column vectors.

This matrix and others formed from it, such as R X in equation (8.69) below, are called association matrices because they are based on measures of association (covariance or correlation) among the columns of X. We could likewise define a Gramian association matrix based on measures of association among the rows of X.

A transformation using the Cholesky factor of S X or the square root of S X (assuming S X is full rank) results in a matrix whose associated variance-covariance is the identity. We call this a sphered matrix:
$$\displaystyle{ X_{\mathrm{sphered}} = X_{\mathrm{c}}S_{X}^{-\frac{1} {2} }. }$$
(8.68)
The matrix S X is a measure of the anisometry of the space of vectors represented by the rows of X as mentioned in Sect.  3.2.9. The inverse, S X −1, in some sense evens out the anisometry. Properties of vectors in the space represented by the rows of X are best assessed following a transformation as in equation (8.66). For example, rather than orthogonality of two vectors u and v, a more interesting relationship would be S X −1-conjugacy (see equation ( 3.93)):
$$\displaystyle{u^{\mathrm{T}}S_{ X}^{-1}v = 0.}$$
Also, the Mahalanobis distance, \(\sqrt{(u - v)^{\mathrm{T} } S_{X }^{-1 }(u - v)}\), may be more relevant for measuring the difference in two vectors than the standard Euclidean distance.

8.6.3.3 Gramian Matrices Formed from Scaled Centered Matrices:Correlation Matrices

If the columns of a centered matrix are standardized (that is, divided by their standard deviations, assuming that each is nonconstant, so that the standard deviation is positive), the scaled cross products matrix is the correlation matrix, often denoted by R X or just R,
$$\displaystyle\begin{array}{rcl} R_{X}& =& \frac{1} {n - 1}X_{\mathrm{cs}}^{\mathrm{T}}X_{\mathrm{ cs}} \\ & =& (r_{ij}), {}\end{array}$$
(8.69)
where if xi and xj are the vectors corresponding to the ith and jth columns of X, r ij = Cor(xi , xj ). The correlation matrix can also be expressed as R X = XcTDXc, where D is the diagonal matrix whose kth diagonal is \(1/\sqrt{\mathrm{V} (x_{{\ast}k } )}\), where V(xk ) is the sample variance of the kth column; that is, \(\mathrm{V}(x_{{\ast}k}) =\sum _{i}(x_{ik} -\bar{ x}_{{\ast}k})^{2}/(n - 1)\). This Gramian matrix R X is based on measures of association among the columns of X.

The elements along the diagonal of the correlation matrix are all 1, and the off-diagonals are between − 1 and 1, each being the correlation between a pair of column vectors, as in equation ( 2.80). The correlation matrix is nonnegative definite because it is a Gramian matrix.

The trace of an n × n correlation matrix is n, and therefore the eigenvalues, which are all nonnegative, sum to n.

Without reference to a data matrix, any nonnegative definite matrix with 1s on the diagonal and with all elements less than or equal to 1 in absolute value is called a correlation matrix.

8.6.4 The Generalized Variance

The diagonal elements of the variance-covariance matrix S associated with the n × m data matrix X are the second moments of the centered columns of X, and the off-diagonal elements are pairwise second central moments of the columns. Each element of the matrix provides some information about the spread of a single column or of two columns of X. The determinant of S provides a single overall measure of the spread of the columns of X. This measure, | S |, is called the generalized variance, or generalized sample variance, to distinguish it from an analogous measure of a distributional model.

On page 74, we discussed the equivalence of a determinant and the volume of a parallelotope. The generalized variance captures this, and when the columns or rows of S are more orthogonal to each other, the volume of the parallelotope determined by the columns or rows of S is greater, as shown in Fig. 8.6 for m = 3.
Figure 8.6

Generalized variances in terms of the columns of S

The columns or rows of S are generally not of much interest in themselves. Our interest is in the relationship of the centered columns of the n × m data matrix X. Let us consider the case of m = 2. Let z∗1 and z∗2 represent the centered column vectors of X; that is, for z∗1, we have \(z_{{\ast}1_{i}} = x_{{\ast}1_{i}} -\bar{ x_{1}}\). Now, as in equation ( 3.42), consider the parallelogram formed by z∗1 and z∗2. For computing the area, consider z∗1 as forming the base. The length of the base is
$$\displaystyle{\|z_{{\ast}1}\| = \sqrt{(n - 1)s_{11}},}$$
and the height is
$$\displaystyle{\|z_{{\ast}2}\|\vert \sin (\theta )\vert = \sqrt{(n - 1)s_{22 } (1 - r_{12 }^{2 })}.}$$
(Recall the relationship between the angle and the correlation from equation ( 2.81).)
The area of the parallelogram therefore is
$$\displaystyle{\mathrm{area} = (n - 1)\sqrt{s_{11 } s_{22 } (1 - r_{12 }^{2 })}.}$$
Now, consider S:
$$\displaystyle\begin{array}{rcl} S& =& \left [\begin{array}{cc} s_{11} & s_{21} \\ s_{12} & s_{22} \end{array} \right ] {}\\ & & {}\\ & =& \left [\begin{array}{cc} s_{11} & \sqrt{s_{11 } s_{22}}r_{12} \\ \sqrt{s_{11 } s_{22}}r_{12} & s_{22} \end{array} \right ]. {}\\ & & {}\\ \end{array}$$
The determinant of S is therefore
$$\displaystyle{s_{11}s_{22}(1 - r_{12}^{2}),}$$
that is,
$$\displaystyle{ \vert S\vert = \frac{1} {(n - 1)^{m}}\mathrm{volume}^{2}. }$$
(8.70)

Although we considered only the case m = 2, equation (8.70) holds generally, as can be seen by induction on m (see Anderson, 2003).

8.6.4.1 Comparing Variance-Covariance Matrices

Many standard statistical procedures for comparing groups of data rely on the assumption that the population variance-covariance matrices of the groups are all the same. (The simplest example of this is the two-sample t-test, in which the concern is just that the population variances of the two groups be equal.) Occasionally, the data analyst wishes to test this assumption of homogeneity of variances.

On page 175, we considered the problem of comparing two matrices of the same size. There we defined a metric based on a matrix norm. For the problem of comparing variance-covariance matrices, a measure based on the generalized variances is more commonly used.

In the typical situation, we have an n × m data matrix X in which the first n1 rows represent observations from one group, the next n2 rows represent observations from another group, and the last n g rows represent observations from the gth group. For each group, we form a sample variance-covariance matrix S i as in equation (8.67) using the ith submatrix of X. Whenever we have individual variance-covariance matrices in situations similar to this, we define the pooled variance-covariance matrix :
$$\displaystyle{ S_{\mathrm{p}} = \frac{1} {(n - g)}\sum _{i=1}^{g}(n_{ i} - 1)S_{i}. }$$
(8.71)
Bartlett suggested a test based on the determinants of (n i − 1)S i . (From equation ( 3.36), | (n i − 1)S i | = (n i − 1) m | S i |.) A similar test suggested by Box also uses the generalized variances. One form of the Box M statistic for testing for homogeneity of variances is
$$\displaystyle{ M = (n - g)\log (\vert S_{\mathrm{p}}\vert ) -\sum _{i=1}^{g}(n_{ i} - 1)S_{i}. }$$
(8.72)

8.6.5 Similarity Matrices

Covariance and correlation matrices are examples of similarity association matrices: they measure the similarity or closeness of the columns of the matrices from which they are formed.

The cosine of the angle between two vectors is related to the correlation between the vectors, so a matrix of the cosine of the angle between the columns of a given matrix would also be a similarity matrix. (The angle is exactly the same as the correlation if the vectors are centered; see equation ( 2.80).)

Similarity matrices can be formed in many ways, and some are more useful in particular applications than in others. They may not even arise from a standard dataset in the familiar form in statistical applications. For example, we may be interested in comparing text documents. We might form a vector-like object whose elements are the words in the document. The similarity between two such ordered tuples, generally of different lengths, may be a count of the number of words, word pairs, or more general phrases in common between the two documents.

It is generally reasonable that similarity between two objects be symmetric; that is, the first object is as close to the second as the second is to the first. We reserve the term similarity matrix for matrices formed from such measures and, hence, that themselves are symmetric. Occasionally, for example in psychometric applications, the similarities are measured relative to rank order closeness within a set. In such a case, the measure of closeness may not be symmetric. A matrix formed from such measures is called a directed similarity matrix.

8.6.6 Dissimilarity Matrices

The elements of similarity generally increase with increasing “closeness”. We may also be interested in the dissimilarity. Clearly, any decreasing function of similarity could be taken as a reasonable measure of dissimilarity. There are, however, other measures of dissimilarity that often seem more appropriate. In particular, the properties of a metric (see page 32) may suggest that a dissimilarity be defined in terms of a metric.

In considering either similarity or dissimilarity for a data matrix, we could work with either rows or columns, but the common applications make one or the other more natural for the specific application. Because of the types of the common applications, we will discuss dissimilarities and distances in terms of rows instead of columns; thus, in this section, we will consider the dissimilarity of x i and x j, the vectors corresponding to the ith and jth rows of X.

Dissimilarity matrices are also association matrices in the general sense of Sect. 8.1.5.

Dissimilarity or distance can be measured in various ways. A metric is the most obvious measure, although in certain applications other measures are appropriate. The measures may be based on some kind of ranking, for example. If the dissimilarity is based on a metric, the association matrix is often called a distance matrix. In most applications, the Euclidean distance, ∥x ix j2, is the most commonly used metric, but others, especially ones based on L1 or L norms, are often useful.

Any hollow matrix with nonnegative elements is a directed dissimilarity matrix. A directed dissimilarity matrix is also called a cost matrix. If the matrix is symmetric, it is a dissimilarity matrix.

An n × n matrix D = (d ij ) is an m-dimensional distance matrix if there exists m-vectors x1, x n such that, for some metric \(\Delta\), \(d_{ij} = \Delta (x_{i},x_{j})\). A distance matrix is necessarily a dissimilarity matrix. If the metric is the Euclidean distance, the matrix D is called a Euclidean distance matrix.

The matrix whose rows are the vectors x1T, , x n T is called an associated configuration matrix of the given distance matrix. In metric multidimensional scaling, we are given a dissimilarity matrix and seek to find a configuration matrix whose associated distance matrix is closest to the dissimilarity matrix, usually in terms of the Frobenius norm of the difference of the matrices (see Trosset 2002; for basic definitions and extensions).

8.7 Nonnegative and Positive Matrices

A nonnegative matrix, as the name suggests, is a real matrix all of whose elements are nonnegative, and a positive matrix is a real matrix all of whose elements are positive. In some other literature, the latter type of matrix is called strictly positive, and a nonnegative matrix with a positive element is called positive.

Many useful matrices are nonnegative, and we have already considered various kinds of nonnegative matrices. The adjacency or connectivity matrix of a graph is nonnegative. Dissimilarity matrices, including distance matrices, are nonnegative. Matrices used in modeling stochastic processes are nonnegative. While many of these matrices are square, we do not restrict the definitions to square matrices.

If A is nonnegative, we write
$$\displaystyle{ A \geq 0, }$$
(8.73)
and if it is positive, we write
$$\displaystyle{ A> 0. }$$
(8.74)
Notice that A ≥ 0 and A ≠ 0 together do not imply A > 0.
We write
$$\displaystyle{A \geq B}$$
to mean (AB) ≥ 0 and
$$\displaystyle{A> B}$$
to mean (AB) > 0. (Recall the definitions of nonnegative definite and positive definite matrices, and, from equations (8.11) and (8.16), the notation used to indicate those properties, A⪰0 and A ≻ 0. Furthermore, notice that these definitions and this notation for nonnegative and positive matrices are consistent with analogous definitions and notation involving vectors on page 32. Some authors, however, use the notation of equations (8.73) and (8.74) to mean “nonnegative definite” and “positive definite”. We should also note that some authors use somewhat different terms for these and related properties. “Positive” for these authors means nonnegative with at least one positive element, and “strictly positive” means positive as we have defined it.)

Notice that positiveness (nonnegativeness) has nothing to do with positive (nonnegative) definiteness. A positive or nonnegative matrix need not be symmetric or even square, although most such matrices useful in applications are square. A square positive matrix, unlike a positive definite matrix, need not be of full rank.

The following properties are easily verified.
  1. 1.

    If A ≥ 0 and uv ≥ 0, then AuAv.

     
  2. 2.

    If A ≥ 0, A ≠ 0, and u > v > 0, then Au > Av.

     
  3. 3.

    If A > 0 and v ≥ 0, then Av ≥ 0.

     
  4. 4.

    If A > 0 and A is square, then ρ(A) > 0.

     
Whereas most of the important matrices arising in the analysis of linear models are symmetric, and thus have the properties listed in Sect. 8.2.1 beginning on page 340, many important nonnegative matrices, such as those used in studying stochastic processes, are not necessarily symmetric. The eigenvalues of real symmetric matrices are real, but the eigenvalues of real nonsymmetric matrices may have imaginary components. In the following discussion, we must be careful to remember the meaning of the spectral radius. The definition in equation ( 3.233) for the spectral radius of the matrix A with eigenvalues c i ,
$$\displaystyle{\rho (A) =\max \vert c_{i}\vert,}$$
is still correct, but the operator “ | ⋅ | ” must be interpreted as the modulus of a complex number.

8.7.1 The Convex Cones of Nonnegative and Positive Matrices

The class of all n × m nonnegative (positive) matrices is a cone because if X is a nonnegative (positive) matrix and a > 0, then aX is a nonnegative (positive) matrix (see page 43). Furthermore, it is a convex cone in \(\mathrm{I\!R}^{n\times m}\), because if X1 and X2 are n × m nonnegative (positive) matrices and a, b ≥ 0, then aX1 + bX2 is nonnegative (positive) so long as either a ≠ 0 or b ≠ 0.

Both of these classes are closed under Cayley multiplication, but they may not have inverses in the class (that is, in particular, they are not groups with respect to that operation).

8.7.2 Properties of Square Positive Matrices

We have the following important properties for square positive matrices. These properties collectively are the conclusions of the Perron theorem.

Let A be a square positive matrix and let r = ρ(A). Then:

  1. 1.

    r is an eigenvalue of A. The eigenvalue r is called the Perron root. Note that the Perron root is real and positive (although other eigenvalues of A may not be).

     
  2. 2.

    There is an eigenvector v associated with r such that v > 0.

     
  3. 3.

    The Perron root is simple. (That is, the algebraic multiplicity of the Perron root is 1.)

     
  4. 4.

    The dimension of the eigenspace of the Perron root is 1. (That is, the geometric multiplicity of ρ(A) is 1.) Hence, if v is an eigenvector associated with r, it is unique except for scaling. This associated eigenvector is called the Perron vector. Note that the Perron vector is real (although other eigenvectors of A may not be). The elements of the Perron vector all have the same sign, which we usually take to be positive; that is, v > 0.

     
  5. 5.

    If c i is any other eigenvalue of A, then | c i | < r. (That is, r is the only eigenvalue on the spectral circle of A.)

     

We will give proofs only of properties 1 and 2 as examples. Proofs of all of these facts are available in Bapat and Raghavan (1997).

To see properties 1 and 2, first observe that a positive matrix must have at least one nonzero eigenvalue because the coefficients and the constant in the characteristic equation must all be positive. Now scale the matrix so that its spectral radius is 1 (see page 142). So without loss of generality, let A be a scaled positive matrix with ρ(A) = 1. Now let (c, x) be some eigenpair of A such that | c | = 1. First, we want to show, for some such c, that c = ρ(A).

Because all elements of A are positive,
$$\displaystyle{\vert x\vert = \vert Ax\vert \leq A\vert x\vert,}$$
and so
$$\displaystyle{ A\vert x\vert -\vert x\vert \geq 0. }$$
(8.75)
An eigenvector must be nonzero, so we also have
$$\displaystyle{A\vert x\vert> 0.}$$
Now we want to show that A | x | − | x | = 0. To that end, suppose the contrary; that is, suppose A | x | − | x | ≠ 0. In that case, A(A | x | − | x | ) > 0 from equation (8.75), and so there must be a positive number ε such that
$$\displaystyle{ \frac{A} {1+\epsilon }A\vert x\vert> A\vert x\vert }$$
or
$$\displaystyle{By> y,}$$
where B = A∕(1 + ε) and y = A | x |. Now successively multiplying both sides of this inequality by the positive matrix B, we have
$$\displaystyle{B^{k}y> y\quad \mbox{ for all}\;k = 1,2,\ldots.}$$
Because ρ(B) = ρ(A)∕(1 + ε) < 1, from equation ( 3.314) on page 173, we have lim k B k = 0; that is, lim k B k y = 0 > y. This contradicts the fact that y > 0. Because the supposition A | x | − | x | ≠ 0 led to this contradiction, we must have A | x | − | x | = 0. Therefore 1 = ρ(A) must be an eigenvalue of A, and | x | must be an associated eigenvector; hence, with v = | x |, (ρ(A), v) is an eigenpair of A and v > 0, and this is the statement made in properties 1 and 2.

The Perron-Frobenius theorem, which we consider below, extends these results to a special class of square nonnegative matrices. (This class includes all positive matrices, so the Perron-Frobenius theorem is an extension of the Perron theorem.)

8.7.3 Irreducible Square Nonnegative Matrices

Nonnegativity of a matrix is not a very strong property. First of all, note that it includes the zero matrix; hence, clearly none of the properties of the Perron theorem can hold. Even a nondegenerate, full rank nonnegative matrix does not necessarily possess those properties. A small full rank nonnegative matrix provides a counterexample for properties 2, 3, and 5:
$$\displaystyle{A = \left [\begin{array}{rr} 1&1\\ 0 &1 \end{array} \right ].}$$
The eigenvalues are 1 and 1; that is, 1 with an algebraic multiplicity of 2 (so property 3 does not hold). There is only one nonnull eigenvector, (1, −1), (so property 2 does not hold, but property 4 holds), but the eigenvector is not positive (or even nonnegative). Of course property 5 cannot hold if property 3 does not hold.

We now consider irreducible square nonnegative matrices. This class includes positive matrices, and irreducibility yields some of the properties of positivity to matrices with some zero elements. (Heuristically, irreducibility puts some restrictions on the nonpositive elements of the matrix.) On page 337, we defined reducibility of a nonnegative square matrix and we saw that a matrix is irreducible if and only if its digraph is strongly connected.

To recall the definition, a nonnegative matrix is said to be reducible if by symmetric permutations it can be put into a block upper triangular matrix with square blocks along the diagonal; that is, the nonnegative matrix A is reducible if and only if there is a permutation matrix E(π) such that
$$\displaystyle{ E_{(\pi )}AE_{(\pi )} = \left [\begin{array}{cc} B_{11} & B_{12} \\ 0 &B_{22} \end{array} \right ], }$$
(8.76)
where B11 and B22 are square. A matrix that cannot be put into that form is irreducible. An alternate term for reducible is decomposable, with the associated term indecomposable. (There is an alternate meaning for the term “reducible” applied to a matrix. This alternate use of the term means that the matrix is capable of being expressed by a similarity transformation as the sum of two matrices whose columns are mutually orthogonal.)

We see from the definition in equation (8.76) that a positive matrix is irreducible.

Irreducible matrices have several interesting properties. An n × n nonnegative matrix A is irreducible if and only if (I + A) n−1 is a positive matrix; that is,
$$\displaystyle{ A\;\mbox{ is irreducible}\Longleftrightarrow(I + A)^{n-1}> 0. }$$
(8.77)
To see this, first assume (I + A) n−1 > 0; thus, (I + A) n−1 clearly is irreducible. If A is reducible, then there exists a permutation matrix E(π) such that
$$\displaystyle{E_{(\pi )}AE_{(\pi )} = \left [\begin{array}{cc} B_{11} & B_{12} \\ 0 &B_{22} \end{array} \right ],}$$
and so
$$\displaystyle\begin{array}{rcl} E_{(\pi )}(I + A)^{n-1}E_{ (\pi )}& =& \left (E_{(\pi )}(I + A)E_{(\pi )}\right )^{n-1} {}\\ & =& \left (I + E_{(\pi )}AE_{(\pi )}\right )^{n-1} {}\\ & =& \left [\begin{array}{cc} I_{n_{1}} + B_{11} & B_{12} \\ 0 &I_{n_{2}} + B_{22} \end{array} \right ].{}\\ \end{array}$$
This decomposition of (I + A) n−1 cannot exist because it is irreducible; hence we conclude A is irreducible if (I + A) n−1 > 0.

Now, if A is irreducible, we can see that (I + A) n−1 must be a positive matrix either by a strictly linear-algebraic approach or by couching the argument in terms of the digraph \(\mathcal{G}(A)\) formed by the matrix, as in the discussion on page 337 that showed that a digraph is strongly connected if (and only if) it is irreducible. We will use the latter approach in the spirit of applications of irreducibility in stochastic processes.

For either approach, we first observe that the (i, j)th element of (I + A) n−1 can be expressed as
$$\displaystyle{ \left ((I + A)^{n-1}\right )_{ ij} = \left (\sum _{k=0}^{n-1}{n - 1\choose k}A^{k}\right )_{ ij}. }$$
(8.78)
Hence, for k = 1, , n − 1, we consider the (i, j)th entry of A k . Let a ij (k) represent this quantity.
Given any pair (i, j), for some l1, l2, , l k−1, we have
$$\displaystyle{a_{ij}^{(k)} =\sum _{ l_{1},l_{2},\ldots,l_{k-1}}a_{1l_{1}}a_{l_{1}l_{2}}\cdots a_{l_{k-1}j}.}$$
Now a ij (k) > 0 if and only if \(a_{1l_{1}},a_{l_{1}l_{2}},\ldots,a_{l_{k-1}j}\) are all positive; that is, if there is a path \(v_{1},v_{l_{1}},\ldots,v_{l_{k-1}},v_{j}\) in \(\mathcal{G}(A)\). If A is irreducible, then \(\mathcal{G}(A)\) is strongly connected, and hence the path exists. So, for any pair (i, j), we have from equation (8.78) \(\left ((I + A)^{n-1}\right )_{ij}> 0\); that is, (I + A) n−1 > 0.

The positivity of (I + A) n−1 for an irreducible nonnegative matrix A is a very useful property because it allows us to extend some conclusions of the Perron theorem to irreducible nonnegative matrices.

8.7.3.1 Properties of Square Irreducible Nonnegative Matrices: The Perron-Frobenius Theorem

If A is a square irreducible nonnegative matrix, then we have the following properties, which are similar to properties 1 through 4 on page 374 for positive matrices. These following properties are the conclusions of the Perron-Frobenius theorem.

  1. 1.

    ρ(A) is an eigenvalue of A. This eigenvalue is called the Perron root, as before (and, as before, it is real and positive).

     
  2. 2.

    The Perron root ρ(A) is simple. (That is, the algebraic multiplicity of the Perron root is 1.)

     
  3. 3.

    The dimension of the eigenspace of the Perron root is 1. (That is, the geometric multiplicity of ρ(A) is 1.)

     
  4. 4.

    The eigenvector associated with ρ(A) is positive. This eigenvector is called the Perron vector, as before.

     

The relationship (8.77) allows us to prove properties 1 and 4 in a method similar to the proofs of properties 1 and 2 for positive matrices. (This is Exercise 8.9.) Complete proofs of all of these facts are available in Bapat and Raghavan (1997). See also the solution to Exercise 8.10b on page 611 for a special case.

The one property of square positive matrices that does not carry over to square irreducible nonnegative matrices is property 5: r = ρ(A) is the only eigenvalue on the spectral circle of A. For example, the small irreducible nonnegative matrix
$$\displaystyle{A = \left [\begin{array}{rr} 0&1\\ 1 &0 \end{array} \right ]}$$
has eigenvalues 1 and − 1, and so both are on the spectral circle.

8.7.3.2 Primitive Matrices

Square irreducible nonnegative matrices that have only one eigenvalue on the spectral circle, that is, that do satisfy property 5 on page 374, also have other interesting properties that are important, for example, in Markov chains. We therefore give a name to matrices with that property:

A square irreducible nonnegative matrix is said to be primitive if it has only one eigenvalue on the spectral circle.

8.7.3.3 Limiting Behavior of Primitive Matrices

In modeling with Markov chains and other applications, the limiting behavior of A k is an important property.

On page 172, we saw that lim k A k = 0 if ρ(A) < 1. For a primitive matrix, we also have a limit for A k if ρ(A) = 1. (As we have done above, we can scale any matrix with a nonzero eigenvalue to a matrix with a spectral radius of 1.)

If A is a primitive matrix, then we have the useful result
$$\displaystyle{ \lim _{k\rightarrow \infty }\left ( \frac{A} {\rho (A)}\right )^{k} = vw^{\mathrm{T}}, }$$
(8.79)
where v is an eigenvector of A associated with ρ(A) and w is an eigenvector of AT associated with ρ(A), and w and v are scaled so that wTv = 1. (As we mentioned on page 158, such eigenvectors exist because ρ(A) is a simple eigenvalue. They also exist because of property 4; they are both positive. Note that A is not necessarily symmetric, and so its eigenvectors may include imaginary components; however, the eigenvectors associated with ρ(A) are real, and so we can write wT instead of wH.)
To see equation (8.79), we consider \(\left (A -\rho (A)vw^{\mathrm{T}}\right )\). First, if (c i , v i ) is an eigenpair of \(\left (A -\rho (A)vw^{\mathrm{T}}\right )\) and c i ≠ 0, then (c i , v i ) is an eigenpair of A. We can see this by multiplying both sides of the eigen-equation by vwT:
$$\displaystyle\begin{array}{rcl} c_{i}vw^{\mathrm{T}}v_{ i}& =& vw^{\mathrm{T}}\left (A -\rho (A)vw^{\mathrm{T}}\right )v_{ i} {}\\ & =& \left (vw^{\mathrm{T}}A -\rho (A)vw^{\mathrm{T}}vw^{\mathrm{T}}\right )v_{ i} {}\\ & =& \left (\rho (A)vw^{\mathrm{T}} -\rho (A)vw^{\mathrm{T}}\right )v_{ i} {}\\ & =& 0; {}\\ \end{array}$$
hence,
$$\displaystyle\begin{array}{rcl} Av_{i}& =& \left (A -\rho (A)vw^{\mathrm{T}}\right )v_{ i} {}\\ & =& c_{i}v_{i}. {}\\ \end{array}$$
Next, we show that
$$\displaystyle{ \rho \left (A -\rho (A)vw^{\mathrm{T}}\right ) <\rho (A). }$$
(8.80)
If ρ(A) were an eigenvalue of \(\left (A -\rho (A)vw^{\mathrm{T}}\right )\), then its associated eigenvector, say w, would also have to be an eigenvector of A, as we saw above. But since as an eigenvalue of A the geometric multiplicity of ρ(A) is 1, for some scalar s, w = sv. But this is impossible because that would yield
$$\displaystyle\begin{array}{rcl} \rho (A)sv& =& \left (A -\rho (A)vw^{\mathrm{T}}\right )sv {}\\ & =& sAv - s\rho (A)v {}\\ & =& 0, {}\\ \end{array}$$
and neither ρ(A) nor sv is zero. But as we saw above, any eigenvalue of \(\left (A -\rho (A)vw^{\mathrm{T}}\right )\) is an eigenvalue of A and no eigenvalue of \(\left (A -\rho (A)vw^{\mathrm{T}}\right )\) can be as large as ρ(A) in modulus; therefore we have inequality (8.80).
Finally, we recall equation ( 3.269), with w and v as defined above, and with the eigenvalue ρ(A),
$$\displaystyle{ \left (A -\rho (A)vw^{\mathrm{T}}\right )^{k} = A^{k} - (\rho (A))^{k}vw^{\mathrm{T}}, }$$
(8.81)
for k = 1, 2, .
Dividing both sides of equation (8.81) by (ρ(A)) k and rearranging terms, we have
$$\displaystyle{ \left ( \frac{A} {\rho (A)}\right )^{k} = vw^{\mathrm{T}} + \frac{\left (A -\rho (A)vw^{\mathrm{T}}\right )} {\rho (A)}. }$$
(8.82)
Now
$$\displaystyle{\rho \left (\frac{\left (A -\rho (A)vw^{\mathrm{T}}\right )} {\rho (A)} \right ) = \frac{\rho \left (A -\rho (A)vw^{\mathrm{T}}\right )} {\rho (A)},}$$
which is less than 1; hence, from equation ( 3.312) on page 172, we have
$$\displaystyle{\lim _{k\rightarrow \infty }\left (\frac{\left (A -\rho (A)vw^{\mathrm{T}}\right )} {\rho (A)} \right )^{k} = 0;}$$
so, taking the limit in equation (8.82), we have equation (8.79).

Applications of the Perron-Frobenius theorem are far-ranging. It has implications for the convergence of some iterative algorithms, such as the power method discussed in Sect.  7.2. The most important applications in statistics are in the analysis of Markov chains, which we discuss in Sect.  9.8.1.

8.7.4 Stochastic Matrices

A nonnegative matrix P such that
$$\displaystyle{ P1 = 1 }$$
(8.83)
is called a stochastic matrix. The definition means that (1, 1) is an eigenpair of any stochastic matrix. It is also clear that if P is a stochastic matrix, then ∥P = 1 (see page 166), and because ρ(P) ≤ ∥P∥ for any norm (see page 171) and 1 is an eigenvalue of P, we have ρ(P) = 1.

A stochastic matrix may not be positive, and it may be reducible or irreducible. (Hence, (1, 1) may not be the Perron root and Perron eigenvector.)

If P is a stochastic matrix such that
$$\displaystyle{ 1^{\mathrm{T}}P = 1^{\mathrm{T}}, }$$
(8.84)
it is called a doubly stochastic matrix. If P is a doubly stochastic matrix, ∥P1 = 1, and, of course, ∥P = 1 and ρ(P) = 1.

A permutation matrix is a doubly stochastic matrix; in fact, it is the simplest and one of the most commonly encountered doubly stochastic matrices. A permutation matrix is clearly reducible.

Stochastic matrices are particularly interesting because of their use in defining a discrete homogeneous Markov chain. In that application, a stochastic matrix and distribution vectors play key roles. A distribution vector is a nonnegative matrix whose elements sum to 1; that is, a vector v such that 1Tv = 1. In Markov chain models, the stochastic matrix is a probability transition matrix from a distribution at time t, π t , to the distribution at time t + 1,
$$\displaystyle{\pi _{t+1} = P\pi _{t}.}$$
In Sect.  9.8.1, we define some basic properties of Markov chains. Those properties depend in large measure on whether the transition matrix is reducible or not.

8.7.5 Leslie Matrices

Another type of nonnegative transition matrix, often used in population studies, is a Leslie matrix, after P. H. Leslie, who used it in models in demography. A Leslie matrix is a matrix of the form
$$\displaystyle{ \left [\begin{array}{ccccc} \alpha _{1} & \alpha _{2} & \cdots &\alpha _{m-1} & \alpha _{m} \\ \sigma _{1} & 0&\cdots & 0 & 0 \\ 0& \sigma _{2} & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0&0&\cdots &\sigma _{m-1} & 0 \end{array} \right ], }$$
(8.85)
where all elements are nonnegative, and additionally σ i ≤ 1.

A Leslie matrix is clearly reducible. Furthermore, a Leslie matrix has a single unique positive eigenvalue (see Exercise 8.10), which leads to some interesting properties (see Sect.  9.8.2).

8.8 Other Matrices with Special Structures

Matrices of a variety of special forms arise in statistical analyses and other applications. For some matrices with special structure, specialized algorithms can increase the speed of performing a given task considerably. Many tasks involving matrices require a number of computations of the order of n3, where n is the number of rows or columns of the matrix. For some of the matrices discussed in this section, because of their special structure, the order of computations may be n2. The improvement from O(n3) to O(n2) is enough to make some tasks feasible that would otherwise be infeasible because of the time required to complete them. The collection of papers in Olshevsky (2003) describe various specialized algorithms for the kinds of matrices discussed in this section.

8.8.1 Helmert Matrices

A Helmert matrix is a square orthogonal matrix that partitions sums of squares. Its main use in statistics is in defining contrasts in general linear models to compare the second level of a factor with the first level, the third level with the average of the first two, and so on. (There is another meaning of “Helmert matrix” that arises from so-called Helmert transformations used in geodesy.)

For example, a partition of the sum i = 1 n y i 2 into orthogonal sums each involving \(\bar{y}_{k}^{2}\) and \(\sum _{i=1}^{k}(y_{i} -\bar{ y}_{k})^{2}\) is
$$\displaystyle{ \begin{array}{rcl} \tilde{y}_{i}& =&(i(i + 1))^{-1/2}\left (\sum _{j=1}^{i+1}y_{ j} - (i + 1)y_{i+1}\right )\quad \mathrm{for}\;i = 1,\ldots n - 1,\\ \\ \tilde{y}_{n}& =&n^{-1/2}\sum _{ j=1}^{n}y_{ j}. \end{array} }$$
(8.86)
These expressions lead to a computationally stable one-pass algorithm for computing the sample variance (see equation ( 10.8) on page 504).
The Helmert matrix that corresponds to this partitioning has the form
$$\displaystyle\begin{array}{rcl} H_{n}& =& \left [\begin{array}{ccccc} 1/\sqrt{n} & 1/\sqrt{n} & 1/\sqrt{n} &\cdots & 1/\sqrt{n} \\ 1/\sqrt{2} & - 1/\sqrt{2} & 0 &\cdots & 0 \\ 1/\sqrt{6} & 1/\sqrt{6} & - 2/\sqrt{6} &\cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \frac{1} {\sqrt{n(n-1)}} & \frac{1} {\sqrt{n(n-1)}} & \frac{1} {\sqrt{n(n-1)}} & \cdots & - \frac{(n-1)} {\sqrt{n(n-1)}} \end{array} \right ] \\ & =& \left [\begin{array}{c} 1/\sqrt{n}\,1_{n}^{\mathrm{T}} \\ K_{n-1} \end{array} \right ],{}\end{array}$$
(8.87)
where K n−1 is the (n − 1) × n matrix below the first row. For the full n-vector y, we have
$$\displaystyle\begin{array}{rcl} y^{\mathrm{T}}K_{ n-1}^{\mathrm{T}}K_{ n-1}y& =& \sum (y_{i} -\bar{ y})^{2} {}\\ & =& \sum (y_{i} -\bar{ y})^{2} {}\\ & =& (n - 1)s_{y}^{2}. {}\\ \end{array}$$

The rows of the matrix in equation (8.87) correspond to orthogonal contrasts in the analysis of linear models (see Sect.  9.3.2).

Obviously, the sums of squares are never computed by forming the Helmert matrix explicitly and then computing the quadratic form, but the computations in partitioned Helmert matrices are performed indirectly in analysis of variance, and representation of the computations in terms of the matrix is often useful in the analysis of the computations.

8.8.2 Vandermonde Matrices

A Vandermonde matrix is an n × m matrix with columns that are defined by monomials,
$$\displaystyle{ V _{n\times m} = \left [\begin{array}{ccccc} 1& x_{1} & x_{1}^{2} & \cdots & x_{1}^{m-1} \\ 1& x_{2} & x_{2}^{2} & \cdots & x_{2}^{m-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1&x_{n}&x_{n}^{2} & \cdots &x_{n}^{m-1}\\ \end{array} \right ], }$$
(8.88)
where x i x j if ij. The Vandermonde matrix arises in polynomial regression analysis. For the model equation y i = β0 + β1x i + ⋯ + β p x i p + ε i , given observations on y and x, a Vandermonde matrix is the matrix in the standard representation y = + ε.

Because of the relationships among the columns of a Vandermonde matrix, computations for polynomial regression analysis can be subject to numerical errors, and so sometimes we make transformations based on orthogonal polynomials. The condition number (see Sect.  6.1, page 266) for a Vandermonde matrix is large. A Vandermonde matrix, however, can be used to form simple orthogonal vectors that correspond to orthogonal polynomials. For example, if the xs are chosen over a grid on [−1, 1], a QR factorization (see Sect.  5.8 on page 248) yields orthogonal vectors that correspond to Legendre polynomials. These vectors are called discrete Legendre polynomials. Although not used in regression analysis so often now, orthogonal vectors are useful in selecting settings in designed experiments.

Vandermonde matrices also arise in the representation or approximation of a probability distribution in terms of its moments.

The determinant of a square Vandermonde matrix has a particularly simple form (see Exercise 8.11).

8.8.3 Hadamard Matrices and Orthogonal Arrays

In a wide range of applications, including experimental design, cryptology, and other areas of combinatorics, we often encounter matrices whose elements are chosen from a set of only a few different elements. In experimental design, the elements may correspond to the levels of the factors; in cryptology, they may represent the letters of an alphabet. In two-level factorial designs, the entries may be either 0 or 1. Matrices all of whose entries are either 1 or − 1 can represent the same layouts, and such matrices may have interesting mathematical properties.

An n × n matrix with − 1, 1 entries whose determinant is n n∕2 is called a Hadamard matrix. Hadamard’s name is associated with this matrix because of the bound derived by Hadamard for the determinant of any matrix A with | a ij | ≤ 1 for all i, j: | det(A) | ≤ n n∕2. A Hadamard matrix achieves this upper bound. A maximal determinant is often used as a criterion for a good experimental design.

We often denote an n × n Hadamard matrix by H n , which is the same notation often used for a Helmert matrix, but in the case of Hadamard matrices, the matrix is not unique. All rows are orthogonal and so are all columns. The norm of each row or column is n, so
$$\displaystyle{ H_{n}^{\mathrm{T}}H_{ n} = H_{n}H_{n}^{\mathrm{T}} = nI. }$$
(8.89)

It is clear that if H n is a Hadamard matrix then so is H n T, and a Hadamard matrix is a normal matrix 345). Symmetric Hadamard matrices are often of special interest. In a special type of n × n Hadamard matrix, one row and one column consist of all 1s; all n − 1 other rows and columns consist of n∕2 1s and n∕2 − 1s. Such a matrix is called a normalized Hadamard matrix. Most Hadamard matrices occurring in statistical applications are normalized, and also most have all 1s on the diagonal.

A Hadamard matrix is often represented as a mosaic of black and white squares, as in Fig. 8.7.
Figure 8.7

A 4 × 4 Hadamard matrix

Hadamard matrices do not exist for all n. Clearly, n must be even because | H n | = n n∕2, but some experimentation (or an exhaustive search) quickly shows that there is no Hadamard matrix for n = 6. It has been conjectured, but not proven, that Hadamard matrices exist for any n divisible by 4. Given any n × n Hadamard matrix, H n , and any m × m Hadamard matrix, H m , an nm × nm Hadamard matrix can be formed as a partitioned matrix in which each 1 in H n is replaced by the block submatrix H m and each − 1 is replaced by the block submatrix − H m . For example, the 4 × 4 Hadamard matrix shown in Fig. 8.7 is formed using the 2 × 2 Hadamard matrix
$$\displaystyle{\left [\begin{array}{rr} 1& - 1\\ 1 & 1 \end{array} \right ]}$$
as both H n and H m . Since an H2 exists, this means that for any n = 2 k , an n × n Hadamard matrix exists. Not all Hadamard matrices can be formed from other Hadamard matrices in this way, however; that is, they are not necessarily 2 k × 2 k . A Hadamard matrix exists for n = 12, for example.
A related type of orthogonal matrix, called a conference matrix, is a hollow matrix C n with − 1, 1 entries on the off-diagonals, and such that
$$\displaystyle{ C_{n}^{\mathrm{T}}C_{ n} = (n - 1)I. }$$
(8.90)
A conference matrix is said to be normalized if the first row and the first column consist of all 1s, except for the (1, 1) element. Conference matrices arise in circuit design and other applications of graph theory. They are related to the adjacency matrices occurring in such applications. The (n − 1) × (n − 1) matrix formed by removing the first row and first column of a symmetric conference matrix is a Seidel adjacency matrix (page 334).

A somewhat more general type of matrix corresponds to an n × m array with the elements in the jth column being members of a set of k j elements and such that, for some fixed pm, in every n × p submatrix all possible combinations of the elements of the m sets occur equally often as a row. (I make a distinction between the matrix and the array because often in applications the elements in the array are treated merely as symbols without the assumptions of an algebra of a field. A terminology for orthogonal arrays has evolved that is different from the terminology for matrices; for example, a symmetric orthogonal array is one in which k1 = ⋯ = k m . On the other hand, treating the orthogonal arrays as matrices with real elements may provide solutions to combinatorial problems such as may arise in optimal design.)

The 4 × 4 Hadamard matrix shown in Fig. 8.7 is a symmetric orthogonal array with k1 = ⋯ = k4 = 2 and p = 4, so in the array each of the possible combinations of elements occurs exactly once. This array is a member of a simple class of symmetric orthogonal arrays that has the property that in any two rows each ordered pair of elements occurs exactly once.

Orthogonal arrays are particularly useful in developing fractional factorial plans. (The robust designs of Taguchi correspond to orthogonal arrays.) Dey and Mukerjee (1999) discuss orthogonal arrays with an emphasis on the applications in experimental design, and Hedayat et al. (1999) provide extensive an discussion of the properties of orthogonal arrays.

8.8.4 Toeplitz Matrices

A Toeplitz matrix is a square matrix with constant codiagonals:
$$\displaystyle{ \left [\begin{array}{ccccc} d & u_{1} & u_{2} & \cdots &u_{n-1} \\ l_{1} & d & u_{1} & \cdots &u_{n-2}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ l_{n-2} & l_{n-3} & l_{n-4} & \ddots & u_{1} \\ l_{n-1} & l_{n-2} & l_{n-3} & \cdots & d\\ \end{array} \right ]. }$$
(8.91)
An n × n Toeplitz matrix is characterized by a diagonal element d and two (n − 1)-vectors, u and l, as indicated in the expression above. If u = 0, the Toeplitz matrix is lower triangular; if l = 0, the matrix is upper triangular; and if u = l, the matrix is symmetric. A Toeplitz matrix is a banded matrix, but it may or may not be a “band matrix”, that is, one with many 0 codiagonals, which we discuss in Chaps.  3 and  12

Banded Toeplitz matrices arise frequently in time series studies. The covariance matrix in an ARMA(p, q) process, for example, is a symmetric Toeplitz matrix with 2max(p, q) nonzero off-diagonal bands. See page 451 for an example and further discussion.

8.8.4.1 Inverses of Certain Toeplitz Matrices and Other Banded Matrices

A Toeplitz matrix that occurs often in stationary time series is the n × n variance-covariance matrix of the form
$$\displaystyle{V =\sigma ^{2}\left [\begin{array}{ccccc} 1 & \rho & \rho ^{2} & \cdots &\rho ^{n-1} \\ \rho & 1 & \rho &\cdots &\rho ^{n-2}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ \rho ^{n-1} & \rho ^{n-2} & \rho ^{n-3} & \cdots & 1 \end{array} \right ].}$$
It is easy to see that V−1 exists if σ ≠ 0 and | ρ | < 1, and that it is the type 2 matrix
$$\displaystyle{ V ^{-1} = \frac{1} {(1 -\rho ^{2})\sigma ^{2}}\left [\begin{array}{ccccc} 1 & -\rho & 0 &\cdots &0 \\ -\rho &1 +\rho ^{2} & -\rho &\cdots &0 \\ 0 & -\rho &1 +\rho ^{2} & \cdots &0\\ \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 &\cdots &1 \end{array} \right ]. }$$
(8.92)

Type 2 matrices also occur as the inverses of other matrices with special patterns that arise in other common statistical applications (see Graybill 1983; for examples).

The inverses of all banded invertible matrices have some off-diagonal submatrices that are zero or have low rank, depending on the bandwidth of the original matrix (see Strang and Nguyen 2004, for further discussion and examples).

8.8.5 Circulant Matrices

A Toeplitz matrix having the form
$$\displaystyle{ \left [\begin{array}{cccccc} c_{1} & c_{2} & c_{3} & \cdots &c_{n-1} & c_{n} \\ c_{n}&c_{1} & c_{2} & \cdots &c_{n-2} & c_{n-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ c_{3} & c_{4} & c_{5} & \cdots & c_{1} & c_{2} \\ c_{2} & c_{3} & c_{4} & \cdots & c_{n} & c_{1} \end{array} \right ] }$$
(8.93)
is called a circulant matrix. Beginning with a fixed first row, each subsequent row is obtained by a right circular shift of the row just above it.
A useful n × n circulant matrix is the permutation matrix
$$\displaystyle{ E_{(n,1,2,\ldots,n-1)} =\prod _{ k=2}^{n}E_{ k,k-1}, }$$
(8.94)
which is the identity transformed by moving each row downward into the row below it and the last row into the first row. (See page 82.) Let us denote this permutation as πc; hence, we denote the elementary circulant matrix in (8.94) as \(E_{(\pi _{\mathrm{c}})}\).
By first recalling that, for any permutation matrix, E(π)−1 = E(π)T, and then considering the effects of the multiplications AE(π) and E(π)A, it is easy to see that A is circulant if and only if
$$\displaystyle{ A = E_{(\pi _{\mathrm{c}})}AE_{(\pi _{\mathrm{c}})}^{\mathrm{T}}. }$$
(8.95)
Circulant matrices have several straightforward properties. If A is circulant, then
  • AT is circulant

  • A2 is circulant

  • if A is nonsingular, then A−1 is circulant

You are asked to prove these simple results in Exercise 8.13.

Any linear combination of two circulant matrices of the same order is circulant (that is, they form a vector space, see Exercise 8.14).

If A and B are circulant matrices of the same order, then AB is circulant (Exercise 8.15).

Another important property of a circulant matrix is that it is normal, as we can see by writing the (i, j) element of AAT and of ATA as a sum of products of elements of A (Exercise 8.16). This has an important implication: a circulant matrix is unitarily diagonalizable.

8.8.6 Fourier Matrices and the Discrete Fourier Transform

A special Vandermonde matrix (equation (8.88)) is an n × n matrix whose entries are the nth roots of unity, that is {1, ω, ω2, , ω n−1}, where
$$\displaystyle{ \omega = \mathrm{e}^{2\pi \mathrm{i}/n} =\cos \left ( \frac{2\pi } {n}\right ) + \mathrm{i}\sin \left ( \frac{2\pi } {n}\right ). }$$
(8.96)
The matrix is called a Fourier matrix. The (j, k) entry of a Fourier matrix is \(\omega ^{(j-1)(k-1)}/\sqrt{n}\):
$$\displaystyle{ F_{n} = \frac{1} {\sqrt{n}}\left [\begin{array}{cccccc} 1& 1 & 1 & 1 &\cdots & 1 \\ 1& \omega ^{1} & \omega ^{2} & \omega ^{3} & \cdots & \omega ^{(n-1)} \\ 1& \omega ^{2} & \omega ^{4} & \omega ^{6} & \cdots & \omega ^{2(n-1)}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 1&\omega ^{n-1} & \omega ^{2(n-1)} & \omega ^{3(n-1)} & \cdots &\omega ^{(n-1)(n-1)}\\ \end{array} \right ], }$$
(8.97)
(The Fourier matrix is sometimes defined with a negative sign in the exponents; that is, such that the (j, k)th entry is \(\omega ^{-(j-1)(k-1)}/\sqrt{n}\). The normalizing factor \(1/\sqrt{n}\) is also sometimes omitted. In fact, in many applications of Fourier matrices and various Fourier forms, there is inconsequential, but possibly annoying, variation in the notation.)

Notice that the Fourier matrix is symmetric, and any entry raised to the nth power is 1. Although the Fourier matrix is symmetric, its eigenvalues are not necessarily real, because it itself is not a real matrix.

Fourier matrices whose order is a power of 2 tend to have a propensity of elements that are either ± 1 or ±i. For example, the 4 × 4 Fourier matrix is
$$\displaystyle{F_{4} = \frac{1} {2}\left [\begin{array}{rrrr} 1& 1& 1& 1\\ 1 & \mathrm{i} & - 1 & - \mathrm{i} \\ 1& - 1& 1& - 1 \\ 1& -\mathrm{i}& - 1& \mathrm{i} \end{array} \right ].}$$
(Recall that there are different definitions of the elements of a Fourier matrix; for the 4 × 4, they all are as shown above, but the patterns of positive and negative values may be different. Most of the elements of the 8 × 8 Fourier matrix are either ± 1 or ±i. The 16 that are not are \(\pm 1/\sqrt{2} \pm \mathrm{i}/\sqrt{2}\).

The Fourier matrix has many useful properties, such as being unitary; that is, row and columns are orthonormal: fj Hfk = f jHf k = 0 for jk, and fj Hfj = 1 (Exercise 8.17).

The most interesting feature of the Fourier matrix is its relationship to the Fourier transform. For an integrable function f(x), the Fourier transform is
$$\displaystyle{ \mathcal{F}f(s) =\int _{ -\infty }^{\infty }\mathrm{e}^{-2\pi \mathrm{i}sx}f(x)\,\mathrm{d}x. }$$
(8.98)
(Note that the characteristic function in probability theory is this same transform applied to a probability density function, with argument t = −2πs.)

8.8.6.1 Fourier Matrices and Elementary Circulant Matrices

The Fourier matrix and the elementary circulant matrix \(E_{(\pi _{\mathrm{c}})}\) of corresponding order are closely related. Being a normal matrix, \(E_{(\pi _{\mathrm{c}})}\) is unitarily diagonalizable, and the Fourier matrix and its conjugate transpose are the diagonalizing matrices, and the diagonal matrix itself, that is, the eigenvalues, are elements of the Fourier matrix:
$$\displaystyle{ E_{(\pi _{\mathrm{c}})} = F_{n}^{\mathrm{H}}\mathrm{diag}((1,\omega,\omega ^{2},\ldots,\omega ^{n-1}))F_{ n}. }$$
(8.99)
This is easy to see by performing the multiplications on the right side of the equation, and you are asked to do this in Exercise 8.18.

We see that the eigenvalues of \(E_{(\pi _{\mathrm{c}})}\) are what we would expect if we continued the development from page 137 where we determined the eigenvalues of an elementary permutation matrix. (An elementary permutation matrix of order 2 has the two eigenvalues \(\sqrt{ 1}\) and \(-\sqrt{1}\).)

Notice that the modulus of all eigenvalues of \(E_{(\pi _{\mathrm{c}})}\) is the same, 1. Hence, all eigenvalues of \(E_{(\pi _{\mathrm{c}})}\) lie on its spectral circle.

8.8.6.2 The Discrete Fourier Transform

Fourier transforms are invertible transformations of functions that often allow operations on those functions to be performed more easily. The Fourier transform shown in equation (8.98) may allow for simpler expressions of convolutions, for example. An n-vector is equivalent to a function whose domain is {1, , n}, and Fourier transforms of vectors are useful in various operations on the vectors. More importantly, if the vector represents observational data, certain properties of the data may become immediately apparent in the Fourier transform of the data vector.

The Fourier transform, at a given value s, of the function as shown in equation (8.98) is an integral. The argument of the transform s may or may not range over the same domain as x the argument of the function. The analogue of the Fourier transform of a vector is a sum, again at some given value, which is effectively the argument of the transform. For an n-vector x, the discrete Fourier transform at n points is
$$\displaystyle{ \mathcal{F}x = F_{n}x. }$$
(8.100)
Transformations of this form are widely used in time series, where the vector x contains observations at equally spaced points in time. While the elements of x represent measurements at distinct points in time, the elements of the vector \(\mathcal{F}x\) represent values at different points in the period of a sine and/or a cosine curve, as we see from the relation of the roots of unity given in equation (8.96). Many time series, especially those relating to measurement of waves propagating through matter or of vibrating objects, exhibit periodicities, which can be used to distinguish different kinds of waves and possibly to locate their source. Many waves and other time series have multiple periodicities, at different frequencies. The Fourier transform is sometimes called a frequency filter.

Our purpose here is just to indicate the relation of Fourier transforms to matrices, and to suggest how this might be useful in applications. There is a wealth of literature on Fourier transforms and their applications, but we will not pursue those topics here.

As often when writing expressions involving matrices, we must emphasize that the form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different. This is particularly true in the case of the discrete Fourier transform. There would never be a reason to form the Fourier matrix for any computations, but more importantly, rather than evaluating the elements in the Fourier transform \(\mathcal{F}x\) using the right side of equation (8.100), we take advantage of properties of the powers of roots of unity to arrive at a faster method of computing them—so much faster, in fact, the method is called the “fast” Fourier transform, or FFT. The FFT is one of the most important computational methods in all of applied mathematics. I will not discuss it further here, however. Descriptions of it can be found throughout the literature on computational science (including other books that I have written).

An Aside: Complex Matrices

The Fourier matrix, I believe, is the only matrix I discuss in this book that has complex entries. The purpose is just to relate the discrete Fourier transform to matrix multiplication. We have already indicated various differences in operations on vectors and matrices over the complex plane. We often form the conjugate of a complex number, which we denote by an overbar: \(\bar{z}\). The conjugate of an object such as a vector or a matrix is the result of taking the conjugate of each element individually. Instead of a transpose, we usually work with a conjugate transpose, \(A^{\mathrm{H}} = \overline{A} ^{\mathrm{T}}\). (See the discussion on pages 33 and 60.) We define the inner product of vectors x and y as \(\langle x,y\rangle =\bar{ x}^{\mathrm{T}}y\); instead of symmetric matrices, we focus on Hermitian matrices, that is, ones such that the conjugate transpose is the same as the original matrices, and instead of orthogonal matrices, we focus on unitary matrices, that is, ones whose product with its conjugate transpose is the identity. All of the general results that we have stated for real matrices of course hold for complex matrices. Some of the analogous operations, however, have different properties. For example, the property of a matrix being unitarily diagonalizable is an analogue of the property of being orthogonally diagonalizable, but, as we have seen, a wider class of matrices (normal matrices) are unitarily diagonalizable than those that are orthogonally diagonalizable (symmetric matrices).

Most scientific software systems support computations with complex numbers. Both R and Matlab, for example, provide full capabilities for working with complex numbers. The imaginary unit in both is denoted by “i”, which of course must be distinguished from “i” denoting a variable. In R, the function complex can be used to initialize a complex number; for example,

 z<-complex(re=3,im=2)

assigns the value 3 + 2i to the variable z. (Another simple way of doing this is to juxtapose a numeric literal in front of the symbol i; in R, for example, z<-3+2i assigns the same value to z. I do not recommend the latter construct because of possible confusion with other expressions.)

The jth row of an nth order Fourier matrix can be generated by the R expression

 c(1,exp(2*pi*complex(im=seq(j,j*(n-1),j)/n)))

As mentioned above, there would almost never be a reason to form the Fourier matrix for any computations. It is instructive, however, to note its form and to do some simple manipulations with the Fourier matrix in R or some other software system. The Matlab function dftmtx generates a Fourier matrix (with a slightly different definition, resulting in a different pattern of positives and negatives than what I have shown above).

Some additional R code for manipulating complex matrices is given in the hints for Exercise 8.18 on page 611. (Note that there I do form a Fourier matrix and use it multiplications; but it is just for illustration.)

8.8.7 Hankel Matrices

A Hankel matrix is a square matrix with constant “anti”-codiagonals:
$$\displaystyle{ \left [\begin{array}{ccccc} u_{n-1} & u_{n-2} & \cdots &u_{1} & d \\ u_{n-2} & u_{n-3} & \cdots & d & l_{1}\\ \vdots & \vdots & \vdots & \vdots & \vdots \\ u_{1} & d &l_{1} & \ddots & l_{n-2} \\ d & l_{1} & l_{2} & \cdots &l_{n-1}\\ \end{array} \right ]. }$$
(8.101)
Notice that a Hankel matrix is similar to a Toeplitz matrix, except that the axial diagonal is not the principal diagonal; rather, for an n × n Hankel matrix H, it is the “anti-diagonal”, consisting of the elements h1,n , h2,n−1, , h n, 1. Although this “diagonal” is visually apparent, and symmetries about it are intuitively obvious, there is no commonly-used term for this diagonal or for symmetries about it. We can also observe analogues of lower triangular, upper triangular, and symmetric matrices based on u = 0, l = 0, and u = l; but these are not the kinds of matrices that we have identified with these terms. Words such as “anti” and “skew” are sometimes used to qualify names of these properties or objects. A triangular counterpart is sometimes called “skew …diagonal”. There is no standard term for symmetry about the anti-diagonal, however. The term “skew symmetric” has already been taken. (It is a matrix A with a ij = −a ji .)

As with a Toeplitz matrix, a Hankel matrix is characterized by a “diagonal” element and two vectors, and can be generalized to n × m matrices based on vectors of different orders. As in the expression (8.101) above, an n × n Toeplitz matrix can be defined by a scalar d and two (n − 1)-vectors, u and l. There are other, perhaps more common, ways of putting the elements of a Hankel matrix into two vectors. In Matlab, the function hankel produces a Hankel matrix, but the elements are specified in a different way from that in expression (8.101).

A common form of Hankel matrix is an n × n skew upper triangular matrix, and it is formed from the u vector only. The simplest form of the square skew upper triangular Hankel matrix is formed from the vector u = (n − 1, n − 2, , 1) and d = n:
$$\displaystyle{ \left [\begin{array}{ccccc} 1 &2&3&\cdots &n\\ 2 &3 &4 &\cdots & 0 \\ \vdots & \vdots & \vdots &\cdots & \vdots\\ n &0 &0 &\cdots & 0\\ \end{array} \right ]. }$$
(8.102)
A skew upper triangular Hankel matrix occurs in the spectral analysis of time series. If x(t) is a (discrete) time series, for t = 0, 1, 2, , the Hankel matrix of the time series has as the (i, j) element
$$\displaystyle{\begin{array}{lll} x(i + j - 2)&\ &\mathrm{if}\;i + j - 1 \leq n,\\ 0&&\mathrm{otherwise.}\end{array} }$$
The L2 norm of the Hankel matrix of the time series is called the Hankel norm of the frequency filter response (the Fourier transform).

8.8.8 Cauchy Matrices

Another type of special n × m matrix whose elements are determined by a few n-vectors and m-vectors is a Cauchy-type matrix. The standard Cauchy matrix is built from two vectors, x and y. The more general form defined below uses two additional vectors.

A Cauchy matrix is an n × m matrix C(x, y, v, w) generated by n-vectors x and v and m-vectors y and w of the form
$$\displaystyle{ C(x,y,v,w) = \left [\begin{array}{ccc} \frac{v_{1}w_{1}} {x_{1} - y_{1}} & \cdots & \frac{v_{1}w_{m}} {x_{1} - y_{m}} \\ \vdots &\cdots & \vdots \\ \frac{v_{n}w_{1}} {x_{n} - y_{1}} & \cdots & \frac{v_{n}w_{m}} {x_{n} - y_{m}}\\ \end{array} \right ]. }$$
(8.103)

Cauchy-type matrices often arise in the numerical solution of partial differential equations (PDEs). For Cauchy matrices, the order of the number of computations for factorization or solutions of linear systems can be reduced from a power of three to a power of two. This is a very significant improvement for large matrices. In the PDE applications, the matrices are generally not large, but nevertheless, even in those applications, it worthwhile to use algorithms that take advantage of the special structure. Fasino and Gemignani (2003) describe such an algorithm.

8.8.9 Matrices Useful in Graph Theory

Many problems in statistics and applied mathematics can be posed as graphs, and various methods of graph theory can be used in their solution.

Graph theory is particularly useful in cluster analysis or classification. These involve the analysis of relationships of objects for the purpose of identifying similar groups of objects. The objects are associated with vertices of the graph, and an edge is generated if the relationship (measured somehow) between two objects is sufficiently great. For example, suppose the question of interest is the authorship of some text documents. Each document is a vertex, and an edge between two vertices exists if there are enough words in common between the two documents. A similar application could be the determination of which computer user is associated with a given computer session. The vertices would correspond to login sessions, and the edges would be established based on the commonality of programs invoked or files accessed. In applications such as these, there would typically be a training dataset consisting of text documents with known authors or consisting of session logs with known users. In both of these types of applications, decisions would have to be made about the extent of commonality of words, phrases, programs invoked, or files accessed in order to establish an edge between two documents or sessions.

Unfortunately, as is often the case for an area of mathematics or statistics that developed from applications in diverse areas or through the efforts of applied mathematicians somewhat outside of the mainstream of mathematics, there are major inconsistencies in the notation and terminology employed in graph theory. Thus, we often find different terms for the same object; for example, adjacency matrix and connectivity matrix. This unpleasant situation, however, is not so disagreeable as a one-to-many inconsistency, such as the designation of the eigenvalues of a graph to be the eigenvalues of one type of matrix in some of the literature and the eigenvalues of different types of matrices in other literature.

Refer to Sect. 8.1.2 beginning on page 331 for terms and notation that we will use in the following discussion.

8.8.9.1 Adjacency Matrix: Connectivity Matrix

We discussed adjacency or connectivity matrices on page 334. A matrix, such as an adjacency matrix, that consists of only 1s and 0s is called a Boolean matrix.

Two vertices that are not connected and hence correspond to a 0 in a connectivity matrix are said to be independent.

If no edges connect a vertex with itself, the adjacency matrix is a hollow matrix.

Because the 1s in a connectivity matrix indicate a strong association, and we would naturally think of a vertex as having a strong association with itself, we sometimes modify the connectivity matrix so as to have 1s along the diagonal. Such a matrix is sometimes called an augmented connectivity matrix or augmented adjacency matrix.

The eigenvalues of the adjacency matrix reveal some interesting properties of the graph and are sometimes called the eigenvalues of the graph. The eigenvalues of another matrix, which we discuss below, are more useful, however, and we will refer to those eigenvalues as the eigenvalues of the graph.

8.8.9.2 Digraphs

The digraph represented in Fig. 8.4 on page 335 is a network with five vertices, perhaps representing cities, and directed edges between some of the vertices. The edges could represent airline connections between the cities; for example, there are flights from x to u and from u to x, and from y to z, but not from z to y.

In a digraph, the relationships are directional. (An example of a directional relationship that might be of interest is when each observational unit has a different number of measured features, and a relationship exists from v i to v j if a majority of the features of v i are identical to measured features of v j .)

8.8.9.3 Use of the Connectivity Matrix

The analysis of a network may begin by identifying which vertices are connected with others; that is, by construction of the connectivity matrix.

The connectivity matrix can then be used to analyze other levels of association among the data represented by the graph or digraph. For example, from the connectivity matrix in equation (8.2) on page 335, we have
$$\displaystyle{A^{2} = \left [\begin{array}{rrrrr} 4&1&0&0&1\\ 0 &1 &1 &1 &1 \\ 1&1&1&1&2\\ 1 &2 &1 &1 &1 \\ 1&1&1&1&1\\ \end{array} \right ].}$$
In terms of the application suggested on page 335 for airline connections, the matrix A2 represents the number of connections between the cities that consist of exactly two flights. From A2 we see that there are two ways to go from city y to city w in just two flights but only one way to go from w to y in two flights.

This property extends to multiple connections. If A is the adjacency matrix of a graph, then A ij k is the number of paths of length k between nodes i and j in that graph. (See Exercise 8.20.) Important areas of application of this fact are in DNA sequence comparisons and measuring “centrality” of a node within a complex social network.

8.8.9.4 The Laplacian Matrix of a Graph

Spectral graph theory is concerned with the analysis of the eigenvalues of a graph. As mentioned above, there are two different definitions of the eigenvalues of a graph. The more useful definition, and the one we use here, takes the eigenvalues of a graph to be the eigenvalues of a matrix, called the Laplacian matrix, formed from the adjacency matrix and a diagonal matrix consisting of the degrees of the vertices.

Given the graph \(\mathcal{G}\), let \(D(\mathcal{G})\) be a diagonal matrix consisting of the degrees of the vertices of \(\mathcal{G}\) (that is, \(D(\mathcal{G}) = \mathrm{diag}(d(\mathcal{G}))\)) and let \(C(\mathcal{G})\) be the adjacency matrix of \(\mathcal{G}\). If there are no isolated vertices (that is if \(d(\mathcal{G})> 0\)), then the Laplacian matrix of the graph, \(L(\mathcal{G})\) is given by
$$\displaystyle{ L(\mathcal{G}) = I - D(\mathcal{G})^{-\frac{1} {2} }C(\mathcal{G})D(\mathcal{G})^{-\frac{1} {2} }. }$$
(8.104)
Some authors define the Laplacian in other ways:
$$\displaystyle{ L_{a}(\mathcal{G}) = I - D(\mathcal{G})^{-1}C(\mathcal{G}) }$$
(8.105)
or
$$\displaystyle{ L_{b}(\mathcal{G}) = D(\mathcal{G}) - C(\mathcal{G}). }$$
(8.106)
The eigenvalues of the Laplacian matrix are the eigenvalues of a graph. The definition of the Laplacian matrix given in equation (8.104) seems to be more useful in terms of bounds on the eigenvalues of the graph. The set of unique eigenvalues (the spectrum of the matrix L) is called the spectrum of the graph.

So long as \(d(\mathcal{G})> 0\), \(L(\mathcal{G}) = D(\mathcal{G})^{-\frac{1} {2} }L_{a}(\mathcal{G})D(\mathcal{G})^{-\frac{1} {2} }\). Unless the graph is regular, the matrix \(L_{b}(\mathcal{G})\) is not symmetric. Note that if \(\mathcal{G}\) is k-regular, \(L(\mathcal{G}) = I - C(\mathcal{G})/k\), and \(L_{b}(\mathcal{G}) = L(\mathcal{G})\).

For a digraph, the degrees are replaced by either the indegrees or the outdegrees. (Some authors define it one way and others the other way. The essential properties hold either way.)

The Laplacian can be viewed as an operator on the space of functions \(f\,:\, V (\mathcal{G}) \rightarrow \mathrm{I\!R}\) such that for the vertex v
$$\displaystyle{L(f(v)) = \frac{1} {\sqrt{d_{v}}}\sum _{w,w\sim v}\left (\frac{f(v)} {\sqrt{d_{v}}} -\frac{f(w)} {\sqrt{d_{w}}} \right ),}$$
where wv means vertices w and v that are adjacent, and d u is the degree of the vertex u.
For a symmetric graph, the Laplacian matrix is symmetric, so its eigenvalues are all real. We can see that the eigenvalues are all nonnegative by forming the Rayleigh quotient (equation ( 3.266)) using an arbitrary vector g, which can be viewed as a real-valued function over the vertices,
$$\displaystyle\begin{array}{rcl} R_{L}(g)& =& \frac{\langle g,\,Lg\rangle } {\langle g,\,g\rangle } \\ & =& \frac{\langle g,\,D^{-\frac{1} {2} }L_{a}D^{-\frac{1} {2} }g\rangle } {\langle g,\,g\rangle } \\ & =& \frac{\langle f,L_{a}f\rangle } {\langle D^{\frac{1} {2} }f,\,D^{\frac{1} {2} }f\rangle } \\ & =& \frac{\sum _{v\sim w}(f(v) - f(w))^{2}} {f^{\mathrm{T}}Df},{}\end{array}$$
(8.107)
where \(f = D^{-\frac{1} {2} }g\), and f(u) is the element of the vector corresponding to vertex u. Because the Raleigh quotient is nonnegative, all eigenvalues are nonnegative, and because there is an f ≠ 0 for which the Rayleigh quotient is 0, we see that 0 is an eigenvalue of a graph. Furthermore, using the Cauchy-Schwartz inequality, we see that the spectral radius is less than or equal to 2.

The eigenvalues of a matrix are the basic objects in spectral graph theory. They provide information about the properties of networks and other systems modeled by graphs. We will not explore them further here, and the interested reader is referred to Bollobás (2013) or other general texts on the subject.

If \(\mathcal{G}\) is the graph represented in Fig. 8.2 on page 332, with \(V (\mathcal{G}) =\{ a,b,c,d,e\}\), the degrees of the vertices of the graph are \(d(\mathcal{G}) = (4,2,2,3,3)\). Using the adjacency matrix given in equation (8.1), we have
$$\displaystyle{ L(\mathcal{G}) = \left [\begin{array}{rrrrr} 1& -\frac{\sqrt{2}} {4} & -\frac{\sqrt{2}} {4} & -\frac{\sqrt{3}} {6} & -\frac{\sqrt{3}} {6}\\ \\ -\frac{\sqrt{2}} {4} & 1& 0& 0& -\frac{\sqrt{6}} {6}\\ \\ -\frac{\sqrt{2}} {4} & 0& 1& -\frac{\sqrt{6}} {6} & 0\\ \\ -\frac{\sqrt{3}} {6} & 0& -\frac{\sqrt{6}} {6} & 1& -\frac{1} {3}\\ \\ -\frac{\sqrt{3}} {6} & -\frac{\sqrt{6}} {6} & 0& -\frac{1} {3} & 1 \end{array} \right ]. }$$
(8.108)
This matrix is singular, and the unnormalized eigenvector corresponding to the 0 eigenvalue is \((2\sqrt{14},2\sqrt{7},2\sqrt{7},\sqrt{42},\sqrt{42})\).

8.8.10 Z-Matrices and M-Matrices

In certain applications in physics and in the solution of systems of nonlinear differential equations, a class of matrices called M-matrices is important.

The matrices in these applications have nonpositive off-diagonal elements. A square matrix all of whose off-diagonal elements are nonpositive is called a Z-matrix.

A Z-matrix that is positive stable (see page 159) is called an M-matrix. A real symmetric M-matrix is positive definite.

In addition to the properties that constitute the definition, M-matrices have a number of remarkable properties, which we state here without proof. If A is a real M-matrix, then
  • all principal minors of A are positive;

  • all diagonal elements of A are positive;

  • all diagonal elements of L and U in the LU decomposition of A are positive;

  • for some i, j a ij ≥ 0; and

  • A is nonsingular and A−1 ≥ 0.

Proofs of some of these facts can be found in Horn and Johnson (1991).

Exercises

  1. 8.1.
    Normal matrices.
    1. a)

      Show that a skew symmetric matrix is normal.

       
    2. b)

      Show that a skew Hermitian matrix is normal.

       
     
  2. 8.2.
    Ordering of nonnegative definite matrices.
    1. a)
      A relation ⋈ on a set is a partial ordering if, for elements a, b, and c,
      • it is reflexive: aa;

      • it is antisymmetric: abaa = b; and

      • it is transitive: abcac.

      Show that the relation (equation (8.19)) is a partial ordering.

       
    2. b)

      Show that the relation ≻ (equation (8.20)) is transitive.

       
     
  3. 8.3.
    1. a)

      Show that a (strictly) diagonally dominant symmetric matrix is positive definite.

       
    2. b)
      Show that if the real n × n symmetric matrix A is such that
      $$\displaystyle{a_{ii} \geq \sum _{j\neq i}^{n}\vert a_{ ij}\vert \quad \mbox{ for each}\;i = 1,\ldots,n}$$
      then A⪰0.
       
     
  4. 8.4.

    Show that the number of positive eigenvalues of an idempotent matrix is the rank of the matrix.

     
  5. 8.5.

    Show that two idempotent matrices of the same rank are similar.

     
  6. 8.6.

    Under the given conditions, show that properties (a) and (b) on page 357 imply property (c).

     
  7. 8.7.
    Projections.
    1. a)

      Show that the matrix given in equation (8.42) (page 359) is a projection matrix.

       
    2. b)

      Write out the projection matrix for projecting a vector onto the plane formed by two vectors, x1 and x2, as indicated on page 359, and show that it is the same as the hat matrix of equation (8.52).

       
     
  8. 8.8.

    Correlation matrices. A correlation matrix can be defined in terms of a Gramian matrix formed by a centered and scaled matrix, as in equation (8.69). Sometimes in the development of statistical theory, we are interested in the properties of correlation matrices with given eigenvalues or with given ratios of the largest eigenvalue to other eigenvalues.

    Write a program to generate n × n random correlation matrices R with specified eigenvalues, c1, , c n . The only requirements on R are that its diagonals be 1, that it be symmetric, and that its eigenvalues all be positive and sum to n. Use the following method due to Davies and Higham (2000) that uses random orthogonal matrices with the Haar uniform distribution generated using the method described in Exercise 4.10.
    0.
    Generate a random orthogonal matrix Q; set k = 0, and form
    $$\displaystyle{R^{(0)} = Q\mathrm{diag}((c1,\ldots,c_{ n}))Q^{\mathrm{T}}.}$$
    1.

    If r ii (k) = 1 for all i in {1, , n}, go to step 3.

    2.

    Otherwise, choose p and q with p < j, such that r pp (k) < 1 < r qq (k) or r pp (k) > 1 > r qq (k), and form G(k) as in equation ( 5.13), where c and s are as in equations ( 5.17) and ( 5.18), with a = 1. Form R(k+1) = (G(k))TR(k)G(k). Set k = k + 1, and go to step 1.

    3.

    Deliver R = R(k).

     
  9. 8.9.

    Use the relationship (8.77) to prove properties 1 and 4 on page 377.

     
  10. 8.10.
    Leslie matrices.
    1. a)

      Write the characteristic polynomial of the Leslie matrix, equation (8.85).

       
    2. b)

      Show that the Leslie matrix has a single, unique positive eigenvalue.

       
     
  11. 8.11.

    Write out the determinant for an n × n Vandermonde matrix.

    Hint: The determinant of an n × n Vandermonde matrix as in equation (8.88) is (x n x1)(x n x2)⋯(x n x n−1) times the determinant of the n − 1 × n − 1 Vandermonde matrix formed by removing the last row and column. Show this by multiplying the original Vandermonde matrix by B = I + D, where D is the matrix with 0s in all positions except for the first supradiagonal, which consists of − x n , replicated n − 1 times. Clearly, the determinant of B is 1.

     
  12. 8.12.
    Consider the 3 × 3 symmetric Toeplitz matrix with elements a, b, and c; that is, the matrix that looks like this:
    $$\displaystyle{\left [\begin{array}{ccc} 1& b&c\\ b &1 & b \\ c&b&1\\ \end{array} \right ].}$$
    1. a)

      Invert this matrix. See page 385.

       
    2. b)

      Determine conditions for which the matrix would be singular.

       
     
  13. 8.13.
    If A is circulant, show that
    • AT is circulant

    • A2 is circulant

    • if A is nonsingular, then A−1 is circulant

    Hint: Use equation (8.95).

     
  14. 8.14.

    Show that the set of all n × n circulant matrices is a vector space along with the axpy operation. (Just show that it is closed with respect to that operation.)

     
  15. 8.15.

    If A and B are circulant matrices of the same order, show AB is circulant.

     
  16. 8.16.

    Show that a circulant matrix is normal.

     
  17. 8.17.

    Show that a Fourier matrix, as in equation (8.97), is unitary by showing fj Hfk = f jHf k = 0 for jk, and fj Hfj = 1.

     
  18. 8.18.

    Show that equation (8.99) is correct by performing the multiplications on the right side of the equation.

     
  19. 8.19.

    Write out the determinant for the n × n skew upper triangular Hankel matrix in (8.102).

     
  20. 8.20.
    Graphs. Let A be the adjacency matrix of an undirected graph.
    1. a)

      Show that A ij 2 is the number of paths of length 2 between nodes i and j.

      Hint: Construct a general diagram similar to Fig. 8.2 on page 332, and count the paths between two arbitrary nodes.

       
    2. b)

      Show that A ij k is the number of paths of length k between nodes i and j.

      Hint: Use Exercise 8.20a and mathematical induction on k.

       
     

References

  1. Bapat, R. B., and T. E. S. Raghavan. 1997. Nonnegative Matrices and Applications. Cambridge, United Kingdom: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  2. Bollobás, Béla. 2013. Modern Graph Theory. New York: Springer-Verlag.zbMATHGoogle Scholar
  3. Chu, Moody T. 1991. Least squares approximation by real normal matrices with specified spectrum. SIAM Journal on Matrix Analysis and Applications 12:115–127.MathSciNetCrossRefzbMATHGoogle Scholar
  4. Davies, Philip I., and Nicholas J. Higham. 2000. Numerically stable generation of correlation matrices and their factors. BIT 40:640–651.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Dey, Aloke, and Rahul Mukerjee. 1999. Fractional Factorial Plans. New York: John Wiley and Sons.CrossRefzbMATHGoogle Scholar
  6. Fasino, Dario, and Luca Gemignani. 2003. A Lanczos-type algorithm for the QR factorization of Cauchy-like matrices. In Fast Algorithms for Structured Matrices: Theory and Applications, ed. Vadim Olshevsky, 91–104. Providence, Rhode Island: American Mathematical Society.CrossRefGoogle Scholar
  7. Gentle, James E. 2003. Random Number Generation and Monte Carlo Methods, 2nd ed. New York: Springer-Verlag.zbMATHGoogle Scholar
  8. Graybill, Franklin A. 1983. Introduction to Matrices with Applications in Statistics, 2nd ed. Belmont, California: Wadsworth Publishing Company.zbMATHGoogle Scholar
  9. Hedayat, A. S., N. J. A. Sloane, and John Stufken. 1999. Orthogonal Arrays: Theory and Applications. New York: Springer-Verlag.CrossRefzbMATHGoogle Scholar
  10. Hoffman, A. J., and H. W. Wielandt. 1953. The variation of the spectrum of a normal matrix. Duke Mathematical Journal 20:37–39.MathSciNetCrossRefzbMATHGoogle Scholar
  11. Horn, Roger A., and Charles R. Johnson. 1991. Topics in Matrix Analysis. Cambridge, United Kingdom: Cambridge University Press.CrossRefzbMATHGoogle Scholar
  12. Liu, Shuangzhe and Heinz Neudecker. 1996. Several matrix Kantorovich-type inequalities. Journal of Mathematical Analysis and Applications 197:23–26.MathSciNetCrossRefzbMATHGoogle Scholar
  13. Marshall, A. W., and I. Olkin. 1990. Matrix versions of the Cauchy and Kantorovich inequalities. Aequationes Mathematicae 40:89–93.MathSciNetCrossRefzbMATHGoogle Scholar
  14. Mosteller, Frederick, and David L. Wallace. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58:275–309.zbMATHGoogle Scholar
  15. Olshevsky, Vadim (Editor). 2003. Fast Algorithms for Structured Matrices: Theory and Applications. Providence, Rhode Island: American Mathematical Society.zbMATHGoogle Scholar
  16. Strang, Gilbert, and Tri Nguyen. 2004. The interplay of ranks of submatrices. SIAM Review 46:637–646.MathSciNetCrossRefzbMATHGoogle Scholar
  17. Trefethen, Lloyd N., and Mark Embree. 2005. Spectra and Pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton: Princeton University Press.zbMATHGoogle Scholar
  18. Trosset, Michael W. 2002. Extensions of classical multidimensional scaling via variable reduction. Computational Statistics 17:147–163.MathSciNetCrossRefzbMATHGoogle Scholar
  19. Vandenberghe, Lieven, and Stephen Boyd. 1996. Semidefinite programming. SIAM Review 38:49–95.MathSciNetCrossRefzbMATHGoogle Scholar
  20. Wilkinson, J. H. 1965. The Algebraic Eigenvalue Problem. New York: Oxford University Press.zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • James E. Gentle
    • 1
  1. 1.FairfaxUSA

Personalised recommendations