Matrix Algebra pp 329398  Cite as
Special Matrices and Operations Useful in Modeling and Data Analysis
 5.9k Downloads
Abstract
In previous chapters, we encountered a number of special matrices, such as symmetric matrices, banded matrices, elementary operator matrices, and so on. In this chapter, we will discuss some of these matrices in more detail and also introduce some other special matrices and data structures that are useful in statistics.
In previous chapters, we encountered a number of special matrices, such as symmetric matrices, banded matrices, elementary operator matrices, and so on. In this chapter, we will discuss some of these matrices in more detail and also introduce some other special matrices and data structures that are useful in statistics.
There are a number of special kinds of matrices that are useful in statistical applications. In statistical applications in which data analysis is the objective, the initial step is the representation of observational data in some convenient form, which often is a matrix. The matrices for operating on observational data or summarizing the data often have special structures and properties. We discuss the representation of observations using matrices in Sect. 8.1.
In Sect. 8.2, we review and discuss some of the properties of symmetric matrices.
One of the most important properties of many matrices occurring in statistical data analysis is nonnegative or positive definiteness; this is the subject of Sects. 8.3 and 8.4.
Fitted values of a response variable that are associated with given values of covariates in linear models are often projections of the observations onto a subspace determined by the covariates. Projection matrices and Gramian matrices useful in linear models are considered in Sects. 8.5 and 8.6.
Another important property of many matrices occurring in statistical modeling is irreducible nonnegativeness or positiveness; this is the subject of Sect. 8.7.
Many of the defining properties of the special matrices discussed in this chapter are invariant under scalar multiplication; hence, the special matrices are members of cones. Interestingly even further, convex combinations of some types of special matrices yield special matrices of the same type; hence, those special matrices are members of convex cones (see Sect. 2.2.8 beginning on page 43).
8.1 Data Matrices and Association Matrices
There are several ways that data can be organized for representation in the computer. We distinguish logical structures from computerstorage structures. Data structure in computers is an important concern and can greatly affect the efficiency of computer processing. We discuss some simple aspects of the organization for computer storage in Sect. 11.1, beginning on page 523. In the present section, we consider some general issues of logical organization and structure.
There are two important aspects of data in applications that we will not address here. One is metadata; that is, data about the data. Metadata includes names or labels associated with data, information about how and when the data were collected, information about how the data are stored in the computer, and so on. Another important concern in applications is missing data. In realworld applications it is common to have incomplete data. If the data are stored in some structure that naturally contains a cell or a region for the missing data, the computer representation of the dataset must contain some indication that the cell is empty. For numeric data, the convenient way of doing this is by using “notavailable”, NA (see page 466), or “notanumber”, NaN (see page 475). The effect on a statistical analysis when some data are missing varies with the type of analysis. We consider some effects of missing data on the estimation of variancecovariance matrices in Sect. 9.5.6, beginning on page 523.
8.1.1 Flat Files
The data may be various types of objects, such as names, real numbers, numbers with associated measurement units, sets, vectors, and so on. If the data are represented as real numbers, the data array is a matrix. (Note again our use of the word “matrix”; not just any rectangular array is a matrix in the sense used in this book.) Other types of data can often be made equivalent to a matrix in an intuitive manner.
The flat file arrangement emphasizes the relationships of the data both within an observational unit or row and within a variable or column. Simple operations on the data matrix may reveal relationships among observational units or among variables.
Flat files are the appropriate data structure for the analysis of linear models, but statistics is not just about analysis of linear models anymore. (It never was.)
8.1.2 Graphs and Other Data Structures
If the numbers of measurements on the observational units varies or if the interest is primarily in simple relationships among observational units or among variables, the flat file structure may not be very useful. Sometimes a graph structure can be used advantageously.
A graph is a nonempty set V of points, called vertices, together with a collection E of unordered pairs of elements of V, called edges. (Other definitions of “graph” allow the null set to be a graph.) If we let \(\mathcal{G}\) be a graph, we represent it as (V, E). We often represent the set of vertices as \(V (\mathcal{G})\) and the collection of edges as \(E(\mathcal{G})\). An edge is said to be incident on each vertex in the edge. The number of vertices (that is, the cardinality of V ) is the order of the graph, and the number of edges, the cardinality of E, is the size of the graph.
An edge in which the two vertices are the same is called a loop. If two or more elements of \(E(\mathcal{G})\) contain the same two vertices, those edges are called multiple edges, and the graph itself is called a multigraph. A graph with no loops and with no multiple edges is called a simple graph. In some literature, “graph” means “simple graph”, as I have defined it; and a graph as I have defined it that is not simple is called a “pseudograph”
A path or walk is a sequence of edges, e_{1}, …, e_{ n }, such that for i ≥ 2 one vertex in e_{ i } is a vertex in edge e_{ i−1}. Alternatively, a path or walk is defined as a sequence of vertices with common edges.
A graph such that there is a path that includes any pair of vertices is said to be connected.
A graph with more than one vertex such that all possible pairs of vertices occur as edges is a complete graph.
A closed path or closed walk is a path such that a vertex in the first edge (or the first vertex in the alternate definition) is in the last edge (or the last vertex).
A cycle is a closed path in which all vertices occur exactly twice (or in the alternate definition, in which all vertices except the first and the last are distinct). A graph with no cycles is said to be acyclic. An acyclic graph is also called a tree. Trees are used extensively in statistics to represent clusters.
The number of edges that contain a given vertex (that is, the number of edges incident on the vertex v) denoted by d(v) is the degree of the vertex.
A vertex with degree 0 is said to be isolated.
A regular graph is one for which d(v_{ i }) is constant for all vertices v_{ i }; more specifically, a graph is kregular if d(v_{ i }) = k for all vertices v_{ i }.
The presence of an edge between two vertices can indicate the existence of a relationship between the objects represented by the vertices. The graph represented in Fig. 8.2 may represent five observational units for which our primary interest is in their relationships with one another. For example, the observations may be authors of scientific papers, and an edge between two authors may represent the fact that the two have been coauthors on some paper.
In the graph represented in Fig. 8.2, there are no isolated vertices and the graph is connected. (Note that a graph with no isolated vertices is not necessarily connected.) The graph represented in Fig. 8.2 is not complete because, for example, there is no edge that contains vertices c and e. The graph is cyclic because of the closed path (defined by vertices) (c, d, e, b, a, c). Note that the closed path (c, d, a, e, b, a, c) is not a cycle.
This use of a graph immediately suggests various extensions of a basic graph. For example, E may be a multiset, with multiple instances of edges containing the same two vertices, perhaps, in the example above, representing multiple papers in which the two authors are coauthors. As we stated above, a graph in which E is a multiset is called a multigraph. Instead of just the presence or absence of edges between vertices, a weighted graph may be more useful; that is, one in which a real number is associated with a pair of vertices to represent the strength of the relationship, not just presence or absence, between the two vertices. A degenerate weighted graph (that is, an unweighted graph as discussed above) has weights of 0 or 1 between all vertices. A multigraph is a weighted graph in which the weights are restricted to nonnegative integers. Although the data in a weighted graph carry much more information than a graph with only its edges, or even a multigraph that allows strength to be represented by multiple edges, the simplicity of a graph sometimes recommends its use even when there are varying degrees of strength of relationships. A standard approach in applications is to set a threshold for the strength of relationship and to define an edge only when the threshold is exceeded.
8.1.2.1 Adjacency Matrix: Connectivity Matrix
There is no difference in the connectivity matrix and a table such as in Fig. 8.3 except for the metadata.
A graph as we have described it is “nondirected”; that is, an edge has no direction. The edge (a, b) is the same as (b, a). An adjacency matrix for a nondirected graph is symmetric.
An interesting property of an adjacency matrix, which we will discuss further on page 393, is that if A is the adjacency matrix of a graph, then A_{ ij }^{ k } is the number of paths of length k between nodes i and j in that graph. (See Exercise 8.20.)
The adjacency matrix for graph with no loops is hollow; that is, all diagonal elements are 0s. Another common way of representing a hollow adjacency matrix is to use − 1s in place of the offdiagonal zeroes; that is, the absence of a connection between two different vertices is denoted by − 1 instead of by 0. Such a matrix is called a Seidel adjacency matrix. (This matrix has no relationship to the GaussSeidel method discussed in Chap. 6)
The relationship can obviously be defined in the other direction; that is, given an n × n symmetric matrix A, we define the graph of the matrix as the graph with n vertices and edges between vertices i and j if a_{ ij } ≠ 0. We often denote the graph of the matrix A by \(\mathcal{G}(A)\).
8.1.2.2 Digraphs
Another extension of a basic graph is one in which the relationship may not be the same in both directions. This yields a digraph, or “directed graph”, in which the edges are ordered pairs called directed edges. The vertices in a digraph have two kinds of degree, an indegree and an outdegree, with the obvious meanings.
The simplest applications of digraphs are for representing networks. Consider, for example, the digraph represented by the network in Fig. 8.4. This is a network with five vertices, perhaps representing cities, and directed edges between some of the vertices. The edges could represent airline connections between the cities; for example, there are flights from x to u and from u to x, and from y to z, but not from z to y.
Figure 8.4 represents a digraph with order 5 (there are five vertices) and size 11 (eleven directed edges). A sequence of edges, e_{1}, …, e_{ n }, constituting a path in a digraph must be such that for i ≥ 2 the first vertex in e_{ i } is the second vertex in edge e_{ i−1}. For example, the sequence x, y, z, w, u, x in the graph of Fig. 8.4 is a path (in fact, a cycle) but the sequence x, u, w, z, y, x is not a path.
In statistical applications, graphs are used for representing symmetric associations. Digraphs are used for representing asymmetric associations or oneway processes such as a stochastic process.
In a simple digraph, the edges only indicate the presence or absence of a relationship, but just as in the case of a simple graph, we can define a weighted digraph by associating nonnegative numbers with each directed edge.
Graphical modeling is useful for analyzing relationships between elements of a collection of sets. For example, in an analysis of internet traffic, profiles of users may be constructed based on the set of web sites each user visits in relation to the sets visited by other users. For this kind of application, an intersection graph may be useful. An intersection graph, for a given collection of sets \(\mathcal{S}\), is a graph whose vertices correspond to the sets in \(\mathcal{S}\) and whose edges between any two sets have a common element.
The word “graph” is often used without qualification to mean any of these types.
8.1.2.3 Connectivity of Digraphs
A digraph that is not weakly connected must have two sets of nodes with no edges between any nodes in one set and any nodes in the other set.
8.1.2.4 Irreducible Matrices
Any nonnegative square matrix that can be permuted into the form in equation (8.4) with square diagonal submatrices is said to be reducible; a matrix that cannot be put into that form is irreducible. We also use the terms reducible and irreducible to refer to the graph itself.
Irreducible matrices have many interesting properties, some of which we will discuss in Sect. 8.7.3, beginning on page 375. The implication (8.77) in that section provides a simple characterization of irreducibility.
8.1.2.5 Strong Connectivity of Digraphs and Irreducibility of Matrices
A nonnegative matrix is irreducible if and only if its digraph is strongly connected. Stated another way, a digraph is not strongly connected if and only if its matrix is reducible.
To see this, first consider a reducible matrix. In its reduced form of equation (8.4), none of the nodes corresponding to the last rows have directed edges leading to any of the nodes corresponding to the first rows; hence, the digraph is not strongly connected.
Now, assume that a given digraph \(\mathcal{G}\) is not strongly connected. In that case, there is some node, say the i^{th} node, from which there is no directed path to some other node. Assume that there are m − 1 nodes that can be reached from node i. If m = 1, then we have a trivial partitioning of the n × n connectivity in which B_{11} of equation (8.4) is (n − 1) × (n − 1) and B_{22} is a 1 × 1 0 matrix (that is, 0_{1}). If m ≥ 1, perform symmetric permutations so that the row corresponding to node i and all other m − 1 nodes are the last m rows of the permuted connectivity matrix. In this case, the first n − m elements in each of those rows must be 0. To see that this must be the case, let k > n − m and j ≤ n − m and assume that the element in the (k, j)^{th} position is nonzero. In that case, there is a path from node i to node k to node j, which is in the set of nodes not reachable from node i; hence the (k, j)^{th} element (in the permuted matrix) must be 0. The submatrix corresponding to B_{11} is n − m × n − m, and that corresponding to B_{22} is m × m. These properties also hold for connectivity matrices with simple loops (with 1s on the diagonal) and for an augmented connectivity matrix (see page 393).
Reducibility plays an important role in the analysis of Markov chains (see Sect. 9.8.1).
8.1.3 TermbyDocument Matrices
An interesting area of statistical application is in clustering and classifying documents. In the simpler cases, the documents consist of text only, and much recent research has been devoted to “text datamining”. (The problem is not a new one; Mosteller and Wallace 1963, studied a related problem.)
We have a set of text documents (often called a “corpus”). A basic set of data to use in studying the text documents is the termdocument matrix, which is a matrix whose columns correspond to documents and whose rows correspond to the various terms used in the documents. The terms are usually just words. The entries in the termdocument matrix are measures of the importance of the term in the document. Importance may be measured simply by the number of times the word occurs in the document, possibly weighted by some measure of the total number of words in the document. In other measures of the importance, the relative frequency of the word in the given document may be adjusted for the relative frequency of the word in the corpus. (Such a measure is called term frequencyinverse document frequency or tfidf.) Certain common words, such as “the”, called “stopwords”, may be excluded from the data. Also, words may be “stemmed”; that is, they may be associated with a root that ignores modifying letters such as an “s” in a plural or an “ed” in a pasttense verb. Other variations include accounting for sequences of terms (called “collocation”).
There are several interesting problems that arise in text analysis. Termdocument matrices can be quite large. Terms are often misspelled. The documents may be in different languages. Some terms may have multiple meanings. The documents themselves may be stored on different computer systems.
Two types of manipulation of the termdocument matrix are commonly employed in the analysis. One is singularvalue decomposition (SVD), and the other is nonnegative matrix factorization (NMF).
SVD is used in what is called “latent semantic analysis” (“LSA”), or “latent semantic indexing” (“LSI”). A variation of latent semantic analysis is “probabilistic” latent semantic analysis, in which a probability distribution is assumed for the words. In a specific instance of probabilistic latent semantic analysis, the probability distribution is a Dirichlet distribution. (Most users of this latter method are not statisticians; they call the method “LDA”, for “latent Dirichlet allocation”.)
NMF is often used in determining which documents are similar to each other or, alternatively, which terms are clustered in documents. If A is the termdocument matrix, and A = WH is a nonnegative factorization, then the element w_{ ij } can be interpreted as the degree to which term i belongs to cluster j, while element h_{ ij } can be interpreted as the degree to which document j belongs to cluster i. A method of clustering the documents is to assign document j (corresponding to the j^{th} column of A) to the k^{th} cluster if h_{ kj } is the maximum element in the column h_{∗j }.
There is a wealth of literature on text datamining, but we will not discuss the analysis methods further.
8.1.4 Probability Distribution Models
Probability models in statistical data analysis are often multivariate distributions, and hence, matrices arise in the model. In the analysis itself, matrices that represent associations are computed from the observational data. In this section we mention some matrices in the models, and in the next section we refer to some association matrices that are computed in the analysis.
Data in rows of flat files are often assumed to be realizations of vector random variables, some elements of which may have a degenerate distribution (that is, the elements in some columns of the data matrix may be considered to be fixed rather than random). The data in one row are often considered independent of the data in another row. Statistical data analysis is generally concerned with studying various models of relationships among the elements of the vector random variables. For example, the familiar linear regression model relates one variable (one column) to a linear combination of other variables plus a translation and random noise.
A random graph of fixed order is a discrete probability space over all possible graphs of that order. For a graph of order n, there are \(2^{{n\choose 2}}\) possible graphs. Asymptotic properties of the probability distribution refer to the increase of the order without limit. Occasionally it is useful to consider the order of the graph to be random also. If the order is unrestricted, the sample space for a random graph of random order is infinite but countable. The number of digraphs of order n is \(4^{{n\choose 2}}\).
Random graphs have many uses in the analysis of large systems of interacting objects; for example, a random intersection graph may be used to make inferences about the clustering of internet users based on the web sites they visit.
8.1.5 Derived Association Matrices
In data analysis, the interesting questions usually involve the relationships among the variables or among the observational units. Matrices formed from the original data matrix for the purpose of measuring these relationships are called association matrices. There are basically two types: similarity and dissimilarity matrices. The variancecovariance matrix, which we discuss in Sect. 8.6.3, is an example of an association matrix that measures similarity. We discuss dissimilarity matrices in Sect. 8.6.6 and in Sect. 8.8.9 discuss a type of similarity matrix for data represented in graphs.
In addition to the distinction between similarity and dissimilarity association matrices, we may identify two types of association matrices based on whether the relationships of interest are among the rows (observations) or among the columns (variables or features). In applications, dissimilarity relationships among rows tend to be of more interest, and similarity relationships among columns are usually of more interest. (The applied statistician may think of clustering, multidimensional scaling, or Q factor analysis for the former and correlation analysis, principal components analysis, or factor analysis for the latter.)
8.2 Symmetric Matrices and Other Unitarily Diagonalizable Matrices
Most association matrices encountered in applications are real and symmetric. Because real symmetric matrices occur so frequently in statistical applications and because such matrices have so many interesting properties, it is useful to review some of those properties that we have already encountered and to state some additional properties.
First, perhaps, we should iterate a trivial but important fact: the product of symmetric matrices is not, in general, symmetric. A power of a symmetric matrix, however, is symmetric.
We should also emphasize that some of the special matrices we have discussed are assumed to be symmetric because, if they were not, we could define equivalent symmetric matrices. This includes positive definite matrices and more generally the matrices in quadratic forms.
8.2.1 Some Important Properties of Symmetric Matrices

If k is any positive integer, A^{ k } is symmetric.

AB is not necessarily symmetric even if B is a symmetric matrix.

A ⊗ B is symmetric if B is symmetric.

If A is nonsingular, then A^{−1} is also symmetric because (A^{−1})^{T} = (A^{T})^{−1} = A^{−1}.

If A is nonsingular (so that A^{ k } is defined for nonpositive integers), A^{ k } is symmetric and nonsingular for any integer k.

All eigenvalues of A are real (see page 140).

A is diagonalizable (or simple), and in fact A is orthogonally diagonalizable; that is, it has an orthogonally similar canonical factorization, A = V CV^{T} (see page 154).

A has the spectral decomposition A = ∑_{ i }c_{ i }v_{ i }v_{ i }^{T}, where the c_{ i } are the eigenvalues and v_{ i } are the corresponding eigenvectors (see page 155).

A power of A has the spectral decomposition A^{ k } = ∑_{ i }c_{ i }^{ k }v_{ i }v_{ i }^{T}.

Any quadratic form x^{T}Ax can be expressed as ∑_{ i }b_{ i }^{2}c_{ i }, where the b_{ i } are elements in the vector V^{−1}x.
 We have(see page 156). If A is nonnegative definite, this is the spectral radius ρ(A).$$\displaystyle{\max _{x\neq 0}\frac{x^{\mathrm{T}}Ax} {x^{\mathrm{T}}x} =\max \{ c_{i}\}}$$
 For the L_{2} norm of the symmetric matrix A, we have$$\displaystyle{\A\_{2} =\rho (A).}$$
 For the Frobenius norm of the symmetric matrix A, we haveThis follows immediately from the fact that A is diagonalizable, as do the following properties.$$\displaystyle{\A\_{F} = \sqrt{\sum c_{i }^{2}}.}$$
 $$\displaystyle{\mathrm{tr}(A) =\sum c_{i}}$$
8.2.2 Approximation of Symmetric Matrices and an Important Inequality
In Sect. 3.10, we considered the problem of approximating a given matrix by another matrix of lower rank. There are other situations in statistics in which we need to approximate one matrix by another one. In data analysis, this may be because our given matrix arises from poor observations and we know the “true” matrix has some special properties not possessed by the given matrix computed from the data. A familiar example is a sample variancecovariance matrix computed from incomplete data (see Sect. 9.5.6). Other examples in statistical applications occur in the simulation of random matrices (see Gentle 2003; Section 5.3.3). In most cases of interest, the matrix to be approximated is a symmetric matrix.
We now proceed in two steps to show that in order for f(Q) to attain its lower bound l, X must be diagonal. First we will show that when f(Q) = l, the submatrix X_{ ij } in equation (8.9) must be null if i ≠ j. To this end, let \(Q_{\!\!\!\ _{\nabla }}\) be such that \(f(Q_{\!\!\!\ _{\nabla }}) = l\), and assume the contrary regarding the corresponding \(X_{\!\!\!\ _{\nabla }} = Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\); that is, assume that in some submatrix \(X_{ij_{\!\!\!\ _{ \nabla }}}\) where i ≠ j, there is a nonzero element, say \(x_{\!\!\!\ _{\nabla }}\). We arrive at a contradiction by showing that in this case there is another X_{0} of the form \(Q_{0}^{\mathrm{T}}\widetilde{A}Q_{0}\), where Q_{0} is orthogonal and such that \(f(Q_{0}) <f(Q_{\!\!\!\ _{\nabla }})\).
To establish some useful notation, let p and q be the row and column, respectively, of \(X_{\!\!\!\ _{\nabla }}\) where this nonzero element \(x_{\!\!\!\ _{\nabla }}\) occurs; that is, \(x_{pq} = x_{\!\!\!\ _{\nabla }}\neq 0\) and p ≠ q because x_{ pq } is in \(X_{ij_{\!\!\!\ _{ \nabla }}}\). (Note the distinction between uppercase letters, which represent submatrices, and lowercase letters, which represent elements of matrices.) Also, because \(X_{\!\!\!\ _{\nabla }}\) is symmetric, \(x_{qp} = x_{\!\!\!\ _{\nabla }}\). Now let \(a_{\!\!\!\ _{\nabla }} = x_{pp}\) and \(b_{\!\!\!\ _{\nabla }} = x_{qq}\). We form Q_{0} as \(Q_{\!\!\!\ _{\nabla }}R\), where R is an orthogonal rotation matrix of the form G_{ pq } in equation ( 5.12) on page 239. We have, therefore, \(\Q_{0}^{\mathrm{T}}\widetilde{A}Q_{0}\^{2} =\ R^{\mathrm{T}}Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}R\^{2} =\ Q_{\!\!\!\ _{ \nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\^{2}\). Let a_{0}, b_{0}, and x_{0} represent the elements of \(Q_{0}^{\mathrm{T}}\widetilde{A}Q_{0}\) that correspond to \(a_{\!\!\!\ _{\nabla }}\), \(b_{\!\!\!\ _{\nabla }}\), and \(x_{\!\!\!\ _{\nabla }}\) in \(Q_{\!\!\!\ _{\nabla }}^{\mathrm{T}}\widetilde{A}Q_{\!\!\!\ _{ \nabla }}\).
We have shown, therefore, that the minimum of f(Q) is \(\sum _{i=1}^{n}(c_{i} \tilde{ c}_{i})^{2}\), where both sets of eigenvalues are ordered in nonincreasing value. From equation (8.7), which is f(V ), we have the inequality (8.6).
While an upper bound may be of more interest in the approximation problem, the lower bound in the HoffmanWielandt theorem gives us a measure of the goodness of the approximation of one matrix by another matrix. There are various extensions and other applications of the HoffmanWielandt theorem, see Chu (1991).
8.2.3 Normal Matrices
A real square matrix A is said to be normal if A^{T}A = AA^{T}. (In general, a square matrix is normal if A^{H}A = AA^{H}.) The Gramian matrix formed from a normal matrix is the same as the Gramian formed from the transpose (or conjugate transpose) of the matrix. Normal matrices include symmetric (and Hermitian), skew symmetric (and Hermitian), square orthogonal (and unitary) matrices, and circulant matrices. The identity is also obviously a normal matrix.
There are a number of interesting properties possessed by a normal matrix, but the most important property is that it can be diagonalized by a unitary matrix. Recall from page 154 that a matrix can be orthogonally diagonalized if and only if the matrix is symmetric. Not all normal matrices can be orthogonally diagonalized, but all can be diagonalized by a unitary matrix (“unitarily diagonalized”). In fact, a matrix can be unitarily diagonalized if and only if the matrix is normal. (This is the reason the word “normal” is used to describe these matrices; and a matrix that is unitarily diagonalizable is an alternate, and more meaningful way of defining a normal matrix.)
It is easy to see that a matrix A is unitarily diagonalizable if and only if A^{H}A = AA^{H} (and for real A, A^{T}A = AA^{T} implies A^{H}A = AA^{H}).
Spectral methods, based on the unitary diagonalization, are useful in many areas of applied mathematics. The spectra of nonnormal matrices, however, are quite different (see Trefethen and Embree (2005)).
8.3 Nonnegative Definite Matrices: Cholesky Factorization
We defined nonnegative definite and positive definite matrices on page 91, and discussed some of their properties, particularly in Sect. 3.8.11. We have seen that these matrices have useful factorizations, in particular, the square root and the Cholesky factorization. In this section, we recall those definitions, properties, and factorizations.

The sum of two (conformable) nonnegative matrices is nonnegative definite.

All diagonal elements of a nonnegative definite matrix are nonnegative. Hence, if A is nonnegative definite, tr(A) ≥ 0.

Any square submatrix whose principal diagonal is a subset of the principal diagonal of a nonnegative definite matrix is nonnegative definite. In particular, any square principal submatrix of a nonnegative definite matrix is nonnegative definite.
It is easy to show that the latter two facts follow from the definition by considering a vector x with zeros in all positions except those corresponding to the submatrix in question. For example, to see that all diagonal elements of a nonnegative definite matrix are nonnegative, assume the (i, i) element is negative, and then consider the vector x to consist of all zeros except for a 1 in the i^{th} position. It is easy to see that the quadratic form is negative, so the assumption that the (i, i) element is negative leads to a contradiction.

A diagonal matrix is nonnegative definite if and only if all of the diagonal elements are nonnegative.
This must be true because a quadratic form in a diagonal matrix is the sum of the diagonal elements times the squares of the elements of the vector.

If A is nonnegative definite, then \(A_{(i_{1},\ldots,i_{k})(i_{1},\ldots,i_{k})}\) is nonnegative definite. (See page 599 for notation.)
Again, we can see this by selecting an x in the defining inequality (8.10) consisting of 1s in the positions corresponding to the rows and columns of A that are retained and 0s elsewhere.

if A is nonnegative definite, and C is conformable for the multiplication, then C^{T}AC is nonnegative definite.

the determinant of a nonnegative definite matrix is nonnegative.
8.3.1 Eigenvalues of Nonnegative Definite Matrices
We have seen on page 159 that a real symmetric matrix is nonnegative (positive) definite if and only if all of its eigenvalues are nonnegative (positive).
This fact allows a generalization of the statement above: a triangular matrix is nonnegative (positive) definite if and only if all of the diagonal elements are nonnegative (positive).
8.3.2 The Square Root and the Cholesky Factorization
8.3.3 The Convex Cone of Nonnegative Definite Matrices
The class of all n × n nonnegative definite matrices is a cone because if X is a nonnegative definite matrix and a > 0, then aX is a nonnegative definite matrix (see page 43). Furthermore, it is convex cone in \(\mathrm{I\!R}^{n\times n}\), because if X_{1} and X_{2} are n × n nonnegative definite matrices and a, b ≥ 0, then aX_{1} + bX_{2} is nonnegative definite so long as either a > 0 or b > 0.
This class is not closed under Cayley multiplication (that is, in particular, it is not a group with respect to that operation). The product of two nonnegative definite matrices might not even be symmetric.
8.4 Positive Definite Matrices
An important class of nonnegative definite matrices are those that satisfy strict inequalities in the definition involving x^{T}Ax. These matrices are called positive definite matrices and they have all of the properties discussed above for nonnegative definite matrices as well as some additional useful properties.

Any square principal submatrix of a positive definite matrix is positive definite.
Because a quadratic form in a diagonal matrix is the sum of the diagonal elements times the squares of the elements of the vector, a diagonal matrix is positive definite if and only if all of the diagonal elements are positive.

The determinant of a positive definite matrix is positive.
 If A is positive definite,which we see using the same argument as for inequality (8.12).$$\displaystyle{ a_{ij}^{2} <a_{ ii}a_{jj}, }$$(8.17)

The sum of a positive definite matrix and a (conformable) nonnegative definite matrix is positive definite.

A positive definite matrix is necessarily nonsingular. (We see this from the fact that no nonzero combination of the columns, or rows, can be 0.) Furthermore, if A is positive definite, then A^{−1} is positive definite. (We showed this is Sect. 3.8.11, but we can see it in another way: because for any y ≠ 0 and x = A^{−1}y, we have y^{T}A^{−1}y = x^{T}y = x^{T}Ax > 0.)

A (strictly) diagonally dominant symmetric matrix with positive diagonals is positive definite. The proof of this is Exercise 8.3.

A positive definite matrix is orthogonally diagonalizable.

A positive definite matrix has a square root.

A positive definite matrix Cholesky factorization.

If A is positive definite, and C is of full rank and conformable for the multiplication AC, then C^{T}AC is positive definite (see page 114).
If A ≻ B, we also write B ≺ A; and if A⪰B, we may write B⪯A.
8.4.1 Leading Principal Submatrices of Positive Definite Matrices
8.4.2 The Convex Cone of Positive Definite Matrices
The class of all n × n positive definite matrices is a cone because if X is a positive definite matrix and a > 0, then aX is a positive definite matrix (see page 43). Furthermore, this class is a convex cone in \(\mathrm{I\!R}^{n\times n}\) because if X_{1} and X_{2} are n × n positive definite matrices and a, b ≥ 0, then aX_{1} + bX_{2} is positive definite so long as either a ≠ 0 or b ≠ 0.
As with the cone of nonnegative definite matrices this class is not closed under Cayley multiplication.
8.4.3 Inequalities Involving Positive Definite Matrices
Quadratic forms of positive definite matrices and nonnegative matrices occur often in data analysis. There are several useful inequalities involving such quadratic forms.
All of the inequalities (8.21) through (8.25) are sharp. We know that (8.21) and (8.22) are sharp by using the appropriate eigenvectors. We can see the others are sharp by using A = I.
There are several variations on these inequalities and other similar inequalities that are reviewed by Marshall and Olkin (1990) and Liu and Neudecker (1996).
8.5 Idempotent and Projection Matrices
From the definition, it is clear that an idempotent matrix is its own Drazin inverse: A^{D} = A (see page 129).
An idempotent matrix that is symmetric is called a projection matrix.
8.5.1 Idempotent Matrices
Many matrices encountered in the statistical analysis of linear models are idempotent. One such matrix is X^{−}X (see page 124 and Sect. 9.3.2). This matrix exists for any n × m matrix X, and it is square. (It is m × m.)
Because the eigenvalues of A^{2} are the squares of the eigenvalues of A, all eigenvalues of an idempotent matrix must be either 0 or 1.
Equation (8.29) together with the diagonalizability theorem (equation ( 3.248)) implies that an idempotent matrix is diagonalizable.
If A is idempotent, then (I − A) is also idempotent, as we see by multiplication. This fact and equation (8.29) have generalizations for sums of idempotent matrices that are parts of Cochran’s theorem, which we consider below.
On page 146, we saw that similar matrices are equivalent (have the same rank). For idempotent matrices, we have the converse: idempotent matrices of the same rank (and size) are similar (see Exercise 8.5).
8.5.1.1 Symmetric Idempotent Matrices
Many of the idempotent matrices in statistical applications are symmetric, and such matrices have some useful properties.
8.5.1.2 Cochran’s Theorem
We can now extend this result to an idempotent matrix in place of I; that is, for an idempotent matrix A with A = A_{1} + ⋯ + A_{ k }. Rather than stating it simply as in equation (8.39), however, we will state the implications differently.
Then any two of the following conditions imply the third one:
 (a).
A is idempotent.
 (b).
A_{ i } is idempotent for i = 1, …, k.
 (c).
A_{ i }A_{ j } = 0 for all i ≠ j.
This is also called Cochran’s theorem. (The theorem also applies to nonsymmetric matrices if condition (c) is augmented with the requirement that rank(A_{ i }^{2}) = rank(A_{ i }) for all i. We will restrict our attention to symmetric matrices, however, because in most applications of these results, the matrices are symmetric.)
First, if we assume properties (a) and (b), we can show that property (c) follows using an argument similar to that used to establish equation (8.39) for the special case A = I. The formal steps are left as an exercise.
Any two of the properties (a) through (c) also imply a fourth property for A = A_{1} + ⋯ + A_{ k } when the A_{ i } are symmetric:
 (d).
rank(A) = rank(A_{1}) + ⋯ + rank(A_{ k }).
There is also a partial converse: properties (a) and (d) imply the other properties.
The most important statistical application of Cochran’s theorem is for the distribution of quadratic forms of normally distributed random vectors. These distribution results are also called Cochran’s theorem. We briefly discuss it in Sect. 9.2.3.
8.5.2 Projection Matrices: Symmetric Idempotent Matrices
For a given vector space \(\mathcal{V}\), a symmetric idempotent matrix A whose columns span \(\mathcal{V}\) is said to be a projection matrix onto \(\mathcal{V}\); in other words, a matrix A is a projection matrix onto span(A) if and only if A is symmetric and idempotent. (Some authors do not require a projection matrix to be symmetric. In that case, the terms “idempotent” and “projection” are synonymous.)
It is easy to see that, for any vector x, if A is a projection matrix onto \(\mathcal{V}\), the vector Ax is in \(\mathcal{V}\), and the vector x − Ax is in \(\mathcal{V}^{\perp }\) (the vectors Ax and x − Ax are orthogonal). For this reason, a projection matrix is sometimes called an “orthogonal projection matrix”. Note that an orthogonal projection matrix is not an orthogonal matrix, however, unless it is the identity matrix. Stating this in alternative notation, if A is a projection matrix and \(A \in \mathrm{I\!R}^{n\times n}\), then A maps \(\mathrm{I\!R}^{n}\) onto \(\mathcal{V}(A)\) and I − A is also a projection matrix (called the complementary projection matrix of A), and it maps \(\mathrm{I\!R}^{n}\) onto the orthogonal complement, \(\mathcal{N}(A)\). These spaces are such that \(\mathcal{V}(A) \oplus \mathcal{N}(A) = \mathrm{I\!R}^{n}\).
In this text, we use the term “projection” to mean “orthogonal projection”, but we should note that in some literature “projection” can include “oblique projection”. In the less restrictive definition, for vector spaces \(\mathcal{V}\), \(\mathcal{X}\), and \(\mathcal{Y}\), if \(\mathcal{V} = \mathcal{X}\oplus \mathcal{Y}\) and v = x + y with \(x \in \mathcal{X}\) and \(y \in \mathcal{Y}\), then the vector x is called the projection of v onto \(\mathcal{X}\) along \(\mathcal{Y}\). In this text, to use the unqualified term “projection”, we require that \(\mathcal{X}\) and \(\mathcal{Y}\) be orthogonal; if they are not, then we call x the oblique projection of v onto \(\mathcal{X}\) along \(\mathcal{Y}\). The choice of the more restrictive definition is because of the overwhelming importance of orthogonal projections in statistical applications. The restriction is also consistent with the definition in equation ( 2.51) of the projection of a vector onto another vector (as opposed to the projection onto a vector space).
Because a projection matrix is idempotent, the matrix projects any of its columns onto itself, and of course it projects the full matrix onto itself: AA = A (see equation (8.27)).
If x is a general vector in \(\mathrm{I\!R}^{n}\), that is, if x has order n and belongs to an ndimensional space, and A is a projection matrix of rank r ≤ n, then Ax has order n and belongs to span(A), which is an rdimensional space. Thus, we say projections are dimension reductions.
Useful projection matrices often encountered in statistical linear models are X^{+}X and XX^{+}. (Recall that for any generalized inverse X^{−}X is an idempotent matrix.) The matrix X^{+} exists for any n × m matrix X, and X^{+}X is square (m × m) and symmetric.
8.5.2.1 Projections onto Linear Combinations of Vectors
This idea can be used to project y onto the plane formed by two vectors, x_{1} and x_{2}, by forming a projection matrix in a similar manner and replacing x in equation (8.42) with the matrix X = [x_{1}  x_{2}]. On page 409, we will view linear regression fitting as a projection onto the space spanned by the independent variables.
8.6 Special Matrices Occurring in Data Analysis
Some of the most useful applications of matrices are in the representation of observational data, as in Fig. 8.1 on page 330. If the data are represented as real numbers, the array is a matrix, say X. The rows of the n × m data matrix X are “observations” and correspond to a vector of measurements on a single observational unit, and the columns of X correspond to n measurements of a single variable or feature. In data analysis we may form various association matrices that measure relationships among the variables or the observations that correspond to the columns or the rows of X. Many summary statistics arise from a matrix of the form X^{T}X. (If the data in X are incomplete—that is, if some elements are missing—problems may arise in the analysis. We discuss some of these issues in Sect. 9.5.6.)
8.6.1 Gramian Matrices
A (real) matrix A such that for some (real) matrix B, A = B^{T}B, is called a Gramian matrix. Any nonnegative definite matrix is Gramian (from equation (8.14) and Sect. 5.9.2 on page 256). (This is not a definition of “Gramian” or “Gram” matrix; these terms have more general meanings, but they do include any matrix expressible as B^{T}B.)
8.6.1.1 Sums of Squares and Cross Products
Although the properties of Gramian matrices are of interest, our starting point is usually the data matrix X, which we may analyze by forming a Gramian matrix X^{T}X or XX^{T} (or a related matrix). These Gramian matrices are also called sums of squares and cross products matrices. (The term “cross product” does not refer to the cross product of vectors defined on page 47, but rather to the presence of sums over i of the products x_{ ij }x_{ ik } along with sums of squares x_{ ij }^{2}.) These matrices and other similar ones are useful association matrices in statistical applications.
8.6.1.2 Some Immediate Properties of Gramian Matrices

X^{T}X is symmetric.

rank(X^{T}X) = rank(X).

X^{T}X is of full rank if and only if X is of full column rank.

X^{T}X is nonnegative definite.

X^{T}X is positive definite if and only if X is of full column rank.

X^{T}X = 0 ⇔ X = 0.

BX^{T}X = CX^{T}X ⇔ BX^{T} = CX^{T}.

X^{T}XB = X^{T}XC ⇔ XB = XC.

If d is a singular value of X, then c = d^{2} is an eigenvalue of X^{T}X; or, expressed another way,if c is a nonzero eigenvalue of X^{T}X, then there is a singular value d of X such that d^{2} = c.
These properties were shown in Sect. 3.3.10, beginning on page 115, except for the last one, which was shown on page 162.
8.6.1.3 Generalized Inverses of Gramian Matrices
8.6.1.4 Eigenvalues of Gramian Matrices
The nonzero eigenvalues of X^{T}X are the same as the nonzero eigenvalues of XX^{T} (property 14 on page 140).
If the singular value decomposition of X is UDV^{T} (page 161), then the similar canonical factorization of X^{T}X (equation ( 3.252)) is V D^{T}DV^{T}. Hence, we see that the nonzero singular values of X are the square roots of the nonzero eigenvalues of the symmetric matrix X^{T}X. By using DD^{T} similarly, we see that they are also the square roots of the nonzero eigenvalues of XX^{T}.
8.6.2 Projection and Smoothing Matrices
It is often of interest to approximate an arbitrary nvector in a given mdimensional vector space, where m < n. An n × n projection matrix of rank m clearly does this.
8.6.2.1 A Projection Matrix Formed from a Gramian Matrix
In linear models, tr(H) is the model degrees of freedom, and the sum of squares due to the model is just y^{T}Hy.
8.6.2.2 Smoothing Matrices
The hat matrix, either from a full rank X as in equation (8.52) or formed by a generalized inverse as in equation (8.50), smoothes the vector y onto the hyperplane defined by the column space of X. It is therefore a smoothing matrix. (Note that the rank of the column space of X is the same as the rank of X^{T}X.)
A useful variation of the cross products matrix X^{T}X is the matrix formed by adding a nonnegative (positive) definite matrix A to it. Because X^{T}X is nonnegative (positive) definite, X^{T}X + A is nonnegative definite, as we have seen (page 349), and hence X^{T}X + A is a Gramian matrix.
Any matrix such as H_{ λ } that is used to transform the observed vector y onto a given subspace is called a smoothing matrix.
8.6.2.3 Effective Degrees of Freedom
When λ = 0, this is the same as the ordinary model degrees of freedom, and when λ is positive, this quantity is smaller, as we would want it to be by the argument above. The d_{ i }^{2}∕(d_{ i }^{2} + λ) are called shrinkage factors.
If X^{T}X is not of full rank, the addition of λI to it also has the effect of yielding a full rank matrix, if λ > 0, and so the inverse of X^{T}X + λI exists even when that of X^{T}X does not. In any event, the addition of λI to X^{T}X yields a matrix with a better condition number, which we discussed in Sect. 6.1. (On page 272, we showed that the condition number of X^{T}X + λI is better than that of X^{T}X.)
8.6.2.4 Residuals from Smoothed Data
The rank of H_{ λ } is the same as the number of columns of X, but the trace, and hence the model degrees of freedom, is less than this number.
8.6.3 Centered Matrices and VarianceCovariance Matrices
In Sect. 2.3, we defined the variance of a vector and the covariance of two vectors. These are the same as the “sample variance” and “sample covariance” in statistical data analysis and are related to the variance and covariance of random variables in probability theory. We now consider the variancecovariance matrix associated with a data matrix. We occasionally refer to the variancecovariance matrix simply as the “variance matrix” or just as the “variance”.
First, we consider centering and scaling data matrices.
8.6.3.1 Centering and Scaling of Data Matrices
When the elements in a vector represent similar measurements or observational data on a given phenomenon, summing or averaging the elements in the vector may yield meaningful statistics. In statistical applications, the columns in a matrix often represent measurements on the same feature or on the same variable over different observational units as in Fig. 8.1, and so the mean of a column may be of interest.
If the unit of a measurement is changed, all elements in a column of the data matrix in which the measurement is used will change. The amount of variation of elements within a column or the relative variation among different columns ideally should not be measured in terms of the basic units of measurement, which can differ irreconcilably from one column to another. (One column could represent scores on an exam and another column could represent weight, for example.)
8.6.3.2 Gramian Matrices Formed from Centered Matrices:Covariance Matrices
This matrix and others formed from it, such as R_{ X } in equation (8.69) below, are called association matrices because they are based on measures of association (covariance or correlation) among the columns of X. We could likewise define a Gramian association matrix based on measures of association among the rows of X.
8.6.3.3 Gramian Matrices Formed from Scaled Centered Matrices:Correlation Matrices
The elements along the diagonal of the correlation matrix are all 1, and the offdiagonals are between − 1 and 1, each being the correlation between a pair of column vectors, as in equation ( 2.80). The correlation matrix is nonnegative definite because it is a Gramian matrix.
The trace of an n × n correlation matrix is n, and therefore the eigenvalues, which are all nonnegative, sum to n.
Without reference to a data matrix, any nonnegative definite matrix with 1s on the diagonal and with all elements less than or equal to 1 in absolute value is called a correlation matrix.
8.6.4 The Generalized Variance
The diagonal elements of the variancecovariance matrix S associated with the n × m data matrix X are the second moments of the centered columns of X, and the offdiagonal elements are pairwise second central moments of the columns. Each element of the matrix provides some information about the spread of a single column or of two columns of X. The determinant of S provides a single overall measure of the spread of the columns of X. This measure,  S , is called the generalized variance, or generalized sample variance, to distinguish it from an analogous measure of a distributional model.
Although we considered only the case m = 2, equation (8.70) holds generally, as can be seen by induction on m (see Anderson, 2003).
8.6.4.1 Comparing VarianceCovariance Matrices
Many standard statistical procedures for comparing groups of data rely on the assumption that the population variancecovariance matrices of the groups are all the same. (The simplest example of this is the twosample ttest, in which the concern is just that the population variances of the two groups be equal.) Occasionally, the data analyst wishes to test this assumption of homogeneity of variances.
On page 175, we considered the problem of comparing two matrices of the same size. There we defined a metric based on a matrix norm. For the problem of comparing variancecovariance matrices, a measure based on the generalized variances is more commonly used.
8.6.5 Similarity Matrices
Covariance and correlation matrices are examples of similarity association matrices: they measure the similarity or closeness of the columns of the matrices from which they are formed.
The cosine of the angle between two vectors is related to the correlation between the vectors, so a matrix of the cosine of the angle between the columns of a given matrix would also be a similarity matrix. (The angle is exactly the same as the correlation if the vectors are centered; see equation ( 2.80).)
Similarity matrices can be formed in many ways, and some are more useful in particular applications than in others. They may not even arise from a standard dataset in the familiar form in statistical applications. For example, we may be interested in comparing text documents. We might form a vectorlike object whose elements are the words in the document. The similarity between two such ordered tuples, generally of different lengths, may be a count of the number of words, word pairs, or more general phrases in common between the two documents.
It is generally reasonable that similarity between two objects be symmetric; that is, the first object is as close to the second as the second is to the first. We reserve the term similarity matrix for matrices formed from such measures and, hence, that themselves are symmetric. Occasionally, for example in psychometric applications, the similarities are measured relative to rank order closeness within a set. In such a case, the measure of closeness may not be symmetric. A matrix formed from such measures is called a directed similarity matrix.
8.6.6 Dissimilarity Matrices
The elements of similarity generally increase with increasing “closeness”. We may also be interested in the dissimilarity. Clearly, any decreasing function of similarity could be taken as a reasonable measure of dissimilarity. There are, however, other measures of dissimilarity that often seem more appropriate. In particular, the properties of a metric (see page 32) may suggest that a dissimilarity be defined in terms of a metric.
In considering either similarity or dissimilarity for a data matrix, we could work with either rows or columns, but the common applications make one or the other more natural for the specific application. Because of the types of the common applications, we will discuss dissimilarities and distances in terms of rows instead of columns; thus, in this section, we will consider the dissimilarity of x_{ i∗} and x_{ j∗}, the vectors corresponding to the i^{th} and j^{th} rows of X.
Dissimilarity matrices are also association matrices in the general sense of Sect. 8.1.5.
Dissimilarity or distance can be measured in various ways. A metric is the most obvious measure, although in certain applications other measures are appropriate. The measures may be based on some kind of ranking, for example. If the dissimilarity is based on a metric, the association matrix is often called a distance matrix. In most applications, the Euclidean distance, ∥x_{ i∗}− x_{ j∗}∥_{2}, is the most commonly used metric, but others, especially ones based on L_{1} or L_{ ∞ } norms, are often useful.
Any hollow matrix with nonnegative elements is a directed dissimilarity matrix. A directed dissimilarity matrix is also called a cost matrix. If the matrix is symmetric, it is a dissimilarity matrix.
An n × n matrix D = (d_{ ij }) is an mdimensional distance matrix if there exists mvectors x_{1}…, x_{ n } such that, for some metric \(\Delta\), \(d_{ij} = \Delta (x_{i},x_{j})\). A distance matrix is necessarily a dissimilarity matrix. If the metric is the Euclidean distance, the matrix D is called a Euclidean distance matrix.
The matrix whose rows are the vectors x_{1}^{T}, …, x_{ n }^{T} is called an associated configuration matrix of the given distance matrix. In metric multidimensional scaling, we are given a dissimilarity matrix and seek to find a configuration matrix whose associated distance matrix is closest to the dissimilarity matrix, usually in terms of the Frobenius norm of the difference of the matrices (see Trosset 2002; for basic definitions and extensions).
8.7 Nonnegative and Positive Matrices
A nonnegative matrix, as the name suggests, is a real matrix all of whose elements are nonnegative, and a positive matrix is a real matrix all of whose elements are positive. In some other literature, the latter type of matrix is called strictly positive, and a nonnegative matrix with a positive element is called positive.
Many useful matrices are nonnegative, and we have already considered various kinds of nonnegative matrices. The adjacency or connectivity matrix of a graph is nonnegative. Dissimilarity matrices, including distance matrices, are nonnegative. Matrices used in modeling stochastic processes are nonnegative. While many of these matrices are square, we do not restrict the definitions to square matrices.
Notice that positiveness (nonnegativeness) has nothing to do with positive (nonnegative) definiteness. A positive or nonnegative matrix need not be symmetric or even square, although most such matrices useful in applications are square. A square positive matrix, unlike a positive definite matrix, need not be of full rank.
 1.
If A ≥ 0 and u ≥ v ≥ 0, then Au ≥ Av.
 2.
If A ≥ 0, A ≠ 0, and u > v > 0, then Au > Av.
 3.
If A > 0 and v ≥ 0, then Av ≥ 0.
 4.
If A > 0 and A is square, then ρ(A) > 0.
8.7.1 The Convex Cones of Nonnegative and Positive Matrices
The class of all n × m nonnegative (positive) matrices is a cone because if X is a nonnegative (positive) matrix and a > 0, then aX is a nonnegative (positive) matrix (see page 43). Furthermore, it is a convex cone in \(\mathrm{I\!R}^{n\times m}\), because if X_{1} and X_{2} are n × m nonnegative (positive) matrices and a, b ≥ 0, then aX_{1} + bX_{2} is nonnegative (positive) so long as either a ≠ 0 or b ≠ 0.
Both of these classes are closed under Cayley multiplication, but they may not have inverses in the class (that is, in particular, they are not groups with respect to that operation).
8.7.2 Properties of Square Positive Matrices
We have the following important properties for square positive matrices. These properties collectively are the conclusions of the Perron theorem.
Let A be a square positive matrix and let r = ρ(A). Then:
 1.
r is an eigenvalue of A. The eigenvalue r is called the Perron root. Note that the Perron root is real and positive (although other eigenvalues of A may not be).
 2.
There is an eigenvector v associated with r such that v > 0.
 3.
The Perron root is simple. (That is, the algebraic multiplicity of the Perron root is 1.)
 4.
The dimension of the eigenspace of the Perron root is 1. (That is, the geometric multiplicity of ρ(A) is 1.) Hence, if v is an eigenvector associated with r, it is unique except for scaling. This associated eigenvector is called the Perron vector. Note that the Perron vector is real (although other eigenvectors of A may not be). The elements of the Perron vector all have the same sign, which we usually take to be positive; that is, v > 0.
 5.
If c_{ i } is any other eigenvalue of A, then  c_{ i }  < r. (That is, r is the only eigenvalue on the spectral circle of A.)
We will give proofs only of properties 1 and 2 as examples. Proofs of all of these facts are available in Bapat and Raghavan (1997).
To see properties 1 and 2, first observe that a positive matrix must have at least one nonzero eigenvalue because the coefficients and the constant in the characteristic equation must all be positive. Now scale the matrix so that its spectral radius is 1 (see page 142). So without loss of generality, let A be a scaled positive matrix with ρ(A) = 1. Now let (c, x) be some eigenpair of A such that  c  = 1. First, we want to show, for some such c, that c = ρ(A).
The PerronFrobenius theorem, which we consider below, extends these results to a special class of square nonnegative matrices. (This class includes all positive matrices, so the PerronFrobenius theorem is an extension of the Perron theorem.)
8.7.3 Irreducible Square Nonnegative Matrices
We now consider irreducible square nonnegative matrices. This class includes positive matrices, and irreducibility yields some of the properties of positivity to matrices with some zero elements. (Heuristically, irreducibility puts some restrictions on the nonpositive elements of the matrix.) On page 337, we defined reducibility of a nonnegative square matrix and we saw that a matrix is irreducible if and only if its digraph is strongly connected.
We see from the definition in equation (8.76) that a positive matrix is irreducible.
Now, if A is irreducible, we can see that (I + A)^{ n−1} must be a positive matrix either by a strictly linearalgebraic approach or by couching the argument in terms of the digraph \(\mathcal{G}(A)\) formed by the matrix, as in the discussion on page 337 that showed that a digraph is strongly connected if (and only if) it is irreducible. We will use the latter approach in the spirit of applications of irreducibility in stochastic processes.
The positivity of (I + A)^{ n−1} for an irreducible nonnegative matrix A is a very useful property because it allows us to extend some conclusions of the Perron theorem to irreducible nonnegative matrices.
8.7.3.1 Properties of Square Irreducible Nonnegative Matrices: The PerronFrobenius Theorem
If A is a square irreducible nonnegative matrix, then we have the following properties, which are similar to properties 1 through 4 on page 374 for positive matrices. These following properties are the conclusions of the PerronFrobenius theorem.
 1.
ρ(A) is an eigenvalue of A. This eigenvalue is called the Perron root, as before (and, as before, it is real and positive).
 2.
The Perron root ρ(A) is simple. (That is, the algebraic multiplicity of the Perron root is 1.)
 3.
The dimension of the eigenspace of the Perron root is 1. (That is, the geometric multiplicity of ρ(A) is 1.)
 4.
The eigenvector associated with ρ(A) is positive. This eigenvector is called the Perron vector, as before.
The relationship (8.77) allows us to prove properties 1 and 4 in a method similar to the proofs of properties 1 and 2 for positive matrices. (This is Exercise 8.9.) Complete proofs of all of these facts are available in Bapat and Raghavan (1997). See also the solution to Exercise 8.10b on page 611 for a special case.
8.7.3.2 Primitive Matrices
Square irreducible nonnegative matrices that have only one eigenvalue on the spectral circle, that is, that do satisfy property 5 on page 374, also have other interesting properties that are important, for example, in Markov chains. We therefore give a name to matrices with that property:
A square irreducible nonnegative matrix is said to be primitive if it has only one eigenvalue on the spectral circle.
8.7.3.3 Limiting Behavior of Primitive Matrices
In modeling with Markov chains and other applications, the limiting behavior of A^{ k } is an important property.
On page 172, we saw that lim_{ k → ∞ }A^{ k } = 0 if ρ(A) < 1. For a primitive matrix, we also have a limit for A^{ k } if ρ(A) = 1. (As we have done above, we can scale any matrix with a nonzero eigenvalue to a matrix with a spectral radius of 1.)
Applications of the PerronFrobenius theorem are farranging. It has implications for the convergence of some iterative algorithms, such as the power method discussed in Sect. 7.2. The most important applications in statistics are in the analysis of Markov chains, which we discuss in Sect. 9.8.1.
8.7.4 Stochastic Matrices
A stochastic matrix may not be positive, and it may be reducible or irreducible. (Hence, (1, 1) may not be the Perron root and Perron eigenvector.)
A permutation matrix is a doubly stochastic matrix; in fact, it is the simplest and one of the most commonly encountered doubly stochastic matrices. A permutation matrix is clearly reducible.
8.7.5 Leslie Matrices
A Leslie matrix is clearly reducible. Furthermore, a Leslie matrix has a single unique positive eigenvalue (see Exercise 8.10), which leads to some interesting properties (see Sect. 9.8.2).
8.8 Other Matrices with Special Structures
Matrices of a variety of special forms arise in statistical analyses and other applications. For some matrices with special structure, specialized algorithms can increase the speed of performing a given task considerably. Many tasks involving matrices require a number of computations of the order of n^{3}, where n is the number of rows or columns of the matrix. For some of the matrices discussed in this section, because of their special structure, the order of computations may be n^{2}. The improvement from O(n^{3}) to O(n^{2}) is enough to make some tasks feasible that would otherwise be infeasible because of the time required to complete them. The collection of papers in Olshevsky (2003) describe various specialized algorithms for the kinds of matrices discussed in this section.
8.8.1 Helmert Matrices
A Helmert matrix is a square orthogonal matrix that partitions sums of squares. Its main use in statistics is in defining contrasts in general linear models to compare the second level of a factor with the first level, the third level with the average of the first two, and so on. (There is another meaning of “Helmert matrix” that arises from socalled Helmert transformations used in geodesy.)
The rows of the matrix in equation (8.87) correspond to orthogonal contrasts in the analysis of linear models (see Sect. 9.3.2).
Obviously, the sums of squares are never computed by forming the Helmert matrix explicitly and then computing the quadratic form, but the computations in partitioned Helmert matrices are performed indirectly in analysis of variance, and representation of the computations in terms of the matrix is often useful in the analysis of the computations.
8.8.2 Vandermonde Matrices
Because of the relationships among the columns of a Vandermonde matrix, computations for polynomial regression analysis can be subject to numerical errors, and so sometimes we make transformations based on orthogonal polynomials. The condition number (see Sect. 6.1, page 266) for a Vandermonde matrix is large. A Vandermonde matrix, however, can be used to form simple orthogonal vectors that correspond to orthogonal polynomials. For example, if the xs are chosen over a grid on [−1, 1], a QR factorization (see Sect. 5.8 on page 248) yields orthogonal vectors that correspond to Legendre polynomials. These vectors are called discrete Legendre polynomials. Although not used in regression analysis so often now, orthogonal vectors are useful in selecting settings in designed experiments.
Vandermonde matrices also arise in the representation or approximation of a probability distribution in terms of its moments.
The determinant of a square Vandermonde matrix has a particularly simple form (see Exercise 8.11).
8.8.3 Hadamard Matrices and Orthogonal Arrays
In a wide range of applications, including experimental design, cryptology, and other areas of combinatorics, we often encounter matrices whose elements are chosen from a set of only a few different elements. In experimental design, the elements may correspond to the levels of the factors; in cryptology, they may represent the letters of an alphabet. In twolevel factorial designs, the entries may be either 0 or 1. Matrices all of whose entries are either 1 or − 1 can represent the same layouts, and such matrices may have interesting mathematical properties.
An n × n matrix with − 1, 1 entries whose determinant is n^{ n∕2} is called a Hadamard matrix. Hadamard’s name is associated with this matrix because of the bound derived by Hadamard for the determinant of any matrix A with  a_{ ij }  ≤ 1 for all i, j:  det(A)  ≤ n^{ n∕2}. A Hadamard matrix achieves this upper bound. A maximal determinant is often used as a criterion for a good experimental design.
It is clear that if H_{ n } is a Hadamard matrix then so is H_{ n }^{T}, and a Hadamard matrix is a normal matrix 345). Symmetric Hadamard matrices are often of special interest. In a special type of n × n Hadamard matrix, one row and one column consist of all 1s; all n − 1 other rows and columns consist of n∕2 1s and n∕2 − 1s. Such a matrix is called a normalized Hadamard matrix. Most Hadamard matrices occurring in statistical applications are normalized, and also most have all 1s on the diagonal.
A somewhat more general type of matrix corresponds to an n × m array with the elements in the j^{th} column being members of a set of k_{ j } elements and such that, for some fixed p ≤ m, in every n × p submatrix all possible combinations of the elements of the m sets occur equally often as a row. (I make a distinction between the matrix and the array because often in applications the elements in the array are treated merely as symbols without the assumptions of an algebra of a field. A terminology for orthogonal arrays has evolved that is different from the terminology for matrices; for example, a symmetric orthogonal array is one in which k_{1} = ⋯ = k_{ m }. On the other hand, treating the orthogonal arrays as matrices with real elements may provide solutions to combinatorial problems such as may arise in optimal design.)
The 4 × 4 Hadamard matrix shown in Fig. 8.7 is a symmetric orthogonal array with k_{1} = ⋯ = k_{4} = 2 and p = 4, so in the array each of the possible combinations of elements occurs exactly once. This array is a member of a simple class of symmetric orthogonal arrays that has the property that in any two rows each ordered pair of elements occurs exactly once.
Orthogonal arrays are particularly useful in developing fractional factorial plans. (The robust designs of Taguchi correspond to orthogonal arrays.) Dey and Mukerjee (1999) discuss orthogonal arrays with an emphasis on the applications in experimental design, and Hedayat et al. (1999) provide extensive an discussion of the properties of orthogonal arrays.
8.8.4 Toeplitz Matrices
Banded Toeplitz matrices arise frequently in time series studies. The covariance matrix in an ARMA(p, q) process, for example, is a symmetric Toeplitz matrix with 2max(p, q) nonzero offdiagonal bands. See page 451 for an example and further discussion.
8.8.4.1 Inverses of Certain Toeplitz Matrices and Other Banded Matrices
Type 2 matrices also occur as the inverses of other matrices with special patterns that arise in other common statistical applications (see Graybill 1983; for examples).
The inverses of all banded invertible matrices have some offdiagonal submatrices that are zero or have low rank, depending on the bandwidth of the original matrix (see Strang and Nguyen 2004, for further discussion and examples).
8.8.5 Circulant Matrices

A^{T} is circulant

A^{2} is circulant

if A is nonsingular, then A^{−1} is circulant
You are asked to prove these simple results in Exercise 8.13.
Any linear combination of two circulant matrices of the same order is circulant (that is, they form a vector space, see Exercise 8.14).
If A and B are circulant matrices of the same order, then AB is circulant (Exercise 8.15).
Another important property of a circulant matrix is that it is normal, as we can see by writing the (i, j) element of AA^{T} and of A^{T}A as a sum of products of elements of A (Exercise 8.16). This has an important implication: a circulant matrix is unitarily diagonalizable.
8.8.6 Fourier Matrices and the Discrete Fourier Transform
Notice that the Fourier matrix is symmetric, and any entry raised to the n^{th} power is 1. Although the Fourier matrix is symmetric, its eigenvalues are not necessarily real, because it itself is not a real matrix.
The Fourier matrix has many useful properties, such as being unitary; that is, row and columns are orthonormal: f_{∗j }^{H}f_{∗k } = f_{ j∗}^{H}f_{ k∗} = 0 for j ≠ k, and f_{∗j }^{H}f_{∗j } = 1 (Exercise 8.17).
8.8.6.1 Fourier Matrices and Elementary Circulant Matrices
We see that the eigenvalues of \(E_{(\pi _{\mathrm{c}})}\) are what we would expect if we continued the development from page 137 where we determined the eigenvalues of an elementary permutation matrix. (An elementary permutation matrix of order 2 has the two eigenvalues \(\sqrt{ 1}\) and \(\sqrt{1}\).)
Notice that the modulus of all eigenvalues of \(E_{(\pi _{\mathrm{c}})}\) is the same, 1. Hence, all eigenvalues of \(E_{(\pi _{\mathrm{c}})}\) lie on its spectral circle.
8.8.6.2 The Discrete Fourier Transform
Fourier transforms are invertible transformations of functions that often allow operations on those functions to be performed more easily. The Fourier transform shown in equation (8.98) may allow for simpler expressions of convolutions, for example. An nvector is equivalent to a function whose domain is {1, …, n}, and Fourier transforms of vectors are useful in various operations on the vectors. More importantly, if the vector represents observational data, certain properties of the data may become immediately apparent in the Fourier transform of the data vector.
Our purpose here is just to indicate the relation of Fourier transforms to matrices, and to suggest how this might be useful in applications. There is a wealth of literature on Fourier transforms and their applications, but we will not pursue those topics here.
As often when writing expressions involving matrices, we must emphasize that the form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different. This is particularly true in the case of the discrete Fourier transform. There would never be a reason to form the Fourier matrix for any computations, but more importantly, rather than evaluating the elements in the Fourier transform \(\mathcal{F}x\) using the right side of equation (8.100), we take advantage of properties of the powers of roots of unity to arrive at a faster method of computing them—so much faster, in fact, the method is called the “fast” Fourier transform, or FFT. The FFT is one of the most important computational methods in all of applied mathematics. I will not discuss it further here, however. Descriptions of it can be found throughout the literature on computational science (including other books that I have written).
An Aside: Complex Matrices
The Fourier matrix, I believe, is the only matrix I discuss in this book that has complex entries. The purpose is just to relate the discrete Fourier transform to matrix multiplication. We have already indicated various differences in operations on vectors and matrices over the complex plane. We often form the conjugate of a complex number, which we denote by an overbar: \(\bar{z}\). The conjugate of an object such as a vector or a matrix is the result of taking the conjugate of each element individually. Instead of a transpose, we usually work with a conjugate transpose, \(A^{\mathrm{H}} = \overline{A} ^{\mathrm{T}}\). (See the discussion on pages 33 and 60.) We define the inner product of vectors x and y as \(\langle x,y\rangle =\bar{ x}^{\mathrm{T}}y\); instead of symmetric matrices, we focus on Hermitian matrices, that is, ones such that the conjugate transpose is the same as the original matrices, and instead of orthogonal matrices, we focus on unitary matrices, that is, ones whose product with its conjugate transpose is the identity. All of the general results that we have stated for real matrices of course hold for complex matrices. Some of the analogous operations, however, have different properties. For example, the property of a matrix being unitarily diagonalizable is an analogue of the property of being orthogonally diagonalizable, but, as we have seen, a wider class of matrices (normal matrices) are unitarily diagonalizable than those that are orthogonally diagonalizable (symmetric matrices).
Most scientific software systems support computations with complex numbers. Both R and Matlab, for example, provide full capabilities for working with complex numbers. The imaginary unit in both is denoted by “i”, which of course must be distinguished from “i” denoting a variable. In R, the function complex can be used to initialize a complex number; for example,
assigns the value 3 + 2i to the variable z. (Another simple way of doing this is to juxtapose a numeric literal in front of the symbol i; in R, for example, z<3+2i assigns the same value to z. I do not recommend the latter construct because of possible confusion with other expressions.)
The j^{th} row of an n^{th} order Fourier matrix can be generated by the R expression
As mentioned above, there would almost never be a reason to form the Fourier matrix for any computations. It is instructive, however, to note its form and to do some simple manipulations with the Fourier matrix in R or some other software system. The Matlab function dftmtx generates a Fourier matrix (with a slightly different definition, resulting in a different pattern of positives and negatives than what I have shown above).
Some additional R code for manipulating complex matrices is given in the hints for Exercise 8.18 on page 611. (Note that there I do form a Fourier matrix and use it multiplications; but it is just for illustration.)
8.8.7 Hankel Matrices
As with a Toeplitz matrix, a Hankel matrix is characterized by a “diagonal” element and two vectors, and can be generalized to n × m matrices based on vectors of different orders. As in the expression (8.101) above, an n × n Toeplitz matrix can be defined by a scalar d and two (n − 1)vectors, u and l. There are other, perhaps more common, ways of putting the elements of a Hankel matrix into two vectors. In Matlab, the function hankel produces a Hankel matrix, but the elements are specified in a different way from that in expression (8.101).
8.8.8 Cauchy Matrices
Another type of special n × m matrix whose elements are determined by a few nvectors and mvectors is a Cauchytype matrix. The standard Cauchy matrix is built from two vectors, x and y. The more general form defined below uses two additional vectors.
Cauchytype matrices often arise in the numerical solution of partial differential equations (PDEs). For Cauchy matrices, the order of the number of computations for factorization or solutions of linear systems can be reduced from a power of three to a power of two. This is a very significant improvement for large matrices. In the PDE applications, the matrices are generally not large, but nevertheless, even in those applications, it worthwhile to use algorithms that take advantage of the special structure. Fasino and Gemignani (2003) describe such an algorithm.
8.8.9 Matrices Useful in Graph Theory
Many problems in statistics and applied mathematics can be posed as graphs, and various methods of graph theory can be used in their solution.
Graph theory is particularly useful in cluster analysis or classification. These involve the analysis of relationships of objects for the purpose of identifying similar groups of objects. The objects are associated with vertices of the graph, and an edge is generated if the relationship (measured somehow) between two objects is sufficiently great. For example, suppose the question of interest is the authorship of some text documents. Each document is a vertex, and an edge between two vertices exists if there are enough words in common between the two documents. A similar application could be the determination of which computer user is associated with a given computer session. The vertices would correspond to login sessions, and the edges would be established based on the commonality of programs invoked or files accessed. In applications such as these, there would typically be a training dataset consisting of text documents with known authors or consisting of session logs with known users. In both of these types of applications, decisions would have to be made about the extent of commonality of words, phrases, programs invoked, or files accessed in order to establish an edge between two documents or sessions.
Unfortunately, as is often the case for an area of mathematics or statistics that developed from applications in diverse areas or through the efforts of applied mathematicians somewhat outside of the mainstream of mathematics, there are major inconsistencies in the notation and terminology employed in graph theory. Thus, we often find different terms for the same object; for example, adjacency matrix and connectivity matrix. This unpleasant situation, however, is not so disagreeable as a onetomany inconsistency, such as the designation of the eigenvalues of a graph to be the eigenvalues of one type of matrix in some of the literature and the eigenvalues of different types of matrices in other literature.
Refer to Sect. 8.1.2 beginning on page 331 for terms and notation that we will use in the following discussion.
8.8.9.1 Adjacency Matrix: Connectivity Matrix
We discussed adjacency or connectivity matrices on page 334. A matrix, such as an adjacency matrix, that consists of only 1s and 0s is called a Boolean matrix.
Two vertices that are not connected and hence correspond to a 0 in a connectivity matrix are said to be independent.
If no edges connect a vertex with itself, the adjacency matrix is a hollow matrix.
Because the 1s in a connectivity matrix indicate a strong association, and we would naturally think of a vertex as having a strong association with itself, we sometimes modify the connectivity matrix so as to have 1s along the diagonal. Such a matrix is sometimes called an augmented connectivity matrix or augmented adjacency matrix.
The eigenvalues of the adjacency matrix reveal some interesting properties of the graph and are sometimes called the eigenvalues of the graph. The eigenvalues of another matrix, which we discuss below, are more useful, however, and we will refer to those eigenvalues as the eigenvalues of the graph.
8.8.9.2 Digraphs
The digraph represented in Fig. 8.4 on page 335 is a network with five vertices, perhaps representing cities, and directed edges between some of the vertices. The edges could represent airline connections between the cities; for example, there are flights from x to u and from u to x, and from y to z, but not from z to y.
In a digraph, the relationships are directional. (An example of a directional relationship that might be of interest is when each observational unit has a different number of measured features, and a relationship exists from v_{ i } to v_{ j } if a majority of the features of v_{ i } are identical to measured features of v_{ j }.)
8.8.9.3 Use of the Connectivity Matrix
The analysis of a network may begin by identifying which vertices are connected with others; that is, by construction of the connectivity matrix.
This property extends to multiple connections. If A is the adjacency matrix of a graph, then A_{ ij }^{ k } is the number of paths of length k between nodes i and j in that graph. (See Exercise 8.20.) Important areas of application of this fact are in DNA sequence comparisons and measuring “centrality” of a node within a complex social network.
8.8.9.4 The Laplacian Matrix of a Graph
Spectral graph theory is concerned with the analysis of the eigenvalues of a graph. As mentioned above, there are two different definitions of the eigenvalues of a graph. The more useful definition, and the one we use here, takes the eigenvalues of a graph to be the eigenvalues of a matrix, called the Laplacian matrix, formed from the adjacency matrix and a diagonal matrix consisting of the degrees of the vertices.
So long as \(d(\mathcal{G})> 0\), \(L(\mathcal{G}) = D(\mathcal{G})^{\frac{1} {2} }L_{a}(\mathcal{G})D(\mathcal{G})^{\frac{1} {2} }\). Unless the graph is regular, the matrix \(L_{b}(\mathcal{G})\) is not symmetric. Note that if \(\mathcal{G}\) is kregular, \(L(\mathcal{G}) = I  C(\mathcal{G})/k\), and \(L_{b}(\mathcal{G}) = L(\mathcal{G})\).
For a digraph, the degrees are replaced by either the indegrees or the outdegrees. (Some authors define it one way and others the other way. The essential properties hold either way.)
The eigenvalues of a matrix are the basic objects in spectral graph theory. They provide information about the properties of networks and other systems modeled by graphs. We will not explore them further here, and the interested reader is referred to Bollobás (2013) or other general texts on the subject.
8.8.10 ZMatrices and MMatrices
In certain applications in physics and in the solution of systems of nonlinear differential equations, a class of matrices called Mmatrices is important.
The matrices in these applications have nonpositive offdiagonal elements. A square matrix all of whose offdiagonal elements are nonpositive is called a Zmatrix.
A Zmatrix that is positive stable (see page 159) is called an Mmatrix. A real symmetric Mmatrix is positive definite.

all principal minors of A are positive;

all diagonal elements of A are positive;

all diagonal elements of L and U in the LU decomposition of A are positive;

for some i, ∑_{ j }a_{ ij } ≥ 0; and

A is nonsingular and A^{−1} ≥ 0.
Proofs of some of these facts can be found in Horn and Johnson (1991).
Exercises
 8.1.Normal matrices.
 a)
Show that a skew symmetric matrix is normal.
 b)
Show that a skew Hermitian matrix is normal.
 a)
 8.2.Ordering of nonnegative definite matrices.
 8.3.
 a)
Show that a (strictly) diagonally dominant symmetric matrix is positive definite.
 b)Show that if the real n × n symmetric matrix A is such thatthen A⪰0.$$\displaystyle{a_{ii} \geq \sum _{j\neq i}^{n}\vert a_{ ij}\vert \quad \mbox{ for each}\;i = 1,\ldots,n}$$
 a)
 8.4.
Show that the number of positive eigenvalues of an idempotent matrix is the rank of the matrix.
 8.5.
Show that two idempotent matrices of the same rank are similar.
 8.6.
 8.7.Projections.
 8.8.
Correlation matrices. A correlation matrix can be defined in terms of a Gramian matrix formed by a centered and scaled matrix, as in equation (8.69). Sometimes in the development of statistical theory, we are interested in the properties of correlation matrices with given eigenvalues or with given ratios of the largest eigenvalue to other eigenvalues.
Write a program to generate n × n random correlation matrices R with specified eigenvalues, c_{1}, …, c_{ n }. The only requirements on R are that its diagonals be 1, that it be symmetric, and that its eigenvalues all be positive and sum to n. Use the following method due to Davies and Higham (2000) that uses random orthogonal matrices with the Haar uniform distribution generated using the method described in Exercise 4.10. 0.
 Generate a random orthogonal matrix Q; set k = 0, and form$$\displaystyle{R^{(0)} = Q\mathrm{diag}((c1,\ldots,c_{ n}))Q^{\mathrm{T}}.}$$
 1.

If r_{ ii }^{(k)} = 1 for all i in {1, …, n}, go to step 3.
 2.

Otherwise, choose p and q with p < j, such that r_{ pp }^{(k)} < 1 < r_{ qq }^{(k)} or r_{ pp }^{(k)} > 1 > r_{ qq }^{(k)}, and form G^{(k)} as in equation ( 5.13), where c and s are as in equations ( 5.17) and ( 5.18), with a = 1. Form R^{(k+1)} = (G^{(k)})^{T}R^{(k)}G^{(k)}. Set k = k + 1, and go to step 1.
 3.

Deliver R = R^{(k)}.
 8.9.
 8.10.Leslie matrices.
 a)
Write the characteristic polynomial of the Leslie matrix, equation (8.85).
 b)
Show that the Leslie matrix has a single, unique positive eigenvalue.
 a)
 8.11.
Write out the determinant for an n × n Vandermonde matrix.
Hint: The determinant of an n × n Vandermonde matrix as in equation (8.88) is (x_{ n } − x_{1})(x_{ n } − x_{2})⋯(x_{ n } − x_{ n−1}) times the determinant of the n − 1 × n − 1 Vandermonde matrix formed by removing the last row and column. Show this by multiplying the original Vandermonde matrix by B = I + D, where D is the matrix with 0s in all positions except for the first supradiagonal, which consists of − x_{ n }, replicated n − 1 times. Clearly, the determinant of B is 1.
 8.12.Consider the 3 × 3 symmetric Toeplitz matrix with elements a, b, and c; that is, the matrix that looks like this:$$\displaystyle{\left [\begin{array}{ccc} 1& b&c\\ b &1 & b \\ c&b&1\\ \end{array} \right ].}$$
 a)
Invert this matrix. See page 385.
 b)
Determine conditions for which the matrix would be singular.
 a)
 8.13.If A is circulant, show that

A^{T} is circulant

A^{2} is circulant

if A is nonsingular, then A^{−1} is circulant
Hint: Use equation (8.95).

 8.14.
Show that the set of all n × n circulant matrices is a vector space along with the axpy operation. (Just show that it is closed with respect to that operation.)
 8.15.
If A and B are circulant matrices of the same order, show AB is circulant.
 8.16.
Show that a circulant matrix is normal.
 8.17.
Show that a Fourier matrix, as in equation (8.97), is unitary by showing f_{∗j }^{H}f_{∗k } = f_{ j∗}^{H}f_{ k∗} = 0 for j ≠ k, and f_{∗j }^{H}f_{∗j } = 1.
 8.18.
Show that equation (8.99) is correct by performing the multiplications on the right side of the equation.
 8.19.
Write out the determinant for the n × n skew upper triangular Hankel matrix in (8.102).
 8.20.Graphs. Let A be the adjacency matrix of an undirected graph.
 a)
Show that A_{ ij }^{2} is the number of paths of length 2 between nodes i and j.
Hint: Construct a general diagram similar to Fig. 8.2 on page 332, and count the paths between two arbitrary nodes.
 b)
Show that A_{ ij }^{ k } is the number of paths of length k between nodes i and j.
Hint: Use Exercise 8.20a and mathematical induction on k.
 a)
References
 Bapat, R. B., and T. E. S. Raghavan. 1997. Nonnegative Matrices and Applications. Cambridge, United Kingdom: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Bollobás, Béla. 2013. Modern Graph Theory. New York: SpringerVerlag.zbMATHGoogle Scholar
 Chu, Moody T. 1991. Least squares approximation by real normal matrices with specified spectrum. SIAM Journal on Matrix Analysis and Applications 12:115–127.MathSciNetCrossRefzbMATHGoogle Scholar
 Davies, Philip I., and Nicholas J. Higham. 2000. Numerically stable generation of correlation matrices and their factors. BIT 40:640–651.MathSciNetCrossRefzbMATHGoogle Scholar
 Dey, Aloke, and Rahul Mukerjee. 1999. Fractional Factorial Plans. New York: John Wiley and Sons.CrossRefzbMATHGoogle Scholar
 Fasino, Dario, and Luca Gemignani. 2003. A Lanczostype algorithm for the QR factorization of Cauchylike matrices. In Fast Algorithms for Structured Matrices: Theory and Applications, ed. Vadim Olshevsky, 91–104. Providence, Rhode Island: American Mathematical Society.CrossRefGoogle Scholar
 Gentle, James E. 2003. Random Number Generation and Monte Carlo Methods, 2nd ed. New York: SpringerVerlag.zbMATHGoogle Scholar
 Graybill, Franklin A. 1983. Introduction to Matrices with Applications in Statistics, 2nd ed. Belmont, California: Wadsworth Publishing Company.zbMATHGoogle Scholar
 Hedayat, A. S., N. J. A. Sloane, and John Stufken. 1999. Orthogonal Arrays: Theory and Applications. New York: SpringerVerlag.CrossRefzbMATHGoogle Scholar
 Hoffman, A. J., and H. W. Wielandt. 1953. The variation of the spectrum of a normal matrix. Duke Mathematical Journal 20:37–39.MathSciNetCrossRefzbMATHGoogle Scholar
 Horn, Roger A., and Charles R. Johnson. 1991. Topics in Matrix Analysis. Cambridge, United Kingdom: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Liu, Shuangzhe and Heinz Neudecker. 1996. Several matrix Kantorovichtype inequalities. Journal of Mathematical Analysis and Applications 197:23–26.MathSciNetCrossRefzbMATHGoogle Scholar
 Marshall, A. W., and I. Olkin. 1990. Matrix versions of the Cauchy and Kantorovich inequalities. Aequationes Mathematicae 40:89–93.MathSciNetCrossRefzbMATHGoogle Scholar
 Mosteller, Frederick, and David L. Wallace. 1963. Inference in an authorship problem. Journal of the American Statistical Association 58:275–309.zbMATHGoogle Scholar
 Olshevsky, Vadim (Editor). 2003. Fast Algorithms for Structured Matrices: Theory and Applications. Providence, Rhode Island: American Mathematical Society.zbMATHGoogle Scholar
 Strang, Gilbert, and Tri Nguyen. 2004. The interplay of ranks of submatrices. SIAM Review 46:637–646.MathSciNetCrossRefzbMATHGoogle Scholar
 Trefethen, Lloyd N., and Mark Embree. 2005. Spectra and Pseudospectra: The Behavior of Nonnormal Matrices and Operators. Princeton: Princeton University Press.zbMATHGoogle Scholar
 Trosset, Michael W. 2002. Extensions of classical multidimensional scaling via variable reduction. Computational Statistics 17:147–163.MathSciNetCrossRefzbMATHGoogle Scholar
 Vandenberghe, Lieven, and Stephen Boyd. 1996. Semidefinite programming. SIAM Review 38:49–95.MathSciNetCrossRefzbMATHGoogle Scholar
 Wilkinson, J. H. 1965. The Algebraic Eigenvalue Problem. New York: Oxford University Press.zbMATHGoogle Scholar