Skip to main content

Part of the book series: Mathematics for Industry ((MFI,volume 22))

  • 1486 Accesses

Abstract

The normalized least-mean-squares (NLMS) algorithm has a problem that the convergence slows down for correlated input signals. The reason for this phenomenon is explained by looking at the algorithm from a geometrical point of view. This observation motivates the affine projection algorithm (APA) as a natural generalization of the NLMS algorithm. The APA exploits most recent multiple regressors, while the NLMS algorithm uses only the current, single regressor. In the APA, the current coefficient vector is orthogonally projected onto the affine subspace defined by the regressors for updating the coefficient vector. By increasing the number of regressors, which is called the projection order, the convergence rate of the APA is improved especially for correlated input signals. The role of the step-size is made clear. Investigations from the affine projection point of view give us a deep insight into the properties of the APA. We also see that alternative approaches are possible to derive the update equation for the APA. To stabilize the numerical inversion of a matrix in the update equation, a regularization term is often added. This variant of the APA is called the regularized APA (R-APA), whereas the original APA is called the basic APA (B-APA). This chapter also explains that the B-APA with unity step-size has a decorrelating property, and that there are formal similarities between the recursive least-squares (RLS) algorithm and the R-APA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Numerical inversion of a matrix A is based on solving a linear equation of the type \(Ax= b\). If \(\mathrm{cond}\,(A)\) is large, a small error in A or b can result in a large error in the solution.

  2. 2.

    There are some variants of RLS. The present type is called the “prewindowed RLS” algorithm [9].

  3. 3.

    This is the abbreviation of the phrase if and only if, which is very often used in mathematics.

  4. 4.

    We sometimes use a formulation that the sum of \((x(j) - \hat{x}(j))^{2}\) is to be minimized. In this formulation the sign of \(f_{i}\) is reversed.

References

  1. Ozeki, K., Umeda, T.: An adaptive filtering algorithm using an orthogonal projection to an affine subspace and its properties. IEICE Trans. J67-A(2), 126–132 (1984) (Also in Electron. Commun. Jpn. 67-A(5), 19–27 (1984))

    Google Scholar 

  2. Haykin, S.: Adaptive Filter Theory. Prentice-Hall, Upper Saddle River (2002)

    Google Scholar 

  3. Sayed, A.H.: Adaptive Filters. Wiley, Hoboken (2008)

    Book  Google Scholar 

  4. Haykin, S., Widrow, B. (eds.): Least-Mean-Square Adaptive Filters. Wiley, Hoboken (2003)

    Google Scholar 

  5. Werner, S., Diniz, P.S.R.: Set-membership affine projection algorithm. IEEE Signal Process. Lett. 8(8), 231–235 (2001)

    Article  Google Scholar 

  6. Morgan, D.R., Kratzer, S.G.: On a class of computationally efficient, rapidly converging, generalized NLMS algorithms. IEEE Signal Process. Lett. 3(8), 245–247 (1996)

    Article  Google Scholar 

  7. Rupp, M.: A family of adaptive filter algorithms with decorrelating properties. IEEE Trans. Signal Process. 46(3), 771–775 (1998)

    Article  Google Scholar 

  8. Hinamoto, T., Maekawa, S.: Extended theory of learning identification. J. IEEJ-C 95(10), 227–234 (1975)

    Google Scholar 

  9. Cioffi, J.M., Kailath, T.: Windowed fast transversal filters adaptive algorithms with normalization. IEEE Trans. Acoust. Speech Signal Process. ASSP–33(3), 607–625 (1985)

    Article  Google Scholar 

  10. Satake, I.: Linear Algebra. Marcel Dekker, New York (1975)

    Google Scholar 

  11. Luenberger, D.G.: Linear and Nonlinear Programming. Addison-Wesley, Menlo Park (1989)

    Google Scholar 

  12. Markel, J.D., Gray, A.H.: Linear Prediction of Speech. Springer, Berlin (1976)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazuhiko Ozeki .

Appendices

Appendix 1: Affine Projection

3.1.1 Orthogonal Projection

The inner product \(\langle x, \,y \rangle \) of vectors \(x, y \in \mathbb {R}^{n}\) is defined by \( \langle x, \,y \rangle \mathop {=}\limits ^{\triangle }x^{t}y\). The Euclidean norm \(\Vert x\Vert \) of \(x \in \mathbb {R}^{n}\) is defined by \(\Vert x\Vert \mathop {=}\limits ^{\triangle }\sqrt{\langle x,\,x \rangle }\). Vectors x and y are said to be orthogonal and denoted by \(x \perp y\) iff Footnote 3 \( \langle x, \,y \rangle = 0\). A vector x and a linear subspace \(\mathbb {V}\) of \(\mathbb {R}^{n}\) are said to be orthogonal and denoted by \(x \perp \mathbb {V}\) iff \(x \perp y\) for every element y in \(\mathbb {V}\). Linear subspaces \(\mathbb {V}\) and \(\mathbb {W}\) are said to be orthogonal and denoted by \(\mathbb {V} \perp \mathbb {W}\) iff any element of \(\mathbb {V}\) and any element of \(\mathbb {W}\) are orthogonal.

For a linear subspace \(\mathbb {V}\) of \(\mathbb {R}^{n}\), \(\mathbb {V}^{\perp }\mathop {=}\limits ^{\triangle }\{x \in \mathbb {R}^{n} \,;\, x \perp \mathbb {V}\}\) is called the orthogonal complement of \(\mathbb {V}\). The set \(\mathbb {V}^{\perp }\) is a linear subspace of \(\mathbb {R}^{n}\). It is easy to verify that \(V \perp V^{\perp }\) and that \(V \cap V^{\perp }= \{0\}\).

Let \(\mathbb {V}\) be a linear subspace of \(\mathbb {R}^{n}\) and x an element of \(\mathbb {R}^{n}\). An element \(x_{0} \in \mathbb {V}\) is called the orthogonal projection of x onto \(\mathbb {V}\) iff \((x-x_{0}) \perp \mathbb {V}\). Such \(x_{0}\) is uniquely determined for x. In fact, suppose that there are two orthogonal projections \(x_{0}\) and \(x_{0}^{\prime }\) of x. Then, for any \(y \in \mathbb {V}\), \(\langle x-x_{0},\,y \rangle = 0\) and \(\langle x-x_{0}^{\prime },\,y \rangle = 0\). From these equations, we have \(\langle x_{0}-x_{0}^{\prime },\,y \rangle = 0\). Let \(y\mathop {=}\limits ^{\triangle }x_{0}-x_{0}^{\prime } \in \mathbb {V}\). Then \(\langle x_{0}-x_{0}^{\prime },\,x_{0}-x_{0}^{\prime } \rangle = 0\), which leads to \(x_{0}=x_{0}^{\prime }\). The mapping that maps x to \(x_{0}\) is also called the orthogonal projection, and denoted by \(P_{\mathbb {V}}\). This is a linear mapping from \(\mathbb {R}^{n}\) onto \(\mathbb {V}\).

As illustrated in Fig. 3.6, the orthogonal projection has the minimum distance property. That is, if y is an element of \(\mathbb {V}\), then \(\Vert P_{\mathbb {V}}x- x\Vert \le \Vert y-x\Vert \) with equality iff \(y= P_{\mathbb {V}}(x)\). This is a direct consequence of the Pythagorean theorem:

$$\begin{aligned} \Vert y-x\Vert ^{2}&=\Vert y- P_{\mathbb {V}}x\Vert ^{2}+ \Vert P_{\mathbb {V}}x-x\Vert ^{2}\\&\ge \Vert P_{\mathbb {V}}x-x\Vert ^{2}. \end{aligned}$$
Fig. 3.6
figure 6

Minimum distance property of orthogonal projection

Fig. 3.7
figure 7

Orthogonal decomposition of \(\mathbb {R}^{n}\)

Any element \(x \in \mathbb {R}^{n}\) is decomposed as

$$\begin{aligned} x= P_{\mathbb {V}}x + P_{\mathbb {V}^{\perp }}x. \end{aligned}$$
(3.37)

In fact, let \(y\mathop {=}\limits ^{\triangle }x - P_{\mathbb {V}}x\). Then y is an element of \(\mathbb {V}^{\perp }\). Since \(x-y= P_{\mathbb {V}}x \in V\), \((x-y) \perp V^{\perp }\). Hence, \(y= P_{\mathbb {V^{\perp }}}x\).

The decomposition (3.37) is unique in the sense that if \(x= y_{1}+y_{2},\ \ y_{1} \in \mathbb {V}, y_{2} \in \mathbb {V}^{\perp }\), then \(y_{1}= P_{\mathbb {V}} x\) and \(y_{2}= P_{\mathbb {V}^{\perp }} x\). Therefore, as in Fig. 3.7, \(\mathbb {R}^{n}\) is represented as the (orthogonal) direct sum of \(\mathbb {V}\) and \(\mathbb {V}^{\perp }\):

$$\begin{aligned} \mathbb {R}^{n}= \mathbb {V} \oplus \mathbb {V}^{\perp }. \end{aligned}$$

Since \(P_{\mathbb {V}}\) is a linear mapping from \(\mathbb {R}^{n}\) to \(\mathbb {R}^{n}\), it can be represented by a real matrix. The following theorem is well known [10, p. 149].

Theorem 3.3

A matrix A is an orthogonal projection iff \( A= A^{t}\) (symmetric) and \(A^{2}= A\) (idempotent). When these conditions are met, \(A=P_{\mathcal {R}(A)}\) and \(I- A= P_{\mathcal {R}(A)^{\perp }}\), where \(\mathcal {R}(A)\) is the range space of A, i.e., the linear subspace spanned by the columns of A.

Theorem 3.4

If A is an orthogonal projection, then its eigenvalues are 0 or 1.

Proof

If \(\lambda \) is an eigenvalue of A, and x the corresponding eigenvector (Appendix A), then \(Ax= \lambda x\). The eigenvector x can be uniquely decomposed as \(x= x_{1} + x_{2},\ x_{1} \in \mathcal {R}(A),\ x_{2} \in \mathcal {R}(A)^{\perp }\). Therefore, \(Ax= x_{1}\) and \(Ax= \lambda x_{1} + \lambda x_{2}\), from which we have \((\lambda -1)x_{1} + \lambda x_{2}= 0\). By the uniqueness of orthogonal decomposition, \((\lambda -1)x_{1}= 0\) and \(\lambda x_{2}= 0\). Thus, if \(x_{1}\ne 0\), then \(\lambda =1\) (and \(x_{2}=0\)). If \(x_{1}= 0\) then \(x_{2}\ne 0\) since \(x\ne 0\), which implies \(\lambda =0\).\(\square \)

3.1.2 Gram–Schmidt Orthogonalization Procedure

Theorem 3.5

Let \(\{x_{1}, x_{2}, \ldots , x_{m}\}\ (m\le n)\) be a set of linearly independent vectors in \(\mathbb {R}^{n}\). From this set, construct a new set of vectors \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{m}\}\) as follows:

Define \(x^{\prime }_{1}\) by

$$\begin{aligned} x^{\prime }_{1}\mathop {=}\limits ^{\triangle }x_{1}/\Vert x_{1}\Vert , \end{aligned}$$
(3.38)

and define \(x^{\prime }_{l}\) for \(l= 2, 3, \ldots , m\) recursively as

$$\begin{aligned} y_{l}&\mathop {=}\limits ^{\triangle }x_{l} - \sum _{k=1}^{l-1}\langle x_{l},\,x^{\prime }_{k}\rangle x^{\prime }_{k},\nonumber \\ x^{\prime }_{l}&\mathop {=}\limits ^{\triangle }y_{l}/\Vert y_{l}\Vert . \end{aligned}$$
(3.39)

Then,

$$\begin{aligned} \langle x^{\prime }_{j},\,x^{\prime }_{k} \rangle = 0\ \ \ (1 \le j< k \le m), \end{aligned}$$

and the sets of vectors \(\{x_{1}, x_{2}, \ldots , x_{l}\}\) and \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l}\}\) span the same linear subspace for any \(l\ (1 \le l \le m)\).

Proof

First note that each \(x^{\prime }_{k}\) is normalized so that \(\Vert x^{\prime }_{k}\Vert =1\). Let us show, by mathematical induction on l, that for \(l=2, 3, \ldots ,m\), the vector \(x^{\prime }_{l}\) and each of \(x^{\prime }_{1}, x^{\prime }_{2},\ldots , x^{\prime }_{l-1}\) are orthogonal. For \(l=2\), this is true. In fact,

$$\begin{aligned} \langle x^{\prime }_{2},\,x^{\prime }_{1} \rangle&= \langle (x_{2}- \langle x_{2},\, x^{\prime }_{1}\rangle x^{\prime }_{1})/\Vert y_{2}\Vert ,\,x^{\prime }_{1} \rangle \\&= (\langle x_{2},\, x^{\prime }_{1}\rangle - \langle x_{2}, x^{\prime }_{1}\rangle \langle x^{\prime }_{1}, x^{\prime }_{1}\rangle )/\Vert y_{2}\Vert \\&= (\langle x_{2},\, x^{\prime }_{1}\rangle - \langle x_{2}, x^{\prime }_{1}\rangle )/\Vert y_{2}\Vert \\&= 0. \end{aligned}$$

Next, assume that for \(l\ge 3\),

$$\begin{aligned}&x^{\prime }_{2}\ \mathrm{and}\ x^{\prime }_{1}\ \mathrm{are\ orthogonal}, \\&x^{\prime }_{3}\ \mathrm{and \ each\ element\ of}\ \{x^{\prime }_{1}, x^{\prime }_{2}\}\ \mathrm{are\ orthogonal}, \\&\qquad \vdots \\&x^{\prime }_{l-1}\ \mathrm{and\ each\ element\ of}\ \{x^{\prime }_{1}, x^{\prime }_{2},\ldots , x^{\prime }_{l-2}\}\ \mathrm{are\ orthogonal}. \end{aligned}$$

Then, for \(1 \le j \le l-1\),

$$\begin{aligned} \langle x^{\prime }_{l},\,x^{\prime }_{j} \rangle&= \left\langle \frac{1}{\Vert y_{l}\Vert }\left( x_{l}- \sum _{k=1}^{l-1}\langle x_{l},\, x^{\prime }_{k}\rangle x^{\prime }_{k}\right) ,\,x^{\prime }_{j} \right\rangle \\&= \frac{1}{\Vert y_{l}\Vert }\left( \langle x_{l},\, x^{\prime }_{j}\rangle - \sum _{k=1}^{l-1}\langle x_{l}, x^{\prime }_{k}\rangle \langle x^{\prime }_{k}, x^{\prime }_{j}\rangle \right) \\&= \frac{1}{\Vert y_{l}\Vert }(\langle x_{l},\, x^{\prime }_{j}\rangle - \langle x_{l}, x^{\prime }_{j}\rangle ) \\&= 0. \end{aligned}$$

This shows that \(x^{\prime }_{l}\) and each element of \(\{x^{\prime }_{1}, x^{\prime }_{2},\ldots , x^{\prime }_{l-1}\}\) are orthogonal.

From the recursion (3.38) and (3.39), it is obvious that each \(x^{\prime }_{l}\) is a linear combination of \(x_{1}, x_{2}, \ldots , x_{l}\). Also from the recursion,

$$\begin{aligned} x_{1}&= \Vert x_{1}\Vert x^{\prime }_{1},\\ x_{l}&= \sum _{k=1}^{l-1}\langle x_{l},\,x^{\prime }_{k}\rangle x^{\prime }_{k} + \Vert y_{l}\Vert x^{\prime }_{l}, \end{aligned}$$

which shows that each \(x_{l}\) is a linear combination of \(x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l}\). Therefore, \(\{x_{1}, x_{2}, \ldots , x_{l}\}\) and \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l}\}\) span the same linear subspace.\(\square \)

The recursion in (3.38) and (3.39) is referred to as the Gram–Schmidt orthogonalization procedure . Note that \(\sum _{k=1}^{l-1}\langle x_{l},\,x^{\prime }_{k}\rangle x^{\prime }_{k}\) in (3.39) is the orthogonal projection of \(x_{l}\) onto the linear subspace spanned by \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l-1}\}\).

If normalization is not performed, the Gram–Schmidt orthogonalization procedure is simply written as

$$\begin{aligned} \begin{aligned} x^{\prime }_{1}&\mathop {=}\limits ^{\triangle }x_{1},\\ x^{\prime }_{l}&\mathop {=}\limits ^{\triangle }x_{l} - \sum _{k=1}^{l-1}(\langle x_{l},\,x^{\prime }_{k}\rangle /\Vert x^{\prime }_{k}\Vert ^{2}) x^{\prime }_{k} \ \ \ \ (l= 2, 3, \ldots , m). \end{aligned} \end{aligned}$$
(3.40)

3.1.3 Moore–Penrose Pseudoinverse

An \(m\times n\) real matrix X can be considered as a mapping from \(\mathbb {R}^{n}\) to \(\mathbb {R}^{m}\). The range space of X, denoted by \(\mathcal {R}(X)\), is defined by

$$\begin{aligned} \mathcal {R}(X)\mathop {=}\limits ^{\triangle }\{Xv \,;\, v \in \mathbb {R}^{n}\}, \end{aligned}$$
(3.41)

which is a linear subspace of \(\mathbb {R}^{m}\) spanned by the columns of X. The null space of X, denoted by \(\mathcal {N}(X)\), is defined by

$$\begin{aligned} \mathcal {N}(X)\mathop {=}\limits ^{\triangle }\{v \in \mathbb {R}^{n} \,;\, Xv= 0\}, \end{aligned}$$
(3.42)

which is also a linear subspace of \(\mathbb {R}^{n}\).

Let \(\mathcal {N}(X)^{\perp }\) be the orthogonal complement of \(\mathcal {N}(X)\) in \(\mathbb {R}^{n}\):

$$\begin{aligned} \mathcal {N}(X)^{\perp }\mathop {=}\limits ^{\triangle }\{v \in \mathbb {R}^{n} \,;\, \langle v, \, w \rangle = 0\ \ (w \in \mathcal {N}(X))\}. \end{aligned}$$

Likewise, let \(\mathcal {R}(X)^{\perp }\) be the orthogonal complement of \(\mathcal {R}(X)\) in \(\mathbb {R}^{m}\):

$$\begin{aligned} \mathcal {R}(X)^{\perp }\mathop {=}\limits ^{\triangle }\{v \in \mathbb {R}^{m} \,;\, \langle v, \, w \rangle = 0\ \ (w \in \mathcal {R}(X))\}. \end{aligned}$$

Then, \(\mathbb {R}^{n}\) and \(\mathbb {R}^{m}\) are represented respectively as the direct sum

$$\begin{aligned} \mathbb {R}^{n}&= \mathcal {N}(X)^{\perp } \oplus \mathcal {N}(X),\\ \mathbb {R}^{m}&= \mathcal {R}(X) \oplus \mathcal {R}(X)^{\perp }. \end{aligned}$$

Denote by \(X|_{\mathcal {N}(X)^{\perp }}\) the restriction of X, as a mapping, on \(\mathcal {N}(X)^{\perp }\). \(X|_{\mathcal {N}(X)^{\perp }}\) is a one-to-one mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\). In fact, if \(X|_{\mathcal {N}(X)^{\perp }} v= X|_{\mathcal {N}(X)^{\perp }} w\) for \(v, w \in \mathcal {N}(X)^{\perp }\), then \(X|_{\mathcal {N}(X)^{\perp }}(v-w)= X(v-w)= 0\). Therefore, \(v-w \in \mathcal {N}(X)^{\perp } \cap \mathcal {N}(X) =\{0\}\). Hence \(v= w\), i.e., \(X|_{\mathcal {N}(X)^{\perp }}\) is a one-to-one mapping. Furthermore, for \(w \in \mathcal {R}(X)\), there exists \(v \in \mathbb {R}^{n}\) such that \(Xv= w\). The element v can be decomposed as \(v= v_{1}+ v_{2}, v_{1} \in \mathcal {N}(X)^{\perp }, v_{2} \in \mathcal {N}(X)\). Since \(Xv_{2}=0\),

$$\begin{aligned} X|_{\mathcal {N}(X)^{\perp }}v_{1} = Xv_{1}= Xv_{1}+ Xv_{2}= X(v_{1}+v_{2})= Xv= w. \end{aligned}$$

This shows that \(X|_{\mathcal {N}(X)^{\perp }}\) is a mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\). Thus, we see that the mapping \(X|_{\mathcal {N}(X)^{\perp }}\) has the inverse

$$\begin{aligned} (X|_{\mathcal {N}(X)^{\perp }})^{-1}:\ \mathcal {R}(X) \longrightarrow \ \mathcal {N}(X)^{\perp }. \end{aligned}$$

Now, \(w \in \mathbb {R}^{m}\) can be uniquely decomposed as \(w= w_{1} + w_{2},\ \ w_{1} \in \mathcal {R}(X),\ w_{2} \in \mathcal {R}(X)^{\perp }\). Let \(P_{\mathcal {R}(X)}\) be the orthogonal projection from \(\mathbb {R}^{m}\) onto \(\mathcal {R}(X)\) : \(P_{\mathcal {R}(X)} w \mathop {=}\limits ^{\triangle }w_{1}\). The composite mapping \(X^{+}\mathop {=}\limits ^{\triangle }(X|_{\mathcal {N}(X)^{\perp }})^{-1}\, P_{\mathcal {R}(X)}\) is called the Moore–Penrose pseudoinverse of X. This is a linear mapping from \(\mathbb {R}^{m}\) to \(\mathbb {R}^{n}\). The formation of \(X^{+}\) is illustrated in Fig. 3.8.

Fig. 3.8
figure 8

Formation of the Moore–Penrose pseudoinverse \(X^{+}\)

Lemma 3.1

  1. (1)

    \(X^{+}X = P_{\mathcal {N}(X)^{\perp }}\).

  2. (2)

    \(XX^{+} =P_{\mathcal {R}(X)}\).

  3. (3)

    If X is nonsingular, then \(X^{+}= X^{-1}\).

Proof

(1) Let us decompose \(v \in \mathbb {R}^{n}\) as \(v= v_{1}+v_{2}\), \(v_{1} \in \mathcal {N}(X)^{\perp }, v_{2} \in \mathcal {N}(X)\). Then,

$$\begin{aligned} X^{+}X v&= X^{+}X(v_{1}+ v_{2})\\&= X^{+}(Xv_{1})\\&= (X|_{\mathcal {N}(X)^{\perp }})^{-1}\, P_{\mathcal {R}(X)}(Xv_{1})\\&= (X|_{\mathcal {N}(X)^{\perp }})^{-1}\, P_{\mathcal {R}(X)}(X|_{\mathcal {N}(X)^{\perp }}v_{1})\\&= (X|_{\mathcal {N}(X)^{\perp }})^{-1}(X|_{\mathcal {N}(X)^{\perp }}v_{1})\\&= v_{1}\\&= P_{\mathcal {N}(X)^{\perp }}v, \end{aligned}$$

which shows \(X^{+}X = P_{\mathcal {N}(X)^{\perp }}\).

(2) Let us decompose \(w \in \mathbb {R}^{m}\) as \(w= w_{1}+w_{2}\), \(w_{1} \in \mathcal {R}(X), w_{2} \in \mathcal {R}(X)^{\perp }\). Then,

$$\begin{aligned} XX^{+}w&= X(X|_{\mathcal {N}(X)^{\perp }})^{-1} P_{\mathcal {R}(X)}(w_{1}+w_{2})\\&= X(X|_{\mathcal {N}(X)^{\perp }})^{-1}w_{1}\\&= X|_{\mathcal {N}(X)^{\perp }}(X|_{\mathcal {N}(X)^{\perp }})^{-1}w_{1}\\&= w_{1}\\&= P_{\mathcal {R}(X)}w, \end{aligned}$$

which shows \(XX^{+} =P_{\mathcal {R}(X)}\).

(3) If X is an \(n\times n\) nonsingular matrix, then \(\mathcal {N}(X)^{\perp }=\mathbb {R}^{n}\), and \(\mathcal {R}(X)=\mathbb {R}^{n}\). Therefore, \(P_{\mathcal {R}(X)}\) equals the identity mapping \(I_{n}\) on \(\mathbb {R}^{n}\). Hence,

$$\begin{aligned} X^{+}= (X|_{\mathcal {N}(X)^{\perp }})^{-1}P_{\mathcal {R}(X)}= X^{-1}I_{n}= X^{-1}. \end{aligned}$$

\(\square \)

Lemma 3.2

If \(\mathrm{rank}\,X = m\), then \(XX^{+}= I_{m}\).

Proof

Because \(\dim \mathcal {R}(X)= \mathrm{rank}\,X= m\), we have \(\mathcal {R}(X)= \mathbb {R}^{m}\). Thus, in view of Lemma 3.1(2),

$$\begin{aligned} XX^{+}= P_{\mathcal {R}(X)}= P_{\mathbb {R}^{m}}= I_{m}. \end{aligned}$$

\(\square \)

Lemma 3.3

\(\mathcal {R}(X^{t})= \mathcal {N}(X)^{\perp }\).

Proof

Note that \(X^{t}\) is a mapping from \(\mathbb {R}^{m}\) to \(\mathbb {R}^{n}\). Because there exists a one-to-one linear mapping \(X|_{\mathcal {N}(X)^{\perp }}\) from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\),

$$\begin{aligned} \dim \mathcal {N}(X)^{\perp } = \dim \mathcal {R}(X)= \mathrm{rank}\,X= \mathrm{rank}\,X^{t}. \end{aligned}$$
(3.43)

For arbitrary \(w \in \mathbb {R}^{m}\) and \(v \in \mathcal {N}(X)\),

$$\begin{aligned} \langle X^{t}w,\,v\rangle = \langle w,\,Xv\rangle = 0. \end{aligned}$$

Therefore, \(X^{t}w \in \mathcal {N}(X)^{\perp }\), which shows that

$$\begin{aligned} \mathcal {R}(X^{t})\subseteq \mathcal {N}(X)^{\perp }. \end{aligned}$$
(3.44)

On the other hand, by (3.43),

$$\begin{aligned} \dim \mathcal {R}(X^{t})= \mathrm{rank}\,X^{t}= \dim \mathcal {N}(X)^{\perp }. \end{aligned}$$
(3.45)

From (3.44) and (3.45), we conclude that \(\mathcal {R}(X^{t})= \mathcal {N}(X)^{\perp }\).\(\square \)

Theorem 3.6

  1. (1)

    \((X^{+})^{+}= X\).

  2. (2)

    \((X^{t})^{+}= (X^{+})^{t}\).

  3. (3)

    \(X(X^{t}X)^{+}X^{t}X = X\).

  4. (4)

    \(X(X^{t}X)^{+}X^{t}= P_{\mathcal {R}(X)}\).

Proof

(1) By definition of the Moore–Penrose pseudoinverse,

$$\begin{aligned} (X^{+})^{+}= (X^{+}|_{\mathcal {N}(X^{+})^{\perp }})^{-1} P_{\mathcal {R}(X^{+})}. \end{aligned}$$

Since \(\mathcal {R}(X^{+})= \mathcal {N}(X)^{\perp }\) and \(\mathcal {N}(X^{+})^{\perp }= (\mathcal {R}(X)^{\perp })^{\perp }= \mathcal {R}(X)\), we have

$$\begin{aligned} (X^{+})^{+}&= (X^{+}|_{\mathcal {R}(X)})^{-1} P_{\mathcal {N}(X)^{\perp }}\\&= ((X|_{\mathcal {N}(X)^{\perp }})^{-1})^{-1} P_{\mathcal {N}(X)^{\perp }}\\&= X|_{\mathcal {N}(X)^{\perp }} P_{\mathcal {N}(X)^{\perp }}. \end{aligned}$$

Let v be an arbitrary element in \(\mathbb {R}^{n}\), and decompose it as \(v= v_{1} + v_{2}\), where \(v_{1} \in \mathcal {N}(X)^{\perp }\) and \(v_{2} \in \mathcal {N}(X)\). Then,

$$\begin{aligned} (X^{+})^{+} v&= X|_{\mathcal {N}(X)^{\perp }} P_{\mathcal {N}(X)^{\perp }} v\\&= X|_{\mathcal {N}(X)^{\perp }} v_{1}\\&= X v_{1}. \end{aligned}$$

On the other hand,

$$\begin{aligned} Xv= X(v_{1}+ v_{2})= Xv_{1}. \end{aligned}$$

Therefore, \((X^{+})^{+} v = Xv\) for any \(v \in \mathbb {R}^{n}\), i.e., \((X^{+})^{+} = X\).

(2) The matrix \(X|_{\mathcal {N}(X)^{\perp }}\) is a one-to-one mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\), and \(X^{t}|_{\mathcal {N}(X^{t})^{\perp }}\) is a one-to-one mapping from \(\mathcal {N}(X^{t})^{\perp }\) onto \(\mathcal {R}(X^{t})\). In view of Lemma 3.3, \(\mathcal {N}(X)^{\perp }= \mathcal {R}(X^{t})\) and \(\mathcal {N}(X^{t})^{\perp }= \mathcal {R}(X)\). That is, the domain of \(X|_{\mathcal {N}(X)^{\perp }}\) coincides with the range of \(X^{t}|_{\mathcal {N}(X^{t})^{\perp }}\), and the domain of \(X^{t}|_{\mathcal {N}(X^{t})^{\perp }}\) coincides with the range of \(X|_{\mathcal {N}(X)^{\perp }}\) as shown in the following diagram:

figure a

Let us define

$$\begin{aligned} \mathcal {S}\mathop {=}\limits ^{\triangle }\mathcal {N}(X)^{\perp }= \mathcal {R}(X^{t}), \end{aligned}$$

and

$$\begin{aligned} \mathcal {T}\mathop {=}\limits ^{\triangle }\mathcal {R}(X) = \mathcal {N}(X^{t})^{\perp }. \end{aligned}$$

Then, we have

$$\begin{aligned} (X|_{\mathcal {S}})^{t}= X^{t}|_{\mathcal {T}}. \end{aligned}$$
(3.46)

In fact, let v and w be arbitrary elements in \(\mathcal {S}\) and in \(\mathcal {T}\), respectively. Then,

$$\begin{aligned} \langle (X|_{\mathcal {S}})^{t} w, v \rangle&= \langle w, X|_{\mathcal {S}} v \rangle \\&= \langle w, X v \rangle , \end{aligned}$$

and

$$\begin{aligned} \langle X^{t}|_{\mathcal {T}} w, v \rangle&= \langle X^{t} w, v \rangle \\&= \langle w, X v \rangle . \end{aligned}$$

Therefore, for arbitrary \(v \in \mathcal {S}\) and \(w \in \mathcal {T}\),

$$\begin{aligned} \langle (X|_{\mathcal {S}})^{t} w, v \rangle = \langle X^{t}|_{\mathcal {T}} w, v \rangle , \end{aligned}$$

which is equivalent to (3.46).

We also have

$$\begin{aligned} ((X|_{\mathcal {S}})^{t})^{-1} = ((X|_{\mathcal {S}})^{-1})^{t}. \end{aligned}$$
(3.47)

In fact, let v and w be arbitrary elements in \(\mathcal {S}\) and \(\mathcal {T}\), respectively, and let \(u \mathop {=}\limits ^{\triangle }((X|_{\mathcal {S}})^{t})^{-1} v\). Then,

$$\begin{aligned} \langle ((X|_{\mathcal {S}})^{t})^{-1} v, w \rangle = \langle u, w \rangle , \end{aligned}$$
(3.48)

and

$$\begin{aligned} \langle ((X|_{\mathcal {S}})^{-1})^{t} v, w \rangle&= \langle v, (X|_{\mathcal {S}})^{-1} w \rangle \nonumber \\&= \langle (X|_{\mathcal {S}})^{t} u, (X|_{\mathcal {S}})^{-1} w \rangle \nonumber \\&= \langle u, X|_{\mathcal {S}} (X|_{\mathcal {S}})^{-1} w \rangle \nonumber \\&= \langle u, w \rangle . \end{aligned}$$
(3.49)

Combining (3.48) and (3.49),

$$\begin{aligned} \langle ((X|_{\mathcal {S}})^{t})^{-1} v, w \rangle = \langle ((X|_{\mathcal {S}})^{-1})^{t} v, w \rangle , \end{aligned}$$

which is equivalent to (3.47).

Now, to prove \((X^{t})^{+}= (X^{+})^{t}\), it suffices to show that for arbitrary \(v \in \mathbb {R}^{n}\) and \(w \in \mathbb {R}^{m}\),

$$\begin{aligned} \langle (X^{t})^{+} v, w \rangle = \langle (X^{+})^{t} v, w \rangle . \end{aligned}$$

This equation is equivalent to

$$\begin{aligned} \langle (X^{t})^{+} v, w \rangle = \langle v, X^{+}w \rangle . \end{aligned}$$
(3.50)

Let v be decomposed as \(v= v_{1}+ v_{2}\), where \(v_{1} \in \mathcal {S}\) and \(v_{2} \in \mathcal {S}^{\perp }\). Then, using (3.46) and (3.47), we have

$$\begin{aligned} (X^{t})^{+} v&= (X^{t}|_{\mathcal {T}})^{-1} P_{\mathcal {S}} v\\&= (X^{t}|_{\mathcal {T}})^{-1}v_{1}\\&= ((X|_{\mathcal {S}})^{t})^{-1} v_{1}\\&= ((X|_{\mathcal {S}})^{-1})^{t} v_{1}. \end{aligned}$$

Therefore, if we decompose w as \(w= w_{1}+w_{2}\), where \(w_{1} \in \mathcal {T}\) and \(w_{2} \in \mathcal {T}^{\perp }\), we obtain (3.50) in the following way:

$$\begin{aligned} \langle (X^{t})^{+} v, w \rangle&= \langle ((X|_{\mathcal {S}})^{-1})^{t} v_{1}, w \rangle \\&= \langle ((X|_{\mathcal {S}})^{-1})^{t} v_{1}, w_{1} \rangle \\&= \langle v_{1}, (X|_{\mathcal {S}})^{-1}w_{1} \rangle \\&= \langle v_{1}, X^{+} w \rangle \\&= \langle v, X^{+} w \rangle . \end{aligned}$$

(3) Let \(Y \mathop {=}\limits ^{\triangle }X(X^{t}X)^{+}X^{t}X - X\). Since \(((X^{t}X)^{+})^{t}= (X^{t}X)^{+}\) by (2) above, we have

$$\begin{aligned} Y^{t}Y&= (X(X^{t}X)^{+}X^{t}X - X)^{t}(X(X^{t}X)^{+}X^{t}X - X)\\&= (X^{t}X(X^{t}X)^{+}X^{t} - X^{t})(X(X^{t}X)^{+}X^{t}X - X)\\&= X^{t}X(X^{t}X)^{+}X^{t}X(X^{t}X)^{+}X^{t}X \\&\quad - X^{t}X(X^{t}X)^{+}X^{t}X - X^{t}X(X^{t}X)^{+}X^{t}X + X^{t}X. \end{aligned}$$

By Lemma 3.1(2), \(X^{t}X(X^{t}X)^{+}= P_{\mathcal {R}(X^{t}X)}\). Therefore,

$$\begin{aligned} X^{t}X(X^{t}X)^{+}X^{t}X&= P_{\mathcal {R}(X^{t}X)}X^{t}X\\&= X^{t}X, \end{aligned}$$

and

$$\begin{aligned} X^{t}X(X^{t}X)^{+}X^{t}X(X^{t}X)^{+}X^{t}X&= P_{\mathcal {R}(X^{t}X)} P_{\mathcal {R}(X^{t}X)}X^{t}X\\&= X^{t}X. \end{aligned}$$

Hence, \(Y^{t}Y=0\), from which \(Y=0\) is concluded.

(4) By (3) above, \(X(X^{t}X)^{+}X^{t}u = u\) for any \(u \in \mathcal {R}(X)\). Moreover, \(X(X^{t}X)^{+}X^{t}v= 0\) for any \(v \in \mathcal {R}(X)^{\perp }\), since \(X^{t}v= 0\). Therefore, \(X(X^{t}X)^{+}X^{t}= P_{\mathcal {R}(X)}\).\(\square \)

Theorem 3.7

If \(\mathrm{rank}\,X = m\), then \(X^{+}= X^{t}(XX^{t})^{-1}\).

Proof

Let \(x_{k}\) be the kth column of \(X^{t}\). Then,

$$\begin{aligned} XX^{t}= \begin{bmatrix} \langle x_{1},\,x_{1} \rangle&\langle x_{1},\,x_{2} \rangle&\cdots&\langle x_{1},\,x_{m} \rangle \\ \langle x_{2},\,x_{1} \rangle&\langle x_{2},\,x_{2} \rangle&\cdots&\langle x_{2},\,x_{m} \rangle \\ \vdots&\vdots&\ddots&\vdots \\ \langle x_{m},\,x_{1} \rangle&\langle x_{m},\,x_{2} \rangle&\cdots&\langle x_{m},\,x_{m} \rangle \end{bmatrix}. \end{aligned}$$

If \(\mathrm{rank}\,X = \mathrm{rank}\,X^{t}= m\), the vectors \(x_{1}, x_{2}, \ldots , x_{m}\) are linearly independent. Therefore, by Theorem A.8 (Appendix A), \(\det (XX^{t}) \ne 0\), so that the Gramian matrix \(XX^{t}\) has the inverse \((XX^{t})^{-1}\). For arbitrary \(w \in \mathbb {R}^{m}\), let

$$\begin{aligned} v \mathop {=}\limits ^{\triangle }X^{+}w. \end{aligned}$$
(3.51)

Because \(v \in \mathcal {N}(X)^{\perp }\), Lemma 3.3 guarantees the existence of \(z \in \mathbb {R}^{m}\) such that

$$\begin{aligned} v= X^{t}z. \end{aligned}$$
(3.52)

Combining (3.51) and (3.52), and using Lemma 3.2, we have

$$\begin{aligned} XX^{t}z= Xv= XX^{+}w= w. \end{aligned}$$

Since the Gramian matrix \(XX^{t}\) has the inverse \((XX^{t})^{-1}\),

$$\begin{aligned} z= (XX^{t})^{-1}w. \end{aligned}$$
(3.53)

Substitution of (3.53) into (3.52) yields

$$\begin{aligned} v= X^{t}(XX^{t})^{-1}w. \end{aligned}$$
(3.54)

Comparison of (3.51) and (3.54) leads to

$$\begin{aligned} X^{+}= X^{t}(XX^{t})^{-1}. \end{aligned}$$

\(\square \)

Theorem 3.8

Let X be an \(m \times n\) matrix, and y an element of \(\mathbb {R}^{m}\). The Moore–Penrose pseudoinverse gives a least-squares solution of the linear equation

$$\begin{aligned} Xv= y. \end{aligned}$$
(3.55)

That is, \(v\mathop {=}\limits ^{\triangle }X^{+}y\) minimizes \(\Vert Xv - y\Vert ^{2}\). Moreover, \(X^+y{\,}+{\,}\mathcal {N}(X)\mathop {=}\limits ^{\triangle }\{X^+y{\,}+{\,}w \,;\, w \in \mathcal {N}(X)\}\) is the set of least-squares solutions of (3.55). Therefore, \(X^{+}y\) gives the minimum norm least-squares solution of (3.55).

Proof

Because \(\{Xv \,;\, v \in \mathbb {R}^{n}\}= \mathcal {R}(X)\), the quantity \(\Vert Xv - y\Vert ^{2}\) is minimized iff \(Xv = P_{\mathcal {R}(X)}y\). This is attained for \(v= X^{+}y\), since \(XX^{+}= P_{\mathcal {R}(X)}\). If \(v= X^{+}y+ w, w \in \mathcal {N}(X)\), then

$$\begin{aligned} Xv&= X(X^{+}y + w) \\&= XX^{+}y + Xw \\&= XX^{+}y\\&= P_{\mathcal {R}(X)} y. \end{aligned}$$

Therefore, any element in \(X^{+}y+\mathcal {N}(X)\) is a least-squares solution of (3.55). Conversely, suppose \(v= X^{+}y+ w,\ w \in \mathbb {R}^{n}\), is a least-squares solution of (3.55). Decomposition of w as \(w= w_{1}+w_{2},\,w_{1} \in \mathcal {N}(X)^{\perp }, w_{2} \in \mathcal {N}(X)\) leads to

$$\begin{aligned} Xv&= XX^{+}y + Xw_{1}+Xw_{2} \\&= XX^{+}y + Xw_{1}. \end{aligned}$$

Because \(Xv= P_{\mathcal {R}(X)}y= XX^{+}y\), \(Xw_{1}=0\). Therefore, \(w_{1} \in \mathcal {N}(X)^{\perp } \cap \mathcal {N}(X)= \{0\}\). This shows that \(v= X^{+}y+ w_{2} \in X^{+}y+\mathcal {N}(X)\).

Since \(X^{+}y \perp \mathcal {N}(X)\), \(X^{+}y\) is the minimum norm element in \(X^{+}y+\mathcal {N}(X)\). \(\square \)

Suppose a linear equation has at least one solution. Then, a least-squares solution is a solution, and vice versa. Thus, we have the following corollary:

Corollary 3.1

If (3.55) has a solution, \(X^{+}y+\mathcal {N}(X)\) gives the set of solutions, and \(X^{+}y\) gives the minimum norm solution.

Corollary 3.2

For any \(m\times n\) matrix X,

$$\begin{aligned} X^{+}&= (X^{t}X)^{+}X^{t} \end{aligned}$$
(3.56)
$$\begin{aligned}&= X^{t}(XX^{t})^{+}. \end{aligned}$$
(3.57)

Proof

Let us prove (3.56) first. By Theorem 3.8, \(v= X^{+}y\) gives the minimum norm least-squares solution of (3.55). On the other hand, v minimized \(\Vert Xv - y\Vert ^{2}\) iff \((Xv - y)\perp X\), that is,

$$\begin{aligned} X^{t}(Xv - y)= X^{t}Xv - X^{t}y = 0. \end{aligned}$$

The minimum norm solution \(v^{\prime }\) of this equation is given by \(v^{\prime }= (X^{t}X)^{+}X^{t}y\). Because the minimum norm least-squares solution is unique, \(v= v^{\prime }\). Thus, \(X^{+}y = (X^{t}X)^{+}X^{t}y\) for any \(y \in \mathbb {R}^{m}\), which shows (3.56).

Replace X with \(X^{t}\) in (3.56):

$$\begin{aligned} (X^{t})^{+}= (XX^{t})^{+}X. \end{aligned}$$
(3.58)

By Theorem 3.6(2), \(((X^{t})^{+})^{t}= X^{+}\), and \(((XX^{t})^{+})^{t}= (XX^{t})^{+}\). Therefore, taking the transpose of both sides of (3.58), we immediately obtain (3.57).\(\square \)

Affine Projection

A subset \(\Pi \) of \(\mathbb {R}^{n}\) is called an affine subspace of \(\mathbb {R}^{n}\) iff there exists an element \(a \in \Pi \) such that

$$\begin{aligned} \Pi - a \mathop {=}\limits ^{\triangle }\{x-a\, ;\, x \in \Pi \} \end{aligned}$$

is a linear subspace of \(\mathbb {R}^{n}\). The element a is called the origin of \(\Pi \), and \(\Pi -a\) the linear subspace associated with \(\Pi \). The dimension of \(\Pi \) is defined as \(\dim \Pi \mathop {=}\limits ^{\triangle }\dim (\Pi -a)\).

Theorem 3.9

Let \(\Pi \) be an affine subspace of \(\mathbb {R}^{n}\), and a its origin. Then, for any \(b \in \Pi \),

$$\begin{aligned} \Pi - a = \Pi - b. \end{aligned}$$

That is, any element of \(\Pi \) can be chosen as its origin, and the linear subspace associated with \(\Pi \) is independent of the choice of the origin.

Proof

If we denote \(\mathbb {V}\mathop {=}\limits ^{\triangle }\Pi -a\), \(\Pi \) is represented as \(\Pi = \mathbb {V} + a\). Then, noting \(b\,-\,a \in \mathbb {V}\), we have \(\Pi = \mathbb {V}\,-\,(b\,-\,a)\,+\,b = \mathbb {V}\,+\,b\). This shows \(\Pi \,-\,a = \mathbb {V}= \Pi \,-\,b\).    \(\square \)

Theorem 3.10

If \(\Pi _{1}\) and \(\Pi _{2}\) are affine subspaces of \(\mathbb {R}^{n}\) satisfying \(\Pi _{1}\cap \Pi _{2}\) \(\ne \emptyset \), then \(\Pi _{1} \cap \Pi _{2}\) is an affine subspace of \(\mathbb {R}^{n}\).

Proof

For arbitrarily chosen \(a \in \Pi _{1}\cap \Pi _{2}\), let

$$\begin{aligned} \mathbb {V}_{1}\mathop {=}\limits ^{\triangle }\Pi _{1} -a,\ \ \ \ \mathbb {V}_{2}\mathop {=}\limits ^{\triangle }\Pi _{2}-a. \end{aligned}$$

Then, it is shown that

$$\begin{aligned} \Pi _{1} \cap \Pi _{2} - a = \mathbb {V}_{1}\cap \mathbb {V}_{2}. \end{aligned}$$
(3.59)

In fact, if \(v \in \Pi _{1} \cap \Pi _{2} - a\), there exists \(w \in \Pi _{1} \cap \Pi _{2}\) such that \(v=w-a\). Because \(w \in \Pi _{1}\), we have \(v \in \mathbb {V}_{1}\). Also, because \(w \in \Pi _{2}\), we have \(v \in \mathbb {V}_{2}\). Therefore, \(v \in \mathbb {V}_{1}\cap \mathbb {V}_{2}\). Conversely, if \(v \in \mathbb {V}_{1}\cap \mathbb {V}_{2}\), then \(v \in \mathbb {V}_{1}\). Therefore, there exists \(w_{1} \in \Pi _{1}\) such that

$$\begin{aligned} v= w_{1}-a. \end{aligned}$$
(3.60)

Also, because \(v \in \mathbb {V}_{2}\), there exists \(w_{2} \in \Pi _{2}\) such that

$$\begin{aligned} v= w_{2}-a. \end{aligned}$$
(3.61)

From (3.60) and (3.61), \(w_{1}= w_{2}\). Let \(w\mathop {=}\limits ^{\triangle }w_{1}= w_{2}\). Then, \(w\in \Pi _{1}\cap \Pi _{2}\) and \(v= w-a\). Therefore, \(v \in \Pi _{1}\cap \Pi _{2} - a\).

Because \(\mathbb {V}_{1}\cap \mathbb {V}_{2}\) is a linear subspace of \(\mathbb {R}^{n}\), (3.59) shows that \(\Pi _{1}\cap \Pi _{2}\) is an affine subspace.\(\square \)

We can immediately generalize the above theorem.

Corollary 3.3

If \(\Pi _{1}\), \(\Pi _{2}\), \(\ldots \) , \(\Pi _{p}\) are affine subspaces of \(\mathbb {R}^{n}\) satisfying \(\Pi _{1} \cap \Pi _{2} \cap \cdots \cap \Pi _{p} \ne \emptyset \), then \(\Pi _{1} \cap \Pi _{2} \cap \cdots \cap \Pi _{p}\) is an affine subspace of \(\mathbb {R}^{n}\).

Let \(\Pi \) be an affine subspace of \(\mathbb {R}^{n}\), and \(\mathbb {V}\) the associated linear subspace. An element \(v \in \mathbb {R}^{n}\) and \(\Pi \) are said to be orthogonal and denoted by \(v \perp \Pi \) iff \(v \perp \mathbb {V}\). An element \(v^{\prime } \in \Pi \) is called the affine projection of \(v \in \mathbb {R}^{n}\) onto \(\Pi \) iff \((v-v^{\prime }) \perp \Pi \). The affine projection \(v^{\prime } \in \Pi \) is uniquely determined for v. In fact, suppose \(v^{\prime }\) and \(v^{\prime \prime }\) are both affine projections of v. Then, by definition, \(v^{\prime }- v \in \mathbb {V}^{\perp }\). Also, \(v^{\prime \prime }- v \in \mathbb {V}^{\perp }\). Since \(\mathbb {V}^{\perp }\) is a linear subspace of \(\mathbb {R}^{n}\), \(v^{\prime }- v^{\prime \prime } \in \mathbb {V}^{\perp }\). On the other hand, there exists \(w^{\prime } \in \mathbb {V}\) such that \(v^{\prime }= w^{\prime } + a\), where a is the origin of \(\mathbb {V}\). In the same way, there exists \(w^{\prime \prime } \in \mathbb {V}\) such that \(v^{\prime \prime }= w^{\prime \prime } + a\). Hence, \(v^{\prime }- v^{\prime \prime }= w^{\prime }- w^{\prime \prime } \in \mathbb {V}\). Thus, we have \(v^{\prime }- v^{\prime \prime }\in \mathbb {V}\cap \mathbb {V}^{\perp }= \{0\}\), from which \(v^{\prime }= v^{\prime \prime }\) is concluded.

The mapping that maps \(v \in \mathbb {R}^{n}\) to its affine projection \(v^{\prime } \in \Pi \) is also called the affine projection onto \(\Pi \), and denoted by \(P_{\Pi }\). Just as in the case of orthogonal projection onto a linear subspace, \(P_{\Pi }v\) is characterized as the unique element \(v^{\prime } \in \Pi \) that minimizes \(\Vert v^{\prime }- v\Vert \).

Now, given \(x_{1}, x_{2}, \ldots , x_{m} \in \mathbb {R}^{n}\) and \(y_{1}, y_{2}, \ldots , y_{m} \in \mathbb {R}\ (m \le n)\), let us consider a system of linear equations

$$\begin{aligned} \langle x_{k},\,v \rangle = y_{k}\ \ \ (k=1, 2, \ldots , m), \end{aligned}$$
(3.62)

where \(v \in \mathbb {R}^{n}\) is the unknown vector. Using the matrix and vector notation

$$\begin{aligned} X&\mathop {=}\limits ^{\triangle }\begin{bmatrix} x_{1}&x_{2}&\cdots&x_{m} \end{bmatrix}^{t},\\ y&\mathop {=}\limits ^{\triangle }(y_{1}, y_{2}, \ldots , y_{m})^{t}, \end{aligned}$$

we can rewrite (3.62) as

$$\begin{aligned} Xv = y. \end{aligned}$$
(3.63)

Note that X is an \(m \times n\) matrix, and y an m-dimensional vector.

By Theorem 3.8, \(\Pi \mathop {=}\limits ^{\triangle }X^{+}y + \mathcal {N}(X)\) gives the set of least-squares solutions of (3.63). If (3.63) has at least one solution, then, by Corollary 3.1, \(\Pi \) gives the set of solutions. Note that \(\Pi \) is an affine subspace of \(\mathbb {R}^{n}\), with the origin \(X^{+}y\), and the associated linear subspace \(\mathcal {N}(X)\).

Theorem 3.11

\(\mathrm{rank}\,X + \dim \Pi = n\).

Proof

As stated in “Moore-Penrose Pseudoinverse” above, there is a one-to-one linear mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\). Therefore, \(\dim \mathcal {R}(X) = \dim \mathcal {N}(X)^{\perp }\), so that

$$\begin{aligned} \mathrm{rank}\,X + \dim \Pi&= \dim \mathcal {R}(X) + \dim \mathcal {N}(X) \\&= \dim \mathcal {N}(X)^{\perp } + \dim \mathcal {N}(X) \\&= \dim \mathbb {R}^{n} \\&= n. \end{aligned}$$

\(\square \)

If X is composed of a single, nonzero row, then \(\mathrm{rank}\,X = 1\). Therefore, by Theorem 3.11, \(\dim \Pi = n-1\). Such an affine subspace is called a hyperplane . If \(x_{k} \ne 0\), each equation \(\langle x_{k},\,v \rangle = y_{k}\) in (3.62) determines a hyperplane. Let it be denoted by \(\Pi _{k}= \{v \in \mathbb {R}^{n} \,;\, \langle x_{k},\,v \rangle = y_{k}\}\). As illustrated in Fig. 3.9, \(\Pi _{k}\) is a hyperplane that is orthogonal to \(x_{k}\) and passes through the point \(p= y_{k}x_{k}/\Vert x_{k}\Vert ^{2}\). If (3.63) has a solution, the set of solutions \(\Pi \) is the intersection of such hyperplanes: \(\Pi = \Pi _{1}\cap \Pi _{2}\cap \cdots \cap \Pi _{m}\).

Fig. 3.9
figure 9

Hyperplane determined by the equation \(\langle x_{k},\,v \rangle = y_{k}\). \(p= y_{k}x_{k}/\Vert x_{k}\Vert ^{2}\)

The angle between two hyperplanes \(\Pi _{i}\) and \(\Pi _{j}\) is defined to be the angle between \(x_{i}\) and \(x_{j}\).

Theorem 3.12

A vector \(v \in \mathbb {R}^{n}\) and \(\Pi =X^{+}y + \mathcal {N}(X)\) are orthogonal iff v is a linear combination of \(x_{1}\), \(x_{2}\), \(\ldots \) , \(x_{m}\).

Proof

Let \(\mathbb {V}\) be the linear subspace spanned by \(\{x_{1}, x_{2}, \ldots , x_{m}\}\). Then, by Lemma 3.3, \(\mathbb {V}= \mathcal {R}(X^{t})= \mathcal {N}(X)^{\perp }\), from which the theorem is obvious. \(\square \)

Theorem 3.13

For \(\Pi = X^{+}y + \mathcal {N}(X)\), the affine projection \(P_{\Pi }\) is given by

$$\begin{aligned} P_{\Pi }v = v + X^{+}(y-Xv)\ \ (v \in \mathbb {R}^{n}). \end{aligned}$$
(3.64)

Proof

Let \(w \mathop {=}\limits ^{\triangle }v + X^{+}(y-Xv)\). We first show that w is an element of \(\Pi \). Note that w can be rewritten as \(w = X^{+}y +(v-X^{+}Xv)\). By Lemma 3.1, \(XX^{+}= P_{\mathcal {R}(X)}\). Since \(Xv \in \mathcal {R}(X)\),

$$\begin{aligned} X(v-X^{+}Xv)&= Xv- XX^{+}Xv \\&= Xv - P_{\mathcal {R}(X)}Xv \\&= Xv -Xv \\&= 0. \end{aligned}$$

Therefore, we have \(v-X^{+}Xv \in \mathcal {N}(X)\). Hence, \(w = X^{+}y +(v-X^{+}Xv) \in X^{+}y + \mathcal {N}(X) = \Pi \). Furthermore, because \((w - v) = X^{+}(y-Xv) \in \mathcal {R}(X^{+})= \mathcal {N}(X)^{\perp }\), we have \((w - v) \perp \mathcal {N}(X)\). Hence, by definition, \((w - v) \perp \Pi \). Thus we have shown \(w= P_{\Pi }v\).\(\square \)

Fig. 3.10
figure 10

Geometrical meaning of Theorem 3.13

Figure 3.10 illustrates the geometrical meaning of Theorem 3.13. In the figure, we see two interpretations of \(P_{\Pi }v\):

$$\begin{aligned} P_{\Pi }v= v+ (X^{+}y - P_{\mathcal {N}(X)^{\perp }}v)\ \ \ \mathrm{and} \ \ \ P_{\Pi }v= X^{+}y + P_{\mathcal {N}(X)}v. \end{aligned}$$

Appendix 2: Condition Number

3.1.1 Natural Norm of a Matrix

The natural norm \(\Vert A\Vert \) of an \(n\times n\) matrix A is defined by

$$\begin{aligned} \Vert A\Vert \mathop {=}\limits ^{\triangle }\sup _{x\ne 0} \frac{\Vert Ax\Vert }{\Vert x\Vert }. \end{aligned}$$

The natural norm is also called the induced norm, induced from the vector norm used in the definition.

Lemma 3.4

The natural norm has the following properties:

  1. (1)

    \(\Vert A\Vert \ge 0\), and \(\Vert A\Vert =0\) implies \(A= 0\).

  2. (2)

    For a scalar c, \(\Vert cA\Vert =|c|\,\Vert A\Vert \).

  3. (3)

    \(\Vert A+B\Vert \le \Vert A\Vert + \Vert B\Vert \).

  4. (4)

    \(\Vert AB\Vert \le \Vert A\Vert \cdot \Vert B\Vert \).

  5. (5)

    \(\Vert I\Vert = 1\).

Proof

(1) Since \(\Vert Ax\Vert /\Vert x\Vert \ge 0\), \(\Vert A\Vert \ge 0\) is obvious. If \(A\ne 0\), there exists \(\tilde{x} \ne 0\) such that \(A\tilde{x}\ne 0\). Since \(\Vert A\tilde{x}\Vert /\Vert \tilde{x}\Vert > 0\), \(\Vert A\Vert > 0\).

(2) Since \(\Vert cAx\Vert =|c|\,\Vert Ax\Vert \),

$$\begin{aligned} \Vert cA\Vert&=\sup _{x\ne 0}\Vert cAx\Vert /\Vert x\Vert \\&=|c|\sup _{x\ne 0}\Vert Ax\Vert /\Vert x\Vert \\&= |c|\,\Vert A\Vert . \end{aligned}$$

(3) By the triangle inequality for the vector norm, we have

$$\begin{aligned} \Vert A+B\Vert&= \sup _{x\ne 0} \frac{\Vert (A+B)x\Vert }{\Vert x\Vert }\\&\le \sup _{x\ne 0} \frac{\Vert Ax\Vert +\Vert Bx\Vert }{\Vert x\Vert }\\&\le \sup _{x\ne 0} \frac{\Vert Ax\Vert }{\Vert x\Vert }+\sup _{x\ne 0}\frac{\Vert Bx\Vert }{\Vert x\Vert }\\&= \Vert A\Vert + \Vert B\Vert . \end{aligned}$$

(4) First note that for any vector y,

$$\begin{aligned} \Vert Ay\Vert \le \Vert A\Vert \cdot \Vert y\Vert . \end{aligned}$$
(3.65)

In fact, if \(y= 0\), this is obvious. If \(y\ne 0\), then

$$\begin{aligned} \Vert Ay\Vert /\Vert y\Vert \le \sup _{x\ne 0} \Vert Ax\Vert /\Vert x\Vert = \Vert A\Vert , \end{aligned}$$

which leads to (3.65). Using this inequality, we have

$$\begin{aligned} \Vert AB\Vert&= \sup _{x\ne 0}\frac{\Vert ABx\Vert }{\Vert x\Vert }\\&\le \sup _{x\ne 0}\frac{\Vert A\Vert \cdot \Vert Bx\Vert }{\Vert x\Vert }\\&= \Vert A\Vert \sup _{x\ne 0}\frac{\Vert Bx\Vert }{\Vert x\Vert }\\&= \Vert A\Vert \cdot \Vert B\Vert . \end{aligned}$$

(5) is obvious from the definition of the natural norm. \(\square \)

Lemma 3.5

If A is an \(n\times n\) Hermitian matrix with eigenvalues \(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\) (Appendix A), then

$$\begin{aligned} \Vert A\Vert = \max _{k} |\lambda _{k}|. \end{aligned}$$

Proof

Let \(\{x_{1}, x_{2}, \ldots , x_{n}\}\) be the orthonormal basis of \(\mathbb {C}^{n}\) comprised of eigenvectors corresponding to the eigenvalues \(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\). Then, any vector \(x \in \mathbb {C}^{n}\) is represented as \(x= \sum _{k=1}^{n} c_{k}x_{k}\) with some complex coefficients \(c_{k}\)’s. Using this representation, we have

$$\begin{aligned} \Vert Ax\Vert ^{2}&= \left\| \sum _{k=1}^{n} c_{k}\lambda _{k}x_{k}\right\| ^{2}\\&= \sum _{k=1}^{n} |\lambda _{k}|^{2} |c_{k}|^{2}\\&\le \max _{k}|\lambda _{k}|^{2} \sum _{i=1}^{n}|c_{i}|^{2}. \end{aligned}$$

We also have

$$\begin{aligned} \Vert x\Vert ^{2}&= \left\| \sum _{k=1}^{n} c_{k}x_{k}\right\| ^{2}\\&= \sum _{i=1}^{n} |c_{i}|^{2}. \end{aligned}$$

Thus, if \(x \ne 0\),

$$\begin{aligned} \frac{\Vert Ax\Vert }{\Vert x\Vert }\le \max _{k}|\lambda _{k}|. \end{aligned}$$
(3.66)

On the other hand, let \(m\mathop {=}\limits ^{\triangle }\mathop {\mathrm{argmax}}\nolimits _{k} |\lambda _{k}|\). Then,

$$\begin{aligned} \frac{\Vert Ax_{m}\Vert }{\Vert x_{m}\Vert }= \frac{|\lambda _{m}|\cdot \Vert x_{m}\Vert }{\Vert x_{m}\Vert }= |\lambda _{m}|= \max _{k}|\lambda _{k}|. \end{aligned}$$
(3.67)

From (3.66) and (3.67),

$$\begin{aligned} \Vert A\Vert = \sup _{x \ne 0}\frac{\Vert Ax\Vert }{\Vert x\Vert }= \max _{k}|\lambda _{k}|. \end{aligned}$$

\(\square \)

3.1.2 Condition Number

The condition number \(\kappa (A)\) of a nonsingular matrix A is defined by

$$\begin{aligned} \kappa (A)\mathop {=}\limits ^{\triangle }\Vert A\Vert \cdot \Vert A^{-1}\Vert . \end{aligned}$$

By (4) and (5) of Lemma 3.4, it is easily verified that \(\kappa (A)\ge 1\).

If A is a Hermitian matrix with eigenvalues \(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\), then

$$\begin{aligned} \kappa (A)= \frac{\max _{k}|\lambda _{k}|}{\min _{k}|\lambda _{k}|}. \end{aligned}$$
(3.68)

This is immediately proved by Lemma 3.5 if we note that the eigenvalues of \(A^{-1}\) are \(1/\lambda _{1}, 1/\lambda _{2}, \ldots , 1/\lambda _{n}\).

Theorem 3.14

Let us consider a linear equation

$$\begin{aligned} Ax= b, \end{aligned}$$
(3.69)

where A is an \(n\times n\) nonsingular matrix, and \(b \in \mathbb {C}^{n}\).

  1. (1)

    If the solution x changes to \(x+ \Delta x\) when b changes to \(b+\Delta b\), the following inequality holds:

    $$\begin{aligned} \frac{\Vert \Delta x\Vert }{\Vert x\Vert }\le \kappa (A) \frac{\Vert \Delta b\Vert }{\Vert b\Vert }. \end{aligned}$$
  2. (2)

    Let \(\Delta A\) be an \(n \times n\) matrix satisfying \(\Vert A^{-1} \Delta A\Vert < 1\). If the solution x changes to \(x+ \Delta x\) when A changes to \(A+\Delta A\), the following inequality holds:

    $$\begin{aligned} \frac{\Vert \Delta x\Vert }{\Vert x\Vert }\le \frac{\kappa (A)}{1-\Vert A^{-1} \Delta A\Vert }\, \frac{\Vert \Delta A\Vert }{\Vert A\Vert }. \end{aligned}$$

Proof

(1) From (3.69) and \(A(x+\Delta x)= b+\Delta b\), we have \(A \Delta x = \Delta b\). Therefore,

$$\begin{aligned} \Vert \Delta x \Vert = \Vert A^{-1} \Delta b\Vert \le \Vert A^{-1}\Vert \cdot \Vert \Delta b\Vert . \end{aligned}$$

On the other hand, \(\Vert b\Vert =\Vert Ax\Vert \le \Vert A\Vert \cdot \Vert x\Vert \). Thus,

$$\begin{aligned} \frac{\Vert \Delta x\Vert }{\Vert x\Vert }&\le \Vert A^{-1}\Vert \frac{\Vert \Delta b\Vert }{\Vert x\Vert }\\&\le \Vert A^{-1}\Vert \cdot \Vert A\Vert \frac{\Vert \Delta b\Vert }{\Vert b\Vert }\\&= \kappa (A)\frac{\Vert \Delta b\Vert }{\Vert b\Vert }. \end{aligned}$$

(2) Since \((A+ \Delta A)(x+ \Delta x)= b= Ax\), we have

$$\begin{aligned} (A+\Delta A)\Delta x= -(\Delta A) x. \end{aligned}$$

From the assumption that \(\Vert A^{-1}\Delta A\Vert < 1\), we have for an arbitrary nonzero vector y,

$$\begin{aligned} \Vert A^{-1}(\Delta A) y\Vert < \Vert y\Vert . \end{aligned}$$

Therefore,

$$\begin{aligned} \Vert (I+A^{-1}\Delta A) y\Vert&\ge \Vert Iy\Vert -\Vert A^{-1}(\Delta A) y\Vert \\&> \Vert y\Vert -\Vert y\Vert \\&= 0. \end{aligned}$$

This shows \((I\,+\,A^{-1}\Delta A) y \ne 0\), so that \(I+A^{-1}\Delta A\) is nonsingular. Hence, \(A\,+\, \Delta A= A(I+A^{-1}\Delta A)\) is also nonsingular. Now, let \(B\mathop {=}\limits ^{\triangle }A^{-1}\Delta A\) and \(C\mathop {=}\limits ^{\triangle }(I+ B)^{-1}\). Then, since \(I=C+BC\),

$$\begin{aligned} \Vert C\Vert (1-\Vert B\Vert )&= \Vert C\Vert -\Vert C\Vert \cdot \Vert B\Vert \\&\le \Vert C\Vert -\Vert BC\Vert \\&\le \Vert C+BC\Vert \\&= \Vert I\Vert \\&= 1. \end{aligned}$$

From these facts,

$$\begin{aligned} \frac{\Vert \Delta x\Vert }{\Vert x\Vert }&= \frac{\Vert -(A+\Delta A)^{-1}(\Delta A) x\Vert }{\Vert x\Vert }\\&= \frac{\Vert -CA^{-1}(\Delta A) x \Vert }{\Vert x\Vert }\\&\le \Vert C\Vert \cdot \Vert A^{-1}\Vert \cdot \Vert \Delta A\Vert \\&=\Vert A\Vert \cdot \Vert A^{-1}\Vert \cdot \Vert C\Vert \frac{\Vert \Delta A\Vert }{\Vert A\Vert }\\&\le \frac{\kappa (A)}{1-\Vert B\Vert }\,\frac{\Vert \Delta A\Vert }{\Vert A\Vert }\\&= \frac{\kappa (A)}{1-\Vert A^{-1}\Delta A\Vert }\,\frac{\Vert \Delta A\Vert }{\Vert A\Vert }. \end{aligned}$$

\(\square \)

Theorem 3.14 shows that if \(\kappa (A)\) is close to 1, then the relative change in the solution x is comparable to the relative change in A and b. However, if \(\kappa (A)\) is large, this is not guaranteed. In fact, if \(\kappa (A)\) is large, there are cases where a small relative change in A or b causes quite a large relative change in the solution x. Because of this, if the condition number \(\kappa (A)\) is large, the linear equation (3.69) is said to be ill-conditioned .

Appendix 3: The Method of Lagrange Multipliers

The method of Lagrange multipliers is a method of solving constrained optimization problems. Let \(f(x), g_{0}(x), g_{1}(x), \ldots , g_{m-1}(x)\) be real differentiable functions defined on a domain D in \(\mathbb {R}^{n}\), and let us consider the problem of minimizing f(x) subject to constraints \(g_{k}(x)= 0\ \ (k=0, 1, \ldots , m-1)\). From these functions we construct the Lagrangian function as

$$\begin{aligned} L(x, \lambda _{0}, \lambda _{1}, \ldots , \lambda _{m-1}) \mathop {=}\limits ^{\triangle }f(x) + \sum _{k=0}^{m-1} \lambda _{k} g_{k}(x), \end{aligned}$$

where \(\lambda _{k} \in \mathbb {R}\ \ (k=0, 1, \ldots , m-1)\) are variables called the Lagrange multipliers . We denote the Jacobian matrix of the constraining functions at x by J(x):

$$\begin{aligned} J(x) \mathop {=}\limits ^{\triangle }\left[ \frac{\partial g_{i}}{\partial x_{j}} (x) \right] . \end{aligned}$$

A point \(x^{*} \in D\) is called a regular point iff it satisfies the constraints and \(\mathrm{rank}\,J(x^{*}) = m\). The latter condition is equivalent to the statement that the vectors \(\nabla _{x} g_{k}(x^{*})\ (k=0, 1, \ldots , m-1)\) are linearly independent.

The following theorem is well known in the field of nonlinear programming [11, p. 300].

Theorem 3.15

Let \(x^{*}\) be a regular point. If \(x^{*}\) minimizes f(x), then there exist \(\lambda _{k}^{*}\ (k=0, 1, \ldots , m-1)\) such that

$$\begin{aligned} \frac{\partial L}{\partial x}(x^{*}, \lambda _{0}^{*}, \lambda _{1}^{*}, \ldots , \lambda _{m-1}^{*})= 0. \end{aligned}$$
(3.70)

Solving (3.70) with the use of the constraining equations, we can find \(x^{*}\) that is a candidate for \(\mathop {\mathrm{argmin}}\nolimits _{x} f(x)\) under the constraints.

Appendix 4: Newton’s Method

A twice continuously differentiable function f defined on \(\mathbb {R}^{n}\) can be approximated by a quadratic function on a neighborhood of each point \(x^{o} \in \mathbb {R}^{n}\) as

$$\begin{aligned} f(x) \approx f(x^{o}) + (\nabla _{x}f(x^{o}))^{t}(x-x^{o}) + \frac{1}{2}(x-x^{o})^{t} \nabla _{x}^{2}f(x^{o}) (x-x^{o}). \end{aligned}$$

If the Hessian matrix \(\nabla _{x}^{2}f(x^{o})\) is positive definite, the right-hand side of this equation is minimized by solving the equation

$$\begin{aligned} \nabla _{x} \left\{ f(x^{o}) + (\nabla _{x}f(x^{o}))^{t}(x-x^{o}) + \frac{1}{2}(x-x^{o})^{t} \nabla _{x}^{2}f(x^{o}) (x-x^{o})\right\} = 0. \end{aligned}$$

By using the formula (A.17) (Appendix A), this equation can be rewritten as

$$\begin{aligned} \nabla _{x}f(x^{o}) + \nabla _{x}^{2}f(x^{o}) (x-x^{o})= 0, \end{aligned}$$

from which the minimum point \(x^{m}\) is obtained as

$$\begin{aligned} x^{m}= x^{o} - (\nabla _{x}^{2}f(x^{o}))^{-1}\nabla _{x}f(x^{o}). \end{aligned}$$

Motivated by this fact, Newton’s method searches for \(\mathop {\mathrm{argmin}}\nolimits _{x}f(x)\) by iterating the following computation starting from an initial point \(x_{0}\):

$$\begin{aligned} x_{k+1}= x_{k} - \mu (\delta I+ \nabla _{x}^{2}f(x_{k}))^{-1}\nabla _{x}f(x_{k}), \end{aligned}$$
(3.71)

where \(\delta > 0\), and I is the \(n \times n\) identity matrix. The regularization term \(\delta I\) is added so that the algorithm does not fail even if \(\nabla _{x}^{2}f(x_{k})\) is not invertible. Just as in the case of the steepest descent algorithm, an index dependent step-size \(\mu (k)\) may be used. Under a certain condition, the vector sequence \(x_{0}, x_{1}, x_{2}, \ldots \) converges to a local minimum point. If the initial point \(x_{0}\) and the step-size \(\mu \) are appropriately chosen, the sequence converges to \(\mathop {\mathrm{argmin}}\nolimits _{x} f(x)\).

Appendix 5: Linear Prediction

Let \(\{x(k)\}\) be a real signal. We make a linear combination \(\hat{x}(j)\mathop {=}\limits ^{\triangle }\sum _{i=1}^{m}f_{i}x(j-i)\) of \(x(j-1)\), \(x(j-2)\), \(\ldots \) , \(x(j-m)\), where \(f_{1}\), \(f_{2}\), \(\ldots \) , \(f_{m}\) are real coefficients. Let us determine the coefficients so that the function

$$\begin{aligned} L_{f}(f_{1},f_{2}, \ldots , f_{m})\mathop {=}\limits ^{\triangle }\sum _{j=k}^{k-(n-1)} (x(j) + \hat{x}(j))^{2} \end{aligned}$$

is minimized. That is, we want to determine the coefficients so that \(-\hat{x}(j)\) becomes the least-mean-squares estimate of x(j) on the interval \(j=k, k-1, \ldots , k-(n-1)\) [12]. In order to minimize \(L_{f}\), we differentiate it with respect to each \(f_{l}\), and set it equal to zero. Because

$$\begin{aligned} \frac{\partial L_{f}}{\partial f_{l}}(f_{1}, f_{2}, \ldots , f_{m}) = 2 \left( \sum _{j=k}^{k-(n-1)} x(j)x(j-l) + \sum _{i=1}^{m} f_{i} \sum _{j=k}^{k-(n-1)} x(j-i)x(j-l) \right) , \end{aligned}$$

we obtain a linear equation

$$\begin{aligned} \begin{bmatrix} r_{1,1}&r_{1,2}&\cdots&r_{1,m}\\ r_{2,1}&r_{2,2}&\cdots&r_{2,m}\\ \vdots&\vdots&\ddots&\vdots \\ r_{m,1}&r_{m,2}&\cdots&r_{m,m} \end{bmatrix} \begin{bmatrix} f_{1}\\ f_{2}\\ \vdots \\ f_{m} \end{bmatrix} + \begin{bmatrix} r_{0,1}\\ r_{0,2}\\ \vdots \\ r_{0,m} \end{bmatrix} = 0, \end{aligned}$$
(3.72)

where

$$\begin{aligned} r_{s,t}\mathop {=}\limits ^{\triangle }\sum _{j=k}^{k-(n-1)}x(j-s)x(j-t). \end{aligned}$$

Note that \(r_{s,t}\) is symmetric: \(r_{s,t}= r_{t,s}\). Equation (3.72) is called the normal equation . We assume that the matrix

$$\begin{aligned} R_{l}\mathop {=}\limits ^{\triangle }\begin{bmatrix} r_{1,1}&r_{1,2}&\cdots&r_{1,m}\\ r_{2,1}&r_{2,2}&\cdots&r_{2,m}\\ \vdots&\vdots&\ddots&\vdots \\ r_{m,1}&r_{m,2}&\cdots&r_{m,m} \end{bmatrix} \end{aligned}$$

is nonsingular. The solution of (3.72) is called the forward linear predictor . We denote it by the same symbol as the unknown variable:

$$\begin{aligned} (f_{1}, f_{2}, \ldots , f_{m})^{t} = -R_{l}^{-1} (r_{0,1}, r_{0,2}, \ldots , r_{0,m})^{t}. \end{aligned}$$
(3.73)

Each \(f_{i}\) is referred to as a forward linear predictor coefficient or simply linear predictor coefficient .Footnote 4

Now let us calculate the forward residual power \(E_{f}\) defined by

$$\begin{aligned} E_{f}\mathop {=}\limits ^{\triangle }\min L_{f}(f_{1},f_{2}, \ldots , f_{m}). \end{aligned}$$

The function \(L_{f}(f_{1},f_{2}, \ldots , f_{m})\) is expressed as

$$\begin{aligned} L_{f}(f_{1},f_{2}, \ldots , f_{m})&= \sum _{j=k}^{k-(n-1)} \left( x(j) + \sum _{i=1}^{m} f_{i}x(j-i)\right) ^{2}\nonumber \\&= r_{0,0} + 2(r_{0,1}, r_{0,2}, \ldots , r_{0,m}) (f_{1}, f_{2}, \ldots , f_{m})^{t}\nonumber \\&\quad + (f_{1}, f_{2}, \ldots , f_{m})R_{1m}(f_{1}, f_{2}, \ldots , f_{m})^{t}. \end{aligned}$$
(3.74)

Substitution of (3.73) into (3.74) yields

$$\begin{aligned} E_{f} = r_{0,0} + (r_{0,1}, r_{0,2}, \ldots , r_{0,m}) (f_{1}, f_{2}, \ldots , f_{m})^{t}. \end{aligned}$$

Thus, we see that the forward residual filter

$$\begin{aligned} f\mathop {=}\limits ^{\triangle }(f_{0}, f_{1}, f_{2}, \ldots , f_{m})^{t}\ \ (f_{0}=1) \end{aligned}$$

satisfies the augmented normal equation

$$\begin{aligned} R f = (E_{f}, 0, \ldots , 0)^{t}, \end{aligned}$$
(3.75)

where

$$\begin{aligned} R\mathop {=}\limits ^{\triangle }\begin{bmatrix} r_{0,0}&r_{0,1}&\cdots&r_{0,m}\\ r_{1,0}&r_{1,1}&\cdots&r_{1,m}\\ \vdots&\vdots&\ddots&\vdots \\ r_{m,0}&r_{m,1}&\cdots&r_{m,m} \end{bmatrix}. \end{aligned}$$
(3.76)
Fig. 3.11
figure 11

Forward prediction and backward prediction. In the forward prediction, x(j) is predicted by using the past data. In the backward prediction on the other hand, \(x(j-m)\) is predicted, or estimated, by using the future data

We can also consider backward prediction as in Fig. 3.11. For a linear combination \(\tilde{x}(j-m)\mathop {=}\limits ^{\triangle }\sum _{i=1}^{m}b_{i}x(j-m+i)\) of \(x(j-m+1), x(j-m+2), \ldots , x(j)\), we determine the coefficients \(b_{1},b_{2}, \ldots , b_{m}\) so that the function

$$\begin{aligned} L_{b}(b_{1},b_{2}, \ldots , b_{m})\mathop {=}\limits ^{\triangle }\sum _{j=k}^{k-(n-1)} (x(j-m) + \tilde{x}(j-m))^{2} \end{aligned}$$

is minimized. Differentiating \(L_{b}\) with respect to each \(b_{l}\), and setting it equal to zero, we obtain the normal equation

$$\begin{aligned} \begin{bmatrix} r_{0,0}&r_{0,1}&\cdots&r_{0,m-1}\\ r_{1,0}&r_{1,1}&\cdots&r_{1,m-1}\\ \vdots&\vdots&\ddots&\vdots \\ r_{m-1,0}&r_{m-1,1}&\cdots&r_{m-1,m-1} \end{bmatrix} \begin{bmatrix} b_{m}\\ b_{m-1}\\ \vdots \\ b_{1} \end{bmatrix} + \begin{bmatrix} r_{0,m}\\ r_{1,m}\\ \vdots \\ r_{m-1,m} \end{bmatrix} = 0. \end{aligned}$$
(3.77)

The solution of (3.77), denoted by the same symbol \((b_{m}\), \(b_{m-1}\), \(\ldots \) , \(b_{1})^{t}\) as the unknown variable, is called the backward linear predictor . Just as in the case of the forward residual power, the backward residual power

$$\begin{aligned} E_{b}\mathop {=}\limits ^{\triangle }\min L_{b}(b_{1},b_{2}, \ldots , b_{m}) \end{aligned}$$

is calculated as

$$\begin{aligned} E_{b}= r_{m,m} + (r_{m,0}, r_{m,1}, \ldots , r_{m,m-1}) (b_{m}, b_{m-1}, \ldots , b_{1})^{t}. \end{aligned}$$

Therefore, the backward residual filter

$$\begin{aligned} b\mathop {=}\limits ^{\triangle }(b_{m}, b_{m-1}, \ldots , b_{0})^{t}\ \ (b_{0}=1) \end{aligned}$$

satisfies the augmented normal equation in the backward case

$$\begin{aligned} R b = (0, \ldots , 0 , E_{b})^{t}. \end{aligned}$$
(3.78)

Theorem 3.16

Assume that the matrix R in (3.76) is nonsingular. Then,

$$\begin{aligned} R^{-1} = \begin{bmatrix} 0&\varvec{0}^{t}\\ \varvec{0}&R_{l}^{-1} \end{bmatrix} + \frac{ff^{t}}{E_{f}}, \end{aligned}$$
(3.79)

where \(\varvec{0}\) is the m-dimensional zero vector.

Similarly,

$$\begin{aligned} R^{-1} = \begin{bmatrix} R_{u}^{-1}&\varvec{0}\\ \varvec{0}^{t}&0 \end{bmatrix} + \frac{bb^{t}}{E_{b}}, \end{aligned}$$
(3.80)

where \(R_{u}\) is the \(m\times m\) sub-matrix of R consisting of the upper left corner of R:

$$\begin{aligned} R_{u} \mathop {=}\limits ^{\triangle }\begin{bmatrix} r_{0,0}&r_{0,1}&\cdots&r_{0,m-1}\\ r_{1,0}&r_{1,1}&\cdots&r_{1,m-1}\\ \vdots&\vdots&\ddots&\vdots \\ r_{m-1,0}&r_{m-1,1}&\cdots&r_{m-1,m-1} \end{bmatrix}. \end{aligned}$$

Proof

First note \(R_{l}\) and \(R_{u}\) are nonsingular by the assumption that R is nonsingular. Let us show (3.79). Defining \(r\mathop {=}\limits ^{\triangle }(r_{1,0}, r_{2,0}, \ldots , r_{m,0})^{t}\), we can express R as

$$\begin{aligned} R= \begin{bmatrix} r_{0,0}&r^{t}\\ r&R_{l} \end{bmatrix}. \end{aligned}$$

By the formula (A.1) (Appendix A) and (3.75),

$$\begin{aligned} R \left( \begin{bmatrix} 0&\varvec{0}^{t}\\ \varvec{0}&R_{l}^{-1} \end{bmatrix} + \frac{ff^{t}}{E_{f}} \right)&= \begin{bmatrix} r_{0,0}&r^{t}\\ r&R_{l} \end{bmatrix} \begin{bmatrix} 0&\varvec{0}^{t}\\ \varvec{0}&R_{l}^{-1} \end{bmatrix} + R\frac{ff^{t}}{E_{f}}\\&= \begin{bmatrix} r_{0,0}\cdot 0 + r^{t}\cdot \varvec{0}&r_{0,0}\cdot \varvec{0}^{t} + r^{t} R_{l}^{-1}\\ r\cdot 0 + R_{l}\cdot \varvec{0}&r\cdot \varvec{0}^{t} + R_{l} R_{l}^{-1} \end{bmatrix} + \frac{1}{E_{f}} Rff^{t}\\&= \begin{bmatrix} 0&r^{t}R_{l}^{-1}\\ \varvec{0}&I \end{bmatrix} + \frac{1}{E_{f}} (E_{f}, 0, \ldots , 0)^{t} f^{t}\\&= \begin{bmatrix} 0&r^{t}R_{l}^{-1}\\ \varvec{0}&I \end{bmatrix} + (1, 0, \ldots , 0)^{t} f^{t}\\&= \left[ \begin{array}{@{\,}c|cccc@{\,}} 0 &{} -f_{1} &{} -f_{2} &{} \cdots &{} -f_{m}\\ \hline 0 &{} &{} &{} &{} \\ \vdots &{} &{} I &{} &{} \\ 0 &{} &{} &{} &{} \end{array} \right] + \begin{bmatrix} 1&f_{1}&f_{2}&\cdots&f_{m}\\ 0&0&0&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots \\ 0&0&0&\cdots&0 \end{bmatrix}\\&= \begin{bmatrix} 1&\varvec{0}^{t}\\ \varvec{0}&I \end{bmatrix}\\&= I. \end{aligned}$$

This shows that

$$\begin{aligned} \begin{bmatrix} 0&\varvec{0}^{t}\\ \varvec{0}&R_{l}^{-1} \end{bmatrix} + \frac{ff^{t}}{E_{f}}= R^{-1}. \end{aligned}$$

Equation (3.80) can be shown in a similar way.\(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Japan

About this chapter

Cite this chapter

Ozeki, K. (2016). Affine Projection Algorithm. In: Theory of Affine Projection Algorithms for Adaptive Filtering. Mathematics for Industry, vol 22. Springer, Tokyo. https://doi.org/10.1007/978-4-431-55738-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-4-431-55738-8_3

  • Published:

  • Publisher Name: Springer, Tokyo

  • Print ISBN: 978-4-431-55737-1

  • Online ISBN: 978-4-431-55738-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics