Abstract
The normalized least-mean-squares (NLMS) algorithm has a problem that the convergence slows down for correlated input signals. The reason for this phenomenon is explained by looking at the algorithm from a geometrical point of view. This observation motivates the affine projection algorithm (APA) as a natural generalization of the NLMS algorithm. The APA exploits most recent multiple regressors, while the NLMS algorithm uses only the current, single regressor. In the APA, the current coefficient vector is orthogonally projected onto the affine subspace defined by the regressors for updating the coefficient vector. By increasing the number of regressors, which is called the projection order, the convergence rate of the APA is improved especially for correlated input signals. The role of the step-size is made clear. Investigations from the affine projection point of view give us a deep insight into the properties of the APA. We also see that alternative approaches are possible to derive the update equation for the APA. To stabilize the numerical inversion of a matrix in the update equation, a regularization term is often added. This variant of the APA is called the regularized APA (R-APA), whereas the original APA is called the basic APA (B-APA). This chapter also explains that the B-APA with unity step-size has a decorrelating property, and that there are formal similarities between the recursive least-squares (RLS) algorithm and the R-APA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Numerical inversion of a matrix A is based on solving a linear equation of the type \(Ax= b\). If \(\mathrm{cond}\,(A)\) is large, a small error in A or b can result in a large error in the solution.
- 2.
There are some variants of RLS. The present type is called the “prewindowed RLS” algorithm [9].
- 3.
This is the abbreviation of the phrase if and only if, which is very often used in mathematics.
- 4.
We sometimes use a formulation that the sum of \((x(j) - \hat{x}(j))^{2}\) is to be minimized. In this formulation the sign of \(f_{i}\) is reversed.
References
Ozeki, K., Umeda, T.: An adaptive filtering algorithm using an orthogonal projection to an affine subspace and its properties. IEICE Trans. J67-A(2), 126–132 (1984) (Also in Electron. Commun. Jpn. 67-A(5), 19–27 (1984))
Haykin, S.: Adaptive Filter Theory. Prentice-Hall, Upper Saddle River (2002)
Sayed, A.H.: Adaptive Filters. Wiley, Hoboken (2008)
Haykin, S., Widrow, B. (eds.): Least-Mean-Square Adaptive Filters. Wiley, Hoboken (2003)
Werner, S., Diniz, P.S.R.: Set-membership affine projection algorithm. IEEE Signal Process. Lett. 8(8), 231–235 (2001)
Morgan, D.R., Kratzer, S.G.: On a class of computationally efficient, rapidly converging, generalized NLMS algorithms. IEEE Signal Process. Lett. 3(8), 245–247 (1996)
Rupp, M.: A family of adaptive filter algorithms with decorrelating properties. IEEE Trans. Signal Process. 46(3), 771–775 (1998)
Hinamoto, T., Maekawa, S.: Extended theory of learning identification. J. IEEJ-C 95(10), 227–234 (1975)
Cioffi, J.M., Kailath, T.: Windowed fast transversal filters adaptive algorithms with normalization. IEEE Trans. Acoust. Speech Signal Process. ASSP–33(3), 607–625 (1985)
Satake, I.: Linear Algebra. Marcel Dekker, New York (1975)
Luenberger, D.G.: Linear and Nonlinear Programming. Addison-Wesley, Menlo Park (1989)
Markel, J.D., Gray, A.H.: Linear Prediction of Speech. Springer, Berlin (1976)
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Affine Projection
3.1.1 Orthogonal Projection
The inner product \(\langle x, \,y \rangle \) of vectors \(x, y \in \mathbb {R}^{n}\) is defined by \( \langle x, \,y \rangle \mathop {=}\limits ^{\triangle }x^{t}y\). The Euclidean norm \(\Vert x\Vert \) of \(x \in \mathbb {R}^{n}\) is defined by \(\Vert x\Vert \mathop {=}\limits ^{\triangle }\sqrt{\langle x,\,x \rangle }\). Vectors x and y are said to be orthogonal and denoted by \(x \perp y\) iff Footnote 3 \( \langle x, \,y \rangle = 0\). A vector x and a linear subspace \(\mathbb {V}\) of \(\mathbb {R}^{n}\) are said to be orthogonal and denoted by \(x \perp \mathbb {V}\) iff \(x \perp y\) for every element y in \(\mathbb {V}\). Linear subspaces \(\mathbb {V}\) and \(\mathbb {W}\) are said to be orthogonal and denoted by \(\mathbb {V} \perp \mathbb {W}\) iff any element of \(\mathbb {V}\) and any element of \(\mathbb {W}\) are orthogonal.
For a linear subspace \(\mathbb {V}\) of \(\mathbb {R}^{n}\), \(\mathbb {V}^{\perp }\mathop {=}\limits ^{\triangle }\{x \in \mathbb {R}^{n} \,;\, x \perp \mathbb {V}\}\) is called the orthogonal complement of \(\mathbb {V}\). The set \(\mathbb {V}^{\perp }\) is a linear subspace of \(\mathbb {R}^{n}\). It is easy to verify that \(V \perp V^{\perp }\) and that \(V \cap V^{\perp }= \{0\}\).
Let \(\mathbb {V}\) be a linear subspace of \(\mathbb {R}^{n}\) and x an element of \(\mathbb {R}^{n}\). An element \(x_{0} \in \mathbb {V}\) is called the orthogonal projection of x onto \(\mathbb {V}\) iff \((x-x_{0}) \perp \mathbb {V}\). Such \(x_{0}\) is uniquely determined for x. In fact, suppose that there are two orthogonal projections \(x_{0}\) and \(x_{0}^{\prime }\) of x. Then, for any \(y \in \mathbb {V}\), \(\langle x-x_{0},\,y \rangle = 0\) and \(\langle x-x_{0}^{\prime },\,y \rangle = 0\). From these equations, we have \(\langle x_{0}-x_{0}^{\prime },\,y \rangle = 0\). Let \(y\mathop {=}\limits ^{\triangle }x_{0}-x_{0}^{\prime } \in \mathbb {V}\). Then \(\langle x_{0}-x_{0}^{\prime },\,x_{0}-x_{0}^{\prime } \rangle = 0\), which leads to \(x_{0}=x_{0}^{\prime }\). The mapping that maps x to \(x_{0}\) is also called the orthogonal projection, and denoted by \(P_{\mathbb {V}}\). This is a linear mapping from \(\mathbb {R}^{n}\) onto \(\mathbb {V}\).
As illustrated in Fig. 3.6, the orthogonal projection has the minimum distance property. That is, if y is an element of \(\mathbb {V}\), then \(\Vert P_{\mathbb {V}}x- x\Vert \le \Vert y-x\Vert \) with equality iff \(y= P_{\mathbb {V}}(x)\). This is a direct consequence of the Pythagorean theorem:
Any element \(x \in \mathbb {R}^{n}\) is decomposed as
In fact, let \(y\mathop {=}\limits ^{\triangle }x - P_{\mathbb {V}}x\). Then y is an element of \(\mathbb {V}^{\perp }\). Since \(x-y= P_{\mathbb {V}}x \in V\), \((x-y) \perp V^{\perp }\). Hence, \(y= P_{\mathbb {V^{\perp }}}x\).
The decomposition (3.37) is unique in the sense that if \(x= y_{1}+y_{2},\ \ y_{1} \in \mathbb {V}, y_{2} \in \mathbb {V}^{\perp }\), then \(y_{1}= P_{\mathbb {V}} x\) and \(y_{2}= P_{\mathbb {V}^{\perp }} x\). Therefore, as in Fig. 3.7, \(\mathbb {R}^{n}\) is represented as the (orthogonal) direct sum of \(\mathbb {V}\) and \(\mathbb {V}^{\perp }\):
Since \(P_{\mathbb {V}}\) is a linear mapping from \(\mathbb {R}^{n}\) to \(\mathbb {R}^{n}\), it can be represented by a real matrix. The following theorem is well known [10, p. 149].
Theorem 3.3
A matrix A is an orthogonal projection iff \( A= A^{t}\) (symmetric) and \(A^{2}= A\) (idempotent). When these conditions are met, \(A=P_{\mathcal {R}(A)}\) and \(I- A= P_{\mathcal {R}(A)^{\perp }}\), where \(\mathcal {R}(A)\) is the range space of A, i.e., the linear subspace spanned by the columns of A.
Theorem 3.4
If A is an orthogonal projection, then its eigenvalues are 0 or 1.
Proof
If \(\lambda \) is an eigenvalue of A, and x the corresponding eigenvector (Appendix A), then \(Ax= \lambda x\). The eigenvector x can be uniquely decomposed as \(x= x_{1} + x_{2},\ x_{1} \in \mathcal {R}(A),\ x_{2} \in \mathcal {R}(A)^{\perp }\). Therefore, \(Ax= x_{1}\) and \(Ax= \lambda x_{1} + \lambda x_{2}\), from which we have \((\lambda -1)x_{1} + \lambda x_{2}= 0\). By the uniqueness of orthogonal decomposition, \((\lambda -1)x_{1}= 0\) and \(\lambda x_{2}= 0\). Thus, if \(x_{1}\ne 0\), then \(\lambda =1\) (and \(x_{2}=0\)). If \(x_{1}= 0\) then \(x_{2}\ne 0\) since \(x\ne 0\), which implies \(\lambda =0\).\(\square \)
3.1.2 Gram–Schmidt Orthogonalization Procedure
Theorem 3.5
Let \(\{x_{1}, x_{2}, \ldots , x_{m}\}\ (m\le n)\) be a set of linearly independent vectors in \(\mathbb {R}^{n}\). From this set, construct a new set of vectors \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{m}\}\) as follows:
Define \(x^{\prime }_{1}\) by
and define \(x^{\prime }_{l}\) for \(l= 2, 3, \ldots , m\) recursively as
Then,
and the sets of vectors \(\{x_{1}, x_{2}, \ldots , x_{l}\}\) and \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l}\}\) span the same linear subspace for any \(l\ (1 \le l \le m)\).
Proof
First note that each \(x^{\prime }_{k}\) is normalized so that \(\Vert x^{\prime }_{k}\Vert =1\). Let us show, by mathematical induction on l, that for \(l=2, 3, \ldots ,m\), the vector \(x^{\prime }_{l}\) and each of \(x^{\prime }_{1}, x^{\prime }_{2},\ldots , x^{\prime }_{l-1}\) are orthogonal. For \(l=2\), this is true. In fact,
Next, assume that for \(l\ge 3\),
Then, for \(1 \le j \le l-1\),
This shows that \(x^{\prime }_{l}\) and each element of \(\{x^{\prime }_{1}, x^{\prime }_{2},\ldots , x^{\prime }_{l-1}\}\) are orthogonal.
From the recursion (3.38) and (3.39), it is obvious that each \(x^{\prime }_{l}\) is a linear combination of \(x_{1}, x_{2}, \ldots , x_{l}\). Also from the recursion,
which shows that each \(x_{l}\) is a linear combination of \(x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l}\). Therefore, \(\{x_{1}, x_{2}, \ldots , x_{l}\}\) and \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l}\}\) span the same linear subspace.\(\square \)
The recursion in (3.38) and (3.39) is referred to as the Gram–Schmidt orthogonalization procedure . Note that \(\sum _{k=1}^{l-1}\langle x_{l},\,x^{\prime }_{k}\rangle x^{\prime }_{k}\) in (3.39) is the orthogonal projection of \(x_{l}\) onto the linear subspace spanned by \(\{x^{\prime }_{1}, x^{\prime }_{2}, \ldots , x^{\prime }_{l-1}\}\).
If normalization is not performed, the Gram–Schmidt orthogonalization procedure is simply written as
3.1.3 Moore–Penrose Pseudoinverse
An \(m\times n\) real matrix X can be considered as a mapping from \(\mathbb {R}^{n}\) to \(\mathbb {R}^{m}\). The range space of X, denoted by \(\mathcal {R}(X)\), is defined by
which is a linear subspace of \(\mathbb {R}^{m}\) spanned by the columns of X. The null space of X, denoted by \(\mathcal {N}(X)\), is defined by
which is also a linear subspace of \(\mathbb {R}^{n}\).
Let \(\mathcal {N}(X)^{\perp }\) be the orthogonal complement of \(\mathcal {N}(X)\) in \(\mathbb {R}^{n}\):
Likewise, let \(\mathcal {R}(X)^{\perp }\) be the orthogonal complement of \(\mathcal {R}(X)\) in \(\mathbb {R}^{m}\):
Then, \(\mathbb {R}^{n}\) and \(\mathbb {R}^{m}\) are represented respectively as the direct sum
Denote by \(X|_{\mathcal {N}(X)^{\perp }}\) the restriction of X, as a mapping, on \(\mathcal {N}(X)^{\perp }\). \(X|_{\mathcal {N}(X)^{\perp }}\) is a one-to-one mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\). In fact, if \(X|_{\mathcal {N}(X)^{\perp }} v= X|_{\mathcal {N}(X)^{\perp }} w\) for \(v, w \in \mathcal {N}(X)^{\perp }\), then \(X|_{\mathcal {N}(X)^{\perp }}(v-w)= X(v-w)= 0\). Therefore, \(v-w \in \mathcal {N}(X)^{\perp } \cap \mathcal {N}(X) =\{0\}\). Hence \(v= w\), i.e., \(X|_{\mathcal {N}(X)^{\perp }}\) is a one-to-one mapping. Furthermore, for \(w \in \mathcal {R}(X)\), there exists \(v \in \mathbb {R}^{n}\) such that \(Xv= w\). The element v can be decomposed as \(v= v_{1}+ v_{2}, v_{1} \in \mathcal {N}(X)^{\perp }, v_{2} \in \mathcal {N}(X)\). Since \(Xv_{2}=0\),
This shows that \(X|_{\mathcal {N}(X)^{\perp }}\) is a mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\). Thus, we see that the mapping \(X|_{\mathcal {N}(X)^{\perp }}\) has the inverse
Now, \(w \in \mathbb {R}^{m}\) can be uniquely decomposed as \(w= w_{1} + w_{2},\ \ w_{1} \in \mathcal {R}(X),\ w_{2} \in \mathcal {R}(X)^{\perp }\). Let \(P_{\mathcal {R}(X)}\) be the orthogonal projection from \(\mathbb {R}^{m}\) onto \(\mathcal {R}(X)\) : \(P_{\mathcal {R}(X)} w \mathop {=}\limits ^{\triangle }w_{1}\). The composite mapping \(X^{+}\mathop {=}\limits ^{\triangle }(X|_{\mathcal {N}(X)^{\perp }})^{-1}\, P_{\mathcal {R}(X)}\) is called the Moore–Penrose pseudoinverse of X. This is a linear mapping from \(\mathbb {R}^{m}\) to \(\mathbb {R}^{n}\). The formation of \(X^{+}\) is illustrated in Fig. 3.8.
Lemma 3.1
-
(1)
\(X^{+}X = P_{\mathcal {N}(X)^{\perp }}\).
-
(2)
\(XX^{+} =P_{\mathcal {R}(X)}\).
-
(3)
If X is nonsingular, then \(X^{+}= X^{-1}\).
Proof
(1) Let us decompose \(v \in \mathbb {R}^{n}\) as \(v= v_{1}+v_{2}\), \(v_{1} \in \mathcal {N}(X)^{\perp }, v_{2} \in \mathcal {N}(X)\). Then,
which shows \(X^{+}X = P_{\mathcal {N}(X)^{\perp }}\).
(2) Let us decompose \(w \in \mathbb {R}^{m}\) as \(w= w_{1}+w_{2}\), \(w_{1} \in \mathcal {R}(X), w_{2} \in \mathcal {R}(X)^{\perp }\). Then,
which shows \(XX^{+} =P_{\mathcal {R}(X)}\).
(3) If X is an \(n\times n\) nonsingular matrix, then \(\mathcal {N}(X)^{\perp }=\mathbb {R}^{n}\), and \(\mathcal {R}(X)=\mathbb {R}^{n}\). Therefore, \(P_{\mathcal {R}(X)}\) equals the identity mapping \(I_{n}\) on \(\mathbb {R}^{n}\). Hence,
\(\square \)
Lemma 3.2
If \(\mathrm{rank}\,X = m\), then \(XX^{+}= I_{m}\).
Proof
Because \(\dim \mathcal {R}(X)= \mathrm{rank}\,X= m\), we have \(\mathcal {R}(X)= \mathbb {R}^{m}\). Thus, in view of Lemma 3.1(2),
\(\square \)
Lemma 3.3
\(\mathcal {R}(X^{t})= \mathcal {N}(X)^{\perp }\).
Proof
Note that \(X^{t}\) is a mapping from \(\mathbb {R}^{m}\) to \(\mathbb {R}^{n}\). Because there exists a one-to-one linear mapping \(X|_{\mathcal {N}(X)^{\perp }}\) from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\),
For arbitrary \(w \in \mathbb {R}^{m}\) and \(v \in \mathcal {N}(X)\),
Therefore, \(X^{t}w \in \mathcal {N}(X)^{\perp }\), which shows that
On the other hand, by (3.43),
From (3.44) and (3.45), we conclude that \(\mathcal {R}(X^{t})= \mathcal {N}(X)^{\perp }\).\(\square \)
Theorem 3.6
-
(1)
\((X^{+})^{+}= X\).
-
(2)
\((X^{t})^{+}= (X^{+})^{t}\).
-
(3)
\(X(X^{t}X)^{+}X^{t}X = X\).
-
(4)
\(X(X^{t}X)^{+}X^{t}= P_{\mathcal {R}(X)}\).
Proof
(1) By definition of the Moore–Penrose pseudoinverse,
Since \(\mathcal {R}(X^{+})= \mathcal {N}(X)^{\perp }\) and \(\mathcal {N}(X^{+})^{\perp }= (\mathcal {R}(X)^{\perp })^{\perp }= \mathcal {R}(X)\), we have
Let v be an arbitrary element in \(\mathbb {R}^{n}\), and decompose it as \(v= v_{1} + v_{2}\), where \(v_{1} \in \mathcal {N}(X)^{\perp }\) and \(v_{2} \in \mathcal {N}(X)\). Then,
On the other hand,
Therefore, \((X^{+})^{+} v = Xv\) for any \(v \in \mathbb {R}^{n}\), i.e., \((X^{+})^{+} = X\).
(2) The matrix \(X|_{\mathcal {N}(X)^{\perp }}\) is a one-to-one mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\), and \(X^{t}|_{\mathcal {N}(X^{t})^{\perp }}\) is a one-to-one mapping from \(\mathcal {N}(X^{t})^{\perp }\) onto \(\mathcal {R}(X^{t})\). In view of Lemma 3.3, \(\mathcal {N}(X)^{\perp }= \mathcal {R}(X^{t})\) and \(\mathcal {N}(X^{t})^{\perp }= \mathcal {R}(X)\). That is, the domain of \(X|_{\mathcal {N}(X)^{\perp }}\) coincides with the range of \(X^{t}|_{\mathcal {N}(X^{t})^{\perp }}\), and the domain of \(X^{t}|_{\mathcal {N}(X^{t})^{\perp }}\) coincides with the range of \(X|_{\mathcal {N}(X)^{\perp }}\) as shown in the following diagram:
Let us define
and
Then, we have
In fact, let v and w be arbitrary elements in \(\mathcal {S}\) and in \(\mathcal {T}\), respectively. Then,
and
Therefore, for arbitrary \(v \in \mathcal {S}\) and \(w \in \mathcal {T}\),
which is equivalent to (3.46).
We also have
In fact, let v and w be arbitrary elements in \(\mathcal {S}\) and \(\mathcal {T}\), respectively, and let \(u \mathop {=}\limits ^{\triangle }((X|_{\mathcal {S}})^{t})^{-1} v\). Then,
and
which is equivalent to (3.47).
Now, to prove \((X^{t})^{+}= (X^{+})^{t}\), it suffices to show that for arbitrary \(v \in \mathbb {R}^{n}\) and \(w \in \mathbb {R}^{m}\),
This equation is equivalent to
Let v be decomposed as \(v= v_{1}+ v_{2}\), where \(v_{1} \in \mathcal {S}\) and \(v_{2} \in \mathcal {S}^{\perp }\). Then, using (3.46) and (3.47), we have
Therefore, if we decompose w as \(w= w_{1}+w_{2}\), where \(w_{1} \in \mathcal {T}\) and \(w_{2} \in \mathcal {T}^{\perp }\), we obtain (3.50) in the following way:
(3) Let \(Y \mathop {=}\limits ^{\triangle }X(X^{t}X)^{+}X^{t}X - X\). Since \(((X^{t}X)^{+})^{t}= (X^{t}X)^{+}\) by (2) above, we have
By Lemma 3.1(2), \(X^{t}X(X^{t}X)^{+}= P_{\mathcal {R}(X^{t}X)}\). Therefore,
and
Hence, \(Y^{t}Y=0\), from which \(Y=0\) is concluded.
(4) By (3) above, \(X(X^{t}X)^{+}X^{t}u = u\) for any \(u \in \mathcal {R}(X)\). Moreover, \(X(X^{t}X)^{+}X^{t}v= 0\) for any \(v \in \mathcal {R}(X)^{\perp }\), since \(X^{t}v= 0\). Therefore, \(X(X^{t}X)^{+}X^{t}= P_{\mathcal {R}(X)}\).\(\square \)
Theorem 3.7
If \(\mathrm{rank}\,X = m\), then \(X^{+}= X^{t}(XX^{t})^{-1}\).
Proof
Let \(x_{k}\) be the kth column of \(X^{t}\). Then,
If \(\mathrm{rank}\,X = \mathrm{rank}\,X^{t}= m\), the vectors \(x_{1}, x_{2}, \ldots , x_{m}\) are linearly independent. Therefore, by Theorem A.8 (Appendix A), \(\det (XX^{t}) \ne 0\), so that the Gramian matrix \(XX^{t}\) has the inverse \((XX^{t})^{-1}\). For arbitrary \(w \in \mathbb {R}^{m}\), let
Because \(v \in \mathcal {N}(X)^{\perp }\), Lemma 3.3 guarantees the existence of \(z \in \mathbb {R}^{m}\) such that
Combining (3.51) and (3.52), and using Lemma 3.2, we have
Since the Gramian matrix \(XX^{t}\) has the inverse \((XX^{t})^{-1}\),
Substitution of (3.53) into (3.52) yields
Comparison of (3.51) and (3.54) leads to
\(\square \)
Theorem 3.8
Let X be an \(m \times n\) matrix, and y an element of \(\mathbb {R}^{m}\). The Moore–Penrose pseudoinverse gives a least-squares solution of the linear equation
That is, \(v\mathop {=}\limits ^{\triangle }X^{+}y\) minimizes \(\Vert Xv - y\Vert ^{2}\). Moreover, \(X^+y{\,}+{\,}\mathcal {N}(X)\mathop {=}\limits ^{\triangle }\{X^+y{\,}+{\,}w \,;\, w \in \mathcal {N}(X)\}\) is the set of least-squares solutions of (3.55). Therefore, \(X^{+}y\) gives the minimum norm least-squares solution of (3.55).
Proof
Because \(\{Xv \,;\, v \in \mathbb {R}^{n}\}= \mathcal {R}(X)\), the quantity \(\Vert Xv - y\Vert ^{2}\) is minimized iff \(Xv = P_{\mathcal {R}(X)}y\). This is attained for \(v= X^{+}y\), since \(XX^{+}= P_{\mathcal {R}(X)}\). If \(v= X^{+}y+ w, w \in \mathcal {N}(X)\), then
Therefore, any element in \(X^{+}y+\mathcal {N}(X)\) is a least-squares solution of (3.55). Conversely, suppose \(v= X^{+}y+ w,\ w \in \mathbb {R}^{n}\), is a least-squares solution of (3.55). Decomposition of w as \(w= w_{1}+w_{2},\,w_{1} \in \mathcal {N}(X)^{\perp }, w_{2} \in \mathcal {N}(X)\) leads to
Because \(Xv= P_{\mathcal {R}(X)}y= XX^{+}y\), \(Xw_{1}=0\). Therefore, \(w_{1} \in \mathcal {N}(X)^{\perp } \cap \mathcal {N}(X)= \{0\}\). This shows that \(v= X^{+}y+ w_{2} \in X^{+}y+\mathcal {N}(X)\).
Since \(X^{+}y \perp \mathcal {N}(X)\), \(X^{+}y\) is the minimum norm element in \(X^{+}y+\mathcal {N}(X)\). \(\square \)
Suppose a linear equation has at least one solution. Then, a least-squares solution is a solution, and vice versa. Thus, we have the following corollary:
Corollary 3.1
If (3.55) has a solution, \(X^{+}y+\mathcal {N}(X)\) gives the set of solutions, and \(X^{+}y\) gives the minimum norm solution.
Corollary 3.2
For any \(m\times n\) matrix X,
Proof
Let us prove (3.56) first. By Theorem 3.8, \(v= X^{+}y\) gives the minimum norm least-squares solution of (3.55). On the other hand, v minimized \(\Vert Xv - y\Vert ^{2}\) iff \((Xv - y)\perp X\), that is,
The minimum norm solution \(v^{\prime }\) of this equation is given by \(v^{\prime }= (X^{t}X)^{+}X^{t}y\). Because the minimum norm least-squares solution is unique, \(v= v^{\prime }\). Thus, \(X^{+}y = (X^{t}X)^{+}X^{t}y\) for any \(y \in \mathbb {R}^{m}\), which shows (3.56).
Replace X with \(X^{t}\) in (3.56):
By Theorem 3.6(2), \(((X^{t})^{+})^{t}= X^{+}\), and \(((XX^{t})^{+})^{t}= (XX^{t})^{+}\). Therefore, taking the transpose of both sides of (3.58), we immediately obtain (3.57).\(\square \)
Affine Projection
A subset \(\Pi \) of \(\mathbb {R}^{n}\) is called an affine subspace of \(\mathbb {R}^{n}\) iff there exists an element \(a \in \Pi \) such that
is a linear subspace of \(\mathbb {R}^{n}\). The element a is called the origin of \(\Pi \), and \(\Pi -a\) the linear subspace associated with \(\Pi \). The dimension of \(\Pi \) is defined as \(\dim \Pi \mathop {=}\limits ^{\triangle }\dim (\Pi -a)\).
Theorem 3.9
Let \(\Pi \) be an affine subspace of \(\mathbb {R}^{n}\), and a its origin. Then, for any \(b \in \Pi \),
That is, any element of \(\Pi \) can be chosen as its origin, and the linear subspace associated with \(\Pi \) is independent of the choice of the origin.
Proof
If we denote \(\mathbb {V}\mathop {=}\limits ^{\triangle }\Pi -a\), \(\Pi \) is represented as \(\Pi = \mathbb {V} + a\). Then, noting \(b\,-\,a \in \mathbb {V}\), we have \(\Pi = \mathbb {V}\,-\,(b\,-\,a)\,+\,b = \mathbb {V}\,+\,b\). This shows \(\Pi \,-\,a = \mathbb {V}= \Pi \,-\,b\). \(\square \)
Theorem 3.10
If \(\Pi _{1}\) and \(\Pi _{2}\) are affine subspaces of \(\mathbb {R}^{n}\) satisfying \(\Pi _{1}\cap \Pi _{2}\) \(\ne \emptyset \), then \(\Pi _{1} \cap \Pi _{2}\) is an affine subspace of \(\mathbb {R}^{n}\).
Proof
For arbitrarily chosen \(a \in \Pi _{1}\cap \Pi _{2}\), let
Then, it is shown that
In fact, if \(v \in \Pi _{1} \cap \Pi _{2} - a\), there exists \(w \in \Pi _{1} \cap \Pi _{2}\) such that \(v=w-a\). Because \(w \in \Pi _{1}\), we have \(v \in \mathbb {V}_{1}\). Also, because \(w \in \Pi _{2}\), we have \(v \in \mathbb {V}_{2}\). Therefore, \(v \in \mathbb {V}_{1}\cap \mathbb {V}_{2}\). Conversely, if \(v \in \mathbb {V}_{1}\cap \mathbb {V}_{2}\), then \(v \in \mathbb {V}_{1}\). Therefore, there exists \(w_{1} \in \Pi _{1}\) such that
Also, because \(v \in \mathbb {V}_{2}\), there exists \(w_{2} \in \Pi _{2}\) such that
From (3.60) and (3.61), \(w_{1}= w_{2}\). Let \(w\mathop {=}\limits ^{\triangle }w_{1}= w_{2}\). Then, \(w\in \Pi _{1}\cap \Pi _{2}\) and \(v= w-a\). Therefore, \(v \in \Pi _{1}\cap \Pi _{2} - a\).
Because \(\mathbb {V}_{1}\cap \mathbb {V}_{2}\) is a linear subspace of \(\mathbb {R}^{n}\), (3.59) shows that \(\Pi _{1}\cap \Pi _{2}\) is an affine subspace.\(\square \)
We can immediately generalize the above theorem.
Corollary 3.3
If \(\Pi _{1}\), \(\Pi _{2}\), \(\ldots \) , \(\Pi _{p}\) are affine subspaces of \(\mathbb {R}^{n}\) satisfying \(\Pi _{1} \cap \Pi _{2} \cap \cdots \cap \Pi _{p} \ne \emptyset \), then \(\Pi _{1} \cap \Pi _{2} \cap \cdots \cap \Pi _{p}\) is an affine subspace of \(\mathbb {R}^{n}\).
Let \(\Pi \) be an affine subspace of \(\mathbb {R}^{n}\), and \(\mathbb {V}\) the associated linear subspace. An element \(v \in \mathbb {R}^{n}\) and \(\Pi \) are said to be orthogonal and denoted by \(v \perp \Pi \) iff \(v \perp \mathbb {V}\). An element \(v^{\prime } \in \Pi \) is called the affine projection of \(v \in \mathbb {R}^{n}\) onto \(\Pi \) iff \((v-v^{\prime }) \perp \Pi \). The affine projection \(v^{\prime } \in \Pi \) is uniquely determined for v. In fact, suppose \(v^{\prime }\) and \(v^{\prime \prime }\) are both affine projections of v. Then, by definition, \(v^{\prime }- v \in \mathbb {V}^{\perp }\). Also, \(v^{\prime \prime }- v \in \mathbb {V}^{\perp }\). Since \(\mathbb {V}^{\perp }\) is a linear subspace of \(\mathbb {R}^{n}\), \(v^{\prime }- v^{\prime \prime } \in \mathbb {V}^{\perp }\). On the other hand, there exists \(w^{\prime } \in \mathbb {V}\) such that \(v^{\prime }= w^{\prime } + a\), where a is the origin of \(\mathbb {V}\). In the same way, there exists \(w^{\prime \prime } \in \mathbb {V}\) such that \(v^{\prime \prime }= w^{\prime \prime } + a\). Hence, \(v^{\prime }- v^{\prime \prime }= w^{\prime }- w^{\prime \prime } \in \mathbb {V}\). Thus, we have \(v^{\prime }- v^{\prime \prime }\in \mathbb {V}\cap \mathbb {V}^{\perp }= \{0\}\), from which \(v^{\prime }= v^{\prime \prime }\) is concluded.
The mapping that maps \(v \in \mathbb {R}^{n}\) to its affine projection \(v^{\prime } \in \Pi \) is also called the affine projection onto \(\Pi \), and denoted by \(P_{\Pi }\). Just as in the case of orthogonal projection onto a linear subspace, \(P_{\Pi }v\) is characterized as the unique element \(v^{\prime } \in \Pi \) that minimizes \(\Vert v^{\prime }- v\Vert \).
Now, given \(x_{1}, x_{2}, \ldots , x_{m} \in \mathbb {R}^{n}\) and \(y_{1}, y_{2}, \ldots , y_{m} \in \mathbb {R}\ (m \le n)\), let us consider a system of linear equations
where \(v \in \mathbb {R}^{n}\) is the unknown vector. Using the matrix and vector notation
we can rewrite (3.62) as
Note that X is an \(m \times n\) matrix, and y an m-dimensional vector.
By Theorem 3.8, \(\Pi \mathop {=}\limits ^{\triangle }X^{+}y + \mathcal {N}(X)\) gives the set of least-squares solutions of (3.63). If (3.63) has at least one solution, then, by Corollary 3.1, \(\Pi \) gives the set of solutions. Note that \(\Pi \) is an affine subspace of \(\mathbb {R}^{n}\), with the origin \(X^{+}y\), and the associated linear subspace \(\mathcal {N}(X)\).
Theorem 3.11
\(\mathrm{rank}\,X + \dim \Pi = n\).
Proof
As stated in “Moore-Penrose Pseudoinverse” above, there is a one-to-one linear mapping from \(\mathcal {N}(X)^{\perp }\) onto \(\mathcal {R}(X)\). Therefore, \(\dim \mathcal {R}(X) = \dim \mathcal {N}(X)^{\perp }\), so that
\(\square \)
If X is composed of a single, nonzero row, then \(\mathrm{rank}\,X = 1\). Therefore, by Theorem 3.11, \(\dim \Pi = n-1\). Such an affine subspace is called a hyperplane . If \(x_{k} \ne 0\), each equation \(\langle x_{k},\,v \rangle = y_{k}\) in (3.62) determines a hyperplane. Let it be denoted by \(\Pi _{k}= \{v \in \mathbb {R}^{n} \,;\, \langle x_{k},\,v \rangle = y_{k}\}\). As illustrated in Fig. 3.9, \(\Pi _{k}\) is a hyperplane that is orthogonal to \(x_{k}\) and passes through the point \(p= y_{k}x_{k}/\Vert x_{k}\Vert ^{2}\). If (3.63) has a solution, the set of solutions \(\Pi \) is the intersection of such hyperplanes: \(\Pi = \Pi _{1}\cap \Pi _{2}\cap \cdots \cap \Pi _{m}\).
The angle between two hyperplanes \(\Pi _{i}\) and \(\Pi _{j}\) is defined to be the angle between \(x_{i}\) and \(x_{j}\).
Theorem 3.12
A vector \(v \in \mathbb {R}^{n}\) and \(\Pi =X^{+}y + \mathcal {N}(X)\) are orthogonal iff v is a linear combination of \(x_{1}\), \(x_{2}\), \(\ldots \) , \(x_{m}\).
Proof
Let \(\mathbb {V}\) be the linear subspace spanned by \(\{x_{1}, x_{2}, \ldots , x_{m}\}\). Then, by Lemma 3.3, \(\mathbb {V}= \mathcal {R}(X^{t})= \mathcal {N}(X)^{\perp }\), from which the theorem is obvious. \(\square \)
Theorem 3.13
For \(\Pi = X^{+}y + \mathcal {N}(X)\), the affine projection \(P_{\Pi }\) is given by
Proof
Let \(w \mathop {=}\limits ^{\triangle }v + X^{+}(y-Xv)\). We first show that w is an element of \(\Pi \). Note that w can be rewritten as \(w = X^{+}y +(v-X^{+}Xv)\). By Lemma 3.1, \(XX^{+}= P_{\mathcal {R}(X)}\). Since \(Xv \in \mathcal {R}(X)\),
Therefore, we have \(v-X^{+}Xv \in \mathcal {N}(X)\). Hence, \(w = X^{+}y +(v-X^{+}Xv) \in X^{+}y + \mathcal {N}(X) = \Pi \). Furthermore, because \((w - v) = X^{+}(y-Xv) \in \mathcal {R}(X^{+})= \mathcal {N}(X)^{\perp }\), we have \((w - v) \perp \mathcal {N}(X)\). Hence, by definition, \((w - v) \perp \Pi \). Thus we have shown \(w= P_{\Pi }v\).\(\square \)
Figure 3.10 illustrates the geometrical meaning of Theorem 3.13. In the figure, we see two interpretations of \(P_{\Pi }v\):
Appendix 2: Condition Number
3.1.1 Natural Norm of a Matrix
The natural norm \(\Vert A\Vert \) of an \(n\times n\) matrix A is defined by
The natural norm is also called the induced norm, induced from the vector norm used in the definition.
Lemma 3.4
The natural norm has the following properties:
-
(1)
\(\Vert A\Vert \ge 0\), and \(\Vert A\Vert =0\) implies \(A= 0\).
-
(2)
For a scalar c, \(\Vert cA\Vert =|c|\,\Vert A\Vert \).
-
(3)
\(\Vert A+B\Vert \le \Vert A\Vert + \Vert B\Vert \).
-
(4)
\(\Vert AB\Vert \le \Vert A\Vert \cdot \Vert B\Vert \).
-
(5)
\(\Vert I\Vert = 1\).
Proof
(1) Since \(\Vert Ax\Vert /\Vert x\Vert \ge 0\), \(\Vert A\Vert \ge 0\) is obvious. If \(A\ne 0\), there exists \(\tilde{x} \ne 0\) such that \(A\tilde{x}\ne 0\). Since \(\Vert A\tilde{x}\Vert /\Vert \tilde{x}\Vert > 0\), \(\Vert A\Vert > 0\).
(2) Since \(\Vert cAx\Vert =|c|\,\Vert Ax\Vert \),
(3) By the triangle inequality for the vector norm, we have
(4) First note that for any vector y,
In fact, if \(y= 0\), this is obvious. If \(y\ne 0\), then
which leads to (3.65). Using this inequality, we have
(5) is obvious from the definition of the natural norm. \(\square \)
Lemma 3.5
If A is an \(n\times n\) Hermitian matrix with eigenvalues \(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\) (Appendix A), then
Proof
Let \(\{x_{1}, x_{2}, \ldots , x_{n}\}\) be the orthonormal basis of \(\mathbb {C}^{n}\) comprised of eigenvectors corresponding to the eigenvalues \(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\). Then, any vector \(x \in \mathbb {C}^{n}\) is represented as \(x= \sum _{k=1}^{n} c_{k}x_{k}\) with some complex coefficients \(c_{k}\)’s. Using this representation, we have
We also have
Thus, if \(x \ne 0\),
On the other hand, let \(m\mathop {=}\limits ^{\triangle }\mathop {\mathrm{argmax}}\nolimits _{k} |\lambda _{k}|\). Then,
\(\square \)
3.1.2 Condition Number
The condition number \(\kappa (A)\) of a nonsingular matrix A is defined by
By (4) and (5) of Lemma 3.4, it is easily verified that \(\kappa (A)\ge 1\).
If A is a Hermitian matrix with eigenvalues \(\lambda _{1}, \lambda _{2}, \ldots , \lambda _{n}\), then
This is immediately proved by Lemma 3.5 if we note that the eigenvalues of \(A^{-1}\) are \(1/\lambda _{1}, 1/\lambda _{2}, \ldots , 1/\lambda _{n}\).
Theorem 3.14
Let us consider a linear equation
where A is an \(n\times n\) nonsingular matrix, and \(b \in \mathbb {C}^{n}\).
-
(1)
If the solution x changes to \(x+ \Delta x\) when b changes to \(b+\Delta b\), the following inequality holds:
$$\begin{aligned} \frac{\Vert \Delta x\Vert }{\Vert x\Vert }\le \kappa (A) \frac{\Vert \Delta b\Vert }{\Vert b\Vert }. \end{aligned}$$ -
(2)
Let \(\Delta A\) be an \(n \times n\) matrix satisfying \(\Vert A^{-1} \Delta A\Vert < 1\). If the solution x changes to \(x+ \Delta x\) when A changes to \(A+\Delta A\), the following inequality holds:
$$\begin{aligned} \frac{\Vert \Delta x\Vert }{\Vert x\Vert }\le \frac{\kappa (A)}{1-\Vert A^{-1} \Delta A\Vert }\, \frac{\Vert \Delta A\Vert }{\Vert A\Vert }. \end{aligned}$$
Proof
(1) From (3.69) and \(A(x+\Delta x)= b+\Delta b\), we have \(A \Delta x = \Delta b\). Therefore,
On the other hand, \(\Vert b\Vert =\Vert Ax\Vert \le \Vert A\Vert \cdot \Vert x\Vert \). Thus,
(2) Since \((A+ \Delta A)(x+ \Delta x)= b= Ax\), we have
From the assumption that \(\Vert A^{-1}\Delta A\Vert < 1\), we have for an arbitrary nonzero vector y,
Therefore,
This shows \((I\,+\,A^{-1}\Delta A) y \ne 0\), so that \(I+A^{-1}\Delta A\) is nonsingular. Hence, \(A\,+\, \Delta A= A(I+A^{-1}\Delta A)\) is also nonsingular. Now, let \(B\mathop {=}\limits ^{\triangle }A^{-1}\Delta A\) and \(C\mathop {=}\limits ^{\triangle }(I+ B)^{-1}\). Then, since \(I=C+BC\),
From these facts,
\(\square \)
Theorem 3.14 shows that if \(\kappa (A)\) is close to 1, then the relative change in the solution x is comparable to the relative change in A and b. However, if \(\kappa (A)\) is large, this is not guaranteed. In fact, if \(\kappa (A)\) is large, there are cases where a small relative change in A or b causes quite a large relative change in the solution x. Because of this, if the condition number \(\kappa (A)\) is large, the linear equation (3.69) is said to be ill-conditioned .
Appendix 3: The Method of Lagrange Multipliers
The method of Lagrange multipliers is a method of solving constrained optimization problems. Let \(f(x), g_{0}(x), g_{1}(x), \ldots , g_{m-1}(x)\) be real differentiable functions defined on a domain D in \(\mathbb {R}^{n}\), and let us consider the problem of minimizing f(x) subject to constraints \(g_{k}(x)= 0\ \ (k=0, 1, \ldots , m-1)\). From these functions we construct the Lagrangian function as
where \(\lambda _{k} \in \mathbb {R}\ \ (k=0, 1, \ldots , m-1)\) are variables called the Lagrange multipliers . We denote the Jacobian matrix of the constraining functions at x by J(x):
A point \(x^{*} \in D\) is called a regular point iff it satisfies the constraints and \(\mathrm{rank}\,J(x^{*}) = m\). The latter condition is equivalent to the statement that the vectors \(\nabla _{x} g_{k}(x^{*})\ (k=0, 1, \ldots , m-1)\) are linearly independent.
The following theorem is well known in the field of nonlinear programming [11, p. 300].
Theorem 3.15
Let \(x^{*}\) be a regular point. If \(x^{*}\) minimizes f(x), then there exist \(\lambda _{k}^{*}\ (k=0, 1, \ldots , m-1)\) such that
Solving (3.70) with the use of the constraining equations, we can find \(x^{*}\) that is a candidate for \(\mathop {\mathrm{argmin}}\nolimits _{x} f(x)\) under the constraints.
Appendix 4: Newton’s Method
A twice continuously differentiable function f defined on \(\mathbb {R}^{n}\) can be approximated by a quadratic function on a neighborhood of each point \(x^{o} \in \mathbb {R}^{n}\) as
If the Hessian matrix \(\nabla _{x}^{2}f(x^{o})\) is positive definite, the right-hand side of this equation is minimized by solving the equation
By using the formula (A.17) (Appendix A), this equation can be rewritten as
from which the minimum point \(x^{m}\) is obtained as
Motivated by this fact, Newton’s method searches for \(\mathop {\mathrm{argmin}}\nolimits _{x}f(x)\) by iterating the following computation starting from an initial point \(x_{0}\):
where \(\delta > 0\), and I is the \(n \times n\) identity matrix. The regularization term \(\delta I\) is added so that the algorithm does not fail even if \(\nabla _{x}^{2}f(x_{k})\) is not invertible. Just as in the case of the steepest descent algorithm, an index dependent step-size \(\mu (k)\) may be used. Under a certain condition, the vector sequence \(x_{0}, x_{1}, x_{2}, \ldots \) converges to a local minimum point. If the initial point \(x_{0}\) and the step-size \(\mu \) are appropriately chosen, the sequence converges to \(\mathop {\mathrm{argmin}}\nolimits _{x} f(x)\).
Appendix 5: Linear Prediction
Let \(\{x(k)\}\) be a real signal. We make a linear combination \(\hat{x}(j)\mathop {=}\limits ^{\triangle }\sum _{i=1}^{m}f_{i}x(j-i)\) of \(x(j-1)\), \(x(j-2)\), \(\ldots \) , \(x(j-m)\), where \(f_{1}\), \(f_{2}\), \(\ldots \) , \(f_{m}\) are real coefficients. Let us determine the coefficients so that the function
is minimized. That is, we want to determine the coefficients so that \(-\hat{x}(j)\) becomes the least-mean-squares estimate of x(j) on the interval \(j=k, k-1, \ldots , k-(n-1)\) [12]. In order to minimize \(L_{f}\), we differentiate it with respect to each \(f_{l}\), and set it equal to zero. Because
we obtain a linear equation
where
Note that \(r_{s,t}\) is symmetric: \(r_{s,t}= r_{t,s}\). Equation (3.72) is called the normal equation . We assume that the matrix
is nonsingular. The solution of (3.72) is called the forward linear predictor . We denote it by the same symbol as the unknown variable:
Each \(f_{i}\) is referred to as a forward linear predictor coefficient or simply linear predictor coefficient .Footnote 4
Now let us calculate the forward residual power \(E_{f}\) defined by
The function \(L_{f}(f_{1},f_{2}, \ldots , f_{m})\) is expressed as
Substitution of (3.73) into (3.74) yields
Thus, we see that the forward residual filter
satisfies the augmented normal equation
where
We can also consider backward prediction as in Fig. 3.11. For a linear combination \(\tilde{x}(j-m)\mathop {=}\limits ^{\triangle }\sum _{i=1}^{m}b_{i}x(j-m+i)\) of \(x(j-m+1), x(j-m+2), \ldots , x(j)\), we determine the coefficients \(b_{1},b_{2}, \ldots , b_{m}\) so that the function
is minimized. Differentiating \(L_{b}\) with respect to each \(b_{l}\), and setting it equal to zero, we obtain the normal equation
The solution of (3.77), denoted by the same symbol \((b_{m}\), \(b_{m-1}\), \(\ldots \) , \(b_{1})^{t}\) as the unknown variable, is called the backward linear predictor . Just as in the case of the forward residual power, the backward residual power
is calculated as
Therefore, the backward residual filter
satisfies the augmented normal equation in the backward case
Theorem 3.16
Assume that the matrix R in (3.76) is nonsingular. Then,
where \(\varvec{0}\) is the m-dimensional zero vector.
Similarly,
where \(R_{u}\) is the \(m\times m\) sub-matrix of R consisting of the upper left corner of R:
Proof
First note \(R_{l}\) and \(R_{u}\) are nonsingular by the assumption that R is nonsingular. Let us show (3.79). Defining \(r\mathop {=}\limits ^{\triangle }(r_{1,0}, r_{2,0}, \ldots , r_{m,0})^{t}\), we can express R as
By the formula (A.1) (Appendix A) and (3.75),
This shows that
Equation (3.80) can be shown in a similar way.\(\square \)
Rights and permissions
Copyright information
© 2016 Springer Japan
About this chapter
Cite this chapter
Ozeki, K. (2016). Affine Projection Algorithm. In: Theory of Affine Projection Algorithms for Adaptive Filtering. Mathematics for Industry, vol 22. Springer, Tokyo. https://doi.org/10.1007/978-4-431-55738-8_3
Download citation
DOI: https://doi.org/10.1007/978-4-431-55738-8_3
Published:
Publisher Name: Springer, Tokyo
Print ISBN: 978-4-431-55737-1
Online ISBN: 978-4-431-55738-8
eBook Packages: EngineeringEngineering (R0)