Journal of Signal Processing Systems

, Volume 65, Issue 3, pp 403–412

Regularized Pre-image Estimation for Kernel PCA De-noising

Input Space Regularization and Sparse Reconstruction


    • DTU InformaticsTechnical University of Denmark
  • Lars Kai Hansen
    • DTU InformaticsTechnical University of Denmark

DOI: 10.1007/s11265-010-0515-4

Cite this article as:
Abrahamsen, T.J. & Hansen, L.K. J Sign Process Syst (2011) 65: 403. doi:10.1007/s11265-010-0515-4


The main challenge in de-noising by kernel Principal Component Analysis (PCA) is the mapping of de-noised feature space points back into input space, also referred to as “the pre-image problem”. Since the feature space mapping is typically not bijective, pre-image estimation is inherently illposed. As a consequence the most widely used estimation schemes lack stability. A common way to stabilize such estimates is by augmenting the cost function by a suitable constraint on the solution values. For de-noising applications we here propose Tikhonov input space distance regularization as a stabilizer for pre-image estimation, or sparse reconstruction by Lasso regularization in cases where the main objective is to improve the visual simplicity. We perform extensive experiments on the USPS digit modeling problem to evaluate the stability of three widely used pre-image estimators. We show that the previous methods lack stability in the is non-linear regime, however, by applying our proposed input space distance regularizer the estimates are stabilized with a limited sacrifice in terms of de-noising efficiency. Furthermore, we show how sparse reconstruction can lead to improved visual quality of the estimated pre-image.


Kernel PCAPre-imageRegularizationDe-noisingSparsity

1 Introduction

We are interested in unsupervised learning methods for de-noising. If necessary we will use non-linear maps to project noisy data onto a clean signal manifold. Kernel PCA and similar methods are widely used candidates for such projection beyond conventional linear unsupervised learning schemes like principal component analysis (PCA), independent component analysis (ICA), and non-negative matrix factorization (NMF). The basic idea is to implement the projection in three steps, in the first step we map the original input space data into a feature space. The second step then consists of using a conventional linear algorithm, like PCA, to identify the signal manifold by linear projection in feature space. Finally, in the third step we estimate the de-noised input space points that best correspond to the projected feature space points. The latter step is referred to as the pre-image problem. Unfortunately, finding a reliable pre-image is entirely non-trivial and has given rise to several algorithms [2, 4, 8, 9, 14]. In this work we experimentally analyze the stability of the estimated pre-images from the most used of these algorithms, we suggest to introduce regularization in order to improve the performance and stability relative to the existing approaches. If the aim is stabilization, Tikhonov input space regularization is recommendable whereas sparse reconstruction by Lasso regularization is found superior for sparse data when the aim is improved visual quality.

Let us recapitulate some basic aspects of de-noising with kernel PCA. Let \({\mathcal{F}}\) be the Reproducing Kernel Hilbert Space (RKHS) associated with the kernel function k(x,x′) = φ(x)Tφ(x′), where \({\varphi}\ : {\mathcal{X}}\mapsto{\mathcal{F}}\) is a possibly nonlinear map from the D-dimensional input space \({\mathcal{X}}\) to the high (possibly infinite) dimensional feature space \({\mathcal{F}}\) (see notation1). In de-noising and a number of other applications it is of interest to reconstruct a data point in input space from a point in feature space. Hence, applying the inverse map of φ. Given a point, Ψ, in feature space the pre-image problem thus consists of finding a point \({\mathbf z}\in{\mathcal{X}}\) in the input space such that φ(z) = Ψ. z is then called the pre-image of Ψ. For many non-linear kernels \(\rm{dim}({\mathcal{F}})\gg \rm{dim}({\mathcal{X}})\) and φ is not surjective. Furthermore, whether φ is injective depends on the choice of kernel function. As a function f : XY has an inverse iff it is bijective, we do not expect φ to have an inverse. When φ is not surjective, it follows that not all points in \({\mathcal{F}}\) or even the span of \(\{{\varphi}({\mathcal{X}})\}\) is the image of some \({\mathbf x}\in{\mathcal{X}}\). Finally, when φ is not injective, any recovered pre-image might not be unique. Thus the pre-image problem is ill-posed [1, 3, 4, 8, 9, 12, 14]. As we can not expect an exact pre-image, we follow [9] and relax the quest to find an approximate pre-image, i.e., a point in input space which maps into a point in feature space ‘as close as possible’ to Ψ (Fig. 1).
Figure 1

The pre-image problem in kernel PCA de-noising concerns estimating z from x0, through the projection of the image onto the principal subspace. Presently available methods for pre-image estimation lead to unstable pre-images because the inverse is ill-posed. We show that simple input space regularization, with a penalty based on the distance || z − x0|| leads to a stable pre-image.

2 Kernel PCA

Kernel Principal Component Analysis is a nonlinear generalization of linear PCA, in which PCA is carried out in the feature space \({\mathcal{F}}\) mapped data [13]. However, as \({\mathcal{F}}\) can be infinite dimensional we can not work directly with the feature space covariance matrix. Fortunately, the so-called kernel trick allows us to formulate nonlinear extensions of linear algorithms when these are expressed in terms of inner-products.

Let {x1,..., xN} be N training data points in \({\mathcal{X}}\) and {φ(x1),..., φ(xN)} be the corresponding images in \({\mathcal{F}}\). The mean of the φ-mapped data points is denoted \(\bar{{\varphi}}=\frac{1}{N}\sum_{n=1}^N{\varphi}({\mathbf x}_n)\) and the ‘centered’ images are given by \({\tilde{\varphi}}({\mathbf x}) = {{\varphi}({\mathbf x})}-{\bar{\varphi}}\). Now, let K denote the kernel matrix with element Kij = k(xi, xj), then kernel PCA can be performed by solving the eigenvalue problem
$${\widetilde{{\mathbf K}}} \boldmath{\alpha}_i=\lambda_i \boldmath{\alpha}_i $$
where \({\widetilde{{\mathbf K}}}\) is the centered kernel matrix defined as \({\widetilde{{{\mathbf K}}}}={\mathbf K}-\frac{1}{N}{\mathbf 1}_{NN} {\mathbf K}- \frac{1}{N}{\mathbf K}{\mathbf 1}_{NN}+\frac{1}{N^2}{\mathbf 1}_{NN}{\mathbf K}{\mathbf 1}_{NN}\).
The projection of a φ-mapped test point onto the i’th principal component is
$$ \beta_i={\tilde{\varphi}}({\mathbf x})^T{\mathbf v}_i=\sum\limits_{n=1}^N \alpha_{in}{\tilde{\varphi}}({\mathbf x})^T{\tilde{\varphi}}({\mathbf x}_n)=\sum\limits_{n=1}^N \alpha_{in}{\tilde{k}}({\mathbf x},{\mathbf x}_n) $$
where vi is the i’th eigenvector of the feature space covariance matrix and the αi’s have been normalized. The centered kernel function can be found as \({\tilde{k}}({\mathbf x},{\mathbf x}')\!=\!k({\mathbf x},{\mathbf x'})\!-\!\frac{1}{N}{\mathbf 1}_{1N}{\mathbf k}_{\mathbf x}\!-\!\frac{1}{N}{\mathbf 1}_{1N}{\mathbf k}_{\mathbf x'}\!+\!\frac{1}{N^2}{\mathbf 1}_{1N}{\mathbf K}{\mathbf 1}_{N1}\), where \({\mathbf k}_{{\mathbf x}}\!=\![k({\mathbf x},{\mathbf x}_1),\ldots, k({\mathbf x},{\mathbf x}_N)]^T\). The projection of φ(x) onto the subspace spanned by the first q eigenvectors will be denoted Pqφ(x) and can be found as
$$\begin{array}{lll} P_q{\varphi}({\mathbf x})&= \sum\limits_{i=1}^{q} \beta_i {\mathbf v}_i+\bar{{\varphi}}= \sum\limits_{i=1}^{q}\beta_i \sum\limits_{n=1}^N \alpha_{in}{\tilde{\varphi}}({\mathbf x}_n) + \bar{{\varphi}} \\ &= \sum\limits_{n=1}^N {\tilde{\gamma}}_n {\tilde{\varphi}}({\mathbf x}_n) + \bar{{\varphi}} \label{eq.projn} \end{array} $$
where \({\tilde{\gamma}}_n = \sum_{i=1}^q\beta_i\alpha_{in}\). Kernel PCA satisfies properties similar to those for linear PCA, namely that the squared reconstruction error is minimal and the retained variance is maximal. However, these properties hold in \({\mathcal{F}}\) not in \({\mathcal{X}}\). For a more thorough derivation of kernel PCA the reader is referred to, e.g., [13].

3 Approximate Pre-images

Several optimality criteria can be used for the pre-image approximation, see e.g., [1],
$$\rm{Distance: } \quad\quad {\mathbf z} = {\underset{{\mathbf z}\in{\mathcal{X}}}{\operatorname{argmin}}\;}||{\varphi}({\mathbf z})-\Psi||^2 \label{eq.error}\\ $$
$$\rm{Co-linearity: } \quad {\mathbf z} = {\underset{{{\mathbf z}\in{\mathcal{X}}}}{\operatorname{argmax}}\;}{\left\langle {\frac{{\varphi}({\mathbf z})}{||{{\varphi}({\mathbf z})}||},\frac{\Psi}{||\Psi||}} \right\rangle} \label{eq.critcol} $$
For RBF kernels of the form \(k({\mathbf x}_i,{\mathbf x}_j) = \kappa(||{\mathbf x}_i-{\mathbf x}_j||^2)\) the co-linearity criteria and the distance criteria coincide:
$$\begin{array}{lll} ||{\varphi}({\mathbf z})-\Psi||^2 &= {\left\langle {{{\varphi}({\mathbf z})},{{\varphi}({\mathbf z})}} \right\rangle} + {\left\langle {\Psi,\Psi} \right\rangle}- 2{\left\langle {{{\varphi}({\mathbf z})},\Psi} \right\rangle}\\ &=k({\mathbf z},{\mathbf z}) + ||\Psi||^2-2{\left\langle {{{\varphi}({\mathbf z})},\Psi} \right\rangle} \end{array} $$
As k(z,z) is constant for RBF kernels and ||Ψ||2 is independent of z, minimizing ||φ(z) − Ψ||2 is equivalent to maximizing the co-linearity. As \({\mathcal{F}}\) is a RKHS, the distance will be the same before and after centering. However, the expression gets a bit more tedious when using explicit centering as will be shown later.
Thus we seek to minimize the distance between φ(z) and Ψ with respect to z. By assuming that Ψ lies in (or close to) the span of {φ(xi)}, Ψ can be represented as a linear combination of the training images, i.e. Pqφ(x), without loss of generality. When q = N this will translate to projecting Ψ onto the span of {φ(xi)}. Thus, we are interested in an expression for
$$\begin{array}{lll} ||{\varphi}({\mathbf z})-P_q{\varphi}({\mathbf x})||^2 &=& ||{\varphi}({\mathbf z}) ||^2+||P_q{\varphi}({\mathbf x})||^2\\ &&-2{\varphi}({\mathbf z})^TP_q{\varphi}({\mathbf x}). \end{array} $$
The terms will in the following be expanded separately, starting with the first term
$$ ||{\varphi}({\mathbf z}) ||^2 = {\varphi}({\mathbf z})^T{\varphi}({\mathbf z})=k({\mathbf z},{\mathbf z}) $$
From Eq. 3 and the definition of centering and mean in feature space, we have
$$\begin{array}{lll} ||P_q{\varphi}({\mathbf x})||^2 &\!=\!&\left( \sum\limits_{i=1}^{q} \beta_i {\mathbf v}_i\!+\!\bar{{\varphi}} \right)^T\left( \sum\limits_{i=1}^{q} \beta_i {\mathbf v}_i\!+\!\bar{{\varphi}} \right)\\ &\!=\!&\sum\limits_{i=1}^q \beta_i^2 \!+\! {\bar{\varphi}}^T{\bar{\varphi}} \!+\!2{\bar{\varphi}}^T\sum\limits_{n=1}^{N}{\tilde{\gamma}}_n{\tilde{\varphi}}({\mathbf x}_n)\\ &\!=\!& \sum\limits_{i=1}^q\left(\sum\limits_{n=1}^N \alpha_{in}{\tilde{k}}({\mathbf x},{\mathbf x}_n)\right)^2\!\!+\! \frac{1}{N^2}\sum\limits_{n,m=1}^N k({\mathbf x}_n,{\mathbf x}_m)\\ &&\!+ \frac{2}{N}\sum\limits_{n=1}^N\!\left(\!\!{\tilde{\gamma}}_n\!\sum\limits_{m=1}^Nk(\!{\mathbf x}_m,\!{\mathbf x}_n\!)\!-\!\frac{\!{\tilde{\gamma}}_n}{N}\!\sum\limits_{m,l=1}^N k({\mathbf x}_m,\!{\mathbf x}_l) \!\right)\\ \end{array} $$
Finally the last term can be expanded using the same properties as above
$$\begin{array}{lll} {\varphi}({\mathbf z})^T P_q{\varphi}({\mathbf x}) &=&{\varphi}({\mathbf z})^T\left(\sum\limits_{n=1}^N{\tilde{\gamma}}_n({\varphi}({\mathbf x}_n)-{\bar{\varphi}})+{\bar{\varphi}} \right)\\ &=&\sum\limits_{n=1}^N\gamma_n k({\mathbf z},{\mathbf x}_n)\label{eq.errcp} \end{array} $$
Where the last equality follows from letting \(\gamma_n = {\tilde{\gamma}}_n+\frac{1}{N}(1-\sum_{j=1}^N {\tilde{\gamma}}_j)\), and where \({\tilde{\gamma}}_n = \sum_{i=1}^q \beta_i\alpha_{in}\) as defined in Eq. 3. Now combining the expressions gives the following cost function
$$\begin{array}{lll} R(\boldmath{z})&=&||{\varphi}({\mathbf z})-P_q{\varphi}({\mathbf x})||^2 \\ &=& k({\mathbf z},{\mathbf z}) -2\sum\limits_{n=1}^N\gamma_n k({\mathbf z},{\mathbf x}_n) + \Omega \label{eq.featuredist} \end{array} $$
where all the z-independent terms originating from \(||P_q{\varphi}({\mathbf x})||^2\) have been collected in Ω.

3.1 Overview of Existing Algorithms

The non-linear optimization problem associated with finding the pre-image has been approached in a variety of ways. In the original work by Mika et al. [9, 14] a fixed-point iteration method was proposed. It is a noted drawback of this method that it can be numerically unstable, sensitive to the initial starting point, and converge to a local extremum. To overcome this problem a more ‘direct’ approach was taken by Kwok and Tsang [8]. They combined the idea of multidimensional scaling (MDS) with the relationship of distance measures in feature space and input space, thereby deriving a non-iterative solution. These are the two approaches most widely used in applications. However, several modifications have already been proposed. In order to overcome possible numerical instabilities of the fixed-point approach, various ways of initialization have been suggested. The algorithm can be started in a ‘random’ input space point, but this can lead to slow convergence in real-life problems, since the cost-function can be very flat in regions away from data. Alternatively, for de-noising applications, it can be initialized in the point in input space, which we seek to de-noise. However, according to Takahashi and Kurita [15] this strategy will only work if the signal-to-noise ratio (SNR) is high. Instead Kim et al. [7] suggested to initialize the fixed-point iteration scheme in the solution found by Kwok and Tsang’s direct method. Later it was claimed that a more efficient starting point would be the mean of a certain number of neighbors of the point to be de-noised [16]. Dambreville et al. [4] proposed a modification of the method developed by Mika et al. utilizing feature space distances. This method also minimizes the distance constraint in Eq. 4, but does so in a non-iterative approximation thereby avoiding numerical instabilities. Bakir et al. [2] used kernel ridge regression to learn some inverse mapping of φ. While this formulation is in very general terms, the actual implementation is similar to that of Kwok and Tsang [8]. The main issue is that we typically only have indirect access to feature space points, thus a learned pre-image needs to be formulated in terms of distances as in Kwok and Tsang’s method, rather than explicit input-output examples. It should be noted that with the relative general formulation the method of Bakir et al. in some cases can be applied beyond Kwok and Tsang’s method, e.g., to non-Euclidean input spaces. In lieu of the recognized ill-posed nature of the inverse problem attempts of more robust estimators have been pursued. Nguyen and De la Torre Frade [10] introduced regularization that penalized the projection in feature space, while Zheng and Lai [20] used a ridge regression regularizer for the weights of a learned pre-image estimator as originally proposed by Bakir et al. [2].

Returning to the iterative scheme of Mika et al., we work, as in most applications, with RBF kernels for which k(z,z) is constant for all z, hence minimizing the squared distance in Eq. 11 is identical to
$$\max\limits_{{\mathbf z}}\, 2\sum\limits_{n=1}^N\gamma_n k({\mathbf z},{\mathbf x}_n) \label{eq.mikaopt} $$
Now in extrema of Eq. 12 the derivative with respect to z is zero, which leads to the following fixed-point iteration for a Gaussian kernel of the form \(k({\mathbf x},{\mathbf x}')=\exp\left(-\frac{1}{c}||{\mathbf x}-{\mathbf x}' ||^2\right)\), where c controls the width of the kernel and thereby the non-linearity of the associated feature space map [9]
$${\mathbf z}_{t+1}=\frac{\sum_{n=1}^N \gamma_{n}\exp(-||{\mathbf z}_t-{\mathbf x}_n ||^2/c){\mathbf x}_n}{\sum_{n=1}^N \gamma_{n}\exp(-||{\mathbf z}_t-{\mathbf x}_n ||^2/c)}$$
As mentioned maximizing Eq. 12 is a non-linear optimization problem, and hence suffers from convergence to local minima and strong sensitivity to the initial point z. As we shall see, this implies that the solutions are at times highly unstable.

4 Instability Issues

Some of the most recent publications (e.g., [1, 17]) argue that the methods of Mika et al. [9], Kwok and Tsang [8], and Dambreville et al. [4] are the most reliable. In this section we show that these current approaches suffer from different weaknesses.

A distinctive feature of all the algorithms is that they seek to determine the pre-image as a weighted average of the training points. In the method proposed by Kwok and Tsang only k of the training points are used for the estimation, and their weights are based on a distance relation between feature space and input space and the persistence of this distance across the φ-mapping. In Mika et al.’s approach all training points contribute to estimating the pre-image, and the individual weights are found using the pre-image itself, hence the method becomes iterative. Furthermore, the weight of a given training point decays exponentially with input space distance, so only points close to the pre-image contribute significantly to the pre-image estimate. Dambreville et al. substituted the iterative approach by a direct formula, where the weights decrease linearly with feature space distance, giving high weight to training points for which φ(xi) is close to Ψ.

Thus, in different ways, both Kwok and Tsang’s and Dambreville et al.’s method are based on the assumption that points which are close in feature space are also close in input space. For very non-linear kernels this assumption fails. In fact, the more non-linear kernel the more “creased” the associated feature space will be. This is illustrated with a simple 2-dimensional example, where 500 data points are drawn from two rings with Gaussian noise. Kernel PCA is performed using a Gaussian kernel. In order to illustrate how distances are skewed due to the kernel transformation, all pairwise feature space distances are determined, and for the 0.5% closest relations in feature space, the corresponding observation pairs are marked in input space. In Fig. 2a and b these pairs are indicated with green lines. Figure 2b indicates that the distortion is not only affected by the non-linearity of the kernel but also the dimensionality of the principal subspace (i.e., the number of principal components used).
Figure 2

Kernel PCA with a Gaussian kernel is performed on a 2-dimensional two-class example in order to illustrate the distance distortions occurring between input and feature space when either the non-linearity is increased or the principal subspace dimensionality is decreased. The green lines indicate the 0.5% closest relations in feature space. The combination of c = 0.20 and # PCs = 5 is shown in both figures as the top right plot for easy comparison. Panel a shows how increasing the non-linearity gives rise to unexpected relations in feature space, and hence distortions of the distance relation across the φ-map, while panel b illustrates that the same effect occurs when decreasing the dimensionality of the subspace.

The instability of the fixed-point iteration method is also evident when using a very non-linear kernel. Just like in the previous example, this is illustrated by drawing 500 data points from two partial rings with Gaussian noise. A “noisy” observation, which we seek to de-noise is placed in the center of the rings. Kernel PCA is performed using a Gaussian kernel with varying scale parameter. The number of principal components is fixed to 50. For every scale, Mika et al.’s algorithm is initialized in all training points respectively and the resulting pre-image of the de-noised observation is shown in the top two rows of Fig. 3. Clearly when using a very non-linear kernel, the reconstructed pre-image heavily depends on the initialization. As the cost function has many local minima, the algorithm converges to the nearest one. When c increases the number of distinct pre-images are seen to decrease, until c reaches a certain level, where kernel PCA approaches linear PCA. When this happens the pre-image is drawn towards the noisy observation, as linear PCA fails to capture the non-linear trends clearly visible in the data. In the lower part of Fig. 3 the value of the cost function, R(z) in Eq. 11, are shown for all the found pre-images as a function of the scale. From this figure it is clear that using a very non-linear kernel result in a cost function with not only many local minima, but furthermore these minima practically all have the same value. Thereby making iterative algorithms very sensitive to the point of initialization.
Figure 3

We seek to de-noise the green point in (0,0) using kernel PCA with varying scale and 50 PCs. Mika et. al.’s algorithm is initialized in all 500 training points resulting in 500 pre-image estimates indicated by the black crosses in the two top rows. The bottom row shows the value of the cost function, R(z) in Eq. 11, for each pre-image. The color indicates which class the initialization point belongs to. It is clearly seen how a very non-linear kernel results in many local minima with almost the same cost function value, while the more linear case fails to describe the non-linear signal manifold leading to a unique but not denoised pre-image.

Based on the simple examples shown here, it seems reasonable to try to improve the stability of the current approaches. We suggest this is done by introducing regularization in Eq. 11 as further described in the following sections.

5 Regularization

Regularization is commonly used to stabilize estimates of high variability. Thus, if the unregularized criterion is the risk function, R(z), the regularized version is obtained by adding a penalty, T(z), so that the solution can be formulated as:
$${\boldmath{z}} = {\underset{}{\operatorname{argmin}}\;} R(\boldmath{z}) + \lambda T(\boldmath{z})\label{eq.pen1} $$
where λ > 0 is the regularization parameter controlling the strength of the penalty term. Hence, Eq. 14 can lead to various estimates depending not only on the chosen penalty term but also on the value of λ.

If both R(z) and T(z) are differentiable a fixed-point iteration scheme similar to that of Mika et al. can easily be derived.

In this paper we focus on two special cases from the power penalty family, namely Tikhonov regularization [19]
$${\boldmath{z}} = {\underset{{}}{\operatorname{argmin}}\;} R(\boldmath{z}) + \lambda ||{\mathbf z}-{\mathbf x}_0||^2_{\ell_2}\label{eq.tikhonov} $$
and the Lasso [18] where
$${\boldmath{z}} = {\underset{{}}{\operatorname{argmin}}\;} R(\boldmath{z}) + \lambda ||{\mathbf z} ||_{\ell_1}\label{eq.lasso} $$

While Tikhonov regularization stabilizes the estimate, Lasso regularization will for sufficiently large values of λ force some of the zj’s to zero, leading to a sparse pre-image. Notice also that the Lasso problem can be interpreted as a MAP estimate with a Laplacian prior on the zi’s.

The type of penalty should be chosen according to the given problem, and hence, prior knowledge of the expected pre-image will often work as the base for choosing the penalty term.

5.1 Tikhonov Input Space Regularization

In order to provide a more stable estimate of the pre-image we propose to augment the cost function with an input space distance penalty term (see Fig. 1)
$$\begin{array}{lll} \rho_1({\mathbf z})&=&R(\boldmath{z}) + \lambda T_2(\boldmath{z})\\ &=&||{\varphi}({\mathbf z})-P_q{\varphi}({\mathbf x})||^2 + \lambda||{\mathbf z}-{\mathbf x}_0||^2_{\ell_2}\\ &=&k({\mathbf z},{\mathbf z})-2\sum\limits_{n=1}^N \gamma_n k({\mathbf z},{\mathbf x}_n) + \Omega \\ &&+ \lambda({\mathbf z}^T{\mathbf z} + {\mathbf x}_0^T{\mathbf x}_0-2{\mathbf z}{\mathbf x}_0)\label{eq.regcost} \end{array} $$
where x0 is the noisy observation in \({\mathcal{X}}\). The main rationale is that among the solutions to the non-linear optimization problem we want the pre-image which is closest to the noisy input point, hence, hopefully reduce possible distortions of the signal. Thus we seek to minimize
$$\rho_2({\mathbf z})=k({\mathbf z},{\mathbf z})-2\sum\limits_{n=1}^N \gamma_n k({\mathbf z},{\mathbf x}_n) +\lambda({\mathbf z}^T{\mathbf z} -2{\mathbf z}{\mathbf x}_0) $$
ignoring all z-independent terms. This expression can be minimized for any kernel using a non-linear optimizer.
For RBF kernels the fixed-point iteration scheme can be regularized similarly, this typically leads to a faster evaluation than using an optimizer. Introducing regularization in the maximization problem given in Eq. 12 leads to the following objective function
$$\rho_3({\mathbf z}) = 2\sum\limits_{n=1}^N \gamma_n k({\mathbf z},{\mathbf x}_n) - \lambda||{\mathbf z}-{\mathbf x}_0||^2 \label{eq.isr} $$
which we seek to maximize with respect to z. With straightforward algebra we get the regularized fixed-point iteration
$${\mathbf z}_{t+1} = \frac{\frac{2}{c}\sum_{n=1}^N \gamma_n \exp\left(-\frac{1}{c}||{\mathbf z}_t-{\mathbf x}_n||^2\right){\mathbf x}_n+\lambda{\mathbf x}_0}{\frac{2}{c}\sum_{n=1}^N \gamma_n \exp\left(-\frac{1}{c}||{\mathbf z}_t-{\mathbf x}_n||^2\right)+\lambda} $$
In this expression the denominator is given by \(\frac{2}{c}{\left\langle {{\varphi}({\mathbf z}_t),\Psi} \right\rangle}+\lambda\). As λ is a non-negative parameter, the denominator will always be non-zero in the neighborhood of a maximum because the inner-product will be positive in that same neighborhood.

5.2 Sparse Reconstruction by Lasso Regularization

In many applications, introducing other types of regularization seems appealing. We will in the following show how ℓ1-norm regularization improves the performance, when the sought pre-image is expected to be sparse, i.e., only a fraction of the input dimensions are nonzero. Hence, we now seek
$${\mathbf z} = {\underset{{\mathbf z}\in{\mathcal{X}}}{\operatorname{argmin}}\;}||{\varphi}({\mathbf z})-P_q{\varphi}({\mathbf x}_0)||^2 + \lambda ||{\mathbf z} ||_{\ell_1}\label{eq.L1reg} $$
Which can be reformulated as minimizing the following cost function
$$\begin{array}{lll} \rho_{4}({\mathbf z})&= R({\mathbf z}) + \lambda T_1({\mathbf z})\\ &=||{\varphi}({\mathbf z})-P_q{\varphi}({\mathbf x})||^2 + \lambda||{\mathbf z}||_{\ell_1}\\ &= -2\sum\limits_{n=1}^N \gamma_n k({\mathbf z},{\mathbf x}_n) + \lambda\sum\limits_{j=1}^D | z_j |\end{array} $$
where the last equality only holds for RBF kernels.

Since T1(z) is not differentiable, implementing a fixed-point iteration scheme is not feasible. Instead we will apply the generalized path seeking (GPS) framework as introduced by Friedman in [5] to estimate both the pre-image and the degree of regularization simultaneously.

In order to calculate the regularization parameter path, λ, and drive the GPS algorithm, the following first order derivatives are needed:
$$\frac{dR}{dz_j} = 2\sum\limits_{n=1}^N \gamma_n \exp\left(-\frac{1}{c}|| {\mathbf z}-{\mathbf x}_n||^2\right) \cdot \frac{2(z_j-x_{jn})}{c} $$
$$\frac{dT_1}{d | z_j |} = 1 $$

Applying the GPS algorithm is now straight forward. Friedman suggest using an adaptive step length when exploring the solution space, however, for simplicity we chose a fixed step length of 1e − 2 max (|X|)) in the following experiments. This is a very conservative step length, and may be tuned for faster convergence. The algorithm is stopped when the cost function stabilizes.

It is noted, that the GPS algorithm could also be used to get an indication of the magnitude of the regularization needed in the fixed point iteration scheme. However, as the GPS framework needs many more iterations than the fixed-point scheme, this way of estimation is not attractive for general estimation.

For further background on smoothing by ℓp norms we refer the reader to [11].

6 Experiments

In this section we compare the new regularization approaches to the existing methods proposed by: (a) Kwok-Tsang [8], (b) Dambreville et al. [4], and (c) Mika et al. [9]. The experiments are done on a subset of the USPS data consisting of 16 ×16 pixels handwritten digits.2 For each of the digits 0,2,4, and 9 we chose 100 examples for training and another 100 examples for testing. We added Gaussian noise \(\mathcal{N}(0,0.25)\) and set the regularization parameter in Eq. 19 to λ = 3e − 4.

In order to illustrate the stability and performance of the methods, we vary both the number of principal components used to define the signal manifold and the scale parameter c of the Gaussian kernel. For each combination and pre-image estimator, the mean squared error (MSE) of the de-noised result for the 400 test examples is calculated. The iterative approaches are initialized in the noisy test point and for the Kwok and Tsang’s approach 10 neighbors were used for the approximation.

The results are summarized in Fig. 4 where we show the lower 5th and upper 95th percentile confidence intervals for the MSE. In order to ease the comparison and adjust for potential bias in the estimation, all pre-images are re-normalized to the range of the original image before the squared error is calculated. As seen the confidence intervals blow up for the previous methods—panels (a–c)—in the non-linear regime in which the kernel has a relative small scale parameter. At the same time the confidence interval points to a much more stable de-noised solution for the new Tikhonov input space regularized approach—as seen in panel (d).
Figure 4

Experiment to illustrate the stability of pre-image based de-noising of USPS digits. A training set of 400 digits (100 @ 0,2,4,9) is used to define the signal manifold. We show the confidence intervals (5th and the 95th percentile) for the mean square error (MSE) in different combinations of kPCA subspace dimension and non-linearity. MSE computed for 400 de-noised test samples for (a) Kwok-Tsang, (b) Dambreville et al., (c) Mika et al., (d) using Tikhonov input space regularization, and (e) using Lasso regularization. The previous schemes are seen to deteriorate in the non-linear regime (smallc) compared to the input space regularization approach.

To better understand the nature of the instability of the previous algorithms we have investigated the diversity of the solutions obtained when starting the fixed-point iterative algorithms in different initial points. Specifically we compare the standard iterative solution of Mika et al. and the new input space regularized version. For each of the 400 test examples the two algorithms are initialized in 40 randomly chosen training examples. This leads to 40 (potentially different) pre-image solutions for each test sample. We measure the stability of these solution sets as the mean pairwise distance between them 40 pre-images, and report the mean across the 400 test examples This mean and its confidence intervals are presented in Fig. 5 as a function of the non-linearity scale parameter c. As seen, the Tikhonov input space regularization approach produces a stable pre-image even for very non-linear models (small c), where the un-regularized iterative scheme fails to reproduce.
Figure 5

Illustration of the instability. The mean pairwise distances between solutions obtained after initializing in 40 randomly chosen training set input points (mean, 5th and the 95th percentiles) for Mika et al. (red) and the new Tikhonov input space regularization approach (blue). We use 300 principal components in this study. The previous approach fails to provide a stable pre-image in the non-linear regime (smallc). The right panel is a close-up of the box indicated on the left panel. Arrow ‘B’ indicates the scale used in Fig. 6.
Figure 6

Top: Example of de-noised digits using a very non-linear kernel (c = 50) and 100 principal components. The colormap has been adjusted for better visualization. Bottom: The image intensity along the 16 pixel segments indicated by the red and the green line in the upper panels. Panel a shows the original digits, b shows the digits after Gaussian noise have been added, c is the de-noised digits by Mika et al.’s algorithm, d de-noising using Tikhonov regularization, and e using sparse reconstruction by Lasso regularization. Note the improved SNR in the results of the new methods.

Finally Fig. 6 shows visual examples of the de-noised images obtained with Mika et al.’s and the two new regularized approaches. For the images which are successfully de-noised by Mika et al.’s method, e.g., some of the ’zeros’ or ’nines’, adding regularization has very little effect. However, a clear improvement can be seen for the images for which Mika et al.’s algorithm fails to recover a good visual solution. For these digits the input space regularization method do reconstruct the correct digit, albeit with a price paid in terms of a slightly less de-noised result. Furthermore, the image intensity, as shown in the lower part of Fig. 6, illustrates the increased SNR achieved by the input space regularization.

The last panel in Fig. 6 illustrates the effect of regularizing pre-image estimation by the sparsity promoting Lasso penalty. The majority of the digits are clearly identifiable and only a minimum of background noise is present. Again the high SNR is reflected in the lower part of the figure. However, as noted in Fig. 4 sparse reconstruction does not lead to a stable estimation in terms of the MSE measure.

7 Conclusion

In this contribution we addressed the problem of pre-image instability for kernel PCA de-noising. The recognized concerns of current methods, e.g., the sensitivity to local minima and large variability were demonstrated for the most widely used methods including Mika et al.’s iterative scheme, Kwok-Tsang’s local linear approximation and the method of Dambreville et al. By introducing simple input space distance regularization in the existing pre-image approximation cost function, we achieved a more stable pre-image, with very little sacrifice of the de-noising ability. Experimental results on the USPS data illustrated how input space regularization provides a more stable pre-image in the sense of variability between test points and reduced the sensitivity to starting conditions as well as provided a better visual result. Furthermore, we introduced ℓ1-norm Lasso regularization and demonstrated that it leads to an improved estimate in terms of visual quality. This regularizer however incurs a relatively large mean squared error in the data set investigated here. The trade-off between quantitative and qualitative performances of the two regularizers needs further investigation.

We thus recommend to augment the cost function for pre-image estimation in Eq. 11 with a task specific penalty term. When the aim is superior visual quality, and the data is known to be sparse, sparse reconstruction by Lasso regularization should be employed. In cases where the objective is both visual impression and stability of the estimate, we suggest the use of Tikhonov input space distance regularization as it provides a reliable pre-image in cases where current methods fail to recover a meaningful result.

In future work we aim to combine input space regularization with sparse reconstruction in order to achieve both highly stabile and attractive visual results as well as extend the concept of sparse reconstruction in relation to kernel methods. Furthermore, the amount of regularization is to be investigated further for both methods presented in this paper.


Bold uppercase letters denote matrices, bold lowercase letters represent column vectors, and non-bold letters denote scalars. aj denotes the j’th column of A, while aij denotes the scalar in the i’th row and j’th column of A. Finally, 1NN is a N×N matrix of ones.


The USPS data set is described in [6] and can be downloaded from



This work is supported in part by the Lundbeckfonden through the Center for Integrated Molecular Brain Imaging (Cimbi),

Copyright information

© Springer Science+Business Media, LLC 2010