# Regularized Pre-image Estimation for Kernel PCA De-noising

## Authors

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s11265-010-0515-4

- Cite this article as:
- Abrahamsen, T.J. & Hansen, L.K. J Sign Process Syst (2011) 65: 403. doi:10.1007/s11265-010-0515-4

- 229 Views

## Abstract

The main challenge in de-noising by kernel Principal Component Analysis (PCA) is the mapping of de-noised feature space points back into input space, also referred to as “the pre-image problem”. Since the feature space mapping is typically not bijective, pre-image estimation is inherently illposed. As a consequence the most widely used estimation schemes lack stability. A common way to stabilize such estimates is by augmenting the cost function by a suitable constraint on the solution values. For de-noising applications we here propose Tikhonov input space distance regularization as a stabilizer for pre-image estimation, or sparse reconstruction by Lasso regularization in cases where the main objective is to improve the visual simplicity. We perform extensive experiments on the USPS digit modeling problem to evaluate the stability of three widely used pre-image estimators. We show that the previous methods lack stability in the is non-linear regime, however, by applying our proposed input space distance regularizer the estimates are stabilized with a limited sacrifice in terms of de-noising efficiency. Furthermore, we show how sparse reconstruction can lead to improved visual quality of the estimated pre-image.

### Keywords

Kernel PCAPre-imageRegularizationDe-noisingSparsity## 1 Introduction

We are interested in unsupervised learning methods for de-noising. If necessary we will use non-linear maps to project noisy data onto a clean signal manifold. Kernel PCA and similar methods are widely used candidates for such projection beyond conventional linear unsupervised learning schemes like principal component analysis (PCA), independent component analysis (ICA), and non-negative matrix factorization (NMF). The basic idea is to implement the projection in three steps, in the first step we map the original input space data into a feature space. The second step then consists of using a conventional linear algorithm, like PCA, to identify the signal manifold by linear projection in feature space. Finally, in the third step we estimate the de-noised input space points that best correspond to the projected feature space points. The latter step is referred to as the *pre-image problem*. Unfortunately, finding a reliable pre-image is entirely non-trivial and has given rise to several algorithms [2, 4, 8, 9, 14]. *In this work we experimentally analyze the stability of the estimated pre-images from the most used of these algorithms, we suggest to introduce regularization in order to improve the performance and stability relative to the existing approaches. If the aim is stabilization, Tikhonov input space regularization is recommendable whereas sparse reconstruction by Lasso regularization is found superior for sparse data when the aim is improved visual quality*.

*k*(

**x**,

**x**′) =

*φ*(

**x**)

^{T}

*φ*(

**x**′), where \({\varphi}\ : {\mathcal{X}}\mapsto{\mathcal{F}}\) is a possibly nonlinear map from the

*D*-dimensional input space \({\mathcal{X}}\) to the high (possibly infinite) dimensional feature space \({\mathcal{F}}\) (see notation

^{1}). In de-noising and a number of other applications it is of interest to reconstruct a data point in input space from a point in feature space. Hence, applying the inverse map of

*φ*. Given a point, Ψ, in feature space the pre-image problem thus consists of finding a point \({\mathbf z}\in{\mathcal{X}}\) in the input space such that

*φ*(

**z**) = Ψ.

**z**is then called the pre-image of Ψ. For many non-linear kernels \(\rm{dim}({\mathcal{F}})\gg \rm{dim}({\mathcal{X}})\) and

*φ*is not surjective. Furthermore, whether

*φ*is injective depends on the choice of kernel function. As a function

*f*:

*X*↦

*Y*has an inverse iff it is bijective, we do not expect

*φ*to have an inverse. When

*φ*is not surjective, it follows that not all points in \({\mathcal{F}}\) or even the span of \(\{{\varphi}({\mathcal{X}})\}\) is the image of some \({\mathbf x}\in{\mathcal{X}}\). Finally, when

*φ*is not injective, any recovered pre-image might not be unique. Thus the pre-image problem is ill-posed [1, 3, 4, 8, 9, 12, 14]. As we can not expect an exact pre-image, we follow [9] and relax the quest to find an

*approximate pre-image*, i.e., a point in input space which maps into a point in feature space ‘as close as possible’ to Ψ (Fig. 1).

## 2 Kernel PCA

Kernel Principal Component Analysis is a nonlinear generalization of linear PCA, in which PCA is carried out in the feature space \({\mathcal{F}}\) mapped data [13]. However, as \({\mathcal{F}}\) can be infinite dimensional we can not work directly with the feature space covariance matrix. Fortunately, the so-called kernel trick allows us to formulate nonlinear extensions of linear algorithms when these are expressed in terms of inner-products.

**x**

_{1},...,

**x**

_{N}} be

*N*training data points in \({\mathcal{X}}\) and {

*φ*(

**x**

_{1}),...,

*φ*(

**x**

_{N})} be the corresponding images in \({\mathcal{F}}\). The mean of the

*φ*-mapped data points is denoted \(\bar{{\varphi}}=\frac{1}{N}\sum_{n=1}^N{\varphi}({\mathbf x}_n)\) and the ‘centered’ images are given by \({\tilde{\varphi}}({\mathbf x}) = {{\varphi}({\mathbf x})}-{\bar{\varphi}}\). Now, let

**K**denote the kernel matrix with element

*K*

_{ij}=

*k*(

**x**

_{i}

**, x**

_{j}

**)**, then kernel PCA can be performed by solving the eigenvalue problem

*φ*-mapped test point onto the

*i’th*principal component is

**v**

_{i}is the

*i’th*eigenvector of the feature space covariance matrix and the

**α**

_{i}’s have been normalized. The centered kernel function can be found as \({\tilde{k}}({\mathbf x},{\mathbf x}')\!=\!k({\mathbf x},{\mathbf x'})\!-\!\frac{1}{N}{\mathbf 1}_{1N}{\mathbf k}_{\mathbf x}\!-\!\frac{1}{N}{\mathbf 1}_{1N}{\mathbf k}_{\mathbf x'}\!+\!\frac{1}{N^2}{\mathbf 1}_{1N}{\mathbf K}{\mathbf 1}_{N1}\), where \({\mathbf k}_{{\mathbf x}}\!=\![k({\mathbf x},{\mathbf x}_1),\ldots, k({\mathbf x},{\mathbf x}_N)]^T\). The projection of

*φ*(

**x**) onto the subspace spanned by the first

*q*eigenvectors will be denoted

*P*

_{q}

*φ*(

**x**) and can be found as

## 3 Approximate Pre-images

*k*(

**z**,

**z**) is constant for RBF kernels and ||Ψ||

^{2}is independent of

**z**, minimizing ||

*φ*(

**z**) − Ψ||

^{2}is equivalent to maximizing the co-linearity. As \({\mathcal{F}}\) is a RKHS, the distance will be the same before and after centering. However, the expression gets a bit more tedious when using explicit centering as will be shown later.

*φ*(

**z**) and Ψ with respect to

**z**. By assuming that Ψ lies in (or close to) the span of {

*φ*(

**x**

_{i})}, Ψ can be represented as a linear combination of the training images, i.e.

*P*

_{q}

*φ*(

**x**), without loss of generality. When

*q*=

*N*this will translate to projecting Ψ onto the span of {

*φ*(

**x**

_{i})}. Thus, we are interested in an expression for

**z**-independent terms originating from \(||P_q{\varphi}({\mathbf x})||^2\) have been collected in Ω.

### 3.1 Overview of Existing Algorithms

The non-linear optimization problem associated with finding the pre-image has been approached in a variety of ways. In the original work by Mika et al. [9, 14] a fixed-point iteration method was proposed. It is a noted drawback of this method that it can be numerically unstable, sensitive to the initial starting point, and converge to a local extremum. To overcome this problem a more ‘direct’ approach was taken by Kwok and Tsang [8]. They combined the idea of multidimensional scaling (MDS) with the relationship of distance measures in feature space and input space, thereby deriving a non-iterative solution. These are the two approaches most widely used in applications. However, several modifications have already been proposed. In order to overcome possible numerical instabilities of the fixed-point approach, various ways of initialization have been suggested. The algorithm can be started in a ‘random’ input space point, but this can lead to slow convergence in real-life problems, since the cost-function can be very flat in regions away from data. Alternatively, for de-noising applications, it can be initialized in the point in input space, which we seek to de-noise. However, according to Takahashi and Kurita [15] this strategy will only work if the signal-to-noise ratio (SNR) is high. Instead Kim et al. [7] suggested to initialize the fixed-point iteration scheme in the solution found by Kwok and Tsang’s direct method. Later it was claimed that a more efficient starting point would be the mean of a certain number of neighbors of the point to be de-noised [16]. Dambreville et al. [4] proposed a modification of the method developed by Mika et al. utilizing feature space distances. This method also minimizes the distance constraint in Eq. 4, but does so in a non-iterative approximation thereby avoiding numerical instabilities. Bakir et al. [2] used kernel ridge regression to learn some inverse mapping of *φ*. While this formulation is in very general terms, the actual implementation is similar to that of Kwok and Tsang [8]. The main issue is that we typically only have indirect access to feature space points, thus a learned pre-image needs to be formulated in terms of distances as in Kwok and Tsang’s method, rather than explicit input-output examples. It should be noted that with the relative general formulation the method of Bakir et al. in some cases can be applied beyond Kwok and Tsang’s method, e.g., to non-Euclidean input spaces. In lieu of the recognized ill-posed nature of the inverse problem attempts of more robust estimators have been pursued. Nguyen and De la Torre Frade [10] introduced regularization that penalized the projection in feature space, while Zheng and Lai [20] used a ridge regression regularizer for the weights of a learned pre-image estimator as originally proposed by Bakir et al. [2].

*k*(

**z**,

**z**) is constant for all

**z**, hence minimizing the squared distance in Eq. 11 is identical to

**z**is zero, which leads to the following fixed-point iteration for a Gaussian kernel of the form \(k({\mathbf x},{\mathbf x}')=\exp\left(-\frac{1}{c}||{\mathbf x}-{\mathbf x}' ||^2\right)\), where

*c*controls the width of the kernel and thereby the non-linearity of the associated feature space map [9]

**z**. As we shall see, this implies that the solutions are at times highly unstable.

## 4 Instability Issues

Some of the most recent publications (e.g., [1, 17]) argue that the methods of Mika et al. [9], Kwok and Tsang [8], and Dambreville et al. [4] are the most reliable. In this section we show that these current approaches suffer from different weaknesses.

A distinctive feature of all the algorithms is that they seek to determine the pre-image as a weighted average of the training points. In the method proposed by Kwok and Tsang only *k* of the training points are used for the estimation, and their weights are based on a distance relation between feature space and input space and the persistence of this distance across the *φ*-mapping. In Mika et al.’s approach all training points contribute to estimating the pre-image, and the individual weights are found using the pre-image itself, hence the method becomes iterative. Furthermore, the weight of a given training point decays exponentially with input space distance, so only points close to the pre-image contribute significantly to the pre-image estimate. Dambreville et al. substituted the iterative approach by a direct formula, where the weights decrease linearly with feature space distance, giving high weight to training points for which *φ*(**x**_{i}) is close to Ψ.

*c*increases the number of distinct pre-images are seen to decrease, until

*c*reaches a certain level, where kernel PCA approaches linear PCA. When this happens the pre-image is drawn towards the noisy observation, as linear PCA fails to capture the non-linear trends clearly visible in the data. In the lower part of Fig. 3 the value of the cost function,

*R*(

**z)**in Eq. 11, are shown for all the found pre-images as a function of the scale. From this figure it is clear that using a very non-linear kernel result in a cost function with not only many local minima, but furthermore these minima practically all have the same value. Thereby making iterative algorithms very sensitive to the point of initialization.

Based on the simple examples shown here, it seems reasonable to try to improve the stability of the current approaches. We suggest this is done by introducing regularization in Eq. 11 as further described in the following sections.

## 5 Regularization

*R*(

**z)**, the regularized version is obtained by adding a penalty,

*T*(

**z)**, so that the solution can be formulated as:

*λ*> 0 is the regularization parameter controlling the strength of the penalty term. Hence, Eq. 14 can lead to various estimates depending not only on the chosen penalty term but also on the value of

*λ*.

If both *R*(**z)** and *T*(**z)** are differentiable a fixed-point iteration scheme similar to that of Mika et al. can easily be derived.

While Tikhonov regularization stabilizes the estimate, Lasso regularization will for sufficiently large values of *λ* force some of the *z*_{j}’s to zero, leading to a sparse pre-image. Notice also that the Lasso problem can be interpreted as a MAP estimate with a Laplacian prior on the **z**_{i}’s.

The type of penalty should be chosen according to the given problem, and hence, prior knowledge of the expected pre-image will often work as the base for choosing the penalty term.

### 5.1 Tikhonov Input Space Regularization

**x**

_{0}is the noisy observation in \({\mathcal{X}}\). The main rationale is that among the solutions to the non-linear optimization problem we want the pre-image which is closest to the noisy input point, hence, hopefully reduce possible distortions of the signal. Thus we seek to minimize

**z**-independent terms. This expression can be minimized for any kernel using a non-linear optimizer.

**z**. With straightforward algebra we get the regularized fixed-point iteration

*λ*is a non-negative parameter, the denominator will always be non-zero in the neighborhood of a maximum because the inner-product will be positive in that same neighborhood.

### 5.2 Sparse Reconstruction by Lasso Regularization

_{1}-norm regularization improves the performance, when the sought pre-image is expected to be sparse, i.e., only a fraction of the input dimensions are nonzero. Hence, we now seek

Since *T*_{1}(**z)** is not differentiable, implementing a fixed-point iteration scheme is not feasible. Instead we will apply the generalized path seeking (GPS) framework as introduced by Friedman in [5] to estimate both the pre-image and the degree of regularization simultaneously.

*λ*, and drive the GPS algorithm, the following first order derivatives are needed:

Applying the GPS algorithm is now straight forward. Friedman suggest using an adaptive step length when exploring the solution space, however, for simplicity we chose a fixed step length of 1*e*^{ − 2} max (|**X|))** in the following experiments. This is a very conservative step length, and may be tuned for faster convergence. The algorithm is stopped when the cost function stabilizes.

It is noted, that the GPS algorithm could also be used to get an indication of the magnitude of the regularization needed in the fixed point iteration scheme. However, as the GPS framework needs many more iterations than the fixed-point scheme, this way of estimation is not attractive for general estimation.

For further background on smoothing by ℓ_{p} norms we refer the reader to [11].

## 6 Experiments

In this section we compare the new regularization approaches to the existing methods proposed by: (a) Kwok-Tsang [8], (b) Dambreville et al. [4], and (c) Mika et al. [9]. The experiments are done on a subset of the USPS data consisting of 16 ×16 pixels handwritten digits.^{2} For each of the digits 0,2,4, and 9 we chose 100 examples for training and another 100 examples for testing. We added Gaussian noise \(\mathcal{N}(0,0.25)\) and set the regularization parameter in Eq. 19 to *λ* = 3*e*^{ − 4}.

In order to illustrate the stability and performance of the methods, we vary both the number of principal components used to define the signal manifold and the scale parameter *c* of the Gaussian kernel. For each combination and pre-image estimator, the mean squared error (MSE) of the de-noised result for the 400 *test* examples is calculated. The iterative approaches are initialized in the noisy test point and for the Kwok and Tsang’s approach 10 neighbors were used for the approximation.

*c*. As seen, the Tikhonov input space regularization approach produces a stable pre-image even for very non-linear models (small

*c*), where the un-regularized iterative scheme fails to reproduce.

Finally Fig. 6 shows visual examples of the de-noised images obtained with Mika et al.’s and the two new regularized approaches. For the images which are successfully de-noised by Mika et al.’s method, e.g., some of the ’zeros’ or ’nines’, adding regularization has very little effect. However, a clear improvement can be seen for the images for which Mika et al.’s algorithm fails to recover a good visual solution. For these digits the input space regularization method do reconstruct the correct digit, albeit with a price paid in terms of a slightly less de-noised result. Furthermore, the image intensity, as shown in the lower part of Fig. 6, illustrates the increased SNR achieved by the input space regularization.

The last panel in Fig. 6 illustrates the effect of regularizing pre-image estimation by the sparsity promoting Lasso penalty. The majority of the digits are clearly identifiable and only a minimum of background noise is present. Again the high SNR is reflected in the lower part of the figure. However, as noted in Fig. 4 sparse reconstruction does not lead to a stable estimation in terms of the MSE measure.

## 7 Conclusion

In this contribution we addressed the problem of pre-image instability for kernel PCA de-noising. The recognized concerns of current methods, e.g., the sensitivity to local minima and large variability were demonstrated for the most widely used methods including Mika et al.’s iterative scheme, Kwok-Tsang’s local linear approximation and the method of Dambreville et al. By introducing simple input space distance regularization in the existing pre-image approximation cost function, we achieved a more stable pre-image, with very little sacrifice of the de-noising ability. Experimental results on the USPS data illustrated how input space regularization provides a more stable pre-image in the sense of variability between test points and reduced the sensitivity to starting conditions as well as provided a better visual result. Furthermore, we introduced ℓ_{1}-norm Lasso regularization and demonstrated that it leads to an improved estimate in terms of visual quality. This regularizer however incurs a relatively large mean squared error in the data set investigated here. The trade-off between quantitative and qualitative performances of the two regularizers needs further investigation.

We thus recommend to augment the cost function for pre-image estimation in Eq. 11 with a task specific penalty term. When the aim is superior visual quality, and the data is known to be sparse, sparse reconstruction by Lasso regularization should be employed. In cases where the objective is both visual impression and stability of the estimate, we suggest the use of Tikhonov input space distance regularization as it provides a reliable pre-image in cases where current methods fail to recover a meaningful result.

In future work we aim to combine input space regularization with sparse reconstruction in order to achieve both highly stabile and attractive visual results as well as extend the concept of sparse reconstruction in relation to kernel methods. Furthermore, the amount of regularization is to be investigated further for both methods presented in this paper.

Bold uppercase letters denote matrices, bold lowercase letters represent column vectors, and non-bold letters denote scalars. **a**_{j} denotes the *j’th* column of **A**, while *a*_{ij} denotes the scalar in the *i’th* row and *j’th* column of **A**. Finally, **1**_{NN} is a *N*×*N* matrix of ones.

## Acknowledgement

This work is supported in part by the Lundbeckfonden through the Center for Integrated Molecular Brain Imaging (Cimbi), www.cimbi.org.