In this section, the one-class kernel PCA model ensemble will be introduced. The theory of kernel PCA and pattern reconstruction via preimage will first be introduced, then the proposed KPCA ensemble will be described.
3.1 KPCA and pattern reconstruction via preimage
The traditional (linear) PCA tries to preserve the greatest variations of data by approximating data in a principle component subspace spanned by the leading eigenvectors, noises or less important data variations will be removed. Kernel PCA inherits this scheme; however, it performs linear PCA in the kernel feature space . Suppose is the original input data space and is a reproducing kernel Hilbert space (RKHS) (also called feature space) associated to a kernel function κ(x,y)=<φ(x),φ(y)>, where is a mapping induced by κ that . Given a set of patterns , kernel PCA performs the traditional linear PCA in . Similar to the linear PCA, KPCA also has the eigendecomposition:
where K is the kernel matrix such that K
ij
=κ(x
i
,x
j
), and
is the centering matrix, where I is the N×N identity matrix, 1=[1,1,…,1]′ is an N×1 vector, U=[α1,…,α
N
] is the matrix containing eigenvectors α
i
=[αi 1,…,α
iN
]′, and Λ=diag(λ1,…,λ
N
) contains the corresponding eigenvalues.
Denote the mean of the φ-mapped patterns by . Then for a mapped pattern φ(x
i
), the centered map can be defined as follows:
(3)
The k th eigenvector V
k
of the covariance matrix in the feature space is a linear combination of :
(4)
where . If we use β
k
to denote the projection of the φ-image of a pattern x onto the k th component V
k
, then:
(5)
where
(6)
where k
x
=[κ(x,x1),…,κ(x,x
N
)]′. Denote
(7)
then β
k
in Equation 5 can be rewritten as .
Therefore, the projection P(φ(x)) of φ(x) onto the subspace spanned by the first M eigenvectors can be obtained by:
(8)
where .
PCA is a simple method whereby a model for the distribution of training data can be generated. For linear distributions, PCA can be used; however, many real-world problems are nonlinear. Methods like Gaussian mixture models and auto-associative neural networks have been used for nonlinear problems. These methods, however, need to solve a nonlinear optimization problem and are thus prone to local minima and sensitive to the initialization [29]. KPCA runs PCA in the high-dimensional feature space through the nonlinearity of the kernel, and this allows for a refinement in the description of the patterns of interest. Therefore, kernel PCA was chosen to model the nonlinear distribution of the training samples here.
Kernel PCA has been widely used for classification tasks. A straightforward method using kernel PCA for classification is to directly use the distances between the mapped patterns in the feature space to obtain the classification boundaries [29, 39]. However, as pointed out in [29] for kernel PCA, their experimental results showed that the classification performance highly depends on the parameters selected for the kernel function, and there is no guideline for parameter selection in real classification tasks. It is also demonstrated in a more recent work that it is not sufficient to use kernel space distance for unsupervised learning algorithms, and the distances in the input space are more appropriate for classification [40].
In this paper, we focus on the distances between a pattern x and its reconstruction results by the kernel PCA models trained from different classes. As kernel PCA is used as an one-class classifier here, which means that for each class, at least one KPCA model is trained. Suppose there is an m-class classification task, there will be m KPCA models, one for each class. Given an unlabeled pattern x, every KPCA model will produce a projection P(φ(x))
i
, i=1,…,m. During classification, x will be reconstructed in the input space by every P(φ(x))
i
, then m reconstruction results can be obtained, the distance between x and each (also called reconstruction error) is calculated, and x will be assigned to the class whose KPCA model produces the minimum reconstruction error. Ideally, the KPCA model trained from the class which x also belongs to will always give the minimum reconstruction error. In our proposed classification scheme, multiple KPCA models are trained for each class and the reconstruction errors of KPCA models from different classes are combined for classification, which is demonstrated in Section 3.2 and Section 3.3.
In order to obtain the input-space distance between x and its reconstruction result, it is necessary to map P(φ(x)) back into the input space. The reverse mapping from feature space back to input space is called the preimage problem (Figure 1). However, the preimage problem is ill-posed and the exact preimage x′ of P(φ(x)) in the input space does not exist [41]; instead, one can only find an approximation in the input space such that
In order to address the preimage learning problem, some algorithms have been proposed. Mika et al. [41] proposed an iterative method to determine the preimage by minimizing least square distance error. Kwok and Tsang proposed a distance constraint learning (DCL) method to find preimage by using a similar technique in multi-dimensional scaling (MDS) [42]. In a more recent work, Zheng et al. [43] proposed a weakly supervised penalty strategy for preimage learning in KPCA; however, their method needs information for both positive and negative classes. As we are only interested in one-class scenarios, the distance constraint method in [42] was selected with respect to the work described in this paper. We briefly review the method here.
For any two patterns x
i
and x
j
in the input space, the Euclidean distance d(x
i
,x
j
) can be easily obtained. Similarly, the feature-space distance between their φ-mapped images in the feature space can also be obtained. For many commonly used kernels, such as the Gaussian kernels, there is a simple relationship between the feature-space distance and the input-space distance [44]:
(10)
Therefore,
(11)
As κ is invertible, d ij 2 can be obtained if is known.
A given training set has n patterns X={x1,…,x
n
}. For a pattern x in the input space, the corresponding φ(x) is projected to P(φ(x)), then for each training pattern x
i
in X, P(φ(x)) will be at a certain distance from φ(x
i
) in the feature space. This feature-space distance can be obtained by:
(12)
The Equation 12 can be solved by using Equations 5 and 8. Therefore, the kernel space distances in Equation 11 between P(φ(x)) and each x
i
can be obtained now. Denote the kernel space distance between P(φ(x)) and x
i
as:
(13)
The location of will be obtained by requiring to be as close to the values in (13) as possible, i.e.,
(14)
To this end, in DCL, the training set X is constrained to the n nearest neighbors of x, and the least square optimization is used to obtain .
3.2 Construction of one-class KPCA ensemble for image classification
Given an image set of m classes, the proposed one-class KPCA ensemble is built as follows: (i) for each image category, n-type image features are extracted; (ii) a KPCA model will be trained for each individual type of the extracted features; and (iii) therefore, for each image class, n KPCA models will be constructed. For a m-class problem, there will be m×n KPCA models in the ensemble. The construction of the proposed one-class KPCA ensemble is illustrated in Figure 2, where represents the model trained by the type j feature from class i.
3.3 Multi-class prediction using an ensemble of one-class KPCA models
The classification confidence score is used to describe the probability of the image that belongs to each class. The confidence score can provide a quantitative measure of the predictions produced by KPCA models.
Given an unlabeled image x with n extracted features F={f1,f2,…,f
n
}, let represent the KPCA model belonging to class i and trained from the feature set f
j
, where i∈{1…m} is the class label and j∈{1…n} is the feature label. For classification, the preimages of each image feature f
j
∈F will be obtained by all the KPCA models trained from the j th feature. The DCL method introduced in Section 3.1 is used for obtaining the preimages. For example, the preimages of f1 will be obtained by the models . Denote the preimages of f
j
as , and the squared distance D
j
between f
j
and f j′ is used as the reconstruction error, therefore:
(15)
where . In the same way, the preimages of all the features in F will be obtained, forming a distance matrix D, which has the dimensions n×m, where n is the number of KPCA models used for the preimage learning and m is the number of image classes. Each row of D represents the reconstruction errors of a feature in F by m KPCA models from each class:
(16)
Noting that the values in each column of D represents the reconstruction errors of F using the KPCA models from the same class, these values provide a measurement of how an image x is described by the KPCA models from one class. Since we try to find the KPCA models from a class which give the minimum reconstruction error, this indeed is a 1-nearest neighbor search, as we wish to find the best preimage of x in m preimages. Such a distance measure can improve the speed of the classification. Moreover, it is also in line with the ideas in metric multi-dimensional scaling, in which smaller dissimilarities are given more weight, and in locally linear embedding, where only the local neighborhood structure needs to be preserved [42].
In order to combine the reconstruction errors from KPCA models, the reconstruction errors in D are first normalized using Equation 17:
(17)
which models a Gaussian distribution from the square distance. The scale parameter s can be fitted to the distribution of . Moreover, Equation 17 has the feature that the scaled value is always bounded between 0 and 1. The normalized distance matrix D is denoted by .
The normalized reconstruction errors in are obtained by different one-class KPCA models, which can be combined to produce the confidence scores (CS) for classifying x into each class. Let C s={cs1,cs2,…,cs
m
} denote the confidence scores for x with respect to each image class. The confidence scores can be computed from the distance matrix by using an appropriate combination rule. A product rule was proposed in [45] for combining one-class classifiers:
(18)
where k is the label of the target class. is the probabilities of classifying x into the target class obtained from classifiers of class k, which can be calculated from the values in one column of the distance matrix as:
(19)
represents the probability of x belonging to the outlier class, which is obtained by multiplying all the values in except the values from the ‘target’ class k:
(20)
In [30], the authors investigated different mechanisms for combining one-class classifiers, and their results showed that the ‘product rule’ in Equation 18 outperforms other combining mechanisms for one-class classifiers. As noted in [30, 45], when using the product combining rule, P
k
(x|wT) should be available and a distance should be transformed to a ‘resemblance’ by some heuristic mapping as in Equation 17.
However, when one-class classifiers are used for multi-class classification tasks, the product rule in Equation 18 may not perform well. The number of the one-class classifiers constructed for the outlier classes will exceed the number of the classifiers for the target class; a problem of ‘imbalance’ thus occurs in Equation 18, where the items used for producing are much more than the items used for . During classification, some classifiers from the outlier classes may give small classification probabilities when the classifiers estimate that the pattern is not an outlier. In Equation 18, these small probabilities will still be used to calculate , even if there are more classifiers which have a different judgement. In this imbalance situation, due to those relatively small probabilities, a small value of will be obtained, approaching 0, which makes the classification confidence scores rather closed to each other.
Here, a variant of the product combining rule of Equation 18 is proposed to address the imbalance problem. Instead of using the mapping values from all the outlier classes’ KPCA models, for those models trained by a same type of image feature, only the model that gives the biggest mapping value will be chosen to produce . The proposed product combining rule can be described as:
(21)
where j is the image feature label and j=1…n. can be obtained using Equation 19. Each in is the probability of x belongs to the outlier classes using the j th image feature, which can be obtained by:
(22)
The maximum value selection procedure in Equation 21 is illustrated by a simple example in Figure 3. In Figure 3, there is a four-class classification task (I, II, III, and IV in the figure), in which four types of features are extracted from image x. For one type of image feature, there are four trained KPCA models, each from a different class, giving four reconstruction results for the same feature of x (one row in matrix ). If we consider class I as the ‘target’ class (first column in the figure), the four values in the first column are used to produce the item in Equation 21. The other three column of values are deemed as the outlier probabilities produced by the KPCA models from the other three classes. The proposed combining rule selects the maximum mapping value from each row to produce the outlier probability product .
The selection scheme in Equation 21 ensures that the numbers of items for calculating and are the same. Moreover, the negative effect on confidence scores brought by the imbalance can also be removed. The proposed combining rule is in line with the basic idea of one-class classification, as in the one-class scenario, one only needs to know if a pattern should be assigned to the target class or to the outlier class. If one or more outlier models is able to produce a high outlier probability product, the current target class should be doubted. Moreover, by combining the outliers value from different feature-derived models, the diversity of the ensemble will be improved, which is an important factor to make an ensemble learning method successful [46].
Note that since the ‘target class’ is unknown for an unlabeled image, during classification, each class will be deemed as the target class in turn to calculate the confidence score, i.e., each column in will be used in turn to obtain for each class. In such a way, for a m-class classification, each class will be deemed as the target class, one by one, to produce m confidence scores; thus the image will be assigned to the class giving the maximum classification confidence score.