1 Introduction

Medical imaging is one of the most important tools in modern medicine; different types of imaging technologies such as X-ray imaging, ultrasonography, biopsy imaging, computed tomography, and optical coherence tomography have been widely used in clinical diagnosis for various kinds of diseases. However, in clinical applications, it is usually time-consuming to examine an image manually. Moreover, as there is always a subjective element related to the pathological examination of an image by human physician, an automated technique will provide valuable assistance for physicians. A large focus with respect to medical image analysis has been on automated image classification. Many recent studies have revealed that medical images can be properly classified if suitable image feature descriptions are chosen [13]. These research demonstrated that by combining different image description features, it is possible to improve medical image classification performance.

Although the classifiers which can provide multi-class classification such as support vector machines (SVM) and neural networks are usually selected for medical image classification [4], one-class classifiers (OCC) [5] that can work on the samples seen are, so far, more appropriate for medical image classification task. One-class classification is also often called outlier (or novelty) detection as the learning algorithms are used to differentiate between data that appears normal and abnormal with respect to the distribution of the training data. This principle of one-class classification is thus appropriate with respect to medical diagnosis and in disease versus no-disease problems.

In many real classification tasks, using a single classifier often fails to capture all aspects of the data. Therefore, a combination of classifiers (an ensemble) is often considered to be an appropriate mechanism to address this shortcoming. The main idea behind the ensemble methodology is to use several classifiers and combine the individual results in order to produce a classification that outperforms the outcome that would have been produced if the classifiers were to operate in isolation [6]. Ensembles of one-class classifiers have also been shown to perform better than individual classifiers [79].

There are many strategies for constructing a classifier ensemble, with examples including the use of different training data sets, different feature subsets, various types of individual classifiers, and different fusion rules. Among these, the feature subset strategy has shown better performance when the dimensionality of the feature vector is high compared to the number of the data samples [1013]. It is thus suggested that the feature subset ensemble strategy is consequently well suited to medical image classification problems, as various types of image features are generally extracted for medical image classification tasks, which in turn means that the dimensionality of the vector space is typically beyond the number of image samples, in which the ‘curse of dimensionality’ occurs, but the use of the feature subset strategy can avoid such problem.

In this paper, we propose and evaluate a novel classification scheme for medical images. The proposed classification scheme utilizes an ensemble of one-class classifiers, which is built with the feature subset strategy; each one-class classifier is trained with one type of features extracted from the training images. The kernel principle component analysis (KPCA) model was chosen as the base classifier of the ensemble. Given a m-class classification task and n different kinds of image features, the ensemble will consist of m×n KPCA models. For an unlabeled image, its n-types of features will first be mapped into the kernel space by the corresponding n-trained KPCA models from each class. The mapped features will then be reconstructed from the high dimensional kernel space into the original space by preimage learning, the distances between the original features and the reconstructed features will be measured. The distances given by the KPCA models will be combined to output a confidence score describing the probability of the sample belonging to a class. For a m-class classification task, the m confidence scores will be obtained, one for each class. The image will be classified into the class with the maximum confidence score. Promising classification performance was obtained using the proposed classification scheme on two medical image sets.

2 Related works

In this section, we will first introduce some related works on one-class classification. Then one-class classifier ensembles will be discussed.

2.1 One-class classification

Moya et al. originated the term one-class classification [14]. Many approaches to one-class classification have been presented in the literature [5]. Following the taxonomy in the survey papers of [1517], the algorithms used in OCC can be categorized as follows: (i) boundary methods, (ii) density estimation, and (iii) reconstruction methods.

Tax and Duin [18, 19] sought to solve the problem of OCC by distinguishing the positive class from all other patterns in the pattern space; the positive class data was surrounded by a hyper-sphere which encompassed almost all positive patterns within the minimum radius. This method of support vector data description (SVDD) was different to that proposed by Schölkopf et al. [20] who, using a separating hyper-plane instead of a hyper-sphere, tried to separate the pattern space with data from the space containing no data. Manevitz and Yousef [21] proposed another version of one-class SVM based on identifying outlier data as representative of the second class, and they applied their method to the standard Reuters[22] dataset and noted that their SVM methods was quite sensitive to the choice of representation and kernel. Although one-class classifiers, such as OCSVM, have been widely used, the estimated boundary can be sensitive to the nature of the data [23]. This can be highly problematic for many applications, especially for medical diagnosis where the number of false positives must be kept to a minimum, since an accidental diagnosis of a cancer patient as healthy may result in death.

Density estimation methods estimate the density of the target class to form a model with which to represent the data. The generally used models include Parzen, Gaussian, and Gaussian mixture models. The test point is classified by the maximum posterior probability. Generally, this approach works well when the sample size is sufficiently high and a flexible model is used. However, when the model does not fit the data very well, a large bias may result. Details and some comparisons of these methods can be found in [24, 25].

As the density estimation or support-vector-based methods require large training sets, when this is not feasible, one can approximate the target class by a simpler reconstruction model. This type of models tries to capture the data structure; new objects are projected onto this model. The reconstruction error, the difference between the original object x and the projected object p(x), indicates the resemblance of a new object to the original target distribution. When the training data has a very high dimensionality, the nearest neighbor methods tend to perform badly [26]. In such cases, it can often be assumed that the target data is distributed in subspaces of much lower dimensionality. Principle component analysis [27] is a linear model that has the ability to project the original data into orthogonal space which can captures the variance in the data. Many nonlinear subspace models have also been proposed, such as self-organizing map (SOM), auto-encoders, auto-associative networks [28], and kernel PCA [29].

2.2 Ensemble of one-class classifiers

Ensemble learning is concerned with mechanisms to combine the results of a number of weak learning systems to produce better learning performance. Several methodologies exist for creating an ensemble classifier from individual classifiers; a survey on the design of multiple classifier systems can be found in [6]. It has been demonstrated that combining classifiers can also be effective for one-class classifiers. The existing classifier combination strategies can also be used in one-class classifiers. Because for one-class classifiers, information concerning only one class is available; thus, the combining of one-class classifiers is more difficult. Tax and Duin investigated the influence of feature sets and the types of one-class classifiers for the best choice of the combination rule [30]. A bagging-based one-class support vector machine ensemble method was proposed in [31]. A dynamic ensemble strategy based on structural risk minimization [32] was proposed by Goh et al. for multi-class image annotation [7]. Recently, some research results have revealed that creating a one-class classifier ensemble from different feature subsets can provide better performance. Perdisci et al. [33] also used an ensemble of one-class SVMs to create a ‘high-speed payload-based’ anomaly detection system, in which the features were first extracted and clustered and the OCSVM ensemble was then constructed based on the clustered feature subsets. A biometric classification system combining different biometric features was proposed by Bergamini et al. [8], where the one-class SVMs in the ensemble were trained by the data from different people. The feature subset strategy provides diversity with respect to the base classifiers, and some researchers emphasize the importance of measuring diversity in ensembles so as to improve classification performance [9, 34].

Combining one-class classifiers has also shown promising performance in medicine and biology [35]. Peng Li et al. [36] proposed a multi-size patch-based classifier ensemble, which provides a multiple-level representation of image content, and this method was evaluated on colonoscopy images and ECG beat detection [37]. The k-nearest neighbor classifier was selected as the base classifier in the work of Okun and Priisalu [38] in which majority voting was chosen as the combination rules for the ensemble and the method was evaluated on gene expression cancer data.

3 One-class kernel subspace ensemble

In this section, the one-class kernel PCA model ensemble will be introduced. The theory of kernel PCA and pattern reconstruction via preimage will first be introduced, then the proposed KPCA ensemble will be described.

3.1 KPCA and pattern reconstruction via preimage

The traditional (linear) PCA tries to preserve the greatest variations of data by approximating data in a principle component subspace spanned by the leading eigenvectors, noises or less important data variations will be removed. Kernel PCA inherits this scheme; however, it performs linear PCA in the kernel feature space H κ . Suppose X n is the original input data space and H κ is a reproducing kernel Hilbert space (RKHS) (also called feature space) associated to a kernel function κ(x,y)=<φ(x),φ(y)>, where x,yX.φ(·) is a mapping induced by κ that φ(x):X H κ . Given a set of patterns { x 1 , x 2 ,, x N }X, kernel PCA performs the traditional linear PCA in H κ . Similar to the linear PCA, KPCA also has the eigendecomposition:

HKH= U
(1)

where K is the kernel matrix such that K ij =κ(x i ,x j ), and

H=I 1 N 1 1
(2)

is the centering matrix, where I is the N×N identity matrix, 1=[1,1,…,1] is an N×1 vector, U=[α1,…,α N ] is the matrix containing eigenvectors α i =[αi 1,…,α iN ], and Λ=diag(λ1,…,λ N ) contains the corresponding eigenvalues.

Denote the mean of the φ-mapped patterns by φ ̄ = 1 N j = 1 N φ( x j ). Then for a mapped pattern φ(x i ), the centered map φ ~ ( x i ) can be defined as follows:

φ ~ ( x i )=φ( x i ) φ ̄ .
(3)

The k th eigenvector V k of the covariance matrix in the feature space is a linear combination of φ ~ ( x i ):

V k = i = 1 N α ki φ ~ ( x i )= φ ~ α k ,
(4)

where φ ~ =[ φ ~ ( x 1 ), φ ~ ( x 2 ),..., φ ~ ( x N )]. If we use β k to denote the projection of the φ-image of a pattern x onto the k th component V k , then:

β k = φ ~ ( x ) V k = i = 1 N α ki φ ~ ( x ) φ ~ ( x i ) = i = 1 N α ki κ ~ ( x , x i ) ,
(5)

where

κ ~ ( x , y ) = φ ~ ( x ) φ ~ ( y ) = ( φ ( x ) φ ̄ ) ( φ ( y ) φ ̄ ) = κ ( x , y ) 1 N 1 k x 1 N 1 k y + 1 N 2 1 K 1
(6)

where k x =[κ(x,x1),…,κ(x,x N )]. Denote

κ ~ x = [ κ ~ ( x , x 1 ) , , κ ~ ( x , x N ) ] = k x 1 N 11 k x 1 N K 1 + 1 N 2 11 K 1 = H ( k x 1 N K 1 ) ,
(7)

then β k in Equation 5 can be rewritten as β k = α k κ ~ x .

Therefore, the projection P(φ(x)) of φ(x) onto the subspace spanned by the first M eigenvectors can be obtained by:

P ( φ ( x ) ) = k = 1 M β k V k + φ ̄ = k = 1 M ( α k κ ~ x ) ( φ ~ α k ) + φ ̄ = φ ~ L κ ~ x + φ ̄ ,
(8)

where L= k = 1 M α k α k .

PCA is a simple method whereby a model for the distribution of training data can be generated. For linear distributions, PCA can be used; however, many real-world problems are nonlinear. Methods like Gaussian mixture models and auto-associative neural networks have been used for nonlinear problems. These methods, however, need to solve a nonlinear optimization problem and are thus prone to local minima and sensitive to the initialization [29]. KPCA runs PCA in the high-dimensional feature space through the nonlinearity of the kernel, and this allows for a refinement in the description of the patterns of interest. Therefore, kernel PCA was chosen to model the nonlinear distribution of the training samples here.

Kernel PCA has been widely used for classification tasks. A straightforward method using kernel PCA for classification is to directly use the distances between the mapped patterns in the feature space H κ to obtain the classification boundaries [29, 39]. However, as pointed out in [29] for kernel PCA, their experimental results showed that the classification performance highly depends on the parameters selected for the kernel function, and there is no guideline for parameter selection in real classification tasks. It is also demonstrated in a more recent work that it is not sufficient to use kernel space distance for unsupervised learning algorithms, and the distances in the input space are more appropriate for classification [40].

In this paper, we focus on the distances between a pattern x and its reconstruction results by the kernel PCA models trained from different classes. As kernel PCA is used as an one-class classifier here, which means that for each class, at least one KPCA model is trained. Suppose there is an m-class classification task, there will be m KPCA models, one for each class. Given an unlabeled pattern x, every KPCA model will produce a projection P(φ(x)) i , i=1,…,m. During classification, x will be reconstructed in the input space by every P(φ(x)) i , then m reconstruction results x 1 ,, x m can be obtained, the distance between x and each x i (also called reconstruction error) is calculated, and x will be assigned to the class whose KPCA model produces the minimum reconstruction error. Ideally, the KPCA model trained from the class which x also belongs to will always give the minimum reconstruction error. In our proposed classification scheme, multiple KPCA models are trained for each class and the reconstruction errors of KPCA models from different classes are combined for classification, which is demonstrated in Section 3.2 and Section 3.3.

In order to obtain the input-space distance between x and its reconstruction result, it is necessary to map P(φ(x)) back into the input space. The reverse mapping from feature space back to input space is called the preimage problem (Figure 1). However, the preimage problem is ill-posed and the exact preimage x of P(φ(x)) in the input space does not exist [41]; instead, one can only find an approximation x ̂ in the input space such that

φ( x ̂ )=P(φ(x)).
(9)
Figure 1
figure 1

Illustration of KPCA preimage learning. The sample x in the original space is first mapped into the kernel space by kernel mapping φ(·), then PCA is used to project φ(x) into P(φ(x)), which is a point in a PCA subspace. Preimage learning is used to find the preimage x ̂ of x in the original input space from P(φ(x)).

In order to address the preimage learning problem, some algorithms have been proposed. Mika et al. [41] proposed an iterative method to determine the preimage by minimizing least square distance error. Kwok and Tsang proposed a distance constraint learning (DCL) method to find preimage by using a similar technique in multi-dimensional scaling (MDS) [42]. In a more recent work, Zheng et al. [43] proposed a weakly supervised penalty strategy for preimage learning in KPCA; however, their method needs information for both positive and negative classes. As we are only interested in one-class scenarios, the distance constraint method in [42] was selected with respect to the work described in this paper. We briefly review the method here.

For any two patterns x i and x j in the input space, the Euclidean distance d(x i ,x j ) can be easily obtained. Similarly, the feature-space distance d ~ (φ( x i ),φ( x j )) between their φ-mapped images in the feature space can also be obtained. For many commonly used kernels, such as the Gaussian kernels, there is a simple relationship between the feature-space distance and the input-space distance [44]:

d ~ ij 2 = K ii + K jj 2κ( d ij 2 ).
(10)

Therefore,

κ( d ij 2 )= 1 2 ( K ii + K jj d ~ ij 2 ).
(11)

As κ is invertible, d ij 2 can be obtained if d ~ ij 2 is known.

A given training set has n patterns X={x1,…,x n }. For a pattern x in the input space, the corresponding φ(x) is projected to P(φ(x)), then for each training pattern x i in X, P(φ(x)) will be at a certain distance d ~ (P(φ(x)),φ( x i )) from φ(x i ) in the feature space. This feature-space distance can be obtained by:

d ~ 2 ( P ( φ ( x ) ) , φ ( x ) ) = P ( φ ( x ) ) 2 + φ ( x i ) 2 2 P ( φ ( x ) ) φ ( x i ) .
(12)

The Equation 12 can be solved by using Equations 5 and 8. Therefore, the kernel space distances in Equation 11 between P(φ(x)) and each x i can be obtained now. Denote the kernel space distance between P(φ(x)) and x i as:

d 2 =[ d 1 2 , d 2 2 ,, d n 2 ].
(13)

The location of x ̂ will be obtained by requiring d 2 ( x ̂ , x i ) to be as close to the values in (13) as possible, i.e.,

d 2 ( x ̂ , x i ) d i 2 ,i=1,,n.
(14)

To this end, in DCL, the training set X is constrained to the n nearest neighbors of x, and the least square optimization is used to obtain x ̂ .

3.2 Construction of one-class KPCA ensemble for image classification

Given an image set of m classes, the proposed one-class KPCA ensemble is built as follows: (i) for each image category, n-type image features are extracted; (ii) a KPCA model will be trained for each individual type of the extracted features; and (iii) therefore, for each image class, n KPCA models will be constructed. For a m-class problem, there will be m×n KPCA models in the ensemble. The construction of the proposed one-class KPCA ensemble is illustrated in Figure 2, where KPCA i j represents the model trained by the type j feature from class i.

Figure 2
figure 2

Construction of one-class KPCA ensemble from different image feature sets. KPCA i j represents the KPCA model trained from the j th image feature of class i.

3.3 Multi-class prediction using an ensemble of one-class KPCA models

The classification confidence score is used to describe the probability of the image that belongs to each class. The confidence score can provide a quantitative measure of the predictions produced by KPCA models.

Given an unlabeled image x with n extracted features F={f1,f2,…,f n }, let KPCA i j represent the KPCA model belonging to class i and trained from the feature set f j , where i∈{1…m} is the class label and j∈{1…n} is the feature label. For classification, the preimages of each image feature f j F will be obtained by all the KPCA models trained from the j th feature. The DCL method introduced in Section 3.1 is used for obtaining the preimages. For example, the preimages of f1 will be obtained by the models KPCA i 1 ,i=1,,m. Denote the preimages of f j as f j ={ f j 1 , f j 2 ,, f j ′m }, and the squared distance D j between f j and f j′ is used as the reconstruction error, therefore:

D j =[ d j 1 , d j 2 ,, d j m ],
(15)

where d i j = f j f j ′i 2 ,i=1,,m. In the same way, the preimages of all the features in F will be obtained, forming a distance matrix D, which has the dimensions n×m, where n is the number of KPCA models used for the preimage learning and m is the number of image classes. Each row of D represents the reconstruction errors of a feature in F by m KPCA models from each class:

D= D 1 D 2 D n = d 1 1 d 1 2 d 1 m d 2 1 d 2 2 d 2 m d n 1 d n 2 d n m .
(16)

Noting that the values in each column of D represents the reconstruction errors of F using the KPCA models from the same class, these values provide a measurement of how an image x is described by the KPCA models from one class. Since we try to find the KPCA models from a class which give the minimum reconstruction error, this indeed is a 1-nearest neighbor search, as we wish to find the best preimage of x in m preimages. Such a distance measure can improve the speed of the classification. Moreover, it is also in line with the ideas in metric multi-dimensional scaling, in which smaller dissimilarities are given more weight, and in locally linear embedding, where only the local neighborhood structure needs to be preserved [42].

In order to combine the reconstruction errors from KPCA models, the reconstruction errors in D are first normalized using Equation 17:

d ~ i j =exp( d i j /s),
(17)

which models a Gaussian distribution from the square distance. The scale parameter s can be fitted to the distribution of d i j . Moreover, Equation 17 has the feature that the scaled value is always bounded between 0 and 1. The normalized distance matrix D is denoted by D ~ .

The normalized reconstruction errors in D ~ are obtained by different one-class KPCA models, which can be combined to produce the confidence scores (CS) for classifying x into each class. Let C s={cs1,cs2,…,cs m } denote the confidence scores for x with respect to each image class. The confidence scores can be computed from the distance matrix D ~ by using an appropriate combination rule. A product rule was proposed in [45] for combining one-class classifiers:

cs k (x)= k P k ( x | w T ) k P k ( x | w T ) + k P k ( x | w O ) ,
(18)

where k is the label of the target class. k P k (x| w T ) is the probabilities of classifying x into the target class obtained from classifiers of class k, which can be calculated from the values in one column of the distance matrix D ~ as:

k P k (x| w T )= j = 1 n d ~ j k .
(19)

P k (x| w O ) represents the probability of x belonging to the outlier class, which is obtained by multiplying all the values in D ~ except the values from the ‘target’ class k:

P k (x| w O )= d ~ j i ,j=1n,i=1mandik.
(20)

In [30], the authors investigated different mechanisms for combining one-class classifiers, and their results showed that the ‘product rule’ in Equation 18 outperforms other combining mechanisms for one-class classifiers. As noted in [30, 45], when using the product combining rule, P k (x|wT) should be available and a distance should be transformed to a ‘resemblance’ by some heuristic mapping as in Equation 17.

However, when one-class classifiers are used for multi-class classification tasks, the product rule in Equation 18 may not perform well. The number of the one-class classifiers constructed for the outlier classes will exceed the number of the classifiers for the target class; a problem of ‘imbalance’ thus occurs in Equation 18, where the items used for producing k P k (x| w O ) are much more than the items used for P k (x| w T ). During classification, some classifiers from the outlier classes may give small classification probabilities when the classifiers estimate that the pattern is not an outlier. In Equation 18, these small probabilities will still be used to calculate k P k (x| w O ), even if there are more classifiers which have a different judgement. In this imbalance situation, due to those relatively small probabilities, a small value of k P k (x| w O ) will be obtained, approaching 0, which makes the classification confidence scores rather closed to each other.

Here, a variant of the product combining rule of Equation 18 is proposed to address the imbalance problem. Instead of using the mapping values from all the outlier classes’ KPCA models, for those models trained by a same type of image feature, only the model that gives the biggest mapping value will be chosen to produce k P k (x| w O ). The proposed product combining rule can be described as:

cs k (x)= k P k ( x | w T ) k P k ( x | w T ) + j P k j ( x | w O ) ,
(21)

where j is the image feature label and j=1…n. k P k (x| w T ) can be obtained using Equation 19. Each P k j (x| w O ) in j P k j (x| w O ) is the probability of x belongs to the outlier classes using the j th image feature, which can be obtained by:

P k j (x| w O )=max{ d ~ j i },i=1mandik.
(22)

The maximum value selection procedure in Equation 21 is illustrated by a simple example in Figure 3. In Figure 3, there is a four-class classification task (I, II, III, and IV in the figure), in which four types of features are extracted from image x. For one type of image feature, there are four trained KPCA models, each from a different class, giving four reconstruction results for the same feature of x (one row in matrix D ~ ). If we consider class I as the ‘target’ class (first column in the figure), the four values in the first column are used to produce the item k P k (x| w T ) in Equation 21. The other three column of values are deemed as the outlier probabilities produced by the KPCA models from the other three classes. The proposed combining rule selects the maximum mapping value from each row to produce the outlier probability product j P k j (x| w O ).

Figure 3
figure 3

Illustration of KPCA model selection to produce outlier probability product.

The selection scheme in Equation 21 ensures that the numbers of items for calculating k P k (x| w O ) and P k (x| w T ) are the same. Moreover, the negative effect on confidence scores brought by the imbalance can also be removed. The proposed combining rule is in line with the basic idea of one-class classification, as in the one-class scenario, one only needs to know if a pattern should be assigned to the target class or to the outlier class. If one or more outlier models is able to produce a high outlier probability product, the current target class should be doubted. Moreover, by combining the outliers value from different feature-derived models, the diversity of the ensemble will be improved, which is an important factor to make an ensemble learning method successful [46].

Note that since the ‘target class’ is unknown for an unlabeled image, during classification, each class will be deemed as the target class in turn to calculate the confidence score, i.e., each column in D ~ will be used in turn to obtain k P k (x| w T ) for each class. In such a way, for a m-class classification, each class will be deemed as the target class, one by one, to produce m confidence scores; thus the image will be assigned to the class giving the maximum classification confidence score.

4 Experiments and results

The effectiveness of the proposed method is illustrated using a biopsy breast cancer image set, a 3D OCT retinal image set, and the UCI Wisconsin breast cancer (diagnostic) dataset. The details of the image set and image feature extractors are given in Section 4.1. Section 4.2 introduces our experimental setup and the evaluation methods used in our experiments. The effectiveness of combining kernel PCAs is illustrated in Section 4.3. Finally, the proposed method was compared with some state-of-art ensemble classification methods on the UCI Wisconsin breast cancer dataset.

4.1 Image set and feature extraction

With respect to the work described in this paper, three medical image sets were used to evaluate the proposed classification method: A breast cancer benchmark biopsy images dataset from the Israel Institute of Technology [47], a 3D OCT retinal image set, and the breast cancer dataset (diagnostic) from UCI machine learning repository [48].

4.1.1 Breast cancer biopsy image set

The image set consists of 361 samples, of which 119 were classified by a pathologist as normal tissue, 102 as carcinoma in situ, and 140 as invasive ductal or lobular carcinoma. The samples were generated from breast tissue biopsy slides, stained with hematoxylin and eosin. They were photographed using a Nikon Coolpix 995 attached to a Nikon Eclipse E600 (Nikon Corporation, Shinjuku, Tokyo, Japan) at magnification of ×40 to produce images with resolution of about 5 μ per pixel. No calibration was made, and the camera was set to automatic exposure. The images were cropped to a region of interest of 760×570 pixels and compressed using the lossy JPEG compression. The resulting images were again inspected by a pathologist to ensure that their quality was sufficient for diagnosis. Figure 4 presents three sample images of healthy tissue, tumor in situ, and invasive carcinoma.

Figure 4
figure 4

Typical image instances. (a) Carcinoma in situ: tumor confined to a well-defined small region, usually a duct (arrow). (b) Invasive: breast tissue completely replaced by the tumor. (c) Normal: normal breast tissue, with ducts and finer structures.

The shape feature and texture feature are critical factors for distinguishing one image from another. For the biopsy image discrimination, shapes and textures are also effective. As we can see from Figure 4, the three kinds of biopsy images have visible differences in cell externality and texture distribution. Thus, we use completed local binary patterns (CLBPs) [49] for extracting local textural features, gray level co-occurrence matrix (GLCM) [50] statistics for representing global textures, and the curvelet transform [51] for shape description. These feature descriptors have shown promising results in our previous work on biopsy image classification [52].

Different from traditional LBPs, in CLBPs a local region is represented by three coding operators to represent the central pixel, the difference signs, and the difference magnitudes [49]. According to the authors, CLBP can achieve much better rotation invariant texture classification results than conventional LBP-based schemes. In this paper, we use the 3D joint histogram of these three operators to generate textural features of breast cancer biopsy images, and the joint combination of the three components gives better classification than when using conventional LBPs and provides a smaller feature dimension. The dimension of the CLBP feature is 200.

The co-occurrence probabilities provide a second-order method for generating texture features. The basis for features used here is the gray level co-occurrence matrix [50]. With respect to the work described in this paper, a total of 22 features were extracted from gray level co-occurrence matrix, and they are listed in Table 1. Each of these statistics has a qualitative meaning with respect to the structure within the gray level co-occurrence matrix. The total dimension of the GLCM features is 220.

Table 1 Features extracted from gray level co-occurrence matrix

The fastest curvelet transform currently available is the curvelets via wrapping [51], which was therefore adopted with respect to our work. From the curvelet coefficients, some statistics can be calculated from each of these curvelet sub-bands. In this paper, the mean μ, the standard deviation δ, and the entropy H are used as the simple features. If n curvelets are used for the transform, 3n features G=[G μ ,G δ ,H] are obtained, where G μ =[μ1,μ2,…,μ n ], G δ =[δ1,δ2,…,δ n ], and H=[h1,h2,…,h n ]. A 3n-dimensional feature vector can be used to represent each image in the dataset. Using five levels of the curvelet transform, 82 sub-bands of curvelet coefficients are computed, therefore, a 246 dimensional curvelet feature vector is generated for each image.

4.1.2 3D OCT retinal image set

The 3D OCT retinal image set was collected at the Royal Hospital of University of Liverpool. The image set contains 140 volumetric OCT images, in which 68 images are from normal eyes and the remainder from eyes that have age-related macular degeneration (AMD). Figure 5 shows the example images.

Figure 5
figure 5

Examples of two 3D OCT images showing the difference between a ‘normal’ (a) and an AMD retina (b).

The OCT images are preprocessed by using the Split Bregman Isotropic Total Variation algorithm with a least squares approach [53]. The preprocessing step has two targets: (i) identification and extraction of a volume of interest (VOI) which also results in noise removal and (ii) flattening of the retina as appropriate. The example images after preprocessing can be seen in Figure 6.

Figure 6
figure 6

Examples of OCT images. (a) Before preprocessing. (b) After preprocessing.

As the images are three-dimensional, following the work in [53], three types image features were used for image description: local binary patterns of three orthogonal planes (LBP-TOP), local phase quantization (LPQ) and multi-scale spatial pyramid (MSSP).

4.1.3 UCI breast cancer dataset

The Wisconsin breast cancer image sets were obtained from digitized images of fine needle aspirate (FNA) of breast masses. They describe characteristics of the cell nuclei present in the image. Ten real-valued features are computed for each cell nucleus: radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry and fractal dimension. The 569 images in the dataset are categorized into two classes: benign and malignant.

4.2 Experimental setup and performance evaluation methods

MATLAB 7 was used to implement the proposed process together with the Gaussian kernel k(x,y)=exp(−∥xy2/2σ2). Other types of kernels could have been used; however, since the Gaussian kernel is commonly used for the kernel PCA, the SVDD, and the Parzen density, this kernel is the only kernel used with respect to the experiments reported here.

Unless otherwise stated, tenfold cross-validation was used, all the results are averages of ten runs of the tenfold cross-validation. The following measures are used to evaluate the proposed cascade method:

  • Recognition rate (RR) = number of correctly recognized images / number of testing images

  • ROC, receiver operating characteristic graph

  • AUC, area under an ROC curve

4.3 Evaluation of kernel PCA ensemble

The KPCA ensemble evaluation using the biopsy image data and the 3D OCT retinal image data is reported in this section. For the biopsy images, as introduced in Section 4.1, three types of image features were extracted, therefore for each image class, three kernel PCAs were built with respect to each type of image feature. The recognition rates of using these KPCAs individually are listed in column 2 to column 4 in Table 2, where CvletK, GLCMK, and LBPK represent KPCA models trained from curvelets, GLCM, and LBP, respectively. The results of combining all KPCA models are listed in the last column of Table 2; the combining rule is introduced in Equation 21. The parameters of KPCAs were set to σ=4 and n=40. The combined model gives the best classification performance for each image class; the averaged classification accuracy for these three image classes is 92.28%.

Table 2 Recognition rate (percent) for the biopsy image data from individual KPCAs and the combined model

The evaluation results on the 3D OCT retinal images are list in Table 3. Three types of image features were extracted, namely LPQ, LBP-TOP, and MSSP. Therefore, for each image class three kernel PCAs were built with respect to each type of image feature. The recognition rates of using these KPCAs individually are listed in column 2 to column 4 in Table 3, where LPQK, LBPK, and MSSPK represent the KPCA models trained from LPQ, LBP-TOP, and MSSP, respectively. The results of combining all KPCA models are listed in the last column of Table 3. The parameters of KPCAs were set to σ=4 and n=40. The combined model gives the best classification performance for each image class; the averaged classification accuracy for these two image classes is 92.06%.

Table 3 Recognition rate (percent) for the 3D OCT retinal image data from individual KPCAs and the combined model

From Tables 2 and 3, one can see that using the proposed product combining rule, the classification accuracies of all the image classes have been improved. This illustrates that by combining one-class classifiers trained from different features can improve the classification performance, which is in accordance with the observation in [30]. For comparison, the other one-class classifiers are also used as the base classifier of the ensemble, using the same combining rule, the classification results on the biopsy image set and the 3D OCT retinal image set are listed in Tables 4 and 5, respectively.

Table 4 Recognition rate (percent) for biopsy image data from different one-class classifier ensembles
Table 5 Recognition rate (percent) for 3D OCT retinal image data from different one-class classifier ensembles

With respect to the comparison of the operation of a variety of one-class classifiers, six one-class classifiers were used as the base classifier for the ensemble: they are Parzen, SVDD, PCA, Kmeans, MoG, and KPCA. The receiver operating characteristic (ROC) curves obtained using different one-class classifiers on the biopsy image data are shown in Figure 7. The x axis of the ROC curves is false positive rate (FPr) and the y axis is the true positive rate (TPr). The FPr and TPr are obtained by Equations 23 and 24, respectively. A threshold on the difference between the biggest confidence score and the second biggest confidence score was used to obtain the trade-off between TPr and FPr. Initially, the threshold was set to 0.05, then the threshold was increased by a step of 0.01 until 0.60, on each threshold value, and the TPr and FPr were accounted. The areas under the ROC curves (AUC), for the compared classifiers, are listed in Table 6; the KPCA ensemble gives the best result.

TPr= True positive True positive + False negative
(23)
Figure 7
figure 7

Receiver operating characteristics curves of different one-class classifiers. These curves were used as the base classifier for the ensemble on the biopsy image data.

Table 6 AUC of different one-class classifiers used as the base classifier for the ensemble on the biopsy image data
FPr= False positive False positive + True negative
(24)

The proposed method was also compared with some state-of-art methods on the biopsy image set. The methods compared with are as follows: (i) the level set histogram (LSH) method proposed in [54]; (ii) a cascade classification system (CAS) in [55], which first classifies the images into ‘cancer’ and ‘non-cancer’ categories, then further classification is implemented within the ‘cancer’ category to discriminate different cancer types; (iii) a hybrid feature (HF) proposed in [56], which used higher-order spectra (HOS), local binary pattern (LBP), and laws texture energy (LTE) for histopathological image classification, in which the Takagi-Sugeno fuzzy model is selected as the classifier.

In our experiment, based on the description in [54], for LSH, the images were first converted to grayscale images that have the intensity range between 0 and 255, then 25 thresholds with the steps of 10 were used to convert the images into binary images (0 and 1). For each binary image, the level set segmentation was used to generate a 42-bin histogram for the connected components in the image. Thus, each image finally generated a feature vector with the size of 42×25=1,050. SVM with RBF kernel was used for classification with the parameter σ that defines the spread of the radial function set to 4.0, and the parameter C that defines the trade-off between the classifier accuracy and the margin was set to 3.0. For CAS, we used the same classifier, decision tree C5.0, and the same image features as stated in [55]. The feature vector for each image is a combination of first-order statistics, co-occurrence matrix, and steerable filters.

Table 7 lists the performance of the compared methods on the biopsy image set, where one can be noted that the proposed method achieved the better performance than other methods. The CAS method obtained an accuracy of 91.94%, which is superior than the accuracy of LSH and HF. The LSH method obtained only 87.38% accuracy on the biopsy image set. LSH only used the level set histograms for image description, while other compared methods all used composite image features, which demonstrates that using a combination of different image features can improve classification performance. Figure 8 presents the ROC curves of the compared methods; the AUC of the ROC curves are listed in Table 7.

Table 7 Performance comparison of some state-of-art methods and the proposed method on the biopsy image set
Figure 8
figure 8

Receiver operating characteristics curves of the compared methods on the biopsy image data.

For the 3D OCT retinal images, a method in [53] was used to compare with the proposed method. The method in [53] used the same image data, and the same image features introduced in Section 4.1.2 were composed together as the image feature, in which Bayes classifier was used for classification. A classification accuracy of 91.50% was reported by the authors, while our proposed system achieved 92.06%.

The proposed method was also compared with some state-of-art methods on the UCI breast cancer dataset. The methods compared are the following: (i) the multi-layer perceptron ensemble (MLPE) method proposed in [57]; (ii) a boosted neural network (BoostNN) classifier in [58]; (iii) a decision tree (DT) and support vector machine sequential minimal optimization (SVM-SMO) based ensemble classifier proposed by Luo and Cheng [59]. The results are listed in Table 8.

Table 8 Comparison of classification accuracy on the UCI breast cancer image set

5 Conclusions

In this paper, a classification scheme based on a one-class KPCA model ensemble has been proposed for the classification of medical images. The ensemble consists of one-class KPCA models trained using different image features from each image class, and a proposed product combining rule was used for combining the kernel PCA models to produce classification confidence scores for assigning an image to each class. The effectiveness of the proposed classification scheme was verified using a breast cancer biopsy image dataset and a 3D OCT retinal image set. The proposed classification scheme obtained high classification accuracy on the tested image sets.

Although the proposed system has shown promising results with respect to the biopsy image classification task, there are still some aspects that need to be further investigated. The benchmark images used in this work were cropped from the original biopsy scans and only cover the important areas of the scans. However, it is often difficult to find regions of interest (ROIs) that contain the most important tissues in biopsy scans; therefore, more effort needs to be put into detecting ROIs from biopsy images. The parameters of the kernel PCA models, such as the number of principle components and the width of the Gaussian kernel, were fixed during the experiments. In the future research, some optimization methods or adaptive algorithms should be considered for searching the optimal parameters of KPCA models.