Background

Motivation

In most supervised classification schemes, a training set of exemplars from each class is used to train a classifier to distinguish between the different object classes. The training exemplars (e.g. images, pixels, regions of interest) usually have a semantic label assigned to them by an expert describing a category of interest or class to which they belong. Each training exemplar serves as an observation of the domain space; as the space is sampled more completely, the resulting classifier should achieve greater classifier accuracy when predicting class labels for new, unlabeled (unseen) data. Thus, typically, the larger the training set, the greater the accuracy of the resulting classifier [1]. In most cases, the training set of labeled data for each of the object categories is generated by a human expert who manually annotates a pool of unlabeled samples by assigning a label to each exemplar.

The use of computers in histopathology analysis, known as digital pathology, is an increasingly common practice that promises to facilitate the detection, diagnosis, and treatment of disease [2]. Supervised classifiers have been applied in this context for a number of problems such as cancer detection and grading [38]. If the objective of the classifier is to distinguish normal from cancerous regions of tissue, exemplars corresponding to each class need to be manually labeled by a domain expert (typically a pathologist). Figure 1 shows an image from such an annotation task, where a prostate tissue sample stained with hematoxylin and eosin (H&E) has been digitized at 40× optical magnification using a whole-slide scanner. In this case, the goal of the supervised classifier is to identify regions of carcinoma of the prostate (CaP, the target class). The black contour in Figure 1 indicates the target class and was placed manually by an expert pathologist. We have previously shown [3] that a supervised classifier can accurately distinguish between CaP and non-CaP, but the annotation process required to build a large training set is laborious, time consuming, and expensive. The digitized images can be over 2 gigabytes (several million pixels) in size, making it difficult to quickly identify cancerous regions within the digital slide. In addition, CaP often appears within and around non-CaP areas, and the boundary between these regions is not always clear (even to a trained expert). These factors increase the time, effort, and overall cost associated with training a supervised classifier in the context of digital pathology. To reduce the cost and effort involved in training these classifiers, it is important to utilize an intelligent labeling strategy. In traditional supervised classification, samples are chosen from an unlabeled pool, annotated, and used to train a classification algorithm. This is known as random learning (RL), illustrated by the flowchart in Figure 2 (top row). In RL, no prior knowledge about the nature of the unlabeled samples is used, and it is possible that many non-informative samples (samples that will not have a positive impact on classifier performance) will be annotated; clearly a wasted effort. To improve training efficiency, a strategy known as active learning (AL) was developed to select only "informative" exemplars for annotation [9, 10].

Figure 1
figure 1

Annotated Prostate Biopsy Tissue Image. Annotation of CaP (black contour) on digital histopathology. CaP tissue often appears near and around non-CaP tissue, making annotation difficult and time-consuming.

Figure 2
figure 2

Random Learning vs Active Learning Flowchart. Comparison of Random Learning (RL, top row) and Active Learning (AL, bottom row) training processes. In RL, unlabeled data (a) is sent to an expert (b), who assigns a label to each sample in the image (c): red regions indicate cancer, and green indicates non-cancer. These labeled samples are used to train a supervised classifier (d). In AL, unlabeled samples (e) are analyzed to find informative samples (f), and only informative samples (g) are annotated for training (h). The supervised classifier (i) can be re-trained and used to identify new samples that may be informative. In the AL setup, only new samples that will improve classification accuracy are added.

Informative samples are those which, if annotated and added to the training set, would increase the accuracy of the resulting trained classifier. In this setup, illustrated in Figure 2 (bottom row), the AL algorithm identifies informative samples (those which are difficult to classify) in an unlabeled dataset for annotation and addition to the growing training set. AL generates training sets that yield better classifier performance compared with training sets of the same size obtained via RL. The concept of "informative" samples in this context is related to the idea of margin-based classification in support vector machines (SVMs) [11], where labeled samples close to a decision boundary are used to classify unlabeled samples. In the AL context, informative samples are difficult-to-classify unlabeled data points that improve an existing training set.

Several AL algorithms have been proposed to determine whether an unlabeled sample is informative.

These methods measure the "informativeness" of a sample as the distance to a support-vector hyperplane [12, 13], the disagreement among bagged weak classifiers [9, 10], variation in feature distributions [14, 15], and model-based predictions [16]. In a bioinformatics context, Lee, et al. [17] showed the benefits of using AL in building a naive Bayes classifier to identify disease states for several different datasets. Veeramachaneni, et al. [18] implemented an AL training approach to build a classifier identifying patient status from tissue microarray data. Previously [19], we investigated the performance of different AL algorithms in creating training sets for distinguishing diseased from non-diseases tissue samples.

Among the results of that study, we found that the particular AL algorithm chosen for learning had no significant effect on the performance of the supervised classifier.

Another major issue in supervised training involves the minority class problem, wherein the target class is under-represented in the dataset, relative to the non-target classes. A labeled training set comprises two sets of samples: S ω 1 tr representing training samples from the target (minority) class, and S ω 2 tr being the samples from the non-target (majority) class. In the minority class problem, S ω 1 tr << S ω 2 tr , where |·| indicates set cardinality. Several researchers [2024] have shown that this training set will likely yield a classifier with lower accuracy and area under the receiver operating characteristic curve (AUC) compared with training sets where S ω 1 tr = S ω 2 tr or S ω 1 tr >> S ω 2 tr . Weiss and Provost [20] showed that for several datasets, varying the percentage of the minority class in the training set alters the accuracy and AUC of the resulting classifiers, and that the optimal class ratio was found to be significantly different from the "natural" ratio. Japkowicz and Stephen [21] found that the effect of the minority class problem depends on a number of factors, including the complexity of the target class and the size of the class disparity. Chawla, et al. [22] proposed mitigating the problem by over-sampling the minority class using synthetic samples; however, this method may simply introduce noise if the target class is too complex.

While some research has addressed the minority class problem in biomedical data [17, 25], there has been little related work in the realm of digital pathology. Cosatto, et al. [26] applied a SVM AL method [12] in training a classifier for grading nuclear pleomorphism on breast tissue histology, while Begelman, et al. [27] employed an AL-trained SVM classifier in building a telepathology system for prostate tissue analysis. However, these studies did not account for the minority class problem in the training set, particularly relevant in the context of digital pathology, since the target class (cancer) is often observed far less often than the non-target class (non-cancer) and occupies only a small percentage of the overall tissue area. Ideally, an intelligent training strategy for this domain would combine AL while simultaneously addressing the minority class problem by maintaining a user-defined class ratio (class balancing). Zhu and Hovey [23] combined an entropy-based AL technique with over-and under-sampling to overcome the minority class problem for text classification, and found that over-sampling the minority class yielded the highest classifier performance. However, they did not investigate different class ratios and did not discuss the increased cost of the sampling techniques. Bloodgood and Vijay-Shanker [28] focused on an AL and classification method based on SVMs for unbalanced text and protein expression data; their approach involves estimating the class balance in the entire dataset, and then selecting samples to overcome this bias (as opposed to overcoming bias in the growing training set generated by AL).

While additional sampling can help to mitigate the minority class problem, this process requires more annotations compared to a training set with unbalanced classes. Because the cost of obtaining each annotation is high, it would be beneficial to be able to predict the number of annotations required to obtain a class-balanced training set of a pre-defined size. These predictions are critical for determining, a priori, the amount of resources (time, money, manpower) that will be employed in developing a supervised classifier. An analytical cost model will enable us to predict the cost involved in training the supervised classifier. Additionally, such a model will provide some insight into the relationship between (1) the size of a training set, (2) its class balance, and (3) the number of annotations required to achieve a predefined target accuracy.

Contributions and Significance

In this work, we develop an AL-based classifier training strategy that also accounts for the minority class problem. This training strategy is referred to as "Class-Balanced Active Learning" (CBAL). We apply CBAL to the problem of building a supervised classifier to distinguish between CaP and non-CaP regions on images of prostate histopathology. For this particular problem, training samples are difficult and expensive to obtain, and the target class (CaP) is relatively sparse in relation to the non-target class; thus, we expect CBAL to yield large benefits in terms of training cost. Our mathematical model is used to predict the cost of building a training set of a pre-defined size and class ratio. This is, to the best of our knowledge, the first in-depth investigation and modeling of AL-based training for supervised classifiers that also specifically addresses the minority class problem in the context of digital pathology. However, CBAL training can be easily applied to other domains where obtaining annotated training samples is a time-consuming and difficult task, and where the target and non-target class ratios are not balanced. The rest of the paper is organized as follows. In Section 2 we describe the theory behind CBAL, followed by a description of the algorithms and model implementation in Section 3. In Section 4 we describe our experimental design, and in Section 5 we present the results and discussion. Concluding remarks are presented in Section 6.

Methods

Modeling the Annotation Cost of Class Balancing in Training

Notation and Symbols

A table containing commonly used notation and symbols is presented in Table 1. Our data comprises a set of square image regions rR on digitized prostate images, represented by the red squares in Figure 2 (e). The regions rR are divided into an unlabeled training pool, Str, and an independent labeled testing pool, Ste. Each sample has been identified as either belonging to the minority class ω1 (in this case the cancer class) or the majority (non-cancer) class, ω2. We denote membership of sample rR in the minority class ω1 as r ↪ ω1, and these samples are "minority class samples." At iteration t ∈ {0, 1, ⋯, T } of AL, the labeled training set is denoted as S t , Φ tr , where Φ denotes the training methodology and T is the maximum number of iterations. At each iteration t, a set of M weak binary classifiers is trained by S t , Φ tr and used to build a strong classifier, T t ( r ) { 0 , , 1 } . The selectivity of the AL algorithm is parameterized by τ ∈ {0, ⋯, 0.5}, the confidence margin. We denote by k 1 ^ and k 2 ^ the desired number of samples rR in the final training set for which r ↪ ω1 and r ↪ ω2, respectively. The total number of samples annotated at any iteration t is denoted as N t .

Table 1 Notation and Symbols

Theory of CBAL

In this subsection, we describe the theoretical foundation of the CBAL approach. Our goal in this section is to precisely define an "informative sample," identify the likelihood of observing a sample of a target class, and predict the number of samples that must be annotated before a specified number of target samples is observed and annotated. Our aim is to be able to predict a priori the cost of the system in terms of actively-learned annotations, which in turn represent an expenditure of resources.

Definition 1. The set of informative samples (eligible for annotation), S t E , at any iteration t is given by the set of samples rR for which0.5-τ T t ( r ) 0.5+τ.

The value of T t ( r ) denotes the classification confidence, where T t ( r ) =1 indicates strong confidence that r ↪ ω1, and T t ( r ) =0 indicates confidence that r ↪ ω2. The number of samples r S t E for which r ↪ ω1 and r ↪ ω2 are denoted k1,tand k2,t, respectively. The likelihood of randomly selecting a sample r ↪ ω1 from S t E is p t ( r ω 1 ) = k 1 , t k 1 , t + k 2 , t . The number annotated in class ω2 is N t - k 1 ^ .

Proposition 1. Given the probability p t (r ↪ ω1) of observing a sample r ↪ ω1at any iteration t, the probability P ^ t of observing k 1 ^ samples from class ω1after annotating N t samples is:

P ^ t = N t + k 1 ^ - 1 N t [ p t ( r ω 1 ) ] N t [ 1 - p t ( r ω 1 ) ] k 1 ^
(1)

Proof Revealing the label of a sample r S t E is an independent event resulting in either observation of class ω1 or ω2. The probability of success (i.e. observing a minority class sample) is p t (r ↪ ω1), and the probability of failure is p t (r ↪ ω2) = 1 - p t (r ↪ ω1) in the two class case. We assume that S t E is large, so p t (r ↪ ω1) is fixed. The annotations continue until k 1 ^ successes are achieved. Because of these properties, the number of annotations N t is therefore a negative binomial random variable, and the probability of observing k 1 ^ samples from class ω1 in N t annotations is given by the negative binomial distribution.

The consequence of Proposition 1 is that as N t (i.e. the training cost in annotations) increases, P ^ t also increases, indicating a greater likelihood of observing k 1 ^ samples r ↪ ω1. We denote as PΔ the target probability for the model to represent the degree of certainty that, within N t annotations, we have achieved our k 1 ^ samples rR for which r ↪ ω1.

Proposition 2. Given a target probability PΔ, the number of annotations required before k 1 ^ minority class samples are observed in SE is:

N t = argmin k ^ 1 x | S t r | [ P Δ ( x + k ^ 1 1 x ) [ p t ( r ω 1 ) ] x [ 1 p t ( r ω 1 ) ] k ^ 1 ] .
(2)

Proof We wish to find the value of N t that causes Equation 1 to match our target probability, PΔ. When that happens, P ^ t = P Δ and P ^ t - P Δ =0. Using a minimization strategy, we obtain the value of N t .

Proposition 2 gives us an analytical formulation for N t . Note that Equation 3 returns the smallest N t that matches the PΔ. The possible values of N t range from k 1 ^ , in which case exactly N t = k 1 ^ annotations are required, to N t = | Str|, in which case the entire dataset is annotated before obtaining k 1 ^ samples. Note that we are assuming that there are at least k 1 ^ samples in the unlabeled training set from which we are sampling.

Algorithms and Implementation

AL Algorithm for Selecting Informative Samples

The CBAL training strategy consists of two algorithms that work in tandem: ActiveTrainingStrategy, for selecting informative samples, and MinClassQuery, for maintaining class balance. Algorithm ActiveTrainingStrategy, detailed below, requires a pool of unlabeled samples, Str, from which samples will

Algorithm ActiveTrainingStrategy

Input: Str, T

Output: S T , Φ tr , T T

begin

  1. 0.

    initialization: create bootstrap training set S 0 , Φ tr , set t = 0

  2. 1.

    while t < T do

  3. 2.

       Create classifier T t from training set S t , Φ tr ;

  4. 3.

       Find eligible sample set S t E where T t ( r ) = 1 2 ±τ;

  5. 4.

       Annotate K eligible samples via MinClassQuery() to obtain S ^ t E ;

  6. 5.

       Remove S ^ t E from S tr and add to S t + 1 , Φ tr ;

  7. 6.

       t = t + 1;

  8. 7.

    endwhile

  9. 8.

    return T T , S T , Φ tr ;

end

be drawn for annotation, as well as a parameter for maximum iterations T. This parameter can be chosen according to the available training budget or through a pre-defined stopping criterion. The output of the algorithm will be a fully annotated training set S T , Φ tr as well as the classifier trained using training set T T . The identification of the informative samples occurs in Step 3, wherein a fuzzy classifier T T is generated from a set of M weak binary decision trees [29] that are combined via bagging [30]. Informative samples are those samples for which half of the M weak binary decision trees disagree; that is, samples for which 0.5-τ T t ( r ) 0.5+τ. This approach is similar to the Query-by-Committee (QBC) AL algorithm [9, 10]. While there are several alternative algorithms available to perform AL-based training [12, 1416], we chose the QBC algorithm in this work due to its intuitive description of sample informativeness and its straightforward implementation. It is important to note that poor performance of T T does not degrade the ability of the algorithm to identify informative samples. We expect that at low t, the performance of T T will be low due to the lack of sufficient training, and much of the dataset will be identified as informative.

However, even if T T identifies the majority of unlabeled samples as informative, it is still more efficient than RL. In the worst-case scenario, where all unlabeled samples are considered informative, then we are forced to choose training samples at random - which is equivalent to traditional supervised training.

Obtaining Annotations While Maintaining Class Balance

Algorithm MinClassQuery is used by ActiveTrainingStrategy to select samples from the set of eligible samples, S t E , according to a class ratio specified by k 1 ^ and k 2 ^ . Recall that K= k 1 ^ + k 2 ^ , and so K > 0. We expect that there will be many more samples from ω2 (the majority class) than from ω1. Because these

Algorithm MinClassQuery

Input: S t E , K > 0, k 1 ^ , k 2 ^

Output: S ^ t E

begin

  1. 0.

    initialization: S ^ t E =, k 1 =0, k 2 =0

  2. 1.

    while | S ^ t E |K do

  3. 2.

       Find class ω i of a random sample r S t E , i ∈ {1, 2};

  4. 3.

       if k i < k i ^

  5. 4.

          Remove r from S t E and add to S ^ t E ;

  6. 5.

           k i = k i +1;

  7. 6.

       else

  8. 7.

          Remove r from S t E ;

  9. 8.

       endif

  10. 9.

    endwhile

  11. 10.

    return S ^ t E ;

end

samples are being annotated, they are removed from the unlabeled eligible sample pool S t E in Step 7; however, since the resources have been expended to annotate them, they can be saved for future iterations.

Updating Cost Model and Stopping Criterion Formulation

At each iteration, we can calculate N t using Equation 1. We can estimate p0(r ↪ ω1) based on the size of the target class observed empirically from the initial training set (< 10%); for t > 0, we update the probability of observing a minority class sample using the following equation:

p t + 1 ( r ω 1 ) = k 1 , t - k 1 ^ k 1 , t + k 2 , t - N t ,
(3)

and N t+1 is re-calculated via the minimization of Equation 2. If {rStr|r ↪ ω1} = ⌀, then k 1 , t - k 1 ^ =0 and thus p t +1(r ↪ ω1) = 0. If there are no remaining samples in Str, then k1,t+ k2,t= N t and p t+1 (r ↪ ω1) is undefined. Essentially we must assume that (1) there are at least some samples rStr for which r ↪ ω1, and (2) Str ≠ ⌀. The cost of the entire training is calculated by summing N t for all t:

= t = 1 T N t .
(4)

ActiveTrainingStrategy repeats until one of two conditions is met: (1) Str is empty, or (2) the maximum number of iterations T is reached. A stopping criterion can be trained off-line to determine the value of T as the smallest t that satisfies:

| A t - A t - 1 | δ ,
(5)

where δ is a similarity threshold and A t is the accuracy of classifier T T (as evaluated on a holdout training set). Thus, when additional training samples no longer increase the resulting classifier's accuracy, the training can cease. An assumption in using this stopping criterion is that adding samples to the training set will not decrease classifier accuracy, and that accuracy will rise asymptotically. The total number of iterations T corresponds to the size of the final training set and can be specified manually or found using a stopping criterion discussed below. Classifiers that require a large training set will require a large value for T, increasing cost.

Selection of Free Parameters

Our methodology contains a few free parameters that must be selected by the user. The training algorithm employs three parameters: the similarity threshold δ (Equation 5); the confidence margin τ; and the number of samples from each class to add per iteration, k 1 ^ and k 2 ^ . The choice of δ will determine the maximum number of iterations, T, the algorithm is allowed to run. A small value of δ will require a larger final training set (i.e. a larger T) before the algorithm satisfies the stopping criterion. Additionally, if Eq. 5 is never satisfied, then all available training samples will eventually be annotated (Str will be exhausted).

The confidence margin τ defines the range of values of T t ( r ) for which sample r is considered informative (difficult-to-classify). Smaller values of τ define a smaller area on the interval [0, 1], requiring more uncertainty for a region to be selected. τ = 0.0 indicates that only samples for which T t ( r ) =0.5 (i.e. perfect classifier disagreement) are informative, while τ = 0.5 indicates that all samples are informative (equivalent to random learning). The number of samples to add from each class during an iteration of learning, k 1 ^ and k 2 ^ , determines how many annotations occur before a new round of learning starts.

Consider the following two cases:

  1. 1.

    k 1 ^ = k 2 ^ =10: in this case, 20 samples (10 from each class) are annotated per iteration.

  2. 2.

    k 1 ^ = k 2 ^ =1: in this case, 2 samples (1 per class) are annotated per iteration.

In both cases, the learning algorithm for selecting informative samples is only updated after each iteration.

In the first case, 20 samples are added to S t E ^ before new learning occurs, while in the second case, the learning algorithm is updated after each additional sample is annotated. Thus, in case 2, we are sure that each additional sample is chosen using the maximum amount of available information, while in case 1, several samples are added before the learning algorithm is updated. Although the second case requires ten iterations before it has the same training set size as the first case, each additional annotation is chosen based on an updated AL model, ensuring that all 20 samples are informative.

Experimental Design

Data Description

We apply the CBAL training methodology to the problem of prostate cancer detection from biopsy samples. Glass slides containing prostate biopsy samples are digitized at 40× magnification (0.25 μ m per pixel resolution). The original images are reduced in size using a pyramidal decomposition scheme [31] to 6.25% of their original size (4.0 μ m per pixel resolution), matching the resolution of the images used in [3]. Each image is divided into sets of square regions, rR such that each region constitutes a 30-by-30 pixel square area (120-by-120 μ m area). These image regions constitute the dataset used for training and testing. Ground truth annotation is performed manually by an expert pathologist, who places a contour on tissue regions on the original 40× magnification image. Pathologists annotated both cancer and non-cancer regions of tissue, and only annotated regions were included in the dataset. A total of 100 biopsy images were analyzed from 58 patients, yielding over 12,000 annotated image regions. All of the 58 patients exhibited prostate cancer, although cancer was not present in all 100 images. The square regions were assumed to be independently drawn from the images.

Feature Extraction

In [3], we built a classifier for discriminating between cancer and non-cancer on a pixel level. We extracted several hundred texture features, comprising three different classes of texture descriptors: Grey-level statistics of image intensities, Haralick texture features based on the co-adjacency of image intensities, and Gabor filter features based on a filter bank utilizing phase and scale parameters. Examples of these feature types are given in Figure 3. We employed the Adaptive Boosting (AdaBoost) algorithm [32], which is a method of assigning a weight to each feature based on its discriminating power. Features with a higher weight are better able to capture the differences between classes; a subset of highly informative features can be selected as those with weights above 0.05. In the current study, we employed those 14 features under the assumption that the features useful for pixel-wise classification would be similarly useful in patch-wise classification of cancer. The feature values were calculated in a pixel-wise fashion for each 30-by-30 region, and each region r was then represented by the average value of the feature calculated over all pixels.

Figure 3
figure 3

Examples of Feature Types. Examples of the feature types extracted on two ROIs from a biopsy sample (a), identified by black squares. Shown are (b), (f) the original tissue image, (c), (g) a greylevel texture image (standard deviation value), (d), (h) a Haralick texture image (entropy of the co-adjacency matrix), and (e), (i) a Gabor filter feature image. The top row (b)-(e) indicates a cancerous region, while the bottom row (f)-(i) is a benign region.

First-order Statistical Features

First-order features are statistics calculated directly from the pixel values in the image. These include the mean, median, and standard deviation of the pixels within a window size, as well as Sobel filters and directional gradients. Of these features, two were included in the subset: the standard deviation and the range of pixel intensities.

Second-order Co-occurrence Features

Co-occurrence image features are based on the adjacency of pixel values in an image. An adjacency matrix is created where the value of the i th row and the j th column equals the number of times pixel values i and j appear within a fixed distance of one another. A total of thirteen Haralick texture features [33] are calculated from this co-adjacency matrix, of which 5 were found to be highly discriminating: information measure, correlation, energy, contrast variance, and entropy.

Steerable Filter Features

To quantify spatial and directional textures in the image, we utilize a steerable Gabor filter bank [34]. The Gabor filter is parameterized by frequency and orientation (phase) components; when convolved with an image, the filter provides a high response for textures that match these components. We compute a total of 40 filter banks, of which 7 were found to be informative, from a variety of frequency and orientation values.

Evaluation of Training Set Performance via Probabilistic Boosting Trees

Evaluation of S t , Φ tr is done by testing the trained classifier's accuracy. To avoid biasing the results, we wish to use a different classifier than T T for evaluation; a probabilistic boosting tree (PBT) [35], denoted T t , is employed. The PBT combines AdaBoost [32] and decision trees [29] and recursively generates a decision tree where each node is boosted with M weak classifiers. The classifier output, T t ( r ) , is the probability that sample r belongs to the target class. The PBT is used to classify an independent testing set Ste (where SteStr = ⌀) via area under the receiver operating characteristic curve (AUC) and classifier accuracy. The hard classification for rSte is denoted as:

T ̃ t ( r ) = 1 if  T ̃ t ( r ) > θ 0 otherwise ,
(6)

where θ is a classifier-dependent threshold. For region r, the ground truth label is denoted as G ( r ) { 0 , 1 } , where a value of 1 indicates class ω1 and 0 indicates class ω2. The resulting accuracy at iteration t is denoted as:

A t = 1 | R | r 1 if  G ( r ) = T ̃ t ( r ) 0 otherwise .
(7)

We generate receiver operating characteristic (ROC) curves by calculating the classifier's sensitivity and specificity at various decision thresholds θ ∈ {0, ..., 1}. Each value of θ yields a single point on the ROC curve, and the area under the curve (AUC) measures the discrimination between cancer and non-cancer regions. The accuracy can then be calculated by setting θ to the operating point of the ROC curve. Again, it should be noted that it is possible to evaluate the performance of the training set using any supervised classifier in place of PBT. A previous study [36] used both decision trees [29] and SVMs [11] as supervised evaluation algorithms in an AL training experiment, and found that the trend in performance for both algorithms was similar. In this study we implemented PBTs because the algorithm was different from T T , which avoids biasing results; however, alternative evaluation algorithms could certainly be employed.

Although the classifier performance values may change, the goal of these experiments is to show that the performance of an actively-learned, class-balanced training set is better than a randomly generated unbalanced set.

List of Experiments

We perform three sets of experiments to analyze different facets of the active learning training methodology.

Experiment 1: Comparison of CBAL Performance with Alternate Training Strategies We compare the performance of CBAL with four alternative training strategies to show that CBAL training will yield a classifier with greater performance versus a training set of the same size trained using an alternative method.

  • Unbalanced Active Learning (UBAL): The class ratio is not controlled; eligible samples S t E determined via AL are randomly annotated and added to S ^ t E .

  • Class Balanced Random Learning (CBRL): All unlabeled samples in Str are eligible for annotation, while holding class balance constant as described in MinClassQuery.

  • Unbalanced Random Learning (UBRL): All unlabeled samples are queried randomly. This is the classic training scenario, wherein neither class ratio nor informative samples are explicitly controlled.

  • Full Training (Full): All available training samples are used. This represents the performance when the entire training set is annotated and available (an ideal scenario).

In random learning (RL), all samples in the unlabeled pool Str are "eligible" for annotation; that is, SE = Str. In unbalanced class experiments, the MinClassQuery algorithm is replaced by simply annotating K random samples (ignoring class) and adding them to S ^ E . The full training strategy represents the scenario when all possible training data is used.

The classifier is tested against an independent testing pool, Ste, which (along with the training set) is selected at random from the dataset at the start of each trial. In these experiments, T = 40, the confidence margin was τ = 0.25, and the number of samples added at each iteration was K = 2. In the balanced experiments, k 1 ^ = k 2 ^ =1. A total of 12,588 image regions were used in the overall dataset, drawn from the 100 images in the dataset; 1,346 regions were randomly selected for Ste, and 11,242 for Str in each of 10 trials. The regions are assumed to be independent samples of the overall image space due to the heterogeneity of the tissue and appearance of disease. Because the goal of classification is to distinguish between cancer and non-cancer regions of tissue rather than individual patients, the training and testing was drawn randomly from the overall pool of available regions. The true ratio of non-cancer to cancer regions in S tr was approximately 25:1 (4% belonged to the cancer class). A total of 10 trials were performed, with random selection of Str and Ste at the beginning of each trial.

Experiment 2: Effect of Training Set Class Ratio on Accuracy of Resulting Classifier To explore the effect of training set class ratio on the performance of the resulting classifier, the CBAL methodology was used, setting K = 10 and varying k 1 ^ and k 2 ^ such that the percentages of the training set consisting of minority samples vary from 20% ( k 1 ^ = 2 , k 2 ^ = 8 ) to 80% ( k 1 ^ = 8 , k 2 ^ = 2 ) . Each set of parameters was used to build a training set, which in turn was used to build a classifier that was evaluated on the same independent testing set Ste.

Experiment 3: Comparison of Cost Model Predictions with Empirical Observations At each step of the AL algorithm, we estimate N t for obtaining balanced classes as described in Section 2. The goal of this experiment was to empirically evaluate whether our mathematical model could accurately predict the cost of obtaining balanced classes at each iteration, and could thus be used to predict the cost of classifier training for any problem domain. For these calculations, we set the initial class probability p0(ω1) = 0.04, based on the observations of the labeled data used at the beginning of the AL process. Additionally, we set the desired sample numbers to correspond with the different class ratios listed in Experiment 2, from 20% minority class samples ( k 1 ^ = 2 , k 2 ^ = 8 ) to 80% ( k 1 ^ = 8 , k 2 ^ = 2 ) . The aim of this experiment was to investigate the relationship between the cost of a specific class ratio and the performance of T T .

Results and Discussion

Experiment 1: Comparison of CBAL performance with Alternate Training Strategies

Examples of confidence or likelihood scenes generated by T T are shown in Figure 4, obtained at iteration T = 40 (since K = 2, these images represent the classifier's performance using 80 total samples). Figures 4(a) and 4(d) show images with benign regions marked in red boundaries and cancerous regions in black. Figures 4(b) and 4(e) show the confidence scenes obtained via the CBAL training strategy, and (c) and (f) are obtained via CBRL training. High intensity regions represent high classifier confidence that r ↪ ω1, while dark regions indicate confidence that r ↪ ω2. In both cases, the CBRL training fails to properly find the cancer regions, either returning large numbers of false positives (Figure 4(c)) or failing to fully identify the cancer area (Figure 4(f)). This difference (high false positives in one case, high false negatives in another) is most likely due to the inability of random learning to accurately define the classes, given the small training set size. Thus, given the constraints on training set size, a CBAL-trained classifier can out-perform a randomly-trained classifier.

Figure 4
figure 4

Qualitative Results of the Probabilistic Boosting Tree Classifier. Qualitative results of the final PBT classifier T T . Shown in (a), (d) are the segmented cancer region, (b), (e) show the probability scene obtained through the CBAL classifier, and (c), (f) show the probability scene obtained via CBRL. The intensity of a region is determined by T T ( r ) .

Quantitative classification results are plotted in Figure 5 as accuracy (Figure 5(a)) and area under the ROC curve (Figure 5(b)) as a function of the number of training samples in the set S t tr for 1 ≤ t ≤ 40. In each plot, the "full" training set corresponds to the straight black line, CBAL is the red triangle line, CBRL is a black dashed line, UBAL is a green squared line, and UBRL is a blue circled line. Note that the "full" line indicates the maximum achievable classifier accuracy for a given training set; thus, the closer a training set gets to the straight black line, the closer it is to optimal performance.

Figure 5
figure 5

Quantitative Results of the Probabilistic Boosting Tree Classifier. Quantitative results of the classifier, T T , for t ∈ {1, 2, ⋯, T}. Shown are (a) accuracy and (b) AUC values for the PBT classifier, evaluated at each iteration.

The AUC values for CBAL approach the full training with 60 samples (t = 30) while CBRL, UBRL, and UBAL have approximately 0.05 lower AUC at those sample sizes. Accuracy for CBAL remains similar to other methods until t = 30, at which point CBAL out-performs other methods by approximately 3%. CBRL, UBRL, and UBAL do not perform as well as CBAL for the majority of our experiments, requiring a larger number of samples to match the accuracy and AUC of CBAL.

Experiment 2: Effect of Training Set Class Ratio on Accuracy of Resulting Classifier

Figure 6 shows the effects of varying training class ratios on the resulting classifier's performance for the prostate cancer detection problem. Shown is the performance of the PBT classifier at each iteration of the AL algorithm using 20% minority samples (blue line), 40% (green line), 50% (red line), 60% (cyan line), and 80% (magenta line), for both accuracy (Figure 6(a)) and AUC (Figure 6(b)). The AUC curves are similar for all class ratios, although the training set that uses 80% minority class samples tends to perform slightly better. Thus, by over-representing the minority class, we achieve greater performance in terms of accuracy. Noted that while changing the class ratio had different effects on accuracy and AUC a similar trend was reported by Weiss and Provost [20], who found that altering the class ratio of a training set for a classifier affected AUC and accuracy differently (although there was no specific trend across multiple datasets).

Figure 6
figure 6

Performance of the PBT Classifier Trained on Different Target Class Ratios. Performance of the PBT classifier trained using training sets with different percentages of samples for which r ↪ ω1. Shown are the (a) accuracy and (b) AUC values for the trained classifier at each iteration, using p0(r ↪ ω1) = 0.04.

Experiment 3: Comparison of Cost Model Predictions with Empirical Observations

Figure 7(a) shows the results of cost modeling simulations. The predicted cost, found by solving for N t in Equation 1, is plotted as a function of t (solid black line) with p0(r ↪ ω1) = 0.04 along with the empirically observed costs of CBRL (blue dotted line) and CBAL (red triangle line) with k 1 ^ = k 2 ^ =5. At each t, the plots show how many annotations were required before class balancing was achieved. We can see that the simulation predicts the number of annotations required to achieve class balance at each iteration within approximately 10-20 annotations. Additionally, we see that the empirically observed costs are greatly varied, particularly for t < 50; this is due to the fact that the number of annotations required to achieve class balance depends greatly on (1) the current training set, (2) the remaining samples in the unlabeled pool, and (3) the order in which eligible samples are chosen for annotation.

Figure 7
figure 7

Cost of Class Balance. (a) Plot of annotations N t required for class balance as a function of t; shown are CBAL (blue line), CBRL (red dashed line), and the predicted N t from Equation 1 (black line). (b) The cost of obtaining a specific class ratio as iterations increase. If a high percentage of minority class samples is desired, the cost increases.

While it may seem from Figure 6 that the strategy yielding best performance would be to over-sample the minority class as much as possible, we also plotted the empirical cost values N t for each of the class ratios from Experiment 2 in Figure 7(b). We find that as the percentage of the minority class increases, the cost associated with each iteration of the AL algorithm also increases. This is due to the fact that as the minority class is over-sampled, more annotations are required to find additional minority samples. While there is some increase in accuracy by over-sampling the dataset, the annotation cost increases by an order of magnitude. Thus, the optimal strategy will need to balance the increase in accuracy with the constraints of the overall annotation budget.

Conclusions

In this work we present a strategy for training a supervised classifier when the costs of training are high, and where the minority class problem exists. Our strategy, Class-Balanced Active Learning (CBAL), has the following characteristics: (1) Active Learning (AL) is used to select informative samples for annotation, thus ensuring that each annotation is highly likely to improve classifier performance. (2) Class ratios are specifically addressed in this training strategy to prevent the training set from being biased toward the majority class. (3) A mathematical model is used to predict the number of annotations required before the specified class balance is reached. We applied these techniques to the task of quantitatively analyzing digital prostate tissue samples for presence of cancer, where the CBAL training method yielded a classifier with accuracy and AUC values similar to those obtained with the full training set using fewer samples than the unbalanced AL, class-balanced random learning, or unbalanced random learning methods. Our mathematical cost model was able to predict the number of annotations required to build a class-balanced training set within 20 annotations, despite the large amount of variance in the empirically observed costs. This model is critical in determining, a priori, what the cost of training will be in terms of annotations, which in turn translates into the time and effort expended by the human expert in helping to build the supervised classifier. We found that by specifying class ratios for the training set that favor the minority class (i.e. over-sampling), the resulting classifier performance increased slightly; however, the cost model predicted a large increase in the cost of training, as a high percentage of minority class samples requires more annotations to build. Thus, an optimal training strategy must take into account the overall training budget and the desired accuracy.

Some of the specific findings in this work, such as the observation that over-representing the minority class yields a slightly higher classifier performance, may be specific to the dataset considered here. Additionally, the observation that the AL algorithm has a large amount of variance in the empirically-observed costs (particularly at the beginning of training) indicates that the eligible sample set is unpredictable with respect to class compositions. This behavior may not necessarily be duplicable with different datasets or AL strategies, both of which will yield eligible sample sets with different class compositions. Additionally, we do not claim that our choice of AL algorithm (QBC), our weak classification algorithm (bagged decision trees), or our evaluation classifier (PBT) will out-perform the available alternatives. However, by combining AL and class balancing, we have developed a general training strategy that should be applicable to most supervised classification problems where the dataset is expensive to obtain and which suffers from the minority class problem. These problems are particularly prevalent in medical image analysis and digital pathology, where the costs of classifier training are very high and an intelligent training strategy can help save great amounts of time and money. Training is an essential and difficult part of supervised classification, but the integration of AL and intelligent choice of class ratios, as well as the application of a general cost model, will help researchers to plan the training process more quickly and effectively. Future work will involve extensions of our framework to the multi-class case, where relationships between multiple classes with different distributions must be taken into account.