1 Introduction

It is widely acknowledged in machine learning that the performance of a learning algorithm is dependent on both its parameters and the training data. Yet, the bulk of algorithmic development has focused on adjusting model parameters without fully understanding the data that the learning algorithm is modeling. As such, algorithmic development for classification problems has largely been measured by classification accuracy, precision, or a similar metric on benchmark data sets. These metrics, however, only provide aggregate information about the learning algorithm and the task upon which it operates. They fail to offer any information about which instances are misclassified, let alone why they are misclassified. There is some speculation as to why some instances are misclassified, but, to our knowledge, no thorough investigation (such as the one presented here) has taken place.

Previous work on instance misclassification has focused mainly on isolated causes. For example, it has been observed that outliers are often misclassified and can affect the classification of other instances (Abe et al. 2006). Border points and instances that belong to a minority class have also been found to be more difficult to classify correctly (Brighton and Mellish 2002; van Hulse et al. 2007). As these studies have had a narrow focus on trying to identify and handle outliers, border points, or minority classes, they have not generally produced an agreed-upon definition of what characterizes these instances. At the data set level, previous work has presented measures to characterize the overall complexity of a data set (Ho and Basu 2002). Data set measures have been used in meta learning (Brazdil et al. 2009) as well as to understand under what circumstances a particular learning algorithm will perform well (Mansilla and Ho 2004). As with the performance metrics, the data complexity measures characterize the overall complexity of a data set but do not look at the instance level and thus cannot say anything about why certain instances are misclassified. It is our contention that identifying which instances are misclassified and understanding why they are misclassified can lead to improvements in machine learning algorithm design and application.

The misclassification of an instance depends on the learning algorithm used to model the task it belongs to and its relationship to other instances in the training set. Hence, any notion of instance hardness, i.e., the likelihood of an instance being misclassified, must be a relative one. However, generalization beyond a single learning algorithm can be achieved by aggregating the results from multiple learning algorithms. We use this fact to propose an empirical definition of instance hardness based on the classification behavior of a set of learning algorithms that have been selected because of (1) their diversity, (2) their utility, and (3) their wide practical applicability. We then present a thorough analysis of instance hardness, and provide insight as to why hard instances are frequently misclassified. To the best of our knowledge our research is the first at reporting on a systematic and extensive investigation of the issue.

We analyze instance hardness in over 190,000 instances from 64 classification tasks classified by nine learning algorithms. We find that a considerable amount of instances are hard to classify correctly—17.5 % of the investigated instances are misclassified by at least half of the considered learning algorithms and 2.3 % are misclassified by all of the considered learning algorithms. Seeking to improve our understanding of why these instances are misclassified becomes a justifiable quest. To discover why these instance are hard to classify, we introduce a set of measurements (hardness measures). The results suggest that class overlap has the strongest influence on instance hardness and that there may be other features that affect the hardness of an instance. Although we focus on hardness at the instance level, the measures can also be used at the data set level by averaging the values of the instances in the data set. Further, we incorporate instance hardness into the learning process by modifying the error function of a multilayer perceptron and by filtering instances. These methods place more emphasis on the non-overlapping instances, alleviating the effects of class overlap. We demonstrate that incorporating instance hardness into the learning process can significantly increase classification accuracy.

The remainder of the paper is organized as follows. In Sect. 2, we introduce and define instance hardness as an effective means of identifying instances that are frequently misclassified. The hardness measures are presented in Sect. 3 as a means of providing insight into why an instance is hard to classify correctly. Section 4 presents the experimental methodology. An analysis of hardness at the instance level is provided in Sect. 5 followed by Sect. 6 which demonstrates that improved accuracy can follow from integrating instance hardness into the learning process. Section 7 compares instance hardness at the data set level with previous data set complexity studies. Section 8 provides related works and Sect. 9 concludes the paper.

2 Instance hardness

Our work posits that each instance in a data set has a hardness property that indicates the likelihood that it will be misclassified. For example, outliers and mislabeled instances are expected to have high instance hardness since a learning algorithm will have to overfit to classify them correctly. Instance hardness seeks to answer the important question of what is the probability that an instance in a particular data set will be misclassified.

As most machine learning research is focused on the data set level, one is concerned with maximizing p(h|t), where h:XY is a hypothesis or function mapping input feature vectors X to their corresponding label vectors Y, and t={(x i ,y i ):x i Xy i Y} is a training set. With the assumption that the pairs in t are drawn i.i.d., the notion of instance hardness is found through a decomposition of p(h|t) using Bayes’ theorem:

$$\begin{aligned} p(h|t) &= \frac{p(t|h)\;p(h)}{p(t)} \\ &= \frac{\prod_{i=1}^{|t|}p(x_i, y_i|h)\;p(h)}{p(t)} \\ &= \frac{\prod_{i=1}^{|t|}p(y_i|x_i, h)\;p(x_i|h)\;p(h)}{p(t)}. \end{aligned}$$

For a training instance 〈x i ,y i 〉, the quantity p(y i |x i ,h) measures the probability that h assigns the label y i to the input feature vector x i . The larger p(y i |x i ,h) is, the more likely h is to assign the correct label to x i , and the smaller it is, the less likely h is to produce the correct label for x i . Hence, we obtain the following definition of instance hardness, with respect to h:

$$\textit {IH}_h\bigl(\langle x_i, y_i \rangle\bigr) = 1 - p(y_i|x_i,h). $$

In practice, h is induced by a learning algorithm g trained on t with hyper-parameters α, i.e., h=g(t,α). Explicitly, instance hardness equals 1−p(y i |x i ,t,h) but since y i is conditionally independent of t given h we can use p(y i |x i ,h). Thus, the hardness of an instance is dependent on the instances in the training data and the algorithm used to produce h. There are many approaches that could be taken to calculate instance hardness (or equivalently p(y i |x i ,g(t,α))) such as an analysis of the distribution of instances in t according to their class. To gain a better understanding of what causes instance hardness in general, the dependence of instance hardness on a specific hypothesis can be lessened by summing instance hardness over the set of hypotheses \(\mathcal{H}\) and weighting each \(h \in\mathcal{H}\) by p(h|t):

$$\begin{aligned} \textit {IH}\bigl(\langle x_i, y_i \rangle\bigr) &= \sum _\mathcal{H} \bigl(1-p(y_i|x_i, h)\bigr) p(h|t) \\ &= \sum_\mathcal{H} p(h|t)-\sum _\mathcal{H} p(y_i|x_i, h)p(h|t) \\ &= 1-\sum_\mathcal{H} p(y_i|x_i, h)p(h|t). \end{aligned}$$
(1)

Practically, to sum over \(\mathcal{H}\), one would have to sum over the complete set of hypotheses, or, since h=g(t,α), over the complete set of learning algorithms and hyper-parameters associated with each algorithm. This, of course, is not feasible. In practice, instance hardness can be estimated by restricting attention to a carefully chosen set of representative algorithms (and parameters). Also, it is important to estimate p(h|t) because if all hypotheses were equally likely, then all instances would have the same instance hardness value under the no free lunch theorem (Wolpert 1996). A natural way to approximate the unknown distribution p(h|t), or equivalently p(g(t,α)), is to weigh a set of representative learning algorithms, and their associated parameters, \(\mathcal{L}\), a priori with a non-zero probability while treating all other learning algorithms as having zero probability. Given such a set \(\mathcal{L}\) of learning algorithms, we can then approximate (1) to the following:

$$\begin{aligned} \textit {IH}_\mathcal{L}\bigl(\langle x_i, y_i \rangle \bigr) = 1 - \frac{1}{|\mathcal{L}|} \sum_{j=1}^{|\mathcal{L}|} p\bigl(y_i|x_i, g_j(t,\alpha)\bigr) \end{aligned}$$
(2)

where p(h|t) is approximated as \(\frac{1}{|\mathcal{L}|}\) and the distribution p(y i |x i ,g j (t,α)) is estimated using the indicator function and classifier scores, as described in Sect. 4. For simplicity, we refer to \(\textit {IH}_{\mathcal{L}}\) as simply IH proceeding forward.

In this paper, we estimate instance hardness by biasing the selection of representative learning algorithms to those that (1) have shown utility, and (2) are widely used in practice. We call such classification learning algorithms the empirically successful learning algorithms (ESLAs). To get a good representation of \(\mathcal{H}\), and hence a reasonable estimate of IH, we select a diverse set of ESLAs using unsupervised metalearning (Lee and Giraud-Carrier 2011). Unsupervised metalearning uses Classifier Output Difference (COD) (Peterson and Martinez 2005) to measure the diversity between learning algorithms. COD measures the distance between two learning algorithms as the probability that the learning algorithms make different predictions. Unsupervised metalearning then clusters the learning algorithms based on their COD scores with hierarchical agglomerative clustering. Here, we considered 20 commonly used learning algorithms with their default parameters as set in Weka (Hall et al. 2009). The resulting dendrogram is shown in Fig. 1, where the height of the line connecting two clusters corresponds to the distance (COD value) between them. A cut-point of 0.18 was chosen and a representative algorithm from each cluster was used to create \(\mathcal{L}\) as shown in Table 1.

Fig. 1
figure 1

Dendrogram of the considered learning algorithms clustered using unsupervised metalearning

Table 1 Set \(\mathcal{L}\) of ESLAs used to calculate instance hardness

We recognize that instance hardness could be calculated with either more specific or broader sets of learning algorithms, and each set would obtain somewhat different results. We also recognize that the set of ESLAs is constantly evolving and thus no exact solution is possible. As the set of ESLAs grows and evolves, instance hardness can follow this evolution by simply adjusting \(\mathcal{L}\). The size and exact make up of \(\mathcal{L}\) are not as critical as getting a fairly representative sample of ESLAs. While more learning algorithms may give a more accurate estimate of instance hardness, we demonstrate that both efficiency and accuracy can be achieved with a relatively small and diverse set of learning algorithms.

With this approach, the instance hardness of an instance is dependent both on the learning algorithm trying to classify it and on its relationship to the other instances in the data set as demonstrated in the hypothetical two-dimensional data set shown in Fig. 2. Instances A, C, and D could be considered outliers, though they vary in how hard they are to classify correctly: instance A would almost always be misclassified while instances C and D would almost always be correctly classified. The instances inside of the dashed oval represent border points, which would have a greater degree of hardness than the non-outlier instances that lie outside the dashed oval. Obviously, some instances are harder for some learning algorithms than for others. For example, some instances (such as instance B) are harder for a linear classifier than for a non-linear classifier because a non-linear classifier is capable of producing more complex decision boundaries.

Fig. 2
figure 2

Hypothetical 2-dimensional data set

3 Hardness measures

In this section, we present a set of measures that measure various aspects about the instance hardness level of an individual instance. Instance hardness indicates which instances are misclassified while the hardness measures are intended to indicate why they are misclassified. Each hardness measure measures an aspect of why an instance may be misclassified (class overlap, class skew, etc.) and, thus, gives key insights into: (1) why particular instances are hard to classify, (2) how we could detect them, and (3) potentially creating improved mechanisms to deal with them. In addition, a subset of the measures could be used as a less expensive alternative to estimate instance hardness, although this is not investigated in this paper.

The set of hardness measures was discovered by examining the learning mechanisms of several learning algorithms. In compiling a set of hardness measures, we chose to use those that are relatively fast to compute and are interpretable so as to provide an indication as to why an instance is misclassified.

k-Disagreeing Neighbors (kDN).

kDN measures the local overlap of an instance in the original task space in relation to its nearest neighbors. The kDN of an instance is the percentage of the k nearest neighbors (using Euclidean distance) for an instance that do not share its target class value.

$$ k\mathrm{DN}(x) = \frac{\mid\{y: y \in k\mathrm{NN}(x) \wedge t(y)\neq t(x)\} \mid}{k} $$

where kNN(x) is the set of k nearest neighbors of x and t(x) is the target class for x.

Disjunct Size (DS).

DS measures how tightly a learning algorithm has to divide the task space to correctly classify an instance and the complexity of the decision boundary. Some learning algorithms, such as decision trees and rule-based learning algorithms, can express the learned concept as a disjunctive description. Thus, the DS of an instance is the number of instances in a disjunct divided by the number of instances covered by the largest disjunct in a data set.

$$ \mathrm{DS}(x)=\frac{\mid \mathit{disjunct}(x)\mid- 1}{\max_{y \in D} \mid \mathit{disjunct}(y)\mid- 1} $$

where the function disjunct(x) returns the disjunct that covers instance x, and D is the data set that contains instance x. The disjuncts are formed using a slightly modifiedFootnote 1 C4.5 (Quinlan 1993) decision tree, created without pruning and setting the minimum number of instances per leaf node to 1.Footnote 2

Disjunct Class Percentage (DCP).

DCP measures the overlap of an instance on a subset of the features. Using a pruned C4.5 tree, the DCP of an instance is the number of instances in a disjunct belonging to its class divided by the total number of instances in the disjunct.

$$ \mathrm{DCP}(x)=\frac{\mid\{z: z \in \mathit{disjunct}(x) \wedge t(z)=t(x)\} \mid}{\mid \mathit{disjunct}(x) \mid} $$

Tree Depth (TD).

Decision trees also provide a way to estimate the description length, or Kolmogorov complexity, of an instance. The depth of the leaf node that classifies an instance can give an intuition of the description length required for an instance. For example, an instance that requires 15 attribute splits before arriving at a leaf node is more complex than an instance that only requires 1 attribute split. Therefore, tree depth measures the depth of the leaf node for an instance in an induced C4.5 decision tree (both pruned (TD_P) and unpruned (TD_U)) as an estimate of the minimum description length for an instance.

Class Likelihood (CL).

CL provides a global measure of overlap and the likelihood of an instance belonging to a class. The CL of an instance belonging to a certain class is defined as:

$$ \mathrm{CL}(x)= \mathrm{CL}\bigl(x,t(x)\bigr) = \prod_{i}^{|x|}P \bigl(x_i|t(x)\bigr) $$

where |x| is the number of attributes of instance x and x i is the value of instance x’s ith attribute.Footnote 3 The prior term is excluded in order to avoid bias against instances that belong to minority classes. CL assumes independence between the data attributes.

Class Likelihood Difference (CLD).

CLD captures the difference in likelihoods and global overlap. It is the difference between the class likelihood of an instance and the maximum likelihood for all of the other classes.

$$ \mathrm{CLD}(x) = \mathrm{CL}(x) - \mathop {\mathrm {argmax}}_{y \in Y-t(x)} \mathrm{CL}(x, y) $$

where Y represents set of possible labels in the data set.

Minority Value (MV).

MV measures the skewness of the class that an instance belongs to. For each instance, its MV is the ratio of the number of instances sharing its target class value to the number of instances in the majority class.

$$ \mathrm{MV}(x) = 1-\frac{\mid\{z: z \in D \wedge t(z)=t(x)\} \mid}{\max_{y \in Y} \mid\{z: z \in D \wedge t(z)=y\} \mid}. $$

Class Balance (CB).

CB also measures the skewness of the class that an instance belongs to and offers an alternative to MV. If there is no class skew, then there is an equal number of instances for all classes. Hence, the CB of an instance is:

$$ \mathrm{CB}(x) = \frac{\mid\{z: z \in D \wedge t(z)=t(x)\} \mid}{\mid D \mid} - \frac{1}{\mid Y \mid}. $$

If the data set is completely balanced the class balance value will be 0.

For convenience, Table 2 summarizes the hardness measures and what they measure. Although all of the hardness measures are intended to understand why an instance is hard to classify, some of the measures indicate how easy an instance is to classify (they have a negative correlation with instance hardness). For example, class likelihood (CL) measures how likely an instance belongs to a certain class. High values for CL would represent easier instances. In Table 2, the “+” and “−” symbols distinguish which hardness measures are positively and negatively correlated with instance hardness.

Table 2 List of hardness measures and what they measure. The “+” and “−” symbols distinguish which hardness measures are positively and negatively correlated with instance hardness

Class overlap and class skew are two commonly assumed and observed causes of instance hardness that are measured with the hardness measures. Mathematically, the class overlap of an instance for a binary task can be expressed as:

$$ \mathit{classOverlap}\bigl(\langle x_i, y_i \rangle\bigr) = p( \bar{y_i}|x_i, t) - p(y_i|x_i, t). $$
(3)

where \(\bar{y_{i}}\) represents an incorrect class for the input feature vector x i . The class skew of the class of an instance can be expressed as:

$$ \mathit{classSkew}\bigl(\langle x_i, y_i \rangle\bigr) = \frac{p(y_i|t)}{p(\bar{y_i}|t)}. $$
(4)

There is no known method to measure class overlap or to determine when class skew affects instance hardness. The hardness measures allow a user to estimate class overlap and class skew as well as other uncharacterized sources of hardness. Equations (3) and (4) could be extended to multi-class problems with a 1 vs. 1, or a 1 vs. all approach.

4 Experimental methodology

In this section we provide our experimental methodology. Recall that to compute the instance hardness of an instance x, we must compute the probability that x is misclassified when the learner is trained on the other points from the dataset. Since this type of leave-one-out procedure is computationally prohibitive, the learning algorithms are evaluated using 5 by 10-fold cross-validation.Footnote 4 We use five repetitions to better measure the instance hardness of each instance and to protect against the dependency on the data used in each fold. We then compare the hardness measures with instance hardness.

We examine instance hardness on a large and varied set of data sets chosen with the intent of being representative of those commonly encountered in machine learning problems. We analyze the instances from 57 UCI data sets (Frank and Asuncion 2010) and 7 non-UCI data sets (Thomson and McQueen 1996; Salojärvi et al. 2005; Sayyad Shirabad and Menzies 2005; Stiglic and Kokol 2009). Table 3 shows the data sets used in this study organized according to the number of instances, number of attributes, and attribute type. The non-UCI data sets are in bold.

Table 3 Datasets used organized by number of instances, number of attributes, and attribute type

We compare calculating instance hardness using all of the learning algorithms in \(\mathcal{L}\) with calculating instance hardness using a single learning algorithm. In addition, p(y i |x i ,g(t,α)) is estimated using two methods: (1) the indicator function (IH_ind) which establishes the frequency of an instance being misclassified and (2) the classifier scores (IH_class). IH_ind and IH_class are calculated using 5 by 10-fold cross-validation. Generally, classification learning algorithms classify an instance into nominal classes. To produce a real-valued score, we calculate classifier scores for the nine investigated learning algorithms. Obviously, the indicator function and the classifier scores do not produce true probabilities. However, the classifier scores can provide the confidence of an inferred model for the class label of an instance. Below, we present how we calculate the classifier scores for the investigated learning algorithms.

Multilayer Perceptron.:

For multiple classes, each class from a data set is represented with an output node. The classifier score is the largest value of the output nodes normalized between zero and one (the softmax (Bridle 1989)):

$$\hat{p}(y | x) = \frac{o_i(x)}{\sum_i^{|Y|} o_i(x)} $$

where y is a class from the set of possible classes Y and o i is the value from the output node corresponding to class y i .

Decision Tree.:

To calculate a classifier score, an instance first follows the induced set of rules until it reaches a leaf node. The classifier score is number of training instances that have the same class as the examined instance divided by all of the training instances that also reach the same leaf node.

5-NN.:

5-NN returns the percentage of the nearest-neighbors that agree with the class label of an instance as the classifier score.

LWL.:

LWL finds the k-nearest neighbors for an instance from the training data and weights them by their distance from the test instance. The weighted k-nearest neighbors are then used to train a base classifier. Weka uses a decision stump as the base classifier. A decision stump is a decision tree that makes a singe binary split on the most informative attribute. A test instance is propagated to a leaf node. The sum of weights of the training instances in the leaf node that have the same class value as the test instance is divided by the sum of the weights of all of the training instances in the leaf node.

Naïve Bayes.:

Returns the probability of the most probable class by multiplying the probability of the class by the probabilities of the attribute values for an instance given the class:

$$\max_{y_j \in Y} p(y_j) \prod _i^{|x|} p(x_i|y_j). $$
NNge.:

Since NNge only keeps exemplars of the training data, a class score of 1 is returned if an instance agrees with the class of the nearest exemplar, otherwise a 0 is returned.

Random Forest.:

Random forests return the class counts from the leaf nodes of each tree in the forest. The counts for each class are summed together and then normalized between 0 and 1.

RIDOR.:

RIDOR creates a set of rules, but does not keep track of the number of training instances covered by a rule. A classifier score of 1 is returned if RIDOR predicts the correct class for an instance, otherwise a 0 is returned (same as the indicator function).

RIPPER.:

RIPPER returns the percentage of training instances that are covered by a rule and share the same class as the examined instance.

To our knowledge, instance hardness is the only measurement that seeks to identify instances that are hard to classify. However, there are other methods that could be used to identify hard instances that have not been examined for identifying hard instances. One such method that we compare against is active learning. Active learning is a semi-supervised technique that uses a mode inferred from the labeled instances to choose which unlabeled instances are the most informative to be labeled by an external oracle. The informative scores assigned by active learning techniques can be used as a hardness measure. This assumes that the most informative instances are those that the model is least certain about, which would include the border points. We implemented two active learning techniques: uncertainty sampling (US) (Lewis and Gale 1994) and query-by-committee (QBC) (Seung et al. 1992). For uncertainty sampling, we use margin sampling (Scheffer et al. 2001):

$$ x^* = \mathop {\mathrm {argmin}}_x p(\hat{y}_1|x)-p(\hat{y}_2|x) $$

where \(\hat{y}_{1}\) and \(\hat{y}_{2}\) are the first and second most probable class labels for the instance x. We use naïve Bayes to calculate the probability of the classes for an instance. For query-by-committee, we use a committee of five learning algorithms using query by bagging (Abe and Mamitsuka 1998). The level of disagreement is determined using vote entropy (Dagan and Engelson 1995):

$$ x^* = \mathop {\mathrm {argmax}}_x - \sum_i \frac{V(y_i)}{C} \log \frac{V(y_i)}{C} $$

where y i ranges over all possible class labels, V(y i ) is the number of votes that a class label received from the committee, and C is the size of the committee. We examine QBC using naïve Bayes and decision trees. Active learning requires that some labeled instances are available to the models to produce the scores for the other instances. We divide the data set in half, using one half of the instances to calculate the scores for the other half.

We emphasize the extensiveness of our analysis. We examine over 190,000 instances individually. A total of 28,750 models are produced from 9 learning algorithms trained with 64 data sets using 5 by 10-fold cross-validation.Footnote 5 With this volume and diversity, our results can provide more useful insight about the extent to which hard instances exist and what contributes to instance hardness.

5 Instance-level analysis

In this section we examine the hardness measures to identify hard instances and the hardness measures to discover what causes an instance to be misclassified. We use instance hardness with the indicator function (IH_ind) to establish the frequency of an instance being misclassified. Figure 3 shows the cumulative percentage of instances that are misclassified a specified percentage of times by the learning algorithms in \(\mathcal{L}\) (Table 1). The first pair of columns shows that all of the instances were classified correctly by zero or more of the considered learning algorithms. The second pair of columns shows the percentage of instances that were misclassified by at least one of the considered learning algorithms. Overall, 2.4 % of the instances from the UCI data sets are misclassified by all of the considered learning algorithms and 16.8 % are misclassified by at least half. For the instances from the non-UCI data sets, 1.7 % are misclassified by all of the considered learning algorithms and 22.7 % are misclassified by at least half. The trend of hardness is similar for the UCI and non-UCI data sets. For the set of instances from the UCI and non-UCI data sets, only 38.3 % of the instances are classified correctly 100 % of the time by the examined learning algorithms. These results show that a considerable amount of instances are hard to classify correctly. Seeking to improve our understanding of why these instances are misclassified is the goal of the hardness measures.

Fig. 3
figure 3

Percentage of instances that are misclassified by at least a percentage of the learning algorithms

We calculate the hardness measures for all of the instances regardless of their instance hardness. We first examine the relationship between the hardness measures. This will provide insight into how similar the measures are with each other and detect possible overlap in what they measure (see Table 2 for the hardness measures and what they measure). Next, we examine the relationship of the hardness measures with instance hardness. We first normalize the measures by subtracting the mean and dividing by the standard deviation for each measure before analyzing the results.

We first examine the correlation between the hardness measures. Table 4 shows a pairwise comparison of the hardness measures using the Spearman correlation. Only (CL) and class likelihood difference (CLD) are strongly correlated with a correlation coefficient of 0.989. This suggests that, besides CL and CLD, the hardness measures measure different properties of the hardness of an instance.

Table 4 Spearman correlation matrix for the hardness measures. The magnitude of only one pair of measures is stronger than 0.95, showing that the measures measure different aspects of instance hardness

The more interesting question to consider is how does instance hardness relate to the considered hardness measures. Table 5 shows the Spearman correlation coefficients relating instance hardness to the other considered hardness measures for the UCI and non-UCI data sets. The hardness measure with the strongest correlation with instance hardness IH_ind and IH_class is in bold. The first section of the table uses the indicator function to calculate instance hardness, the second section uses the classifier scores to calculate instance hardness, and the third section shows the results for active learning. IH_ind and IH_class use the indicator function and classifier scores respectively from all of the learning algorithms in \(\mathcal{L}\) to calculate instance hardness. The following rows use a single learning algorithm to calculate instance hardness. For all of the hardness measures, kDN, DCP, CL, and CLD have the strongest correlation with all of the hardness measures. Using PCA on the hardness measures, kDN, CL, and CLD have the largest coefficients for the first principal component (thus accounting for more variance than the other measures). kDN, CL, and CLD measure class overlap using all of the features from the data set. The other measures (which measure overlap on a subset of the features, class skew, and the description length) are not as indicative of an instance being hard to classify. The results from the Spearman correlation coefficients and the PCA analysis suggest that, in general, class overlap is a principal contributor to instance hardness for the considered data sets whether considering ESLAs in general (IH_ind and IH_class) or for a specific learning algorithm. The effect of class overlap on instance hardness can also be seen by examining individual instances and their corresponding hardness measures. The hardness measures and instance hardness values for a sample of instances are provided in Table 6. The first instance is a clear example that exhibits class overlap and should be misclassified as indicated by the values of the hardness measures (i.e. high value for kDN, CLD is negative, etc.).

Table 5 The Spearman correlation coefficients for the hardness measures relating to the examined methods for identifying hard instances. The correlation coefficients with the strongest correlation with IH_ ind and IH_class are in bold
Table 6 The hardness measures and instance hardness values for an example set of instances

One of the difficulties of identifying hard instances is that hardness may arise from several sources. For example, instances 2 and 3 in Table 6 have multiple possible reasons for why they are misclassified, but no hardness measure strongly indicates that it should be misclassified (i.e. the kDN values are less than 0.5, meaning that the instances agree with the majority of their neighbors). The last column “Lin” in Table 5 shows the correlation coefficients of a linear model of the hardness measures predicting instance hardness. The instance hardness and hardness measures from the UCI and non-UCI data sets for each instance were compiled and linear regression was used to predict instance hardness. Apart from US_NB and QBC_NB, a linear combination of the hardness measures results in a stronger correlation with instance hardness than any of the individual measures suggesting that there is no one measure that sufficiently captures the hardness of an instance.

Comparing the instance hardness measures, IH_class has the strongest correlation with the linear combination of the hardness measures and kDN. IH_class also has a strong correlation with CL and CLD. Only US_NB has a stronger correlation with CL and CLD than IH_class. These strong correlations with IH_class suggest that IH_class may be a good candidate for determining the hardness of an instance. The QBC methods are not as strongly correlated with any of the hardness measures as the other hardness measures. The active learning approaches select border points as the hardest instances, but do not indicate that the outlier instances are hard. We also observe that using classifier scores has a stronger correlation with the hardness measures than using the indicator function to calculate instance hardness. For all of the considered learning algorithms, calculating instance hardness with the classifier scores provide a stronger or equal correlation with the hardness measures than the indicator function, suggesting that the classifier scores may provide a better indication of which instances are hard to classify. Also, for our examination of when ESLAs misclassify an instance, using an ensemble of learning algorithms to determine hardness has a stronger correlation with the hardness measures than a single learning algorithm.

The previous results suggest that, in general, class overlap causes instance hardness. However, in making this point, we realize that all data sets have different levels and causes of hardness. Table 7 shows the correlation between IH_class and the hardness measures for the instances in each data set. The column “DSH” refers to the data set hardness and is the average IH_class value for the instances in the data set. The harder data sets have a higher DSH value. The values in bold represent the hardness measures that have the strongest correlation with IH_class for the instances in the data set. The underlined values are the hardness measures with a correlation magnitude greater than 0.75. The values in Table 7 indicate that the hardness of the majority of the data sets is strongly correlated with the hardness measures that measure class overlap. There are a few data sets that have a strong correlation between IH_class and the measures that measure class skew (MV and CB). The most notable are the post-opPatient and zoo data sets. For those data sets, in addition to having a strong correlation with MV and CB, instance hardness is also strongly correlated with other hardness measures that measure class overlap.

Table 7 The correlation of the hardness measures with IH_class for the instances in each data set. DSH is the average IH_class value of the instances in the data set. The underlined values are the hardness measures with a correlation magnitude greater than 0.75. The bold values represent the hardness measures that have the strongest correlation with IH_class for each data set

It is not surprising that class overlap is observed as a principal contributor to instance hardness since outliers and border points, which exhibit class overlap, have been observed to be more difficult to classify correctly. However, instances that belong to a minority class have also been observed to be more difficult to classify correctly. This is confirmed as the coefficients for the class imbalance measures (MV and CB) in the linear regression models are statistically significant. Also, removing MV and CB from the linear model results in a weaker correlation. To what extent does class skew affect instance hardness? One of the core problems seen with class skew is that of data ambiguity, when multiple instances have the same feature values but different classes. In these cases, the instances that belong to the minority class will be misclassified. There are only 204 such instances, about 0.1 % of all of the instances used in this study. We removed all of the ambiguous instances and then divided the instances into those that have a MV value of 0 (they belong to the majority class) and those that have a value greater than 0. This considers any instance that does not belong to the majority class as belonging to a minority class. There are 97,469 instances that belong to the majority class and 92,669 instances that do not. We observe that instances that belong to a minority class are harder to classify correctly than those that do not. The average IH_class value for the instances that belong to a majority class is 0.16 while the average instance hardness value for the instances not belonging to the majority class is 0.41. Table 8 compares the hardness measures for the instances that belong to a minority class and those that belong to the majority class. The last column (easy) gives the value for the hardness measures for the easiest instances (the instances that are always correctly classified). Not including MV and CB (which are biased since all of the instances that belong to the non-majority classes are separated from the majority class instances), all of the hardness measures except for pruned tree depth (TD_P) indicate that the instances that belong to a minority class are harder to classify correctly as well. Thus, we observe that class skew exacerbates the effects of the underlying causes for instance hardness. This coincides with Batista’s conclusion that class skew alone does not hinder learning algorithm performance, but rather class skew magnifies the hardness already present in the instances (Batista et al. 2004). For example, Table 9 gives the hardness measure values for two instances from the chess data set. The hardness measures are similar for each measure except the first instance (id 22037) belongs to the majority class while instance 26549 does not. The difference in the IH_ind value for the instances is considerable. The difference in IH_class values does not vary considerably since many of the class scores are similar to the hardness measures. This supports the fact that class skew exacerbates the effects of class overlap and also shows that IH_ind may be better able to incorporate the effects of class skew than IH_class. Given that class skew exacerbates the effects of class overlap on instance hardness, the expected instance hardness for an instance is related to the class overlap (3) and class skew (4) of the instance:

$$\mathbb{E}\bigl[\textit {IH}\bigl(\langle x_i, y_i \rangle\bigr) \bigr] \sim f\bigl(\mathit{classOverlap}\bigl(\langle x_i, y_i \rangle\bigr),\; \mathit{classSkew}\bigl(\langle x_i, y_i \rangle \bigr)\bigr). $$

The exact form of f is unknown at this stage. Additionally, other factors not discussed here may affect the hardness of an instance. Discovering the relationship between class overlap, class skew, and instance hardness, as well as identifying other sources of hardness, is left for future work.

Table 8 Various statistics for the hardness measures for instances that belong to the majority class and those that do not. For the instances that belong to the minority class, the values for the measures indicate higher levels of class overlap. The column “easy” gives the expected value for the hardness measure if an instance has low instance hardness
Table 9 The hardness measures and instance hardness values for an example set of instances from the chess data set

6 Integrating instance hardness into the learning process

In this section we examine how to exploit instance hardness during the learning process to alleviate the effects of class overlap and instance hardness. Incorporating instance hardness into the learning process provides significant improvements in accuracy. Note that the improvement requires computing instance hardness for each instance. In the experiments, we opt to use IH_class instead of IH_ind as they are strongly correlated and IH_class produces slightly better results. We also ran the experiments calculating instance hardness with the same single learning algorithm that is inferring the model. This provides the opportunity to compare whether it is more appropriate to use a specific measure of instance hardness rather than a more general one. In addition, we ran the experiments using the active learning hardness measures. The active learning techniques are not designed to identify hard instances and using them as a hardness measure often resulted in poor results. In order to avoid a deluge of data, we do not show their results.

6.1 Informative error

Informative error (IE) is based on the premise of knowing if an instance should be misclassified. We implement IE in multilayer perceptrons (MLPs) trained with backpropagation using instance hardness computed using (1) all of the learning algorithms in \(\mathcal{L}\) (IE ESLA ) and (2) only using a MLP (IE MLP ). We use instance hardness to estimate if an instance should be misclassified. A common approach for classification problems with MLPs is to create one output node for every class value. If the data set has a class with three possible values, then three output nodes are created. The target output value for each node is either 1 if the instance belongs to that class or 0 if it does not. The error function of target-output for each of the k output nodes can then be formulated as:

$$\mathit{error}(x) = \begin{cases} 1 - o_k & \text{if }t(x)=k_{class}\\ 0 - o_k & \text{otherwise} \end{cases} $$

where o k is the output value for node k, t(x) is the target value for instance x and k class is the class represented by node k.

We modify the error function such that it subtracts the instance hardness value of an instance from the target value for the output node only.

$$\mathit{error}(x) = \begin{cases} 1 - \textit {IH}(x,t(x)) - o_k & \text{if }t(x)=k_{class}\\ 0 - o_k & \text{otherwise}. \end{cases} $$

The instance hardness value is only subtracted from the output node that corresponds with the target class value of an instance. If the instance hardness value were added to the output value for the output nodes that do not correspond with the target class value of an instance then this could potentially confuse the network as an instance is incorrect for one class value yet correct for all of the others. For example, if an instance has an instance hardness value of 1, then the errors would essentially tell the network that the target value is wrong whereas all of the other classes are correct. Also, if an instance had an instance hardness value of 0.5, all output nodes would have the same target value and no information is gained. IE places more emphasis on the non-overlapping instances by reducing the weight of the error from instances with high instance hardness values.

Table 10 shows the results of using IE to train a MLP on 52 data sets (the data sets that did not have instance hardness greater than 0.5 were not used) compared against two filtering techniques (repeated edited nearest neighbor (RENN) (Tomek 1976) and fast local kernel noise reduction (FaLKNR) (Segata et al. 2009)) and two boosting methods (AdaBoost (Freund and Schapire 1996), and MultiBoost (Webb 2000), using a MLP as the base algorithm). RENN repeatedly removes the instances that are misclassified by a 3-nearest neighbor classifier and has produced good results. FaLKNR removes any instances that disagree with the predicted class from a support vector machine trained on the neighborhood of the selected instance. The average accuracy, the number of times that the accuracy using IE MLP is better, the same, or worse than the other methods, and the p-value calculated using the Wilcoxon signed-rank test are provided in the bottom three rows as a summary of the table. There are 14 data sets on which IE MLP increases accuracy by more than 5 %, indicated by an asterisk. On the lung cancer data set, accuracy increases by 21.9 % and is 3 percentage points higher than the next best algorithm (FaLKNR). On the labor data set, IE MLP increases accuracy by 10.5 % and is 5 percentage points greater than the next best algorithm. On average, IE MLP increases more than 3 % in accuracy over the original and 2 % over RENN. The increases in accuracy are statistically significant. In this case, IE MLP is significantly better than IE ESLA . Thus, in this case, using a specific bias from a learning algorithm is preferred. This is examined in more detail in the next section.

Table 10 Pairwise comparison of informative error with standard backpropagation, RENN, FaLKNR, AdaBoost, and MultiBoost. An asterisk indicates data sets on which IE MLP improves accuracy more than 5 %

Although IE is described in the context of MLPs, it can also be applied to other learning algorithms that are incrementally updated based on an error value such as the class of non-closed form regression models (i.e., logistic regression and isotonic regression). Similar to informative error, instance hardness could be used to weight the instances prior to training a model. This weight could then be used in a number of learning algorithms such as nearest-neighbor or naïve Bayes algorithms.

6.2 Filtering the data set

A simple idea to handle hard instances and reduce overlap is to filter or remove them from a data set prior to training. The idea of filtering is to remove the instances that are suspected outliers or noise and thus increase class separation (Smith and Martinez 2011). We use the IH_class values to determine which instances to filter from the data sets. We compare the results to those by RENN and the majority and consensus filters proposed by Brodley and Friedl (1999). The majority and consensus filters remove an instance if it is misclassified respectively by the majority of, or all, three learning algorithms (C4.5, IB1, and thermal linear machine (Brodley and Utgoff 1995)). When using the instance hardness values, we use the classifier scores from the five folds of the nine learning algorithms as our ensemble and remove any instances with an IH_class value greater than a set threshold. We set the threshold at 0.5 (IH_0.5), 0.7 (IH_0.7) and 0.9 (IH_0.9). We also compare using the same learning algorithm to filter the instances as well as to infer a model of the data (IH_LA). For example, IH_LA for MLP uses a MLP to identify which instances to filter prior to training a MLP. Each filtering technique was used on a set of 52 data sets evaluated using five by ten-fold cross-validation on the nine learning algorithms. Testing is done on all of the data, including the instances that were removed.

For the nine learning algorithms, Table 11 shows the average accuracy, pairwise comparison of the accuracies, and p-values from the Wilcoxon signed-rank statistical significance test comparing the filtering method to the original accuracy. Only the averages are displayed to avoid the overload of tables and much of the aggregate information is present in the pairwise comparison of the algorithms (number of times that a learning algorithm increases-stays the same-decreases the accuracy) and the p-value from the Wilcoxon signed-rank significance test. Filtering significantly increases classification accuracy for most of the filtering techniques and learning algorithms. IH 0.7 achieves the greatest increase in accuracy, being slightly better than the majority filter. One of the advantages of using instance hardness is that various thresholds can be used to filter the instances. However, we note that there is not one filtering approach that is best for all learning algorithms and data sets (as indicated by the counts). For filtering, using the same learning algorithm to infer the model and to determine which instances to filter is only better than using all of the learning algorithms in \(\mathcal{L}\) for C4.5 and 5-NN.

Table 11 The average accuracy values for the nine learning algorithms comparing filtering techniques against not filtering the data (Orig). “count” gives the number of times that a filtering algorithm improves, maintains, or reduces classification accuracy. On average, filtering the data sets significantly improves the classification accuracy. The p-values in bold represent the cases where filtering significantly increases the classification accuracy over not filtering. For each learning algorithm, the accuracy for the filtering technique that produces the highest accuracy is in bold

To examine the variability of each data set and learning algorithm combination, we examine an adaptive filtering approach that generates a set of learning algorithms to calculate instance hardness for a specific data set/learning algorithm combination. We call the set of learning algorithms used to calculate instance hardness a filter set. The adaptive approach discovers the filter set through a greedy search of \(\mathcal{L}\). The adaptive approach iteratively adds a learning algorithm from \(\mathcal{L}\) to a filter set by selecting the learning algorithm that produces the highest classification accuracy when added to the filter set, as shown in Algorithm 1. A constant threshold value is set to filter instances in runLA(F) for all iterations. We examine thresholds of 0.5, 0.7, and 0.9. The baseline accuracy for the greedy approach is the accuracy of the learning algorithm without filtering. The search stops once adding one of the remaining learning algorithms to the filter set does not increase accuracy. The running time for the adaptive approach is O(N 2) where N is the number of learning algorithms to search over. The significant improvement in accuracy makes the increase in computational time reasonable in most cases.

Algorithm 1
figure 4

Adaptively constructing a filter set.

Table 12 gives the results for adaptively filtering for a specific data set/learning algorithm combination. The adaptive approach significantly increases the classification accuracy over IH_0.7 for all of the learning algorithms and thresholds. The accuracy increases for at least 85 % of the data sets regardless of which learning algorithm is being used for classification. A_0.9 achieves the highest classification accuracy for the adaptive approach. Interestingly, there is no one particular learning algorithm that is always included in a filter set for a particular learning algorithm. The frequency for how often a learning algorithm is included in a filter set for each learning algorithm and an aggregate count (overall) is given in Table 13. MLPs and random forests are included in more than 50 % of the constructed filter sets while RIPPER and NB are included in less than 2 % of the filter sets. The remaining learning algorithms are used in a filter set between 13 % and 23 % of the time. It is interesting that some of the learning algorithms include a particular learning algorithm in the filter set for most of the data sets while other learning algorithms never or rarely include it. For example, MLP is always included in the filter set for NB, yet never for 5-NN. Also, only the MLP and 5-NN learning algorithms frequently include themselves in the filter set. Thus, hardness for a learning algorithm is often better detected using a different learning algorithm.

Table 12 The average accuracy values for nine learning algorithms comparing the adaptive filtering approach against IH 0.7. “count” gives the number of times that a filtering algorithm improves, maintains, or reduces classification accuracy. The adaptive filtering approach significantly increases classification accuracy
Table 13 The frequency of selecting a learning algorithm when adaptively constructing a filter set. Each row gives the percentage of cases that the learning algorithm was included in the filter set for the learning algorithm in the column

7 Data set-level analysis

Our work has focused on hardness at the instance-level. However, prior work has been done that examines what causes hardness at the data set level. The hardness measures and instance hardness values can be averaged together to measure hardness at the data set level. The averaged hardness measures can provide insight into a data set’s characteristics and possibly provide direction into which methods are the most appropriate for the data set. Previous studies have primarily looked at only binary classification problems. We compare instance hardness at the data set level with other data set complexity measures. We use a set of complexity measures by Ho and Basu (2002) (implemented with DCoL (Orriols-Puig et al. 2009)). In this study we do not limit our examination to two-class problems. Hence, we do not use the measurements from Ho and Basu that are only for two-class problems. Ho and Basu’s complexity measures that were used are shown in Table 14. Some of the original measures were adapted to handle multi-class problems (Orriols-Puig et al. 2009).

Table 14 List of complexity measures from Ho and Basu (2002)

We first compare our measures to those used by Ho and Basu (2002). The matrix of Spearman correlation coefficients comparing the hardness measures against those measures used by Ho and Basu are shown in Table 15. The measures were normalized by subtracting the mean and dividing by the standard deviation. The values in bold represent correlations with a magnitude greater than 0.75. Only N1 and N3 are strongly correlated with kDN, CL, and CLD. N1 is the percentage of instances with at least one nearest neighbor of a different class. N3 is the leave-one-out error of the one-nearest neighbor classifier. Both N1 and N3 are similar and can be categorized as measuring class separability. N1, N3, kDN, CL, and CLD measure class overlap using all of the features in the data set.

Table 15 Spearman correlation matrix comparing the hardness measures against the complexity measures from Ho and Basu. The strong correlation (bolded values) indicate that there is some overlap between our measures and those by Ho and Basu

We examined each hardness measure and complexity measure individually to determine how well it predicts data set hardness (the average instance hardness of the instances in the data set). The Spearman correlation coefficient for the hardness measures and the measures from Ho and Basu with data set hardness are shown in Table 16. kDN, CL, CLD, N1, and N3 all have a correlation coefficient greater than 0.8. Recall that kDN, CL, CLD are strongly correlated with N1 and N3. Despite diversity in the measures, only these few are strongly correlated with data set hardness and they measure class overlap.

Table 16 The Spearman correlation coefficients for each hardness measure and Ho and Basu’s complexity measures relating to data set hardness. The measures that measure class overlap have a strong correlation with data set hardness

We also apply linear regression to evaluate data set hardness as a combination of the hardness measures and the measures from Ho and Basu. The correlation coefficients are shown in the column “Lin” in Table 16. For the linear model of the hardness measures, only kDN is statistically significant for the hardness measures. For Ho and Basu’s complexity measures, only N1 is statistically significant. The correlation of data set hardness with the linear models of the hardness measures and the measures from Ho and Basu are weaker than the correlation of data set hardness with an individual measure. When using both sets of measures, the resulting correlation coefficient is 0.896 with none of the measures being statistically significant. The linear model also has a weaker correlation coefficient than only using kDN.

Based on correlation from a linear regression model, our aggregate hardness measures are competitive with those from Ho and Basu. When the hardness measures are used in combination with those from Ho and Basu, a slightly stronger correlation is achieved. This is somewhat expected as there are many underlying and misunderstood factors that affect complexity. By measuring the complexity from many different angles, more perspective can be found.

The averaged hardness measures at the data set level provide an indication of the source of hardness and could further indicate which learning algorithms and/or methods for integrating instance hardness into the learning process are the most appropriate to use for a particular data set. A cursory examination of the correlation of the hardness measures with instance hardness at the data set level (Table 16) does not reveal an obvious connection. Further in depth analysis is left for future work.

8 Related work

There are a number of methods and approaches that can be used to identify instances that are hard to classify correctly. In this section we review some previous work for identifying hard instances. Fundamentally, instances that are hard to classify correctly are those for which a learning algorithm has a low probability of predicting the correct class label after having been trained on a training set. To compare the related works with instance hardness we reference the hypothetical data set in Fig. 2. For convenience, we reproduce Fig. 2 in Fig. 4. We also compare instance hardness (IH_ind and IH_class) with related works in Table 17 on a subset of the examined instances. The columns under “IH h –Classifier Scores” are instance hardness values calculated using the classifier score for a specific learning algorithm. Table 17 is divided into three sections: the first section contains instances with high instance hardness (IH_ ind ∼ 1), the second section contains instances with low instance hardness (IH_ ind ∼ 0), and the third section contains instances with instance hardness around 0.5. We will refer to Table 17 throughout this section.

Fig. 4
figure 5

Hypothetical 2-dimensional data set

Table 17 Comparison summary of the methods that identify hard instances. The first section of the table shows examples of instances that have high instance hardness values (IH = 1), the second section shows examples that are easy but the LoOP value is high (IH∼0 && LOP∼1), the third section has instances with medium instance hardness (IH∼0.5). The active learning values are the uncertainty scores—higher values represent more uncertainty; the outlier detection values are “yes” if the instance is an outlier and “no” if it is not—for LOP, higher values indicate a higher likelihood of being an outlier. The values in bold represent values that are unexpected given the instance hardness value. For example, an instance with an instance hardness value of 1 is expected to be considered an outlier. On the other hand, it is unexpected when an instance with a low instance hardness value (∼0) is categorized as an outlier instance

Machine learning research has observed that data sets are often noisy and contain outliers, and that noise and outliers are harder to classify correctly. Although we do not explicitly search for outliers, outliers and noisy instances will constitute a subset of the hard instances. Much work has been put forth to identify outliers and noise. Discovering outliers is important in anomaly detection where an outlier may represent an important instance. For example, an outlier in a database of credit card transactions may represent a fraudulent transaction. Anomaly detection is generally unsupervised and looks at the data set as a whole. One of the difficulties with outlier detection is that there is no agreed-upon definition of what constitutes an outlier or noise. Thus, a variety of different outlier detection methods exist, such as statistical methods (Barnett and Lewis 1978), distance-based methods (Knorr and Ng 1999), and density-based methods (Breunig et al. 2000). Anomaly detection methods identify anomalous instances as those that lie outside the group(s) of the majority of the other instances in the data set. In the hypothetical two-dimensional data set shown in Fig. 4, instances C and D would be identified as anomalous but not instances A and B.

Most anomaly detection methods do not have a continuous output and are not supervised. One anomaly detection method that outputs continuous values is local outlier factor. Local outlier factor (LOF) (Breunig et al. 2000) suggests that each instance has a degree of “outlierness” rather than a binary labeling. LOF seeks to overcome the problem facing most anomaly detection methods—that the sub-spaces within many data sets have different densities. LOF considers relative density rather than the global density of the data set. Instances with a LOF value of 1 or less are not outliers. The values produced by LOF are somewhat hard to interpret as there is no upper bound or any value that indicates when a LOF value represents an outlier. For one data set, an LOF value of 1.1 may represent an outlier while a value of 2 could represent an outlier in another data set.

There are a number of approaches that aim at overcoming the uninterpretability of LOF (Kriegel et al. 2011). One approach is local outlier probability (LoOP) (Kriegel et al. 2009). LoOP builds on LOF with statistical methods to produce a probability that an instance is a local density outlier. This allows the values to be compared across data sets. There are two major assumptions that LoOP makes: (1) that the k-nearest neighbors of an instance p are centered around p and (2) that the distances behave like the positive leg of a normal distribution. Despite being more interpretable, LoOP often does not identify hard instances as outliers and identifies easy instances as outliers as shown in Table 17 (LOP). Low values indicate that an instance is not likely to be an outlier according to LoOP.

Filtering, or removing instances prior to training, is another approach that seeks to identify mislabeled and/or noisy instances with the intent of improving an inferred model of the data. Unlike anomaly detection, filtering is often supervised, removing instances that are misclassified by a learning algorithm. In Fig. 4, filtering would likely identify instances A, B, and some of the border points as hard to classify. A popular approach to outlier detection is repeated-edited nearest-neighbor (RENN) (Tomek 1976) which repeatedly removes the instances that are misclassified by a 3-nearest neighbor classifier and has produced good results. Brodley and Friedl (1999) expanded this idea by removing the instances that were misclassified by all or the majority of the learning algorithms in an ensemble of three learning algorithms. These methods do not output a continuous value but they do take into account the class label. As shown in Table 17, these methods (maj, con, and REN) do not often identify easy instances as outliers, but they may not identify hard instances as outliers.

Some learning algorithms produce probabilistic output, such as naïve Bayes, Bayes nets, and Gaussian processes. The output from probabilistic algorithms could naturally answer the question of which instances are hard to classify correctly. However, there are often assumptions that are made that are not true of the data distribution (i.e. the attributes are independent or the data is normally distributed). Many other machine learning algorithms do not produce a probabilistic output. In those cases, the probabilities can be approximated by normalizing the output values or using some heuristic to produce pseudo probabilistic values. The posterior classifier probabilities from the learning algorithms in \(\mathcal{L}\) for a subset of the instances are provided in Table 17. The posterior classifier probabilities provide a good approximation for instance hardness, but, as discovered in Sect. 5, they have a lower correlation with the hardness measures. This is apparent when examining instances that have an instance hardness measure around 0.5 (the last four instances in Table 17).

Probabilistic outputs from a classifier are important when the outputs are combined with other sources of information for making decisions, such as the outputs from other classifiers. Probabilistic outputs are often not well calibrated, such as the output from naïve Bayes (Domingos and Pazzani 1996). As such, a number of methods have been proposed to calibrate classifier scores (Bennett 2000; Platt 2000; Zadrozny and Elkan 2001, 2002). For binary classification problems, the calibration is usually done by training the learning algorithm to get the classifier scores s(x) and then mapping these scores into a probability estimate \(\hat{P}(y|x)\) by learning a mapping function. Platt (2000) suggests finding the parameters A and B for a sigmoid function of the form \(\hat{P}(y|x) = \frac{1}{1+e^{As(x) + B}}\) to map the classifier scores s(x) to the probability estimates minimizing the log-likelihood of the data. Multi-class classification problems are broken down into binary classification problems such as 1 vs 1 or 1 vs all. 1 vs 1 creates a classifier for each pair of classes. 1 vs all creates a classifier that discriminates between the instances of a particular class and all the instances that have a different class. The calibrated probabilities from the binary classification problems are then recombined back together. Classifier scores are supervised and produce continuous outputs for identifying hard instances. In Fig. 4, instances A, B, and the border points would be identified as being hard to classify.

Active learning (Settles 2010) seeks to find the most informative instances in a data set. Active learning assumes that there is a set of labeled instances and an abundance of unlabeled training data and that labeling the data is expensive, thus, the most informative instances should be labeled first. In active learning, a learning algorithm chooses which instances to use for training. Active learning assigns unlabeled instances a degree of how informative they may be to a learning algorithm by optimizing a given criterion. This informative measure could be used as a means of identifying hard instances. For example, uncertainty sampling (Lewis and Gale 1994) selects an unlabeled instance x whose labeling the learning algorithm is least certain about:

$$x^* = \mathop {\mathrm {argmax}}_x 1 - p(\hat{y}|x) $$

where \(\hat{y}\) is the class label with the highest posterior probability for the learning algorithm. Other methods, such as query-by-committee (Seung et al. 1992; Freund et al. 1992) and a Support Vector Machine method by Tong and Koller (2001), seek to reduce the size of the version spaceFootnote 6 (Mitchell 1982). Query-by-committee uses a committee of models trained on the labeled instances and selects the instances that the committee disagrees about most. Thus, active learning identifies the border points as being hard to classify. Table 17 shows that active learning scores vary widely for the same instances. For instances with an instance hardness value near 1, it would be expected that the uncertainty value would be close to either 1 or 0. Since the class is not included for active learning, a hard instance would appear to be the wrong class (i.e. the instance is mislabeled) or it would have high uncertainty. For the easy instances, a low uncertainty value would be expected. Active learning scores do not have a high correlation with instance hardness.

Clearly, none of the previous work was designed to better understand why instances are misclassified as is the case with instance hardness. For example, filtering aims at removing mislabeled instances from a data set, and the classifier scores are for applications where a confidence on a prediction is required. Incorporating the ideas of previous work, instance hardness provides a framework for identifying which instances are hard to classify and understanding why they are hard to classify.

9 Conclusions and future work

In this paper we examined why instances are misclassified and how to characterize them. We presented instance hardness to empirically discover which instances are hard to classify correctly. We also presented a set of hardness measures to characterize why some instances are difficult to classify correctly. We used a broad set of data sets and learning algorithms and examined hardness at the instance level. We found that class overlap is a principal contributor to instance hardness and data set hardness. The hardness measures kDN, CL, and CLD capture class overlap and are strongly correlated with instance hardness. Class skew has been observed to increase instance hardness. We found that class skew alone does not cause instances to be misclassified. Rather, class skew exacerbates the other characteristics, such as class overlap, that cause an instance to be misclassified as demonstrated when the instances and their hardness values were segregated according to their MV value (Table 8). Continued study of instance hardness should lead to additional insights regarding data complexity.

Being able to measure instance hardness and complexity has important ramifications for future machine learning and meta-learning research. We briefly examined integrating instance hardness into the learning process by filtering the data sets prior to training and using informative error. In each case, integrating into the learning process the knowledge of which instances are hard to classify correctly resulted in a significant increase in classification accuracy. As a specific example, informative error significantly increased the classification accuracy over various filtering and boosting approaches. These techniques show that integrating into the learning process the knowledge about which instances are hard can increase generalization accuracy. Future work includes understanding the circumstances and situations which are most appropriate for each technique. There is no one technique for identifying hard instances that is best for all data sets as demonstrated with the adaptive filter sets.

Calculating instance hardness and the hardness measures can be a computationally expensive procedure. Despite requiring the computation of N learning algorithms, the instance hardness values only need to be computed once and they can be used in a wide variety of applications as was shown in Sect. 6. The hardness measures need to be calculated only once as well. For many data sets, this additional computational complexity is acceptable. For massive data sets, though, the additional computational complexity can be a significant concern. In this case, the set of learning algorithms used to calculate instance hardness and the hardness measures can be altered to those that better handle massive data sets. Also, we showed that there is no specific set of learning algorithms that is best for all data sets and learning algorithms. Using the same learning algorithm to calculate instance hardness and to infer the model of the data does not always result in the most accurate model.

Being able to better analyze data would allow a practitioner to select an algorithm more suited to their purposes. Also, the evaluation of a learning algorithm could be enhanced by knowing which instances are hard and, with a high likelihood, will be misclassified. This could lead to a better stopping criterion. We expect that the exploration of instance hardness and data complexity may lead to more in depth investigation and applications in new areas of machine learning and data mining. Instance hardness and the hardness measures could be used in combination with techniques from active learning to determine a subset of the most important instances from a data set. Future work also includes work in meta-learning. For example, the hardness measures could be used to estimate the performance of a learning algorithm on a data set.