Keywords

1 Introduction

In recent years, deep learning models have contributed to significant advances in supervised and unsupervised learning tasks on complex data, such as speech and imagery. Convolutional neural networks (CNNs), for example, have achieved state-of-the-art performances on various image classification benchmarks [8, 27]. One of the key features of CNN is representation learning, i.e., the hidden layers of convolutional neural network generates an expressive, non-linear mapping of complex data in a deep feature space [12, 19]. Such features are shown to be useful for other classification models or similar classification tasks [2, 13], enabling further means of enhancements such as multi-task learning [7].

The applications of deep learning, meanwhile, encounter many practical challenges, such as the cost of preparing a sufficient amount of labeled samples. The problem of class imbalance arises when the number of samples significantly differ between two or more classes. Such imbalance can affect the traditional classification models [1, 11, 17] as well as deep learning models [14, 15, 28], commonly resulting in poor performances over the classes in the minority. For deep learning models, its influence on representation learning can deteriorate the performances on majority classes as well.

There is a rich literature on countermeasures against class imbalance for traditional classification models [4]. A popular and intuitive approach among them is re-sampling, which directly adjusts the sample sizes of respective classes. For example, SMOTE [3] generates synthetic samples, which are interpolations of in-class neighboring samples, to augment the minority class. Its underlying assumption is that the interpolations do not deviate from the original class distribution, as in a locally linear feature space. Similar approaches for deep learning models, e.g., re-sampling, cost-sensitive learning, and their combinations [14, 28], have also been explored, but in a more limited number of studies. Overall, they introduce complex architectures or sampling schemes, which require significant amount of data- and model-specific configurations and lower the applicability to new problems. It also should be noted that generating synthetic samples of structured, complex data, in order to conduct synthetic over-sampling on such data, is not straightforward.

In this work, we extend the synthetic over-sampling method to the convolutional neural network (CNN) using its deep representation. To our knowledge, over-sampling in the acquired, deep feature space have not been explored prior to this work. Integrating synthetic instances, which are not direct mappings of any raw input sample, effectively into a supervised learning framework is a non-trivial challenge. Our main idea is to use synthetic instances as the supervising targets in the deep feature space, to implement a representation learning which induce better class distinction in the acquired space.

The proposed framework, Deep Over-sampling (DOS), employs a basic CNN architecture in which the lower layers acquire the embedding function and the top layers acquire the classification function. We implement the training of the CNN with explicit supervising information for both functions, i.e., the network parameters are updated by propagation from the output of the lower layers as well as the top layers. Accordingly, the training data presents with each raw input sample, a class label and a target in the deep feature space. The targets are sampled from a linear subspace of the in-class neighbors around the embedded input. As such targets naturally distribute closer to the class mean, our aim is to induce smaller in-class variance among the embeddings.

DOS provides the framework to address the effect of class imbalance on both classifier and representation learning. First, the training data is augmented by assigning multiple synthetic targets to one input sample. Secondly, an iterative process of learning the CNN and updating the targets with the acquired representation enhances the discriminative power of the deep features.

The main contribution of this work is a general re-sampling framework, which enables the deep neural net to learn the deep representation and the classifier jointly in a class-imbalanced setting without substantial modification on its architecture, thus are applicable to a wide range of deep learning models. We validate the effectiveness of the proposed framework in present an empirical study using public image classification benchmarks. Furthermore, we investigate the effect of the proposed framework outside the class imbalance setting. The rest of this paper is organized as follows. Section 2 introduces the related work and the preliminaries, respectively. Section 3 describes the details of the proposed framework. The empirical results are shown in Sect. 4, and we present our conclusion in Sect. 5.

2 Background

2.1 Class Imbalance

Class imbalance is a common issue in practical classification problems, where a large imbalance in the number of training samples between classes causes the learning algorithms to over-generalize for the classes in the majority. Its effect is critical as retrieving the minority classes is usually the primary interest in practice [11, 17]. The countermeasures against class imbalance can be generally categorized into three major approaches. The re-sampling approach attempts to directly adjust the sample sizes by over- or under-sampling on the training set. The instance weighting approach exploits a similar intuition by increasing the importance of the minority class samples. Finally, the cost-sensitive learning approach modifies the loss function and/or the learning algorithm to penalize the errors on the minority class predictions.

For traditional classification models, the synthetic over-sampling methods such as SMOTE [3] have been generally successful in countering the effect of imbalance. Typically, synthetic samples are generated by randomly selecting a minority class sample and taking an interpolation between its neighbors. The re-sampling approach has also been attempted on neural network models, e.g., [28] has combined over-sampling with cost-sensitive learning and [15] has combined under-sampling with synthetic over-sampling.

One limitation of the synthetic over-sampling method is the need for the vector form input data, i.e., it is not applicable to non-vector input domain, e.g., pair-wise distances [1, 16]. Moreover, it implicitly assumes that the interpolations do not deviate from the original distribution, as in the locally linear feature space. Such assumption is usually not problematic for the traditional classification models, many of which are developed with similar assumptions. Synthetic over-sampling is generally successful when the features are pre-selected for such models. Meanwhile, the assumption does not hold for complex, structured data often handled by deep neural nets. Generating samples of complex data is substantially difficult, and simple interpolations can easily deviate from the original distribution. While acquiring locally-linear representation is a key advantage of deep neural nets, a recent study has reported that class imbalance can affect their representation learning capability as well [14].

In [14], a sophisticated under-sampling scheme called Large Margin Local Embedding (LMLE) was implemented to generate an abridged training set for representation learning. These samples were selected considering the class and cluster structures such as in-/out-of-class and, in-/out-of-cluster neighbors. It also introduced a new loss function based on class-separating margins, inspired by the Large Margin Nearest Neighbor (LMNN) [26].

The potential demerit of under-sampling is the loss of information from discarding the subset of the training data. As such, computationally-intensive analysis to retain important samples, in this case the analyses of class and cluster structure, is inevitable in the re-sampling process. Another drawback in the above work was the specificity of the loss function and the network architecture. The effect of class imbalance, as we demonstrate in the next section, differ based on the classification model. It is thus not clear whether the experimental results of a modified kNN extends generally to other classifiers, especially given that the proposed margin-based loss function is oriented toward the kNN-based classification model. Additionally, its architecture does not support simultaneous learning of the representation and the classifier, which is a key feature of CNN. Overall, the above implementation is likely to require much task- and model-specific configurations when applying to a new problem.

In this paper, we explore an over-sampling scheme in the deep representation space to avoid computationally intensive analyses. We also attempt to utilize a basic CNN architecture in order to maintain its wide applicability and its advantage of cohesive representation and classifier learning.

2.2 Preliminary Results

To motivate our study, we first present a preliminary result on the effect of class imbalance on deep learning. An artificial imbalanced setting was created with the MNIST-back-rotation images [20] by selecting four digits randomly, and removing 90 % of their samples. We trained two instances of a basic CNN architecture [22] by back-propagation respectively with the original and the imbalanced data.

The training of the two CNNs from initial parameters were repeated ten times and the averages of their class-wise retrieval performances, i.e., Precision, Recall, and F1-score, are reported here. Although the overall accuracy has been reported in prior studies, we preferred the class-wise retrieval measures in order to obtain separate insights for the minority and majority classes.

Tables 1a and b show the class-wise precision, recall, and F1-score of one trial from each experiment. In Table 1b, the minority class digits are indicated by the asterisks. Additionally, significant declines (0.1 or more) in precision or recall, compared to Table 1a, are indicated by double underline and smaller (0.05 or more) drops are indicated by single underlines. Tables 2a and b show the same performance measures by the kNN classifier using the deep representation acquired by the two CNNs. The performance of the kNN, which is a non-inductive lazy learning algorithm, is said to substantially reflect the effect of class imbalance on representation learning [14]. The minority classes and reductions in precision or recall are indicated in the same manner as in Table 2a.

Table 1. Class-wise performance comparison (CNN)
Table 2. Class-wise performance comparison (deep representation + kNN)

In Table 1b, there is a clear trend of decrease in precision for the minority classes. Over the majority classes, there are reduction in recall which are smaller but still substantial. In Table 2b, both the precision and the recall decline for most of the minority classes. There are declines in precision or recall of many majority classes as well.

Table 3 shows the average measures of minority and majority classes over ten trials. The digits of the minority classes were chosen randomly in each trial. The trends in Table 3 regarding the precision and the recall are consistent with those of Tables 1 and 2. These preliminary results support our insight that the class imbalance has a negative impact on the representation learning of CNN as well as its classifier training, and the influence differs depending on the classifier.

Table 3. Summary of average retrieval measures

3 Deep Over-sampling Framework

This section describes the details of the proposed framework, Deep Over-sampling (DOS). The main idea of DOS is to re-sample the training data in an expressive, nonlinear feature space acquired by the convolutional neural network. While previous over-sampling based methods such as SMOTE have achieved general success for traditional models, their approach of sampling from the linear subspace of the original data has clear limitations for complex, structured data, such as imagery. In contrast, DOS implements re-sampling in the linear subspace of deep feature instances and exploit the re-sampled instances for explicitly supervised representation learning as well as complementing the minority classes.

3.1 Notations

We employ a basic CNN whose architecture can be divided into two groups of layers: the lower layers embedding the raw input into the deep feature space, and the top layers predicting the class label from the deep features. We denote the embedding function of the CNN by \(f:\varPhi \rightarrow {\mathbb {R}}^d\), where \(\varPhi \) is the raw input domain with a complex data structure. We also denote the discriminative function of the CNN by \(g:{\mathbb {R}}^d\rightarrow [0:1]^n\), whose output represents a vector of posterior class probabilities P(C|x) over n classes.

Let \(\mathcal {X}=\{(x_i,y_i)\}_{i=1}^m\) denote a set of training data, where \(x_i\in \varPhi \) and \(y_i\) takes a class value from \({\mathcal {C}}=\{c_j\}_{j=1}^n\). The network parameters, denoted by \({\mathbf {W}}_f\) and \({\mathbf {W}}_g\) for the embedding layers and the classification layers, respectively, are learned with back-propagation. A class imbalance, such that \(\#\{(x,y):y=c_i\}\gg \#\{(x,y):y=c_j\}\) for some \( \{c_i,c_j\}\subset \mathcal {C}\), may arise from practical issues such as the cost of data collection. A significant imbalance can hinder the performance of the acquired model. The architecture is illustrated in Fig. 1. We further elaborate on its details in Sect. 3.3.

3.2 Deep Feature Overloading

We employ re-sampling in the deep feature space to assign each raw input sample with multiple deep feature instances. As a result, the supervising targets are provided for both the embedding function f and the classification function g.

Fig. 1.
figure 1

CNN architecture

Fig. 2.
figure 2

Deep feature overloading

Let \(\mathcal {V}(c_j)=\{f(x_i):y_i=c_j\}\) denote the set of embeddings whose raw input has the label \(c_j\). A training instance is defined as a triplet \(z_i=(x_i,y_i,\mathcal {N}(x_i))\), consisting of an input sample \(x_i\), its class label \(y_i\), and a subset of embeddings \({\mathcal {N}}(x_i)\). \({\mathcal {N}}(x_i)\) is a subset of \(\mathcal {V}(y_i)\) that includes \(f(x_i)\) and its k in-class neighbors,

(1)

We refer to the process of pairing each \((x_i,y_i)\in \mathcal {X}\) with its deep feature neighbors as deep feature overloading. The process is illustrated in Fig. 2. We refer to k as the overloading parameter and \(\mathcal {Z}=\{z_i\}_{i=1}^m\) as the overloaded training set.

As we describe in the following section, the deep feature neighbors are used to generate synthetic targets for the minority class samples. The value of k, thus may be varied between the minority and the majority classes as the latter does not need to be synthetically complemented. Note that the minimum value for k is 0, in which case \(\mathcal {N}(x_i)=\{f(x_i)\}\).

3.3 Micro-cluster Loss Function

Our CNN architecture, illustrated in Fig. 1, features two outputs, one for the classification and one for the embedding functions. The initial parameters of the network are learned in a single-task training using only the substructure indicated by the dotted box on the left side of the figure with the original imblanced training set.

The classifier output is given by the softmax layer and the cross-entropy loss [9] \(\mathcal {H}\) based on the predicted class probabilities g(f(x)) of a given input (xy)

$$\begin{aligned} \ell (x,y)=\mathcal {H}\left( g({f}(x)),y\right) \end{aligned}$$
(2)

is used for single-task learning.

After the initialization, the training is expanded to the multi-task learning architecture in Fig. 1 to use propagation from both f and g. The loss function given an overloaded sample \(z_i\) is defined with regards to \(\mathcal {N}(x_i)\), which can be considered an in-class cluster in the deep feature space. We thus refer to the functions as the micro-cluster loss.

The micro-cluster loss for the embedding function f is defined as a sum of squared errors

$$\begin{aligned} \ell _f(x)=\sum _{v\in \mathcal {N}(x)} \left\| f(x)-v\right\| ^2 \end{aligned}$$
(3)

The minimum of (3) is obtained when f(x) is mapped to the mean of \(\mathcal {N}(x_i)\). Note that the mean is a synthetic point in the deep feature space, to which no particular original sample is projected.

There are two key intuitions for setting the target representation to local means. First, the summation of the squared errors can add emphases to the minority class samples, by overloading them with a larger number of embeddings. Secondly, as the local means distribute closer toward the mean of the original distribution, it induces smaller in-class variance in the learned representation. Smaller in-class variance yields better class distinction, which can be induced further by iterating the procedure with the updated embeddings.

The micro-cluster loss for g is defined as the weighted sum of the cross-entropy losses, i.e.,

$$\begin{aligned} \ell _g(x,y)= \sum \limits _{v\in \mathcal {N}(x)}\rho (v)\mathcal {H}(g(v),y) \end{aligned}$$
(4)

where \(\rho (v)\) is the normalized exponential weight given the squared errors in (3),

$$\begin{aligned} \rho (v)=\frac{1}{Z}\exp \left( -\Vert f(x)-v\Vert ^2\right) \end{aligned}$$
(5)

and Z denotes a normalizer such that

$$Z=\sum _{\mathcal {N}(x)}\exp \left( -\Vert f(x)-v\Vert ^2\right) $$

In (5), the largest weight among \(\mathcal {N}(x)\), 1, is assigned to the original loss from f(x), and a larger weight is assigned to neighbors in a closer range.

3.4 Deep Over-sampling

Deep Over-sampling uses the overloaded instances to supplement the minority classes. Multiple overloaded instances are generated from each training sample by pairing it with different targets sampled from the linear subspace of its in-class neighbors.

Let \(\mathcal {W}\) denote a domain of positive, \(\ell \)1-normalized vectors of k-dimensions, i.e., for all \(\mathbf {w}\in \mathcal {W}\), \(\Vert \mathbf {w}\Vert ^1=1\) and \({w}_i\ge 0\) for \(i=1,\ldots ,k\). Note that k is the overloading parameter. For each overloaded instance \(z_i\in \mathcal {Z}\), we sample a set of vectors \(\{\mathbf {w}^{(i,j)}\}_{j=1}^r\) from \(\mathcal {W}\).

We define the weighted overloading instance as a quadruplet

$$\begin{aligned} z_{i}^{(j)}=(x_i,y_i,\mathcal {N}(x_i),\mathbf {w}^{(i,j)}) \end{aligned}$$
(6)

Note that each element of the weight vector correspond with an element of \(\mathcal {N}(x_i)\).

Sampling r vectors for each \(z_i\), we obtain a weighted training set

$$\begin{aligned} \mathcal {Z}'=\mathop {\cup }\limits _{j=1}^r\left\{ z_i^{(j)}\right\} _{i=1}^m \end{aligned}$$
(7)

We define the following micro-cluster loss functions for the weighted instances. The loss function for f given a quadruplet of x, y, \(\mathcal {N}(x)=\{v_i\}_{i=1}^k)\), and \(\mathbf {w}=(w_1,\ldots ,w_k)\) is written as

$$\begin{aligned} \ell '_f(x,y,\mathbf {w},\mathcal {N}(x))=\sum _{i=1}^k w_i \left\| f(x)-v_i\right\| ^2 \end{aligned}$$
(8)

The minimum of (8) is attained when f(x) is at the weighted mean, \(\sum _iw_iv_i\).

The weighted micro-cluster loss for g is defined, similarly to (4), as

$$\begin{aligned} \ell '_g(x,y,\mathbf {w})=\sum _{i=1}^k\rho '(v_i,w_i)\mathcal {H}(g(v_i),y) \end{aligned}$$
(9)

where \(\rho '\) is the normalized weight

$$\begin{aligned} \rho '(v_i,w_i)=\frac{1}{Z}\exp (-w_i\Vert f(x)-v_i\Vert ^2) \end{aligned}$$
(10)

To summarize, the augmentative samples for the minority classes are generated by pairing each raw input sample with multiple targets for representation learning. The rationale for learning to map one input onto multiple targets can be explained as promoting robustness under the imbalanced setting. Since there are less than sufficient number of samples for the minority classes, strong supervised learning may induce the risk of overfitting. Generating multiple targets within the range of local neighbors is similar in effect as adding noise to the target and can prevent the convergence of the gradient descent to undesired local solutions.

After training the CNN, the targets are recomputed with the updated representation. The iterative process of training the CNN and updating the targets incrementally shifts the targets toward the class mean and improve the class distinction among the embeddings. The pseudo code of the iterative procedure is shown in Algorithm 1.

figure a

3.5 Parameter Selection

As mentioned in Sect. 3.2, different values of overloading parameter k may be selected for the minority and the majority classes to placing additional emphases on the former. Let \(k_{\text {mnr}}\) and \(k_{\text {mjr}}\) denote the overloading values for the minority and the majority classes, respectively. If the latter is set to the minimum, i.e., \(k_{\text {mjr}}=0\), then the loss for the minority class samples, as given by (3), accounts for \((k_\text {mnr}+1)\) times more squared errors.

In essence, however, k should be chosen to reflect the extent of the neighborhood, as the size of the neighbors, \(\mathcal {N}(x)\), can influence the the efficiency of the back-propagation learning. As one increase k, the target shifts closer to the global class mean and, in turn, farther away from the tentative embedding f(x). For better convergence of the gradient descent, the target should be maintained within a moderate range of proximity from the tentative embedding.

Our general guideline for the parameter selection is therefore to choose a value of \(k_{\text {mnr}}\) from [3:10] by empirical validation and set \(k_{\text {mjr}}=0\) provided that there are sufficient number of samples for the majority classes. Furthermore, we suggest to choose the over-sampling rate r from \([\frac{1}{R}:\frac{k_{\text {mnr}}}{R}]\) where R denotes the average ratio of the minority and the majority class samples, i.e.,

$$\begin{aligned} R=\frac{\#\{(x,y):(x,y)\in \mathcal {X}{\wedge }y=c_{\text {mnr}}\}}{\#\{(x,y):(x,y)\in \mathcal {X}{\wedge }y=c_{\text {mjr}}\}} \end{aligned}$$
(11)

For the number of iterations T, we suggest it to be the same as the number of training rounds, i.e., to re-compute the targets after every training round.

4 Empirical Results

We conducted an empirical studyFootnote 1 to evaluate the DOS framework in three experimental settings. The first experiment is a baseline comparison, for which we replicated a setting used in the most recently proposed model to address the class imbalance. Secondly, we evaluated the sensitivity of DOS with different levels of imbalance and parameter choices. Finally, we investigated the effect of deep over-sampling in the standard, balanced settings. The imbalanced settings were created with standard benchmarks by deleting the samples from selected classes.

4.1 Datasets and CNN Settings

We present the results on five public datasets: MNIST [21], MNIST-back-rotation images [20], SVHN [23], CIFAR-10 [18], and STL-10 [6]. We have set up the experiment in the image domain, because it is one of the most popular domains in which CNNs have been used extensively, and also it is difficult to apply the SMOTE algorithm directly. Note that we omit the result of preprocessing the imbalanced image set using SMOTE, as it achieved no improvement in the classifier performances.

The MNIST digit dataset consists of a training set of 60000 images and a test set of 10000 images, which includes 6000 and 1000 images for each digit, respectively. The MNIST-back-rotation-image (MNISTrb) is an extension of the MNIST dataset contains \(28\times 28\) images of rotated digits over randomly inserted backgrounds. The default training and test set consist of 12000 and 50000 images, respectively. The Street View House Numbers (SVHN) dataset consists of 73,257 digits for training and 26,032 digits for testing in \(32\times 32\) RGB images. The CIFAR-10 dataset consists of 32 \(\times \) 32 RGB images. A total of 60,000 images in 10 categories are split into 50,000 training and 10,000 testing images. The STL-10 dataset contains \(96\times 96\) RGB images in 10 categories. All results are reported on the default test sets.

For MNIST, MNISTrb, and SVHN, we employ a CNN architecture consisting of two convolution layers with 6 and 16 filters, respectively, and two fully-connected layers with 400 and 120 hidden units. ReLU is adopted between the convolutional layers and the fully connected layers. For CIFAR-10 and STL-10, we use convolutional layers with 20 and 50 filters and fully-connected layers with 500 and 120 hidden units. The summary of the datasets and the architectures are shown in Table 4.

4.2 Experimental Settings and Evaluation Metrics

Our first experiment follows that of [14] using the MNIST-back-rot images. First, the original dataset was augmented 10 times with mirrored and rotated images. Then, class imbalance was created by deleting samples selected with Gaussian distribution until a designated overall reduction rate is reached. We compare the average per-class accuracy (average class-wise recall) with those of Triplet re-sampling with cost-sensitive learning and Large Margin Local Embedding (LMLE) reported in [14]. Triplet loss re-sampling with cost-sensitive learning is a hybrid method that implements the triplet loss function used in [5, 24, 25] with re-sampling and cost-sensitive learning.

Table 4. Datasets and CNN architectures

The second experiment analyzes the sensitivity of DOS over the levels of imbalance and the choices of k, using MNIST, MNIST-back-rotation, and SVHN datasets. The value of k is altered over 3, 5, and 10. In [3], 5 was given as the default value of k and other values have been tested in ensuing studies. The imbalance is created by randomly selecting four classes and removing p portion of their samples. We report the class-wise retrieval measures: precision, recall, F1-score, and the Area Under the Precision-Recall Curve (AUPRC), for the minority and the majority classes, respectively. The precision-recall curve is used in similar scope as the receiver operating characteristic curve (ROC). [10] has suggested the use of AUPRC over AUROC, which provides an overly optimistic estimate of the retrieval performance in some cases. Note that the precision, recall, and F1-score are computed from a multi-class confusion matrix, while the precision-recall curve is computed from the class-wise posterior probabilities.

The third experiment is conducted using the original SVHN, CIFAR-10, and STL-10 datasets. Since the classes are not imbalanced, the overloading value k is set uniformly for all classes, and the over-sampling rate r, from (11), is set to 1. The result of this experiment thus reflect the effect of deep feature overloading in a well-balanced setting. The evaluation metrics are the same as the second experiment, but averaged over all classes.

4.3 Results

Comparison with Existing Work. The result from the first experiment is summarized in Table 5. The overall reduction rate is shown on the first column. The performances of the baseline methods (TL-RS-CSL, LMLE) are shown in the second and the third columns. In the last three columns, the performances of DOS and two classifiers: logistic regression (LR) and k-nearest neighbors (kNN) using its deep representation are shown. While all methods show declining trends of accuracy, DOS showed the slowest decline against the reduction rate.

Table 5. Baseline comparison (class-wise recall)

Sensitivity Analysis on Imbalanced Data. Tables 6 and 7 summarize the results of the second experiment. In Table 6, we compare the performances of the basic CNN, traditional classifiers using the deep representation of the basic CNN (CNN-CL), and DOS, over the reduction rates \(p=0.90,0.95,0.99\). On each row, four evaluation measures on MNIST, MNISTbr, and SVHN are shown. Note that the AUPRC of CNN-CL is computed from the predicted class probabilities of the logistic regression classifier and other performances are those of the kNN classifier. The performances of the minority (mnr) and the majority (mjr) classes are indicated on the third column, respectively. We indicate the significant increases (0.1 or more) over CNN and CNN-CL by DOS with double underlines and smaller increases (0.05 or more) with single underlines. In Table 6, DOS exhibit more significant advantage with increasing level of imbalance, over the basic CNN and the classifiers using its deep representation.

Table 7 summarizes the sensitivity analysis on the overloading parameter values \(k=3,5,10\) with reduction rate set to \(p=0.01\). Each row shows the four evaluation measures on SVHN, CIFAR-10, and STL-10, respectively. For reference, The performances of the basic CNN and the deep feature classifiers are shown on the top rows. We indicate the significant increases over the baselines in the similar manner as Table 6. The minority and majority classes are indicated on the third column as well. Additionally, we indicate the unique largest values among DOS settings by bold letters. These results show that DOS is generally not sensitive to the choice of k. However, there is a marginal trend that the performances on the minority classes are slightly higher with \(k=3,5\) and those of the majority classes are slightly higher with \(k=10\). It suggests possible decline in performance with overly large k.

Table 6. Performance comparison on imbalanced data over reduction rate
Table 7. Performance comparison on imbalanced data over k
Table 8. Performance comparison on balanced data over k

Run-Time Analysis. The deep learning in the above experiment was conducted on NVIDIA GTX 980 graphics card with 704 cores and 6GB of global memory. The average increases in run-time for DOS compared to the basic, single-task learning architecture CNN were 11%, 12%, and 32% for MNIST, MNIST-bak-rot, and SVHN datasets, respectively.

Evaluation on Balanced Data. Table 8 summarizes the performances on the balanced settings. On the top rows, the performances of the basic CNN and the classifiers using its representations are shown for reference. On the bottom rows, the performances of DOS at \(k=3,5,10\) are shown. The uniquely best values among the three settings are indicated by bold fonts. While the improvements from the basic CNN were smaller (between 0.01 and 0.03) than in previous experiments, DOS showed consistent improvements across all datasets. This result supports our view that deep feature overloading can improve the discriminative power of the deep representation. We note that the performance of DOS were not sensitive to the values chosen for k.

5 Conclusion

We proposed the Deep Over-sampling framework for imbalanced classification problem of complex, structured data that allows the CNN to learn the deep representation and the classifier jointly without substantial modification to its architecture. The framework extends the synthetic over-sampling technique by using the synthetic instances not only to complement the minority classes for classifier learning, but also to supervise representation learning and enhance its robustness and class distinction. The empirical results showed that the proposed framework can address the class imbalance more effectively than the existing countermeasures for deep learning, and the improvements were more significant under stronger levels of imbalance. Furthermore, its merit on representation learning were verified from the improved performances in the balanced setting.