1 Introduction

With the prosperity of deep learning research field, visual recognition has witnessed the prominence of powerful representation learning approaches and high-quality, large-scale datasets, e.g., ImageNet ILSVRC (Russakovsky et al., 2015) and Places (Zhou et al., 2018). These datasets are usually carefully balanced, exhibiting roughly uniform distributions of class labels. However, visual phenomena in real world tends to have skewed data distributions with long-tailed characteristics (Dong et al., 2017; Liu & Tsoumakas, 2018; Xiang & Ding, 2020; Bej et al., 2021), consisting of a few majority classes (head classes) and a large number of minority classes (tail classes). When dealing with such long-tailed data, many standard approaches fail to work well due to the extreme class imbalance trouble, leading to a significant drop in accuracy for tail classes (Mollaysa et al., 2019).

A common way to solve long-tailed problem is re-sampling or re-weighting, which artificially generates class-balanced batch or loss to avoid extreme long tail (Huang et al., 2016; Buda et al., 2018; Ma et al., 2018; Cao et al., 2019). Inspired by the phenomenon that naively re-weighting or re-sampling inevitably causes under-fitting to the head or overfitting to the tail, latest studies (Kang et al., 2020; Zhou et al., 2020) separate the imbalanced feature learning and balanced classifier learning, leading a two-stage training paradigm. Each of these strands is intuitive, and has proven empirically successful. However, they are not without limitation: no explanations about wherefores that the data sampler of feature extractor learning and classifier learning should be different. On the other hand, imbalanced feature representation will project on the head feature direction due to head classes always dominate training procedure, resulting in the re-trained classifier biased (Zhou et al., 2020).

In this paper, we propose to investigate long-tailed recognition from a memorization-generalization point of view. A recent study (Jiang et al., 2020) suggests rare and low-regularity samples could be learned based on the internal representations built from strongest-domain regularities first. In this case, the relation between sample regularity and cardinality of each class during training is verified. We visualize the cumulative learned events and forgetting events (Toneva et al., 2019) of each sample (see Appendix  A) and find that regularity of the same training samples will be sharply decreased with the reduction of class cardinality. As shown in Fig. 1, the more class cardinality reduced, the more regularity decreased. Further, it is shown by the comparison between Figs. 9 and  10 that such skewed regularity of the training samples may cause the network generalization degraded, i.e., the less training samples one class have, the lower regularity the validation samples of the same class will own (see Appendix  A).

Based on the notion that long-tail challenge is essentially a trade-off between the representation of high-regularity head classes and generalization to low-regularity tail classes, we explore a simple yet effective joint training strategy, named Switching, which properly shifts learning focus from high-regularity head classes to low-regularity tail classes and give the theoretical generalization bound of changing data samplers during training for the first time.

Specifically, we employ the standard training procedure with cross-entropy loss and instance-balanced sampler w.r.t. the original data distribution to ensure the learning of universal visual patterns. We only switch from instance-balanced sampler to class-reversed sampler for the last several epochs of training, tending tail classes to be over-exposed. In earlier training, with head classes dominate the training data, the patterns and structures discovered in regular examples are utilized to build a generalizable representation. In later training phase, the memorization of tail classes will not seriously disrupt the learned representation as the learning rate is much smaller than earlier stages. Such strategy can simultaneously boost the representation and classification towards long-tailed distributions, avoiding the risk of re-trained classifier excessive dependent on feature extractor.

We conduct extensive experiments across four benchmark long-tailed datasets: CIFAR10-LT, CIFAR100-LT, iNaturalist 2018 and ImageNet-LT, to evaluate the effectiveness of our proposed method. With such a simple training strategy, we obtain comparable or better results more efficiently compared with previous state-of-the-art methods.

To summarize, the main contributions are as follows:

  • We empirically identify that the low-regularity of tail classes is the primary hurdle for learning an accurate model for long-tailed distributions and appropriately memorizing them is essential for better generalization across all classes.

  • We propose a simple yet effective strategy, named Switching, to handle the trade-off between high-regularity head classes and low-regularity tail classes and give the theoretical generalization error bound proving that class-reversed sampling is better than instance-balanced sampling during the last training stage.

  • We investigate the effectiveness and efficiency of the proposed method through extensive experimentation and demonstrate that tackling long-tail trade-off could only cost a few training epochs with a small learning rate and over-exposure of tail samples.

Fig. 1
figure 1

The visualization of the regularity degradation of selected training samples when training set changing from standard CIFAR-10 to Long-tailed CIFAR-10 with class cardinality reduced. The regularity of one class will be higher with more samples gathering in the lower right corner of the picture. In each subfigure, samples are in one-to-one correspondence within two plots. It could be observed that regularity of the same training samples will be sharply decreased with the reduction of class cardinality. The more cardinality reduced, the more regularity decreased

2 Related work

2.1 Long-tailed visual recognition

Re-sampling strategies Re-sampling strategies can be divided into two classical types: over-sampling the minority classes by repeatedly adding augmented images (Drummond et al., 2003; Han et al., 2005; Buda et al., 2018); or under-sampling the majority classes by removing several images (Japkowicz & Stephen, 2002; He & Garcia, 2009; Bellinger et al., 2018). All these re-sampling methods tend to provide a more balanced data distribution during training to solve the long-tailed problem. However, over-sampling may sometimes cause over-fitting towards minority classes, while under-sampling may weaken the representation ability of networks.

Re-weighting losses Re-weighting methods usually allocate different weights for training samples of each class to re-balance data distribution (Huang et al., 2016; Cao et al., 2019; Wu et al., 2020). Cui et al. (2019) assigns weights to each class based on the effective numbers of samples instead of the proportional frequency. Further, Jamal et al. (2020) utilizes both effective numbers (Cui et al., 2019) and conditional weights to augment the classic class-balanced learning by explicitly estimating the differences between the class-conditioned distributions with a meta-learning approach.

Two-stage fine-tuning Various methods (Ouyang et al., 2016; Cao et al., 2019; Liu et al., 2019; Peng et al., 2020) are proposed to modify re-balancing for further improvements in long-tailed recognition. These methods usually separate training process into two single stages. In general, they train the networks with instance-balanced sampling in the first stage and exploit re-sampling or re-weighting methods at the second stage to fine-tune the network. More radically, Kang et al. (2020) re-train the classifier from scratch in a class-aware manner in the second stage with backbone fixed.

Different from them, we provide the theoretical analysis on the upper bound of generalization error for switching data samplers. Based on this, we do not artificially generate class-balanced batches or losses; instead, we simply emphasize the memorization of low-regularity tail class samples by only switching from the instance-balanced sampler to class-reversed sampler during the standard training procedure.

2.2 Memorization-generalization mechanism in deep learning

Memorization was once considered a failure of deep networks since it implies a lack of generalization. However, the view that memorization is harmful may be a misunderstanding towards deep learning. Zhang et al. (2017) was the first to demonstrate that standard deep learning algorithms can achieve high training accuracy even on large and randomly labeled datasets, leading a large wave of research interest in the topic of generalization for deep learning. Toneva et al. (2019) introduced the “forgetting event” to describe the learning dynamics of neural networks, where some instances flip flop between “learned” and “forgotten” states during training. In order to analyze how individual instances are treated by a model on the memorization-generalization continuum, Jiang et al. (2020) proposed the C-score to measure the consistency of a sample with respect to the rest of the training set. They found that samples having lower C-scores are learned more slowly, indicating the need for a stage-wise learning rate schedule during training.

A recent work of Feldman’s (2020) proposed a new theoretical explanation for the benefits of memorization. In their abstract model, algorithm can only get the frequency of a subpopulation through the empirical frequency of its representatives, thus it can only avoid the risk of missing subpopulations with significant frequency by memorizing examples. Further, Feldman and Zhang (2020) introduced the influence estimation to validate the necessity of memorizing useful examples for achieving close-to-optimal generalization error.

3 Method

Long-tailed visual recognition follows a long-tailed distribution over classes, leading model to exhibit under-fitting on tail classes and over-fitting to head classes (Tao et al., 2018; Baloch et al., 2019). Since increasing the exposure of tail classes may lead to over-fitting while under-sampling head classes may weaken the representation ability of networks, the trade-off between the representation of head and generalization towards tail becomes the main dilemma in long-tailed problem. To solve this dilemma, we first introduce the cumulative learned and forgetting events (Toneva et al., 2019) to verify the relation between cardinality and regularity. Based on the fact that regularity of the same training samples will be sharply decreased with the reduction of class cardinality (see Appendix  A), we propose the Switching training strategy by only switching the standard instance-balanced sampler to a class-reversed sampler during the last training procedure, in order to learn low-regularity samples (tail classes) without seriously disrupting the representation of the strongest domain regularities (head classes) first.

3.1 Theoretical motivations

Problem setup and notations Let \(f_{\theta }(\cdot )\) denote a feature extractor implemented by a CNN model with parameter \(\theta\), we get the class prediction through \({\hat{y}} = \mathop {\arg \max } g(f_{\theta }(\mathbf{x} ))\), where \(\mathbf{x}\) is the input image and \(g(\cdot )\) is a classifier function. Given a training set \({\mathcal {D}} = \{x_{i}, y_{i}\}, i \in \{1, ..., n\}\) with C classes, let \(n_{j}\) denote the number of samples for class j and \(n = \sum _{i=1}^C n_{i}\) be the total number of samples. Without loss of generality, we assume classes are sorted by cardinality in decreasing order, i.e., if \(i < j\), then \(n_{i} \ge n_{j}\). For most sampling strategies, the probability \(p_{j}\) of sampling a data point from class j is given by:

$$\begin{aligned} p_{j} = \frac{n^{q}_{j}}{\sum _{i=1}^C n^{q}_{i}}, \end{aligned}$$
(1)

with different values of q arise for different sampling strategies. The sampling of each data can be capsuled into the following two steps: 1) Randomly sample a class according to \(p_{j}\); 2) Uniformly pick up a sample from class j. Sampling strategies that corresponding to q = 1, q = 0, and q = -1 are introduced as below:

Instance-balanced sampling (IB) This is the most common and standard way of sampling data, where each sample of the training dataset is sampled only once with equal probability in a training epoch. For instance-balanced sampling, the probability \(p^{IB}_{j}\) is given by Eq. 1 with q = 1, i.e., a sample from class j will be sampled proportionally to the cardinality \(n_{j}\) of the class.

Class-balanced sampling CB To alleviate the extreme data imbalance during training, class-balanced sampling is proposed to artificially generate class-balanced data. The probability \(p^{CB}_{j}\) is given by Eq. 1 with q = 0, e.g., \(p^{CB}_{j} = 1/C\). In this scenario, the probability of each class j being selected is equal, independent to its cardinality \(n_{j}\).

Class-reversed sampling (CR) Zhou et al. (2020) utilizes the reversed sampler to re-balance feature representation and particularly improve the classification accuracy on tail classes. Here we integrate \(p^{CR}_{j}\) into Eq. 1 with q = -1. For class-reversed sampling, a data point from class j will be sampled proportionally to the reciprocal of its cardinality \(n_{j}\), i.e., the more samples in a class, the smaller sampling possibility that class has.

Objective function Let \(L_{ji}(\theta )\) denote standard training error on i-th sample of class j:

$$\begin{aligned} L_{ji}(\theta ) = \ell \left( f_{\theta } \left( {x}_{ji}\right) , {y}_{ji}\right) , \end{aligned}$$
(2)

where \(\ell\) is the loss function, e.g., cross-entropy loss.

For standard training process with IB sampling, where each sample is sampled with equal probability, the objective function over the total training set \({\mathcal {D}}\) is given as follows:

$$\begin{aligned} \begin{aligned} L^{s}(\theta ) = \frac{1}{n}\sum _{i=1}^{n}L_{i}(\theta )+R(\theta ), \end{aligned} \end{aligned}$$
(3)

where \(R(\theta )\) is the regular terms.

Now considering a more general scene, where sampling a data containing two steps: (1) Randomly chooses one class according to \(p_{j}\); 2) Uniformly pick up one sample from its \(n_{j}\) samples, we have the following objective function:

$$\begin{aligned} \begin{aligned} L(\theta ) = \sum _{j=1}^{C}\sum _{i=1}^{n_{j}}\frac{p_{j}}{n_{j}} L_{ji}(\theta )+R(\theta ). \end{aligned} \end{aligned}$$
(4)

Generalization error upper bound Now we give the generalization analysis for such an objective function by deriving its generalization error upper bound. Let \(\varTheta\) be the family function of our learned neural network, we define \(\mathfrak {R _{n}}(\varTheta )\) as the standard Rademacher complexity (Bartlett & Mendelson, 2002) of the set \(\{(x, y) \mapsto \ell (f(x ; \theta ), y): \theta \in \varTheta \}\):

$$\begin{aligned} \begin{aligned} \mathfrak {R}_{n}(\varTheta )={\mathbb {E}}_{{{\mathcal {D}}}, \xi }\left[ \sup _{\theta \in \varTheta } \frac{1}{n} \sum _{i=1}^{n} \xi _{i} \ell \left( f_{\theta }\left( {x}_{ji} \right) , {y}_{ji}\right) \right] , \end{aligned} \end{aligned}$$
(5)

where \(\xi _{1}, \ldots , \xi _{n}\) are in-dependent uniform random variables taking values in {−1,1} (i.e. Rademacher variables).

Let \({\mathcal {M}}\) denote the least upper bound on the difference of individual loss values:

\(\left| \ell (f_{\theta }(x), y)-\ell \left( f_{\theta }\left( x^{\prime } \right) , y^{\prime }\right) \right| \le {\mathcal {M}}\) for all \(\theta \in \varTheta\). For the standard training process with \(L^{s}(\theta )\), for any \(\delta >0,\) with probability at least \(1-\delta\) over the training set \({\mathcal {D}}\), the following error bound holds for all \(\theta \in \varTheta\) (Kawaguchi & Lu, 2020):

$$\begin{aligned} \begin{aligned} {\mathbb {E}}^{s}_{(x, y)}[\ell (f_{\theta }(x), y)] \le L^{s}(\theta )+2 {\mathfrak {R}}_{n}(\varTheta )+ {\mathcal {M}}\sqrt{\frac{\ln (1 / \delta )}{2 n}}. \end{aligned} \end{aligned}$$
(6)

Analogously, for the general objective function \(L_{\theta }\), we have the following error bound for all \(\theta \in \varTheta\) (the proof is given in Appendix  B.1):

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_{(x, y)}[\ell (f_{\theta }(x), y)] \le L(\theta )+2 \mathfrak {R}_{n}(\varTheta )-{\mathcal {Q}}_{n}(\varTheta ; p, n) \\ +{\mathcal {M}}\sqrt{\sum _{j \in C}\frac{p_{j}^{2}}{n_{j}}}\sqrt{\frac{\ln (1 / \delta )}{2}}, \end{aligned} \end{aligned}$$
(7)

where \({\mathcal {Q}}_{n}(\varTheta ; p, n) = {\mathbb {E}}_{{{\mathcal {D}}}}\left[ \inf _{\theta \in \varTheta } \sum _{j=1}^{C}\sum _{i=1}^{n_{j}}\left( \frac{p_{j}}{n_{j}}-\frac{1}{n}\right) L_{ji}(\theta ) \right] ,\) a residual term which measures the expectation of the minimum difference between the empirical value of the training error of the proposed Switching method and that of Instance-balanced resampling method (IB) under the global distribution \({\mathcal {D}}\).

With the above derivation, we have the following Theorem 1 to serve as a theoretical evidence supporting the superiority of the generalization of the proposed method.

Theorem 1

With a small size of \(\varTheta\) and a bounded \({\mathcal {M}}\), the upper bound on the expected error for CR is strictly lower than IB if \({\mathcal {Q}}_{n}(\varTheta ; p, n) + L^{s} - L > 0\) or if \(L^{s} - L > 0\) (the proof is given in Appendix  B.2).

In our experimental settings, a small learning rate is adopted in the last training stage, which is equivalent to fine-tuning on a pre-processed initial value to produce a narrow parameter space \(\varTheta\) (See the first assumption in Appendix B.2). Therefore, Theorem 1 can theoretically guarantee that the upper bound of the CR method used in the small learning rate stage is strictly lower than that of the IB training method.

3.2 Switching data samplers during training

figure a
Fig. 2
figure 2

The illustration of Algorithm 1. \(m_j\) denotes the j-th exact moment to decay learning rate (j-th learning rate decay milestone), \(m_n\) denotes the last learning rate decay milestone, \(t_s\) denotes the exact moment to switch Resampling strategy and \(t_e\) denotes the ending epoch

Switching is proposed to shift the learning focus from head classes to tail classes by simply switching the IB sampler to CR sampler at some epoch during training. Before the switching happens, the uniform IB sampler retains the characteristics of original distributions and almost the high-regularity samples from head classes are learned, the patterns and structures discovered in those head class samples can be used to build a generalizable representation. In later stages, the memorization of tail class samples will not seriously disrupt the learned representation as the learning rate is much smaller than the earlier stages.

Concretely, the number of total training epochs is denoted as T and the learning rate milestones are denoted as \([m_{1}, \ldots , m_{n}]\), where \(m_{1}< \cdots < m_{n} \le T\). Let \(\gamma \in (0, 1)\) becomes the multiplicative factor, learning rate will be decayed by \(\gamma\) once the epoch reaches one of the learning rate milestones during training. When training procedure reaches the \(m_{n} + S\) epoch, we switch IB sampling to CR sampling and continuing training, where S is the hyper-parameter in our method indicating when to switch. The details of our switching strategy are shown in Algorithm 1 and illustrated by Fig. 2.

Our method is simple and clean, which only switches the data sampler from IB to CR once during training, without changing any structure of the original network or artificially generating class-balanced batches or losses.

4 Experiments

4.1 Datasets and empirical settings

Long-tailed CIFAR-10 and CIFAR-100. Both CIFAR-10 and CIFAR-100 contain 60,000 images, with 50,000 for training and 10,000 for validation with category number of 10 and 100, respectively. For fair comparisons, we use the long-tailed versions of CIFAR datasets as the same as those used in Zhou et al. (2020) with controllable degrees of data imbalance. Imbalance factor \(\beta\) is utilized to describe the severity of the long tail problem with the number of training samples for the most frequent class and the least frequent class, e.g., \(\beta\) = \(\frac{n_{max}}{n_{min}}\). We use \(\beta\) as 10, 50, and 100 in our experiments.

iNaturalist 2018 The iNaturalist species classification dataset is a large-scale real-world, naturally long-tailed dataset, suffering from extremely imbalanced label distributions. We choose the 2018 version in our experiments, which consists of 437,513 images from 8142 categories. Note that, besides the extreme imbalance, the iNaturalist datasets also face the fine-grained problem. For fair comparisons, we utilize the official splits of training and validation images.

ImageNet-LT ImageNet-LT is artificially truncated from their balanced versions so that the labels of the training set follow a long-tailed distribution. ImageNet-LT has 1000 classes and the number of images per class ranges from 1280 to 5 images. Note that the validation set is balanced of 1000 classes.

4.2 Implementation details

Implementations details on CIFAR We adopt the plaining ResNet-32 (He et al., 2016) as our model in all experiments. Standard mini-batch stochastic gradient descent (SGD) with momentum of 0.9, weight decay of \(2 \times 10^{-4}\) is utilized to optimize the whole network. We train all the models on one single NVIDIA 2080Ti GPU for 200 epochs with batch size of 64. The initial learning rate is set to 0.1 and the first five epochs is trained with the linear warm-up learning rate schedule (Goyal et al., 2017). The learning rate is decayed at the 100th by 0.1. S is set to 1, which means we switch the instance-balanced sampling to class-reversed sampling at the 101st epoch.

Implementations details on iNaturalist For fair comparisons, we utilize the plaining ResNet-50 (He et al., 2016) as our network in all experiments. We train all the models on eight NVIDIA 2080Ti GPUs with batch size of 512 for 90 epochs and 200 epochs, respectively. The initial learning rate is set to 0.05 and decayed by 0.1 at the 60th and 80th epoch for 90, 120th and 160th for 200. The batch size is 512 and S is set to 1, which is similar to experiments on CIFAR. For fair comparison with Decouple (Kang et al., 2020), we also set the S as 10 and 40 respectively, which means to train it for additional 10 epochs after the same standard training procedure have done, with total number of training epochs as 100 epochs and 210 epochs.

Implementations details on ImageNet-LT We adopt ResNet-50 and ResNext-50 as our backbone to analyze the effectiveness of our method. The initial learning rate is set to 0.2 and decayed by 0.1 at the 60th and 80th epoch for total 90 epochs. The batch size is 256 and S is set to 1, which is similar to experiments on CIFAR. For fair comparison with Decouple (Kang et al., 2020), we also set the S as 10, which means to switch sampler and train it for additional 10 epochs after the same standard training procedure have done, with total number of training epochs as 100 epochs.

4.3 Comparison methods

In experiments, we compare our method with four groups of methods:

Baseline methods We employ plaining training with cross-entropy loss and focal loss (Lin et al., 2017) as our baselines.

Re-weighting methods For re-weighting methods, we compare with the CB-Focal (Cui et al., 2019) and LDAM (Cao et al., 2019), where effective numbers or margin-based generalization are utilized to alleviate the extreme data imbalance during training.

Two-stage fine-tuning strategies To prove the effectiveness of our switching strategy, we compare it with two-stage fine-tuning strategies proposed in Cao’s work (2019). Networks are trained with cross-entropy (CE) on imbalanced data first, and then are trained with class re-balancing strategy in the second stage. CE-DRW and CE-DRS refer to the two-stage baselines using re-weighting and re-sampling at the second stage. We also compare with Decouple (Kang et al., 2020), which trains network with instance-balanced sampling and uses class-balanced sampling to re-train classifier in the second stage with backbone fixed.

State-of-the-art methods For state-of-the-art methods, we compare with the recently proposed BBN (Zhou et al., 2020), which utilizes class-reversed sampling to re-balance the feature extractor but has a more complicated model structure, neglecting the proper combination of different data samplers itself.

Table 1 Top-1 accuracy of ResNet-32 on long-tailed CIFAR-10 and CIFAR-100

4.4 Overall performance

In this section, we compare the performance of the proposed scheme to other recent works that report state-of-the-art results on four common long-tailed benchmarks: Long-tailed CIFAR-10, Long-tailed CIFAR-100, iNaturalist2018 and ImageNet-LT.

Long-tailed CIFAR We conduct extensive experiments on long-tailed CIFAR datasets with three different imbalanced ratios: 10, 50 and 100. Table 1 reports the accuracy of various methods. For CIFAR-10 series, our method achieves comparable or better results comparing other complicated methods. When working on CIFAR-100 series, our method achieves best results across all imbalance ratios, compared with two-stage fine-tuning strategies (i.e., CE-DRW/CE-DRS) and previous state-of-the-arts (i.e., Decouple and BBN). Especially for long-tailed CIFAR-100 with imbalanced ratio 100 (the most extreme imbalance case), we get 44.7% accuracy which is 2.1% higher than previous BBN.

iNaturalist 2018 We further evaluate our methods on the iNaturalist 2018 dataset. Similar to Decouple (Kang et al., 2020) and BBN (Zhou et al., 2020), we present results training after 90 and 200 epochs for fair comparison. As illustrated in the Table 2, with an end-to-end trained plain ResNet-50 model, we surpass other complicated methods including two-stage fine-tune (Decouple) and well-designed architecture (BBN). When S =1, where total training epochs are 10 epochs less than Decouple, we get 1.5% gains compared with the totally decouple training strategy cRT. We can achieve further improvements with the same training epochs as Decouple (see S = 10 for 90 epochs and S = 40 for 200 epochs).

Table 2 Top-1 accuracy of ResNet-50 on iNaturalist 2018
Table 3 Top-1 accuracy of ResNet-50 on large-scale long-tailed datasets ImageNet-LT

ImageNet-LT Table 3 presents results for the most challenging ImageNet-LT. The results of BBN are conducted using the author’s open-sourced codebase. From the table we see that our simple method with plain ResNet50, with less training epochs (see S = 1), outperform the current state-of-the-art about 0.6% higher than Decouple and 2.0% higher than BBN. With the same training epochs as them (see S = 10), our method gets further improvements, about 0.9% higher than Decouple and 2.3% higher than BBN.

Fine-grained analysis To better validate our assumption that memorizing low-regularity samples with small learning rate can avoid seriously damage the representation of high-regularity samples, we further report accuracy on three splits of the set of classes: Many-shot (more than 100 images), Medium-shot (20–100 images) and Few-shot (less than 20 images). As shown in Table 4, standard training process (see Cross Entropy with IB only) always perform best on Many-shot since head class samples dominate the training batch all the time. Meanwhile, our method can improve the performance of tail classes by a large margin due to the CR sampling in the last training stage. It is worth to note that while greatly boosting the recognition of tail classes, our switching method only slightly damage the performance of head classes (compared with Cross Entropy with CR only), indicating that memorizing tail class samples with small learning rate can better handle the trade-off between high-regularity head classes and low-regularity tail classes.

Table 4 Fine-grained results on the most skewed long-tailed CIFAR-100 (imbalance ratio: 100) and the most challenging ImageNet-LT, compared with the previous state-of-art

4.5 Ablation studies

4.5.1 Analysis on hyper-parameter S

To find the optimal setting of S, which is the hyper-parameter controlling when to switch, we investigate S and corresponding results are shown in Table 5. Interestingly, our method achieves comparable results despite different values of S, indicating S is not dataset/distribution dependent or sensitive. This is consistent with our motivation: memorization of tail classes will not seriously disrupt the learned representation with smaller learning rate. Thus, once there is CR during the small learning rate stage, model could jointly fine-tune both feature extractor and classifier to achieve better generalization, regardless of the specific value of S.

When S turns to infinity, the only difference between our method and regular SGD training is that we still need a switching action. To simulate this situation, we enlarge S to 200 and 500. When S=200 the classification performances at an imbalance ratio of 50 on Long-tailed CIFAR-10 and CIFAR-100 are 82.2 and 48.1, respectively. Even when S=500, our method still achieves 82.0 and 47.5. Considering that the regular SGD only achieves 77.9 and 44.9, as also illustrated in Table 6, the superiority of the switching from IB to CR at small learning rate is shown inevitably.

Table 5 Determining of the optimal S on long-tailed CIFAR-10 (imbalance ratio: 50) and CIFAR-100 (imbalance ratio: 50)

4.5.2 Combinations of sampling strategies

In order to find the optimal sampler combination before and after switching, we conduct comprehensive experiments on long-tailed CIFAR-10 (imbalance ratio: 50) with combinations of different data samplers used in different stages. As shown in Table 6, our strategy, which switches instance-balanced sampling to class-reversed sampling in the small learning rate stage, achieves the best performance across all experimental settings. We draw the same conclusion with Decouple that instance-balanced sampling gives the most generalizable representations, for using instance-balanced in the first stage always performs better than other results. In addition, switching to class-reversed sampling can always bring a significant improvement no matter what samplers used in the first stage, except class-reversed sampling on the long-tailed CIFAR-100 with imbalance ratio 100 and 50 (see the last row in Table 6). We conjecture this is because class-reversed sampling cannot learn the general representations on such extreme imbalanced data, since it mainly samples from the tail classes with low cardinality. Without generalizable representation and seeing samples from other classes, network cannot generalize well across all classes.

We also investigate the progressively Switching in Table 7. For the first CB then CR setting, results are almost the same as only CR, showing Switching is robust to the samplers used in earlier stages. However, first CR then CB will lead a great drop in accuracy, indicating memorizing low-regularity tail classes should happen in the last training stage with high-regularity domain knowledge first-disrespect of it will hurt the performance.

Table 6 Comprehensive results on long-tailed CIFAR-10 (imbalance ratio: 50) with combinations of different data samplers used in different stages
Table 7 Determining of the way of switching strategies on long-tailed CIFAR-10 (imbalance ratio: 50)

4.5.3 Comparing with decoupling paradigm

To further compare our method with Decouple, we investigate the factors of fixing feature extractor and re-training classifier towards learning long-tailed distributions, which are adopted in Decouple. From the results shown in Table 8, the following observations can be made:

  • Joint training is better. Training the backbone and the classifier jointly always performs better than fixing the backbone. This phenomenon indicates that although instance-balanced sampling gives the most generalizable representations, it is not good enough. Fine-tuning the backbone with low-regularity tail class samples in the small learning rate stage can significantly improve its representation ability across tail classes.

  • Re-training is matter when no switching. When training with the switching strategy, results with re-training or without re-training the classifier are much similar (see rows with CB, CR as the switching sampler). However, interestingly, re-training the classifier can bring improvements in the standard training procedure (see rows with IB as the switching sampler). We speculate that model trained by uniform instance-balanced sampling would have a strong bias towards tail classes in both backbone and classifier. Re-training classifier based on the learned general representations can alleviate it.

  • Switching and joint training are complementary. We compare the results of only switching to only joint training, finding that while switching samplers and joint training can bring improvements respectively, their combination can improve the performance further. Fine-tuning with class-balanced or class-reversed distributions can boost the generalization ability further.

Table 8 Comparisons between Decouple learning paradigm and our learning paradigm on long-tailed CIFAR-10 (imbalance ratio: 50), where Decouple indicates fixing the backbone and re-train the classifier from scratch while we continue to joint train both

Further, we valid the quality of features learned by standard training procedure and our switching training procedure in Table 10, just like Decouple. Although a slightly lower with IB, re-training based on our feature can bring significant improvements compare with standard features. These results also indicate a disadvantage of Decouple: performance of re-training classifier depends on the performance of feature extractor. Once the feature representation is sub-optimal, the re-trained classifier is sub-optimal.

To validate our method could reach a better balance under bias-variance trade-off, we calculate the total error of each method in Table 9. Our method (when S=1) yields a lower upper bound on the generalization error, and therefore higher test accuracy, lower Bias, and lower Variance, which indicate our switching algorithm performs better in the challenging trade-off compared with other methods.

Table 9 Total error (bias\(^{2}\)+variance) of different methods on the test set of long-tailed CIFAR-10 (imbalance ratio: 50)
Table 10 Feature quality of Decouple learning paradigm and our switching learning paradigm on long-tailed CIFAR-10 (imbalance ratio: 50)
Fig. 3
figure 3

Test performance of three methods with different sampling strategies on long-tailed CIFAR-10 (imbalance ratio: 50) with SGD using stage-wise constant learning rate

4.6 Validation and visualization of our proposals

4.6.1 Learning speed

In order to further validate our method could learn long-tailed distributions more efficiently, we plot the test accuracy per epoch of three methods with different sampling strategies in Fig. 3. Compared with using IB only, switching to CR can immediately improve the performance by a large margin. Meanwhile, although BBN could achieve comparable performance with ours, it converges more slowly since it optimizes two branches of feature extractor in turn during training.

Fig. 4
figure 4

Learning speed of examples of 4 selected class with SGD using stage-wise constant learning rate. Left: standard training procedure. Right: our switching training procedure

Fig. 5
figure 5

Learning speed of examples of 4 selected classes with SGD using constant learning rate with standard training strategy. The 4 different learning rates correspond to the constants used in the stage-wise scheduler

Fig. 6
figure 6

Learning speed of examples of 4 selected classes with SGD using constant learning rate with our training strategy. The 4 different learning rates correspond to the constants used in the stage-wise scheduler

Table 11 Test performance of models trained with various learning rate schedulers on long-tailed CIFAR-10 (imbalance ratio: 50)

4.6.2 Learning rate scheduling

Intuitively, a training example from head classes should be learned quickly since it is consistent with many others and the gradient steps for all consistent examples should be well aligned. As Jiang et al. (2020) indicates that strong regularities in a data set are not only better learned at asymptote leading to better generalization performance but are also learned sooner in the time course of training, we conjecture that head class samples will be learned sooner than tail class samples and plot average proportion correct as a function of training epoch for each class to validate it.

Figure 4 shows the learning speed of 4 selected classes with SGD using stage-wise constant learning rate scheduling. In Fig. 5 we show the learning speeds of 4 selected classes trained with SGD using constant learning rate scheduling with the standard training procedure. The 4 panels show the results of different values of constant learning rate used in training. It is observed that faster convergence is achieved with smaller learning rate (see 0.1, 0.02 and 0.01). While the learning rate is so small, e.g., 0.001, the learning speed of each class is significantly slowed down.

In Fig. 6 we show the learning speeds of our switching training procedure trained with SGD using constant learning rate scheduling. Similar to Fig. 5, proper small learning rate could accelerate the convergence, with higher and more stable accuracy. It is worth to note that switching to class-reversed sampler always improves the accuracy of tail classes, but will damage the representative ability of head classes to some extent. Stage-wise constant learning brings the smallest damage to the head class representations, showing the necessity of building generalization representations first. Quantitative results of both standard training and switching training are shown in Table 11.

Here we manage to explain why class-reversed sampler is effective. The reason that switching to class-reversed sampler performs well is that it delayed the learning of low-regularity samples (tail classes samples) to later small learning rate stages. In the first stage, when almost head class samples are learned, the patterns and structures discovered in those high-regularity samples can be used to build a generalizable representation. In later stage, network is able to learn or memorize low-regularity samples of tail classes based on the representations from a clean subset of high-regularity samples. In addition, learning or memorizing tail class samples will not seriously disrupt the learned representation as the learning rate is much smaller than the earlier stages. In contrast, standard learning procedure without switching could not focus on the tail class samples since the extreme data imbalance, leading under-representation for tail classes, while SGD with (small) constant learning rate learns the examples across all classes quickly, which cannot learn the generalizable representation from high-regularity samples of head classes before.

5 Conclusion

In this paper, we investigate long-tailed visual recognition from a memorization-generalization point of view, which not only theoretically explains the previous methods, but also provides a simple yet effective Switching strategy to memorize tail classes without huge damage to the head classes. The detailed implementation only contains switching instance-balanced sampling to class-reversed sampling during the last few training epochs, which is clean and elegant. Closely afterwards, we give the generalization error upper bound of different sampling strategies. Further empirical findings show the inevitability to deal the trade-off between head class representing and tail class memorizing in the memorization stage with small learning rate.