1 Introduction

In the past few years, Convolutional Neural Networks (CNNs) have excelled immensely on image classification for its splendid feature extraction capability. However, different from the traditional image classification task, where categories have a huge difference in morphology. Fine-Grained Visual Categorization (FGVC) mainly focuses on distinguishing between subordinate-level categories within the same basic-level category, e.g., different kinds of birds [29], cars [17], dogs [15], aircraft [20]. FGVC is more challenging than traditional image classification for that the intra-class variances could be much higher than the inter-class variances. The CNN model can not correctly distinguish between extremely similar-looking categories unless it can extract subtle and discriminative features.

As shown in recent works [7, 25, 34], paying attention to multiple discriminative parts plays a vital role in FGVC. In the early work [2, 3, 32], extra manual bounding-box or part annotations are employed to extracting discriminative features in multiple object parts. Recent efforts [34, 39] utilize only class labels to automatically localize the object parts. Ding et al. [7], Sun et al. [25] show that without external interference, CNNs [13, 27, 31] usually excel at extracting the most discriminative feature but ignores the crucial complementary information as well. Recently, the study of translation invariance [1, 24, 36] in CNN indicates that small translation or rescaling on the input image can drastically change the prediction of a deep network, it means that a fixed network will focus on different parts and extract different features when the object is panned or zoomed.

In this paper, a novel framework named “forcing network” is proposed, which is referred to as F-Net to address the challenges of FGVC. The diverse and enhanced features will be obtained in F-Net by the forcing module which is consisted of the original branch and the forcing branch. The original branch generates the class activation maps (CAM) to localize the most discriminative parts. In the forcing branch, the suppressive mask is generated to suppress the primary discriminative and force the network to pay attention to secondary discriminative regions which are usually overlooked due to the network pays the most attention to the primary discriminative. After the back gradient propagation, enhanced features will be extracted for classifiers. To reduce the prediction error, the subtle regions are magnified, according to the CAM, the object is cropped and zoomed as the second input to predict again. The first and second prediction probability are fused as the final results. In the training phase, the most discriminative region on the cropped image is dropped to force the network to pay attention to more regions.

Our main contributions can be summarized as follows:

  • We proposed a novel “forcing network” structure. The forcing branch is introduced as an auxiliary branch to force the network to focus on multiple regions. And extract diverse features including primary discriminative features and confusion features for fine-grained visual categorization.

  • Based on class activation maps, the object is cropped to the center of the image and the subtle regions will be magnified for the second prediction. The sum of the two prediction probability serves as the final prediction.

  • Comprehensive experiments were carried on the widely-used fine-grained benchmarks, including CUB-200-2011, FGVC-aircraft, and Stanford-cars. The comparison results demonstrated that our method outperforms the majority of methods and achieves state-of-the-art performance on FGVC-Aircraft.

The rest of the paper is organized as follows. Section 2 contains the literature review. Section 3 contains the methodology (method). Section 4 contains the results. Section 5 contains the conclusions and policy implications.

2 Related work

In this section, we briefly review the related works of fine-grained visual categorization.

For FGVC, the traditional image classification method was used in the earliest stage. The commonly used image feature extraction method is the SIFT method. After the features are lifted by SIFT [21], the features are clustered by K-nearest [22] and other clustering methods. Such methods are computationally complex and time-consuming. There have been a variety of methods proposed for FGVC. In the early work [2, 3, 32, 38], the cumbersome and expensive manual bounding box or part annotations are adopted. Later, part or box annotations were replaced by extracting features of multiple parts or discriminative parts in a weakly supervised way. MA-CNN [39] generated multiple parts by clustering, weighting, and pooling from spatially-correlated channels and then classified an image by each individual part. MA-CNN takes a long time to train and has low accuracy. NTS-Net [34] adopted self-supervision to effectively localize informational regions without the demand of bounding-boxes or part annotations. DCL [5] partitioned the input image into local regions and then shuffles them as another destructed sample. It will pay more attention to discriminative regions to recognize the destructed image. The adversarial learning module is added to the DCL to prevent the network from overfitting to the noisy features caused by random image scrambles. S3N [7] collected peaks from the class response maps to estimate the discriminative and complementary information receptive fields and learn a set of sparse attention for capturing the subtle yet fine-detailed visual evidence as well as preserving content information. DB [25] found subtle differences between similar-looking categories by suppressing the most prominent discriminative regions in class activation maps in the training phase. DB network enables the network to notice multiple regions in the inference stage by randomly suppressing the feature expressions of different regions in the training stage. DB can achieve higher accuracy with fewer parameters.

Bilinear [18] CNN model is another effective stream for FGVC, the output of two CNN branches is multiplied using the outer product at each location of the image and pooled to obtain the bilinear vectors as the features for the classification layer. Following the impressive performance, some improved bilinear models are proposed. TASN [40] proposed trilinear attention sampling to learn subtle feature representations from hundreds of part proposals for FGVC. Gao et al. [9] compacted bilinear pooling with low-dimension and low-rank bilinear pooling [16] by applying a low-rand bilinear classifier was proposed to reduce the consumption in computation time and parameters memory. HBP [35] adapted bilinear pooling between different layers that enabled the inter-layer interaction of features. Xiong et al. [33] proposed an efficient framework for RGB-D scene recognition, which adaptively selects important local features to capture the great spatial variability of scene images. Wang et al. [30] present multiscale representation for scene classification, which is realized by a global–local two-stream architecture.

FGVC is improved by various other methods as well. MAMC [26] leveraged metric learning to learn multiple relevant parts by pulling positive features closer while pushing negative features away. API-Net [41] recognized a pair of fine-grained images by interaction. In MC-loss [4], each class was predicted by a specific number of channels, and each group consists of a discriminative component and a diversity component. GCL [31] proposed a criss-cross Graph propagation sub-network to learn region correlations. MGE-CNN [37] developed several experts to classifier the image, and each expert learns with prior knowledge from the previous expert, in the end, a gating network was used to determine the contribution of each expert. A gradient-boosting loss that seeks to resolve ambiguities among closely related classes is proposed in DB [25] as well.

Our method obtains diverse features that contain the primary discriminative features and confusion features by enhancing the secondary discriminative regions. Compared with random suppression, suppressing primary the discriminative regions in class activation maps that force the network to pay more attention to the confusing regions which are usually overlooked due to the network pays the most attention to the primary discriminative regions. Compare with multiple frameworks, the first and the second prediction in our method share the same framework, and we only increase an extra convolutional layer, based on backbone such as ResNet-50 [13]. Since the object is panned and magnified in the second input, the network will focus on the parts different from the first prediction. In the training phase, we use the average of the first and the second loss as the final loss, it reduces the loss of the oscillation from the first wrong prediction.

3 Methodology

In this section, the F-Net and the CAM-based cropping moudule are described in detail, the overview architectures of the two modules are illustrated in Figs. 1 and 2, respectively. F-Net consists of two components including the feature extracting module and the forcing module. The feature extracting module is the convolutional backbone of Resnet-50 [13]. The forcing module and CAM-based cropping module will be described detailly in this Sections 3.1 and 3.2, respectively. To acquire class activation maps conveniently, the fully connected layer for classification is replaced with a 1 × 1 convolutional layer. The 1 × 1 convolutional layer has an output channel number equaling to the number of classes to acquire class activation maps. Given an input image, the feature maps for classification are produced by the feature extracting module. We denote the extracted feature maps as F𝜖RN×W×H, with height H, width W, and the number of channels N.

Fig. 1
figure 1

Overview of F-Net. F-Net consists of the feature extracting module and the forcing module The feature extracting module is convolutional layers that extract features. The forcing module contains the original branch and forcing branch

Fig. 2
figure 2

Overview of CAM-based cropping module. This module crops the object region as the center of the input image for the second prediction. The first and second prediction probabilities are fused as the final result

3.1 Forcing module

The proposed forcing module is inspired by DB [25]. The forcing module aims to force the network to extract more diverse features for the classifier, it consists of the original branch and forcing branch. The original branch and forcing branch share the same feature extract model, but the inputs of the two branches are different. The destination of the original branch is to generate the class activation maps and to localize the primary discriminative regions. After the feature maps F is convoluted by 1 × 1 convolutional layer, class activation maps \(M^{\prime }\epsilon R^{C\times W \times H}\) is obtained, where W, H, and C represent the feature’s width, height, and the number of classes, respectively. Then we perform global average pooling, in order to obtain the predicted class activation maps \(M^{\prime }_{p} \epsilon R^{W \times H}\), where p is the index of maximum in predicted vector V 𝜖RC. Here, C refers to the number of classes.

$$ V = g(M^{\prime}), $$
(1)

where g(⋅) is Global Average pooling. For forcing branch, \(M^{\prime }_{p}\) is utilized to generate a mask to suppress top-k discriminative positions of F. Since the top-k positions are suppressed, the forcing branch has to pay attention to other confused positions. Here, we describe the procedure to generate the input for the forcing branch in detail. Firstly, \(M^{\prime }_{p}\) is reshaped to a vector of size W times H, i.e. WH and sorted in descending order, then, the k-th values T is obtained as the threshold value.

$$ T = Sort(M^{\prime}_{p})[k], $$
(2)

where Sort(⋅) denotes sorting in descending order,[⋅] represents getting value from vector and k is a hyperparameter that denotes the number of suppressive positions. Let B be the suppressive mask derived from \(M^{\prime }_{p}\) such that:

$$ B(i,j)=\left\{ \begin{array}{rcl} \alpha & & {M^{\prime}_{p}(i,j)>=T}\\ 1 & & {M^{\prime}_{p}(i,j)<T} \end{array},\right. $$
(3)

where i and j represent row and column of the feature’s position respectively and α is a hyperparameter that denotes suppressing factor. Finally, the input of forcing branch G𝜖RC×W×H is obtained, which is generated as follows:

$$ G= B\odot F, $$
(4)

where ⊙ denotes the element-wise multiplication of the two tensors. After the classification convolution is performed, the output of the forcing branch \( M^{\prime \prime }\epsilon R^{C\times W \times N}\) is obtained. Let M be the output of the forcing module, M is obtained as :

$$ M = M^{\prime} + M^{\prime\prime} $$
(5)

The confidence scores will be obtained after M is fed to global average pooling.

3.2 CAM-based cropping module

CAM-based cropping module is proposed to crop the object into the image’s center and infer again. The first prediction always focuses on the obvious regions while the second prediction may pay attention to some subtle regions which will be amplified after the CAM-based cropping module. The summation of raw prediction and the second prediction is as to the final prediction. Here, we explain the procedure to crop the object. In the forcing module, we have described the generation of class activation maps M𝜖RC×W×H. Since the top-1 map usually highly responds on part of the object and the high response regions from other maps are other parts of the object, instead of using the top-1 map to localize the discriminative region, we utilize top-8 maps to crop the whole object.

Denote Mp𝜖RW×H as the element-wise summation of top-8 maps. Mp consists of the object and backgrounds. The threshold value t is set to distinguish between the object and backgrounds. t is generated as follows:

$$ m = max(M_{p}), $$
(6)
$$ t = m\times d, where d\sim rand(0.4,0.6), $$
(7)

where m is the maximum of Mp. Because of the diversity of samples, we generate a random number d from the uniform distribution between 0.4 and 0.6 in the training phase. d is set to the minimum value of 0.4 in the random number in the test phase to ensure the whole object will be cropped. Then, crop mask B2 is obtained as follows:

$$ B_{2}(i,j)=\left\{ \begin{array}{rcl} 1 & & {M_{p}(i,j) >= t}\\ 0 & & {M_{p}(i,j) < t} \end{array}, \right. $$
(8)

The response values greater than or equal to t belong to the object, otherwise belong to the backgrounds. We generate a bounding box that can cover all positions of 1 in B2 and crop the object from the raw image as the second input. In the training phase, the discriminative parts are dropped on the second input. It should be noted that discriminative parts do not drop in the test phase. The drop mask is obtained as follows:

$$ m_{1} = max(M_{1}), $$
(9)
$$ E(i,j)=\left\{ \begin{array}{rcl} 0 & & {M_{1}(i,j) >= m_{1} \times 0.75}\\ 1 & & {M_{1}(i,j) < m_{1} \times 0.75} \end{array}, \right. $$
(10)

where M1 is the top-1map of M and m1 is the maximum of M1. As the training progresses, the size of the high response area changes all the time, so the threshold value is set to a fixed value of 0.75 instead of random values. The position of 0 on E will be dropped in the second input.

3.3 Multi-prediction model

The multi-prediction model is to make two predictions for the same image, the first and the second share the same network, but the input is different. The second input of the multi-prediction model is also different in the training phase and the test phase. The multi-prediction model process is shown in Fig. 2. The multi-prediction model training phase process is shown by the blue arrows in Fig. 2. In the training phase, the original image I is input to the network for the first time, and the prediction result Prop1 is obtained. According to the first predicted class activation map, the I target is first clipped and expanded by linear interpolation, the clipping factor d is a random number between 0.4 and 0.6, and then the main feature area is discarded to obtain the second predicted Input image I2. After going through the same network model, the second prediction result Prob2 is obtained. Both Prob1 and Prob2 are predicted probability values through softmax, and the final prediction result P is calculated as follows:

$$ P = (Prob_{1} + Prob_{1}) \div 2 , $$
(11)

In this classification task, by using the cross-entropy function as the loss function, the cross-entropy loss is calculated as follows:

$$ L = - \sum\limits_{i} y_{i}\log{\hat{y_{i}}} , $$
(12)

where yi is the prediction result, \(\hat {y_{i}}\) is the true label, and i is the category subscript and takes values from 0 to c− 1. During multi-prediction model training, the final prediction result P is not used to calculate the loss value, but the first prediction result Prob1 and the second prediction result Prob2 are used to calculate the loss value, The calculation process is as follows:

$$ Loss = (-\sum\limits_{i}Prob_{1i}\log{\hat{y_{i}}} + - \sum\limits_{i} Prob_{2i}\log{\hat{y_{i}}}) \div 2 , $$
(13)

The final loss is the average of the cross-entropy of the first prediction and the cross-entropy of the second prediction. The flow of the testing phase is shown by the green arrows in Fig. 2. During the testing phase, inputting the original image I into the network to get the first prediction result Prob1 firstly. According to the first predicted class activation map, only the I target is clipped and expanded by linear interpolation, the clipping factor d is 0.4, and the main feature loss module is not performed, and the second predicted input image I2 is obtained. After the same network model, the second prediction result Prob2 is obtained. The final prediction result P is calculated in the same way as in the training phase.

4 Experiments

In this section, we show comprehensive experiments to verify the effectiveness of F-Net. Firstly, three datasets used to verify our method and the implementation details will be described in Sections 4.1 and 4.2. Then we compare our model with other methods among the three common fine-grained visual classification datasets in Section 4.3. Finally, we analyze the contribution of each component in the proposed framework in Section 4.4.

4.1 Datasets

We comprehensively evaluate our method with three challenging fine-grained datasets, including CUB-200-2011 [29], Stanford Cars [17], and FGVC Aircraft [20]. The detailed statistics with category numbers and data splits are shown in Table 1.

Table 1 Three common fine-grained visual classification datasets

4.2 Implementation details

In the following experiments, ResNet-50 [13] implemented in Pytorch [23] is adopted as the backbone and the fully connected layer is replaced with a 1 × 1 convolutional layer which has the same output channel as the number of classes. The feature extracting convolutional layers is initialized by pre-trained ResNet-50 weights from ImageNet [6], and the classification layer is initialized by Xavier initialization [11].

In the training phase, the images are resized to 515 × 512 and then randomly cropped to 448 × 448 with random horizontal flipping. The cropped threshold d is randomly selected from 0.4 to 0.6 for every sample and the dropped threshold is set to 0.75, as described in Section 3.2. We train our network using Stochastic Gradient Descent (SGD) with the momentum of 0.9, epoch number of 100, weight decay of 0.0001, and a mini-batch of 6 on GTX-2080ti(11G) GPU. The initial learning rate is set to 0.001 and decayed on the 30th epoch with a decay rate of 0.1. Source code is released at https://github.com/boxyao/Forcing-Network.

4.3 Quantitative results

We do not use any manual annotations except for the class labels. For fair comparisons, our method is compared with methods without human-defined bounding boxes or part annotations. The comparisons with the various recent and top-performing methods on three challenging datasets, including CUB-200-2011, FGVC aircraft, and Stanford-cars. Table 2 illustrates the results of three datasets.

Table 2 Comparison with the state-of-the-art on the CUB-200-2011, Stanford Cars, and FGVC Aircraft benchmarks

On the CUB-200-2011, the baseline based on ResNet-50 achieves 85.4%. our method further outperforms the baseline by 3.0%. A further improvement of another 0.7% can be observed when we use DenseNet-161 [14] as the backbone. Compared with MGE-CNN [37] based on ResNet-50, which used multi-experts, we acquire almost the same accuracy by adding an auxiliary classifier. Both our approach and the DB [25] extract diverse features by feature suppression. Despite DB method outperform our method by 0.2%, our forcing module outperforms DB without Gradient-boosting loss by 1%. And our method is based on the latest backbone EfficientNet and ConvNext to further improve the accuracy. Compared with API-Net based on DenseNet-161, EfficientNet and AttNet&AffNet, our method when we use EfficientNet-B3 [28] as the backbone, has 0.3%, 1.4%, 0.5% improvement, respectively.

On the FGVC-aircraft, the proposed F-Net on ResNet-50 and DenseNet-161 [14] achieves 93.3%, 94.4% respectively. Compared with methods based on ResNet-50, our methods outperform most of the methods except DB and AttNet&AffNet. Our method based on DenseNet-161 achieves state-of-the-art performance, which further outperforms API-Net [41] based on DenseNet-161 by 0.5%.

On the Stanford-cars, our method based on ResNet-50 obtains 94.5%, which is 2.8% better than the baseline 91.7%. A further improvement of another 0.7% can be observed when we use EfficientNet as the backbone. Compared with API-Net, our proposed method based on EfficientNet-B2 is very competitive.

In Fig. 3, We visualize the experimental results of the forced module. the high-response regions in the second column are marked by red boxes, and the high-response regions in the third column are marked by black boxes. In the first row, the highest response area in the original branch is the bird’s head, and the most distinguishable area is the bird’s head, but in the forcing branch when the bird’s head features are suppressed to a certain extent, the forcing branch puts more attention on the bird’s tail and claws. In the output of the forced module in the fourth column, the classification basis of the network is not only the head of the most important distinguishable area bird, but also the tail and claws of the second important distinguishable area bird. In the second row, the primordial branch judges that the main distinguishable area of the bird is the bird’s head, and in the forcing branch, the bird’s feathers are judged to be the secondary distinguishable area because the head is suppressed. In the third row, the original branch judges the bird’s wings and tail as the most distinguishable regions, and in the forcing branch, the bird’s beak and neck are judged as the secondary distinguishable regions because the wings and tail are suppressed. Finally, in the forcing module, the beak and the neck are judged to be more important distinguishable regions than the tail and wings, which indicates that the forcing branch corrects the misclassified results in a certain possibility.

Fig. 3
figure 3

The experimental results of the forced module are shown visually. From left to right, each column is the original image, the class activation map of the original branch, the class activation map of the forcing branch, and the class activation map of the output of the final forcing module that fuses the original branch and the forcing branch. Among them, the high-response regions in the second column are marked by red boxes, and the high-response regions in the third column are marked by black boxes

In Fig. 4, we visualize the results of our method. The results show that the proposed structure is activated to different parts of the raw input and the cropped input. The case that the original prediction is wrong while the cropped prediction is correct indicates that the two-step strategy can reduce the loss of prediction once.

Fig. 4
figure 4

Visualization of our method. The one to the left of the dotted line is examples where the first prediction was wrong and the second prediction was right and the final prediction was right. The one to the right of the dotted line is examples where the first, the second, and final predictions are right. Each of these examples from left to right is the original image, top-1 class activation map of the original image, prediction of original images, cropped image, top-1 class activation map of the cropped image, prediction of the cropped image, the summation of original image prediction, and cropped image prediction

4.4 Ablation study

To sufficiently analyze the contribution of different components in our method. we conduct various experiments respectively on CUB-200-201, Stanford-Cars, and FGVC-Aircraft using ResNet-50. Tables 34 and 5 illustrate the detailed contribution of each key component. It shows both forcing branch and crop inference are effective to improve the performance of FGVC. In the analysis of the results of the three datasets, the CAM-based Copping Module improves the accuracy more significantly.

Table 3 Ablation analysis on the CUB-200-2011
Table 4 Ablation analysis on the Standford-Cars
Table 5 Ablation analysis on the Fgvc-Aircraft

Impact of forcing branch

Basic ResNet-50 with forcing branch achieves 87.3%, 91.7% and 88.5% top-1 accuracy on the CUB-200-201, Stanford-Cars, and FGVC-Aircraft respectively. Since the primary discriminative of the extracted features for forcing branch classifier is suppressed, the network has to focus on other equally important parts rather than the primary discriminative part. It also means we enhanced the weight of the secondary discriminative regions in the extracted features. In the inference phase, the diverse feature will be acquired by CNN and the classifiers of each branch will pay attention to different parts. For this, the forcing branch can improve the accuracy of the backbone network respectively by 1.9%, 0.8% and 1.2%.

CAM-based cropping module

Because we have conducted panning and rescaling of the images, the prediction of the cropped image is different from the raw image. The Visualization of the result in Fig. 4 shows that the network always pays attention to different parts when the object is panned or zoomed. Analyzing the results on the CUB-200-2011, double prediction improves the result from 85.6% to 88.1%. The 2.7% improvement shows the second prediction can reduce the loss. Compared with double prediction, the combination of forcing model and double prediction leads to an improvement of 0.3%. When we crop the object to the center of the image, since the object is clearer than the first to the network, the network pays attention to more object parts, but the forcing module still forces the network to focus on other confusing parts and improve results from 88.1% to 88.4%. Compared with the Forcing Module, the CAM-based Copping Module improves the accuracy more significantly.

Hyperparameters suppressing factor α and the number of suppressing positions k

The accuracy of different k and α setting is shown in Tables 6 and 7. Because we suppress the top-k positions based on class activation maps which is probably vital for classification, suppressing too many positions or setting an over small α will result in lower accuracy. We first fix α to 0.5 and compare the performance of different k. Specifically, k = 4 provides the best performance. Then we fix k to 4 and compare the performance of different α. The experiments indicate that α= 0.5 promises the best performance on CUB-200-2011.

Table 6 Ablation study on the number of suppressing position k
Table 7 Ablation study on suppressing factor α

5 Conclusion

In this paper, we proposed a forcing network to focus on multiple regions as well as extract diverse features for fine-grained visual categorization and we combined the first prediction and second prediction whose input is cropped based on class activation maps from the first prediction as the final prediction to reduce the prediction errors. The forcing network does not require bounding boxes or part annotations and can be trained end-to-end. Our method outperforms the majority of methods of FGVC among datasets of CUB-200-2011, FGVC-Aircraft, Stanford-Cars and achieves state-of-the-art performance on FGVC-Aircraft. Although our method has improved the accuracy greatly, the suppressed region is highly dependent on hyperparameters. Then we try to use hybrid model [8] of the ensemble learning-based method to further improve the accuracy. Our future work will try to use hyperparameters as model trainable parameters to reduce the dependence on hyperparameters while maintaining high accuracy.