1 Introduction

Image classification is to determine what category the input image belongs to according to some classification algorithms and it is an important basis for object detection [13] and image segmentation [20] tasks. In recent years, with the development of deep learning, image classification has made remarkable progress and reached a level beyond human beings in many applications [23].

From recent research, it can be found that convolutional neural networks are not as good at learning shape features as learning texture features. It often needs to learn shape features by some methods such as style transfer, which undoubtedly customize for each data set. In order to solve the problem of neural network in learning shape features, and to simplify the operation process, we propose the DuFeNet, which not only can improve the accuracy and increase shape bias of neural network models, but also can be used without modifying the structure of original network at lower additional parameters cost. This study aims to effectively alleviate the problem of texture bias. Our method does not require complex operations to change the data set and through the experience the method also improve the accuracy.

The chapters of this paper are arranged as follows: The second section mainly introduces the study of neural network structure and the effect of gradient information in the image classification network. The third section introduces the overall framework and various components of the DuFeNet. The fourth section discusses the experimental details, the quantitative analysis of DuFeNet, checks the effect of DuFeNet on shape bias through specific dataset, uses Grad-CAM [26] method for visual analysis and the method use in detection task. The fifth section summarizes the content of this paper and puts forward the ideas of future work.

2 Related work

2.1 Deep neural network

Since AlexNet [17] wins the 2012 race on the ImageNet competition [3], some top teams have been working on computer vision applications based on convolutional neural networks. Different network structures are designed to improve accuracy. The VGGNet [27] increase the depth of convolutional neural networks. Kaiming He et al. propose ResNet [8], adding shortcut connections to early layers to reuse the features of the early layers. Jie Hu et al. put forward the SE blocks [11] which can adaptively recalibrate channel-wise feature responses. While the accuracy of network is constantly improving, some researchers enable the neural networks to run on mobile devices. MobileNet [9, 10, 25], ShuffleNet [21, 31] and other network structures are put forward. These lightweight network structures can greatly reduce network parameters and calculations without losing much precision. And the design of the network structure relies on the experience and prior knowledge seriously. Some researchers begin to research the algorithm to design the architecture of the network automatically [24]. They try to use reinforcement learning, evolutionary, etc. algorithms to find the best network structure. In addition, some researchers try to apply the transformer to the CV field which is a type of deep network mainly based on self-attention mechanism and achieved results comparable to the CNN method [7].

2.2 The effect of gradient information

It is still important to understand how convolutional neural networks work. Once we find the rules of neural network, we can optimize it pertinently. A common explanation of how the networks work is that continuous convolution layers and pooling layers learn to represent edges and shapes in images [15]. Shapes is a gradient information, and it is well known that, compared with other cues like size or texture, shape is the most important cue for human object recognition [19]. However, recent studies have shown that convolutional neural networks may include texture and color deviations in their renderings [18, 28], which is contrary to people’s intuition about the shape of image content and human biology. For example, Robert Geirhos et al. [6] put such conflicting hypotheses to a quantitative test by evaluating difference between convolutional neural networks and human observers on images with conflicting texture shape cues. The outcome shows that convolutional neural networks trained through ImageNet dataset [3] are strongly biased towards recognizing textures. Figure 1 is the result of three experiments they have done. Besides, they attempt to increase the shape bias by using style-transfer method that can remove texture cues from the data in order to train more robust and accurate networks. Brochu [1] uses domain-adversarial training in order to further increase the shape bias. Table 1 are the characteristics of different methods.

3 Methods

To solve the problem of neural networks in learning shape features, we propose the DuFeNet. In this section, we introduce the component of DuFeNet and explain why the proposed structure can have a good impact on the performance of the model. Firstly, we introduce the overall framework of DuFeNet. Then we present the details of Gradient Branch, Texture Branch which extract different kinds of features from original image and gradient image. Finally we present Fusion Block and special training method.

Fig. 1
figure 1

Accuracies and example stimuli for three different experiments without cue conflicts [6]. The numbers on the histogram represent the accuracy rates of different networks or methods

Table 1 Characteristics of different methods
Fig. 2
figure 2

The design of the DuFeNet, mainly includes three parts, Gradient Branch, Texture Branch and Fusion Block

3.1 Overall framework design

As shown in Fig. 2, we propose the design of the framework, which includes three parts, namely Gradient Branch (GB), Texture Branch (TB), and Fusion Block (FB). The input of Gradient Branch is a single channel with only edge features image obtained by the edge detection algorithm such as Sobel [5], Prewitt [29], Canny [2] and so on. The main function of Gradient Branch is to make convolutional neural networks learn the knowledge in the case of only the edge feature to classify the image. Texture Branch uses the traditional convolutional neural network to learn the knowledge from the original image. Fusion Block combines Gradient Branch and Texture Branch. The input of Fusion Block is done by fusing features, which are in front of the fully connected layer in Gradient Branch and Texture Branch. Fusion Block can be regarded as the fusion of image gradient information (edge information) and the texture feature to enhance the sensitivity of convolutional neural networks to gradient information.

We denote the input of DuFeNet as X and its output as Y. The process of edge detection can be expressed as

$$\begin{aligned} X_{E}=G(X) \end{aligned}$$
(1)

where \(X_{E}\) and G(X) denote the gradient image generated from original image and the edge detection algorithm, respectively. In this paper, we take Canny [2] as the edge detection algorithm.

3.2 Gradient Branch and Texture Branch

The target of the Gradient Branch is to make a convolutional neural network classify image by gradient information (edge image). The process of Gradient Branch can be described as follows:

$$\begin{aligned} F_\mathrm{gradient}=GB(X_{E},\mathrm{Para}_\mathrm{GB}) \end{aligned}$$
(2)

where \(\mathrm{Para}_\mathrm{GB}\) and \(F_\mathrm{gradient}\) denote the parameters of convolutional neural network in Gradient Branch and the vector of the gradient feature generated from Gradient Branch, respectively. As is shown in Fig. 2, the vector of the gradient feature is the yellow block on the right in Gradient Branch. Since gradient information contains much less information than the original image, a small amount of convolutional layer can be used to learn the gradient information in Gradient Branch. In Fig. 2, we design a three-layer convolutional network without residual or other complex network structures. As a small branch network structure, Gradient Branch can be added to the original network at any time to improve the accuracy and increase the shape bias of neural network models at the cost of a small amount of calculation and parameters.

On the other hand, compared with Gradient Branch, Texture Branch uses a convolutional neural network to learn more texture information of the original image. The process of Texture Branch can be described below:

$$\begin{aligned} F_\mathrm{texture}=TB(X,\mathrm{Para}_\mathrm{TB}) \end{aligned}$$
(3)

where \(\mathrm{Para}_\mathrm{TB}\) and \(F_\mathrm{texture}\) denote the parameters of convolutional neural network in Texture Branch and the vector of texture feature generated from Texture Branch, respectively. Texture Branch uses the convolutional neural network structure that came up before, such as Vgg19 [27], ResNet18 [8], Mobilenetv2 [25], Shufflenetv2 [21]. It can make use of the advantages of network structure to get higher accuracy.

Table 2 The advantage of DuFeNet

3.3 Fusion Block and special training method

Fusion Block can fuse the gradient features and texture features learned before. The concrete way is to concatenate the vectors from Gradient Branch and Texture Branch. Then we add a layer of fully connected layer (Fc layer). This process can be formulated as follows:

$$\begin{aligned} Y=FB(F_\mathrm{gradient},F_\mathrm{texture},\mathrm{Para}_\mathrm{FB}) \end{aligned}$$
(4)

where \(\mathrm{Para}_\mathrm{FB}\) denotes the parameters of neural network in Fusion Block. Fusion Block makes full use of two features and achieves better performance by combining these two features. Gradient Branch fully exploits the image gradient information that Texture Branch learning is not enough. Fusion Block combines the gradient knowledge with the original network that has texture bias to supplement the texture knowledge.

As for DuFeNet, the network structure is clear and simple, but the training method is special. The training method is divided into two stages. Stage 1: individual train, Stage 2: frozen, and fine-tune.

Stage 1: Train the two networks separately to obtain \(\mathrm{Para}_\mathrm{GB}\) and \(\mathrm{Para}_\mathrm{TB}\). Gradient Branch only uses the gradient images as its input while Texture Branch uses the original images. The expression is as follows:

$$\begin{aligned} Y_\mathrm{GB} = Fc_\mathrm{GB}(GB(X_{E},\mathrm{Para}_\mathrm{GB})) \end{aligned}$$
(5)
$$\begin{aligned} Y_\mathrm{TB} = Fc_\mathrm{TB}(TB(X,\mathrm{Para}_\mathrm{TB})) \end{aligned}$$
(6)

where \(Y_\mathrm{GB},Y_\mathrm{TB}\) represent the outputs of the individual training process, corresponding to Gradient Branch and Texture Branch respectively. \(Fc_\mathrm{GB},Fc_\mathrm{TB}\) are the functions of the fully connected layer that is added after Gradient Branch and Texture Branch.

Stage 2: When we get \(\mathrm{Para}_\mathrm{GB},\mathrm{Para}_\mathrm{TB}\) in stage 1, we load these parameters in DuFeNet, and frozen these parameters before the Fusion Block. Finally, we fine-tune Fusion Block to make gradient features and texture features fully utilized and achieve the improvement of model performance.

Obviously, there is no complicated structure since each part is distinct and is not mixed together. However, it is worth noting that if we train DuFeNet directly, the performance can not be improved. It may be because when the two branch networks are trained at the same time, their internal parameters will affect each other, the texture and gradient information can not be distinguished for learning. This special method preserves the characteristics of their respective learning.

4 Results

In this section, we describe the details of the experiment, make some quantitative experiments to find out which factors can greatly improve the model accuracy, and use a special dataset to verify the enhancement of shape bias. In order to make the observation more intuitive, we use Grad-CAM method [26] to visualize it. Besides, we use this method in detection experience.

4.1 Experiments settings

The proposed method is evaluated on cifar100 [16], which is a common dataset. In the Texture Branch, we use ResNet18 [8], MobileNetv2 [25], Vgg19 [27], ShuffleNetv2 [21] while we design a network with only a few layers in Gradient Branch. All the experiments are implemented with PyTorch [22] under the ubuntu system on one RTX 2080 GPU. Optimization is performed by using Adam Algorithm [14] with a learning rate of 0.001 and a batch size of 300. All models are trained for 200 epochs from scratch. The Canny Algorithm is implemented with the OpenCV framework. We save the model weights with the highest accuracy on the validation set.

4.2 The advantage of DuFeNet

This subsection mainly verifies the superiority of DuFeNet. And analyze the impact of neural networks with different layers and widths in Gradient Branch on its own accuracy and on the accuracy that brings for DuFeNet. We adopt different networks in Texture Branch and DuFeNet to test the advantage of Gradient Branch. For Gradient Branch, we use the network structure with three convolutional layers as shown in Fig. 2. Table 2 is the results of different networks. We evaluate the networks with four evaluation index, Paras, Mult-Adds, PrecisionT1, PrecisionT5.

It can be found from the Table 2 that DuFeNet can improve the accuracy of any network structure, the increase range of T1 is 0–3%, the increase range of T5 is 0–2.2%, the increase in parameters is about 0.4M, and the increase in computational cost is about 39M. It can be concluded that Gradient Branch is suitable for different network structures. Gradient Branch have a small increase in parameters and calculations, in line with the requirements of being a small part, which can be added to a network at any time to increase the accuracy.

And then we analyze Gradient Branch. As shown in Table 3, a total of 5 different networks are set up, namely Anet, Bnet, Cnet, A2net, and A4net.

Table 3 Different setups of neural network in Gradient Branch
Table 4 The accuracy of Gradient Branch

Firstly, we analyze the impact of neural networks with different layers and widths in Gradient Branch on its own accuracy. It can be seen from the Table 4 that the accuracy continues to increase as the network width widens or the network depth deepens. And it can be seen from the table that when the number of convolution layers is 6 (Cnet) in Gradient Branch, the accuracy of ResNet18 can basically be reached, which shows that the gradient image does not need too many parameters to learn its characteristics and the neural network with only 6 layers of convolution can learn most of the internal information.

Table 5 Impact of different combinations

However, our method combines ignored edge features with Texture Branch instead of only using the gradient information. We combine the different networks in Gradient Branch and Texture Branch to see how their different combinations affect the accuracy of DuFeNet. Table 5 is the results of the experiment. The first column in the table is the network selected in Gradient Branch, the first line is the network selected in Texture Branch and the second line is results of single Texture Branch.

Fig. 3
figure 3

The increase in shape bias of DuFeNet. On the left is a brief flow chart of DuFeNet and the output of it compares the results of Texture Branch and Gradient Branch. On the right are partial results of the special dataset [6]. The label is below the test picture and the histogram shows the probability of the true label in the network outputs

From Table 5, it is found that different networks in Texture Branch can use Gradient Branch component to improve the accuracy to different degrees, Vgg19 improves the least, ShuffleNetv2 the most. When using Anet, A2net, and A4net to make network width comparisons, we can find that the network width has little effect on accuracy. When using Anet, Bnet, Cnet, ResNet18 to do network depth comparison, the depth impact the network accuracy is large. The deeper the network, the greater the improvement in the accuracy. Cnet and ResNet18 have the greatest improvement in the accuracy, which reflects that the improvement of the accuracy under multi-layered such as ResNet18 is limited. And Cnet, a network of 6 convolution layers structures, has learned well about the edge features of the image. For other researchers, they can balance the relationship between model accuracy and inference speed to choose a suitable network structure in Gradient Branch for their original network.

4.3 Increase shape bias of neural network

We use three networks trained by the cifar100 dataset, namely ResNet18 + Cnet (DuFeNet), ResNet18 (single TB), and Cnet (single GB). And we use the dataset provided in the paper [6] (using only the categories contained in cifar100) to explore the influence of Gradient Branch on shape bias by comparing the correct category probability which can be generated by softmax operation on \(Y_\mathrm{TB},Y_\mathrm{GB},Y\). \(Y_\mathrm{TB},Y_\mathrm{GB}\) denote the outputs of the individual training processes, corresponding to Gradient Branch and Texture Branch, respectively. Y denotes the output of DuFeNet. Figure 3 shows the results of some of the images.

Fig. 4
figure 4

Visual presentation by Grad-CAM. The first line is the original images. The second line is Grad-CAM and Guided Grad-CAM pictures generated from the network in Texture Branch. The third line is Grad-CAM, Guided Grad-CAM pictures generated from the network in Gradient Branch, and binary pictures generated by the Canny Algorithm edge detection

Figure 3 that only a few outputs get high probability because the model is not trained on the special data field (Silhouette), still the differences of networks’ shape bias degree can be clearly seen. There is a big difference between categories. For example, in the terms of Bottle pictures, the results of DuFeNet are better than Texture, but for other labels, Gradient Branch is the best; Fusion Block is the second, and Texture Branch is the worst. And we can draw a conclusion that for different categories, the accuracy impact caused by DuFeNet varies. At the meantime we can find that the shape bias degree of Texture Branch is 0–1% in most cases which indicates that Texture Branch does learn most texture features but ignores gradient (edge) features in the image.

Table 6 AP for different backbone
Fig. 5
figure 5

AP for different categories in VOC

To make it easier to understand the role of Gradient Branch, we visualized some pictures of cifar100 dataset using the Grad-CAM method [26]. The method makes full use of the gradients of any target concept to produce a coarse localization map. It highlights the important regions in the image for predicting the concept. Thus it makes ‘visual explanations’ for decisions from the convolutional neural network models, making them more transparent and easier to understand. The results are shown in Fig. 4.

In Fig. 4, it can be seen from Grad-CAM pictures that Gradient Branch is more dependent on global features to identify objects because the red area from Gradient Branch is larger than the red area from Texture Branch in the heat map. The result is also in line with a characteristic that the edge features are generally sparse in the image and not as dense as the texture. And it can be found from Guided Grad-CAM pictures that Gradient Branch is more in line with the intuitive human feeling while Texture Branch is more in favor of other features such as textures in pictures. All the results above show that Gradient Branch effectively extracts edge features and is capable of image classification by edge features.

4.4 Detection experiment

To highlight the breadth and practicality of this method, we use this method in detection task. We change the backbone of centernet [32], a SOTA detection algorithm. Pascal VOC [4] is a popular detecion dataset. Firstly we use TB(ResNet18) and GB(MobileNetv2) as backbone to train on VOC 2007 and VOC 2012 trainval and test on VOC2007 test set with 384\(\times \)384 resolution. And then combine TB and GB to get DuFeNet backbone. Different from the classification task, the method in target detection concat two feature maps from GB and TB before the up-sample stage, and the classification task is to concat two vectors. We use batchsize 32, adam algorithm, learning rate 1.25e-4 and train 70epochs with learning rate dropped 10\(\times \) at 45 and 60 epochs based on the project [12]. Table 6 shows AP at IOU threshold 0.5 and AP for small, middle, large object. From this we can know that the GB also can work in detection task as a backbone because the loss function can converge and get 42.19 AP50. DuFeNet can improve AP on large, medium, and small object. Figure 5 shows the AP of different categories in VOC. We can conclude that for some categories with relatively simple internal textures, our method can greatly increase its AP, just like boat and cow. This also provides us with a new idea for optimizing target detection algorithms.

5 Conclusion

In this paper, we propose a novel DuFeNet, including three parts, namely Gradient Branch, Texture Branch and Fusion Block. DuFeNet uses dual networks to extract diverse features for boosting the representing ability of the learned features for classification. We have found that the gradient information of the image not only can increase the shape bias of neural networks but also strengthen the learning ability of networks. Meanwhile, it improves the accuracy of the classification network and enhances the robustness of the network. And DuFeNet can help well take advantage of this gradient information. This method verifies that it is also possible to further improve the gradient deviation of the model by changing the network structure. Extensive experiments demonstrate the high performance and computation efficiency of DuFeNet. And the original network can be easily transformed to DuFeNet structure within a few parameters and amount of calculation to get greater performance. In the future, as the gradient information is relatively sparse in the image, some convolution methods that can rapidly expand the receptive field, such as dilated convolution [30], can be used to quickly obtain the shape features and further simplify the neural network structure in Gradient Branch. Therefore, we can use this method to achieve the same precision effect with fewer parameters and calculations.