1 Introduction

From the micro-perspective, digital media refers to information carrier, which records, transmits and processes information in binary form. At the same time, information carriers include sensory media, logical media and physical media [1]. Text, graphics, photos, sounds, video images, and animation all fall under the sense media. The presenting formats that are utilized to stand in for these sensory media are referred to as logical media [2]. Physical media is used to store, transmit and display logical media. Digital media technology has been developed rapidly and has penetrated into all aspects of people’s life. This has brought obvious and profound changes to people’s life, entertainment and work [3]. At present, the development trend of digital media technology has obvious characteristics. In terms of user environment, it has changed from single-user environment to multi-user environment and personalized user environment. In terms of operation environment, the local environment has been changed to distributed environment and remote environment [4]. In terms of media communication, from one-way communication to two-way communication. In addition, digital media technology will also develop toward distributed and networked multimedia systems [5].

With the development of the Internet and the arrival of the era of big data, a variety of massive data are generated every day. Image data containing a lot of information has gradually become an important carrier of digital media information [6]. Processing image data has extremely important practical significance and extensive application scenarios. Fine-grained image recognition can be regarded as a sub-field of image recognition, which aims to judge images with small differences between classes [7]. There is a high degree of morphological similarity between the objects depicted in the photograph, suggesting that they all share the same category. In the actual world, fine-grained picture recognition can be used in a greater variety of contexts. The species identified by fine-grained images are very similar in appearance even if they are different subclasses [8]. However, species of the same subclass are easily affected by multiple factors, such as different postures, different light intensities, little difference between foreground and background, and occlusion of objects, and their appearance is often very different. Therefore, fine-grained image recognition usually achieves the purpose of identifying different types by locating discriminating components to extract key features. The precision of fine-grained image recognition is closely related to the ability of image feature expression and the correct selection of significant component areas with discrimination. Traditional image features largely depend on human experience, and most of them are descriptions of image underlying features [9]. High-level picture semantic properties aren’t represented, and the system is hard to fine-tune for individual datasets. Using traditional image artificial design features for classification cannot achieve high recognition accuracy [10].

People have gradually abandoned the method of using traditional artificial design features to classify and recognize images. The convolution neural network can learn image features through multi-level and different size convolution layers [11]. When the image is input into the convolution neural network, forward propagation is carried out. With the deepening of convolution neural network, the extracted features are also becoming more and more advanced [12]. Hence, the network is able to learn not just the concrete low-level properties of the image, but also the more abstract high-level semantic features that are crucial for image recognition. Hence, it can be considered a hands-free, automatic feature extractor. In the forward propagation process, this can extract shallow-to-deep image features [13]. At last, the collected features are combined in the full connection layer before being categorized by the softmax layer. When compared to standard artificial design features, which can only employ the same feature learning approach across different datasets, deep learning’s ability to adapt to new datasets and extract relevant features is a clear advantage [14]. Convolutional neural networks learn convolutional features via reverse transmission, which can be directly affected by the error between the true label and the recognition result [15]. Because of this, the network’s learnt feature patterns can shift from one dataset to the next. When it comes to representing features, deep learning excels. This research explores its potential in the realm of high-resolution digital picture recognition [16].

This work designs a fine-grained image recognition based on feature enhancement.

  • First, this paper designs a feature enhancement and suppression method to process image features.

  • Secondly, this paper designs pyramid residual convolution. This uses different scale convolution kernels to capture different levels of features in the scene.

  • Thirdly, this paper uses the softpool method to rationally allocate the information weight in the pooling process.

  • Fourth, this paper uses feature focus module to mine more features. This focuses on obtaining similar information in multiple local features as discriminant features to improve the recognition.

The content of this article is arranged as follows: Sect. 2 introduces relevant literature research; Sect. 3 explains the proposed method; Sect. 4 carries out experimental analysis; Sect. 5 is the conclusion.

2 Related work

It was possible to classify deep learning-based fine-grained image identification into two main schools of thought: strong supervised learning and weak supervised learning. The primary dividing line between them was whether or not the model was trained using simply label information. A lot of the early labor was done under close watch and instruction. The fine-grained, supervised-learning-based image identification technique required not only image label information, but also costly manual annotations in the form of boxes or points during the training phase. This method of detecting target objects and parts was inspired by the design principles of models widely used in the field of target detection, such as R-CNN [17], Fast RCNN [18], Fast R-CNN [19], and so on. Part-based R-CNN proposed in reference [20] used the selective search method to generate the object region candidate frame on the dataset image with object annotation frame and part annotation point. This used the label box of the image to train and detect the important parts of the foreground and the target object in these areas. The final classification result was obtained by extracting features from the trained and modified regions and clustering with these features. However, this method also had great defects. A large number of useless candidate boxes would be generated in the process of detecting the object area. To reduce the situation of poor recognition performance caused by the difference in the attitude of the target object, the literature [21] proposed Part Normalized CNN, which was an improved version of Part-based R-CNN. This would extract features for classification after accurate image pose normalization of the clipped local image. Literature [22] proposed Part-Stack CNN, in which human participation was less. This could not only obtain high recognition accuracy, but also was widely used in practical scenes. In addition, the representative work was the SPDA-CNN based on the detection and extraction of the same semantic component proposed in document [23]. The network paid attention to the learning of middle-level features and the correlation between various parts of learning. This used a common convolution kernel to greatly reduce the complexity of multiple training and testing.

To reduce the labor cost and avoid human error, more and more researchers began to pay attention to the weak supervised learning method. Their characteristic was that only category labels were used in the training phase of network models, and the obtained model recognition accuracy could also meet the industrial requirements. Weak supervised learning method had become a popular method to solve the problem of fine-grained image recognition. Literature [24] proposed a two-level attention model, which was a typical algorithm for fine-grained image recognition based on weak supervised learning. The algorithm used the target body and key components to extract global and local features. Like the Part-based R-CNN mentioned earlier, it used the selective search method to generate candidate regions. Finally, these images were sent to convolution neural network for feature extraction and classification. Literature [25] proposed a diversified visual attention network, which improved the expression of feature diversity by locating multiple component regions to obtain more powerful discriminative features. The visual attention network based on full convolution proposed in reference [26] was an efficient reinforcement learning full convolution attention location network. This could automatically lock multiple discriminative significant regions. Literature [27] proposed a bilinear CNN model, which used two feature extractors at the same time. These two feature extractors were the same CNN or different CNN. After pictures pass through CNN, their two outputs were multiplied by the outer product and pooled to obtain image descriptors. To solve the problem of large storage memory in bilinear CNN, a compact bilinear CNN model was proposed in [28]. This ensured that the recognition accuracy did not decline, and at the same time uses the idea of kernel approximation to reduce the dimension to achieve the acquisition of high-dimensional features. Reference [29] used Taylor series kernel to capture the higher-order interaction of features, and the kernel approximation error and recognition accuracy had been well improved. Literature [30] proposed a cyclic convolution model for multiple detection of multi-scale key regions. This obtained two different scale features by locating the key areas in the image.

3 Proposed method

This work designs a fine-grained image recognition based on feature enhancement. First, this paper designs a feature enhancement and suppression module to process image features. Secondly, this paper designs pyramid residual convolution. This uses different scale convolution kernels to capture different levels of features in the scene. Thirdly, this paper uses the softpool method to rationally allocate the information weight in the pooling process. Fourth, this paper uses feature focus module to mine more features. This focuses on obtaining similar information in multiple local features as discriminant features to further improve the recognition effect.

3.1 Convolution neural network

Layered feature extraction networks like CNN are what make up this technology. As an input image is processed by successive network layers, its corresponding extracted features undergo a similar transition, moving from concrete low-level characteristics to more generalized high-level features. This forward process of feature extraction layer by layer is called forward propagation. Through the last layer of neural network, the target task is transformed into a function expression. Compare the output layer of forward propagation with the true value, and the difference obtained is back-propagated. Update the parameter value of each layer according to the deviation of the guide ladder, and the process will not end until the difference reaches a certain expected value.

The backbone of a convolutional neural network is a layer called the convolution layer, which consists of numerous convolution cores. Its primary use is to aid in the feature extraction process for network-based image classification systems. A variety of convolution kernels are included. The input image and these convolution kernels are used to produce a feature map. Hence, when the image has been processed by the convolution layer, as many feature maps will be generated as there are convolution cores. With the same input image, various convolution kernels can derive distinctive characteristics. The convolution operation is used to describe the process wherein several, identical, two-dimensional convolution cores of the convolution layer are simultaneously applied to the input image. Here, we are multiplying the value of a pixel in the image’s coverage area by the value of an element in the convolution kernel. The result of the multiplication is added to obtain the value of the corresponding pixel of the output feature map.

$$ F\left( {i,j} \right) = \left( {C*X} \right)\left( {i,j} \right) $$
(1)

where \(C\) is convolutional kernel, \(X\) is input.

Compared with the traditional neural network, the advantage of convolution neural network lies in the adoption of parameter sharing mechanism. This method can effectively reduce the explosive growth of network weight, that is, the parameters in the convolution kernel are all the neural network parameters in the same channel. Compared with the traditional neural network, this method reduces the number of network parameters. In this way, the feature information of the picture can be effectively extracted. To ensure that the feature information extracted from each channel is different, multiple convolution cores need to be set in the convolution neural network.

The convolution operation reduces training parameters and connections by using the weight sharing mechanism and sparse connections between different layers. However, in the process of multiple convolution feature extraction, there will still be a large amount of computation. Thus, the network training is not easy to converge and the over-fitting probability increases. Therefore, it is generally necessary to connect a pooling layer behind the convolution layer to compress the previously extracted feature map to reduce the complexity of model calculation. At present, common pooling mainly includes average pooling, maximum pooling and global pooling. The pooling layer has two characteristics: first, translation, rotation and scale invariance. This is more concerned with describing the characteristics of objects rather than the coordinates in which they appear. Secondly, it can use different pooling layers to retain the desired features and discard the insignificant features. The dimension of the feature is lower, so the amount of computation is greatly reduced. This also reduces the problem of over-fitting and improves the applicability of the model.

When the convolution layer convolves the input image to extract features, this series of operations is often a linear superposition process, which is not conducive to the expression of complex functions. The activation function of nonlinear mapping is usually added after the convolution layer to enhance the ability of the convolution layer to express image features. There are many common activation functions at present.

$$ {\text{Sigmoid}}\left( x \right) = 1/\left( {1 + \exp \left( { - x} \right)} \right) $$
(2)
$$ {\text{Tanh}} \left( x \right) = \left( {1 - \exp \left( { - 2x} \right)} \right)/\left( {1 + \exp \left( { - 2x} \right)} \right) $$
(3)
$$ {\text{ReLU}}\left( x \right) = {\text{Max}}\left( {0,x} \right) $$
(4)

where \(x\) is input.

The final step in a convolutional neural network, the FC layer does the classification. The feature map undergoes linear growth as it travels from the convolution layer and pooling layer of the convolution neural network to the full connection layer. There may be multiple connecting layers, or there may be only one. Mapping the raw data to the hidden feature space yields the convolution layer and pooling layer. To classify the sample label space, the full connection layer applies an activation function to the mapped feature information.

3.2 FIRFE architecture

The flow of fine-grained image recognition for digital media proposed in this work is shown in Fig. 1. PyConvResNet is a convolution operation combining pyramid convolution and ResNet network structure. Its structure is the same as that of ResNet’s feature extractor. It has five steps. The space size of the feature map after each step is reduced to half the size of the previous step. PyConvResNet has more image feature information and higher dimension after multiple feature extraction. Therefore, this paper places the feature enhancement and suppression module in the third, fourth and fifth stages and the last three stages. The component features processed by the feature enhancement and suppression module (FESM) are input into the feature focus module (FFM), which makes the feature extraction focus more on the discriminative region with more information. After the image passes through the feature focusing module, the softpool method is used for pooling. The final classification result is obtained by combining the results of multiple soft pools and connecting the final classifier.

Fig. 1
figure 1

FIRFE architecture

During the whole training process, the classification loss of the enhancement for each specific component is:

$$ L_{{{\text{cls}}}}^{i} = - y^{T} \log \left( {p_{i} } \right) $$
(5)
$$ p_{i} = {\text{softmax}}\left( {{\text{cls}}_{i} \left( {Z_{{p_{i} }} } \right)} \right) $$
(6)

where \(y\) is ground truth.

3.3 FESM module

For the features extracted from the input data by different neural networks, a large number of methods have tried to use such methods as feature enhancement. Feature enhancement is added to the network for feature fusion and increases the information richness before final classification. Feature enhancement and suppression module is one of the better methods. The method is to first enhance the salient features in the fine-grained image, and at the same time suppress them in another branch. This encourages the network to learn more sub-salient features and enhance the richness of features.

The main idea of feature enhancement and suppression is to mine enough image parts for recognition. This method is mainly to first select the most significant part of the feature map in the current step through the obtained feature map to obtain the image representation of a specific module. Then the most significant part selected will be suppressed instead. The purpose of suppression is to force the model to mine more potentially significant image parts in the next stage. The insertion position of the feature enhancement and suppression module is mainly the middle level of the convolution neural network, from which the feature representation of multiple specific parts can be obtained. These feature representations are also concentrated on different object components, as shown in Fig. 2.

Fig. 2
figure 2

FESM module

The image first passes through the upper part of the convolution neural network. Different basic models pass through different layers of the neural network, so the feature map can be obtained. Inspired by the PCB method, the feature map can be divided into several parts evenly along the width. Then, a convolution block is used to determine the importance of each image component. Then, the average value of this parameter is taken as the importance factor parameter of the characteristic graph. Then use the softmax function to standardize the importance factor parameters. Then, the features are enhanced by enhancing the most prominent module in the image. Then begin to suppress the most prominent part of the image. Suppressed image features can be obtained by compressing the most significant block part of the image.

$$ X_{e} = h\left( {X + \alpha *\left( {B \otimes X} \right)} \right)$$
(7)
$$ X_{s} = S \otimes X$$
(8)

where \(X\) is input, \(\alpha \) is control parameter.

Given a feature graph, the feature value of a specific component is output by the feature enhancement inhibition module. The second feature can be considered as a potential image feature. In the feature enhancement and suppression module, the size of the super-parameter that controls the degree of suppression is adjusted. This can make the potential information in the image relatively prominent when the parameter is passed into the subsequent module.

3.4 PyConvResNet module

The pyramid convolution structure contains a pyramid-like core. The convolution core type on each layer is different, which is manifested in different depth and size. Convolution kernels with different sizes and depths can better capture different levels of detail in the scene. Pyramid convolution contains pyramids with different types of convolution cores. The purpose is to process scales at different core scales to increase the amount of information. In pyramid convolution, the size of convolution core is different at each layer. The size of the nucleus from the bottom to the top increases gradually, and the size of the nucleus also increases gradually. The structure of pyramid convolution is shown in Fig. 3.

Fig. 3
figure 3

Pyramid convolution

This work combines pyramid convolution with the residual bottleneck block in the residual convolution neural network to form pyramid residual convolution as demonstrated in Fig. 4. In the pyramid residual convolution structure, the convolution block is used to reduce the size of the input feature image to 64. Then, four convolution kernel sizes of different sizes in pyramid residual convolution are used for convolution. The size of the four convolution kernels is different. In pyramid residual convolution, each convolution core outputs 16 characteristic graphs, so this structure outputs a total of 64 characteristic graphs. Add common batch normalization and ReLU activation functions after the convolution block. At the same time, like ResNet, a short connection is added to enhance the graphics mapping.

Fig. 4
figure 4

PyConvResNet module

Pyramid residual convolution can process input on multiple convolution kernel scales, which increases its ability to extract the overall data information. At the same time, the pyramid residual convolution and the standard residual convolution have almost the same calculation cost and parameter quantity, which makes its calculation very efficient. While capturing context information at different levels from local to global, pyramid residual convolution can be well integrated into the network structure with feature lifting methods. Because ResNet has a good fusion performance with pyramid convolution, ResNet has shown its excellent performance in various computer vision tasks. First, Pyramid convolution and ResNet network structure are combined to obtain PyConvResNet structure.

3.5 Softpool module

Softpool method is a core-based pooling method. The main method is to calculate the sum of activation values through softmax weighting. Softpool retains the descriptive activation feature to the maximum extent while ensuring that the amount and efficiency of computation are basically unchanged. The softpool method uses natural constants. Because the activation value is non-negative, it can make the large activation value have greater weight on the output. At the same time, the exponential method is differentiable, and there is a proportional gradient containing the lower limit during the back propagation. Softpool uses the smoothed maximum approximation of the active value within the region. The weight corresponding to each activation value is:

$$ w_{i} = e^{{a_{i} }} /\mathop \sum \limits_{j \in R} e^{{a_{j} }} $$
(9)

where \({a}_{i}\) is activation value.

Therefore, the weight calculated by the higher activation value is higher than the lower weight. When pooling in high-dimensional feature space, it is more effective to assign a higher activation value to a larger weight than the average weight of pooling. However, the maximum pooling method ignores too much information, and the same performance is relatively poor. The output of softpool method can be obtained by weighted sum of all activation values in the kernel region:

$$ \tilde{a} = \mathop \sum \limits_{i \in R} w_{i} *a_{i} $$
(10)

where \({w}_{i}\) is weight.

The softpool method does not contain trainable parameters, is independent of training data, and takes into account the advantages of the commonly used average pooling method and maximum pooling method. The average pooling method suppresses the contribution of the larger value to the final result and smoothes the gap between the larger value and the smaller value. However, maximum pooling only retains the maximum activation value, which makes the non-maximum activation value element information completely lost and ignores the overall characteristics. The softpool method balances these two points. First, all activation values in the region can affect the final output value. The part with higher activation value occupies a greater weight in the overall output than the part with lower activation value. The design of softpool makes up for the defects of average pooling and maximum pooling. This combines the advantages of the two methods and has a better effect.

3.6 FFM module

To strengthen the most important regional features in the image feature representation more pertinently, the feature enhancement and suppression module has been used in the previous article through the calculation in the module. This step extracts rich image features. To reduce the error that may be caused by part of the image with low resolution and strengthen the most important regional features, this paper designs a feature focusing module in the feature lifting network. In essence, the feature focusing module is a feature representation for image components. Based on the similar information obtained from other image components, the similarity of feature images of different image components is calculated to gather and enhance them. Through the feature focusing module to establish the model between the image components, the common features of the image components can be expressed more accurately. At the same time, it can also make the overall feature capture of the image focus more on the recognizable part, so as to distinguish the image more clearly. The biggest difference between feature focus and feature diversification is that the former calculates the similarity relationship between features and focuses on important features, while the latter uses complementary relationship.

The function principle of the feature focusing module is to gather the similar information extracted from the feature representation of other components in the image for fusion. This can enhance the features of the current assembly and generalize them into each image assembly. First, the features of the two image components are mutually enhanced by the feature focusing module, as shown in Fig. 5.

Fig. 5
figure 5

FFM module

Each pixel has established a correlation relationship. The higher the similarity of the two pixels, the greater the importance and contribution of any pixel to the similar information generated by the two pixels. In the image features of two image components, each pixel can learn its similar information from the relationship between the two pixels to the maximum extent. This can make up for the problem that the semantic information in the image component is not concentrated in the discriminative region.

4 Result and analysis

4.1 Dataset and environment

This work uses crawler technology to collect image data related to digital media from the Internet. This work builds two datasets, DMA and DMB, based on the collected data. The data of these two datasets are different, and the specific data information is shown in Table 1.

Table 1 The dataset information of DMA and DMB

The GPU used in this experiment is GTX 1080Ti, the CPU processor is Intel (R) Core (TM) i7-8700 K CPU, and the memory is 8 GB. The Python deep learning framework is used under the Ubuntu system, and ResNet50 is used as the basic network of the whole network. The network learning rate of the network model PyConvResNet is set to 0.002, while the learning rate of other parts is 0.02. The learning rate is optimized and adjusted by cosine annealing algorithm in real time. The whole model is optimized by random gradient descent method. The momentum is 0.9. A total of 200 epochs are trained. The weight decrement is 0.00001, and the batch size is 20.

This work uses accuracy, recall and F1 scores to evaluate the image classification performance of FIRFE, and the specific calculations are as follows:

$$ {\text{Accuracy}} = \left( {{\text{TP}} + {\text{TN}}} \right)\left( {P + N} \right) $$
(11)
$$ {\text{Recall}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FN}}} \right) $$
(12)
$$ F1 \;{\text{score}} = 2{\text{TP}}/\left( {2{\text{TP}} + {\text{FP}} + {\text{FN}}} \right) $$
(13)

where \(\mathrm{TP}\) is true positive sample, \(\mathrm{TN}\) is true negative sample, \(\mathrm{FP}\) is false positive sample, \(\mathrm{FN}\) is false negative sample, \(P\) is positive sample, \(N\) is negative sample.

4.2 Training analysis of FIRFE

FIRFE is a fine-grained image recognition model based on deep learning, and data training is necessary. This work first analyzes the training process of FIRFE, and the main analysis indicators are training loss and training performance indicators. Specific experimental data are shown in Figs. 6 and 7.

Fig. 6
figure 6

Training loss of FIRFE

Fig. 7
figure 7

Training performance of FIRFE

From the data curves in the two figures, it can be seen that with the training, the loss gradually decreases to convergence, and the recognition performance gradually increases to convergence. The final convergence of the two curves preliminarily verified the feasibility of FIRFE for fine-grained image recognition of digital media.

4.3 Comparison with different methods

To further verify the superiority of FIRFE in fine-grained image recognition of digital media, this paper compares it with other fine-grained image recognition methods. To ensure the comparability of the experiment, this paper tries to keep the experimental settings consistent. The specific experimental data are shown in Table 2.

Table 2 Comparison with different methods

FIRFE has achieved the highest performance index on both DMA and DMB. Specifically, compared with the second-best method, FIRFE can achieve 2.4%, 3.7% and 2.5% improvement in DMA. On DMB, the corresponding three performance indicators were improved to 3.0%, 3.2% and 2.3%. These improvements verify the feasibility of FIRFE for fine-grained image recognition of digital media.

4.4 Analysis of FESM module

FIRFE uses FESM to process features with high performance. To verify the function of FESM in improving the performance of the model, a comparative experiment is conducted. This paper analyzes the image recognition performance without and with FESM. The specific experimental data are shown in Fig. 8.

Fig. 8
figure 8

Analysis of FESM module

After using FESM, the recognition performance has been improved accordingly. On DMA, the corresponding indicators improvement are 2.2%, 1.7% and 1.5%. In DMB, the corresponding indicators improvement are 1.8%, 1.5% and 2.0%. These improvements validate the advantages of FESM. This shows that FESM can make the network capture more non-most significant image features.

4.5 Analysis of PyConvResNet module

FIRFE uses PyConvResNet to extract more discriminative image feature. To verify the function of PyConvResNet in improving the performance of the model, a comparative experiment is conducted in this work. This paper mainly analyzes the image recognition performance without and with PyConvResNet. The specific experimental data are shown in Table 3.

Table 3 Analysis of PyConvResNet module

After using PyConvResNet, the recognition performance has been improved accordingly. On DMA, the corresponding indicators improvement are 1.4%, 1.2% and 1.2%. In DMB, the corresponding indicators improvement are 1.6%, 1.2% and 1.5%. These improvements validate the advantages of PyConvResNet. This shows that PyConvResNet improves the network’s ability to capture different levels of detail by using multiple convolution cores of different scales.

4.6 Analysis of softpool module

FIRFE uses softpool to process the feature maps from the last step. To verify the function of softpool in improving the performance of the model, a comparative experiment is conducted. This paper analyzes the image recognition performance without and with softpool. The specific experimental data are shown in Fig. 9.

Fig. 9
figure 9

Analysis of softpool module

After using softpool, the recognition performance has been improved accordingly. On DMA, the corresponding indicators improvement are 1.4%, 1.1% and 1.3%. In DMB, the corresponding indicators improvement are 1.3%, 1.2% and 1.3%. These improvements validate the advantages of softpool. This shows that softpool combines the advantages of maximum pooling method and average pooling method. The possible contribution of all values in the matrix to the subsequent tasks is considered and the information loss is minimized.

4.7 Analysis of FFM module

FIRFE uses FFM to focus the obtained feature. To verify the function of FFM in improving the performance of the model, a comparative experiment is conducted. This paper analyzes the image recognition performance without and with FFM. The specific experimental data are shown in Table 4.

Table 4 Analysis of FFM module

After using FFM, the recognition performance has been improved accordingly. On DMA, the corresponding indicators improvement are 2.1%, 1.8% and 1.7%. In DMB, the corresponding indicators improvement are 1.8%, 1.5% and 2.0%. These improvements validate the advantages of FFM. This shows that FFM can learn the similarity between features and focus on the most important discriminative features. This makes the final feature extraction and classification more accurate.

4.8 Analysis of BN module

FIRFE uses BN layer to normalize the feature. To verify the function of BN in improving the performance of the model, a comparative experiment is conducted. This paper analyzes the image recognition performance without and with BN. The specific experimental data are shown in Fig. 10.

Fig. 10
figure 10

Analysis of BN module

After using BN, the recognition performance has been improved accordingly. On DMA, the corresponding indicators improvement are 1.1%, 1.0% and 0.9%. In DMB, the corresponding indicators improvement are 1.2%, 0.9% and 1.1%. These improvements validate the advantages of BN. This shows that feature normalization can improve the efficiency of feature mining.

5 Conclusion

Digital media technology is broad enough to encompass a wide range of disciplines, such as computer graphics-based virtual reality technology, human–computer interface technology, sensor technology, and artificial intelligence. Digital media technology has been rapidly developed and penetrated into all aspects of people’s life. Fine-grained image recognition for digital media is to identify sub-classes in general categories. This is a delicate task in computer vision and has important research significance and application value. This work designs a fine-grained image recognition based on feature enhancement. First, this paper designs a feature enhancement and suppression module to process image features. Secondly, this paper designs pyramid residual convolution. This uses different scale convolution kernels to capture different levels of features in the scene. Thirdly, this paper uses the softpool method to rationally allocate the information weight in the pooling process. Fourth, this paper uses feature focus module to mine more features. This focuses on obtaining similar information in multiple local features as discriminant features to improve the recognition. Fifthly, this paper carried out systematic experiments on the designed method, and the experimental data verified the superiority of this method for fine-grained image recognition of digital media. Although the method proposed in this article has achieved considerable performance, there is still room for improvement compared to the latest methods. In addition, the proposed model parameters are numerous, which is not conducive to actual project deployment. In future research, we will focus on developing fine-grained image recognition models with higher accuracy and smaller volume.