Introduction

The term “camouflage” was initially used to describe the behavior of certain species imitating the appearance, color, and other characteristics of their environment to hide from predators or hunt their prey [1]. For instance, certain insects and fish can change their bodily appearances to match the colors and patterns of their surrounding environments. This mechanism is utilized by humans in warfare and art. Soldiers and war equipment use camouflage or paint (i.e., artificial camouflage objects) to blend in with the surrounding environment for evading detection by humans and machines [2]. Artificial camouflage has been applied in entertainment and art (such as body painting). Figure 1a and b depicts camouflaged objects (insects and fish), whereas Fig. 1c and d depicts artificial camouflage (soldiers and body paintings).

Fig. 1
figure 1

Instances of camouflage in the COD 10K dataset [3, 4]

Recently, camouflaged object detection (COD), i.e., identifying objects hidden in the background, has gained scholarly attention in the field of computer vision. In addition to its academic significance, COD has diverse applications, such as military target detection, medical diagnosis [3, 4], species discovery, and animal detection [5]. However, COD is highly challenging owing to the nature of camouflage, resulting in a high level of inherent similarity between the candidate object and background, which complicates the detection of camouflaged objects by humans and machines. As shown in Fig. 2, the boundaries of the two butterflies (target objects) blend with the bananas (background), rendering the COD more challenging compared to the traditional and salient object detection [6,7,8,9,10] or general object detection [11,12,13].

Fig. 2
figure 2

Common detection tasks

In initial investigations, the majority of approaches employ basic features such as texture, edges, luminosity, and color to differentiate the camouflaged object from its surroundings [36,37,38,39,40,41]. Nevertheless, camouflage often disrupts the inherent features to deceive the observer, rendering these approaches comparatively less efficacious. To this end, deep-learning-based methods have been proposed for COD, which exhibit significant potential and can be classified into the following three approaches:

  1. (1)

    Designing targeted network modules/architectures to effectively investigate the discriminative features of camouflaged objects and improve detection performance. For instance, C2FNet [14] and UGTR [15]. This method requires in-depth adjustments and optimizations to the network, which increases the complexity in design and implementation.

  2. (2)

    Incorporating auxiliary tasks into joint/multitask learning frameworks, such as classification tasks [2], edge extraction [16], salient object detection [17], and camouflaged object ranking [18]. Herein, valuable additional clues from shared features can be mined to significantly improve the feature representation of camouflaged targets, thereby enhancing the model's generalization ability and efficiency, and addressing data scarcity issues. However, the demand for computational and storage resources will be increased.

  3. (3)

    A biomimetic approach wherein the predatory behavioral processes of animals in nature are simulated into design networks, such as SINet [3, 4] and PFNet [19]. Simulating complex natural behaviors can enhance the model’s sensitivity and detection accuracy for camouflage targets. However, it requires a large amount of data and computational resources.

Significant progress has been made using the aforementioned methods, such as SINet [3, 4] and BGNet [20]. Figure 3 shows the accuracy results for the COD10k test set, wherein the E-measure [22] increased from 0.864 to 0.901, S-measure [23] increased from 0.776 to 0.831, weighted F-measure [24] increased from 0.631 to 0.722, and mean absolute error (MAE) [25] decreased from 0.043 to 0.033. Evidently, the accuracy of the models increased; however, the following two major issues persisted:

  1. (1)

    While improving the detection accuracy of the model, its complexity increased significantly. However, the detection time was neglected, due to which the requirements of fast detection in real-world applications were not met. For instance, the SINet algorithm [3, 4], a representative biomimetic method, primarily includes the following three modules: a receptive field module (RFM), partial decoder component (PDC), and search attention module. During algorithm implementation, RFM and PDC are called seven and two times, respectively, thereby significantly increasing the computational complexity. The mutual graph learning (MGL) algorithm [16] is a representative method that incorporates auxiliary tasks into the learning framework, which encodes the edge and object features together into the graph convolutional network and enhances feature representation using the graph interaction module. This increases the complexity of the model and the associated computational burden. UGTR [15] is a representative method that designs targeted network structures and integrates new components, such as the uncertainty quantification network (UQN), prototype transformer (PT), and uncertainty-guided transformer (UGT). These transform the deterministic mapping process of traditional COD models into an uncertainty-guided contextual reasoning process, thereby increasing the computational complexity.

  2. (2)

    Primarily, Resnet50 [21] has been implemented as the backbone network to investigate model accuracy; however, comparisons between the schemes of deepening the feature extraction network for high accuracy and new network structures are insufficient [3, 4, 14,15,16,17,18,19,20]. Therefore, investigating the impact of various backbone networks on the accuracy and speed of COD models is crucial.

Fig. 3
figure 3

Accuracy statistics of the camouflage detection models

This study proposes a COD network based on multilevel feature fusion to reduce model complexity, achieve network lightweighting, and rapidly detect camouflaged targets. First, low-, medium-, and high-level features were extracted using a backbone network, and a dense connection strategy [26] was used to fuse features from different layers and preserve more information. Second, RFM [27] was introduced to extract and fuse the features of various receptive field sizes. Third, the multilevel features were fused with different feature layers, and receptive fields were fed into the decoder to obtain the predicted image. Finally, various backbone networks were compared to test the performance of the lightweight camouflaged target detection model.

Related works

Significant advances have been made with deep-learning-based COD models. Based on biology, a number of approaches [3, 4, 14, 19, 32,33,34,35] have been proposed. Several works come up with different perceptual systems that mimic human behavior vis-a-vis camouflaged objects. For instance, Rank-Net [33] divides the entire detection process into three stages: localization, segmentation and ranking. Inspired by humans attention coupled with the coarse-to-fine detection strategy, SegMaR [34] integrates Segment, Magnify and Reiterate in a multi-stage detection fashion. As can be seen above, Rank-Net [33] and SegMaR [34] are divided into several segments, which realizes the detection effect optimization process from coarse to fine. In addition, numerous researchers have attempted to improve camouflage target detection performance by simulating the predatory behavior of animals. For instance, SINet’s framework is based on the search and recognition stages of animal predation comprising the following two main modules: the search module (SM) for searching camouflage objects and the recognition module (RM) for accurate detection. Furthermore, PFNet comprises the following two key modules: the positioning module (PM) to simulate the detection process during predation and focus module (FM) to execute the recognition process by focusing on blurred regions to improve the initial segmentation results. SINET and PFNet divide the camouflage target detection process into two stages. The candidate regions are generated in the first stage, and further localization is performed in the second stage to improve detection accuracy.

Contrastingly, this study simulates the human visual system and proposes a single-stage camouflage target detection framework to accelerate detection. The proposed method fuses features of different layers to obtain more distinguishable features, and introduces RFM [27] to simulate the sizes and eccentricities of receptive fields in the human visual system to enhance the fused features. Finally, the enhanced features are fed into the decoder and the final results are obtained.

Proposed method

Problem description

The COD model is represented by a function \({M}_{\Theta }\) parameterized by weights Θ. \({M}_{\Theta }\) accepts an image I as the input and generates a camouflage map C ∈ [0,1]. The objective is to learn Θ using a given labeled training dataset \({\{{I}_{i},{C}_{i}\}}_{{\text{i}}=1}^{N}\), where \({I}_{i}\) denotes a training image, \({C}_{i}\) denotes the image label, and N denotes the number of training images.

Overall architecture

Based on multilevel feature fusion, this study proposes a fast and single-branch COD framework called the lightweight network (LINet) that includes feature extraction, receptive field, and decoder modules. The feature extraction module extracts and uses features from various layers. RFM simulates the structures of receptive fields in the human visual system [27], thereby enhancing the feature extraction ability. The decoder module receives multilevel features and outputs feature maps (see Fig. 4).

Fig. 4
figure 4

The framework of lightweight camouflaged object detection algorithm

Feature extraction module

The proposed model was designed based on Resnet50 [21], which is the most widely used backbone network for deep COD. Given an input image I of size H × W, the features were extracted at five levels, denoted as \(\{{x}_{i},i=\mathrm{1,2},\mathrm{3,4},5\}\). Low-level features in shallow layers preserve spatial details for constructing object boundaries, while high-level features in deep layers retain semantic information for locating objects [42]. Thereafter, a dense connection strategy [26] was used to fuse information from different levels. To preserve spatial details for constructing object boundaries, the extracted low-level features \(\{{x}_{1},{x}_{2}\}\) were fused via concatenation, and a max-pooling operation was applied to halve the resolution and obtain feature \({rf}_{1}^{in}\). To preserve the semantic information of the target object, the high-level feature \({{\text{x}}}_{5}\) was upsampled using the bilinear interpolation method to increase its resolution by twofold and obtain the feature \({{\text{x}}}_{5}^{{\text{up}}\times 2}\). Thereafter, the features\(\{{x}_{4}\),\({ x}_{5}^{{\text{up}}\times 2}\)} were fused via concatenation to obtain feature \({rf}_{3}^{{\text{in}}}\). To preserve more characteristic information, the extracted high-level features \({x}_{4}\) and \({x}_{5}\) were upsampled using bilinear interpolation by the factors of two and four to obtain the features \({x}_{4}^{{\text{up}}\times 2}\) and \({x}_{5}^{{\text{up}}\times 4}\), respectively. Furthermore, the features\(\{ x_{3} ,x_{4}^{{{\text{up}} \times 2}} ,\,x_{5}^{{{\text{up}} \times 4}} \}\) were fused by concatenation to obtain feature \({rf}_{2}^{{\text{in}}}\). Finally, fusion features {\({rf}_{1}^{{\text{in}}}{,{rf}_{2}^{{\text{in}}},rf}_{3}^{{\text{in}}},{rf}_{4}^{{\text{in}}}={x}_{5}\)} retaining more distinguishing features were obtained.

RFM

After obtaining the candidate features {\({rf}_{1}^{in}\),\({rf}_{2}^{in}\),\({rf}_{3}^{in}\),\({rf}_{4}^{in}\)} using the feature extraction module, an improved RFM [27] simulating the human visual system was used to fuse the features with different receptive fields to generate the output features {\({rf}_{1}^{out}\),\({rf}_{2}^{out}\),\({rf}_{3}^{out}\),\({rf}_{4}^{out}\)}. The internal structure of RFM can be divided into the following two parts: multi-branch convolution layer with a different kernel number and tail expansion pooling or convolution layer. The former can obtain rich hierarchical features, whereas the latter can capture more contextual information in a larger area while maintaining the same number of parameters. Particularly, RFM consists of five branches \(\{{b}_{k},k=1, 2, 3, 4, 5\}\). In each branch, the first convolutional layer has a size of 1 × 1 to reduce the number of channels to 32. Thereafter, branches b3, b4 and b5 are connected to the following three additional convolutional layers: 1 × (2k − 3), (2k − 3) × 1, and a 3 × 3 layer with a dilation rate of (2k − 3). Branches b3, b4 and b5 were fused using a concatenation operation, and their channel sizes were reduced to 32 using a 1 × 1 convolution operation, while the resolution remained equal to that of the input. Finally, after adding the branch \({b}_{1}\), the entire module was fed into a ReLU function, and the features \({rf}_{{\text{j}}}^{{\text{out}}}\){j = 1, 2, 3, 4} were obtained.

Decoder module

After obtaining the candidate features \({rf}_{{\text{j}}}^{{\text{out}}}\){j = 1, 2, 3, 4} using RFM, the camouflage map Cs can be computed using the decoder module as follows:

$$ C_{s} = D\left( {rf_{j}^{{{\text{out}}}} \left\{ {j = 1,2,3,4} \right\}} \right) $$

The obtained features \({rf}_{j}^{out}\) were fed into the decoder module and a multiplication operation was used to minimize the gaps between features from multiple levels. Particularly, \({rf}_{2}^{{\text{out}}}\) was set to \({f}_{2}^{c}\)=\({rf}_{2}^{{\text{out}}}\). The feature \(\{{rf}_{j}^{{\text{out}}},{\text{j}}>2\}\) was updated to \({f}_{j}^{c}\) using element-wise multiplication with all the deeper features, performed as follows:

$$ f_{k}^{c} = rf_{j}^{{{\text{out}}}} \otimes \mathop \prod \limits_{k = j + 1}^{4} B{\text{Conv}}\left( {{\text{Up}}\left( {rf_{k}^{{{\text{out}}}} } \right)} \right),j \in \left[ {2,3} \right] $$

where Bconv (·) is a sequential operation combining 3 × 3 convolution, batch normalization, and a ReLU activation function; and UP (·) is an upsampling operation with a ratio of \({2}^{k-j}\). Finally, these discriminative features were combined using a concatenation operation to obtain the feature map Cs. The cross-entropy loss [9, 10] was considered as the loss function, formulated as follows:

$$ L = - \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \left[ {G_{i} \ln {\text{Cm}}_{i} + \left( {1 - G_{i} } \right)\ln (1 - {\text{Cm}}_{i} )} \right], $$

where N represents the number of samples, \({\text{Cm}}\) is the object mask obtained by upsampling Cs to a resolution of 352 × 352, and \(G\) is the label.

Benchmark experiments

Experimental setup

Various annotated datasets have been released to promote the development of deep learning-based COD technology. The CHAMELEON dataset [28] consists of 76 images collected from the internet using “camouflaged animals” as keywords via the Google search engine. The CAMO dataset [2] contains 2500 images (2000 for training, 500 for testing) covering eight camouflage categories. The COD10K dataset [3, 4] was the first large-scale and challenging dataset to be constructed, consisting of 10,000 images covering 78 camouflage categories. The NC4K dataset [18] contains 4,121 images with additional localization and ranking annotations, facilitating the localization and ranking of camouflaged objects. The dataset used to evaluate the performance of LINet was trained with the following two different settings: (i) the CAMO default training set containing 1000 images, and (ii) the CAMO + COD10K default training set containing 4040 camouflage images. The accuracy and speed of the model were evaluated using the test sets NC4K, CAMO, and COD10K.

The following four renowned evaluation metrics were used in the experiment: MAE, E-measure (\({E}_{\mathrm{\varphi }}\)), S-measure (\({S}_{\mathrm{\alpha }}\)), and weighted F-measure (\({F}_{\upbeta }^{{\text{W}}}\)).

LINet was implemented using PyTorch and trained using the Adam optimizer [29]. During the training phase, the batch size and learning rate were set to 15 and 1e-4, respectively. The experiments were performed on an Intel(R) Xeon(R) Gold 6135 CPU at 3.40 GHz and RTX 2080TI platforms.

To demonstrate its effectiveness, the proposed approach was compared with the following six mainstream COD models: SINet [3, 4], PFNet [19], S-MGL [16], R-MGL [16], C2FNet [14], and BGNet [20]. To fairly compare the accuracy and speed of each model, the test results of the aforementioned methods were retrained and tested using a batch size of 15 while keeping the other settings constant using the author-provided open-source code. LSR [18] re-annotated and sorted the dataset, and JCSOD [17] introduced the DUTS training set [30] and PASCAL VOC 2007 dataset [31] to extract saliency and camouflage features during training. Simultaneously, the occupancy of memory resources was remarkably increased and novel datasets were introduced. Therefore, to ensure the fairness of the comparative experiment, LSR [18] and JCSOD [17] were not included in the comparison range.

Results and data analysis

Accuracy and speed analysis of the model with setting (i)

To measure the relative improvement in LINet compared to other mainstream COD models, the mean precision change rate metric \({{\text{R}}}_{{\text{a}}}\) was proposed:

$${R}_{a}=\frac{\sum_{{\text{i}}}\sum_{{\text{j}}}(1-\frac{{B}_{ij}}{{A}_{ij}})+\sum_{{\text{i}}}(1-\frac{{D}_{i}}{{C}_{i}})}{{{\text{num}}}_{{\text{i}}}*{{\text{num}}}_{{\text{j}}}}$$

where j represents the precision indicators \({{\text{S}}}_{\mathrm{\alpha }}\), \({{\text{E}}}_{\mathrm{\varphi }}\), \({{\text{F}}}_{\upbeta }^{{\text{W}}}\); i represents the test dataset; \({A}_{ij}\) and \({B}_{ij}\) represent the j-values corresponding to the other mainstream COD models and LINet detection model on the test dataset i, respectively; \({D}_{i}\) and \({C}_{i}\) represent the values of MAE precision indicator corresponding to the other mainstream COD models and LINet model on test dataset i, respectively; \({{\text{num}}}_{{\text{i}}}\) and \({{\text{num}}}_{{\text{j}}}\) represent the number of categories in the test dataset and j, respectively (see Table 1).

Table 1 Comparison of the detection accuracy of the proposed method with six mainstream methods trained on the three benchmark datasets with setting (i)

To measure the speed variations in LINet relative to other mainstream COD models, the following average speed change rate index \({R}_{v}\) was proposed:

$${R}_{v}=\frac{\sum_{{\text{i}}}{(V}_{i}-{U}_{i})}{\sum_{i}{U}_{i}}$$

where \({V}_{i}\) and \({U}_{i}\) denote the frames per second (FPS) of LINet and other mainstream COD models on the test set i, respectively (see Table 2).

Table 2 Comparison of detection speeds of the proposed method and six mainstream methods at the training setting (i)

Figure 5 shows the detection accuracy and speed change rates of LINet relative to the other mainstream models when the training was set to (i). Evidently, LINet exhibits the highest accuracy drop rate of 10.33% and detection speed improvement rate of 121.79% compared to C2FNet. Compared to SINet, LINet exhibits a slight accuracy improvement rate of 0.99% and detection speed improvement rate of 63.59%. Compared to the six mainstream camouflaged target detection models, LINet significantly reduced the inference time while maintaining an insignificant decrease in accuracy. Hence, LINet is a promising solution to the problem of COD. Figure 6 shows a qualitative comparison between LINet and the six baselines when the training setting is (i). In a few challenging tasks (such as undefined boundaries, occlusion, and small objects), LINet exhibits instances of missed detections, such as the leg details in Fig. 6b and d and head details in Fig. 6f. However, to recognize relatively normal camouflaged objects (such as (a), (c), and (e) in Fig. 6), LINet obtained more accurate positions compared to the best-performing model C2FNet.

Fig. 5
figure 5

Detection accuracy and speed change rate of LINet (i) compared to mainstream models

Fig. 6
figure 6

Qualitative comparison of LINet (i) and mainstream models

Precision and speed analysis of the model with setting (ii)

Figure 7 shows the detection accuracy and speed change rate of LINet relative to other mainstream models at the training setting (ii). Compared to BGNet, LINet exhibits the highest accuracy drop rate of 17.49% and speed increase rate of 87.47%. Compared to SINet, LINet exhibits a slight accuracy improvement rate of 0.12% and speed increase rate of 80.60%. Compared to the model trained with setting (i), the overall accuracy drop rate increased, indicating that LINet was more suitable for smaller data scenarios. However, the LINet detection time was significantly reduced, demonstrating the robustness of the proposed framework (see Tables 3, 4).

Fig. 7
figure 7

Detection accuracy and speed variation rate of LINet (ii) compared to those of the mainstream models

Table 3 The detection precision of the proposed method is compared to those of six other mainstream methods on three benchmark datasets at the training setting (ii)
Table 4 Comparison of the detection speeds (FPS) of the proposed method and those of the six state-of-the-art methods at the training setting (ii)

Figure 8 shows a qualitative comparison between LINet and the six baselines at the training setting (ii). To recognize relatively normal camouflaged objects (such as (a), (c), and (e) in Fig. 8), LINet performed similarly as BGNet that exhibited the best accuracy. However, for a few challenging tasks (such as (a), (c), and (e) in Fig. 8), the accuracy of the LINet model decreased noticeably. Compared with other mainstream models, the edge information detected by LINet is relatively blurry, indicating false detection. Figure 8f indicates that LINet identifies clear edge information of the firearms; however, it is not annotated in the labeled image.

Fig. 8
figure 8

Qualitative comparison of LINet (ii) with other mainstream models

Impact of various backbones on the detection accuracy and speed of LINet

The results of Table 5 indicate that when Resnet152 was used as the feature extraction network over Resnet50, LINet achieved the best performance in terms of accuracy, exhibiting an average accuracy improvement rate of 5.16%. However, the average detection speed decreased by 160.86%. Using Resnet152, the average accuracy of LINet was slightly lower than that of R-MGL having a similar detection speed, and the rate of decrease was 4.17%. Therefore, proposing novel model structures in the field of COD is crucial to improve model accuracy while maintaining detection speed.

Table 5 Comparison of detection accuracy and speed of LINet under various feature extraction networks at training setting (i)

The results of Table 6 indicate that when Resnet152 was used as the feature extraction network over Resnet50, LINet achieved the best performance in terms of accuracy, with an average accuracy improvement rate of 0.9%. However, the average detection speed decreased by 180.22%. Using Resnet101, LINet exhibited the worst performance in terms of accuracy. Therefore, in the field of COD, it may not be feasible to obtain high-precision models by increasing the depth of feature extraction network.

Table 6 Comparison of the detection accuracy and speed of LINet under different feature extraction networks at the training setting (ii)

Discussion

Based on the human visual system by blending various feature layers and receptive field sizes, this study proposes a single-stage lightweight camouflage target detection model. Unlike previous methods, this proposal provides a novel way to improve detection speed alongside accuracy by introducing biological ideas into the camouflage target detection model, which is essential for real-time applications. The performance advantages of LINet are summarized in Table 7.

Table 7 Performance advantages of LINet

The findings contribute to enhancing object detection algorithms by considering both accuracy and real-time performance. Furthermore, the following analyses can be drawn from the experimental results:

  1. (1)

    Qualitatively, for challenging tasks (such as fuzzy boundaries and occlusions), the accuracy of LINet decreases. For relatively regular camouflaged objects, LINet model has better accuracy. This situation may be due to the fact that the feature extraction module cannot pay dynamic attention to boundary information, resulting in the loss of local information of the target. This problem would be quantitatively addressed by dividing the dataset based on the complexity of the boundary. Simultaneously, we continue to enhance the LINet model’s ability to extract target boundary information from a biological standpoint and improve its efficiency in detecting complex boundary targets.

  2. (2)

    When LINet uses Resnet50 as the feature extraction network over Resnet101, its accuracy increases. We believe that this situation arises because ResNet-101 has more hierarchies than ResNet-50, but sometimes not all hierarchies are beneficial for the detection of camouflaged targets. Hence, it may not be feasible to improve the detection accuracy by increasing the depth of the feature extraction network in camouflage target detection. Therefore, a feature extraction network would be designed or by adding pre-processing operations [43] for camouflage target detection to improve the detection accuracy without increasing the depth.

  3. (3)

    When the model was trained with setting (ii), the overall accuracy decline rate increased compared to the model trained with setting (i), suggesting that LINet may be more suitable for smaller data scenarios. As a recent advancement, camouflage target detection remains relatively limited, necessitating the collection of more scene data for the model generalization test.

Conclusion

Contrary to the existing two-stage detection methods that simulate animal predation, we proposed a simple and effective single-stage LINet detection framework by integrating features of various feature layers and receptive field sizes. Considering the time constraints of the COD algorithm, we discussed the influence of various feature extraction networks on the accuracy and speed of the LINet model. Experimental results indicate that compared to the detection speeds of the mainstream algorithms, that of LINet can be increased by 187.62%, with the maximum reduction of 17.49% in detection accuracy. The novel LINet model has demonstrated significant improvements in the efficiency of camouflaged object detection at the real-time level. Furthermore, the proposed method can be applied to scenarios requiring a fast detection of camouflaged targets.