Introduction

Camouflaged object detection (COD) aims at finding and segmenting the camouflaged object from the background. It is a newly-developing task and also very challenging. As shown in Fig. 1, can you find the concealed object in the first image? Apparently, it is very difficult even for our human being. Therefore, although both as the binary segmentation task, COD is more challenging than salient object detection (SOD), as seen in Fig. 1.

Fig. 1
figure 1

Visual comparison between COD (left) and SOD (right). As can be seen, it is very difficult to find the camouflaged object in the first image even for our human being. Thus, COD is much more challenging than SOD. From left to right are input images and corresponding ground truth

The creatures in nature have their own unique survival skills and camouflage is one of them. The long process of evolution has earned many creatures the nickname of “The master of camouflage”. Creatures use their own structures and physiological characteristics to reduce the risk of being recognized by their predators. A variety of biological camouflage is accompanied by natural selection and survival of the fittest of natural evolution, with the history of life as long as the development process. Meanwhile, camouflage in biology also plays an important role in human scientific research and is widely used in military, agricultural and medical fields. In the field of medical imaging (e.g., lung infection segmentation [49]), in the outbreak of COVID-19 worldwide in February 2020, low-contrast chest imaging caused a lot of trouble to the diagnosis of the disease. This can lead to doctors making wrong diagnoses, delaying the disease and leading to serious consequences. Therefore, camouflage object detection is not only an important contribution to scientific research but also of great significance to human society.

Both as a binary segmentation task, SOD [11, 13, 16, 24, 25, 50, 53, 59, 61, 63] have developed rapidly, but the field of COD has not received much attention. Generally, camouflage object usually has indistinguishable boundary and low contrast with the background when compared with SOD task, which makes it difficult for SOD methods to accurately locate the camouflaged objects. In this paper, we propose Guided multi-scale Refinement Network (GRN) which incorporates multi-scale residual block into both localization and refinement process to capture useful semantic information as much as possible. Specifically, it starts with a multi-scale Global Perception (GP) module to find the coarse location of the camouflaged object. In the refinement step, although many multi-scale feature fusion methods have been proposed in existing SOD models, they are still facing the problem that high-level features are gradually diluted in the fusion process. To address it, we further propose Guided multi-scale Refinement (GR) module to merge multi-level features with top-down guidance to alleviate the dilution problem.

In summary, the contributions of this paper can be summarized as follows:

  1. 1)

    We first propose multi-scale GP module to capture richer semantic information for the precise initial location of camouflaged object.

  2. 2)

    We further propose GR module to refine the initial prediction in a multi-scale manner with top-down guidance. With its help, the missing object parts and details can be well remedied.

  3. 3)

    Extensive experiments on three COD datasets and four SOD datasets show the state-of-the-art performance of the proposed model. In addition, our model is also very efficient and compact.

Related work

Salient object detection

Multi-scale fusion

In the past years, many efforts about various network architectures have been made in SOD. Among them, multi-scale feature fusion is a common way to combine the complementary cues between high-level features in deep layers and low-level features in shallow layers. Hou et al. [17] introduced short connections to the skip-layer structures within the HED [52] architecture. Liu et al. [28] leveraged global and local pixel-wise contextual attention network to capture global and local context information. Deng et al. [5] learned the residual between the intermediate saliency prediction and the ground truth by alternatively leveraging the low-level integrated features and the high-level integrated features of a fully convolutional network (FCN) [31]. Wu et al. [51] proposed a novel cascaded partial decoder framework which directly utilizes generated saliency map to refine the features of backbone network. This strategy efficiently suppresses distractors in the features and improves their representation ability significantly. Wei et al. [48] aggregated multilevel features selectively and adaptively selected complementary components from embedded features before fusion, effectively avoiding the introduction of too much redundant information and the destruction of original features. Pang et al. [37] proposed an aggregate interaction module to integrate the features from adjacent levels, using small up-/down-sampling rates and introducing less noise.

Edge-aware

Recently, some edge-aware methods are proposed to further improve the visual quality of saliency maps near the object boundaries. Zhao et al. [64] made use of the rich edge information and location information, locating the significant targets more accurately by fusing features. Qin et al. [39] proposed a predict-refine architecture which is composed of a densely supervised Encoder-Decoder network and a residual refinement module. They define the hybrid loss to guide the network to learn the transformation between the input image and the ground truth in a three-level hierarchy. Su et al. [20] proposed a boundary-aware network. In this network, the feature selectivity at boundaries is enhanced by incorporating a boundary localization stream, while the feature invariance at interiors is guaranteed with a complex interior perception stream. Wu et al. [50] proposed a Stacked Cross Refinement Network which aims to simultaneously refine multi-level features of salient object detection and edge detection by stacking cross refinement unit.

However, most of the above methods either having high-complexity or facing the feature dilution problem. In this paper, we propose a simple GR to let the deeper-layer prediction guide the shallower layer features extraction for more effective and efficient fusion, where the precise location of deep prediction and the rich spatial details from shallow features will be comprehensively integrated.

Camouflaged object detection

Compared to the SOD task, COD has not been fully explored due to its challenging. Conventional methods [36, 41, 44] typically use hand-crafted features (e.g., optical flow, color, texture) for camouflaged region detection. In the deep learning era, CNN-based methods have emerged as promising alternatives for COD. In [66], Zheng et al. constructed a large camouflaged people dataset while they built a dense deconvolution network for people detection. Similar to [66], Le et al. [21] provided a new generic camouflaged objects dataset as well as a two-branch network named Anabranch Network for this task. More recently, Fan et al. [10] proposed a large-scale COD dataset called COD10K and introduced a two-stage network called SINet, which consists of the search module and the identification module.

To further promote this and improve the performance, in this paper, we propose a novel framework called GRN for COD. Specifically, we innovatively integrated multi-scale residual block both into global perception and residual refinement to capture richer semantic information for more accurate detection.

Proposed framework

We first introduce our motivation and the whole architecture in Section 3.1 which is based on the multi-scale residual block. Then, we integrate it into the U-net architecture for global perception module and residual refinement module which are described in Sections 3.2 and 3.3 respectively. Finally, we present the loss function in Section 3.4.

Overview

As mentioned above, different from the salient object that usually has significant difference with background, the camouflaged object can well conceal itself into the background thus is very difficult to be discovered. In the human visual system, various population receptive fields helps us to find objects with different scales as studied in neuroscience experiments. In computer vision tasks, the receptive field is also very important to enrich context information, thus, various effective structures have been proposed to enlarge the receptive field of the network for better performance, such as ASPP (Atrous spatial pyramid pooling) [4], DenseASPP [55], RFNet (An end-to-end image matching network based on receptive field) [45], etc.

Recently, Cheng et al. [12] proposed a multi-scale residual based Res2Net for image classification, which achieved better performance compared to the original ResNet [15]. Specifically, it divides the input feature evenly into p feature subsets, which have the same spatial size. One split subset is first fed into a convolutional layer for feature learning, which is further added with its next subset and then fed into another convolutional layer. This process repeated several times until all the split subsets are processed. Finally, all these output subsets after feature learning are concatenated together and sent to another 1 × 1 convolution for residual learning. Res2Net expands the receptive field by replacing standard residual block with the above one. Benefit from this structure, these splits were processed in a multi-scale fashion, which was conducive to the extraction of both global and local information. Let \({f_{k}^{i}}\) and \(\hat {f}_{k}^{i}\) be the input and output feature in split group k, “Conv” denotes 3 × 3 convolutional operation, the above process can be defined as:

$$ \hat{f}_{k}= \begin{cases} &{\text{Conv}}_{k}(f_{k}), k=1;\\ &{\text{Conv}}_{k}({\hat{f}_{k-1}} + {f_{k}}), 1<k<p;\\ &f_{k}, k=p. \end{cases} $$
(1)

Inspired by this successful structure, we introduce it into our framework to enrich semantic information for COD. Our network is implemented in a two-stage strategy based on the following observation: in human visual system, we will first search the whole space when we try to find a target, if we find it we will focus on it to complete its details. Following it, our network is composed of two parts: multi-scale Global Perception (GP) module and Guided multi-scale Refinement (GR) module, which are designed for coarse initial localization and residual refinement respectively. Starting with GP to locate camouflaged object roughly, we apply our proposed GR to refine it progressively with multi-scale top-down guidance. The whole architecture is illustrated in Fig. 2. The details of GP and GR are described in the following subsections.

Fig. 2
figure 2

The whole architecture of the proposed network. The gray blocks mark the trunk of backbone network. GP takes the deepest feature as input to generate the coarse localization of the camouflaged object. The initial prediction is further refined progressively by the proposed GR module with the supervision on each side-output. The convolutional layer is adopted for channel reduction

Multi-scale global perception

We notice that there are some outstanding structures proposed to capture multi-scale contexts, such as PPM (Pyramid Pooling Module) [29], ASPP [4]. Different from these parallel concatenation based methods, we build a multi-scale global perception module by simply stacking the above multi-scale residual block (shown in Fig. 3) on the top of the last layer of the backbone network to increase the receptive field. We stack it in a recurrent manner to reduce the number of model parameters while keeping the ability of the module to capture high-level semantic information. The detailed comparison of the recurrent block numbers is shown in the ablation experiment section and Table 3. The output of the GP gives the coarse localization of the camouflaged object, which needs to be further refined by the proposed GR module as described in the next subsection.

Fig. 3
figure 3

Multi-scale residual block proposed in Res2Net [12]. Here, g is set to 4 for illustration

Guided multi-scale refinement module

After obtained the coarse prediction, we need to remedy the missing object parts and also fix the false detection by integrating multi-level convolutional features. As we know, different layers of deep CNNs learn different scale features and shallow layers capture low-level structure information while deep layers capture high-level semantic information. Typically, there are three ways to fuse their complementary information for better feature representation: feature to feature fusion, prediction to prediction, and prediction to feature fusion. The first one needs to continuously upsample features in deep layers for aggregation, such as FPN [26], which is not efficient enough, especially in the case of many channels and large feature size. The second one is efficient but incompetent to preserve some important cues especially in challenging cases, such as residual learning in R3Net [5]. In general, there are two basic requirements for seamless aggregation: high-level semantic information should be fully preserved, while noisy distractions in shallow layers should be filtered out as much as possible. Following these two criteria, we design an efficient guided refinement module in a prediction to feature fusion manner.

As seen in Fig. 4, the input side-output feature with channel C is first equally split into g groups, each of which consists of \(\frac {C}{g}\)-channel feature maps. Then, the deeper side-output prediction is used as a guidance feature map to be concatenated with the \(\frac {C}{g}\)-channel feature maps within each group to achieve (\(\frac {C}{g}+1\))-channel feature maps. A convolution with kernel 3 × 3 is performed on each concatenated feature map except for the last one. Let xi and \(\hat {x}^{i}\) be the input and output feature in side-output stage i, the concatenated feature in group k is added with the output of “Convk− 1” and then fed into “Convk”. As suggested in Res2Net [12], g is set to 4 in this paper. Thus, the output feature of group k can be formulated as:

$$ \hat{x}_{k}^{i}= \begin{cases} &{\text{Conv}}_{k}(\text{Cat}({x_{k}^{i}},y^{i+1})), k=1;\\ &{\text{Conv}}_{k} (\hat{x}_{k-1}^{i} + {\text{Cat}}({x_{k}^{i}},y^{i+1})), 1<k<4; \\ &{\text{Cat}} ({x_{k}^{i}},y^{i+1}), k=4, \end{cases} $$
(2)

where “Cat” denotes concatenation operation. Then, all the \(\hat {x}_{k}^{i}\) are concatenated together to obtain a (C + g)-channel feature which is fed into several convolutional layers for guided refinement to produce a residual to be added with the input prediction as refined output in side-output stage i. The above operation can be expressed as the following formula:

$$ y^{i}=y^{i+1}+\text{Convs}(\hat{x}^{i}). $$
(3)
Fig. 4
figure 4

Structure of the GR module. The purple block is the input which is split equally for subsequent refinement. Each split group will be concatenated with the upsampled prediction from the deep layer, the following steps are the same as the multi-scale residual block

It is worthy to point out that the proposed guided multi-scale refinement method is different from R3Net [5], where the deep prediction is directly concatenated with the convolutional feature for residual refinement, which can be classified into single-scale guidance strategy. Furthermore, all the side-output features are first grouped into high-level and low-level features respectively for recurrent refinement in R3Net. While in our model, the multi-level features are fed into our GR module progressively for residual refinement. It doesn’t need upsample features with multiple channels for concatenation, thus can be more efficient.

Loss function

We define our training loss as the summation over all side-outputs:

$$ \mathcal{L}=\sum\limits_{(q=1)}^{Q}\alpha_{q}\mathit{l}^{(q)}, $$
(4)

where l(q) is the loss of the q-th side output, Q denotes the total number of the side-outputs and αq is the weight of each loss.

To obtain high quality regional segmentation and clear boundaries, we propose to define it as a hybrid loss:

$$ \mathit{l}^{(q)}=\mathit{l}_{bce}^{(q)}+\mathit{l}_{iou}^{(q)}, $$
(5)

where \(\mathit {l}_{bce}^{(q)}\) and \(\mathit {l}_{iou}^{(q)}\) denote BCE loss [1] and IoU loss [34], respectively.

BCE [1] loss is the most widely used loss in binary classification and segmentation. It is defined as:

$$ \mathit{l}_{bce}=-\sum\limits_{(r,c)} \begin{bmatrix} G_{(r,c)}\log(S_{(r,c)})+(1-G_{(r,c)})\log(1-S_{(r,c)}) \end{bmatrix}, $$
(6)

where \(G_{(r,c)}\in \begin {Bmatrix}0,1\end {Bmatrix}\) is the ground truth label of the pixel (r,c) and S(r,c) is the predicted probability of being salient object.

IoU [34] is originally proposed for measuring the similarity of two sets [19] and then used as a standard evaluation measure for object detection and segmentation. Recently, it has been used as the training loss for SOD [34, 40]. It is defined as bellow:

$$ \mathit{l}_{iou}=1-\frac{\sum\limits_{r=1}^{H}\sum\limits_{c=1}^{W}S_{(r,c)}G_{(r,c)}}{\sum\limits_{r=1}^{H}\sum\limits_{c=1}^{W}\begin{bmatrix}S_{(r,c)}+G_{(r,c)}-S_{(r,c)}G_{(r,c)}\end{bmatrix}}. $$
(7)

Experiments

Dataset

COD datasets

We perform extensive experiments on the following three widely used camouflaged object detection datasets: CHAMELEON [43] contains 76 images that were collected from the Internet via the Google search engine using “camouflaged animal” as a keyword, with manually annotated object-level ground-truths (GT). Another contemporary dataset is CAMO [22], which has 2,500 images (2,000 for training, 500 for testing) covering eight categories. COD10K [10] is a more challenging, higher quality, and densely annotated dataset. COD10K contains 10,000 images (5,066 camouflaged, 3,000 background, 1,934 non-camouflaged), divided into 10 super-classes, and 78 sub-classes (69 camouflaged, nine non-camouflaged) which are collected from multiple photography websites. To provide a large amount of training data for deep learning models, COD10K is split into 6,000 images for training and 4,000 for testing, randomly selected from each sub-class.

SOD datasets

To evaluate the learning ability of our model, we also conduct experiments on four widely-adopted salient object datasets: ECSSD [54], DUTS-OMRON [35], HKU-IS [25] and DUTS [47]. Among the four datasets, DUTS is a large scale dataset containing 10,553 training images (denoted as DUTS-TR) and 5,019 test images (denoted as DUTS-TE). More details can be found in recently released standard benchmarks [6].

Evaluation metrics

Four metrics are utilized to evaluate our model and existing state-of-the-art approaches. MAE (Mean absolute error) is a metric that evaluates the average difference between the prediction map and ground truth. Let P and G denote the saliency map and ground truth which is normalized to [0,1]. We compute the MAE by \(MAE=\frac {1}{w\times h}{\sum }_{x=1}^{w}{\sum }_{y=1}^{h}\left | P(x,y)-G(x,y)\right | \) where W and H are the width and height of images, respectively. F-measure is a harmonic mean of precision and recall, formulated as \(F_{\beta }=\frac {(1+\beta ^{2})Precision\times Recall}{\beta ^{2}Precision+Recall}\) where β2 is set to 0.3 to emphasize the precision over recall, as suggested in the previous works [3, 61, 64]. Weight F-measure [33] has offered an intuitive generalization of the Fβ, and is defined as \(F_{\beta }^{\omega }=\frac {(1+\beta ^{2})Precision^{\omega }\times Recall^{\omega }}{\beta ^{2}Precision^{\omega }+Recall^{\omega }}\) and it identified other causes of inaccurate evaluation. S-measure (structural similarity measure) [7] focuses on evaluating the structural similarity, which is much closer to human visual perception. It can be computed as:Sα = αso + (1 − α)sr where so and srdenote the region-aware and object-aware structural similarity and α is set as 0.5. E-measure (enhanced-alignment measure) [8] combines local pixel values with the image-level mean value in one term, jointly capture image-level statistic and local pixel matching information, can be computed as \(E_{\xi } = \frac {1}{w\times h}\!\sum \limits _{x=1}^{w}\!\sum \limits _{y=1}^{h}\!\theta (\xi _{FM}\!)\!\), where 𝜃(ξFM) is the enhanced alignment matrix.

Table 1 Quantitative results on different COD datasets in four metrics: max E-measure (higher better), S-measure (higher better), F-measure (higher better) and MAE (lower better)

Implementation details

All the experiments are implemented in PyTorch and on a PC with an NVIDIA TITAN Xp GPU and Intel i7-6700K CPU 4.0 GHz processor. Pre-trained ResNet-50 [15] on ImageNet is used as backbone network. We apply data augmentation like random flipping and multi-scale input images to alleviate over-fitting risk. All the images are resized into 352×352 both for training and testing. The training set is COD10K, which is the same as SINet [10]. The hyper-parameters are set as follows: initial learning rate is set to 0.0005 which is decreased by 10 after 25 epochs, the batch size is set to 10 and the maximum epoch is 30.

Comparisons with SOTAs

In this section, we first compare the proposed model with state-of-the-arts methods on COD and SOD datasets. For a fair comparison, we use either the implementations with the pre-trained parameters or the detection results reported by the authors.

Quantitative evaluation on COD

For COD task, our GRN is compared with 14 state-of-the-art methods [2, 14, 18, 28, 29, 39, 46, 48, 50, 51, 55, 62, 64, 65, 67]. Table 1 shows the detailed experimental results. As can be seen that our proposed network significantly outperforms other state-of-the-art methods on the COD dataset, in specific, we achieved 4% and 6.9% improvement over the second best model in term of Eφ and \(F_{w}^{\beta }\). For smaller datasets CHAMELEON and CAMO, we also obtained the best evaluation scores except for S-measure. Although additional edge supervisions are applied in SCRN, we are still comparable with it in S-measure.

Quantitative evaluation on SOD

We also compare the performance of GRN with several those of the other seven recent state-of-the-art methods [37, 39, 48, 50, 51, 64, 68] on the SOD datasets. Detailed experimental results are presented in Table 2. From those results, we can observe that the proposed GRN performs favorable with them especially on the challenging dataset DUT-OMRON.

Table 2 Quantitative results on different SOD datasets

Visual comparison

Figure 5 shows a visual comparison of the results of our method with those of the other five recent state-of-the-art methods [2, 10, 37, 39, 48, 51, 64, 68] on the COD datasets. We choose images in different challenging scenes for comparisons, such as object background interference and low contrast background(1st and 2nd row), object mimics camouflage(3rd, 4th and 5th row), minimal objects(6th, 7th and 8th row) in the scene and occluded objects(9th and 10th row). According to Fig. 5, we can observe that our GRN performs well in all these challenging scenarios. In order to test the universality of the proposed network, we also selected existing SOD networks and compared them with ours on the SOD datasets. According to Fig. 6, we can see that our network also shows excellent detection performance in complex scenarios such as small targets and multiple targets.

Fig. 5
figure 5

Visual comparison on COD task in some challenging cases, including objects mimicking camouflage, multiple and small targets, low contrast backgrounds, and occlusion objects. From left to right: input image, ground truth (GT), our method, SINet [10], F3Net [48], SCRN [50], CPD [51], EGNet [64]

Fig. 6
figure 6

Visual comparison on SOD task. From left to right: input image, ground truth (GT), our method, ITSD [68], MINet [37], F3Net [48], SCRN [50], EGNet [64], BAN [39], CPD [51]

Attributes-based performance on SOC

Attribute is annotated in the challenging SOC [6] dataset that reflects common attributes in real-world scenarios. These annotations are helpful to study the performance of salient object detection model. The structural similarity scores of our proposed model and 10 common algorithms are presented in Table 3.

Table 3 Attributes-based performance on the challenging SOC dataset [6]

As can be seen from the above table, our model got the highest score in six of the nine attributes. The rest of the attributes were also highly rated or ranked at the top. These results show that the proposed model is applicable to most existing challenge scenarios, especially for the attributes of AC(Appearance Change), OC (Occlusion), SC (Shape Complexity) and SO (Small Object), which are consistent with the model characteristics described in the previous chapters. The network is designed for the purpose of detecting camouflaged objects, so even though it is not so outstanding in BO (Big Object), MB (Motion Blur) and other attributes, it has achieved a relatively leading result. Furthermore, our model has faster testing times and smaller model sizes, which will be described below.

Model complexity

The experiment proved that our proposed GRN is not only superior to most of the previous networks in performance, but also with advantages in model complexity. We compare our network with several classic networks, the details of which are shown in Table 4. We report the image size of each network, model size, selected backbone network, and FPS for the final test. Obviously, our model runs much faster than all the competing methods with a much more compact model size. Such high efficiency and compactness enable it a better choice for real-world applications.

Table 4 Model complexity comparison with representative existing methods

Ablation analysis

In this section, we make an in-depth analysis of the proposed GRN with different design options.

Number of residual blocks

In this variant, we adjusted the recurrent number of residual convolution operations in the GP module to investigate its impact on detection performance. As can be seen from Table 5, the performance is gradually improved with the increase of the number of residual block until n = 5, in which the best performance is obtained. This shows that the multi-scale global perception module successfully enlarged the receptive field for better location of camouflaged objects. Behind of this, the experiment also shows that when the number of superimpositions of the residual block exceeds five times, the performance begins to gradually decrease, and the complexity and computational cost of the model are significantly increased. After considering computational complexity and performance, we choose n = 5 as our final setting.

Table 5 Ablation experimental results on COD10K

Guidance strategy

We also conduct experiment to assess the effectiveness of the proposed multi-scale guidance strategy by comparing to single-scale guidance, which directly concatenates the deep prediction with side-output features, such as R3Net [5]. From the results in Table 5, we can conclude that our multi-scale guidance strategy has a significant improvement over the single-scale manner, especially in E-measure and F-measure promotion. This indicates that more local information and spatial structure information are extracted from the network with the help of the multi-scale guidance, making the output of the network closer to the ground truth.

Conclusion

Camouflaged object detection is a very challenging object detection task as camouflaged objects are usually concealed in the background. In this paper, we presents an effective two-stage network for COD with multi-scale top-down guidance. By stacking several multi-scale residual blocks on the top of the backbone network in a recurrent manner, we can find the localization of the camouflaged object more accurately. Furthermore, such coarse prediction is refined progressively by fusing with side-output features in a multi-scale guided manner. As a result, the missing object parts and false detection can be well remedied. Experimental results on three COD datasets and four SOD datasets with existing state-of-the-art methods demonstrated its effectiveness. In addition, the proposed network is simpler and more efficient than the previous ones, which makes it a better choice for real-world applications.