Deep Gradient Learning for Efficient Camouflaged Object Detection

This paper introduces deep gradient network (DGNet), a novel deep framework that exploits object gradient supervision for camouflaged object detection (COD). It decouples the task into two connected branches, i.e., a context and a texture encoder. The essential connection is the gradient-induced transition, representing a soft grouping between context and texture features. Benefiting from the simple but efficient framework, DGNet outperforms existing state-of-the-art COD models by a large margin. Notably, our efficient version, DGNet-S, runs in real-time (80 fps) and achieves comparable results to the cutting-edge model JCSOD-CVPR21 with only 6.82% parameters. The application results also show that the proposed DGNet performs well in the polyp segmentation, defect detection, and transparent object segmentation tasks. The code will be made available at https://github.com/GewelsJI/DGNet.

Recent studies [14−17] present compelling results based on the supervision of the whole object-level ground-truth mask. Later, various sophisticated techniques, e.g., boundary-based [18−20] and uncertainty-guided [21,22] , were developed to augment COD′s underlying representations. However, features learned from boundary-supervised or uncertainty-based models usually respond to the sparse edge of camouflage objects, thereby introducing noisy features, especially for complex scenes (see Fig. 1(a)). Besides, the boundaries of camouflaged objects are always "indefinable" or "fuzzy"; thus, they do not be pop-out from a quick visual scanning. We notice that despite the object′s camouflage, there are still some clues left, shown in the first column of Fig. 1 (white speckles). Instead of extracting only boundary or uncertainty regions, we are interested in how the network mines these "discriminative patterns" inside the object.

Image
(a) Object boundary (b) Object gradient From this perspective, we present our deep gradient network (DGNet) via the explicit supervision of the object-level gradient map. The underlying hypothesis is that there are some intensity changes inside the camouflaged objects. To ease the learning task, we decouple the DGNet into two connected branches, i.e., a context and a texture encoder. The former can be viewed as a contextual semantics learner, while the latter acts as a structural texture extractor. In this way, we can alleviate the feature ambiguity between the high-level and low-level features extracted from the individual branch. To suffi-ciently aggregate the above two discriminative features generated by the two branches, we further design a gradient-induced transition (GIT) module that collaboratively ensembles the multi-source feature space at different group scales (i.e., soft grouping). Fig. 1(b) shows that our DGNet can detect texture patterns while suppressing the background noise by an intensity-sensitive strategy focusing on the intra-region of a camouflaged object. 21 Extensive experiments on three challenging COD benchmarks illustrate that the proposed DGNet achieves state-of-the-art (SOTA) performance without introducing any complicated structures. Furthermore, we implement an efficient version, DGNet-S, with 8.3 M parameters, which achieves the fastest inference speed (80 fps) among COD-related baselines. Notably, it only has 6.82% parameters compared to the cutting-edge model JCSOD-CVPR [22] while achieving comparable performance. These results show that DGNet significantly narrows the gap between scientific research and practical application. Three downstream applications (see Section 5) of our DGNet also support this conclusion. The major contributions of this paper are summarized as follows: 1) We introduce a novel deep gradient-based framework, dubbed DGNet, for addressing the camouflaged object detection task.
2) We propose a gradient-induced transition to automatically group features from the context and texture branches according to the soft grouping strategy.
3) We present three applications and achieve good performance, including polyp segmentation, defect detection, and transparent object segmentation.

Prior works
Traditional methods detect camouflaged objects via extracting various hand-crafted features between the camouflaged areas and their backgrounds, which calculate the 3D convexity [23] , co-occurrence matrix [24] , expectation-maximization statistics [25] , optical flow [26] , and Gaussian mixture model [27] . These methods work well for simple backgrounds, while the performance degrades drastically for complex backgrounds.
CNN-based approaches could be generally categorized into three strategies: 1) Attention-based strategy: Sun et al. [28,29] introduce a network with an attention-induced cross-level fusion module to integrate multi-scale features and a dual-branch global context module to mine multiscale contextual information. To mimic the detection process of predators, Mei et al. [14] develop PFNet, which contains a positioning and focusing module to conduct the identification. Some works propose delicate structures such as covariance matrices of feature [30] and multivariate calibration components [19] to improve the robustness of the network. Kajiura et al. [31] improve the detection accuracy by exploring the uncertainties of pseudo-edge and pseudo-map labels. Zhuge et al. [32] propose a cube-alike architecture for COD, which accompanies attention fu-sion and X-shaped connection to integrate multiple-layer features sufficiently. 2) Two-stage strategy: Search and identification strategy [1] is an early practice to model the COD task. In [2], the neighbor connection decoder and group-reversal attention are introduced in SINet [1] to boost the performance further. 3) Joint-learning strategy: ANet [33] is an early attempt to utilize the classification and segmentation scheme for COD. LSR [15] and JCSOD [22] have recently renewed the joint-learning framework by introducing camouflaged ranking or learning from salient objects to camouflaged objects. ZoomNet [34] is a mixed-scale triplet network that employs the zoom strategy to learn the discriminative camouflaged semantics.
Transformer-based & Graph-based models are two recent technology trends. Recently, Mao et al. [35] introduced the concept of difficulty-aware learning based on the Transformer for both camouflaged and salient object detection. UGTR [21] explicitly utilized the probabilistic representational model to learn the uncertainties of the camouflaged object under the Transformer framework. In addition, Cheng et al. [36] are the first to collect a video dataset for COD and utilize the Transformer-based framework to exploit short-term dynamics and long-term temporal consistency to detect dynamic camouflaged objects. Later, Zhai et al. [18] designed the mutual graph learning model, which decouples one input into different features for roughly locating the target and accurately capturing its boundary.
Remarks. On the contrary, our work excavates the texture information by learning the object-level gradient rather than using boundary-aware or uncertainty-aware modelling. The biologically inspired idea behind this is that the abundant gradient cues inside the camouflaged object deserve to be explored, while the sparse boundary cues are insufficient to achieve this. As shown in Fig. 2, we also note that the recent work [37] tries to utilize the texture cues while they discard excessive object gradient cues due to different threshold settings of the Canny detector. In short, this paper aims to design an elegant framework towards efficient COD with more concise ideas (i.e., object gradient learning). More experimental validations are discussed in Section 4.3.

Deep gradient network
As discussed in [38], the low-level and high-level features occupy an equal role in the scene understanding. As suggested by [39], it is not encouraged to encode them simultaneously. As shown in Fig. 3, we propose to model the camouflaged representations with two separate encoders, a context and a texture encoder.

Context encoder
For a camouflaged input image , we use the widely used EfficientNet [40] as the context encoder to G. P. Ji et al. / Deep Gradient Learning for Efficient Camouflaged Object Detection Dimensional reduction. Inspired by [3], we adopt the following two steps to ensure efficient element-wise operations between different levels in the decoding stage: 1) We only pick out the top-three features (i.e., when ), which retain the affluent semantics of the visual scene. 2) We further utilize two stacked 1 layers with filters to reduce the dimension of each candidate feature to , contributing to easing the computational burden of subsequent operations. The final outputs are three context features , where , , and denote the channel, height, and width of the feature maps.

Texture encoder
We also introduce a tailored texture branch supervised by the object-level gradient map, compensating for the pattern degradation caused by the top-three context features′ weak representation of geometric textures.
Object gradient generation. An image gradient describes the directional change in an image′s intensity or color between adjacent positions, which is widely applied for edge detection [43] and super-resolution [44] . The right part of Fig. 3 presents four widely used types of supervision labels. The object boundary (c) and image gradient (e) can be directly generated by calculating the gradient of the object-level ground-truth (b) and raw image (a), respectively. However, the raw image gradient map (e), which contains irrelevant background noises, may mislead the optimization process when serving as the supervision signal for texture learning. To address this problem, we introduce a novel camouflage learning paradigm that uses the object-level gradient map (d) as supervision, which holds both the gradient cues of the ob-ject′s boundaries and interior regions. This process could be formulated as where represents the standard Canny edge detector [45] for input with discrete pixel coordinates . means the element-wise multiplication.
Texture encoder. Because low-level features with a high resolution will introduce a computational burden, we design a tailored lightweight encoder instead of utilizing an out-of-box backbone. We obtain the texture feature from layer (see Table 1). However, we supervise the following layer with the objectlevel gradient . We keep the texture feature with a larger resolution (i.e., and ) since the features with smaller resolution would discard most geometric details.

Gradient-induced transition
The latent correlation between context and texture features offers great potential for adaptive fusion rather than adopting naive fusion strategies (e.g., concatenation and addition operations). Here, we design a flexible plugand-play gradient-induced transition (GIT) module (see Fig. 4), which views the texture feature as the auxiliaries in the multi-source aggregation from a group-wise perspective. Specifically, it comprises the following three steps.
Gradient-induced group learning. Inspired by [2], we first adopt the gradient-induced group learning strategy, which splits three context features and a texture feature into fixed groups along the channel dimension. This strategy can be formulated for each and pair: where is the feature grouping operation. and denote the channel number of each feature group, and is the corresponding number of groups. Then, we periodically arrange groups of context features and texture features , which generates the regrouped feature via: where means the channel-wise feature concatenation. Here, the m-th sub-component is derived from Ground-truth (c) TINet-text (d) Feature   Fig. 2 Compared to the texture label proposed in TINet [37] , our object gradient label (a) keeps more geometric cues inside the camouflaged object. DGNet, under the supervision of texture label (c), fails to infer attentive regions (d) since the imbalanced distribution of sparse pixels (e.g., thin object boundaries).
Notably, such improvement exerts our DGNet more robust with the reliable auxiliaries, e.g., feature in (b).
spatial resolution of matches . Soft grouping strategy. The naive feature fusion strategies may ignore the correlation or distinctiveness between context and texture representations due to lacking further multi-source interactions. Inspired by the parallel design introduced in [46] for capturing objects at multiple scales, we propose using a soft grouping strategy {N1, N2, N3} N to provide parallel nonlinear projections at multiple finegrained sub-spaces, which enables the network to probe multi-source representations jointly. Specifically, we set three parallel sub-branches (i.e., as in the gray region of Fig. 4) for the soft grouping in our experiment. Here, for simplifying illustration, we take the -th sub-branch as an example via omitting the subscript, which could be formulated as where intentionally introduces soft non-linearity at each multi-source sub-space. The projection function is implemented by a convolutional layer with filters of the shape of , which is parameterized by learnable weights . Here, is the n-th subset of the regrouped feature that is divided into groups. Parallel residual learning. We further introduce residual learning [47] in a parallel manner at different group-k c s p Cg = 32 Table 1 Details of the tailored texture encoder. : kernel size, : output channels, : stride, and : zero-padding. Here, we set the channel as the default setting. Fig. 3 Overall pipeline of the proposed DGNet. It consists of two connected learning branches, i.e., context encoder (Section 3.1) and texture encoder (Section 3.2). Then, we introduce a gradient-induced transition (GIT) (Section 3.3) to collaboratively aggregate the feature that is derived from the above two encoders. Finally, a neighbor connected decoder (NCD) [2] is adopted to generate the prediction (Section 3.4).

Context feature
Texture feature Output feature

Residual Connection
Addition aware scales. Consequently, we can define the GIT function (see the red block in Fig. 3) as denotes a set of scaling factors for different groups, which will be discussed in Section 4.3.
means the element-wise addition and denotes a sum of multiple terms. The final output is .

Learning details
Decoder. Given the context features , we firstly apply the GIT function (see Equ. (6)) to get the output features . To exploit the above gradient-induced features in more efficiently, we utilize the neighbor connection decoder (NCD) [2] to generate the final prediction, enabling feature propagation from high to low levels. Thus, the final prediction can be derived from . Loss function. The overall optimization objective is defined as where and represent the segmentation and object gradient loss functions, respectively. For the former, it is formulated as , where and represent the weighted intersection-over-union loss and the weighted binary cross-entropy loss, respectively. They assign the adaptive weight for each pixel according to its difficulty in focusing on the global structure and paying more attention to the hard pixels. The definitions of these losses are the same as in [1,2,48], and their effectiveness has been proven in binary segmentation. For the latter, we employ the standard mean squared error loss function.
Training settings. The proposed DGNet is implemented in the PyTorch [49] /Jittor [50] toolbox and trained/ inferred on a single NVIDIA RTX TITAN GPU. The model parameters are initialized with the strategy of [51], and we initialize the backbone with the model weights pre-trained on ImageNet [52] to prevent over-fitting. We discard the last stage of Conv1 1, pooling, and fully connected layers in the EfficientNet [40] backbone and extract the features from the top-three lateral outputs, including stage-4 ( ), stage-6 ( ), and stage-8 ( ). Considering the performance-efficiency trade-off, we instantiate two variants to adapt the specific requirement under various computational overheads (refer to Table 2).
We train our model end-to-end, using Adam [53] . The cosine annealing part of the SGDR strategy [54] is used to adjust the learning rate, where the minimum/maximum learning rate and the maximum adjusted iteration are set to / and 20, respectively. The batch size is set to 12, and the maximum training epoch is 100. During training, we resize each image to 352×352 and feed it into DGNet with four data augmentation techniques: color enhancement, random flipping, random cropping, and random rotation. Finally, our DGNet and DGNet-S take 8.8 and 7.9 hours to reach the network convergence.
Testing settings. Once the network is well-trained, we resize the input images to 352×352 and test our DGNet-S and DGNet on three unseen test datasets. We take the final output as the prediction map without any heuristic post-processing techniques, such as Den-seCRF [55] .

Benchmarking
Te Te Te Datasets. There are three popular datasets in the COD field: 1) CAMO [33] has 1 250 camouflaged images and is divided into CAMO-Tr (1 000 samples) and CAMO-(250 samples). 2) COD10K [2] is the largest COD dataset so far, consisting of COD10K-Tr (3 040 images) and COD10K-(2 026 images). It is downloaded from multiple free photography websites, covering five super-classes and 69 sub-classes. 3) NC4K- [15] , the largest test dataset, includes 4 121 samples, which are used to evaluate the models′ generalization ability. Following the protocol of [2], we train our model on the hybrid dataset (i.e., COD10K-Tr + CAMO-Tr) with 4 040 samples and evaluate our method on the above three benchmarks (see Table 3).
Metrics. Following [2], we use five commonly used metrics for the evaluation: structure measure ( ) [64] , enhanced-alignment measure ( ) [65,66] , F-measure ( ) [67,68] , weighted F-measure ( ) [69] , and mean absolute error ( ). Besides, the precision-recall (PR) curves [67] are obtained by varying different thresholds from . Similar to this thresholding strategy, F-measure and E-measure curves are also reported. Moreover, we adopt three criteria to measure the model′s complexity 2 and efficiency: the number of model parameters, measured in Millions (M), the number of multiply-accumulate (MACs) operations, measured in Giga (G), and inference speed measured in frames per second (fps).
Competitors. We compare our model with 20 SOTA  Ours Te NC4K- [15]    competitors (see Table 3), including 8 SOD-based and 12 COD-based. For a fair comparison, all results were either taken from the public website or produced by retraining the models on the same training dataset with default settings.

Results and analysis
Quantitative results. As shown in Table 3, DGNet achieves promising performance in terms of all metrics. Especially, the gradient-based learning strategy helps to F w β Te improve the completeness of the predictions, providing a 2.6% gain of on CAMO-than rank@1 model SIN-etV2 [2] .
Quantitative curves. As shown in Fig. 5, we plot the precision-recall (1st row), F-measure (2nd row), and E-measure (3rd row) curves of all COD-related competitors by varying with different thresholds. All comparisons show that our curves with magenta solid/dotted lines are significantly better than other methods on three datasets.  [15] CAMO-Te [33] COD10K-Te [2] NC4K-Te [15] CAMO-Te [33] COD10K-Te [2] NC4K-Te [15] CAMO-Te [33] COD10K-Te [2] Fig. 5 PR curves (1st row), F-measure curves (2nd row), and E-measure curves (3rd row) of COD-related competitors on three popular datasets. The closer the PR curve is to the upper-right corner, the better the performance is. The higher the F-measure/Emeasure curve is, the better the performance and the better the model works. Best viewed in color. Fig. 6. Interestingly, these competitors fail to provide complete segmentation results for the camouflaged objects touching the image boundary. By contrast, our approach can precisely locate the target region and provide exact predictions due to the gradient learning strategy.
Efficiency analysis. To better unveil the trade-off, two instances consistently obtain the best trade-off compared to existing competitors (see Fig. 7). DGNet outperforms the cutting-edge model SINetV2 [2] with a large margin ( : ). Notably, our efficient instance DGNet-S performs better than JCSOD [22] , with 113.33 M fewer parameters. Besides, we also report the runtime comparison of all COD-related competitors in Table 4, which are tested on an NVIDIA RTX TITAN GPU. It clearly illustrates that DGNet-S and DGNet can achieve super real-time inference speeds (i.e., 80 fps & 58 fps).

Ablation study
We further ablate the core modules to verify the effectiveness of each part and configuration. For ecological reasons, we select the DGNet-S as the base model in this section.

#01 #S
Contribution of base network. In Table 5(a), we remove the texture encoder and GIT from DGNet-S and term it as the base network ( ). Compared to it, our DGNet-S ( ) significantly improves the performance while slightly increasing the model parameters by 0.06 M.

CAMO-Te
The higher the better Fig. 7 (Left) We present the scatter relationship between the performance ( ) and the parameters of all competitors on CAMO- [33] . The larger the colored scatter point size, the heavier the model parameters. (Right) We also report the parallel histogram comparison of the model′s parameters, MACs, and performance ( ). Best viewed in color. ) and find that more parameters may lead to performance saturation. To achieve the best trade-off between resource and speed, we choose = as the default setting.

X2
Contribution of network decoupling strategy. We explore the necessity of our decoupling strategy. Inspired by [19], we replace the feature extracted from the texture encoder with the low-level feature from the context encoder, which yields a single-stream network ). Notably, we only change the extraction manner of texture features and preserve the gradient-wise supervision for both variants (i.e., & ) to ensure unbiased ablation. Table 5(c) demonstrates that decoupling the network into two streams can improve the performance ( : +5.3% on CAMO-), which benefits from the modelling of separated branches without feature ambiguity in different hierarchies.
Variant ) to supervise the context learning process. The improvement ( : on CAMO-) of our gradient map supervision further demonstrates the effectiveness. The first row of Fig. 8 presents the low-level features extracted from the texture learning branch under different supervision types. It shows that our solution can enforce the network to capture the gradient-sensitive information inside the camouflaged object′s body, where those pixels learn to draw the observer′s attention. We further experiment using the supervision of texture labels [37] (see Fig. 2). The results in Table 6 demonstrate that our gradient-supervision manner (i.e., w/ DGNet-grad) is better than the texture-supervision (i.e., w/ TINet-text). Besides, our method is simpler and more efficient than TINet, e.g., DGNet-S (8.0M) VS. TINet (28.6M), DGNet-S (80 fps) VS. TINet (50 fps). With such a compact design, we also achieve the new SOTA performance on CAMO-, e.g., DGNet-S ( ), DGNet ( ) VS. TINet ( ). Table 6 Training DGNet-S under the supervision of texture label (TINet-text [37] ) and our object gradient label (DGNet-grad) Do we need more sub-branches for soft grouping? As shown in Table 5(g), we set three ablative experiments for different sub-branches: three ( : ), four ( : ), and five ( : ) sub-branches. The comparison results unveil that more sub-branches would present unstable performance on all the datasets.

#15
Ti Contribution of gradient-induced transition. We replace the whole GIT in our model with the naive channel-wise concatenation ( : w/o in Table 5(h)) to verify its effectiveness, which shows that our DGNet-S equipped with GIT ( : w/ ) can improve 2.3% on the COD10K-dataset. Moreover, as shown in the second row of Fig. 8, the model obtains a cleaner and finer representation (i.e., Fig. 8(d) after GIT) while suppressing the noises in the background of (i.e., Fig. 8(c) before GIT). A clear benefit of the adaptive aggregation of the context and texture cues in the GIT.

Limitations
Efficient backbone VS. lightweight. We further validate the potential value of our method on limited hardware conditions by replacing the efficient backbone, EfficientNet [40] , with a lightweight one, MobileNet [70] . The results, as in Table 7, show that our method achieves unsatisfactory performance with a lightweight backbone, i.e., MobNet-S (2.96 M) and MobNet-L (6.96 M), leaving a huge room for our future exploration. Table 7 Our method with different backbones, including EfficientNet [40] (i.e., EffNet-B1 & EffNet-B4) VS. MobileNet [70] (i.e., MobNet-S & MobNet-L) Challenging cases. Despite our method′s satisfactory performance, it may fail in challenging camouflaged scenarios as follows. First, we argue that in the proposed strategy, it is still difficult to provide enough texture cues in the limited small target region, resulting in false-positive predictions. As shown in Fig. 9, such cases also easily confuse the rank@1 approach SINetV2 [2] , thus deserving further studies.

Image
SINetV2 [2] DGNet (Ours) Fig. 9 Hard sample with a small camouflaged object Second, we observe that not all the camouflaged objects with noticeable gradient changes inside themselves. As shown in the first row of Fig. 10, our method could segment a white rabbit with non-distinct gradient changes. However, our method fails under extreme conditions, as in the second row of Fig. 10, which has rare gradient cues. It needs to be designed by incorporating more heuristic and learnable patterns for future improvements.

Image
Object gradient Ground-truth Prediction Additionally, we noticed a recently released COD method, ZoomNet [34] , after the submission. As shown in Table 8, our DGNet surpasses the ZoomNet by a margin (i.e., NC4K-: +1.3% and CAMO-: +2.3% ), but fails to outperform ZoomNet on COD10K-. ZoomNet occupies more computational costs (32.38 M parameters) than our DGNet (21.02 M parameters). It inspires us to incorporate the zooming strategy into our network for our future extension.

Downstream applications
This section also assesses the generalization capabilities of three downstream applications.
Polyp segmentation. In the early diagnosis of colonoscopy, the low boundary contrast between a polyp and its highly-similar surroundings significantly decre- † Sα E mx ϕ F w β D mx † † ases the detectability of colorectal cancer. To demonstrate the generality of our method in the medical field, we follow the same benchmark protocols as [3] and retrain our DGNet on the training set of Kvasir-SEG [71] and CVC-ClinicDB [72] datasets. We use two unseen test datasets: CVC-ColonDB [73] and ETIS-LPDB [74] . Table 9 shows that our DGNet consistently surpasses four cutting-edge polyp segmentation methods in four metrics, including , , , and the maximum Dice score ( ). Notably, DGNet denotes that we retrain DGNet on the taskspecific training dataset. Fig. 11(a) shows the visualization results generated by our DGNet. Table 9 Quantitative results on two popular polyp segmentation test datasets CVC-ColonDB [73] ETIS-LPDB [74] Baseline UNet [75]  Defect detection. Substandard products (e.g., tiles, wood) will inevitably incur unrecoverable economic losses in manufacturing. We further retrain our DGNet on the road crack detection dataset (i.e., CrackForest [78] ), using 60% of the samples for training and 40% for testing. Fig. 11(b) presents some visualization cases. † † Transparent object segmentation. In daily life, intelligent agents such as robots and drones need to identify unnoticeable transparent objects (e.g., glasses, bottles, and mirrors) to avoid accidents. We also verify the effectiveness of the retrained model DGNet on the transparent object segmentation task. For convenience, we re-organize the annotation of the Trans10K [79] dataset from instance-level to object-level for training. The visual results shown in Fig. 11(c) further demonstrate the learning ability of DGNet.

Conclusions
We presented a novel deep gradient learning framework (DGNet) for efficiently segmenting camouflaged objects. To extract the camouflaged features, we proposed to decouple the task into two branches, a context encoder and a texture encoder. We designed a novel plugand-play module called gradient-induced transition (GIT), acting as a soft grouping module to learn features from these two branches jointly. This simple and flexible architecture showed strong generalization capabilities on three challenging datasets compared to the 20 SOTA competitors. In addition, our efficient version DGNet-S pealing results for three further applications, including polyp segmentation, defect detection, and transparent object segmentation, which validates its practical application value.