Advances in Deep Concealed Scene Understanding

Concealed scene understanding (CSU) is a hot computer vision topic aiming to perceive objects exhibiting camouflage. The current boom in terms of techniques and applications warrants an up-to-date survey. This can help researchers to better understand the global CSU field, including both current achievements and remaining challenges. This paper makes four contributions: (1) For the first time, we present a comprehensive survey of deep learning techniques aimed at CSU, including a taxonomy, task-specific challenges, and ongoing developments. (2) To allow for an authoritative quantification of the state-of-the-art, we offer the largest and latest benchmark for concealed object segmentation (COS). (3) To evaluate the generalizability of deep CSU in practical scenarios, we collect the largest concealed defect segmentation dataset termed CDS2K with the hard cases from diversified industrial scenarios, on which we construct a comprehensive benchmark. (4) We discuss open problems and potential research directions for CSU. Our code and datasets are available at https://github.com/DengPingFan/CSU, which will be updated continuously to watch and summarize the advancements in this rapidly evolving field.

In recent years, thanks to benchmarks becoming available (e.g., COD10K [19], [22] and NC4K [23]) ánd the rapid development of deep learning, this field has made important strides forward.In 2020, Fan et al. [19] released the first large-scale public dataset -COD10K -geared towards the advancement of perception tasks having to deal with concealment.This has also inspired other related disciplines.For instance, Mei et al. [24], [25] proposed a distraction-aware framework for the segmentation of camouflaged objects, which can be extended to the identification of transparent materials in natural scenes [26].In 2023, Ji et al. [27] developed an efficient model that learns textures from  show natural animals selected from [19].(e) depicts a concealed human in art from [20].
object-level gradients, and its generalizability has been verified through diverse downstream applications, e.g., medical polyp segmentation and road crack detection.
Although multiple research teams have addressed tasks concerned with concealed objects, we believe that stronger interactions between the ongoing efforts would be beneficial.Thus, we mainly review the state and recent deep learning-based advances of CSU.Meanwhile, we contribute a large-scale concealed defect segmentation dataset termed CDS2K.This dataset consists of hard cases from diverse industrial scenarios, thus providing an effective benchmark for CSU.
Previous Surveys and Scope.To the best of our knowledge, only a few survey papers were published in the CSU community, which [28], [29] mainly review non-deep techniques.There are some benchmarks [30], [31] with narrow scopes, such as image-level segmentation, where only a few deep methods were evaluated.In this paper, we present a comprehensive survey of deep learning CSU techniques, thus widening the scope.We also offer more extensive benchmarks with a more comprehensive comparison and with an application-oriented evaluation.Contributions.Our contributions are summarized as follows: (1) We represent the initial effort to examine deep learning techniques tailored towards CSU thoroughly.This includes an overview of its classification and specific obstacles, as well as an assessment of its advancements during the era of deep learning, achieved through an examination of existing datasets and techniques.(2) To provide a quantitative evaluation of the current stateof-the-art, we have created a new benchmark for Concealed Object Segmentation (COS), which is a crucial and highly successful area within CSU.This benchmark is the most up-to-date and comprehensive available.(3) To assess the applicability of deep CSU in real-world scenarios, we have restructured the CDS2K dataset -the largest dataset for concealed defect segmentation -to include challenging cases from various industrial settings.We have utilized this updated dataset to create a comprehensive benchmark for evaluation.(4) Our discussion delves into the present obstacles, available prospects, and future research areas for the CSU community.

Image-level CSU
In this section, we introduce five commonly used image-level CSU tasks, which can be formulated as a mapping function F : X → Y that converts the input space X into the target space Y.
• Concealed Object Segmentation (COS) [22], [27] is a classagnostic dense prediction task, segmenting concealed regions or objects with unknown categories.As presented in Fig. 2 (a), the model F COS : X → Y is supervised by a binary mask Y to predict a probability p ∈ [0, 1] for each pixel x of image X, which is the confidence level that the model determines whether x belongs to the concealed region.
Concealed Object Localization (COL) [23], [32] aims to identify the most noticeable region of concealed objects, which is in line with human perception psychology [32].This task is to learn a dense mapping F COL : X → Y.The output Y is a non-binary fixation map captured by an eye tracker device, as illustrated in Fig. 2 (b).Essentially, the probability prediction p ∈ [0, 1] for a pixel x indicates how conspicuous its camouflage is.
Concealed Instance Ranking (CIR) [23], [32] is to rank different instances in a concealed scene based on their detectability.The level of camouflage is used as the basis for this ranking.The objective of the CIR task is to learn a dense mapping F CIR : X → Y between the input space X and the camouflage ranking space Y, where Y represents per-pixel annotations for each instance with corresponding rank levels.For example, in Fig. 2 (c), there are three toads with different camouflage levels, and their ranking labels are from [23].To perform this task, one can replace the category ID for each instance with rank labels in instance segmentation models like Mask R-CNN [33].
Concealed Instance Segmentation (CIS) [34], [35] is a technique that aims to identify instances in concealed scenarios based on their semantic characteristics.Unlike general instance segmentation [36], [37], where each instance is assigned a category label, CIS recognizes the attributes of concealed objects to distinguish between different entities more effectively.To achieve this, CIS employs a mapping function F CIS : X → Y, where Y is a scalar set comprising various entities used to parse each pixel.This concept is illustrated in Fig. 2 (d).
Concealed Object Counting (COC) [38] is a newly emerging topic in CSU that aims to estimate the number of instances concealed within their surroundings.As illustrated in Fig. 2 (e), the COC is to estimate center coordinates for each instance and generate their counts.Its formulation can be defined as F COC : X → Y, where X is the input image and Y represents the output density map that indicates the concealed instances in scenes.
Overall, the image-level CSU tasks can be categorized into two groups based on their semantics: object-level (COS and COL) and instance-level (CIR, COC, and CIS).Object-level tasks focus on perceiving objects while instance-level ones aim to recognize semantics to distinguish different entities.Additionally, COC is regarded as a sparse prediction task based on its output form, whereas the others belong to dense prediction tasks.Among the literature reviewed in Table 1, COS has been extensively researched while research on the other three tasks is gradually increasing.

Video-level CSU
Given a video clip {X t } T t=1 containing T concealed frames, video-level CSU can be formulated as a mapping function F : for parsing dense spatial-temporal correspondences, where Y t is the label of frame X t .
Video Concealed Object Detection (VCOD) [39] is similar to video object detection [40].This task aims to identify and locate concealed objects within a video by learning a spatial-temporal mapping function F VCOD : {X t } T t=1 → {Y t } T t=1 that predicts the location Y t of an object for each frame X t .The location label Y t is provided as a bounding box (See Fig. 2 (f)) consisting of four numbers (x, y, w, h) indicating the target's location.Here, (x, y) represents its top-left coordinate, while w and h denote its width and height, respectively.
Video Concealed Object Segmentation (VCOS) [41] originated from the task of camouflaged object discovery [39].Its goal is to segment concealed objects within a video.This task usually utilizes spatial-temporal cues to drive the models to learn the mapping F VCOS : {X t } T t=1 → {Y t } T t=1 between input frames X t and corresponding segmentation mask labels Y t .Fig. 2 (g) shows an example of its segmentation mask.
In general, compared to image-level CSU, video-level CSU is developing relatively slowly.Because collecting and annotating video data is labor-intensive and time-consuming.However, with the establishment of the first large-scale VCOS benchmark on MoCA-Mask [41], this field has made fundamental progress while still having ample room for exploration.

Task Relationship
Among image-level CSU tasks, the CIR task requires the highest level of understanding as it may not only involve four subtasks, e.g., segmenting pixel-level regions (i.e., COS), counting (i.e., COC), or distinguishing different instances (i.e., CIS), but also ranking these instances according to their fixation probabilities (i.e., COL) under different difficulty levels.Additionally, regarding two video-level tasks, VCOS is a downstream task for VCOD since the segmentation task requires the model to provide pixellevel classification probabilities.

Related Topics
Next, we will briefly introduce salient object detection (SOD), which, like COS, requires extracting properties of target objects, but one focuses on saliency while the other on the concealed attribute.
Video-level SOD.The early development of video salient object detection (VSOD) originated from introducing attention mechanisms in video object segmentation (VOS) tasks.At that stage, the task scenes were relatively simple, containing only one object moving in the video.As moving objects tend to attract visual attention, VOS and VSOD were equivalent tasks.For instance, Wang et al. [95] used a fully convolutional neural network to address the VSOD task.With the development of VOS techniques, researchers introduced more complex scenes (e.g., with complex backgrounds, object movements, and two objects), but the two tasks remained equivalent.Thus, later works have exploited semantic-level spatial-temporal features [96], [97], [98], [99], recurrent neural networks [100], [101], or offline motion cues such as optical flow [100], [102], [103], [104].However, with the introduction of more challenging video scenes (containing three or more objects, simultaneous camera, and object movements), VOS and VSOD were no longer equivalent.Yet, researchers continued to approach the two tasks as equivalent, ignoring the issue of visual attention allocation in multi-object movement in video scenes, which seriously hindered the development of the field.To address this issue, in 2019, Fan et al. introduced eye trackers to mark the changes in visual attention in multiobject movement scenarios, for the first time posing the scientific problem of attention shift [105] in VSOD asks, and constructed the first large-scale VSOD benchmark -DAVSOD 1 , as well as the baseline model SSAV, which propelled VSOD into a new stage of development.
Remarks.COS and SOD are distinct tasks, but they can mutually benefit via the CamDiff approach [106].This has been demonstrated through adversarial learning [107], leading to joint research efforts such as the recently proposed dichotomous image segmentation [108].In §6, we will explore potential directions for future research in these areas.

DEEP CSU MODELS
This section systematically reviews deep CSU approaches based on task definitions and data types.We have also created a GitHub base 2 as a comprehensive collection to provide the latest information in this field.

Image-level CSU Models
We review the existing four image-level CSU tasks: concealed object segmentation (COS), concealed object localization (COL), concealed instance ranking (CIR), and concealed instance segmentation (CIS).Table 1 summarizes the key features of these reviewed approaches.

Concealed Object Segmentation
This section discusses previous solutions for camouflage object segmentation (COS) from two perspectives: network architecture and learning paradigm.

Concealed Instance Ranking
There has been limited research conducted on this topic.Lv et al. [23] observed for the first time existing COS approaches could not quantify the difficulty level of camouflage.Regarding this issue, they used an eye tracker to create a new dataset, called CAM-LDR [32], that contains instance segmentation masks, fixation labels, and ranking labels.They also proposed two unified frameworks, LSR [23] and its extension LSR+ [32], to simultaneously learn triple tasks, i.e., localizing, segmenting, and ranking camouflaged objects.The insight behind it is that discriminative localization regions could guide the segmentation of the full scope of camouflaged objects, and then, the detectability of different camouflaged objects could be inferred by the ranking task.

Concealed Instance Segmentation
This task advances the COS task from the regional to the instance level, a relatively new field compared with the COS.Then, Le et al. [35] build a new CIS benchmark, CAMO++, via extending on previous CAMO [20] dataset.They also proposed a camouflage fusion learning strategy to fine-tune existing instance segmentation models (e.g., Mask R-CNN [33]) by learning image contexts.Based on instance benchmarks as in COD10K [19] and NC4K [23], the first one-stage transformer framework, OS-Former [34], was proposed for this field by introducing two core designs: location-sensing transformer and coarse-to-fine fusion.Recently, Luo et al. [145] proposed to segment camouflaged instances with two designs: a pixel-level camouflage decoupling module and an instance-level camouflage suppression module.

Concealed Object Counting
Sun et al. [38] recently introduced a new challenge for the community called indiscernible object counting (IOC), which involves counting objects that are difficult to distinguish from their surroundings.They created IOCfish5K, a large-scale dataset containing high-resolution images of underwater scenes with many indiscernible objects (focus on fish) and dense annotations to address the lack of appropriate datasets for this challenge.They also proposed a baseline model called IOCFormer by integrating density-based and regression-based methods in a unified framework.
Based on the above summaries, the COS task is experiencing a rapid development period, resulting in numerous contemporary publications each year.However, very few proposed solutions are still proposed for the COL, CIR, and CIS tasks.This suggests that these fields remain under-explored and offer significant room for further research.Notably, many previous studies are available as references (such as saliency prediction [83], salient object subitizing [67], and salient instance segmentation [81]), providing a solid foundation for understanding these tasks from a camouflaged perspective.

Video-level CSU Models
There are two branches for the video-level CSU task, including detecting and segmenting camouflaged objects from videos.Refer Table 2 for details.

Video Concealed Object Detection
Most works [155], [157] formulated this topic as the degradation problem of the segmentation task since the scarcity of pixelwise annotations.They, as usual, trained on segmentation datasets (e.g., DAVIS [160], FBMS [161]) but evaluated the generalizability performance on video camouflaged object detection dataset, MoCA [39].These methods consistently opt to extract offline optical flow as motion guidance for the segmentation task, but diversifying over the learning strategies, such as fully-supervised learning with real [39], [156], [159] or synthetic [154], [157] data and self-supervised learning [155], [158].

Video Concealed Object Segmentation
Xie et al. [153] proposed the first work on camouflaged object discovery in videos.They used a pixel-trajectory recurrent neural network to cluster foreground motion for segmentation.However, this work is limited to a small-scale dataset, CAD [162].Recently, based upon localization-level dataset MoCA [39] with bounding box labels, Cheng et al. [41] extended this field by creating a large-scale VCOS benchmark MoCA-Mask with pixel-level masks.They also introduced a two-stage baseline SLTNet to implicitly utilize motion information.
From what we have reviewed above, the current approaches for VCOS tasks are still in a nascent state of development.Several concurrent works in well-established video segmentation fields (e.g., self-supervised correspondence learning [163], [164], [165], [166], [167], unified framework for different motion-based tasks [168], [169], [170]) points the way to further explore.Besides, considering high-level semantic understanding has a research gap that merits being supplied, such as semantic segmentation and instance segmentation in the camouflaged scenes.

CSU DATASETS
In recent years, various datasets have been collected for both image-and video-level CSU tasks.In Table 3, we summarize the features of the representative datasets.

Image-level Datasets
• CAMO-COCO [20] is tailor-made for COS tasks with 2,500 image samples across eight categories, divided into two subdatasets, i.e., CAMO with camouflaged objects and MS-COCO with non-camouflaged objects.Both CAMO and MS-COCO contain 1,250 images with a split of 1,000 for training and 250 for testing.
NC4K [23] is currently the largest testing set for evaluating COS models.NC4K consists of 4,121 camouflaged images sourced from the Internet and can be divided into two primary categories: natural scenes and artificial scenes.In addition to the images, this dataset also provides localization labels that include both objectlevel segmentation and instance-level masks, making it a valuable resource for researchers working in this field.In a recent study by Lv et al. [23], an eye tracker was utilized to collect fixation information for each image.As a result, a CAM-FR dataset of 2,280 images was created, with 2,000 images used for training and 280 for testing.The dataset was annotated with three types of labels: localization, ranking, and instance labels.
CAMO++ [35] is a newly released dataset that contains 5,500 samples, all of which have undergone hierarchical pixel-wise annotation.The dataset is divided into two parts: camouflaged samples (1,700 images for training and 1,000 for testing) and noncamouflaged samples (1,800 images for training and 1,000 for testing).
COD10K [19], [22] is currently the largest-scale dataset, featuring a wide range of camouflaged scenes.The dataset contains 10,000 images from multiple open-access photography websites, covering ten super-classes and 78 sub-classes.Out of these images, 5,066 are camouflaged, 1,934 are non-camouflaged pictures and 3,000 are background images.The camouflaged subset of COD10K is annotated using different labels such as category labels, bounding boxes, object-level masks, and instance-level masks, providing a diverse set of annotations.[32] comprises of 4,040 training and 2,026 testing samples.These samples were selected from commonly-used hybrid training datasets (i.e., CAMO with 1,000 training samples and COD10K with 3,040 training samples), along with the testing dataset (i.e., COD10K with 2,026 testing samples).CAM-LDR is an extension of NC4K [23] that includes four types of annotations: localization labels, ranking labels, object-level segmentation masks, and instance-level segmentation masks.The ranking labels are categorized into six difficulty levels -background, easy, medium1, medium2, medium3, and hard.S-COD [141] is the first dataset designed specifically for the COS task under the weakly-supervised setting.The dataset includes 4,040 training samples, with 3,040 samples selected from COD10K and 1,000 from CAMO.These samples were re-labeled using scribble annotations that provide a rough outline of the primary structure based on first impressions, without pixel-wise ground-truth information.

CAM-LDR
IOCfish5K [38] is a distinct dataset that focuses on counting instances of fish in camouflaged scenes.This COC dataset comprises 5,637 high-resolution images collected from YouTube, with 659,024 center points annotated.The dataset is divided into three subsets, with 3,137 images allocated for training, 500 for validation, and 2,000 for testing.
Remarks.In summary, three datasets (CAMO, COD10K, and NC4K) are commonly used as benchmarks to evaluate camouflage object segmentation (COS) approaches, with the experimental protocols typically described in §5.2.For the concealed instance segmentation (CIS) task, two datasets (COD10K and NC4K) containing instance-level segmentation masks can be utilized.The CAM-LDR dataset, which provides fixation information and three types of annotations collected from a physical eye tracker device, is suitable for various brain-inspired explorations in computer vision.Additionally, there are two new datasets from CSU: S-COD, designed for weakly-supervised COS, and IOCfish5K, focused on counting objects within camouflaged scenes.

Video-level Datasets
• CAD [162] is a small dataset comprising nine short video clips and 836 frames.The annotation strategy used in this dataset is sparse, with camouflaged objects being annotated every five frames.As a result, there are 191 segmentation masks available in the dataset.
MoCA [39] is a comprehensive video database from YouTube that aims to detect moving camouflaged animals.It consists of 141 video clips featuring 67 categories and comprises 37,250 high-resolution frames with corresponding bounding box labels for 7,617 instances.
MoCA-Mask [41], an extension of MoCA dataset [39], provides human-annotated segmentation masks every five frames based on MoCA dataset [39].MoCA-Mask is divided into two parts: a training set consisting of 71 short clips (19,313 frames with 3,946 segmentation masks) and an evaluation set containing 16 short clips (3,626 frames with 745 segmentation masks).To label those unlabeled frames, pseudo-segmentation labels were synthesized using a bidirectional optical flow-based strategy [171].
Remarks.The MoCA dataset is currently the largest collection of videos with concealed objects, while it only offers detection labels.As a result, researchers in the community [155], [157] typically assess the performance of well-trained segmentation models by converting segmentation masks into detection bounding boxes.
Recently, there has been a shift towards video segmentation in concealed scenes with the introduction of MoCA-Mask.Despite these advancements, the quantity and quality of data annotations remain insufficient for constructing a reliable video model that can effectively handle complex concealed scenarios.

CSU BENCHMARKS
In this investigation, our benchmarking is built on COS tasks since this topic is relatively well-established and offers a variety of competing approaches.The following sections will detail the evaluation metrics ( §5.1), benchmarking protocols ( §5.2), quantitative analyses ( §5.3, §5.4,§5.5), and qualitative comparisons ( §5.6).

Evaluation Metrics
As suggested in [22], there are five commonly used metrics 3 available for COS evaluation.We compare a prediction mask P with its corresponding ground-truth mask G at the same image resolution.
MAE (mean absolute error, M) is a conventional pixel-wise measure, which is defined as: where W and H are the width and height of G, and (x, y) are pixel coordinates in G.

TABLE 4
Quantitative comparison on CAMO [20] testing set.We classify the competing approaches based on two aspects: those using convolution-based backbones such as ResNet [174], Res2Net [175], EffNet [176], and ConvNext [177]; and those using transformer-based backbones such as MiT [178], PVTv2 [179], and Swin [180].We test two efficiency metrics, model parameters (Para) and multiply-accumulate operations (MACs), in accordance with the preset input resolution in the original paper.Besides, nine evaluation metrics are reported, and the best three scores are highlighted in red, green, and blue, respectively, with ↑/↓ indicating that higher/lower scores are better.If the results are unavailable since the code has not been public, we use a hyphen (-) to denote it.We will follow these notations in subsequent tables unless otherwise specified.Enhanced-alignment measure (E φ ) [172], [173] is a recently proposed binary foreground evaluation metric, which considers the both local and global similarity between two binary maps.Its formulation is defined as: where φ is the enhanced-alignment matrix.Similar to F β , this metric also includes three values computed over all the thresholds, i.e., maximum (E mx φ ), mean (E mn φ ), and adaptive (E ad φ ) values.Structure measure (S α ) [181], [182] is used to measure the structural similarity between a non-binary prediction map and a ground-truth mask: where α balances the object-aware similarity S o and region-aware similarity S r .As in the original paper, we use the default setting for α = 0.5.

Experimental Protocols
Suggested by Fan et al. [22], all competing approaches in the benchmarking were trained on a hybrid dataset comprising the training portions of COD10K [19] and CAMO [20] datasets, totaling 4,040 samples.The models were then evaluated on three popular used benchmarks: COD10K's testing portion with 2,026 samples [19], CAMO with 250 samples [20], and NC4K with 4,121 samples [23].

Quantitative Analysis on CAMO
As reported in Table 4, we evaluated 36 deep-based approaches on the CAMO testing dataset [20] using various metrics.These models were classified into two groups based on the backbones they used: 32 convolutional-based and four transformer-based.As for those models using convolutional-based backbones, several interesting findings are observed: • CamoFormer-C [147] achieved the best performance on CAMO with the ConvNeXt [177] based backbone, even surpassing some metrics produced by transformer-based methods, such as S α value: 0.859 (CamoFormer-C) vs. 0.856 (DTINet [132]) vs. 0.849 (HitNet [142]).However, CamoFormer-R [147] with ResNet-50 backbone was unable to outperform competitors with the same backbone, such as using multi-scale zooming (ZoomNet [133]) and iterative refinement (SegMaR [135]) strategies.
• As for those Res2Net-based models, FDNet [134] achieves the top performance on CAMO with high-resolution input of 416 2 .Besides, SINetV2 [22] and FAPNet [126] also achieve satisfactory results using the same backbone but with a small input size of 352 2 .
• DGNet [27], is an efficient model that stands out with its top#3 performance compared to heavier models like JS-COD [107] (121.63M) and PopNet [148] (181.05M),despite having only 19.22M parameters and 1.20G computation costs.Its performance-efficiency balance makes it a promising architecture for further exploration of its potential capabilities.
• Interestingly, CRNet [141] -a weakly-supervised model -competes favorably with early fully-supervised model SINet [19].It suggests that there is room for developing models to bridge the gap toward better data-efficient learning, e.g., self-/semi-supervised learning.Furthermore, transformer-based methods significantly improve performance due to their superior long-range modeling capabilities.We here test four transformer-based models on the CAMO testing dataset, yielding three noteworthy findings: • CamoFormer-S [147], utilizes a Swin transformer design to enhance the hierarchical modeling ability on concealed content, resulting in superior performance across the entire CAMO benchmark.We also observed that the PVT-based variant CamoFormer-P [147] achieves comparable results but with fewer parameters, i.e., 71.40M (CamoFormer-P) vs. 97.27M(CamoFormer-R).
• DTINet [132] is a dual-branch network that utilizes the MiT-B5 semantic segmentation model from SegFormer [178] as backbone.Despite having 266.33M parameters, it has not delivered impressive performance due to the challenges of balancing such two heavy branches.Nevertheless, this attempt defies our preconceptions and inspires us to investigate the generalizability of semantic segmentation models in concealed scenarios.
• We also investigate the impact of input resolution on the performance of different models.HitNet [142] uses a high-resolution image of 704 2 , which can improve the detection of small targets, but at the expense of increased computation costs.Similarly, convolutional-based approaches like ZoomNet [133] achieved impressive results by taking multiple inputs with different resolutions (the largest being 576 2 ) to enhance segmentation performance.However, not all models benefit from this approach.For instance, PopNet [148] with a resolution of 480 2 fails to outperform SINetV2 [22] with 352 2 in all metrics.This observation raises two critical questions: should high-resolution be used in concealed scenarios, and how can we develop an effective strategy for detecting concealed objects of varying sizes?We will propose potential solutions to these questions and present an interesting analysis of the COD10K in §5.5.

Quantitative Analysis on NC4K
Compared to the CAMO dataset, the NC4K [23] dataset has a larger data scale and sample diversity, indicating subtle changes  • CamoFormer-C [147] still outperforms all methods on NC4K.In contrast to the awkward situation observed on CAMO as described in §5.3, the ResNet-50 based CamoFormer-R [147] now performs better than two other competitors (i.e., SegMaR [135] and Zoom-Net [133]) on NC4K.These results confirm the effectiveness of CamoFormer's decoder design in mapping latent features back to the prediction space, particularly for more complicated scenarios.
• DGNet [27] shows less promise on the challenging NC4K dataset, possibly due to its restricted modeling capability with small model parameters.Nevertheless, this drawback provides an opening for modification since the model has a lightweight and simple architecture.
• While PopNet [148] may not perform well on small-scale CMAO datasets, it has demonstrated potential in challenging NC4K dataset.This indicates that using extra network to synthesize depth priors would be helpful for challenging samples.When compared to SINetV2 based on Res2Net-50 [22], PopNet has a heavier design (188.05M vs. 26.98M)and larger input resolution (512 2 vs. 352 2 ), but only improves the E mn φ value by 0.6%.• Regarding the CamoFormer [147] model, there is now a noticeable difference in performance between its two variants.Specifically, the CamoFormer-S variant based on Swin-B lags behind while the CamoFormer-P variant based on PVTv2-B4 performs better.

Quantitative Analysis on COD10K
In Table 6, we present a performance comparison of 36 competitors, including 32 convolutional-based models and four transformer-based models, on the COD10K dataset with diverse concealed samples.Based on our evaluation, we have made the following observations: • CamoFormer-C [147], which has a robust backbone, remains the best-performing method among all convolutional-based methods.Similarly to its performance on NC4K, CamoFormer-R [147] has once again outperformed strong competitors with identical backbones such as SegMaR [135] and ZoomNet [133].
• Similar to its performance on the NC4K dataset, PopNet [148] achieves consistently high results on the COD10K dataset, ranking second only to CamoFormer-C [147].We believe that prior knowledge of the depth of the scene plays a crucial role in enhancing the understanding of concealed environments.This insight will motivate us to investigate more intelligent ways to learn structural priors, such as incorporating multi-task learning or heuristic methods into our models.
• Notably, HitNet [142] achieves the highest performance on the COD10K benchmark, outperforming models with stronger backbones like Swin-B and PVTv2-B4.To understand why this is the case, we calculated the average resolution of all samples in the CAMO (W=693.89and H=564.22),NC4K (W=709.19 and H=529.61), and COD10K (W=963.34 and H=740.54)datasets.We found that the testing set for COD10K has the highest overall resolution, which suggests that models utilizing higher resolutions or multi-scale modeling would benefit from this characteristic.Therefore, HitNet is an excellent choice for detecting concealed objects in scenarios where high-resolution images are available.
• The attribute of multiple objects (MO) poses a challenge due to the high false-negative rate in current top-performing models.As depicted in the first column of Fig. 4, only two out of ten models could locate the white flying bird approximately, as indicated by the red circle in the GT mask.These two models are CamoFormer-S [147], which employs a robust transformer-based encoder, and FDNet [134], which utilizes a frequency domain learning strategy.
• The models we tested can accurately detect big objects (BO) by precisely locating the target's main part.However, these models struggle to identify smaller details such as the red circles highlighting the toad's claws in the second column of Fig. 4.
• Small object (SO) attribute presents a challenge as it only occupies a small area in the image, typically less than 10% of the total pixels as reported by COD10K [19].As shown in the third column of Fig. 4, only two models (CamoFormer-S and CamoFormer-C [147]) can detect a cute cat lying on the ground in the distance.Such a difficulty arises for two main reasons: firstly, models struggle to differentiate small objects from complex backgrounds or other irrelevant objects in an image; secondly, detectors may miss small regions due to down-sampling operations caused by low-resolution inputs.
• Out-of-view (OV) attribute refers to objects partially outside the image boundaries, leading to incomplete representation.To address this issue, a model should have a better holistic understanding of the concealed scene.As shown in the fourth column of Fig. 4, both CamoFormer-C [147] and FDNet [134] can handle the OV attribute and maintain the object's integrity.However, two transformer-based models failed to do so.This observation has inspired us to explore more efficient methods, such as local modeling within convolutional frameworks and cross-domain learning strategies.
• Shape complexity (SC) attribute indicates that an object contains thin parts, such as an animal's foot.In the fifth column of Fig. 4, the stick insect's feet are a good example of this complexity, being elongated and slender and thus difficult to predict accurately.Only HitNet [142] with high-resolution inputs can predict a rightbottom foot (indicated by a red circle).
• The attribute of occlusion (OC) refers to the partial occlusion of objects, which is a common challenge in general scenes [183].In Fig. 4, for example, the sixth column shows two owls partially occluded by a wire fence, causing their visual regions to be separated.Unfortunately, most of the models presented were unable to handle such cases.
• Indefinable boundary (IB) attribute is hard to address since its uncertainty between foreground and background.As shown in the last column of Fig. 4, a matting-level sample.• In the last two rows of Fig. 4, we display the predictions generated by SINet [19], which was our earliest baseline model.Current models have significantly improved location accuracy, boundary details, and other aspects.Additionally, CRNet [141], a weakly-supervised method with only weak label supervision, can effectively locate target objects to meet satisfactory standards.

DISCUSSION AND OUTLOOK
Based on our literature review and experimental analyses, we discuss five challenges and potential CSU-related directions in this section.
Annotation-Efficient Learning.Deep learning techniques have significantly advanced the field of CSU.However, conventional supervised deep learning is data-hungry and resource-consuming.In practical scenarios, we hope the models can work on limited resources and have good generalizability.Thus developing effective learning strategies for CSU tasks is a promising direction, e.g., weakly-supervised strategy in CRNet [141].Domain Adaptation.Camouflaged samples are generally collected from natural scenes.Thus, deploying the models to detect concealed objects in auto-driving scenarios is challenging.Recent practice demonstrates that various techniques can be used to alleviate this problem, e.g., domain adaptation [184], transfer learning [185], few-shot learning [186], and meta-learning [187].
High-Fidelity Synthetic Dataset.To alleviate algorithmic biases, increasing the diversity and scale of data is crucial.The rapid development of AI-generated content (AIGC) [188] and deep generative models, such as generative adversarial networks [189], [190], [191] and diffusion models [192], [193], is making it easier to create synthetic data for general domains.Recently, to address the scarcity of multi-pattern training images, Luo et al. [106] proposed a diffusion-based image generation framework that generates salient objects on a camouflaged sample while preserving its original label.Therefore, a model should be capable of distinguishing between camouflaged and salient objects to achieve a robust feature representation.
Neural Architecture Search.Automatic network architecture search (NAS) is a promising research direction that can discover optimal network architectures for superior performance on a given task.In the context of concealment, NAS can identify more effective network architectures to handle complex background scenes, highly variable object appearances, and limited labeled data.This can lead to the developing of more efficient and effective network architectures, resulting in improved accuracy and efficiency.Combining NAS with other research directions, such as domain adaptation and data-efficient learning, can further enhance the understanding of concealed scenes.These avenues of exploration hold significant potential for advancing the state-ofthe-art and warrant further investigation in future research.
Large Model and Prompt Engineering.This topic has gained popularity and has even become a direction for the natural language processing community.Recently, the Segment Anything Model (SAM) [194] has revolutionized computer vision algorithms, although it has limitations [195] in unprompted settings on several concealed scenarios.One can leverage the prompt engineering paradigm to simplify workflows using a well-trained robust encoder and task-specific adaptions, such as task-specific prompts and multi-task prediction heads.This approach is expected to become a future trend within the computer vision community.Large language models (LLMs) have brought both new opportunities and challenges to AI, moving towards artificial general intelligence further.However, it is challenging for academia to train the resource-consuming large models.There could be a promising paradigm that the state-of-the-art deep CSU models are used as the domain experts, and meanwhile, the large models could work as an external component to assist the expert models by providing an auxiliary decision, representation, etc.

DEFECT SEGMENTATION DATASET
Industrial defects usually originate from the undesirable production process, e.g., mechanical impact, workpiece friction, chemical corrosion, and other unavoidable physical, whose external visual form is usually with unexpected patterns or outliers, e.g., surface scratches, spots, holes on industrial devices; color difference, indentation on fabric surface; impurities, breakage, stains on the material surface, etc.Though previous works achieve promising advances for identifying visual defects by vision-based techniques, such as classification [196], [197], [198], detection [199], [200], [201], and segmentation [202], [203], [204].These techniques work on the assumption that defects are easily detected, but they ignore those challenging defects that are "seamlessly" embedded in their materials surroundings.With this, we elaborately collect a new multi-scene benchmark, named CDS2K, for the concealed defect segmentation task, whose samples are selected from existing industrial defect databases.

Dataset Organisation
To create a dataset of superior quality, we established three principles for selecting data: (a) The chosen sample should include at least one defective region, which will serve as a positive example.(b) The defective regions should have a pattern similar to the background, making them difficult to identify.(c) We also select normal cases as negative examples to provide a contrasting perspective with the positive ones.These samples were selected from the following well-known defect segmentation databases.
• MVTecAD 4 [205], [206] contains several positive and negative samples for unsupervised anomaly detection.We manually select 748 positive and 746 negative samples with concealed patterns from two main categories: (a) object category as in the 1 st row of Fig. 5: pill, screw, tile, transistor, wood, and zipper.(b) texture category as in the 2 nd row of Fig. 5: bottle, capsule, carpet, grid, leather, and metal nut.The number of positive/negative samples is shown with yellow circles in Fig. 5 • NEU 5 provides three different database: oil pollution defect images [207] (OPDI), spot defect images [208] (SDI), and steel pit defect images [209] (SPDI).As shown in the third row (green

Dataset Description
The CDS2K comprises 2,492 samples, consisting of 1,330 positive and 1,162 negative instances.Three different human-annotated labels are provided to each sample -category, bounding box, and pixel-wise segmentation mask.Fig. 6 illustrates examples of these annotations.The average ratio of defective regions for each category is presented in Table 7, which indicates that most of the defective regions are relatively small.

Evaluation on CDS2K
Here, we evaluate the generalizability of current cutting-edge COS models on the positive samples of CDS2K.Regrading the code availability, we here choose four top-performing COS approaches: SINetV2 [22], DGNet [27], CamoFormer-P [147], and HitNet [142].As reported in Table 8, our observations indicate that these models are not effective in handling cross-domain samples, highlighting the need for further exploration of the domain gap between natural scene and downstream applications.

CONCLUSION
This paper aims to provide an overview of deep learning techniques tailored for concealed scene understanding (CSU).To help the readers view the global landscape of this field, we have made four contributions: Firstly, we provide a detailed survey of CSU, which includes its background, taxonomy, task-specific challenges, and advances in the deep learning era.To the best of our knowledge, this survey is the most comprehensive one to date.Secondly, we have created the largest and most upto-date benchmark for concealed object segmentation (COS), which is a foundational and prosperous direction at CSU.This benchmark allows for a quantitative comparison of state-of-theart techniques.Thirdly, we have collected the largest concealed defect segmentation dataset, CDS2K, by including hard cases from diverse industrial scenarios.We have also constructed a comprehensive benchmark to evaluate the generalizability of deep CSU in practical scenarios.Finally, we discuss open problems and potential directions for this community.We aim to encourage further research and development in this area.We would conclude from the following perspectives.(1) Model.The most common practice is based on the architecture of sharing UNet, which is enhanced by various attention modules.In addition, injecting extra priors and/or introducing auxiliary tasks improve the performance, while there are many potential problems to explore.( 2) Training.Fully-supervised learning is the mainstream strategy in COS, but few researchers have addressed the challenge caused by insufficient data or labels.CRNet [141] is a good attempt to alleviate this issue.(3) Dataset.The existing datasets are still not large and diverse enough.This community needs more concealed samples involving more domains (e.g., autonomous driving and clinical diagnosis).( 4) Performance.
Transformer and ConvNext based models outperform other competitors by a clear margin.Cost-performance tradeoff is still understudied, for which DGNet [27] is a good attempt.(5) Metric.There is no well-defined metrics that can consider the different camouflage degree of different data to give a comprehensive evaluation.This causes unfair comparisons.
Besides, existing CSU methods focus on the appearance attributes of the concealed scenes (e.g., color, texture, boundary) to distinguish concealed objects without enough perception and output from the semantic perspective (e.g., relationships between objects).However, semantics is a good tool for bridging the human and machine intelligence gap.Therefore, beyond the visual space, semantic level awareness is key to the next-generation concealed visual perception.In the future, CSU models should incorporate various semantic abilities, including integrating highlevel semantics, learning vision-language knowledge [213], and modeling interactions across objects.
We hope that this survey provides a detailed overview for new researchers, presents a convenient reference for relevant experts, and encourages future research.

Fig. 3 .
Fig. 3. Network architectures for COS at a glance.We present four types of frameworks from left to right: (a) multi-stream framework, (b) bottom-up/top-down framework and its variant with deep supervision (optional), and (c) branched framework.See §3.1.1 for more details.

Fig. 4 .Fig. 5 .
Fig. 4. Qualitative results of ten COS approaches.More descriptions on visual attributes in each column refer to §5.6.

Fig. 6 .
Fig. 6.Visualization of different annotations.We select a group of images from the MVTecAD database, including a negative (a) and a positive (b) sample.Corresponding annotations are provided: category (scratches on wood) and defect locations: bounding box (c) and segmentation mask (d).

TABLE 1 Essential characteristics of reviewed image-based methods. This summary outlines the key characteristics, including: Architecture Design (Arc.): The
framework used, which can be multi-stream (MSF), bottom-up & top-down (BTF), or branched (BF) frameworks.

Task Level (T.L.): The
specific tasks addressed by the method, including COS (•), CIS (•), COC ( ), and multi-task learning ( ).N/A indicates that the source code is not available.For more detailed descriptions of these characteristics, please refer to §3.1 on image-level CSU models.

TABLE 3 Essential characteristics for CSU datasets. Train/Test: number
of samples for training/testing (e.g., images for image dataset or frames for video dataset) Task: data type of dataset.N.Cam.: whether collecting non-camouflaged samples.Cls.: whether providing classification labels.B.Box: whether providing bounding box labels for the detection task.Obj./Ins.: whether providing object-or instance-level segmentation masks for segmentation tasks.Rank: whether providing ranking labels for instances.Scr.: whether providing weak labels in scribbled form.Cou.: whether providing dense object counting labels.See §4.1 and §4.2 for more descriptions.