QCNet: query context network for salient object detection of automatic surface inspection

Building upon fully convolutional networks (FCNs), deep learning-based salient object detection (SOD) methods achieve gratifying performance in many vision tasks, including surface defect detection. However, most existing FCN-based methods still suffer from the coarse object edge predictions. The state-of-the-art methods employ intricate feature aggregation techniques to refine boundaries, but they are often too computational cost to deploy in the real application. This paper proposes a semantics guided detection paradigm for salient object detection. Guided atrous pyramid module is first applied on the top feature to segment complete salient semantics. Query context modules are further used to build relation maps between saliency and structural information from the top-down pathway. These two modules allow the semantic features to flow throughout the decoder phase, yielding detail enriched saliency predictions. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on surface defect detection and SOD benchmarks. In addition, this method can detect at 27 FPS in a fully convolutional fashion without any post-processing, which has the potential for real-time detection.


Introduction
The human visual system has an excellent attention mechanism, which can capture the most important part of a visual scene for the first time.Salient object detection (SOD) is an effective way to imitate this system.Unlike other denselabeling visual tasks, SOD methods aim to distinguish the most visually prominent areas in a frame.Usually, SOD serves as the first step to benefit other downstream visual tasks, including image segmentation [1], visual tracking [2,3], video abstraction [4], and content-aware image editing [5].
Recently, deep learning [6]-based convolutional neural networks (CNNs, e.g.ResNet [7] and VGG [8]) trained for image classification have been adopted into salient object detection via transfer learning.Fully convolutional network (FCN) [11] sets a paradigm for dense-labeling tasks, surpassing the traditional methods by a large margin [9,10].However, the imperfection of FCN-like models is that they suffer from coarse edge refinements.The consecutive downsampling operations in CNNs result in the loss of spatial information, which is critical for object reconstruction.
To address the aforementioned problem, feature aggregation mechanisms are introduced to refine the high-level features with local information in a recursive way.Feng et al. [12] adopted feature pairs to build ternary attention maps, transmitting multi-level information throughout the whole decoder stage.Xie et al. [13] used an effective feature aggregation mechanism to resolve the object boundaries.Wang et al. [14] learned residual features from integrating deep and shadow features to generate multi-context information.Although these strategies have brought satisfactory improve- ments, the boundaries of salient objects are still not explicitly modeled.The relations between salient object information and edge information are not fully evacuated.Besides, there are some methods using Superpixel [15] or CRF [16][17][18] as post-processing to preserve the object boundaries, and EGNet [55] introduced extra supervision specially designed for edge refinement.The main inconvenience with these approaches is their computation cost and low inference speed.
This paper concentrates on proposing a salient-guided object detection paradigm and bringing query mechanism to explore latent relations between salient and edge information, which does not need any costly post-processing.In order to help semantics to locate the target and restore clear boundaries, these two types of information are fully integrated and the relationships between them are fully evacuated.In general, our model consists of two primary modules on the base of the encoder-decoder network: a guided atrous pyramid module (GAPM) and a query context module (QCM).GAPM contributes to capturing complete and accurate salient objects, which uses as guide information for rebuilding.The GAPM consists of atrous convolution blocks with different artous rates.Then, high-level semantic information collected by GAPM can be successively delivered to feature maps at all pyramid levels, building relation maps between salient and edge information across all stages.The relation map fully evacuates the relationships between these two features, refining predictions from coarse to fine.Without sophisticated edge refinement modules, the proposed model can well locate the target with complex edges and make accurate predictions.In addition, extra supervise is introduced to different stages of the decoder to optimize the training process.
Besides that, the proposed method has been transferred to surface defect detection tasks.We construct a carbon fiberreinforced plastics (CFRP) defect dataset.CFRP is a kind of material widely used in aerospace, transportation, and energy [19][20][21], which is superior in many aspects, including lightweight, high strength, and high-temperature resistance compared to traditional materials [22].The target defects of CFRP datasets such as break, bridging, disconnect, foreign, gap, puckering and tow defects are shown in Fig. 1 .Compared with the natural image, the semantic information contained in the defect image is scarcer, and the object edge is difficult to recognize, which poses a greater challenge to restore the exquisite boundaries.
In summary, this paper makes three major contributions: • Query Context Network (QCNet) is proposed to explicitly build relations between salient objects and edge information to make fine edge predictions.• We further transfer the model to surface defect detection tasks and contribute a CFRP defect dataset to test the generality of the proposed method.• The proposed model can run at a real-time speed of 27 FPS and achieves state-of-the-art performance on multiple popular salient object detection and surface defect detection benchmarks.

Salient object detection
Salient object detection aims to distinguish the most visually obvious areas.Traditional SOD models make predictions mainly based on various saliency cues, including local contrast [23], global contrast [24], and background prior [40].Recently, CNN-based SOD models have achieved promising performance.Qin et al. [52] proposed a predict-refine archi-tecture to segment the salient object regions effectively and used a hybrid loss to supervise the training process at three different levels.In [25], Yan et al. proposed a size divide and conquer mechanism which separates and learn the feature of different size, achieving good results on many tasks.Wang et al. [16] combined background prior with the analysis of boundary property to enhance salient objects and restrain the backgrounds in an image.Qin et al. [59] proposed a twolevel nested U-structure model which is able to capture more contextual information from different scales.Based on the U-shape architecture, Liu et al. [58] used a global guidance module and feature aggregation modules to fuse the multilevel features in a top-down way.Although these models use multi-scale features from different perspectives, they ignore the implicit relationships between semantic and structural features.Our model proposes an explicit pipeline to model these two different features and efficiently uses the information carried by different features to make final predictions.

Atrous convolution
Salient object detectors often use classification models (ResNet [7], VGG [8]) as feature extractors in the first stage, which shows their efficiency in many semantic segmentation tasks.However, these models are designed for classification tasks, consecutive down-sampling operations significantly reduce the spatial resolution of resulting features, which is essential for segmentation tasks.Atrous convolution is a type of convolution that inflates the kernel by inserting holes between the kernel elements.It can expand the receptive field of the model without introducing additional parameters.Models based on atrous convolution have been actively explored for semantic segmentation.Wu et al. [26] experimented with the effect of modifying atrous rates for capturing long-range information.In [27], Chen et al. proposed atrous spatial pyramid pooling (ASPP) to exploit multi-scale features by employing multiple parallel filters with different rates.ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects and image context at various scales.However, this kind of structure often causes grid artifact [28], harming the model performance.At the same time, as the network depth increases, a large rate may cause the degradation of the convolution kernel.We propose guided atrous pyramid module for better global feature extraction.Besides different receptive field features, we use a one-to-one guild on each branch to guarantee the continuity of features.

Attention models
Attention models are popularly used in recent neural networks.The main idea is the model should pay more attention to the region of interest to obtain more detailed information about the target so as to suppress other useless information.The visual attention mechanism dramatically improves the efficiency and accuracy of visual information processing.Islam et al. [29] applied gate units between each encoder and decoder blocks as attention models.These gate units control the feedforward message passing for the sake of filtering out ambiguous information.However, these message filters only happen between different level features within the encoder, which lack feature filter in the decoder phase.To avoid this, our query context module builds relationships between different level features based on high-quality semantic predictions step by step, making predictions while filtering ambiguous information.

Automatic surface inspection
Over the past two decades, numerous methods based on computer vision have been introduced to automatic surface inspection (ASI) problems [30][31][32], which can be generally divided into traditional detection approaches and deep learning-based approaches.
Traditional methods have mainly relied on hand-crafted features, such as statistical information [33], texture [17], distribution pattern [34].Despite its efficiency, hand-crafted features are mostly focused on structural characteristics.The lack of semantic representations causes limited performance.Most of these models are heavily dependent on expertise, bring another obstacle to their application.
With the rapid development of deep learning, CNN-based methods have been introduced to the ASI field, surpassing the traditional methods by significantly improving [32].Since Long et al. [11] proposed FCN to predict semantic labels at a pixel level, FCN-based models have been popularized in surface inspection fields, further improving the efficiency and accuracy of detection results.Yang et al. [35] used feature pyramid to transfer multi-scale context information from deep to shallow features and then used side networks to generate predictions, achieving good performance on pavement crack detection.Yan et al. [36] proposed defect type classification plus defect area segmentation task mode and mixed supervision network architecture, achieving good performance in four ASI tasks.In [37], Yang et al. proposed a multi-scale feature-clustering-based fully convolutional autoencoder method for texture defect detection.Although good results were achieved, most of these methods focus more on high-level features but neglect the importance of low-level edge features or relationships between them.In contrast, we use guided atrous pyramid module to extract more effective semantic information and then model the relationships between different resolution features with proposed query context modules, making more refined predictions step by step.
Fig. 2 The pipeline of proposed approach.Our network is in an encoder-decoder fashion.We denote the output feature maps of encoder and decoder as E (l) and D (l) , receptively, where l ∈ (1, 2, 3, 4).The input image is first passed through the encoder to extract the multilevel features.Then, Guided Atrous Pyramid Module is applied on the top feature to extract E (5) .The decoder consists of four Query Context Modules to fuse E (5) and the lower features E (1) ∼ E (4) to make prediction.Each stage of the decoder has additional supervision to guarantee prediction quality

The proposed method
In this paper, we propose a Query Context Network (QCNet) with a novel Guided Atrous Pyramid Module (GAPM) and Query Context Modules (QCMs) to predict salient objects with entire object and exquisite boundaries.In this section, we begin by describing the complete pipeline of the model and then introduce the guided atrous pyramid module in Sect.3.2, query context module in Sect.3.3, segmentation head in Sect.3.4, respectively.

Network overview
Similar to most previous approaches for salient object detection, we choose the ResNet50 [7] and VGG16 [8] as our backbone network and develop it in an encoder-decoder style.The network illustration is shown in Fig. 2. Four pairs of encoder and decoder blocks are denoted as E (l) and D (l) ; the corresponding prediction maps are denoted as S (l) , respectively (l ∈ 1, 2, 3, 4 represents the stage number).The output feature of GAPM denoted as E (5) .Encoder network We modify the backbone network into a fully convolutional network by casting away the last fully connected layers.E (1) ∼ E (4) are features after each stage, the downsample rate is 4, 8, 16, 32, respectively.Guided atrous pyramid module The GAPM, described in Sect.3.2, takes advantage of atrous convolution, extracting high-quality context semantic information from E (4) .This is the guide cue of the whole decoder phase.Decoder network The decoder network consists of four QCMs and segmentation heads.We discuss the implementation details in Sects.3.3 and 3.4.When training the network, every D (l) estimates saliency maps S (l) respectively, each is supervised by the same ground truth G. Particularly, we use binary cross-entropy loss on S (1) and binary dice loss for others.

Guided atrous pyramid module
In this section, we revisit the structure of ASPP and design a guided atrous pyramid module for better semantic extraction.GAPM has three different atrous convolution branches, and the atrous rates are 12, 24, 36, respectively, which is larger than the ASPP module.Larger atrous rates bring wider receptive field, help capture global information, but cause degradation of convolution layer.So we add a global average pooling branch to alleviate the grid effect by set a basic Fig. 3 Illustration for guided atrous pyramid module vector to each feature map.Besides that, a one-to-one guide branch is added after each branch.The guide feature has complete spatial information; combined with global context from other branches, we can get the resulting feature map with richer semantic expressions.We add a weight merge mechanism before output for better feature reuse.The structure of GAPM is shown in Fig. 3.
GAPM is applied on the top feature of backbone outputs, which is denoted as E (4) : ( where W 1 denotes a 1x1 convolution layer.Then, we use three convolution layers with different atrous rates and a global pooling layer to get multi-scale feature maps. F out = F concat (F a i (E (4) ) + F ski p , F gap (E (4) ) where (1) F concat represents feature concatenation.(2) F a i denotes atrous convolution layers, a i ∈ {12, 24, 36}.(3) F gap denotes a global average pooling layer.
We use weight merge in the output phase to guarantee better feature fusion.
where W 2 denotes a global average pooling layer and a 1x1 convolution layer for weight calculation, W 3 denotes a 1x1 convolution layer for output channel adjustment.

Query context module
We control the message passing between encoder and decoder blocks via Query Context Module (QCM).QCM is a special information matching and filtering module, which takes the high-order semantic features as the key and the loworder structural features as the value.QCM will calculate the relationships between these two different type features in a pixel-wise manner, establishing correspondence between high-level semantic features and low-level structural features more accurately, thereby reducing the sharp step to object contour.The top semantic feature processed by GAPM (denote as E (5) ) has filtered most invalid information; thus, the accuracy of subsequent feature matching calculation can be guaranteed.At the same time, the matching calculation between different stage features can also help to screen out the redundant and incorrect information, and get the accurate predictions.Figure 4 shows the architecture of query context module.
The top feature of backbone (denoted as E (4) ) is first pass to GAPM, the output denoted as E (5) .E (5) has more accurate and complete semantic features, which is the basis of the subsequent step.We denote x = x i N i=1 as the feature map of one semantic feature (after upsampled by 2), y = y i N i=1 as the feature map of one structure feature, respectively, where N is the number of positions in the feature map (N = H • W ), c denotes the global context feature, and z denotes the output of QCMs.We define a temporary variable t as: The global context feature can be calculated as: so the QCM can be expressed as: where (1) i is the index of semantic feature and j is structure feature positions, respectively.(2) W 1 and W 2 denote linear transformation matrices (1x1 convolution layer in model).
(3) L N denotes the Layer Normalization.( 4) δ (•) denotes of fusion function to aggregate the spatial context and channel context.
The decoder phase consists of four QCM, which restores the target contour step by step, and finally obtains the prediction result.

Segmentation head
The downsample rates of D (1) ∼ D (4) are 4, 8, 16, 32, respectively.The consecutive bilinear upsampling layers are applied on the feature map for matching the spatial resolution with D (1) .Then, we sum up all feature maps to pass through the segmentation head, which is a transposed convolution layer and a bilinear upsampling layer to get the final prediction, denoted as S. We also add segmentation head on four original feature maps to get the auxiliary prediction maps denoted as S (1) ∼ S (4) to further supervise model training.
where i = 1, 2, 3, 4. F T denotes a transposed convolution layer, and its exponent represents the number of uses.F head denotes a transposed convolution layer and a bilinear upsampling layer.
The final prediction is represented as: We use binary cross-entropy loss on the main branch and binary dice loss on four auxiliary branches for better object completeness.The loss function can be expressed: where K denotes the number of auxiliary predictions (four in our model), L bce denotes Binary Cross-Entropy (BCE) Loss, and L dice denotes Binary Dice Loss.BCE loss is widely used in binary segmentation tasks, which indicates the difference in the probability distribution between the predicted value and the ground truth, defined as: Since the BCE loss focuses on estimating the overall classification accuracy of all pixels indiscriminately, we further adopt binary dice loss to enhance the regional consistency.(11) where N = H • W .

Implementation details
We train our model using the DUTS-TR dataset and CFRP defect dataset.We choose ResNet50 [7] and VGG16 [8] as the backbone networks, which are commonly used in salient object detection models.Our system is implemented in PyTorch.We train our network on TianXp GPU for 50 epochs, with a base learning rate 0.0001, momentum 0.9, and weight decay 0.000001.The batch size is set to 7. The parameter of backbone is pretrained on ImageNet [38].For other convolutional layers, we initialize the weights using Kaiming uniform [39].We choose Adam optimizer to train our neural networks.While inference, we cast all the auxiliary branches and use the output of the main branch as the final salient map.
DUT-OMRON [40] consists of 5168 high-quality images manually selected from more than 140000 images.This dataset is quite challenging since images could have more than one salient object, and its background is relatively complex.DUTS [41] contains 15572 images, 10553 for training and 5019 for testing.DUTS is the largest publicly available salient object detection benchmark; most of its images are challenging on both scale and scene.ECSSD [42] has 1000 images with various complex scenes.HKU-IS [43] contains 4777 images, 2500 for training, 500 for validation and 2000 for testing, many of which have more than one salient object.Disconnections of objects bring an extra difficulty to detection.PASCAL-S [44] contains 850 images which are hand-picked from the validation dataset of PASCAL VOC segmentation dataset [46].
CFRP dataset has a total of 460 images; we split the data into the training set and testing set according to the ratio of 7:3.The spatial size of all images is 1000x1000 and can be divided into seven categories according to differ-   We apply horizontal flip as data augmentation method, and each image is resized to 384 (512 for ASI images) and normalized using the mean and std value provided by ResNet.
We evaluate the performance of our approach and other state-of-the-art methods with four widely used metrics: Fmeasure score, mean absolute error (MAE) [47], S-measure score [48] and E-measure score [49].F-measure score, denoted as F β , is the weighted harmonic mean of average precision and average recall, can be computed as follows: we set β 2 to 0.3 suggested in [24] to weight precision more than recall.Following most salient object detection methods [50,51], we report the maximum F-measure from all precision-recall pairs.The MAE [47] score is a measurement of the similarity between saliency map S and the ground truth G , formulated as: where W and H denote the width and height of the saliency map, respectively.S-measure [48] is a structure-based met-ric, which is concentrated on structural information in the saliency maps.Compared to the above metrics, S-measure is more close to human visual perception, computed as: where S o and S r denote the region-aware and object-aware structural similarity, γ is set to 0.5 by default.
E-measure [49] focuses more on global means of the image and local pixel matching, which can be represented as: φ s denotes the enhanced alignment matrix, which reflects the correlation between S and G after subtracting their global means.

Results on common salient object detection benchmarks.
We evaluate the performance of the proposed method on five widely used salient object detection datasets in terms of Fmeasure, MAE, S-measure and E-measure.Tables 1 and 2 show the test results.We can conclude that our model can achieve SOTA results on natural images while keeping a fast inference speed.
Results on CFRP defect dataset Besides common SOD benchmarks, we conduct experiments on CFRP defect dataset to prove the generality of the proposed method.In this scenario, our model needs to predict the most notable object from the single-channel frames, which is slightly different from the previous tasks.Images from CFRP dataset are primarily in grayscale, detectors mainly focus on the change of image gray and morphological characteristics of the defect to get the final prediction.We evaluate the performance of our model with other SOTA SOD models in terms of F-measure, MAE, S-measure and E-measure, as shown in Table 3.It can be seen that our model outperforms other methods on all four evaluation metrics by a large margin.The F-measure, MAE, S-measure and E-measure are improved 8.25%, 2.04%, 4.06% and 3.40%, respectively.Note that this is achieved without any post-processing.

Results on magnetic tile defect dataset
To further prove the effectiveness of the proposed method, we test it on the magnetic tile defect dataset.Our model can be easily applied to many ASI tasks without any specific process.The result is shown in Table 3 123

Ablation experiments
In this section, we explore the effect of proposed components in QCNet.All experiments are based on the DUTS-TE and CFRP datasets, respectively, and use ResNet50 as the backbone network.For comparison, we set a baseline model that replaces the GAPM and QCM with ASPP and FPN, respectively.

Extractor of semantic feature
In this subsection, we explore the effectiveness of the proposed GAPM.In the baseline model, we use ASPP to extract semantic information.The output of the backbone model E (4) is passed into the ASPP module.We denote the result semantic feature as Ê(5) .We replace the ASPP with proposed GAPM, and extract the E (5) and Ê(5) from two models and compare their completeness and accuracy of semantic information.The resulting heat map is shown in Fig. 6.The middle two columns and the last two columns show the heat maps of the DUTS-TE and CFRP datasets, respectively.We can see that the model with GAPM can locate the target more accurately and suppress the interference elements well.At the same time, we test the performance of these two strategies, shown in the second row of Tables 4 and 5.Both results show proposed GAPM works better in semantic feature extraction, which sets a better basis for the following process.

Bottom-up feature propagation
In this subsection, we explore the superiority of the proposed QCM.Compared to baseline, we remove all FPN Blocks and replace them with QCM in the decoder phase.The test result is shown in the third row of Tables 4 and 5.We can find that QCM can significantly improve model performance on four evaluation metrics.Meanwhile, combined with GAPM, the final evaluation result greatly improved from baseline, shown in the fourth row of Tables 4 and 5.The model was beyond baseline at 2.21%(7.64%)on F-measure, 7.89%(6.25%)on MAE, 1.14%(3.05%)on S-measure and 1.04%(3.17%)on E-measure, respectively.

Visual comparison
To further illustrate the superiority of the proposed method, we show the qualitative comparison with other ten SOTA models.As shown in Fig. 7, our model is able to accurately segment the target in both natural scenes (first four rows) and ASI tasks (last four rows).In addition, the segmentation results of our model have better completeness and sharper boundaries.

Conclusion
This paper proposes a salient-guided salient object detection paradigm, which has excellent segmentation results and less computational cost.Based on this idea, we first extract salient object information under multiple different adaptive fields.Then, we propose a query context module to build relations between salient and edge information, which gradually restores object boundaries stage by stage.The whole network is capable of capturing complete objects and preserving exquisite edges.Meanwhile, the proposed model can be easily transferred to the surface defect detection field.This model performs favorably against the state-of-the-art methods on salient object detection and surface defect detection benchmarks without any post-processing.Besides that, our model can run at a real-time speed of 27 FPS.
In the future, we will focus on two directions as follows: The one is data augmentation technology due to the expensive manual annotations in defect detection datasets.The other is extending our work to 3D detection tasks.

Fig. 4
Fig. 4 Architecture of query context module

Fig. 5
Fig. 5 Examples of magnetic tile defect dataset

Table 1
Comparison of our model and 10 SOTA models on ECSSD, HKU-IS in terms of FPS, F max , MAE, S Measure and E Measure a ↑ & ↓ denote larger and smaller is better, respectively b F P S 1 & F P S 2 denote GPU and CPU inference time, respectively c Bolditalic, Italic, Bold indicate the best, second best and third best Table 2 Comparison of our model and 10 SOTA models on PASCAL-S, DUT-O, DUTS-TE in terms of F max , MAE, S Measure and E Measure PASCAL-S DUT-O DUTS-TE

Table 3
Comparison of our model and 10 SOTA models on CFRP and magnetic tile defect dataset in terms of FPS, F max , MAE, S Measure and E ↑ & ↓ denote larger and smaller is better, respectively b F P S 1 & F P S 2 denote GPU and CPU inference time, respectively c Bolditalic, Italic, Bold indicate the best, second best and third best performance a

Table 4
Ablation analyses on DUTS-TE dataset.B denotes the baseline model

Table 5
Ablation analyses on CFRP dataset.B denotes the baseline model