Learning graph structures with transformer for weakly supervised semantic segmentation

Weakly supervised semantic segmentation (WSSS) is a challenging task of computer vision. The state-of-the-art semantic segmentation methods are usually based on the convolutional neural network (CNN), which mainly have the drawbacks of inability to explore the global information correctly and failure to activate potential object regions. To avoid such drawbacks, the transformer approach is explored in the WSSS task, but no effective semantic association between different patch tokens can be determined in the transformer. To address this issue, inspired by the graph convolutional network (GCN), this paper proposes a graph structure to learn the semantic category relationships between different blocks in the vector sequence. To verify the effectiveness of the proposed method in this paper, a large number of experiments were conducted on the publicly available PASCAL VOC2012 dataset. The experimental results show that our proposed method achieves significant performance improvement in the WSSS task and outperforms other state-of-the-art transformer-based methods.


Introduction
In recent years, the breakthroughs achieved in the field of computer vision have led to the rapid development of related applications, such as driverless automobile, drones, and virtual reality.Taking driverless automobile [1] for example, the conventional iterative learning control methods [2] can solve some driving control problems, but the neural networks are usually used to solve some nonlinear system events in driving currently [3].Semantic segmentation is a key technical breakthrough in computer vision, which plays a critical role in the development of computer vision.As the traditional fully supervised semantic segmentation (FSSS) is costly for the pixel-level labeling, many researchers switched to the weakly supervised approaches.The image-level labeling is an inexpensive and low-cost weak labeling method, which is also one of the most popular weakly supervised methods.
Most of the previous WSSS methods rely on the class activation map (CAM) as the initialization seed.Although these methods have made sophisticated extensions to the CAM strategy, the labeling of local objects in CAM cannot demonstrate the completeness of the category.To solve this problem, some studies have tried to find the semantic correlation between images.In [4], they attempted to construct a semantic correlation module between images, and their method cannot only obtain the semantic information of a single image, but also the similarities and differences between different images for complementary supervision.In [5], on the other hand, efforts were made to explore a new approach based on the multi-head attention mechanism, and this approach is called cooperative information, which is used to aggregate the contextual relations within the image.When mining the information, the biggest drawback of these methods is that they are limited to the information of a single image, while ignoring the semantic correlation between different images.To address this problem, in [6], the GCN is introduced to construct the node relationship between different images, and the semantic relationships between different image groups are mined.
Although the above methods can well improve the performances of WSSS, the CNN structure still has certain limitations, because the classification model is often used to classify the loss-activated object regions, which makes it difficult to pinpoint the seed.However, very few researches tried to address this inherent semantic defect between images.Therefore, how to capture the semantic correlation between different spatial location features is crucial for WSSS.To this end, the transformer model designed in the Vision Transformer (ViT) [7] has been recently used in the field of computer vision, and great breakthroughs in performances have been achieved.For example, in the TS-CAM [8] work, the information of image semantic segmentation is explicitly shown in the ViT features, and the self-attention mechanism of transformer can be employed to construct the relationships between global features, so as to overcome the limitations of the CNN structure.However, with the ViT model, the image is split into different blocks, which are then converted into vector sequences.Although many current researches have demonstrated that different vector sequences can focus on the semantic features in different regions of the image, there is still no study investigating the relationships between semantic categories and different blocks.
To address the above problem, in this paper, the graph structure is constructed to learn the semantic category relations between different blocks of the vector sequences, and the CAM initialized seed is generate using the transformer, which can avoid the defects of the CNN structure.To this end, this paper proposes the learning graph structure with transformer for discovering the object of interest via WSSS.Specifically, as shown in Fig. 1, for the input image nodes, a set of semantically related image groups are constructed, and the attention mechanism of transformer is used to establish the feature relationships.Using an iterative training approach, different categories of semantic information can be effectively propagated in the graph structure.With this design, not only the defect of spatial location features in the CNN structure can be solved, but also the relationship between blocks and semantic categories can be further refined.
Our main contributions can be summarized as follows.
(1) We propose the learning graph structure with transformer framework (LGST) for the WSSS with imagelevel labels.Moreover, to the best of our knowledge, our approach is the first method that combines the graph structure and transformer for the WSSS tasks.(2) To address the disadvantage of transformer in learning of the local fine-level features, a graph structure is constructed, and the relationship between the semantic categories and different blocks of Transformer is redefined, which further enhances the local semantic feature information of the network.(3) Experiment was carried out to evaluate the performance of LGST on the PASCAL VOC 2012 dataset with image-level annotations, and the results show that our method can offer substantial improvement over the existing transformer methods.For example, compared with the state-of-the-art method TransCAM [9], the performance of our method is improved by 1.6% on the validation set and 1.5% on the test set.

CAM-based learning method
CAM is a method used in CNN to identify different classes of image regions for segmentation, and with this method, the feature map of the last convolutional layer is multiplied with the object-specific weights of the classification layer.
In the WSSS approach, the classification network is trained for feature extraction, and the global average pooling (GAP) layer and linear classification layer are available to generate the pseudo labels.The mainstream approaches use CAM for heuristic-driven exploration of objects, such as SEAM [10].However, this localization approach has a main drawback that the semantic information is sparse, which cannot provide enough supervised information for the segmentation networks.
Recently, some researchers have used subcategories and cross-image semantics to locate more precise object regions, such as EDAM [5], but this process is quite complicated.Besides, the dilation convolution has been introduced to address the limited perceptual field of view of CNN, and this approach encourages CAM to propagate around and expand the area of CAM.For example, the AffinityNet [11] learns the correlation between the neighboring pixels according to the reliable seed generated by the original CAM.The learned affinity can be used to predict a correlation matrix that propagates the CAM mapping by random traversal.Similarly, SEC [12] also uses the confidence pixels in the segmentation results to learn the near-neighbor affinity relationships.In IRNet [13], affinity is learned directly from the feature maps of the classification network to refine the CAMs.In addition, AuxSegNet [14] proposes a cross-task affinity that is learned from the saliency and segmentation representations in a weakly supervised multi-task framework.

Graph convolutional networks for WSSS
Graph convolution is a relatively special neural network structure with translation invariance and parameter sharing, which is particularly effective in feature extraction of image sequences based on the Euclidean space.The graph structure can be employed to perform convolutional operations on irregular information of numerous semantic nodes and perform feature propagation, which is an advantage that the CNN structure does not have.In a study published in 2017 [15], the proposed graph structure can provide better performance, which is also robust to different label perturbations.In [16], it is shown that GCN provides an effective solution for solving the semantic segmentation problem of weakly supervised images.Some recent works [6,17] address the under-labeling problem in WSSS by mining the comprehensive semantic information from graphs through structured modeling and iterative inference.Besides, [18] uses imprecise annotation markers to convert weakly supervised learning to semi-supervised learning, so as to improve the segmentation performance of model, which is also an effective method.Meanwhile, by improving the attention mechanism, the affinity attention [19] based on GCN also proves that the graph structure has important research significance.

Vision transformer attention map generation
The application of transformer can be traced back to the researches on natural language processing, and the most central part of its structure is the multi-head self-attention module.The transformer relies on this module for global modeling, so as to improve the perceptual field of view and enhance performances.The applications of the ViT model in computer vision demonstrate that the transformer could improve the model performances in the vision tasks.For this reason, some researches tried to improve the performances of their methods in the vision tasks using the transformer.Among them, the TransUnet [20] and SETR [21] make the most audacious attempts to perform the segmentation tasks using the transformer, but these approaches still essentially combine CNN with the traditional underlying framework of the transformer.The Swin transformer [22], on the other hand, truly uses the transformer as the encoder of the backbone structure.Recently, some new transformerbased methods have been proposed for segmentation, such as Segmenter [23], and these researches took the first step to understand the image from a global perspective in the coding stage.However, it is worth considering the high performance requirements for hardware in order to obtain the complete global information and whether too many global semantic features are required for segmentation.

Methodology Preliminaries
The LGST framework encoder is designed for progressive fusion of local feature learning of the global semantic features of the transformer based on the graph structure, as shown on the bottom right of Fig. 2. Specifically, LGST contains a transformer backbone network, followed by a structural graph layer of the CNN of graph structure.Note that the structure learning of the graph structure only provides the effective features, and the CNN does not have any other effect on the transformer.

System architecture
As shown in Fig. 2, our semantic encoder mainly uses the transformer as the backbone.For the input image, a variant of transformer's self-attention module-the data-efficient image transformer (DeiT) [24] is utilized for telematics preservation.For the input image I (I i , c), I i ∈ R D×W ×H , where D represents the transformer embedding dimensions, the picture resolution W × H is divide into N 2 Fig. 2 The framework of the proposed LGST for WSSS patches, and c is the label class token category for classification, c ∈ {0, 1, . . ., K − 1}.Then, each token path is flattened and linearized.These token patches are fed into the transformer encoder blocks T att , and the encoder includes a graph multi-head self-attention layer to produce the feature attention maps G cam .At last, a multilayer perceptron (MLP) block mlp( • ) is used to obtain the classification probability, which can be defined as follows: (1)

Graph structure attention learning design
The proposed graph structure attention learning module can effectively address the loss of local feature details in the transformer network.In the construction process, to fully utilize the structural feature dependency of class tokens in the transformer blocks, the graph structure is employed to focus on the local class relevance, as described in Fig. 3.In the above inference scheme, the graph structure attention can be used to better discover the common semantic information present in different images.For the semantic image nodes, the semantic nodes that contain correlation δ can be constructed: ( In Eq. ( 2), ran idx ( • ) constructs the semantic nodes, which is a construction method that randomly retrieves the category information of different nodes.At this point, the graph node features satisfy the following conditions: The main purpose of the convolution operation on node features is to expand the field of view and enhance the representation of semantic information.After the node information is flattened, f n is the node that is centered, and the transpose operation is performed on the semantic related nodes f k as follows: In Eq. ( 4), the gan(•) function aims to enhance the class semantic information, but in order to compensate for the diluted feature information of f n , as shown in Eq. ( 5), more information should be added.
The process of enhancing the feature information is more likely to result in the fusion of non-critical categories of information, and in order to filter out the non-essential information, a special convolution function Conv( • ) is performed here: In this way, according to the above equation, the feature cross-fertilization can be performed in the same information enhanced f n , thus obtaining a richer semantic: Here, Global( f n ) undergoes different semantic associations at different training stages of feature changes due to the randomness of ran idx ( • ), as shown in Eq. ( 2), and the dominant factor is δ.

Complementarity to refine CAM
The attention weights change continuously with the training iterations of Global( f n ), which indirectly reflects the semantic information.CAM can be further refined based on the fusion of weight information collected in different iterations.
is the weight at different epoch iterations, and W * contains all the weight information.
f cam_refine is the refined CAM; Mean(W * ; H × W ) is the balance scale of semantic weights that can reshape f n and further expand the semantic information generation region.

Training to generate the pseudo labels
A more complete object region is obtained for the activation maps generated by the transformer network.The GAP layer is added at the end of the transformer to predict the class.During the design of the loss function, the multi-label soft margin loss is used for calculation in the training phase.The probability of an arbitrary location for a class can be expressed as: ι( • ) is a sigmoid function, so the semantic segmentation class loss function is calculated as follows: It should be noted that during the experiments, the final loss function is found to be best achieved by L g and L t , where L g is based on the graph structure attention learning module, and L t is the loss calculated using the transformer network.
Algorithm 1 describes the core pseudo code for all the above ideas.For a demonstration of the pseudo mask performance, see Sect."Comparison with the state-of-the-art approaches".

Experimental dataset and metrics
The experiment was carried out on dataset PASCAL VOC 2012, which contains 20 object classes and 1 background class.The images in PASCAL VOC 2012 were divided into three subsets: the training set, the validation set and the test set, which contain 1464, 1449 and 1456 images, respectively.Evaluation was conducted on these subsets using our WSSS method.In addition, a semantic boundaries dataset (SBD) with 10,582 images was used for augmentation training, with the special note that only the image-level annotations were used for these images, and the SBD was used to augment the Pascal VOC 2012 dataset.

Implementation details
In the LGST framework structure, the transformer is used as the backbone, while the residual learning [25] pre-trained model is utilized to obtain the convolutional map feature information.To enhance the generalization capability of the model, the initialized size of the training images was randomly cropped and set to the specified size of 336 × 336.The SGD optimizer was used for iterative updating of the model parameters during the model training phase.Other specific classification parameters include: the initial learning rate was set to 1 × 10 -3 , and the weight decay was 2 × 10.Finally, for the generated pseudo labels, the DeepLab-ASPP-S [26] model was used as the semantic segmentation model.In the segmentation model, all hyper parameters remained unchanged, except the batch size was adjusted to 32, and the output stride was set to 16; the learning rate was 0.06.For the deployment of the hardware and software platform, LGST was implemented using the Pytorch 1.17 framework and built on the Ubuntu 20.04 operating system.All experimental procedures were set up on the hardware with the following configuration for model training: Intel Core (TM) i9-12900KF, RAM 64 GB, and single NVIDIA GPU RTX 3090 graphics card.

Metrics of performance evaluation
One of the most critical semantic segmentation evaluation metrics is the Intersection over Union (IoU), which is the intersection region between the segmentation results of a particular class and the ground truth.When there are multiple semantic segmentation results of different classes, it is necessary to perform summation averaging operation on the results of different classes, such as the VOC2012 dataset in this paper, and the mean Intersection over Union (mIoU) is often used to evaluate the segmentation performance.It is defined as follows: In the above equation, for the VOC 2012 dataset, the relationship between the classes is c ∈ [0 ∼ 20], and P represents the probabilities of true and false positives of the pixel points.

Comparison with the state-of-the-art approaches
Previous research [27] demonstrated that the CAM is the most effective means to address weakly supervised imagelevel labeling, as shown in Table 1.In our experiment, we further verified the performance of our method in generating the initialized CAM regions for pseudo label training.According to the results in Table 1, although our method does not show the optimal performance in the process of generating initialized CAM, for example, there is still a difference of 0.5% compared with the SEAM method, but the graph structure shows certain advantages in the training phase, and the final pseudo mask generated shows an improvement of by 7.4% compared with that generated by the SEAM method.
Figure 4 shows the pseudo mask generation results of our method and other GCN methods (DGCN [18], A 2 GNN [19], and WSGCN [16]).According to these results, we can see that our method has been a 2.8% performance improvement over the current optimal GCN-based method WSGCN in generating pseudo masks.
Moreover, to compare the segmentation performances of different classes, our method is compared with other stateof-the-art methods for WSSS developed in the past five years with more detailed data, which may benefit further study on the applications of the transformer model.The final results of the experiment are shown in Table 2.
In addition, the segmentation performances of the traditional CNN, GCN and Transformer methods developed in recent years are also compared in Table 3, so as to demonstrate the research significance of our method.

Graph node effect
For the patch tokens, we try to analyze the impact of different number of graph nodes on the semantic segmentation performance.Table 4 shows the segmentation performance of our method under different number of graph nodes.Different nodes are associated with different semantic information, the performances of our method were evaluated when the graph node number δ 2, 3 or 4, and the segmentation performance was the most optimal when δ 3.

Model complexity
In general, the model overhead of transformer is much higher than that of traditional CNN, and we use Deit-S to solve such problem in our study.As shown in Table 5, our model is compared with the three most popular semantic segmentation models-EDAM, TS-CAM and TransCam.From the perspective of model overhead, our method achieves good results.

Qualitative results
Table 6 shows the performance changes of the transformer model with the graph structure.In addition, the loss function can further improve the performance of the model.Finally, these measures can improve the performance of the proposed method.
Figure 5 shows the performance of our model under different conditions of multi-label and single-label classification and segmentation.It can be seen that the performance of our method in single-label classification and segmentation can generally meet the application requirements, but its performance in the multi-label classification is less satisfactory, which is where the present method needs to be improved.

Conclusions
This paper proposes a graph-structured LGST for WSSS, which is based on the transformer learning framework.Unlike any other previous transformer-based structures, our framework is able to capture more relevant semantic information and generate more accurate pseudo labels.In

Fig. 1
Fig. 1 Semantic category relationship of graph

Fig. 5
Fig. 5 Presentation of the qualitative change of the images obtained with our method, and visualization of the generated pseudo masks

Table 1
The pseudo mask segmentation results on the VOC 2012 training images

Table 2
Semantic segmentation performance on the PASCAL VOC 2012 validation set.The best three results are highlighted in red, blue and green Methods Pub bkg Plane Bike Bird Boat Bottle Bus Car Cat Chair Cow Table Dog Horse Motor Person Plant Sheep Sofa Train Tv

Table 3
Comparison with other state-of-the-art methods on the Pascal VOC2012 validation and test datasets.The performances are evaluated based on mIoU (%)

Table 4
Effect of the number of graph nodes on the performance of our model

Table 6
Ablation study of different modulation components