A stereo spatial decoupling network for medical image classification

Deep convolutional neural network (CNN) has made great progress in medical image classification. However, it is difficult to establish effective spatial associations, and always extracts similar low-level features, resulting in redundancy of information. To solve these limitations, we propose a stereo spatial discoupling network (TSDNets), which can leverage the multi-dimensional spatial details of medical images. Then, we use an attention mechanism to progressively extract the most discriminative features from three directions: horizontal, vertical, and depth. Moreover, a cross feature screening strategy is used to divide the original feature maps into three levels: important, secondary and redundant. Specifically, we design a cross feature screening module (CFSM) and a semantic guided decoupling module (SGDM) to model multi-dimension spatial relationships, thereby enhancing the feature representation capabilities. The extensive experiments conducted on multiple open source baseline datasets demonstrate that our TSDNets outperforms previous state-of-the-art models.


Introduction
Deep learning has always been the top priority of research in the field of medical image analysis, which effectively relieves the pressure of medical experts. In recent years, convolutional neural network (CNN) has been proposed and widely used in many real-world medical image analysis scenarios due its outstanding performance in classifying medical images, such as skin cancer image classification, and X-ray classification, etc. Recently, some variants of CNN model (e.g., DCNN, Resnet, Densenet, and multi-scale CNN [10,18,25]) have been attracting increasing attention to capture and exploit high-level discriminative images features. To better optimize the learning ability of the convolutional neural network, Khan M A et al. [9] fused the features of AlexNet and VGG16 in parallel and realized optimization at the same time B Long Yu yul@xju.edu.cn 1 to obtain the optimal features. To better detect the focus area of breast cancer, Irfan et al. [8] used DenseNet201 as the basic model, and fused 24 sets of convolutional feature vectors to obtain various semantic information of migration, and evaluated the proposed method through 10 fold cross validation. The accuracy of the proposed algorithm in breast cancer has reached 98. 9%, which proves the feasibility of feature fusion. Although these methods have powerful feature representation capabilities, their efficiency and prediction accuracy are hindered by a large number of redundant features generated during model learning, while they are usually over-parameterized and computationally expensive. Figure 1 shows the parameters and running time of some representative models. Furthermore, attention mechanism has become an important research hotspot for medical image segmentation tasks. The concept of feature screening also plays a positive role in other fields [7,29].
Many existing studies [26,30] have shown that the attention mechanism can effectively enhance the representational ability of key features in the feature map by ignoring the irrelevant redundant information. However, for medical images, a specific disease object often appears in different imaging directions. When the pixel intensity of similar target objects changes slightly, it is difficult for the attention mechanism to distinguish the differences between pixels from a single direction. Therefore, it is necessary to conduct effective exploration from different directions. Some studies [4,6,16] have shown that use attention to automatically extract the appropriate feature from two directions, thereby better capturing the dependencies between features. However, they ignore the problem of redundant features, which may cause the model to fail to learn useful information, increasing the difficulty of model learning.
In previous studies [21], the shallow features are directly introduced into the deep features. Although the introduction of shallow features can bring more semantic information to the deep features, the introduction of too much information with weak correlation tends to reduce the quality of the deep feature maps. Therefore, reasonable selection of feature points can not only improve the sparsity of feature maps, but also avoid feature reuse and improve model efficiency.
In this paper, we propose a stereo spatial decoupling network (TSDNets) for medical image classification that utilizes three sets of attention mechanisms to assign three sets of weights to each feature point. Moreover, we adopt a feature screening strategy to suppress redundant features and build gating strategies to improve the quality of features.
-We propose a stereo spatial decoupling network (TSD-Nets) to explore the spatial guidance relationship of the object from three directions of horizontal, vertical, and depth of the medical image. Specifically, the attention in this paper is more accurate than the traditional one-way attention mechanism screening features from multiple perspectives. At the same time, the attention mechanism in this paper plays a role in filtering features, and it is not a feature fusion directly, so compared with the traditional attention mechanism [27,33], the parameter operation is less. -We developed a cross-feature screening module (CFSM) that uses a two-gate threshold screening strategy to generate three types of features, namely important features, secondary features, and redundant features, and targets them for deep feature fusion. -We constructs a semantic guided decoupling module (SGDM), which implements feature selection by setting different gate thresholds for shallow features and deep features respectively, thereby extracting more discriminative features.

Related work
In recent years, deep learning has received significant research attention from academia and industry. A widely used method in medical image classification is the convolution neural network (CNN) [2,12]. However, traditional medical image datasets are usually small and complex, especially when there are specific targets or specific regions in the image that are similar to the surrounding background. Directly applying traditional CNN to medical image classification may be difficult to establish effective spatial associations between features or similar images. To improve the feature representation, various solutions [4,6,21] have recently been proposed to force CNN to focus on specific regions, and rely on domain knowledge to obtain information [12,20]. For example, in [17,32], authors obtain important relevant area information according to the gray value of the medical image. However, in most cases, domain knowledge is limited (only suitable for specific tasks). Moreover, to describe the targets from multiple levels with rich feature representations, [12,20] designed a multi-scale feature extractor to describe the classification objects from multiple aspects using different scale convolution kernels, and then used a simple linear fusion method to represent the features. Although these methods demonstrate the effectiveness of multi-scale features in medical image classification, the extracted redundant information can easily reduce the expression of key features when multi-scale information flows are transferred between layers.
To reduce irrelevant redundant information, the attention mechanism has been widely used in medical image classification, which can suppress redundant features through the weights generated by attention mechanism. There have been several attempts to improve the feature representation ability of model using local or global based attention mechanism [6,13,26,30,35], which allows the network to focus on the most key features or areas, thereby suppressing redundant features. Moreover, the disease regions in a medical image are similar to the surrounding background, which makes it difficult to describe the key regions by a single-direction attention mechanism and establish effective spatial relationships. To address this limitation, various attention-based methods [6,16] have been proposed to locate key features to better establish the relationship between features. Although these methods [24,28] can improve the quality of feature maps, they ignore the relationship between spatial semantic information and feature map semantic information, and lack in-depth analysis of redundant information. In this paper, we reconstruct the connection between redundant features to assist important semantic establishment and enhance spatial semantic information.
In medical image analysis, introducing shallow features into high-order feature maps is an important fusion method, where shallow features can assist higher-order features to capture finer semantic information. However, most of existing studies ignore the influence of redundant features on key features, which weakens the representation ability of key features. Feature selection methods [15,19] have been widely used in many areas, which can be used to improve the The parameters and running time of some representative models. The y-axis represents the number of parameters of a given model, while the x-axis represents the running time model performance effectively. Some simple linear or filtering methods can also be used to filter features [3,11,15], such as point mutual information (PMI), PCA, Gaussian filtering, etc. These methods are mainly applied in the image pre-processing. To solve this limitation, Dropout [31,34] and regularization techniques have been used to prevent the neural network from falling into a local optimum. Moreover, although the attention mechanisms can suppress the redundant features in the feature selection process, most attention based methods do not directly discard these feature points, which weakens the representation ability of key features. In this paper, we propose a spatial decoupling and cross-feature screening strategy based feature selection scheme to minimize the influence of redundant information on key features.

TSDNets
The proposed TSDNets primarily consists of a cross-feature screening module (CFSM) and a semantic guided decoupling module (SGDM). The TSDNets first captures the shallow features of the objects by the convolutional operations, and then uses three different attention mechanisms to decompose the shallow features into three corresponding weight matrices a H , a V , and a D from the horizontal, vertical, and depth directions. Three new matrices of a H V , a H D , and a V D are generated by combining the three weight matrices of a H , a V , and a D in pairs. Based on the dual-gating threshold, CFSM divides each of weight matrices, a H V , a H D , and a V D , into three levels: important, secondary and redundant. Then, three new feature matrices ( f 1 , f 2 , and f 3 ) are obtained by fusing the feature information of the corresponding levels. Meanwhile, the SGDM extracts the optimal shallow global semantic features f cg of f 1 , as well as the optimal deep local semantic features f cg of the secondary features f 2 . Considering that redundant features only contain less useful semantic information. We compress the redundant features into a one-dimensional dense vector to reconstruct the relationship between them. The overall flow of the proposed algorithm is shown in Fig. 2.
In this part, we propose a stereo decoupling attention mechanism, which can extract the key features by assigning weights to each feature, and decompose shallow features from three directions of horizontal, vertical, and depth. As a result, we can obtain the horizontal attention weight matrix a H , the vertical attention weight matrix a V , and the depth attention weight matrix a D . The horizontal attention weight matrix a H can be described by where e i, j is the weight coefficient assigned by the attention mechanism; h j represents the hidden layer; a i represents the horizontal attention weight coefficient of the i-th feature. Similarly, we can get a V and a D .

Cross feature screening module (CFSM)
This cross-feature screening module (CFSM) is mainly divided into two parts: the dual gate threshold screening and represent the cross filtering function of horizontal and vertical, the cross filtering function of horizontal and depth, the cross filtering function of vertical and depth directions, respectively; f cg represents the optimal low-level global semantic feature; f cg represents the optimal high-order local semantic feature; f x ∈ R H ×W ×C is the initial features in the input layer, where H, W , C represent the height, width, and channel respectively the feature aggregation. In the dual gate threshold screening, the above three weight matrix a H , a V , and a D are divided into three levels: important, secondary, and redundant, according to two thresholds T 1 and T 2 . After that, we can obtain three new weight matrices a H V , a H D , and a V D by reconstructing a H , a V , and a D in pairs. Then, we average the three weight matrices a H V , a H D , and a V D to obtain three mean matri- , and a V D +a H V 2 , where each weight value in the three mean matrices is compared to T 1 and T 2 . When the feature points are graded, they are more stable if the mean of any two groups of weights is calculated. If the value is greater than T 1 , it means that the feature i contains the important semantic information. Analogously, if the value is less than T 2 , it means that the feature i only contains redundant information. Next, we divide the weight matrix a H , V into three 0-1 matrices (where 0 represents eliminating the feature i and 1 represents retaining the feature i), namely important matrixâ H V 1 , secondary matrixâ H V 2 and redundant matrixâ H V 3 .
Analogously, we can obtain the corresponding reconstructed 0-1 matrices, a H D 1 , a H D 2 , a H D 3 , a V D 1 , a V D 2 , a V D 3 . Then, we multiply the feature matrix with three sets of 0-1 matrices to get a new three sets of feature matrices, For a H D and a V D , we can obtain the corresponding feature where Cat represents the concatenate operation.

Semantic guided decoupling module(SGDM)
To improve the quality of features, we develop a semantic guided decoupling module (SGDM), which introduce highquality shallow features in deep feature fusion. By taking the square root of the sum of the squares of the weight matrices in each direction (a H , a V and a D ), we can get two new sets of shallow global semantic feature's weight matrices S H V D and deep local semantic feature's weight matrices D H V D . The purpose is that when the weight of any direction of the feature point is greater than the set threshold, it has stronger discriminability, so that the model retains richer semantic information.
Then, SGDM performs feature screening for the shallow global features f x and deep local features f y . For the shallow global features f x (as shown in Fig. 2), we compare the weight value in S H V D with the threshold T 1 to generate the corresponding shallow global 0-1 matrix S H V D . Similarly, for deep local features f y (as shown in Fig. 2), the corresponding deep local 0-1 matrix D H V D can be generated by comparing the D H V D with the threshold T 3 . Specifically, the feature screening is the same as in Sect. "Cross feature screening module (CFSM)", which can be described as follows.
Finally, a set of optimal shallow global semantic features f cg and a set of optimal deep local semantic features f cg are generated.
where ϕ() represents the matrix multiplication operation.

Feature fusion
Feature fusion can improve the accuracy of medical image classification. To improve the feature representation ability of redundant features f 3 , we project the two-dimensional features f 3 into a one-dimensional vector space to reconstruct the feature relationships. Then, the shallow global features and important features are fused to obtain fused features f m . This process can be described as follows: After that, we fuse f cg , f m and the high-order features f cg to obtain the deep features. This process can be described as follows: Next, we can get the final output feature O, as follows.
Finally, the SoftMax classifier is used to output the classification probability.

Feature visualization for CFSM and SGDM
In this part, we validates the correctness on theoretical deduction as well as obtains some important conclusions by case study and visualization analysis. Figure 3 shows the visual analysis results of 5 × 5 deep features with their weight coefficients, where color denotes to the weight of the attention matrix. The red cell indicates that the feature point contains more semantic information, and the blue cell indicates that the feature point has a greater weight. Figure 3a is the original image. Figure 3b represents the deep feature; Fig. 3c-e represent the weight coefficients generated by vertical attention, horizontal attention, and deep attention, respectively; Fig. 3f represents the weight coefficients generated by the horizontal attention and vertical attention in CFSM module; Fig. 3g represents the weight coefficients generated by the horizontal attention and the depth attention in CFSM module; Fig. 3h represents the weight coefficients generated by the vertical attention and the depth attention in CFSM module; Fig. 3i represents the weight coefficients generated by these three attention mechanisms in the SGDM module. As can be seen from Fig. 3c-e, the weight visualization results show that if using only one attention mechanism, the effect of weight assignment performs the worst. As shown in Fig. 3f-h, when we further use two attention mechanisms in CFSM module, this obviously produces information gain. From Fig. 3i, we use three attention mechanisms in SGDM module, the generated features contain more semantic information. This shows the proposed TSDNets is effective at guiding the attention weights to select the useful features.

Datasets
We verify the MAS-Net on three datasets which are briefly described in the following paragraphs.
Shenzhen dataset-Chest X-ray database [5]. This dataset was constructed by the National Library of Medicine of Maryland, USA, and the Third People's Hospital of Shenzhen, China. This dataset consists of 662 chest X-rays images, including 336 tuberculosis disease images (TB) and 326 normal medical images (Nor). The size of each image is 3000 × 2900 ∼ 3000. [1]. This dataset was jointly produced by researchers from Qatar University and Dhaka University. There are only 219 COVID-19 images in the original dataset. To balance this dataset, the number of images reached 1200 through post-supplementation. The data consist of three types of medical chest X-ray images: viral pneumonia (VP, 1345 images), normal (Nor, 1341 images), and COVID (COVID, 1200 images). The size of each image is 1024 × 1024.

COVID-19 radiography database
ISIC2018 dataset [22]. This dataset is the largest public dataset of skin diseases, consisting of 10,015 skin disease images. Different types of skin disease images in this dataset were acquired using different types of imaging equipments. For training and testing purposes for our model, we have our data broken down into three distinct datasets with a ratio of 7:1:2, i.e, the training set, the validation set and the test set. All medical images are compressed to 256 × 256. We train our model using the deep learning framework Keras on the desktop computer (Tesla V100 with 16 G of RAM). We set the learning rate as 0.0001, and use Adam as the optimizer.

Evaluation metrics
We evaluate each model and perform various ablation experiments using the average accuracy (AA), overall accuracy (OA) and Kappa coefficient (Kappa) metrics.
OA = n 1 + n 2 + . . . + n s m 1 + m 2 + · · · + m s (12) where n i represents the number of samples belonging to the ith class, which are perfectly classified; m i represents the total number of ith class; and S represents the number of classes.  Fig. 4 The confusion matrix of TSDNets on the three datasets, COVID-19, ChinaSet, and ISIC. "TL" represent the group truth labels; "PL" represent the predictions

Comparison to other state-of-the-art methods
For fair comparison, in this experiment, we compare the proposed TSDNets classification framework with other stateof-the-art medical image classification models. Table 1 lists the experimental results on different data sets. We can see that the proposed TSDNets achieves the best performance on the three benchmark datasets. Compared with SRC-MT, the Kappa value on the three datasets is increased by 4.54%, 0.02% and 2.13%, respectively. This may be because the proposed TSDNets can enrich the shallow global semantics of the target objects by the semantic guided decoupling module (SGDM), while providing the effective prior information for feature extraction. Moreover, CFSM can capture more favorable local features, making the network pay more attention to subtle changes in the objects. Overall, the proposed TSDNets method is effective for medical image classification.
Due to the large number of categories in ISIC and the small differences between the target categories, all the baseline methods have achieved poor classification performance on the ISIC dataset. For example, the kappa values of AlexNet and ReLSNet are only 0.5835 and 0.5878, which are 1.7% and 1.26% lower than TSDNets, respectively. Transformer [14] Tand Swin Transformer [23] Tdo not work well in small data ChinaSet, but their effects are significantly improved when the number of data set samples is increased. Our proposed model TSDNets is better applicable to all data sets.
When the category and data size are small (i.e., the ChinaSet dataset has 662 images in 2 categories), the proposed TSDNets gives the best performance under all metrics. This further demonstrates that TSDNets has good robustness and generalization. Specifically, Fig. 4 shows the confusion matrix of TSDNets on the three datasets.

Effects of features
In the experiment, the selection of features has an important influence on the experimental results. We verified the impact of the different features f cg , f 1 , f 2 , f 3 (see Section III for details) on the classification performance, as shown in Table 2. We can see that the Model_no f cg ) is closest to our method in all metrics on the three benchmark datasets. The performance of Model_no f cg in the Kappa is 9.08%, 3.02%, and 7.56% higher than the other three methods Model_no f 1 , Model_no f 2 , and Model_no f 3 , respectively. This shows that the shallow global features f cg have smaller advantages compared to other features. We observe that the performance of Model_no f 3 was worse than that of other models. This shows that although the redundant feature f 3 contains only a small amount of useful semantic information. Moreover, this also demonstrates that redundant features in a specific dimensional space could improve the classification performance.

Effects of gated thresholds
To verify the effectiveness of the gated threshold screening strategy, we conducted a large number of experiments. Table 3 shows the results of the proposed TSDNets with different gated thresholds. In the COVID-19 dataset, when T 2 and T 3 are fixed, as the gated threshold T 1 increases, the classification performance increases, which then decreases gradually. With the increase of T 2 , the classification performance shows the same trend when T 1 and T 3 are fixed. The reason may be that when the gating threshold is low, irrelevant redundant information cannot be effectively filtered, thereby reducing the overall classification performance of the model. Moreover, when the gating threshold increases to a certain peak value, it can effectively filter irrelevant redundant information while preserving spatial semantic details. However, when the gating threshold continues to increase, some useful features may be filtered out, resulting in insufficient feature representation. From Table 3, we can see that when T 1 = 0.5, T 2 = 0.3, T 3 = 0.5, the proposed TSDNets achieves the best performance.

Conclusion
This paper proposes a stereo spatial decoupling network (TSDNets) for medical image classification. We use the semantic guided decoupling module (SGDM) to obtain effective shallow global features, which provides favorable prior information for feature representation. Moreover, we use the cross-feature screening module (CFSM) with the dual gate control threshold strategy to enhance the interaction between feature, which further improves the feature representation. Finally, we evaluate the proposed TSDNets on three benchmark datasets. The experimental results show that our method achieves new state-of-the-art results in the classification performance with significant improvements over existing approaches. The proposed TSDNets can extract the semantics of spatial details with high performance and efficiency. In the future, we will consider reducing the number of model parameters for a more concise and efficient spatial decoupling network. At the same time, feature screening is not sufficient, so it is necessary to further classify features according to importance: useful features, general features, redundant features.

Data availability
The data used to support the findings of this study are available from the corresponding author upon request.

Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.