LET-Net: locally enhanced transformer network for medical image segmentation

Medical image segmentation has attracted increasing attention due to its practical clinical requirements. However, the prevalence of small targets still poses great challenges for accurate segmentation. In this paper, we propose a novel locally enhanced transformer network (LET-Net) that combines the strengths of transformer and convolution to address this issue. LET-Net utilizes a pyramid vision transformer as its encoder and is further equipped with two novel modules to learn more powerful feature representation. Specifically, we design a feature-aligned local enhancement module, which encourages discriminative local feature learning on the condition of adjacent-level feature alignment. Moreover, to effectively recover high-resolution spatial information, we apply a newly designed progressive local-induced decoder. This decoder contains three cascaded local reconstruction and refinement modules that dynamically guide the upsampling of high-level features by their adaptive reconstruction kernels and further enhance feature representation through a split-attention mechanism. Additionally, to address the severe pixel imbalance for small targets, we design a mutual information loss that maximizes task-relevant information while eliminating task-irrelevant noises. Experimental results demonstrate that our LET-Net provides more effective support for small target segmentation and achieves state-of-the-art performance in polyp and breast lesion segmentation tasks.


Introduction
Multimodal medical image segmentation aims to accurately identify and annotate regions of interest from images produced by various medical devices, such as segmenting polyps from colonoscopy images [1], breast lesions from ultrasound images [2], and focal cortical dysplasia lesions from magnetic resonance images [3].It has been an essential procedure for computer-aided diagnosis [4], which assists clinicians in making accurate diagnoses, planning surgical procedures, and proposing treatment strategies.Hence, the development of automatic, accurate, and robust medical image segmentation methods is of great value to clinical practice.
However, medical image segmentation still encounters some challenges, one of which is the prevalence of small lesions.Figure 1 illustrates small lesion samples and size distribution histograms for several different benchmarks, where the ratio of lesion area to whole image is significantly concentrated in a smaller range, with proportions in descending order: 0 − 0.1 first, 0.1 − 0.2 s.Specifically, a vast majority of polyps and breast lesions occupy only a small proportion of the entire medical image.Meanwhile, some small lesions, e.g., early stage polyps, exhibit an inconspicuous appearance.These small targets inevitably pose great difficulties for accurate segmentation for several reasons.First, small targets are prone to being lost during repeated downsampling operations and are hard to recover.Second, there is a significant class imbalance problem in the number of pixels between the foreground and background, leading to a biased network and suboptimal performance.Whereas, the ability of computer-aided diagnosis to identify small objects is highly desired, as early detection and diagnosis of small lesions are crucial for successful cancer prevention and treatment.
Nowadays, the development of medical image segmentation has greatly advanced due to the efficient feature extraction ability of convolutional neural networks (CNNs) [5][6][7].Modern CNN-based methods typically utilize a U-shaped encoder-decoder structure, where the encoder extracts semantic information and the decoder restores resolution to facilitate segmentation.Additionally, skip connections are employed to compensate for detailed information.Some advanced U-shaped works focus on the following studies, which include designing novel encoding blocks [8][9][10] to enhance feature representation ability, adopting attention mechanisms to further recalibrate features [11,12], extracting and fusing multi-scale reasonable context information to improve accuracy [13][14][15], and so on.Despite their promising performance, these methods share a common flaw, i.e., lacking global contexts essential for better recognition of target objects.
Due to their superior ability to model global contexts, Transformer-based architectures have become popular in segmentation tasks while achieving promising performance.Recent works [16][17][18] utilize vision transformers (ViT) as a backbone to incorporate global information.Despite their good performance, ViT produces single-scale low-resolution features and has a very high computational cost, which hampers their performance in dense prediction.In contrast to ViT, pyramid vision transformer (PVT) [19] inherits the advantages of both CNN and Transformer and produces hierarchical multi-scale features that are more favorable for segmentation.Unfortunately, Transformer-based methods destroy part of local features when modeling global contexts, which may result in imprecise predictions for small objects.
In the field of small target segmentation, a couple of approaches have been devised to improve the sensitivity of small objects.They overcome the segmentation difficulties brought by small objects from multiple aspects, such as exploiting the complementarity between low-level spatial details and high-level semantics [20], multi-scale feature learning [21,22], and augmenting spatial dimension strategies [23][24][25].Although their skip connections can compensate for detail loss to some extent and even eliminate somewhat irrelevant noises by extra equipping with attention mechanisms, these methods are still insufficient, as some local contexts may be overwhelmed by dominant semantics due to feature misalignment issues.In addition, another important factor that has been overlooked is how to effectively restore spatial information of downsampled features.Most methods adopt common upsampling operations, such as nearest-neighbor interpolation and bilinear interpolation, which may still lack local spatial awareness to handle small object positions.As a result, they are not compatible with the recovery of target objects and produce suboptimal segmentation performance.
In this paper, we propose a novel locally enhanced transformer network (LET-Net) for medical image segmentation.By leveraging the merits of both Transformer and CNN, our LET-Net can accurately segment small objects and precisely sharpen local details.First, the PVT-based encoder produces hierarchical multi-scale features where low-level features tend to retain local details, while high-level features provide strong global representations.Second, to further emphasize detailed local contexts, we propose a featurealigned local enhancement (FLE) module, which can learn discriminative local cues from adjacent-level features on the condition of feature alignment and then utilize the local enhancement block equipped with local receptive fields to further recalibrate features.Third, we design a progressive local-induced decoder that contains cascaded local reconstruction and refinement (LRR) modules to achieve effective spatial recovery of high-level features under the adaptive guidance of reconstruction kernels and optimization of a split-attention mechanism.Moreover, to alleviate the class imbalance between foreground and background, we design a mutual loss based on an information-theoretic objective, which can impose task-relevant restrictions while reducing task-irrelevant noises.
The contributions of this paper mainly include: (

Medical image segmentation
With the great development of deep learning, especially convolutional neural networks (CNNs), various CNN-based methods, such as U-Net [7], have significantly improved the performance of medical image segmentation.These approaches possess the popular U-shaped encoder-decoder structure.To further assist precise segmentation, a battery of innovative improvements based on encoder-decoder architecture has emerged [26][27][28][29][30].One direction is to design a new module for enhancing the encoder or decoder ability.For instance, Dai et al. [26] designed Ms RED network, which, respectively, employs a multi-scale residual encoding fusion module (MsR-EFM) and a multi-scale residual decoding module (MsR-DFM) in the encoder and decoder stages to improve skin lesion segmentation.In the work [27], a selective receptive filed module (SRFM) was designed to obtain suitable sizes of receptive fields, thereby boosting breast mass segmentation.Another direction is optimizing skip connection to facilitate the recovery of spatial information.UNeXt [28] proposed an encoder-decoder structure involving convolutional stages and tokenized MLP stages, achieving better segmentation performance while also improving the inference speed.However, these methods directly fuse unaligned features from different levels, which may hamper accuracy, especially for small objects.In this paper, we propose a powerful feature-aligned local enhancement module, which ensures that feature maps at adjacent levels can be well aligned and then explore substantial local cues to optimally enhance the discriminative details.

Feature alignment
Feature alignment has drawn much attention and is now an active research topic in computer vision.Numerous researchers have devoted considerable effort to addressing this challenge [6,[31][32][33][34][35][36][37].For instance, SegNet [6] utilized max-pooling indices computed in the encoder to perform an upsampling operation in the corresponding decoder stage.Mazzini et al. [32] proposed a guided upsampling module (GUM) that generates learnable guided offsets to enhance the upsampling operation.IndexNet [33] built a novel indexguided encoder-decoder structure in which pooling and upsampling operators are guided by self-learned indices.AlignSeg [34] learned 2D transformation offsets by a simple learnable interpolation strategy to alleviate feature misalignment.Huang et al. [35] designed an FaPN framework consisting of feature alignment and feature selection modules, achieving substantial and consistent performance improvements on dense prediction tasks.SFNet [31] presented a flow alignment module that effectively broadcasts high-level semantic features to high-resolution detail features by its semantic flow.Our method shares a similar aspect with the work [31], in which efficient spatial alignment is achieved by learning offsets.However, unlike these methods, we further enhance discriminative representations by subtraction under the premise of aligning low-resolution and high-resolution features, which facilitates excavating imperceptible local cues related to small objects.

Attention mechanism
Attention-based algorithms have been developed to assist in segmentation.In general, attention mechanisms can be categorized into channel attention, spatial attention, and selfattention according to different focus perspectives.Inspired by the success of SENet [38], various networks [39][40][41] have incorporated the squeeze-and-excitation (SE) module to recalibrate features by modeling channel relationships, thereby improving segmentation performance.K. Wang et al. [42] proposed a dual attention network (DANet), which combines position spatial attention and channel attention modules to capture rich contexts.Additionally, Transformer networks based on self-attention have been popular in medical image segmentation [43][44][45][46][47][48].For instance, TransUnet [43] inserted Transformer layers between CNN-based encoder and decoder stages to model global contexts, achieving excellent performances in multi-organ and cardiac segmentation.Wu et al. [48] proposed FAT-Net with a dual encoder that is, respectively, based on CNNs and Transformers for skin lesion segmentation.However, the loss of local contexts may still hinder the prediction accuracy of Transformer-based methods.In this paper, we propose a feature-aligned local enhancement module and progressive local-induced decoder, which, respectively, emphasize local information and adaptively recover spatial information to improve predictions.

PVT-based encoder
Although CNN-based methods have achieved great success in medical image segmentation, they have general limitations in modeling global contexts.In contrast, pyramid vision transformer (PVT) [19] inherits the advantages of both Transformer and CNN while proving to be more effective for segmentation.Thus, we choose PVT as the backbone to obtain global receptive fields and learn effective multiscale features.
As shown in Fig. 2, the PVT-based encoder has four stages with a similar architecture.Each stage contains a patch embedding layer and multiple Transformer layers.Benefiting from its progressive shrinking pyramid and spatial-reduction attention strategy, the PVT-based encoder can produce multi-scale feature maps with fewer memory costs.Specifically, given an input image X ∈ ℝ H×W×3 , it produces features Therefore, we obtain high-resolution detail features and low-resolution semantic features, which are beneficial for segmentation.

Feature-aligned local enhancement module
The powerful global receptive field of PVT-based encoder makes it challenging for our model to adequately capture critical local details.Although low-level features can provide some local context, directly transmitting them to the decoder via a simple skip connection is problematic, as this may introduce a large amount of irrelevant background information.As a solution, leveraging high-level features is an effective way, but one significant issue, i.e., feature Feature-aligned discriminative learning Due to the information gap between semantics and resolution, feature representation is still suboptimal when directly upsampling high-level feature maps to guide low-level features.To obtain strong feature representations, more attention and effort should be given to position offset between lowlevel and high-level features.Inspired by previous work [31], we propose a feature-aligned discriminative learning (FDL) block that aligns adjacent-level features and further excavates discriminative features, leading to high sensitivity to small objects.Within FDL, two 1 × 1 convolution layers are first employed to compress adjacent-level features (i.e., E i and E i−1 ) into the same channel depth.Then, a semantic flow field is calculated by a 3 × 3 convolution operation, as described in Eq. 1 where f s×s (⋅) indicates s × s convolution layer followed by batch normalization and a ReLU activation function, while © and U(⋅) , respectively, represent concatenation and upsam- pling operation.Next, according to learned semantic flow i−1 , we obtain a feature-aligned high-resolution feature Ẽi with semantic cues, Mathematically where Warp(⋅) indicates the mapping function, E i is a C i dimensional feature map defined on the spatial grid i of the specific size H∕2 i+1 , W∕2 i+1 .Schematically as shown in Fig. 4, the warp procedure consists of two steps.In the Fig. 3 The architecture of feature-aligned local enhancement module, which performs two steps: First, feature-aligned discriminative learning initially produces a flow field to align adjacent features and then constructs discriminative representation using subtraction and a residual connection.Second, local enhancement with a dense connection structure is adopted to highlight local details Fig. 4 An illustration of the warp procedure first step, each point p i−1 on the spatial grid i−1 is mapped to p i on low-resolution feature, which is formulated by Eq. 3 It is worth mentioning that due to the resolution gap between the flow field and features (see Fig. 4), Eq. 3 contains a halved operation to reduce the resolution.In the second step, we adopt the differentiable bilinear sampling mechanism [49] to approximate the final feature Ẽi by linearly inter- polating the scores of four neighboring points (top-right, top-left, bottom-right, and bottom-left) of p i .
After that, to enhance the discriminative local context representation, we further utilize subtraction, absolute value, and residual learning procedures.Conclusively, the final optimized feature Êi−1 can be expressed as follows: Local enhancement In the PVT-based encoder, attention is established between each patch, allowing information to be blended from all other patches, even if their correlation is not high.Meanwhile, since small targets only occupy a portion of the entire image, the global interaction in transformer architecture cannot fully meet the requirements of small target segmentation where more detailed local contexts are needed.Considering that the convolution operation with a fixed receptive field can blend the features of each patch's neighboring patches, we construct a local enhancement (LE) block to increase the weights associated with adjacent patches to the center patch using convolution, thereby emphasizing the local features of each patch.
As shown in Fig. 3, LE has a convolution-based structure and consists of four stages.Each stage includes a 3 × 3 convolutional layer followed by batch normalization and a ReLu activation layer (denoted as f 3×3 (⋅) ).Additionally, (3) dense connections are added to encourage feature reuse and strengthen local feature propagation.As a result, the feature map obtained by LE contains rich local contexts.Let x 0 denote the initial input, and the outputs of i th stage within LE can be formulated as follows: where [ ] represents the concatenation operation.In sum- mary, LE utilizes the local receptive field of the convolution operation and dense connections to achieve local enhancement.

Progressive local-induced decoder
Efficient recovery of spatial information is critical in medical image segmentation, especially for small objects.Inspired by previous works [50,51], we propose a progressive localinduced decoder to adaptively restore feature resolution and detailed information.As shown in Fig. 2, the decoder consists of three cascaded local reconstruction and refinement (LRR) modules.The internal structure of LRR is illustrated in Fig. 5, where two steps are performed: local-induced reconstruction (LR) and split-attention-based refinement (SAR).
Local-induced reconstruction LR aims to transfer the spatial detail information from low-level features into highlevel features, thereby facilitating accurate spatial recovery of high-level features.As shown in Fig. 5, LR first produces a reconstruction kernel ∈ ℝ k 2 ×H i−1 ×W i−1 based on low-level feature F i−1 and high-level feature D i , in which k indicates the neighborhood size for reconstructing local features.The procedure of generating the reconstruction kernel can be expressed as follows: (5) where f s×s (⋅) represents an s × s convolution layer followed by batch normalization and a ReLU activation function.U(⋅) , © , and Soft(⋅) , respectively, indicate upsampling, concatena- tion, and Softmax activation operations.Meanwhile, another 3 × 3 convolution and upsampling operation are applied on D i to obtain Di with the same resolution size as F i−1 .Math- ematically, Di = U f 3×3 D i .Note that, D 4 = E 4 here.Next, we optimize pixel Di [u, v] under the guidance of reconstruc- tion kernel [u,v] ∈ ℝ k×k , producing refined local feature Di [u, v] .This can be written as Eq. 7, where r = ⌊k∕2⌋ Subsequently, Di and F i−1 are concatenated together and then passed through two convolutional layers to produce an optimized feature.Conclusively, LR overcomes the limitations of traditional upsampling operations in precisely recovering pixel-wise prediction, since it takes full advantage of low-level features to adaptively predict reconstruction kernel and then effectively combines semantic contexts with spatial information toward accurate spatial recovery.This can strengthen the recognition of small objects.

Split-attention-based refinement
To enhance feature representation, we implement an SAR block in which grouped sub-features are further split and fed into two parallel branches to capture channel dependencies and pixellevel pairwise relationships through two types of attention mechanisms.As shown in Fig. 5, SAR is composed of two basic components: a spatial attention block and a channel attention block.Given an input feature map M, SAR first divides it along the channel dimension to produce M = M 1 , M 2 , ⋅ ⋅ ⋅, M G .For each M i , valuable responses are specified by attention mechanisms.Specifically, M i is split into two features, denoted as M 1 i and M 2 i , which are separately fed into the channel attention block and spatial attention block to reconstruct features.This allows our model to focus on "what" and "where" are valuable through these two blocks.
In channel attention block, global average pooling (denoted as GAP(⋅) ) is performed to produce channel-wise statistics, which can be formulated as Then, channel-wise dependencies are captured according to the guidance of a compact feature, which is generated by a Sigmoid function (i.e., Sig(⋅) ).Mathematically in which parameters W 1 and b 1 are used for scaling and shift- ing S.
In spatial attention block, spatial-wise statistics are calculated using Group Norm (GN) [52] on M 2 i .The pixel-wise representation is then strengthened by another compact feature calculated by two parameters W 2 and b 2 and a Sigmoid function.This process can be expressed as

Next, M1
i and M2 i are optimized by an additional consistency embedding path and then concatenated.This procedure is represented as After aggregating all sub-features, a channel shuffle [53] is performed to facilitate cross-group information exchange along the channel dimension.

Mutual information loss
As stated in the previous study [54], training models with only pixel-wise loss may limit segmentation performance, especially resulting in prediction errors for small objects.This is due to class imbalance between foreground and background, such that task-relevant information is overwhelmed by irrelevant noise.Therefore, to facilitate preserving taskrelevant information, we explore novel supervision at the feature level to further assist accurate segmentation.Let X and Y denote the input medical image and its corresponding ground truth, respectively.Z represents the deep feature extracted from input X.
Mutual information (MI) Mutual information is a fundamental quantity that measures the amount of information shared between two random variables.Mathematically, the statistical dependency of Y and Z can be quantified by MI, which is expressed as where p(Y, Z) is the probability distribution between Z and Y, while p(Z) and p(Y) are their marginals.
Mutual Information Loss Our primary objective is to maximize the amount of task-relevant information about Y in the latent feature Z while reducing irrelevant information.This is achieved by two mutual information terms [55,56].Formally Owing to the notorious difficulty of the conditional MI computations, these terms are estimated by existing MI estimators [56,57].In detail, the first term is accomplished through the use of Pixel Position Aware (PPA) loss [57] ( L PPA ).Since PPA loss assigns different weights to different positions, it can better explore task-relevant structure information and give more attention to important details.The second term is estimated by Variational Self-Distillation (VSD) [56] ( L VSD ) that uses KL-divergence to compress Z and remove irrelevant noises, thereby addressing the effect of imbalances in the number of foreground and background pixels caused by small targets.Thus, our total loss can be expressed as

Implementation details
We implement our experiments based on the hardware environment with NVIDIA GeForce RTX 3090.The AdamW algorithm is chosen to optimize our model's parameters, and the initial learning rate is set to 1e-4.During training, a multiscale training strategy is employed, in which input images are reshaped according to a ratio of [0.75, 1, 1.25].The total number of epochs and batch size are set to 200 and 16, respectively.In the pre-processing step, all images and corresponding ground truths are resized to 352 × 352 in our experiments.

Datasets
To verify the capability of our proposed model, we evaluate LET-Net in two medical image segmentation tasks.For polyp segmentation, we utilize five public benchmarks: CVC-ClinicDB [62], Kvasir [63], CVC-ColonDB [64], ETIS-LaribPolypDB [65], and CVC-300 [66].To ensure a fair comparison, we follow the work [59] and divide large-scale CVC-ClinicDB and Kvasir datasets into training, validation, and testing datasets in a ratio of [8:1:1], while the remaining three datasets are used only for testing to evaluate the model's generalization abilities.For breast lesion segmentation task, we choose the public breast ultrasound dataset (BUSIS) [67] to assess the effectiveness of our LET-Net.This dataset includes 133 normal cases, 437 benign cases, and 210 malignant cases.We follow the same settings as work [2] to separately conduct experiments on benign and malignant samples.

Evaluation metrics
As done in recent related work of polyp segmentation [20], we employ both mean Dice (mDice) and mean IoU (mIoU) to quantitatively evaluate the performance of our model and ( 14) other state-of-the-art methods on polyp benchmarks.For breast lesion segmentation, we adopt four widely used metrics, including Accuracy, Jaccard index, Precision, and Dice to validate the segmentation performance in our study.Theoretically, high scores for all metrics indicate better results.

Experimental results
To investigate the effectiveness of our proposed method, we validate LET-Net in two applications: polyp segmentation from coloscopy images and breast lesion segmentation from ultrasound images.

Polyp segmentation
Quantitative comparison To demonstrate the effectiveness of our LET-Net, we compare it to several state-of-the-art methods on five polyp benchmarks.Table 1 summarizes the quantitative experimental results in detail.From it, we can see that our LET-Net outperforms the other methods on all datasets.Concretely, on the seen CVC-ClinicDB dataset, it achieves significantly higher mDice and mIoU scores (94.5% and 89.9%, respectively).On Kvasir dataset, our method exceeds SANet [20] and BLE-Net [61] by 2.2% and 2.1% mDice improvements, respectively.The underlying reason for their limited performance is that these two methods follow a pure CNN architecture, which lacks global long-range dependencies.By contrast, our method captures global contexts by its PVT-based encoder, and further excavates valuable local information using FLE module, demonstrating superior segmentation ability.Most importantly, our LET-Net still exhibits excellent generalization capabilities when applied to unseen datasets (i.e., CVC-ColonDB, ETIS-LaribPolypDB, and CVC-300).Specifically, LET-Net gets ahead of the CNN-based SOTA CaraNet [22] by 2.2% and 2.8% in terms of mDice and mIoU on CVC-ColonDB.Compared with other Transformer-based approaches, our LET-Net also presents excellent segmentation and generalization abilities.Concretely, on ETIS-LaribPolypDB dataset, we can observe that LET-Net achieves 4.7% and 4.2% higher mDice than SETR-PUP [18] and TransUnet [43], respectively.This performance improvement can be attributed to two factors.One is that our proposed FLE module compensates for the loss of local details in the Transformer architecture.The other is that the LRR module effectively recovers spatial information.
Visual Comparison To further evaluate the proposed LET-Net intuitively, we visualize some segmentation maps produced by our model and other methods in Fig. 6.It is apparent that our LET-Net can not only clearly highlight polyp regions but also identify small polyps more accurately than other counterparts.This is mainly because our method effectively leverages and combines global and local contexts.In addition, we introduce mutual information loss as an assistant to learning task-relevant representation.Furthermore, we find that our LET-Net successfully deals with other challenging cases, including cluttered backgrounds (Fig. 6 (b),(c), (g), (i)) and low contrast (Fig. 6 (a),(h)).For example, as illustrated in Fig. 6 (b),(i), ACSNet [39] and PraNet [59] misidentify background tissues as polyps, but our LET-Net overcomes this drawback.Due to combining the strengths of CNN and Transformer, our LET-Net produces good segmentation performance in these scenarios.Overall, our model achieves leading performance.

Breast lesion segmentation
Quantitative comparison To further evaluate the effectiveness of our method, we conduct extensive experiments in breast lesion segmentation and perform a comparative analysis with ten segmentation approaches.Table 2 presents the detailed quantitative comparison among different methods on BUSIS dataset.Obviously, our LET-Net exhibits excellent performance in both benign and malignant lesion segmentation.In benign lesion segmentation, LET-Net achieves 97.7% Accuracy, 74% Jaccard, 83.5% Precision, and 81.5% Dice.Compared with other competitors, LET-Net significantly outperforms them by a large margin.In detail, it, respectively, excels C-Net [2], CPF-Net [29], and PraNet [59] by 1.6%, 4.1%, and 4.9% in terms of Jaccard.Meanwhile, in malignant lesion segmentation, we obtain an Accuracy score of 93% and a Dice score of 72.7%, respectively, demonstrating the superiority of our LET-Net over other methods.In particular, LET-Net presents a significant improvement of 1.8% in Jaccard and 2.8% in Dice compared with C-Net [2].The reason behind this is that although C-Net constructs a bidirectional attention guidance network to capture both global and local features, long-range dependencies are not fully modeled due to the limitations of convolution.
Visual comparison: To intuitively demonstrate the performance of our model, we present segmentation results of different methods in Fig. 7.We observe that other methods often produce segmentation maps with incomplete lesion structures or false positives, while our prediction maps are superior to others.This is mainly due to our FLE's ability to facilitate discriminative local feature learning and the effectiveness of our proposed LRR module for spatial reconstruction.In addition, it is worth noting that our LET-Net performs well in handling various shapes [Fig.7   that LRR module is capable of effective spatial recovery via its dynamic reconstruction kernels and split-attention mechanism, thereby facilitating segmentation.

Effectiveness of mutual information loss
To validate the effectiveness and necessity of our mutual information loss, we retrain our proposed LET-Net with different loss settings.Specifically, we denote three variants, i.e., w/o L PPA , w/o L VSD , and w/o L PPA & L VSD , each of which removes the corresponding loss item.Note that we apply conventional binary-cross entropy loss to supervise our model when removing L PPA .Table 4 reports the quan- titative evaluation.Comparing the first and fourth lines in Table 4, we can observe that our model performs poorly without PPA loss supervision, obtaining a 1.1% lower mIoU on CVC-ClinicDB dataset.Also, a similar dropping situation occurs with the variant w/o L VSD .Specifically, our model has witnessed performance degradation without L VSD , decreas- ing mIoU by 1.5%, 1%, and 1.3%, respectively, on CVC-ColonDB, ETIS-LaribPolypDB, and CVC-300 datasets.This confirms that each term in our total loss is effective for segmentation.The reasons can be summarized as: first, in contrast to binary-cross entropy loss, PPA loss can guide our model to pay more attention to local details by synthesizing local structure information of a pixel, resulting in superior performance.Second, L VSD assists task-relevant feature learning, thereby improving the sensitivity of small objects.In addition, it can be seen that our method outperforms w/o L PPA & L VSD by a large margin, achieving 2.3% mDice and 2.5% mIoU performance gains with the help of our mutual information loss on CVC-ColonDB dataset.In summary, our experimental results fully demonstrate that mutual information loss is beneficial for LET-Net.

Conclusion
In this work, we propose a novel locally enhanced transformer network for accurate medical image segmentation.
Our model adopts a PVT-based encoder to extract global contexts and utilizes a feature-aligned local enhancement module to highlight detailed local contexts while effectively recovering high-resolution spatial information by its progressive local-induced decoder.In addition, we design a mutual information loss to encourage our LET-Net to learn powerful representations from the task-relevant perspective.LET-Net is validated in polyp and breast lesion segmentation and achieves state-of-the-art performance, especially demonstrating its ability for small target segmentation.In future work, we aim to apply our proposed LET-Net to other medical image segmentation tasks with different modalities or anatomies, thereby developing our model to be more robust.

Fig. 1
Fig. 1 An illustration of small lesion samples and size distributions for different medical image datasets, including polyp coloscopy images and breast ultrasound images.Ground truth for each image is represented by a green line.In a histogram, the horizontal axis repre-

Figure 2
Figure 2 illustrates our proposed LET-Net, which combines Transformer and CNN architectures to achieve accurate segmentation.In the encoder stage, we utilize a pre-trained pyramid vision transformer (PVT) [19] as the backbone to extract hierarchical multi-scale features.Then, three featurealigned local enhancement (FLE) modules are inserted in the skip connections to enhance discriminative local features.Afterward, we employ a novel progressive local-induced decoder composed of cascaded local reconstruction and refinement (LRR) modules to effectively recover spatial resolution and produce the final segmentation maps.In what follows, we elaborate on the key components of our model.

Fig. 5
Fig. 5 The structure of local reconstruction and refinement module.It contains two blocks: local-induced reconstruction and split-attentionbased refinement (a)-(h)] and low-contrast images [Fig.7(d)(h)], which can be attributed to the powerful and robust feature learning ability of LET-Net.

Fig. 7
Fig. 7 Visual comparison among different methods in breast lesion segmentation, where the segmentation results of benign and malignant lesions are separated by a red dashed line

Table 1
Comparisons between different method in polyp segmentation task.The best results are highlighted in bold

.907 0.839 4.3.1 Impact of FLE and LRR modules To
validate the effectiveness of FLE and LRR modules, we remove them individually from our full net, resulting in two variants, namely w/o FLE and w/o LRR.As shown in Table3, the variant without FLE (w/o FLE) achieves a 93.6% mDice score on CVC-ClinicDB dataset.When we apply the FLE module, the mDice score increases to 94.5%.Moreover, it boosts mDice by 1.6%, 2.2%, and 2.3% on CVC-ColonDB, ETIS-LaribPolypDB, and CVC-300 datasets, respectively.These results indicate that our FLE module effectively supports accurate segmentation due to its ability to learn discriminative local features under the feature alignment condition.Furthermore, when comparing the second and third lines of Table3, it can be seen that LRR module is also conducive to segmentation, with performance gains of 1.6% and 1.7% in terms of mDice and mIoU on Kvasir dataset.The main reason is adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.