Transformers-based architectures for stroke segmentation: a review

Zafari-Ghadim, Yalda; Rashed, Essam A.; Mohamed, Amr; Mabrok, Mohamed

doi:10.1007/s10462-024-10900-5

Transformers-based architectures for stroke segmentation: a review

Open access
Published: 30 September 2024

Volume 57, article number 307, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Transformers-based architectures for stroke segmentation: a review

Download PDF

Yalda Zafari-Ghadim¹,
Essam A. Rashed²,
Amr Mohamed³ &
…
Mohamed Mabrok¹

Abstract

Stroke remains a significant global health concern, necessitating precise and efficient diagnostic tools for timely intervention and improved patient outcomes. The emergence of deep learning methodologies has transformed the landscape of medical image analysis. Recently, Transformers, initially designed for natural language processing, have exhibited remarkable capabilities in various computer vision applications, including medical image analysis. This comprehensive review aims to provide an in-depth exploration of the cutting-edge Transformer-based architectures applied in the context of stroke segmentation. It commences with an exploration of stroke pathology, imaging modalities, and the challenges associated with accurate diagnosis and segmentation. Subsequently, the review delves into the fundamental ideas of Transformers, offering detailed insights into their architectural intricacies and the underlying mechanisms that empower them to effectively capture complex spatial information within medical images. The existing literature is systematically categorized and analyzed, discussing various approaches that leverage Transformers for stroke segmentation. A critical assessment is provided, highlighting the strengths and limitations of these methods, including considerations of performance and computational efficiency. Additionally, this review explores potential avenues for future research and development.

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Stroke, a cerebrovascular disease, stands as the second leading cause of morbidity and mortality worldwide, impacting over 100 million people globally (Feigin et al. 2022). It transpires when there is an abrupt disruption in the blood supply to the brain, resulting in the damage or death of neuro cells. This occurrence can be attributed to two primary reasons: a blockage in the blood vessels, referred to as ischemic stroke, and the rupture of vessels leading to bleeding into surrounding tissues, known as hemorrhagic stroke (Grysiewicz et al. 2008). The consequences of stroke on patients can be profound, often resulting in physical disabilities and cognitive impairments (Meyer et al. 2015; Dimyan and Cohen 2011). This underscores the importance of accurate and timely diagnosis for effective treatment and improved patient outcomes.

Stroke patients typically undergo neuroimaging techniques to distinguish between ischemic and hemorrhagic strokes. This differentiation can be achieved through magnetic resonance imaging (MRI) and computed tomography (CT), each offering a distinctive insight into the condition of the brain (Goldstein and Simel 2005). MRI offers excellent soft tissue contrast for the brain, and when diagnosis is uncertain, it can be more informative than CT (Hwang et al. 2012; Chalela et al. 2007; Fiebach et al. 2002), providing information on stroke location (Flossmann et al. 2008), timing Aoki et al. (2010), and mechanism (Wessels et al. 2006). Diffusion-weighted imaging (DWI) and perfusion-weighted imaging (PWI) within the MRI protocol offer valuable information on the extent and impact of stroke on brain tissue (Simonsen et al. 2015). Refer to Fig. 1 for an illustration showing a stroke infarct sample in two distinct magnetic resonance modalities along with the corresponding annotation. Additionally, refer to Fig. 2 for CT images accompanied by corresponding annotations.

Stroke segmentation plays an essential role in the diagnostic process as well as treatment planning by providing spatial information about affected areas of the brain and the extent of damage. Traditional methods of stroke diagnosis, often based on manual interpretation of medical images, prove to be time-consuming and susceptible to human error. The inherent variability in the size, shape, and location of strokes, compounded by artifacts and noise present in the imaging data, presents substantial challenges for automated analysis, rendering it a challenging task. Furthermore, the need for real-time or near-real-time diagnosis in stroke cases demands algorithms that are not only accurate but also computationally efficient. As such, the development of accurate and automatic methods for stroke segmentation remains a prominent focus in the research domain.

The field of medical image analysis has witnessed a transformative evolution with the advent of deep learning techniques (Zhou et al. 2019a). Deep learning (DL) models, with their ability to automatically learn intricate patterns from vast amounts of data, have shown promising results in various medical imaging tasks, including stroke segmentation (Zhang et al. 2022). Convolutional Neural Networks (CNNs) (O’Shea and Nash 2015), a class of deep learning models, have demonstrated remarkable success in tasks such as image classification (Huang et al. 2017; Hu et al. 2018), object detection (Wang et al. 2017), segmentation (Chen et al. 2017), and registration (Balakrishnan et al. 2019; Jia et al. 2022). These models, with their hierarchical feature learning capabilities, have significantly improved the accuracy and efficiency of medical image analysis. However, the inherent limitations of CNNs in capturing long-range dependencies and contextual information in images have led to the exploration of alternative architectures, including Transformers (Li et al. 2023a).

Originally proposed for natural language processing tasks (Vaswani et al. 2017), Transformers have gained widespread attention in the computer vision community. Unlike traditional convolutional approaches, transformers process input data in a parallel and non-sequential manner, allowing them to capture complex spatial relationships and contextual dependencies effectively. The self-attention mechanism in Transformers enables them to weigh different parts of the input data differently, making them particularly suitable for tasks requiring a global understanding of the data, such as medical image analysis.

To gather a comprehensive body of research on the application of deep learning techniques, particularly Transformer-based architectures, in stroke segmentation, we conducted an extensive search across multiple electronic databases. These databases included PubMed, IEEE Xplore, and Google Scholar, which are widely recognized for their extensive collections of scientific literature in the fields of medicine and computer science. Our search strategy involved the use of various queries designed to retrieve a broad range of published works. We combined terms related to deep learning and neural networks, such as “CNN”, “deep learning”, “Transformer”, and “vision Transformer”, with stroke-specific keywords including “stroke”, “ischemic stroke”, “stroke segmentation”, and “stroke detection”. This approach aimed to capture studies that intersected both the technical aspects of deep learning and the clinical focus on stroke. In addition to the automated search, we also employed a manual review process to ensure the comprehensiveness of our literature search. We examined the reference lists of the articles initially identified to uncover additional studies that might have been missed by our initial search criteria. This step is crucial for uncovering relevant research that may not be indexed as prominently in electronic databases. The studies had to meet specific criteria: (1) they must have been published in English to ensure accessibility and ease of review; (2) their primary focus should have been on stroke segmentation, reflecting our interest in the application of deep learning in accurately identifying and delineating stroke lesions in medical imaging; (3) they should have utilized deep learning techniques, with a particular emphasis on studies that employed Transformer-based architectures, given the growing interest in these models for their potential to improve performance in complex image analysis tasks; and (4) they must have reported quantitative results on the performance of their models, providing empirical evidence of their effectiveness. By adhering to these criteria, we ensured that the studies included in our review contributed meaningful insights into the state of the art in stroke segmentation using deep learning.

We excluded studies that: (1) were unrelated to stroke segmentation; (2) did not achieve high segmentation performance; and (3) presented papers with identical methodologies where their contributions were negligible. However, in cases where the performance of all proposed pipelines was low for certain datasets, we kept the superior ones. Due to the relatively limited number of Transformer-based networks addressing stroke segmentation in the existing literature, we encompassed all available publications in our study. It is crucial to acknowledge that our review might have unintentionally omitted some noteworthy papers related to CNN-based architectures. Nevertheless, our primary objective was to offer an overview of the contributions in utilizing vision Transformers for stroke segmentation purposes. See Fig. 3 for the process of publication selection for this review.

In previous reviews on brain stroke segmentation (Zhang et al. 2022; Abbasi et al. 2023), the focus was primarily on CNN-based architectures, with no inclusion of Transformer-based models. Conversely, reviews on the use of Transformers for medical image analysis (Shamshad et al. 2023; He et al. 2023; Li et al. 2023a) mainly discussed Transformer-based or hybrid CNN-Transformer models, excluding advanced CNN-based architectures. Stroke segmentation presents unique challenges, requiring specific architectural considerations for accurate results. Thus, an architecture that performs well for one task may not be generalizable to others. In this review, we aim to address this gap by considering both Transformer-based and CNN-based architectures for a fair and comprehensive comparison.

Furthermore, previous reviews often focused solely on ischemic stroke and corresponding datasets, overlooking the inclusion of hemorrhagic stroke and datasets specific to it (Luo et al. 2024; Abbasi et al. 2023). Our review aims to bridge this gap by considering both ischemic and hemorrhagic stroke, as well as all available datasets for stroke segmentation. Table 1 provides a summary of the information included in our review compared to previous ones.

Table 1 Comparison of this study and previous reviews for stroke segmentation

Full size table

In this review, we discussed the applications of Transformers in stroke segmentation, exploring innovative methodologies developed to address the challenges posed by stroke diagnosis. We systematically reviewed the existing literature, analyzing different Transformer-based architectures, their integration with traditional deep learning techniques, and their performance in stroke-related tasks. Through this comprehensive review, we are aiming to provide insight into the current state of the art, highlight the strengths and limitations of Transformers in stroke segmentation, and identify potential avenues for future research and development.

The paper is organized as follows: It begins with an introduction providing an overview of vision Transformers, including key mechanisms such as self-attention and multi-head self-attention, basic architectural designs, and their application in medical imaging. Following this, the datasets available for stroke segmentation are introduced, covering both ischemic and hemorrhagic stroke datasets across MRI and CT modalities. Subsequently, various metrics used for evaluating the performance of proposed methods in stroke segmentation are discussed. The paper then discusses different architectures proposed for stroke segmentation, including both early CNN-based approaches and the primary focus of the study, Transformer-based architectures. Open challenges in stroke segmentation and potential future research directions are discussed before concluding with a summary of the study’s contributions. Figure 4 provides a graphical abstract highlighting the key aspects discussed in this paper, encompassing available datasets for stroke segmentation and the diverse deep architectures employed in this context.

The key contributions of this review paper are:

Comprehensive coverage of both Transformer-based and CNN-based architectures for brain stroke segmentation, unlike previous reviews that focused primarily on one type of architecture.
Inclusion of both ischemic and hemorrhagic stroke types, as well as a wider range of datasets (7 datasets), compared to previous reviews that were limited in their scope.
Providing a fair and comprehensive comparison of the performance of different deep learning architectures (Transformers and CNNs) for brain stroke segmentation task.
Identifying the strengths and limitations of Transformer-based approaches in the context of brain stroke segmentation, and highlighting potential avenues for future research and development.

2 Fundamentals of transformers

2.1 Architectural components of transformers

Transformers, introduced in the context of natural language processing, consist of fundamental architectural components that distinguish them from traditional CNN-based models. The core elements of Transformers include self-attention mechanisms and position-wise feed-forward networks. Self-attention enables the model to weigh input elements differently, capturing contextual dependencies irrespective of their positions in the sequence. This mechanism allows Transformers to model long-range dependencies efficiently, making them well-suited for tasks requiring a global understanding of the data. Position-wise feed-forward networks introduce non-linear transformations, enhancing the model’s capacity to learn complex patterns within the data.

2.1.1 Self-attention

The self-attention (SA) mechanism used in Transformers is a crucial component that empowers the model to capture the long-range dependencies between various parts of the input data. This is accomplished through a process in which each element (token) in the input sequence attends to every other element, calculating its representation based on the information from all other elements.

To compute the self-attention, the input sequence $X\in {\mathbb {R}}^{N{\times }C}$ is projected into a query $Q\in {\mathbb {R}}^{N{\times }D}$, a key $K\in {\mathbb {R}}^{N{\times }D}$, and a value $V\in {\mathbb {R}}^{N{\times }D_v}$ using three trainable projection layers $W^Q\in {\mathbb {R}}^{C{\times }D}, W^K\in {\mathbb {R}}^{C{\times }D}, W^V\in {\mathbb {R}}^{C{\times }D_v}$, respectively. Then, the corresponding attention matrix $A\in {\mathbb {R}}^{N{\times }N}$, which represents the affinity of the query and the key, can be calculated by:

$$\begin{aligned} A(Q, K) = Softmax \left(\frac{Q{\times }K^T}{\sqrt{D}}\right). \end{aligned}$$

(1)

The attention matrix connects all elements, which allows the handling of long-range dependencies. Subsequently, the calculated attention matrix is applied to the value V, resulting in the output $Z\in {\mathbb {R}}^{N{\times }D_V}$:

$$\begin{aligned} Z = SA(Q, K, V) = A(Q, K){\times }V, \end{aligned}$$

(2)

2.1.2 Multi-head self-attention

In multi-head self-attention (MSA), multiple SA blocks (heads) are performed in parallel to produce multiple output maps. The final output is a concatenation and projection of all outputs of SA blocks. This enables better modeling of complex dependencies between different elements in the input. For H number of heads, each head has its learnable weight matrices, $ \{W^{(Q_i)}, W^{(K_i)}, W^{(V_i)}; i=1,\ldots , H\}$.

$$\begin{aligned} \begin{aligned} Z_i = SA(Q_i, K_i, V_i) = Softmax \left(\frac{Q_i{\times }K_i^T}{\sqrt{D/H}}\right), \\ MSA(Q, K, V) = Concat (Z_1, Z_2,\ldots , Z_H) W^0, \end{aligned} \end{aligned}$$

(3)

where $W^0$ is a linear projection that aggregates the outputs of all attention heads. It is noteworthy that a larger number of heads does not necessarily lead to better performance (Dosovitskiy et al. 2020).

2.2 Vision transformer pipeline

The Vision Transformer (Dosovitskiy et al. 2020), or ViT, is a Transformer-like architecture introduced for image classification tasks. The main paradigm in the ViT is that tokens are created from the flattened patches of the image. Let X be a 3D image volume ($X\in {\mathbb {R}}^{(H{\times }W{\times }L{\times }C)}$), where (H, W, L) represents the image dimensions, and C is the number of channels. The image is divided into N patches, which can overlap or not overlap, with each patch having a size of (P, P, P). Then, a sequence is created from the flattened form of these patches $x_P\in {\mathbb {R}}^{N{\times }P^3C}$ and projected into a D dimensional space $\hat{x}$. To preserve positional information, a positional embedding was added, resulting in the input of the Transformer encoder, denoted as x:

$$\begin{aligned} x = \hat{x} + E_{pos}, E_{pos}\in {\mathbb {R}}^{N{\times }D}, \end{aligned}$$

(4)

The subsequent tokens were inputted into a Transformer encoder comprising L stacked base blocks. Each base block included multi-head self-attention and a multilayer perceptron (MLP) with layer normalization (LN), and residual connections were employed following each block. A depiction of ViT and the Transformer encoder is shown in Fig. 5.

It is noteworthy that the computational complexity of calculating the Softmax within the MSA blocks grows quadratically as the input sequence length increases (Dosovitskiy et al. 2020). This limitation could restrict its practical use, especially when dealing with high-resolution medical images. The introduction of the "Shifted Windows" idea in the Swin Transformer (Liu et al. 2021) improved the efficiency of MSA calculations. Unlike ViT (Dosovitskiy et al. 2020), which computes the relationship between one token and all others in every step of the self-attention calculations, Swin Transformer restricts self-attention calculations to non-overlapping local windows. It also enables cross-window connections and maintains linear computational complexity relative to the image size. Refer to Figure 6 for a visual representation of how the Shifted Windows concept partitions an input feature map with dimensions of $4{\times }8$ pixels, using a window size of $2{\times }4$. Additionally, it utilizes a hierarchical structure and generates feature maps at multiple resolutions through the incorporation of patch-merging layers.

2.3 Adaptations for medical image analysis

Adapting Transformers for medical image analysis involves several considerations. One significant challenge is the high dimensionality of medical images, often requiring substantial computational resources for processing, especially for 3D images. Researchers have explored techniques such as patch-based processing (Liu et al. 2021) and efficient attention mechanisms (Xiong et al. 2021; Rao et al. 2021) to mitigate this challenge. Another significant challenge lies in ensuring the generalizability of the proposed network. The pipeline must demonstrate robustness when tested on unseen data acquired from different imaging scanners or centers. The inherent variability among images obtained from different vendors, even when imaging the same subject, can result in a noticeable reduction in performance accuracy. Addressing and managing this variability is essential for handling the challenges associated with generalization. Various applications require specific conditions to be satisfied, and these conditions can vary significantly between different applications. The design of networks, especially Transformer-based architectures, should be approached with careful consideration based on the unique nature and requirements of each application.

Several architectures have been developed utilizing exclusively Transformer models (Cao et al. 2022; Karimi et al. 2021; Wang et al. 2021). DAE-Former (Azad et al. 2023) is a dedicated Transformer architecture proposed for medical image segmentation, featuring dual attention blocks in both the encoder and decoder, along with cross-attention blocks in the skip connections to optimize segmentation results. The key elements of dual attention blocks include efficient attention (Shen et al. 2021), employed to reduce computational complexity from quadratic to linear. Additionally, transpose attention (Ali et al. 2021) is incorporated into these blocks to capture channel attention. This architectural choice is based on empirical evidence suggesting that combining spatial and channel attention enhances the model’s capacity to capture more contextual features (Guo et al. 2022).

Pure Transformer architectures demonstrate certain limitations when compared with hybrid architectures that effectively capture both local and global information. Combining Transformers with CNNs in hybrid architectures leverages the strengths of both models, allowing Transformers to capture global context and CNNs to learn local features (Chen et al. 2021). These adaptations have paved the way for the successful application of Transformers in tasks like stroke segmentation, addressing the unique requirements of medical image analysis. Hybrid Transformer-CNN models offer flexibility in the placement of the Transformer component.

In Swin UNETR model (Hatamizadeh et al. 2021), the Swin Transformer took on the role of the encoder, and the encoded features from the Transformer were combined with the CNN-based decoder at various levels and resolutions. UNETR (Hatamizadeh et al. 2022) consisted of a Transformer-based encoder and a CNN-based decoder, featuring skip connections composed of convolutional-based blocks. In TransUNET model (Chen et al. 2021), the Transformer was integrated into the encoder, where it processed tokenized image patches derived from a CNN-generated feature map. TransFuse (Zhang et al. 2021) introduced a unique approach, employing a dual-branch encoder, one branch based on CNN and the other solely on the Transformer. A novel technique, called BiFusion, fused multi-level features extracted from both branches. In nnFormer (Zhou et al. 2021), a combination of interleaved convolution and self-attention operations was employed. Additionally, nnFormer utilized skip attention, akin to the traditional concatenation approach seen in skip connections within UNet-like architectures. Transformers have proven their effectiveness when utilized as the upsampling components within the decoder section (Li et al. 2022).

The Fully Convolutional Transformer (FCT) (Tragakis et al. 2023) was conceived to harness the strengths of CNNs for local feature representation and capitalize on Transformers’ proficiency in capturing long-range dependencies. The utilization of depth-wise convolutions in the projection layer obviates the necessity for positional embedding addition. Following the extraction of overlapping patches from an image, patch-based embeddings are incorporated, and MSA is subsequently calculated on these patches. In the FCT framework, a multi-branch convolutional paradigm is embraced to enhance spatial context. In this context, one layer applies a spatial convolution to the MSA output with a small kernel size, while other layers employ dilated convolutions with larger receptive fields. The integration of these outputs is facilitated by a fusion module known as Wide-Focus. Figure 7 illustrates several typical Transformer-based architectures for medical image segmentation, which have served as inspiration for many related models.

3 Datasets

In this section, we introduced available datasets for stroke infarct segmentation, encompassing both ischemic and hemorrhagic strokes across both CT and MRI modalities. Each dataset comprises various modalities with a different number of cases. Table 2 provides a summary of these datasets.

3.1 ISLES dataset

The Ischemic Stroke Lesion Segmentation (ISLES) challenge offers publicly available datasets, released in 2015, 2017, 2018, and 2022. The objectives of the challenge varied each year, and distinct image modalities were provided in each one.

3.1.1 ISLES 2015

ISLES 2015 (Maier et al. 2017) is a publicly available dataset comprising two distinct sub-challenges: Sub-Acute Ischemic Stroke Lesion Segmentation (SISS) and Stroke Perfusion Estimation (SPES).

ISLES 2015-SISS consisted of 64 sub-acute ischemic cases, with 28 cases allocated for training and 36 cases for testing with a voxel size of $1 {mm}^3$. These cases were collected from two different medical centers with variations in image resolution. Each case within the dataset was accompanied by four MRI modalities, namely T1-weighted, T2-weighted, DWI, and Fluid attenuated inversion recovery (FLAIR). Preprocessing steps, including skull-stripping and resampling to an isotropic space, were applied, and all modalities were registered to the FLAIR modality as a reference. Both the training and testing datasets included instances of single and multi-focal lesions, as well as large and small lesions.

ISLES 2015-SPES comprised 50 acute ischemic cases, with 30 cases designated for training and 20 cases for testing, with a voxel size of $2 {mm}^3$. Each case was accompanied by seven distinct image modalities, namely T1 contrast-enhanced (T1c), T2, DWI, cerebral blood flow (CBF), cerebral blood volume (CBV), time-to-peak (TTP), and time-to-max (Tmax). To ensure consistency, all modalities were registered on the T1c modality.

3.1.2 ISLES 2017

This dataset (Winzeck et al. 2018) represents an extension of the ISLES 2016 stroke lesion segmentation challenge, notable for its expansion in the number of acute ischemic cases. Specifically, the dataset increased from 35 training and 19 testing cases to 43 training and 32 testing cases. Furthermore, this dataset introduced a new set of MRI modalities, including Apparent diffusion coefficient (ADC), rBF, rBV, mean transit time (MTT), Tmax, TTP, and raw PWI, distinguishing it from the previously utilized modalities in the 2015 version. The preprocessing steps in ISLES 2017 were more concise, primarily encompassing registration and skull-stripping, while the voxel size and image resolutions exhibited variations.

3.1.3 ISLES 2018

The primary objective of the ISLES 2018 dataset (Cereda et al. 2016; Hakim et al. 2021) was to perform the segmentation of stroke lesions using computed tomography perfusion (CTP) images, guided by annotations derived from DWI images, which are considered the standard image modalities. The dataset encompasses information from 103 acute ischemic cases with MRI images acquired within a 3-hour window of CTP. For training, 63 cases were designated, while the remaining 40 were reserved for testing. The input data for the algorithms consisted of various perfusion maps, including CBV, CBF, MTT, and Tmax.

3.1.4 ISLES 2022

The dataset (Hernandez Petzsche et al. 2022), sourced from three distinct medical centers, comprised information from 400 acute and sub-acute ischemic cases. Notably, this dataset featured a diverse range of infarct patterns, and a high degree of variability in terms of lesion size and location, with a mean number of 9.289 and a maximum of 126 unconnected ischemic regions per scan. The dataset exhibited heterogeneity due to the utilization of three different imaging devices, which serves as a valuable criterion for assessing the generalization of proposed methods. Modalities included in this dataset encompass DWI, ADC, and FLAIR images.

3.2 ATLAS dataset

The Anatomical Tracings of Lesions After Stroke (ATLAS) v2.0 dataset (Liew et al. 2022) contained high-resolution T1-weighted images for the segmentation of acute, sub-acute, and chronic stroke lesions. Significantly, this dataset was more extensive, containing more than four times the data volume of its predecessor, ATLAS v1.2. Aggregated from multiple centers worldwide, the dataset included data from 1,271 cases. Of these, 655 cases were allocated for training, 300 cases offered images only with hidden segmentation masks, and an additional 316 cases were entirely withheld to assess the generalizability of the proposed methods. The preprocessing steps applied to this dataset involved intensity normalization and registration on the MNI-152 template.

3.3 AISD dataset

The Acute ischemic stroke dataset (AISD) (Liang et al. 2021) comprised paired CT-MRI data for 397 acute ischemic stroke cases. The dataset included Non-Contrast-enhanced CT (NCCT) scans and DWI scans, which were acquired within 24 hours of the CT images. The segmentation labels were derived from the MRI scans, which served as the standard for this purpose.

3.4 APIS dataset

The APIS dataset (Gómez et al. 2023) was designed as a paired CT-MRI dataset with the objective of ischemic stroke lesion segmentation, utilizing NCCT images and annotations from ADC scans. The training set comprised 60 pairs of CT-MRI data, while the testing phase involved 36 NCCT scans exclusively. All cases underwent preprocessing steps, including skull-stripping and registration onto the ADC scans, to ensure data consistency and alignment.

3.5 Johns Hopkins University’s dataset

This dataset comprised 2888 MRI datasets from cases involving acute and early subacute stroke patients, along with corresponding annotations (Liu et al. 2023a). The data were collected over 10 years using 11 MRI scanners. For all patients, DWI images, B0, and ADC were provided, and nearly 98.8% of patients had additional MRI modalities, including T1, high-resolution T1 MPRAGE, T2, FLAIR, SWI, and PWI. DWI images were registered onto the standard MNI space, subjected to skull-stripping, and resampled to a voxel size of $1 {mm}^3$. The considerable diversity within this dataset renders it an excellent benchmark for evaluating proposed methods in the context of stroke segmentation. Nevertheless, it is important to note that access to this dataset is subject to certain restrictions.

3.6 Intracranial Hemorrhage Segmentation (IHS) dataset

The dataset comprised non-contrast CT scans from 36 patients diagnosed with various types of intracranial hemorrhage, including intraventricular, intraparenchymal, subarachnoid, epidural, and subdural hemorrhage (Hssayeni et al. 2020). Each scan was characterized by an average of 30 slices with a thickness of 5mm. Annotations for the dataset were provided by two radiologists. Collected in 2018, this dataset is accessible from PhysioNet (Goldberger et al. 2000) with certain restrictions.

3.7 INSTANCE dataset

For the Intracranial Hemorrhage Segmentation on Non-Contrast Head CT (INSTANCE) challenge (Li et al. 2023b), a dataset comprising non-contrast CT scans from 200 patients diagnosed with various types of intracranial hemorrhage was assembled. The dataset allocation for different phases was as follows: 100 scans were designated for the training phase, 30 cases without ground truth labels were set aside for validation, and the remaining 70 cases were reserved for the final evaluation. The image size for each slice was $512 {\times } 512$, with the number of slices varying from 20 to 70 for each case. While the pixel size in each slice was $0.42 {mm}^2$, providing a good inter-slice resolution, the slice thickness was 5mm, resulting in a lower inter-slice resolution.

Table 2 Summary of available datasets for stroke segmentation

Full size table

4 Performance evaluation for stroke segmentation

Quantitative analysis of a segmentation process, evaluating its effectiveness in categorizing pixels or voxels into desired classes, is a crucial component of model evaluation. Many commonly used metrics rely on pixel-wise or voxel-wise calculations. The simplest method for assessing performance is through overall accuracy, defined as:

$$\begin{aligned} Overall \, Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$

(5)

where, TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively. However, overall accuracy may not provide sufficient insights, especially in imbalanced tasks such as stroke segmentation. To address this limitation, alternative metrics are widely used for a more nuanced evaluation of imbalanced semantic segmentation performance. One prominent metric is the Dice Similarity Coefficient (DSC), which measures the overlap between the predicted segmentation and the ground truth, ranging from 0 (no overlap) to 1 (perfect overlap). It is calculated as:

$$\begin{aligned} DSC= \frac{2{\times }TP}{2{\times }TP + FP + FN} \end{aligned}$$

(6)

Another valuable metric is the Intersection over Union (IoU), also known as the Jaccard Index, which assesses the ratio of the overlap to the total combined area, ranging from 0 (no similarity) to 1 (perfect similarity). The IoU is computed as:

$$\begin{aligned} IoU= \frac{TP}{TP + FP + FN} \end{aligned}$$

(7)

Both DSC and IoU provide comprehensive insights into the agreement between the predicted segmentation and the ground truth, making them particularly useful in scenarios with an imbalanced class distribution. There are also other commonly used metrics to measure stroke segmentation performance, including Precision, which assesses the accuracy of the positive predictions, Recall/Sensitivity, which gauges the ability to capture positive instances, and F1-score, a metric that strikes a balance between precision and recall by calculating their harmonic mean.

$$\begin{aligned}{} & {} Precision = \frac{TP}{TP + FP} \end{aligned}$$

(8)

$$\begin{aligned}{} & {} Recall = \frac{TP}{TP + FN} \end{aligned}$$

(9)

$$\begin{aligned}{} & {} F1-score = \frac{2{\times }Recall\,{\times }\,Precision}{Recall + Precision} \end{aligned}$$

(10)

F1-score is a specific form of the general $F_{\beta }-score$, in which the parameter $\beta $ controls the trade-off between recall and precision, particularly useful when there is uneven importance assigned to precision and recall. The formula for the $F_{\beta }-score$ is as follows:

$$\begin{aligned} F_{\beta }-score = \frac{(1 + {\beta }^2)\,{\times }\,Recall\,{\times }\,Precision}{Recall \,+\, {\beta }^2 \,{\times }\,Precision} \end{aligned}$$

(11)

Another metric worth considering is the Hausdorff Distance (HD), which measures the maximum distance between two segmentation sets. Utilizing the HD metric reflects the level of dissimilarity between predicted and ground truth boundaries. The HD is computed as follows:

$$\begin{aligned} \text {HD}(A, B) = \max \left( \sup _{a \in A} \inf _{b \in B} d(a, b), \sup _{b \in B} \inf _{a \in A} d(a, b)\right) \end{aligned}$$

(12)

Here, A and B represent two sets, and d(a, b) is the distance function between points a in set A and b in set B. The formula calculates the HD by determining the maximum of the infimum of distances from points in set A to the nearest point in set B and vice versa. This measurement can be represented in mm or voxel/pixel-based units. Some papers also utilized additional measurements, including Simple Lesion Count (SLC), Volume Difference (VD), Average Volume Difference (AVD), Volumetric Overlap Error (VOE), Relative Volume Difference (RVD), and average symmetric surface distance (ASSD).

5 Stroke segmentation using transformers

In this section, we summarized the deep models introduced for stroke segmentation, with a focus on Transformer-based architectures. Several architectural designs, both CNN-based and Transformer-based, have been proposed or utilized for this purpose. There are several variations of these architectures, some examples of which are summarized in Fig. 8.

5.1 Earlier approaches for stroke segmentation

The majority of prior studies of stroke lesion segmentation have primarily focused on CNN models, with an emphasis on U-Net-based architectures (Clerigues et al. 2019; Basak and Rana 2020; Khezrpour et al. 2022; Liu et al. 2020; Kadry et al. 2021). Notably, the U-Net (Ronneberger et al. 2015) architecture, specifically designed for biomedical image segmentation, exhibits a distinctive U-shaped configuration, comprising an encoder segment dedicated to contextual feature extraction and a decoder segment tailored for accurate localization. The incorporation of skip connections within this architecture facilitated the seamless integration of high-level feature maps derived from the encoder path with fine-grained details from the decoder path, enhancing its segmentation performance. Some research studies have drawn inspiration from the DenseNet architecture for their utilized neural network design (Zhang et al. 2018a). Table 3 summarizes the performance of selected CNN-based methods for stroke segmentation across various datasets. The inclusion of papers in the table is based on their demonstrated superior performance.

The bilateral quasi-symmetry property of the brain has been utilized in some research studies (Liang et al. 2021; Wang et al. 2016; Clèrigues et al. 2020; Vupputuri et al. 2018). In (Clèrigues et al. 2020) a patch-based deep learning pipeline was proposed, wherein the extraction of patches was carried out with a significant degree of overlap. These patches were subsequently fed into the neural network using a well-balanced sampling strategy, to mitigate issues associated with class imbalance. In (Praveen et al. 2018) a stacked sparse autoencoder network for unsupervised feature learning was employed, which was subsequently coupled with a Support Vector Machine (SVM) classifier to classify patches into normal and lesion categories.

The D-UNet (Zhou et al. 2019b) model was introduced for chronic stroke segmentation, and it incorporated a fusion of 2D and 3D convolutions in the encoder stage via a dimension transform block. This combination of 2D and 3D information facilitated more effective lesion identification. Additionally, a novel loss function called Enhance Mixing Loss (EML) was employed, which is a composite of the Focal loss (Lin et al. 2017) and Dice coefficient loss. By evaluating their proposed method on the ATLAS dataset, they achieved a mean Dice coefficient of 0.5349 for pixel-wise calculation and 0.7231 for voxel-wise calculation, respectively. In (Kumar et al. 2020) a hybrid Classifier-Segmenter network (CSNet) was introduced. Initially, the images were input into a classifier that distinguished slices with lesions. The chosen slices were then fed into a segmenter network, which utilized a fractal U-Net model for segmentation, and a final voting mechanism was employed to improve segmentation performance.

In (Abulnaga and Rubin 2019) a CNN-based model was introduced that incorporated the pyramid pooling module, as introduced in PSPNet (Zhao et al. 2017). This module was utilized to extract global and local contextual information, enhancing the accuracy of stroke lesion segmentation. It achieves this by capturing global information through the use of varying kernel sizes and aggregating multi-scale region-based context. The best results were achieved by incorporating pretraining and utilizing the Focal Loss on the ISLES 2018 dataset.

In X-Net (Qi et al. 2019), a Feature Similarity Module (FSM) was implemented to capture long-range dependencies, thereby enhancing the segmentation process. This module was employed at the bottleneck between the encoder and decoder to investigate dense long-range contextual information. To reduce network size and control the number of parameters, mitigating the risk of overfitting, depthwise convolutions were also integrated. Their proposed network demonstrated superior performance compared to other architectures, including U-Net, SegNet (Badrinarayanan et al. 2017), PSPNet (Zhao et al. 2017), ResUNet (Zhang et al. 2018b), 2D Dense-UNet (Li et al. 2018), and DeepLabv3+ (Chen et al. 2018), as evaluated on the ATLAS dataset.

In (Liu et al. 2019a) a CNN-based pipeline was proposed that divides the network into two subnetworks. This approach involved using multi-kernels of various sizes to extract feature maps across different receptive fields. Post-processing techniques were also applied to retain edge details in the images and reduce noise. The optimal performance for their proposed pipeline was achieved by employing a dropout rate of 0.1 (Bal et al. 2023) followed a similar approach and incorporated a local pathway and a global pathway within their proposed model, with larger kernel sizes employed in the global pathway to expand the receptive field for extracting long-range dependencies and global information. The best results in their proposed pipeline were obtained through the inclusion of preprocessing and data augmentation for the ISLES 2015 dataset.

(Zhang et al. 2020) devised a pipeline centered around a Detection and Segmentation Network (DSN). They utilized a triple-branch architecture to extract predictions for slices in the axial, sagittal, and coronal planes separately. Subsequently, the predicted labels from different slices within each plane were fine-tuned and passed through a fusion module to obtain the final segmentation label. Their approach outperformed architectures such as U-Net, V-Net (Milletari et al. 2016), and DeepMedic (Kamnitsas et al. 2016) in validation using the ISLES-SSIS dataset.

In (Huo et al. 2022) a model within the nnU-Net framework (Isensee et al. 2021) was introduced. Their approach incorporated four schemes: a generic U-Net, utilizing a TopK10 loss to improve performance in small lesion segmentation, a residual U-Net, and a self-training U-Net to enhance model diversity. An ensemble method was employed to combine the predicted results from these four networks, followed by post-processing techniques to enhance the results.

In (Abramova et al. 2021) a 3D U-Net-based network for hemorrhagic stroke segmentation in CT scans was utilized. To enhance the representation of informative features, they incorporated Squeeze and Excitation (SE) blocks (Hu et al. 2018) in the bottleneck and last layers of their network. Through symmetric data augmentation and the implementation of a restrictive patch sampling approach, their proposed architecture achieved a mean Dice coefficient of 0.86 on a clinical dataset consisting of 76 cases.

Pool-UNet (Liu et al. 2022c) incorporated SE blocks in a novel module called DSE-ResNet placed in the bottleneck. This module captures interdependencies between channels to provide the most informative features for the decoder. Additionally, they combined the Poolformer structure (Yu et al. 2022), a transformer-like structure utilizing pooling operations, with CNNs to capture both local and global information. Evaluations on the ISLES 2018 dataset demonstrated the superior performance of their proposed architecture compared to architectures such as U-Net, R2UNet (Alom et al. 2018), and TransUNet (Chalcroft et al. 2023) employed Large Kernel Attention (Guo et al. 2023) to capture long-range dependencies, capitalizing on the inherent biases of convolutions. The Large Kernel Attention mechanism comprises a sequence of depth-wise convolutions, dilated depth-wise convolutions, and pointwise convolutions.

PerfU-Net (de Vries et al. 2023) was introduced for stroke segmentation from CT images. This architecture incorporated an attention module placed in the skip connections, with two variations tested: one featuring channel attention and the other incorporating both channel attention and temporal attention. The training process utilized the generalized Dice loss (Sudre et al. 2017) as the loss function. PerfU-Net achieved a mean Dice coefficient of 0.564 when evaluated on the ISLES 2018 dataset, utilizing 32 frames as the input to the model and employing channel attention as the attention module. Refer to Figure 9 for an illustration of the proposed pipeline in PerfU-Net, and refer to Figure 10 for a qualitative analysis of performance using the ISLES 2018 dataset under two conditions: with and without flipped scans. In their observations, they noted that the use of flipped scans contributed to a reduction in the number of false positives.

Table 3 Performance analysis of CNN-based approaches for stroke segmentation. ${}^{*}$ indicates the median

Full size table

5.2 Transformer-based architectures for stroke segmentation

Vision Transformers have been employed in recent years for stroke segmentation, leveraging their capabilities, especially when combined with CNNs to capture local and global information from the input data. Table 4 summarizes the performance of selected Transformer-based methods for stroke segmentation across various datasets, and Table 5 summarizes the highlights and limitations of proposed Transformer-based architectures.

de Vries et al. (2021) proposed a hybrid Transformer-CNN pipeline for ischemic stroke infarct segmentation from CT perfusion scans. They considered the axial slices of 3D images as temporal information and incorporated the flipped and registered form of each slice as an additional channel in the input to exploit the brain’s bilateral quasi-symmetry property. The spatio-temporal data were fed into a Transformer block consisting of the Linformer (Wang et al. 2020) backbone to generate an attention map representing the probability of infarction. Subsequently, this attention map, along with the source data, was input into a traditional U-Net to produce the final segmentation. The Transformer part was trained using cross-entropy loss, while the U-Net was trained using the generalized Dice loss function.

UCATR (Luo et al. 2021) was proposed for acute ischemic stroke segmentation from non-contrast CT images. A Transformer-based block succeeded the CNN encoder in the bottleneck, and irrelevant information was filtered by employing Multi-Head Cross-Attention modules in the skip connections. The proposed network was evaluated on a clinical dataset containing information from 11 patients, averaging 95 slices per patient, and it outperformed U-Net, Attention U-Net, and TransUNet by achieving a mean Dice coefficient of 0.7358. UTransNet (Feng et al. 2022) employed a novel module (CT block) consisting of two convolutional layers and a Transformer module to leverage the advantages of both CNNs and Transformers. To mitigate computational complexity within the self-attention mechanism, the PVT v2 Transformer (Wang et al. 2022c) was incorporated. In evaluations conducted on the ATLAS dataset, UTransNet achieved superior results compared to other Transformer-based methods, including TransUNet, SwinUNet (Cao et al. 2022), and UCTransNet (Wang et al. 2022a).

The Multi-Encoder Transformer (METrans) (Wang et al. 2022b) was proposed as a novel architecture, incorporating a methodology involving additional encoding modules. These modules served the purpose of extracting abstract features at distinct stages of the primary encoder path. The ensuing step involved the fusion of the multi-scale extracted features. Following each convolutional module within the encoder, Convolutional Block Attention Modules (CBAM) (Woo et al. 2018) were employed to harness both channel attention and spatial-channel attention. Furthermore, a Transformer-based block was integrated into the bottleneck to facilitate the extraction of global features.

Within the architecture of STHarDNet (Gu et al. 2022), HarDNet blocks (Chao et al. 2019) were employed in both the encoder and decoder paths. Notably, the Swin Transformer was incorporated exclusively in the initial layer of the skip connection, while the subsequent layers adhered to a conventional, plain structure. The performance of this network surpassed that of numerous CNN-based and Transformer-based counterparts when evaluated on the ATLAS dataset. LLRHNet (Liu et al. 2022b) implemented a dual-path approach for feature encoding, wherein the initial path utilized CNN layers to extract local information, and the subsequent path employed a Transformer-based block for encoding global features. The features extracted from these two paths were concatenated, and a final prediction was generated through a CNN decoder. To enhance information transfer from the CNN encoder to the decoder, the model integrated multi-level feature fusion skip connections, a departure from conventional skip connection methods. Evaluation of the LLRHNet on a clinical dataset for ischemic stroke segmentation demonstrated its superior performance by achieving a mean Dice coefficient of 0.791.

Wu et al. (2022) propose an architecture consisting of three main elements was proposed. First, the Patch Partition Block (PPB) was employed to encode the image as a patch sequence, simultaneously reducing the number of parameters. Second, the Multi-scale Long-Range Interactive and Regional Attention (MLiRA) mechanism served as the encoder, comprising multiple subsampling Transformers (STR) followed by convolutional blocks. Within STR, subsampling Multi-head Interactive Self-Attention mechanisms were utilized to capture dimensional interactive attention. Moreover, STR exhibited flexibility in adjusting input resolution to attain global information at various spatial resolutions. Third, the Feature Interpolation Path (FIP) was utilized as the decoder, facilitating the recovery of encoded features to the original image resolution.

Marcus et al. (2023) developed a multi-task Transformer-based network for age estimation and segmentation of ischemic lesions from CT images was proposed. Their architecture was rooted in the DETR architecture (Carion et al. 2020) with certain modifications. The primary components of the proposed network included: 1) a CNN encoder consisting of four ResNeXt (Xie et al. 2017) blocks to generate an activation map; 2) a Transformer encoder-decoder, commencing with a pyramid pooling module (Zhao et al. 2017) to augment the receptive field, followed by a Transformer block using the gated positional self-attention mechanism (d’Ascoli et al. 2021); 3) heads for age estimation and bounding box predictions; and 4) a CNN decoder serving as the segmentation head. Based on evaluations of their proposed pipeline on a large clinical dataset consisting of 776 CT images collected from two medical centers, they reached a mean Dice coefficient of 0.382. For evaluation of the generalizability of the trained network on unseen data, they also utilized the ISLES 2018 dataset as the test dataset and reached a 0.203 mean Dice coefficient.

A pipeline designed to address the efficient processing of 3D image data to preserve volumetric information is introduced in (Zhang and Chen 2023). Their proposed architecture, named PDSwin, leverages the Swin Transformer with a pyramidal downsampling approach, spatially downsampling 2D slices. Additionally, the authors addressed the shift domain issue arising from diverse image acquisition centers by proposing a cluster-based domain adversarial algorithm. In (Xu and Ding 2023) a U-shaped network was employed to segment stroke infarct from CT scans. In the CNN-based encoder path, multiple CBAM blocks were incorporated. The features extracted from two scales of the encoder path were flattened and inputted into a Transformer module, encompassing several deformable Transformer layers utilizing deformable self-attention (Zhu et al. 2020).

Soh et al. (2023) introduced a Hybrid UNet and Transformer (HUT) network, comprising two parallel stages: a UNet and a Transformer block. The input to the Transformer block consisted of intermediate features extracted by the CNN encoder of the UNet. The output of the Transformer block was then fused with the extracted features from the encoder path in the skip connections at two different scales. In SAMIHS (Wang et al. 2023) a parameter-efficient fine-tuning strategy to the Segment Anything Model (SAM) (Kirillov et al. 2023) model was applied to segment hemorrhagic stroke. To improve segmentation results, they utilized a combination of the binary cross-entropy loss and a boundary-sensitive loss.

Some research endeavours have been pursued to improve the realism of predicted segmentation mask boundaries in stroke nature. One such approach was TransRender (Wu et al. 2023b), which was proposed to address the issue of overly smooth boundaries. It achieved this by adaptively selecting specific points for computing the boundary features in a point-based rendering manner, intending to enhance the fidelity of the boundary estimation. The hierarchical Transformer-based encoder path captured global information across multiple scales, with additional parallel CNN blocks employed to capture local information. Both local and global features were then provided as input to multiple render modules. These modules, by selecting specific uncertain points and extracting feature representations for those points, facilitated the re-prediction of these uncertain points as boundary points.

Another approach to enhance boundary estimation accuracy was W-Net (Wu et al. 2023a), which introduced a Boundary Deformation Module (BDM) and a Boundary Constraint Module (BCM) to address fuzzy boundaries. W-Net integrated both a CNN network and a Transformer-based network as backbone networks. Initially, a U-shaped CNN network was employed for coarse segmentation, leveraging the advantages of CNNs to extract local features. In the decoder path, features of different scales were inputted into proposed BDM blocks for further optimization through iterative boundary deformation, correcting the initially predicted boundaries using circular convolutions. The second stage of W-Net consisted of a Transformer-based U-shaped network. The output of the BDM blocks was fused with the encoder features at multiple levels. The decoder utilized BCM blocks to refine the encoded global features by constraining the boundary curves, employing several parallel dilated convolution layers. Refer to Fig. 11 for an illustration of the W-Net architecture, and Figs. 12 and 13 for qualitative analyses on the ATLAS and ISLES 2022 datasets, respectively.

Table 4 Performance analysis of Transformer-based approaches for stroke segmentation

Full size table

Table 5 A summary of Transformer-based approaches for stroke segmentation

Full size table

6 Open challenges and future directions

Current solutions for stroke segmentation, whether they employ CNN networks or Transformer-based architectures, have shown less satisfactory results compared to tasks such as tumor segmentation (Ranjbarzadeh et al. 2023; Liu et al. 2023b). The underlying cause of this suboptimal performance is attributed to various factors, including high variability in the location, number, size, and pattern of the infarct. Furthermore, the intensity differences resulting from the varied imaging vendors and stroke ages pose a significant challenge for automated algorithms. The proposed methods must effectively distinguish between healthy and infarcted regions of the brain while accommodating the diverse variability introduced by different medical imaging systems and inherent stroke features.

An additional crucial feature of proposed methods in stroke segmentation is their generalizability to unseen data from different vendors, making them applicable. Current methods often exhibit a lack of generalization, with testing on unseen data acquired from diverse centers leading to significantly lower performance. It is imperative to investigate and improve the robustness of these methods to handle unseen data effectively. Exploring domain adaptation methods could prove beneficial in achieving this objective.

Another inherent characteristic of stroke infarcts is their ability to affect various parts of the brain at different stages, exhibiting different sizes in different locations. It is crucial for segmentation methods to accurately handle multi-instance infarcts of varying sizes. Despite the high values of the Dice coefficient suggesting good overall performance, instance-wise measurements often yield lower results (Kofler et al. 2023). This discrepancy is primarily due to the neglect of small infarcts in the presence of larger ones when calculating commonly used metrics. The currently proposed architectures encounter challenges in detecting and segmenting small infarcts, resulting from information loss within the deeper layers of the networks designed to represent abstract features. Thoughtful improvements are necessary to adapt proposed pipelines for the segmentation of small infarcts.

Transformers are widely employed in medical image segmentation due to their ability to represent global information and capture long-range dependencies, providing a robust representation of shape-based features for segmentation. Given the high variability in stroke infarct shape, location, and pattern, the incorporation of texture information derived from CNNs appears to be more beneficial. Hybrid CNN-Transformer architectures address this challenge by combining texture-based and global information. However, a careful selection of how to employ Transformer blocks is necessary to fully leverage their advantages for the integration of texture information alongside global features to improve the effectiveness of the stroke segmentation methodologies.

A notable challenge in the effective application of Transformers stems from their dependence on large datasets, which becomes especially pronounced in the medical field. The limited availability of labeled data, coupled with difficulties in acquiring annotations and privacy considerations, imposes constraints on the accessibility of medical data. Although pre-training on alternative datasets, including natural images, might be considered, it is suboptimal due to the inherent domain shift. Medical images often have high dimensions, resulting in a large number of parameters that demand significant computational resources. The increased parameter count, combined with a limited dataset, can lead to overfitting issues. Various strategies, such as slicing 3D data from different anatomical planes or adopting patch-based data inputting, have been attempted previously. However, these approaches can result in the loss of valuable information. Exploring realistic data augmentation techniques and optimizing data input methods are crucial avenues to investigate in order to mitigate these challenges.

To establish effective pipelines for stroke segmentation, it is crucial to incorporate additional characteristics, such as privacy-preserving algorithms (Sheller et al. 2020; Li et al. 2021, 2019), and embrace explainable artificial intelligence (XAI) (Alicioglu and Sun 2022; Mondal et al. 2021; Singh et al. 2020) as integral components of trustworthy AI (Liu et al. 2022a). In recent years, researchers have dedicated their efforts to interpret deep learning-based models, which were previously considered black boxes. Understanding the decision-making process of Transformers can enhance predictions and facilitate their use in aiding decision-making for medical diagnoses. Additionally, exploring privacy-preserving algorithms is imperative. This investigation aims to provide a platform that can be utilized in different centers, allowing the sharing of knowledge derived from diverse datasets without compromising the privacy of medical information. This approach aims to continually enhance the performance of the underlying pipeline.

7 Discussion and conclusion

In this paper, we present a comprehensive review of Transformer-based architectures for segmenting stroke infarcts from MRI and CT images. We begin by offering preliminary information on the concepts of self-attention, vision Transformers, and several benchmark Transformer-based networks designed for medical image segmentation. Following this, we delve into details about available datasets for stroke segmentation, encompassing both ischemic and hemorrhagic strokes for both MRI and CT modalities. Subsequently, we discussed commonly used metrics for evaluating segmentation performance and conducted a literature review on stroke segmentation using deep learning methods. Given that a significant portion of previous research has been conducted using CNNs, we selectively extracted and summarized the key ideas with superior performance. Specifically focusing on Transformer-based architectures, we conducted a review of 14 papers, all employing hybrid CNN-Transformer architectures. For each paper, we offered a high-level abstraction of the core techniques utilized in these networks. Additionally, we presented comparison tables for quantitative evaluations of performance, considering both CNN-based and Transformer-based networks. Finally, we outlined the unsolved challenges associated with stroke segmentation and suggested potential avenues for future research directions.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

MRI:: Magnetic resonance imaging
CT:: Computed tomography
DWI:: Diffusion-weighted imaging
PWI:: Perfusion-weighted imaging
FLAIR:: Fluid attenuated inversion recovery
T1c:: T1 contrast-enhanced
CBF:: Cerebral blood flow
CBV:: Cerebral blood volume
CTP:: Computed tomography perfusion
NCCT:: Non-contrast-enhanced CT
ADC:: Apparent diffusion coefficient
TTP:: Time-to-peak
Tmax:: Time-to-max
MTT:: Mean transit time
ISLES:: Ischemic stroke lesion segmentation
SISS:: Sub-acute ischemic stroke lesion segmentation
SPES:: Stroke perfusion estimation
ATLAS:: Anatomical tracings of lesions after stroke
AISD:: Acute Ischemic Stroke Dataset
IHS:: Intracranial Hemorrhage Segmentation
INSTANCE:: Intracranial Hemorrhage Segmentation on Non-Contrast Head CT
DL:: Deep learning
CNN:: Convolutional Neural Networks
SA:: Self-attention
MSA:: Multi-head self-attention
MLP:: Multi-layer perceptron
LN:: Layer normalization
SE:: Squeeze and Excitation
2D:: Two dimensional
3D:: Three dimensional
SVM:: Support Vector Machine
TP:: True positives
TN:: True negatives
FP:: False positives
FN:: False negatives
DSC:: Dice Similarity Coefficient
IoU:: Intersection over Union
HD:: Hausdorff Distance
SLC:: Simple lesion count
VD:: Volume difference
AVD:: Average volume difference
VOE:: Volumetric overlap error
RVD:: Relative volumetric difference
ASSD:: Average symmetric surface distance

References

Abbasi H, Orouskhani M, Asgari S et al. (2023) Automatic brain ischemic stroke segmentation with deep learning: a review. Neurosci Inform 3(4):100145
Article Google Scholar
Abramova V, Clerigues A, Quiles A et al. (2021) Hemorrhagic stroke lesion segmentation using a 3D U-Net with squeeze-and-excitation blocks. Comput Med Imaging Graph 90:101908
Article Google Scholar
Abulnaga SM, Rubin J (2019) Ischemic stroke lesion segmentation in CT perfusion scans using pyramid pooling and focal loss. In: Brainlesion: Glioma, multiple sclerosis, stroke and traumatic brain injuries: 4th international workshop, BrainLes 2018, Held in conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Revised Selected Papers, Part I 4, Springer, pp 352–363
Ali A, Touvron H, Caron M et al. (2021) Xcit: Cross-covariance image transformers. Adv Neural Inf Process Syst 34:20014–20027
Google Scholar
Alicioglu G, Sun B (2022) A survey of visual analytics for explainable artificial intelligence methods. Comput & Graph 102:502–520
Article Google Scholar
Alom MZ, Hasan M, Yakopcic C, et al. (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955
Aoki J, Kimura K, Iguchi Y et al. (2010) Flair can estimate the onset time in acute ischemic stroke patients. J Neurol Sci 293(1–2):39–44
Article Google Scholar
Azad R, Arimond R, Aghdam EK, et al. (2023) Dae-former: Dual attention-guided efficient transformer for medical image segmentation. In: International Workshop on PRedictive Intelligence In MEdicine, Springer, pp 83–95
Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 39(12):2481–2495
Article Google Scholar
Balakrishnan G, Zhao A, Sabuncu MR et al. (2019) Voxelmorph: a learning framework for deformable medical image registration. IEEE Trans Med Imaging 38(8):1788–1800
Article Google Scholar
Bal A, Banerjee M, Chaki R, et al. (2023) A robust ischemic stroke lesion segmentation technique using two-pathway 3d deep neural network in mr images. Multimedia Tools Appl pp. 1–40
Basak H, Rana A (2020) F-unet: A modified u-net architecture for segmentation of stroke lesion. In: International Conference on Computer Vision and Image Processing, Springer, pp 32–43
Cao H, Wang Y, Chen J, et al. (2022) Swin-unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision, Springer, pp 205–218
Carion N, Massa F, Synnaeve G, et al. (2020) End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229
Cereda CW, Christensen S, Campbell BC et al. (2016) A benchmarking tool to evaluate computer tomography perfusion infarct core predictions against a dwi standard. J Cereb Blood Flow Metab 36(10):1780–1789
Article Google Scholar
Chalcroft L, Pereira RL, Brudfors M, et al. (2023) Large-kernel attention for efficient and robust brain lesion segmentation. arXiv preprint arXiv:2308.07251
Chalela JA, Kidwell CS, Nentwich LM et al. (2007) Magnetic resonance imaging and computed tomography in emergency assessment of patients with suspected acute stroke: a prospective comparison. Lancet 369(9558):293–298
Article Google Scholar
Chao P, Kao CY, Ruan YS, et al. (2019) Hardnet: A low memory traffic network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3552–3561
Chen LC, Zhu Y, Papandreou G, et al. (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818
Chen LC, Papandreou G, Kokkinos I et al. (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848
Article Google Scholar
Chen J, Lu Y, Yu Q, et al. (2021) Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
Clerigues A, Valverde S, Bernal J et al. (2019) Acute ischemic stroke lesion core segmentation in ct perfusion images using fully convolutional neural networks. Comput Biol Med 115:103487
Article Google Scholar
Clèrigues A, Valverde S, Bernal J et al. (2020) Acute and sub-acute stroke lesion segmentation from multimodal mri. Comput Methods Programs Biomed 194:105521
Article Google Scholar
d’Ascoli S, Touvron H, Leavitt ML, et al. (2021) Convit: Improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, PMLR, pp 2286–2296
de Vries L, Emmer BJ, Majoie CB et al. (2023) Perfu-net: Baseline infarct estimation from ct perfusion source data for acute ischemic stroke. Med Image Anal 85:102749
Article Google Scholar
de Vries L, Emmer B, Majoie C, et al. (2021) Transformers for ischemic stroke infarct core segmentation from spatio-temporal ct perfusion scans. In: Medical Imaging with Deep Learning
Dimyan MA, Cohen LG (2011) Neuroplasticity in the context of motor rehabilitation after stroke. Nat Rev Neurol 7(2):76–85
Article Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Feigin VL, Brainin M, Norrving B et al. (2022) World stroke organization (wso): global stroke fact sheet 2022. Int J Stroke 17(1):18–29
Article Google Scholar
Feng P, Ni B, Cai X, et al. (2022) Utransnet: Transformer within u-net for stroke lesion segmentation. In: 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), IEEE, pp 359–364
Fiebach J, Schellinger P, Jansen O et al. (2002) Ct and diffusion-weighted mr imaging in randomized order: diffusion-weighted imaging results in higher accuracy and lower interrater variability in the diagnosis of hyperacute ischemic stroke. Stroke 33(9):2206–2210
Article Google Scholar
Flossmann E, Redgrave JN, Briley D et al. (2008) Reliability of clinical diagnosis of the symptomatic vascular territory in patients with recent transient ischemic attack or minor stroke. Stroke 39(9):2457–2460
Article Google Scholar
Goldberger AL, Amaral LA, Glass L et al. (2000) Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. Circulation 101(23):e215–e220
Article Google Scholar
Goldstein LB, Simel DL (2005) Is this patient having a stroke? JAMA 293(19):2391–2402
Article Google Scholar
Gómez S, Mantilla D, Garzón G, et al. (2023) Apis: A paired ct-mri dataset for ischemic stroke segmentation challenge. arXiv preprint arXiv:2309.15243
Grysiewicz RA, Thomas K, Pandey DK (2008) Epidemiology of ischemic and hemorrhagic stroke: incidence, prevalence, mortality, and risk factors. Neurol Clin 26(4):871–895
Article Google Scholar
Gu Y, Piao Z, Yoo SJ (2022) Sthardnet: Swin transformer with hardnet for mri segmentation. Appl Sci 12(1):468
Article Google Scholar
Guo MH, Xu TX, Liu JJ et al. (2022) Attention mechanisms in computer vision: A survey. Comput Visual Media 8(3):331–368
Article Google Scholar
Guo MH, Lu CZ, Liu ZN et al. (2023) Visual attention network. Comput Visual Media 9(4):733–752
Article Google Scholar
Hakim A, Christensen S, Winzeck S et al. (2021) Predicting infarct core from computed tomography perfusion in acute ischemia with machine learning: Lessons from the isles challenge. Stroke 52(7):2328–2337
Article Google Scholar
Hatamizadeh A, Nath V, Tang Y, et al. (2021) Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop, Springer, pp 272–284
Hatamizadeh A, Tang Y, Nath V, et al. (2022) Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 574–584
He K, Gan C, Li Z et al. (2023) Transformers in medical image analysis. Intell Med 3(1):59–78
Article Google Scholar
Hernandez Petzsche MR, de la Rosa E, Hanning U et al. (2022) Isles 2022: A multi-center magnetic resonance imaging stroke lesion segmentation dataset. Sci Data 9(1):762
Article Google Scholar
Hssayeni MD, Croock MS, Salman AD et al. (2020) Intracranial hemorrhage segmentation using a deep convolutional model. Data 5(1):14
Article Google Scholar
Hu X, Luo W, Hu J et al. (2020) Brain segnet: 3d local refinement network for brain lesion segmentation. BMC Med Imaging 20:1–10
Article Google Scholar
Huang G, Liu Z, Van Der Maaten L, et al. (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Hui H, Zhang X, Li F et al. (2020) A partitioning-stacking prediction fusion network based on an improved attention u-net for stroke lesion segmentation. IEEE Access 8:47419–47432
Article Google Scholar
Huo J, Chen L, Liu Y, et al. (2022) Mapping: Model average with post-processing for stroke lesion segmentation. arXiv preprint arXiv:2211.15486
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Hwang DY, Silva GS, Furie KL et al. (2012) Comparative sensitivity of computed tomography vs. magnetic resonance imaging for detecting acute posterior fossa infarct. J Emerg Med 42(5):559–565
Article Google Scholar
Isensee F, Jaeger PF, Kohl SA et al. (2021) nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18(2):203–211
Article Google Scholar
Islam M, Ren H (2018) Class balanced pixelnet for neurological image segmentation. In: Proceedings of the 2018 6th International Conference on Bioinformatics and Computational Biology, pp 83–87
Jia X, Bartlett J, Zhang T, et al. (2022) U-net vs transformer: Is u-net outdated in medical image registration? In: International Workshop on Machine Learning in Medical Imaging, Springer, pp 151–160
Kadry S, Damaševičius R, Taniar D, et al. (2021) U-net supported segmentation of ischemic-stroke-lesion from brain mri slices. In: 2021 Seventh International conference on Bio Signals, Images, and Instrumentation (ICBSII), IEEE, pp 1–5
Kamnitsas K, Ferrante E, Parisot S, et al. (2016) Deepmedic for brain tumor segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: Second International Workshop, BrainLes 2016, with the Challenges on BRATS, ISLES and mTOP 2016, Held in Conjunction with MICCAI 2016, Athens, Greece, October 17, 2016, Revised Selected Papers 2, Springer, pp 138–149
Karimi D, Vasylechko SD, Gholipour A (2021) Convolution-free medical image segmentation using transformers. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, Springer, pp 78–88
Khezrpour S, Seyedarabi H, Razavi SN et al. (2022) Automatic segmentation of the brain stroke lesions from mr flair scans using improved u-net framework. Biomed Signal Process Control 78:103978
Article Google Scholar
Kirillov A, Mintun E, Ravi N, et al. (2023) Segment anything. arXiv preprint arXiv:2304.02643
Kofler F, Möller H, Buchner JA, et al. (2023) Panoptica–instance-wise evaluation of 3d semantic and instance segmentation maps. arXiv preprint arXiv:2312.02608
Kumar A, Upadhyay N, Ghosal P et al. (2020) Csnet: A new deepnet framework for ischemic stroke lesion segmentation. Comput Methods Programs Biomed 193:105524
Article Google Scholar
Li X, Chen H, Qi X et al. (2018) H-denseunet: hybrid densely connected unet for liver and tumor segmentation from ct volumes. IEEE Trans Med Imaging 37(12):2663–2674
Article Google Scholar
Li J, Chen J, Tang Y et al. (2023) Transforming medical imaging with transformers? a comparative review of key properties, current progresses, and future perspectives. Med Image Anal 85:102762
Article Google Scholar
Liang K, Han K, Li X, et al. (2021) Symmetry-enhanced attention network for acute ischemic infarct segmentation with non-contrast ct images. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VII 24, Springer, pp 432–441
Li Y, Cai W, Gao Y, et al. (2022) More than encoder: Introducing transformer decoder to upsample. In: 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp 1597–1602
Liew SL, Lo BP, Donnelly MR et al. (2022) A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Sci Data 9(1):320
Article Google Scholar
Li X, Huang K, Yang W, et al. (2019) On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189
Li X, Jiang M, Zhang X, et al. (2021) Fedbn: Federated learning on non-iid features via local batch normalization. arXiv preprint arXiv:2102.07623
Li X, Luo G, Wang K, et al. (2023b) The state-of-the-art 3d anisotropic intracranial hemorrhage segmentation on non-contrast head ct: The instance challenge. arXiv preprint arXiv:2301.03281
Lin TY, Goyal P, Girshick R, et al. (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Liu L, Wu FX, Wang J (2019) Efficient multi-kernel dcnn with pixel dropout for stroke mri segmentation. Neurocomputing 350:117–127
Article Google Scholar
Liu X, Yang H, Qi K et al. (2019) Msdf-net: Multi-scale deep fusion network for stroke lesion segmentation. IEEE Access 7:178486–178495
Article Google Scholar
Liu L, Kurgan L, Wu FX et al. (2020) Attention convolutional neural network for accurate segmentation and quantification of lesions in ischemic stroke disease. Med Image Anal 65:101791
Article Google Scholar
Liu H, Wang Y, Fan W et al. (2022) Trustworthy ai: A computational perspective. ACM Trans Intell Syst Technol 14(1):1–59
Article Google Scholar
Liu L, Wang Y, Chang J et al. (2022) Llrhnet: Multiple lesions segmentation using local-long range features. Front Neuroinform 16:859973
Article Google Scholar
Liu CF, Leigh R, Johnson B et al. (2023) A large public dataset of annotated clinical mris and metadata of patients with acute stroke. Sci Data 10(1):548
Article Google Scholar
Liu Z, Tong L, Chen L et al. (2023) Deep learning based brain tumor segmentation: a survey. Complex & Intell Syst 9(1):1001–1026
Article Google Scholar
Liu Z, Lin Y, Cao Y, et al. (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Liu R, Pu W, Zou Y, et al. (2022c) Pool-unet: Ischemic stroke segmentation from ct perfusion scans using poolformer unet. In: 2022 6th Asian Conference on Artificial Intelligence Technology (ACAIT), IEEE, pp 1–6
Lucas C, Kemmling A, Mamlouk AM, et al. (2018) Multi-scale neural network for automatic segmentation of ischemic strokes on acute perfusion images. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, pp 1118–1121
Luo J, Dai P, He Z, et al. (2024) Deep learning models for ischemic stroke lesion segmentation in medical images: A survey. Comput Biol Med p 108509
Luo C, Zhang J, Chen X, et al. (2021) Ucatr: Based on cnn and transformer encoding and cross-attention decoding for lesion segmentation of acute ischemic stroke in non-contrast computed tomography images. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, pp 3565–3568
Maier O, Menze BH, Von der Gablentz J et al. (2017) Isles 2015-a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral mri. Med Image Anal 35:250–269
Article Google Scholar
Marcus A, Bentley P, Rueckert D (2023) Concurrent ischemic lesion age estimation and segmentation of ct brain using a transformer-based network. IEEE Trans Med Imaging
Meyer MJ, Pereira S, McClure A et al. (2015) A systematic review of studies reporting multivariable models to predict functional outcomes after post-stroke inpatient rehabilitation. Disabil Rehabil 37(15):1316–1323
Article Google Scholar
Milletari F, Navab N, Ahmadi SA (2016) V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV), Ieee, pp 565–571
Mondal AK, Bhattacharjee A, Singla P et al. (2021) xvitcos: explainable vision transformer based covid-19 screening using radiography. IEEE J Transl Eng Health Med 10:1–10
Article Google Scholar
Ni H, Xue Y, Wong K, et al. (2022) Asymmetry disentanglement network for interpretable acute ischemic stroke infarct segmentation in non-contrast ct scans. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, pp 416–426
O’Shea K, Nash R (2015) An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458
Pereira S, Pinto A, Amorim J et al. (2019) Adaptive feature recombination and recalibration for semantic segmentation with fully convolutional networks. IEEE Trans Med Imaging 38(12):2914–2925
Article Google Scholar
Praveen G, Agrawal A, Sundaram P et al. (2018) Ischemic stroke lesion segmentation using stacked sparse autoencoder. Comput Biol Med 99:38–52
Article Google Scholar
Qi K, Yang H, Li C, et al. (2019) X-net: Brain stroke lesion segmentation based on depthwise separable convolution and long-range dependencies. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, Springer, pp 247–255
Ranjbarzadeh R, Caputo A, Tirkolaee EB et al. (2023) Brain tumor segmentation of mri images: A comprehensive review on the application of artificial intelligence tools. Comput Biol Med 152:106405
Article Google Scholar
Rao Y, Zhao W, Liu B et al. (2021) Dynamicvit: Efficient vision transformers with dynamic token sparsification. Adv Neural Inf Process Syst 34:13937–13949
Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, pp 234–241
Rubin J, Abulnaga SM (2019) Ct-to-mr conditional generative adversarial networks for ischemic stroke lesion segmentation. In: 2019 IEEE International Conference on Healthcare Informatics (ICHI), IEEE, pp 1–7
Shamshad F, Khan S, Zamir SW, et al. (2023) Transformers in medical imaging: A survey. Med Image Anal p 102802
Sheller MJ, Edwards B, Reina GA et al. (2020) Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 10(1):12598
Article Google Scholar
Shen Z, Zhang M, Zhao H, et al. (2021) Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3531–3539
Simonsen CZ, Madsen MH, Schmitz ML, et al. (2015) Sensitivity of diffusion-and perfusion-weighted imaging for diagnosing acute ischemic stroke is 97.5%. Stroke 46(1):98–101
Singh A, Sengupta S, Lakshminarayanan V (2020) Explainable deep learning models in medical image analysis. J Imaging 6(6):52
Article Google Scholar
Soh WK, Yuen HY, Rajapakse JC (2023) Hut: Hybrid unet transformer for brain lesion and tumour segmentation. Heliyon
Sudre CH, Li W, Vercauteren T, et al. (2017) Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, Proceedings 3, Springer, pp 240–248
Thiyagarajan SK, Murugan K (2021) A systematic review on techniques adapted for segmentation and classification of ischemic stroke lesions from brain mr images. Wireless Pers Commun 118(2):1225–1244
Article Google Scholar
Tomita N, Jiang S, Maeder ME, et al. (2020) Automatic post-stroke lesion segmentation on mr images using 3d residual convolutional neural network. NeuroImage: Clin 27:102276
Tragakis A, Kaul C, Murray-Smith R, et al. (2023) The fully convolutional transformer for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 3660–3669
Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Vupputuri A, Dighade S, Prasanth P, et al. (2018) Symmetry determined superpixels for efficient lesion segmentation of ischemic stroke from mri. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, pp 742–745
Wang J, Wang S, Liang W (2022) Metrans: Multi-encoder transformer for ischemic stroke segmentation. Electron Lett 58(9):340–342
Article Google Scholar
Wang W, Xie E, Li X et al. (2022) Pvt v2: Improved baselines with pyramid vision transformer. Comput Visual Media 8(3):415–424
Article Google Scholar
Wang H, Cao P, Wang J, et al. (2022a) Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In: Proceedings of the AAAI conference on artificial intelligence, pp 2441–2449
Wang Y, Chen K, Yuan W, et al. (2023) Samihs: Adaptation of segment anything model for intracranial hemorrhage segmentation. arXiv preprint arXiv:2311.08190
Wang Y, Katsaggelos AK, Wang X, et al. (2016) A deep symmetry convnet for stroke lesion segmentation. In: 2016 IEEE International Conference on Image Processing (ICIP), IEEE, pp 111–115
Wang S, Li BZ, Khabsa M, et al. (2020) Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768
Wang X, Shrivastava A, Gupta A (2017) A-fast-rcnn: Hard positive generation via adversary for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2606–2615
Wang D, Wu Z, Yu H (2021) Ted-net: Convolution-free t2t vision transformer-based encoder-decoder dilation network for low-dose ct denoising. In: Machine Learning in Medical Imaging: 12th International Workshop, MLMI 2021, Held in Conjunction with MICCAI 2021, Strasbourg, France, September 27, 2021, Proceedings 12, Springer, pp 416–425
Wessels T, Wessels C, Ellsiepen A et al. (2006) Contribution of diffusion-weighted imaging in determination of stroke etiology. Am J Neuroradiol 27(1):35–39
Google Scholar
Winzeck S, Hakim A, McKinley R et al. (2018) Isles 2016 and 2017-benchmarking ischemic stroke lesion outcome prediction based on multispectral mri. Front Neurol 9:679
Article Google Scholar
Woo S, Park J, Lee JY, et al. (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Wu Z, Zhang X, Li F et al. (2022) Multi-scale long-range interactive and regional attention network for stroke lesion segmentation. Comput Electr Eng 103:108345
Article Google Scholar
Wu Z, Zhang X, Li F, et al. (2023a) W-net: A boundary-enhanced segmentation network for stroke lesions. Expert Syst Appl p 120637
Wu Z, Zhang X, Li F, et al. (2023b) Transrender: a transformer-based boundary rendering segmentation network for stroke lesions. Front Neurosci 17
Xie S, Girshick R, Dollár P, et al. (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Xiong Y, Zeng Z, Chakraborty R, et al. (2021) Nyströmformer: A nyström-based algorithm for approximating self-attention. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 14138–14148
Xu Z, Ding C (2023) Combining convolutional attention mechanism and residual deformable transformer for infarct segmentation from ct scans of acute ischemic stroke patients. Front Neurol 14
Yang H, Huang W, Qi K, et al. (2019) Clci-net: Cross-level fusion and context inference networks for lesion segmentation of chronic stroke. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part III 22, Springer, pp 266–274
Yu W, Huang Z, Zhang J et al. (2023) San-net: Learning generalization to unseen sites for stroke lesion segmentation with self-adaptive normalization. Comput Biol Med 156:106717
Article Google Scholar
Yu W, Lei Y, Shan H (2023) Fan-net: Fourier-based adaptive normalization for cross-domain stroke lesion segmentation. ICASSP 2023–2023 IEEE International Conference on Acoustics. IEEE, Speech and Signal Processing (ICASSP), pp 1–5
Yu W, Luo M, Zhou P, et al. (2022) Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10819–10829
Zhang R, Zhao L, Lou W et al. (2018) Automatic segmentation of acute ischemic stroke from dwi using 3-d fully convolutional densenets. IEEE Trans Med Imaging 37(9):2149–2160
Article Google Scholar
Zhang Z, Liu Q, Wang Y (2018) Road extraction by deep residual u-net. IEEE Geosci Remote Sens Lett 15(5):749–753
Article Google Scholar
Zhang L, Song R, Wang Y et al. (2020) Ischemic stroke lesion segmentation using multi-plane information fusion. IEEE Access 8:45715–45725
Article Google Scholar
Zhang H, Chen H (2023) Efficient 3d transformer with cluster-based domain-adversarial learning for 3d medical image segmentation. In: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), IEEE, pp 1–5
Zhang Y, Liu H, Hu Q (2021) Transfuse: Fusing transformers and cnns for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, Springer, pp 14–24
Zhang Y, Liu S, Li C, et al. (2022) Application of deep learning method on ischemic stroke lesion segmentation. Journal of Shanghai Jiaotong University (Science) pp 1–13
Zhao H, Shi J, Qi X, et al. (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zhou HY, Guo J, Zhang Y, et al. (2021) nnformer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201
Zhou SK, Rueckert D, Fichtinger G (2019) Handbook of medical image computing and computer assisted intervention. Academic Press
Google Scholar
Zhou Y, Huang W, Dong P et al. (2019) D-unet: a dimension-fusion u shape network for chronic stroke lesion segmentation. IEEE/ACM Trans Comput Biol Bioinf 18(3):940–950
Article Google Scholar
Zhu X, Su W, Lu L, et al. (2020) Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

Download references

Acknowledgements

The findings presented in this paper have emerged from a project funded by the Qatar Japan Research Collaboration Research Program under grant number M-QJRC-2023-313. The authors extend their sincere gratitude to Marubeni and Qatar University for their consistent and generous support. This work was also partially supported by JST, PRESTO Grant Number JPMJPR23P7, Japan.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, Qatar University, Doha, P.O.Box 2713, Qatar
Yalda Zafari-Ghadim & Mohamed Mabrok
Graduate School of Information Science, University of Hyogo, Kobe 650-0047, Japan
Essam A. Rashed
Department of Computer Enginnering, Qatar University, Doha, P.O.Box 2713, Qatar
Amr Mohamed

Authors

Yalda Zafari-Ghadim
View author publications
You can also search for this author in PubMed Google Scholar
Essam A. Rashed
View author publications
You can also search for this author in PubMed Google Scholar
Amr Mohamed
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Mabrok
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.G., ER, and M.M. wrote the main manuscript text. Y.G. prepared Figs 1–3. All authors reviewed the manuscript. AM helped in the review of the paper.

Corresponding author

Correspondence to Mohamed Mabrok.

Ethics declarations

Conflict of interests

The authors declare no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zafari-Ghadim, Y., Rashed, E.A., Mohamed, A. et al. Transformers-based architectures for stroke segmentation: a review. Artif Intell Rev 57, 307 (2024). https://doi.org/10.1007/s10462-024-10900-5

Download citation

Accepted: 06 August 2024
Published: 30 September 2024
DOI: https://doi.org/10.1007/s10462-024-10900-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Transformers-based architectures for stroke segmentation: a review

Abstract

Explore related subjects

1 Introduction

2 Fundamentals of transformers

2.1 Architectural components of transformers

2.1.1 Self-attention

2.1.2 Multi-head self-attention

2.2 Vision transformer pipeline

2.3 Adaptations for medical image analysis

3 Datasets

3.1 ISLES dataset

3.1.1 ISLES 2015

3.1.2 ISLES 2017

3.1.3 ISLES 2018

3.1.4 ISLES 2022

3.2 ATLAS dataset

3.3 AISD dataset

3.4 APIS dataset

3.5 Johns Hopkins University’s dataset

3.6 Intracranial Hemorrhage Segmentation (IHS) dataset

3.7 INSTANCE dataset

4 Performance evaluation for stroke segmentation

5 Stroke segmentation using transformers

5.1 Earlier approaches for stroke segmentation

5.2 Transformer-based architectures for stroke segmentation

6 Open challenges and future directions

7 Discussion and conclusion

Data availability

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation