Abstract
Malignant tumors are a common cytopathologic disease. Pathological tissue examination is a key tool for diagnosing malignant tumors. Doctors need to manually analyze the images of pathological tissue sections, which is not only time-consuming but also highly subjective, easily leading to misdiagnosis. Most of the existing computer-aided diagnostic techniques focus too much on accuracy when processing pathological tissue images, and do not take into account the problems of insufficient resources in developing countries to meet the training of large models and the difficulty of obtaining medical annotation data. Based on this, this study proposes an artificial intelligence multiprocessing scheme (MSPInet) for digital pathology images of malignant tumors. We use techniques such as data expansion and noise reduction to enhance the dataset. Then we design a coarse segmentation method for cell nuclei of pathology images based on Transformer for Semantic Segmentation and further optimize the segmentation of tumor edges using conditional random fields. Finally, we improve the training strategy for knowledge distillation. As a medical assistive system, the method can quantify and convert complex pathology images into analyzable image information. Experimental results show that our method performs well in terms of segmentation accuracy and also has advantages in terms of time and space efficiency. This makes our technology available to developing countries that are not as well resourced, and equipped in terms of medical care. The teacher model and lightweight student model included in our method achieve 71.6% and 66.1% Intersection over Union (IoU) in cell segmentation respectively, outperforming Swin-unet and CSWin Transformer.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
High-resolution digital pathology images contain a wealth of medical information such as tissue details, cellular structures, and lesion characteristics [1]. This information helps assist physicians in making more accurate diagnosis and treatment of diseases. Traditional pathology diagnosis relies on the manual one-by-one judgment of pathologists [2]. This process is time-consuming and cumbersome, and diagnostic results are influenced by personal experience and subjective judgment. It can lead to discrepancies between diagnostic results, thus affecting patient outcomes. There is a growing shortage of pathologists globally, especially in developing countries [3]. The increased pressure on diagnostic pathology further affects the accuracy and efficiency of pathology diagnosis. Quantitative analysis of digital pathology images through computer-aided diagnostic techniques can greatly reduce the repetitive work of pathologists [4,5,6].
Computer-aided diagnosis provides medical workers with more efficient and accurate diagnostic results through image processing technology [7,8,9]. Among them, intelligent image segmentation technology can automatically identify and extract key features in images to improve diagnostic efficiency. Deep learning methods promote the development and application of artificial intelligence in digital pathology analysis due to its powerful learning ability [10]. For example, convolutional neural network (CNN), U-Net model, FCN model, recurrent neural network (RNN), etc., have been widely used in pathological image preprocessing, cell nucleus and karyotype identification and quantification, tumor classification and grading, etc. [11,12,13].
However, existing deep learning methods have certain limitations in segmenting pathological images of malignant tumors, which makes it difficult to meet practical needs [14]. Although clustering-based methods have high computational efficiency, they are more sensitive to noise, resulting in low accuracy. Learning-based segmentation methods are difficult to achieve a balance between accuracy and segmentation efficiency [15]. Although CNN-based segmentation methods have higher accuracy, they are time-consuming and occupy more memory [16]. Especially in developing countries with relatively backward economic conditions, the following problems still exist:
-
1.
Limited specialized personnel. The backwardness of developing countries in terms of economic level and infrastructure has resulted in a relatively low per capita share of medical resources. And the distribution of medical resources often shows an extremely uneven state. In areas with limited medical resources, it is difficult for most patients to receive timely diagnosis and treatment in the early stages of disease development [17].
-
2.
The high cost of pixel-level annotation of pathology images requires a lot of manpower and expertise. This high-cost annotation process poses an obstacle to the further development and application of pathology image analysis.
-
3.
The value density of digital pathology images is low. Large sized sections contain numerous background information and the average cell area is only 8.29% of the background, implying a low percentage of valid information [18]. Due to the high number and resolution of pathology section images, it takes a lot of time and effort to screen and process these images with the naked eye.
-
4.
The model consumes high computational resources. Tumor cell morphology is complex, there are stacking and other situations, which require high accuracy and generalization of the segmentation method. High-performance pathology image segmentation models usually have a large number of parameters and scale, which require a large amount of computational resources [19].
To realize an image segmentation method with high accuracy, high efficiency, high automation and good repeatability, we propose an artificial intelligence multiprocessing scheme (MSPInet) for digital pathology images of malignant tumors.
In preprocessing, we use data expansion methods such as random image flipping, cropping, rotation, and color augmentation to increase the diversity of the data set; We introduce a fast non-local method using integral images to reduce noise caused by equipment quality, production environment and operational errors in processes such as slice generation; and then we design a cell nucleus segmentation model based on Transformer for Semantic Segmentation (TFSS). It introduces bootstrap aggregation bilateral network for coarse segmentation and Conditional Random Field (CRF) to refine the coarse segmentation results. Finally, we improve the training strategy for knowledge distillation. The system can provide physicians with intuitive auxiliary diagnosis results.
Our contributions are specified below:
-
1.
A semantic segmentation transformer (TFSS) based segmentation model was designed. By decoupling the body features obtained from feature low frequency and the edge features obtained from high-resolution shallow features, and then assigning different supervision, the model realizes high-precision cell nucleus segmentation. Our method solves the difficulty of boundary fitting so that the segmentation results fit the actual cell nucleus region to some extent better. The accurate structure can assist doctors to reduce the occurrence of misdiagnosis.
-
2.
We use the CRF model as an optimizer to further refine the minutiae of cell nucleus gaps, cell nucleus edges, and overlapping cells. This method improves the fitting accuracy of the edge regions while reducing the loss of information about the cell nucleus edges.
-
3.
We improve the training strategy of knowledge distillation so that the model not only maintains efficient training efficiency, but also focuses more on difficult-to-split samples, which results in higher prediction accuracy.
-
4.
In this paper, more than 4000 samples were used for experiments. The results show that the MSPInet strategy proposed in this paper has obvious advantages. Doctors can use the segmentation results as an important auxiliary basis in the diagnosis and treatment stages, realizing rapid diagnosis, reducing workload, and improving work efficiency.
Related work
As information technology and computer hardware advance, the utilization of artificial intelligence in medical diagnosis is progressively expanding. This integration enhances the precision and efficiency of medical diagnosis, empowering healthcare professionals with robust tools to elevate the processes of diagnosis and treatment for patients.
Several researchers are now focusing on investigating the influence of different image staining styles on deep learning models. Salehi et al. [20] proposed STST (Pix2Pix-based Stain-to-Stain Translation), initially converting RGB format original pathological images into grayscale images, using the two images as pairs for model training. However, due to the close similarity in color between cell nuclei and cytoplasm, it becomes difficult to distinguish between the two in the transformed images. Pérez-Bueno et al. [21] utilized blind color deconvolution to isolate individual staining channels in multi-stained images, reducing the impact of color variations on computer-aided diagnosis systems. This framework introduced an innovative variational Bayesian blind color deconvolution algorithm that automatically estimates color vector matrices, stain concentrations, and all model parameters, although its effectiveness is not evident in certain bands. Yiqing et al. [22] proposed RandStainNA, which combines stain normalization and stain augmentation to enhance the model's generalization ability, but its generalization ability in color spaces like YUV still needs improvement. Ho et al. [23] introduced the URUST framework, which considers the correlation between neighboring patches, aiming to minimize the color discrepancy between adjacent stained patches after normalization, thereby reducing post-staining errors. Moreover, URUST significantly reduces hardware requirements, enabling the processing of ultra-high-resolution pathological images in the same hardware environment.
Several widely used image enhancement techniques can also enhance the segmentation performance of the model, such as Stylemix [24] and TokenMixup [25]. These image enhancement techniques typically enable the model to achieve improved generalization in a variety of scenarios. Unfortunately, in the realm of medical imaging, these approaches can readily distort the morphological information of osteosarcoma cell nuclei and disrupt the training of the model. Since the introduction of GAN [26], there have been many data augmentation studies based on GAN, such as ACGAN [27], DANGAN [28], which have played a significant role in cancer diagnosis and treatment processes. For example, Zhang et al. [29] proposed a model based on WGAN-GP, which, when used to augment one-dimensional clinical data, performs well in solving the imbalance problem with fewer sample datasets. However, due to the inability of GAN networks to guarantee convergence, the performance of GAN-based data augmentation methods is mediocre when faced with large-scale datasets.
Many researchers have also conducted studies on semantic segmentation techniques for pathological images. Yedong et al. [30] suggested a novel semi-supervised approach for semantic segmentation of cell nuclei in pathological images. Their sampling strategy concentrates on the distribution of cell nuclei to enhance segmentation accuracy, overlooking the image variations induced by staining styles. Ouyang et al. [31] found a significant amount of redundant patches in the original pathological images and suggested that eliminating these patches can significantly reduce manual labor costs, offering a novel perspective on pathological semantic segmentation. However, despite only requiring 5% of labels, the cost of selecting suitable patches cannot be ignored.
Enhancing the performance of deep learning models is also a crucial strategy for improving segmentation accuracy. Convolutional Neural Networks (CNNs) [32] have demonstrated remarkable achievements in the realm of computer vision, especially in image segmentation tasks. CNNs can learn local features from images and efficiently fuse them, enabling precise image segmentation.
TFSS (Transformer for Semantic Segmentation) [33] is a novel deep learning network architecture that, compared to traditional Unet [34], maintains high-resolution representations during model training. This allows TFSS to capture finer details in images and achieve multi-scale feature fusion, improving the spatial accuracy of segmentation results. In recent advancements in image segmentation, techniques that integrate the Fully Convolutional Network (FCN) [35] with an encoder-decoder structure have emerged as the primary approaches for semantic segmentation. For example, Du et al. [36] improved the FCN model with Resnet as the backbone to achieve the accuracy of recognition and segmentation with less time cost.
Recently, Vision Transformers (ViTs) [37] introduced a transformer architecture without convolutional layers designed for image classification, treating input images as sequences of patch tokens. ViTs necessitate training on extensive datasets, while DeiT [38] proposed a labeled knowledge distillation approach and utilized CNNs to achieve competitive visual transformers trained on the ImageNet-1 k dataset. Concurrent work has extended to video classification and semantic segmentation. In particular, SETR [39] employed a ViT backbone along with a conventional CNN decoder. Swin Transformer [40], a modification of ViT, utilizes local windows that shift and upsample between layers, forming a pyramid FCN decoder.
In this study, we present a segmenter, employing an encoder-decoder architecture based on transformers for semantic image segmentation. Our method utilizes a ViT backbone and incorporates a mask decoder inspired by DETR [41]. Without relying on convolutions, our architecture captures global image context through thoughtful design, delivering competitive performance on standard image segmentation benchmarks.
Systematic approach
Digital pathology images have emerged as the gold standard for diagnosing malignant tumors, owing to the abundant medical information they encapsulate [42]. Traditional pathology diagnosis relies on manual, one-by-one judgments by pathologists. This process is both time-consuming and cumbersome [43]. The complexity of pathology images results in diagnostic outcomes influenced by personal experience and subjective judgment [44,45,46]. Hence, our goal is to reduce the burden on doctors by streamlining their workload and improving the efficiency of clinical diagnosis through the use of intelligent medical technology. In this paper, an artificial intelligence multiprocessing scheme (MSPInet) for malignant tumor pathology tissue section images is developed. Specifically, we utilize a Visual Transformer (ViT)-based approach to segment cell nuclei in pathology images. This aids doctors in the initial identification of histopathology images and provides valuable references for their subsequent diagnosis and determination of the degree of soft tissue invasion. The system's overall design is depicted in Fig. 1.
The model documentation comprises two main parts. The first part involves the pre-processing of histopathological images before segmentation. In the initial stage, a pre-screening process sifts through a limited set of valuable images, specifically those that include areas with lesions, and then channels them to the next stage. Following this, we introduce a rapid non-local method that leverages integral images to remove noise from the images, consequently improving quantitative metrics and visual quality. The second part focuses on the system's segmentation. The noise-reduced image is input into an image segmentation network (Segmenter), and a Guided Aggregation Bipartite Network assists in segmenting the histopathology image of osteosarcoma, resulting in a preliminary segmentation result map. The tumor segmentation boundaries are then refined using conditional random fields. Ultimately, histopathological images identifying the lesion region are obtained to aid doctors in disease diagnosis. Additionally, to address the challenges of complex computation and large parameters in large models, we employ the knowledge distillation training strategy. Table 1 lists some notations used in this paper. This section is divided into three subsections: Sect. "Image Preprocessing Module" details the image preprocessing process, Sect. "Segmentation Model" analyzes the image segmentation model, and Sect. "Training Strategy" elaborates on the training strategy and loss function.
Image preprocessing module
Although we have more than 1,000 digital pathology images, each image is very large in size, about 3–5 G, and has more than hundreds of thousands of labeled nuclei [47]. Pixel-level labeling of pathology images is very expensive. It would require large labor and cost to annotate all the cells in the images one by one. To maximize the use of these expensive labeling data, we built a data enhancement pipeline for malignant tumor pathology image features. Each round of images introduced for training will first pass through this pipeline to ensure diversity of input data. A total of 10,000 images were obtained after random cropping of the images. Then we eliminated images that were contaminated, had too much blank area in the background, fewer nuclei, etc., and a total of 2164 images were used for annotation. These images were fed into the data enhancement pipeline to enhance the diversity of the data. The pipeline includes the following elements: the size of the input image is denoted as \(W\times H\). Additionally, the lower-left corner of the image serves as the origin for establishing a right-angled coordinate system. The positive half-axes of the x-axis and y-axis extend to the right and upward, respectively. \((\frac{W}{2},\frac{H}{2})\) represents the center of the image, while \((x,y)\) is used to denote the coordinates of pixels at any point in the input image. As shown in Fig. 2, we performed random cropping, rotating, panning, flipping and color enhancement operations on the original digital pathology images sequentially in order. Among them, random color enhancement includes random brightness, saturation, hue hue and contrast enhancement in several ways.
1. Randomly Crop Images: The image is randomly cropped to the size \({W}_{0}\times {H}_{0}\), as required by the network input (\(512\times 512\) in our case). Simultaneously, efforts are made to ensure that the percentage of non-zero regions in the original image is greater than \(\alpha \) set to \(\alpha =0.7\) in our case. Equation 3 details the computation process for the vertex \((X1,Y1)\) at the top-left corner of the cropped rectangle. This process iterates in a loop for 10 times, concluding either when the non-zero region of the image meets the preset threshold value or when the loop completes without meeting the threshold. Throughout this, the marker image will adapt to the changes.
2. Randomly Rotate Images: Considering the rotational invariance of the cell, the input image undergoes a rotation with a fifty percent probability, introducing a random angle \(\theta \in [-\mathrm{180,180}]\) with \((\frac{W}{2},\frac{H}{2})\) as the center of rotation. The marker image follows this change, and the rotated coordinate point becomes \(\left({x}_{1},{y}_{1}\right)\) (see Eq. 1). Finally, the original image and the marker image are filled with black to create a minimal horizontal rectangle. As a result, the size of the rotated image changes to \({W}^{\prime} \times {H}^{\prime}\).
3. Randomly Translate Images: In contrast to the conventional semantic segmentation task, where cells to be recognized have an equal probability of appearing at any position in the image, including the center and edges, there is often a scenario where a cell is intercepted across two images when intercepting a small block. To simulate this situation, the input image has a fifty percent probability of being randomly translated by distances \(h \,\epsilon\, [-\frac{H}{2},\frac{H}{2}]\) and \(w \,\epsilon\, [-\frac{W}{2},\frac{W}{2}]\) in the vertical and horizontal directions, respectively, resulting in translated coordinate points \(\left({x}_{2},{y}_{2}\right)\) (see Eq. 2). The size of the image remains unchanged after translation, and the labeled image undergoes the same modification. The blank area after the translation of the original and labeled images is filled with black color. The image after random translation is \({W}_{{x}_{2}}^{\prime}\times {H}_{{y}_{2}}^{\prime}\).
4. Randomly Flip Images: The image has a fifty percent probability of being flipped vertically (Eq. 3) or horizontally (Eq. 4). The coordinate points of the flipped image are denoted as \(\left({x}_{3},{y}_{3}\right)\) and \(\left({x}_{4},{y}_{4}\right)\), respectively, with the marker image adapting accordingly.
5. Random Color Enhancement: This data augmentation approach simulates staining differences or visual noise caused by various factors, such as different staining workers, staining batches, scanners, etc., during the processing of pathology images. The module has a fifty percent probability of being executed, and it comprises several sub-modules. Each sub-module has a fifty percent probability of being executed, and none of the labeled images will be altered unless specified otherwise.
6. Random Luminance Enhancement: Cytopathology images frequently exhibit uneven illumination in imaging data due to factors such as microscope settings, sample thickness, staining depth, and leakage. To simulate these illumination differences, data enhancement is implemented here by varying the image luminance. The pixel value representing any point in the \({p}_{i}\) image is used, and a random \(\upgamma \in [-32, 32]\) is introduced to represent the randomly varying luminance value. A change in luminance results in a corresponding change in pixel value, as depicted in Eq. 5.
7. Image Conversion Module: Convert the image from RGB format to HSL format, with each component's conversion detailed in Eq. 6. This serves as an intermediate step, and this sub-module is executed whenever the Color Enhancement Module is executed;
8. Random Saturation Enhancement: Due to different degrees of staining depth, inconsistency of stains, uneven absorption of stains, etc., cytopathology images yield diverse pathology pictures with distinct cell background colors and cell nucleus staining colors [34]. To simulate this diversity, we randomly adjust the saturation of colors within a certain range. This adjustment is accomplished by setting the random parameter \(\eta \in [\mathrm{0.7,1.3}]\) for the change in saturation, as illustrated in Eq. 7.
9. Random Hue Enhancement: The rationale for conducting this data enhancement aligns with the previous point. We randomly select a parameter \(\upsilon \in [-\mathrm{18,18}]\) to denote the value of the tonal enhancement of the image, and the adjustment of the tonal value is expressed in Eq. 8.
10. Image Transformation Module: The image will be converted from HLS format back to RGB format using the same conversion formula as in formula 6. This module is executed whenever the color enhancement module is invoked.
11. Random Contrast Enhancement: Once again, to simulate color differences, we set a random contrast enhancement threshold \(\mu \in [\mathrm{0.5,1.5}]\) resulting in changes to the pixel values at each point, as depicted in Eq. 9.
For model training on images, we first process each image using five data enhancement modules, which include random rotation and random color enhancement, to create an enhanced image input system. Despite the limited dataset, this approach significantly enriches the image inputs for each epoch, leading to highly diverse input images.
We attempted to denoise common artifacts in cytopathology images, such as background stray spots, dye-contaminated areas, varying staining depths, sample thickness, microscope settings, etc. However, we couldn't find an effective denoising method suitable for cytopathology images. Despite trying denoising methods designed for realistic images, testing revealed limited effectiveness in improving results. We refrained from using complex staining normalization methods to address staining level differences across different pathology labs and staining batches, as this would have increased computational costs, contradicting our goal of reducing computation. Instead, we aimed to model these noise and color differences through data augmentation, allowing the deep learning network to learn their characteristics on its own.
Segmentation model
To achieve more accurate cell nucleus boundaries while ensuring computational efficiency and lower hardware requirements, we designed a cell nucleus segmentation model for pathology images based on the Transformer for Semantic Segmentation. The initial coarse segmentation is conducted using a visual transformer-based model, and further refinement is achieved through a Conditional Random Field Network (CRF). Each module is described in detail below.
Rough prediction of cell nuclei in pathological images
Our segmentation model employs a Guided Aggregation Bipartite Network to assist in segmenting histopathology images, resulting in the initial segmentation result map. The architecture of the segmentation model is illustrated in Fig. 3. We input the preprocessed pathology images into the segmentation network. First the segmentation model converts the pathology image data into one-dimensional vectors by flattening and projecting the layers. This step aims to reduce the dimensionality of the pathology image data for easier processing. Next, a linear projection is applied to the pathology image data, aiding in identifying edges, textures, shapes, and other key elements in the pathology image. Following two Mask Transformer layers, semantic information and features in pathology images are extracted, enabling more accurate and high-level image analysis. Finally, a scalar product is performed to generate the pathology image mask, yielding the preliminary segmentation result map.
(1) Encoder module
A pathological image of an osteosarcoma \({\varvec{x}}\in {{\varvec{R}}}^{{\varvec{H}}\times W\times C}\) was segmented into a patch sequence \({\varvec{x}}=\left[{x}_{1},\dots ,{x}_{N}\right]\in {\mathbb{R}}^{N\times {P}^{2}\times C}\), where \((P,P)\) is the size of the patch, \(N=HW/{P}^{2}\) is the number of patches, and \(C\) is the number of channels. Every patch undergoes flattening into a one-dimensional vector, followed by a linear projection into a patch embedding, resulting in a sequence of patch embeddings denoted as \({{\varvec{x}}}_{0}=\left[{\varvec{E}}{x}_{1},\dots ,{\varvec{E}}{x}_{N}\right]\in {\mathbb{R}}^{N\times D}\), wher \(\mathbf{E}\in {\mathbb{R}}^{D\times \left({P}^{2}C\right)}\). To incorporate positional information, learnable positional embedding points \(pos=\left[{pos}_{1},\dots ,{pos}_{N}\right]\in {\mathbb{R}}^{N\times D}\) are added to the patch sequence, resulting in the input sequence labeled \({\mathbf{z}}_{0}={\mathbf{x}}_{0}+{\text{pos}}\). The input sequence is then used to obtain positional information for each patch.
A transformer encoder with L layers is applied to the sequence labeled \({\mathbf{z}}_{0}\) to generate a contextualized coded sequence \({\mathbf{z}}_{L}\in {\mathbb{R}}^{N\times D}\). The transformer layer is composed of a Multi-Self-Attention (MSA) block, succeeded by a two-layer, point-level MLP block. Layer Norms (LNs) are applied before each block, and residual connections are incorporated after each block:
where \(i\in \{1,\dots ,L\}\). These combinations of transformer layers and blocks are used to process information from osteosarcoma pathology images, providing contextual information for a better understanding of structures and features in the pathology images. This aids in the analysis and diagnosis of osteosarcoma. The self-attention mechanism comprises three pointwise linear layers mapping markers to intermediate representations: query \({\varvec{Q}}\in {\mathbb{R}}^{N\times d}\), key \({\varvec{K}}\in {\mathbb{R}}^{N\times d}\), and value \({\varvec{V}}\in {\mathbb{R}}^{N\times d}\). Self-attention is then computed as follows:
The converter encoder maps the input sequence \({z}_{0}\) = [\({z}_{0}\), 1, …, \({z}_{0}\), N] of embedded patches with positional encoding to \({z}_{L}\)= [\({z}_{L}\), 1, …,\({z}_{L}\), N], a contextually encoded sequence containing rich semantic information used by the decoder. In the next section, we will introduce the decoder.
(2) Decoder Module
The patch coding sequence \({\mathbf{z}}_{\mathbf{L}}\in {\mathbb{R}}^{N\times D}\) is decoded into the partition mapping s ∈ \({\mathbb{R}}^{H\times W\times K}\), where \(K\) is the class score. The decoder is trained to translate the patch-level encoding generated by the encoder into patch-level class fractions. Subsequently, these class fractions at the patch level are upsampled to pixel-level fractions using linear interpolation (bilinear interpolation). Below, we will elaborate on a linear decoder, serving as a baseline, whereas our approach is depicted as a mask converter in Fig. 2. A pointwise linear layer is applied to the patch encoding \({{\varvec{z}}}_{{\varvec{L}}}\in {\mathbb{R}}^{N\times D}\) to produce a patch-level class logarithm \({z}_{\text{lin }}\in {\mathbb{R}}^{N\times K}\). The sequence is then reshaped into a two-dimensional feature map \({s}_{\text{lin }}\in {\mathbb{R}}^{H/P\times W/P\times K}\) and upsampled in advance to the original pathology image size \({\varvec{s}}\in {\mathbb{R}}^{H\times W\times K}\). A softmax is then applied on the class dimension to obtain the final segmentation mapping.
Masked Transformers. For the transformer-based decoder, we introduce a set of K learnable class embeddings \(cls=\left[{cls}_{1},\dots ,{cls}_{K}\right]\in {\mathbb{R}}^{K\times D}\), where \(K\) is the number of classes. Each class embedding is randomly initialized and assigned to a semantic class. It will be used to generate a class mask. The class embeddings \(cls\) are processed by the decoder in conjunction with the patch encoding \({\mathbf{z}}_{\mathbf{L}}\), as shown in Fig. 2. The decoder comprises an ensemble of M transformer layers. Our mask transformer produces K masks by calculating the scalar product between the L2 normalized patch embedding \({\mathbf{z}}_{\mathbf{M}}^{\prime}\in {\mathbb{R}}^{N\times D}\) and the class embedding \(\mathbf{c}\in {\mathbb{R}}^{K\times D}\) produced by the decoder. The collection of class masks is computed as follows:
where the mask \(\left({\mathbf{z}}_{\mathbf{M}}^{\prime},\mathbf{c}\right)\in {\mathbb{R}}^{N\times K}\) represents a set of patch sequences. Each mask sequence is subsequently reshaped into a two-dimensional mask, forming \({\mathbf{s}}_{\text{mask }}\in {\mathbb{R}}^{H/P\times W/P\times K}\), and upsampled to match the dimensions of the original pathology image, resulting in the feature map \(\mathbf{s}\in {\mathbb{R}}^{H\times W\times K}\). Subsequently, a softmax operation is applied to the class dimension, followed by a layer-wise paradigm to derive pixel-level class scores, ultimately generating the final segmented histopathology image. This process is performed for all \((i,j)\in H\times W\), the mask sequence, i.e. \({\sum }_{k=1}^{K} {s}_{i,j,k}=1\) for all \((i,j)\in H\times W\).
Our mask converter draws inspiration from DETR and MaxDeepLab, both of which introduce object embeddings for instance mask generation. However, unlike our method, MaxDeepLab adopts a hybrid approach involving CNN and transformers. Due to computational constraints, MaxDeepLab separates pixel and class embeddings into two streams. Using a pure transformer architecture and leveraging patch-level coding, we propose a simple approach that jointly processes patch and class embeddings at the decoding stage. This approach allows the generation of dynamic filters that vary with the input. When dealing with the semantic segmentation of osteosarcoma in this study, our mask transformer can also be directly adapted to perform floodlight segmentation [48]. Its realization is achieved by replacing class embeddings with object embeddings. Floodlight segmentation is a collection of semantic segmentation of static things and instance segmentation of countable objects. By understanding the information of single-frame LiDAR scans, floodlight segmentation can provide much useful information for autonomous driving, such as future prediction and map construction.
Refinement of coarse predictions
Conditional Random Fields (CRF) is a probabilistic graphical model often used as a post-processing tool to improve the performance of algorithms for pattern classification, labeling, segmentation, and other tasks in image processing and computer vision [49]. For example, a U-Net neural network gives good results. However, upon close inspection of the prediction mask, small "islands" of mispredicted pixels are found. To improve these small inconsistencies, a CRF model can be used to enhance and refine the segmentation results. In the recognition of digital pathology images, CRF can improve the segmentation quality by taking into account pixel relationships and interactions with the context to further refine the prediction results of cell nuclei. It also reduces noise interference in the segmentation results by modeling the dependencies between pixels, helping to filter out isolated noise points, and ultimately improving the accuracy of the segmentation.
The process of using CRF to refine the initial coarse segmentation results is outlined below::
-
1.
Prepare pathological image data along with their corresponding preliminary segmentation results.
-
2.
Define the CRF model, specifying its graph structure, observation nodes, hidden nodes, and characteristic functions. In the refined coarse segmentation, hidden nodes typically denote the label or category assigned to each pixel. Feature functions capture pixel relationships and interactions with the coarse segmentation.
-
3.
Design feature functions that emphasize observed data (original images), hidden states (pixel labels), and their interactions. These functions may encompass pixel distances, color similarity, texture features, and other relevant factors.
-
4.
Utilizing the training data, it estimates the weights of the feature functions in the CRF model to enhance its fitting to both the coarse segmentation results and the original image.
-
5.
The learned CRF model is employed to perform label inference on each pixel, determining the most probable label for each pixel and, in turn, refining the coarse segmentation.
-
6.
The CRF output undergoes post-processing to eliminate potential isolated noise points and enhance the smoothness of the segmentation results.
-
7.
Assessing the refined segmentation results typically involves computing segmentation accuracy, IoU (Intersection over Union), and other performance metrics.
Training strategy
Large deep learning models exhibit excellent performance but come with a high parameter count and low computational efficiency. On the other hand, small models are computationally efficient but tend to have poorer performance. In the context of medical auxiliary diagnosis systems, achieving both high precision and efficiency is challenging, especially in developing countries and regions with limited resources. Hence, we embrace the knowledge distillation (KD) training strategy, enabling swift computation while maintaining relatively excellent performance. Moreover, this approach helps address the challenge of expensive and hard-to-obtain pathological image annotations, as model training frequently encounters the issue of insufficient datasets.
Pathological images were input into two networks: the teacher network (TFSS_T) and the student network (TFSS_S). Following the training of the teacher network, the outcomes are initially processed using softlabel to facilitate knowledge transfer from the teacher to the student network. Unlike traditional hard labels that only use "0" and "1" to annotate pathological images, soft labels assign values between 0 and 1, providing a more nuanced representation of the distinctions between background, cytoplasm, and nucleus in pathological images.
During the knowledge distillation training process, our segmentation model initially trains a TFSS_T network with numerous parameters. Subsequently, the network is employed to obtain the soft labels for the training set. These soft labels, along with the actual hard labels, serve as fitting objects for distillation training, with the parameter \(\mathrm{\alpha }\) adjusting the weight of the loss function. Upon completion of training, the TFSS_S network is used for further prediction.
The training loss function is:
Here, \(\mathrm{\alpha }\) represents the weight parameter between the two partial losses. \({D}_{KL}\) denotes the KL divergence loss function, and \(FL\left({p}_{t}\right)\) represents the focal loss. The softening results after the output of the student network and teacher network are denoted by \({Q}_{S}^{\tau }\) and \({Q}_{T}^{\tau }\), respectively. \({y}_{true}\) indicates the actual label.
Sample imbalance is a significant issue in medical images, particularly in pathological images. Addressing the weighting among case samples has become a crucial factor for enhancing model performance. While cross-entropy loss can handle the imbalance between positive and negative samples in the medical environment, it fails to address the challenge of difficult-to-separate samples. Hence, we introduced Focal Loss.
where parameter \(\lambda \) meets \(\lambda \ge 0\). The larger its value, the greater the impact of the control coefficient \({\left(1-{p}_{t}\right)}^{\lambda }\). In the medical decision-making system's pathological image processing, we aim to diminish the weight of easily classifiable background regions, allowing the model to prioritize cchallenging areas such as overlapping cell nuclei and cell edges during training. Additionally, to address the existing sample imbalance in our pathology image dataset, we incorporated cross-entropy loss. The parameter \(\lambda \) controls the rate of downweighting. When \(\lambda =0\), Focal Loss is the cross-entropy loss function.
The updates to Focal Loss are as follows:
where \(\beta \) is the coefficient of the positive label sample.
Automatic recognition of pathology images is challenging due to the labor-intensive production and reading of digital pathology slides and the demanding professional skills of pathologists. Our proposed multiprocessing scheme improves the accuracy of predicting lesion regions in digital pathology images of tumors. It requires less hardware and is more computationally efficient, making it ideal for complex medical environments. As a reliable auxiliary diagnostic tool, our system enables physicians to rapidly acquire high-quality pathology images, which improves the identification and classification of malignant lesion regions.
Experiments
Introduction to experimental environment and dataset
Dataset: We used data from the Monash University Artificial Intelligence Research Center, which consists of 1000 pathology images. We chose a magnification of 40 × and captured regions randomly to obtain 10 sub-images of each pathology image. The size of each sub-image is 512 × 512 pixels. We obtained a total of 10,000 images. Due to the random nature of the intercepted sites, many of the images did not contain enough medical information to guide the diagnosis of osteosarcoma and could not reflect the medical value, so we carried out further screening, and finally obtained a dataset of 2,164 pathology images that could be used for model training. The filtering process is realized using a sliding window. If the percentage of the area of the blank background part or the contaminated area in the intercepted image exceeded 0.7, the changed image was rejected. The 2164 images used for training were then given to three specialized pathologists to complete the annotation process. In addition, in order to improve the generalization ability of the deep neural network, we used the described data enhancement pipeline on the images in the training set and processed the annotated images accordingly. In the experiments, 1700 pathology images were used as the training set for the model and 464 pathology images were used as the test set for the model. The amount of data in the training set is roughly 78.6% of the total dataset and the amount of data in the test set is roughly 21.4% of the total dataset.
The experimental setup involved utilizing the Ubuntu 20.04.2 LTS operating system, PyTorch 1.10.0 deep learning framework, CUDA version 11.3, and Python version 3.8. The experiment employed an AMD EPYC 7543 32-Core Processor for the CPU and an RTX 3090 for the GPU.
Comparison models: The models used for comparison include U-Net, Unet++ [50], DeepLabv3 + [51], Attention U-Net [52], SETR, Swin-UNet [53], and CSWin Transformer [54]. These networks are arranged in chronological order of their proposal, spanning from 2015 to 2021. Among the selected comparative networks, R2UNet stands as the most representative medical segmentation network, with its U-shaped architecture and classic skip-connection design allowing the posterior networks to freely perform shallow and deep feature selections. Unet++ represents a variant of the Unet series without attention structures, maximizing the potential of skip connections. The DeepLabv3 + series has marked substantial success in the domain of semantic segmentation, with many of its designs, such as dilated convolutions, conditional random fields, and multi-scale dilated convolutions (ASPP module), carrying considerable inspirational value. Attention U-Net integrates soft attention into the simple, cost-effective Unet, significantly enhancing the model's accuracy. SETR, Swin-UNet, and CSWin Transformer are the recent popular ViT structure networks, almost outperforming networks consisting only of CNN in image segmentation. Performing comparative experiments with these models can more effectively showcase the superior performance of our designed model.
Assessment metrics: We employed accuracy (Acc), precision (Pre), recall (Re), F1-score (F1), Intersection over Union (IoU), and Dice Similarity Coefficient (DSC) as the metrics to evaluate the efficacy of cell segmentation performed by the network [55, 56]. Among these, IoU is the intersection of actual and predicted cytosolic regions than on the intersection of actual and predicted cytosolic regions, which can effectively represent the similarity between actual and predicted cytosolic regions. DSC indicates the variability of the model-predicted nucleus region of the cell with respect to the artificially labeled ground truth. The values of these two metrics range between [0, 1], where a higher value signifies better model performance. In pathological image segmentation, our aim is to increase the IoU and DSC values of cells as much as possible to achieve precise segmentation.
Moreover, for a comparative analysis of computational costs between the lightweight model resulting from knowledge distillation and the original model, we utilized floating-point operations (FLOPs) as a measure of the model's computational complexity [57]. We used "Params" to measure the size of the model's parameters. The larger the values of these two parameters, the greater the computational and storage resources required by the model [58].
Hyperparameter settings: In all the experiments below, we trained the models for 300 epochs and configured the batch size to 4. The detailed parameters for various algorithms in the data augmentation pipeline, such as execution probabilities and rotation angles, have been extensively described in the system methods module. During training, we utilized the Stochastic Gradient Descent (SGD) optimization algorithm to optimize parameters, initializing the learning rate at 0.01, momentum at 0.9, and decay rate at 0.0005. In the focal loss, the parameters are the coefficients \(\mathrm{\alpha }\) of the positively labeled samples. In knowledge distillation, we set the temperature Γ to 4. For the logits map, we set α to 3, and for the intermediate feature map, we set α to 50.
Experimental results
Before feeding the original images into the neural network, they need preprocessing. In this context, a data augmentation pipeline is employed to improve the model's generalization. Data augmentation processes the original images to create new ones, allowing the model to learn features from different angles, scales, and morphologies. In this experiment, the data augmentation pipeline primarily involves color enhancement but excludes basic operations like rotation and flipping. This means that the resulting enhanced images primarily focus on changes in color rather than adjustments in shape or orientation. The effect of color enhancement is shown in Fig. 4, demonstrating significant color differences between the original images and those after color enhancement. This type of processing helps the model better learn color features in pathological images, enabling it to have stronger generalization capabilities when dealing with pathological images from different staining batches or various laboratories. As pathological images from different staining batches or laboratories may vary, color enhancement can somewhat mitigate the impact of these differences on the model's learning. The contrast between the cell images and background images becomes more distinct after color enhancement, aiding the model in distinguishing and extracting features. Additionally, as staining effects differ each time, color enhancement ensures the model isn't confined to a specific staining method, thereby enhancing its performance in practical applications.
The sequentially color-enhanced results of images with staining variations. Column a represents the original images; column b depicts images after random brightness enhancement; column c exhibits images after random saturation enhancement; column d displays images after random hue enhancement; column e showcases images after random contrast enhancement; and column f presents the images finally input into the training model
The well-trained segmentation networks can efficiently convert malignant tumor pathological images into segmented results of the same size. This technological advancement has greatly improved the accuracy and efficiency of medical diagnosis, providing physicians with important auxiliary support in identifying and treating malignant tumors.
Figure 5 displays the resulting images after the model's processing. In this illustration, we can observe the original image alongside its corresponding segmented result. The segmentation result distinctly presents the cellular structures in the pathological image, enabling doctors to more accurately identify tumor regions.
The representative segmentation results of several pathological images. Column a represents the original pathological tissue slide image; column b describes the ultimate image output of the system's processing; column c shows the prediction from the deep learning network; column d presents the labeled image, and column e showcases the DSC score for this predicted image
The system exhibits remarkable performance in identifying and segmenting cells within pathological images. By leveraging advanced image processing techniques, the model separates cells from surrounding tissue, thereby enhancing diagnostic accuracy. However, despite its overall excellence, the system still exhibits some limitations. In certain instances, the model may present slight under-segmentation, leading to incomplete recognition of some cells. Additionally, there might be instances of blurred edges on some cell boundaries, impacting image clarity. To assess the system's performance, we utilized the DSC (Dice Similarity Coefficient) score. DSC is a frequently utilized metric for assessing image segmentation, with values closer to 1 indicating better segmentation results. The model's DSC score demonstrates excellence, affirming its high accuracy and reliability in handling malignant tumor pathological images.
Comparative experiments
We conducted comprehensive comparative experiments under the same conditions between our segmented model, the model after knowledge distillation, and other commonly used models for medical image segmentation or that are highly representative in the realm of semantic segmentation, as mentioned above. The values of the individual metrics are averaged over all test images for the different models. Figure 6 illustrates the experimental data comparison between the aforementioned models and our model, and the detailed data are recorded in Table 2. Analyzing the results reveals the proficient performance of our segmentation model in the context of pathological image segmentation, outperforming other models in many indicators. The Precision (Pr) value is substantially higher than that of other classical models, indicating that our proposed method achieves high precision in pathological image segmentation, accurately separating cell nuclei from cytoplasm. Additionally, the key evaluation metrics, IoU (Intersection over Union) and DSC (Dice Similarity Coefficient), suggest that the predicted nuclear area of our model is reasonably comparable to the actual nuclear area.
The IoU and DSC of our teacher model exceed those of the best-performing CSWin Transformer in the comparative models by 5.9% and 1.8%, respectively. When compared only with traditional CNN models, our model demonstrates substantially superior performance. TFSS_T shows an improvement of 11.8% in IoU compared to the Unet network. At the same time, we also consider the ViT model, Swin-unet, which has a small parameter and computational load. It can be seen that its performance is even close to that of U-Net. This directly indicates that directly applying segmentation models from other domains to pathological image segmentation may not yield good results. However, the submodel we designed has lower FLOPs and Params than Swin-unet, yet its performance far surpasses Swin-unet.
In summary, our model has clear advantages compared to other models.
In addition, as shown in Fig. 7, we also compared our method with frameworks of similar FLOPs magnitude. Our approach achieves optimal performance while maintaining a low level of FLOPs, which is highly beneficial and important in areas with limited computational resources. Compared to Unet++ at the same FLOPs level, our model consistently outperformed Unet++. Furthermore, our method demonstrated significant performance improvements over several models with higher FLOPs. This indicates that our model can obtain more accurate pathological image segmentation results with faster training speed and lower cost.
Figure 8 shows the segmentation results of different segmentation models on several representative histopathology images. This includes a comparison between our teacher model and the student model. From the figure, it can be visualized that our TFSS_T model has the most accurate segmentation results. the TFSS_S model also performs better. When dealing with simple segmentation tasks (e.g., Fig. 8d and e), the advantages of our model may not be well represented. This is mainly when the predictions of each model are better. However, when dealing with cell segmentation tasks with complex boundaries (e.g., Fig. 8a and b), our model is better able to outline the boundaries of densely stacked cells. When dealing with images with irregularly shaped cells (e.g., Fig. 8c and f), our model shows better robustness and generalization ability. We can accurately segment cells with uneven distribution, different sizes and irregular shapes. The DSC values below these predictions can illustrate the performance of our model more specifically.
Ablation experiments
To further investigate the impact of Focus Loss, data augmentation pipeline, knowledge distillation, and CRF on the segmentation performance of the model, we conducted corresponding distillation experiments. The experimental results are shown in Figs. 9 and 10, and detailed experimental data are recorded in Table 3.
Figure 9 demonstrates the performance of our model after separately incorporating data augmentation pipeline and Focus Loss. We can observe that when the teacher model only used the data augmentation method, the model's IoU and Re improved, while the DSC decreased, possibly due to the fact that the model did not reach the optimal performance point even with increased data volume under the same training load. When using only Focus Loss, the model's Re increased by 4%, with other metrics remaining relatively unchanged. Although the improvement is not significant, it is noteworthy as it is achieved on an already efficient model, indicating further enhancement of the model's performance through the addition of Focus Loss. For the model that incorporates both methods simultaneously, all aspects of the metrics are improved compared to the original model. The optimized model shows significant improvement. Compared to the original TFSS_T, the IoU, DSC, Pr, and F1 scores were increased by 4.3%, 1.6%, 1.6%, and 1.6%, respectively. It can be observed that the model with data augmentation and the inclusion of Focus Loss exhibits stronger performance, substantially increasing the model's flexibility and enabling more accurate segmentation.
Figure 10 demonstrates the performance changes of the student model before and after incorporating knowledge distillation and CRF. It is evident that knowledge distillation greatly improves the performance of the student model, with an increase of 5% in IoU, 13% in DSC, 15% in Recall, and 10% in Precision. This is a significant improvement over the original model, allowing for good performance even with a small computational load. The addition of CRF as a post-processing step does not lead to notable numerical improvements in the model's performance. However, in practical terms, CRF helps the segmentation model to connect the contextual information of the image, making the segmentation of stacked cells in pathological images more distinct and aiding doctors in making accurate decisions.
Conclusion
In this study, we have developed an artificial intelligence multi-processing solution (MSPInet) specifically designed for malignant tumor pathological tissue slice images, aiming to assist doctors in reducing their workload. The multi-processing part of the method is mainly reflected in our processing steps and different segmentation models for pathology images. We have designed a segmentation method based on visual transformers, which first performs coarse segmentation of the model and then utilizes CRF to optimize the segmentation results. Additionally, we have introduced a training strategy using knowledge distillation to achieve higher prediction accuracy at a lower cost. The results show that our system has higher accuracy, lower complexity, and shorter training time. The results predicted by the system can provide doctors with an auxiliary reference basis for clinical diagnosis. Thus, the system plays an important role in simplifying the clinical diagnostic process and assisting doctors in improving diagnostic efficiency and accuracy.
In the future, we will continue to improve the accuracy of the model segmentation, optimize the clarity of cell edge segmentation, and explore the use of semi-supervised or unsupervised learning to address the issue of small sample sizes. In terms of data processing, we will explore more possible denoising algorithms. In the design of small tools, we hope to use clustering methods to grade complex cell boundaries instead of relying solely on mathematical parameters. Lastly, we will integrate clinical practice to continuously improve the intelligent assisted diagnostic system.
Data availability
All data analyzed during the current study are included in the submission. Data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.
References
Han Y, Holste G, Ding Y, Tewfik A, Peng Y, Wang Z (2023) Radiomics-guided global-local transformer for weakly supervised pathology localization in chest X-Rays. IEEE Trans Med Imaging 42(3):750–761. https://doi.org/10.1109/TMI.2022.3217218
Yuan T, Zeng J (2023) A medically assisted model for precise segmentation of osteosarcoma nuclei on pathological images. IEEE J Biomed Health Inform 27(8):3982–3993. https://doi.org/10.1109/JBHI.2023.3278303
Peng T, Tang C, Wu Y et al (2022) H-SegMed: a hybrid method for prostate segmentation in TRUS images via improved closed principal curve and improved enhanced machine learning. Int J Comput Vis 130:1896–1919. https://doi.org/10.1007/s11263-022-01619-3
Fan L, Sowmya A, Meijering E, Song Y (2023) Cancer survival prediction from whole slide images with self-supervised learning and slide consistency. IEEE Trans Med Imaging 42(5):1401–1412. https://doi.org/10.1109/TMI.2022.3228275
Song X et al (2023) Multicenter and multichannel pooling GCN for early AD diagnosis based on dual-modality fused brain network. IEEE Trans Med Imaging 42(2):354–367. https://doi.org/10.1109/TMI.2022.3187141
Luo T et al (2024) Continuous refinement-based digital pathology image assistance scheme in medical decision-making systems. IEEE J Biomed Health Inform. https://doi.org/10.1109/10.1109/JBHI.2024.3351287
Wu H, Huang X, Guo X, Wen Z, Qin J (2023) Cross-image dependency modeling for breast ultrasound segmentation. IEEE Trans Med Imaging 42(6):1619–1631. https://doi.org/10.1109/TMI.2022.3233648
Li X, Guo R, Lu J, Chen T, Qian X (2023) Causality-driven graph neural network for early diagnosis of pancreatic cancer in non-contrast computerized tomography. IEEE Trans Med Imaging 42(6):1656–1667. https://doi.org/10.1109/TMI.2023.3236162
Wei H, Lv B, Liu F, Tang H (2023) A tumor MRI image segmentation framework based on class-correlation pattern aggregation in medical decision-making system. Mathematics 11(5):1187. https://doi.org/10.3390/math11051187
Huang P et al (2023) A ViT-AMC network with adaptive model fusion and multiobjective optimization for interpretable laryngeal tumor grading from histopathological images. IEEE Trans Med Imaging 42(1):15–28. https://doi.org/10.1109/TMI.2022.3202248
Wysocki O et al (2023) Assessing the communication gap between AI models and healthcare professionals: explainability, utility and trust in AI-driven clinical decision-making. Artif Intell 316:103839
Wu H, Lin C, Liu J, Song Y, Wen Z, Qin J (2023) Feature masking on non-overlapping regions for detecting dense cells in blood smear image. IEEE Trans Med Imaging 42(6):1668–1680. https://doi.org/10.1109/TMI.2023.3234688
Kostick-Quenet K, Rahimzadeh V (2023) Ethical hazards of health data governance in the metaverse. Nat Mach Intell 5:480–482. https://doi.org/10.1038/s42256-023-00658-w
He Z, Liu J (2023) An innovative solution based on TSCA-ViT for osteosarcoma diagnosis in resource-limited settings. Biomedicines 11(10):2740. https://doi.org/10.3390/biomedicines11102740
Huang Z, Ling Z (2024) Medical assisted-segmentation system based on global feature and stepwise feature integration for feature loss problem. Biomed Signal Process Control 89:105814. https://doi.org/10.1016/j.bspc.2023.105814
Wu H, Wang Z, Song Y, Yang L, Qin J (2022) Cross-patch dense contrastive learning for semi-supervised segmentation of cellular nuclei in histopathologic images. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11656–11665. https://doi.org/10.1109/CVPR52688.2022.01137
Li W et al (2024) Artificial intelligence auxiliary diagnosis and treatment system for breast cancer in developing countries. J X-Ray Sci Technol. https://doi.org/10.3233/XST-230194
He K, Qin Y (2023) A novel medical decision-making system based on multi-scale feature enhancement for small samples. Mathematics 11:2116. https://doi.org/10.3390/math11092116
Gou F, Wu J (2022) An attention-based AI-assisted segmentation system for osteosarcoma MRI images. IEEE Int Conf Bioinform Biomed (BIBM) 2022:1539–1543. https://doi.org/10.1109/BIBM55620.2022.9995391
Salehi P, Chalechale A (2000) Pix2Pix-based stain-to-stain translation: a solution for robust stain normalization in histopathology images analysis. In: 2020 International Conference on Machine Vision and Image Processing (MVIP), Iran, pp. 1–7
Pérez-Bueno F et al (2021) Blind color deconvolution, normalization, and classification of histological images using general super Gaussian priors and Bayesian inference. Comput Methods Prog Biomed 211:106453
Shen Y, et al. (2022) Randstainna: learning stain-agnostic features from histology slides by bridging stain augmentation and normalization. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer Nature, Cham
Ho M-Y, Min-Sheng W, Che-Ming W (2022) Ultra-high-resolution unpaired stain transformation via kernelized instance normalization. In: European Conference on Computer Vision. Springer Nature, Cham, pp.490–505
Hong M, Woo C, Gunhee K (2021) Stylemix: separating content and style for enhanced data augmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Choi HK, Joonmyung C, Hyunwoo JK (2022) Tokenmixup: efficient attention-guided token-level data augmentation for transformers. Adv Neural Inform Process Syst 35:14224–14235
Goodfellow I, et al. (2014) Generative adversarial nets. Adv Neural Inform Process Syst 27
Odena A, Christopher O, Jonathon S (2017) Conditional image synthesis with auxiliary classifier gans. In: International conference on machine learning. PMLR
Pan X, et al. (2023) Drag your gan: interactive point-based manipulation on the generative image manifold. In: ACM SIGGRAPH 2023 Conference Proceedings
Zhang Y, Wang Z, Zhang Z et al (2023) GAN-based one dimensional medical data augmentation. Soft Comput 27:10481–10491. https://doi.org/10.1007/s00500-023-08345-z
Shen Y, Gou F, Dai Z (2022) Osteosarcoma MRI image-assisted segmentation system base on guided aggregated bilateral network. Mathematics 10(7):1090
Ouyang T et al (2022) Rethinking U-net from an attention perspective with transformers for osteosarcoma MRI image segmentation. Comput Intell Neurosci. 2022:1–17
Kim Y (2014) Convolutional neural networks for sentence classification. Eprint Arxiv
Strudel R, et al. (2021) Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision
Ronneberger O, Philipp F, Thomas B (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany. Proceedings, Part III 18. Springer International Publishing
Long J, Evan S, Trevor D (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Du W, Yang H, Toe TT (2023) An improved image segmentation model of FCN based on residual network. In: 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, pp. 135–139. https://doi.org/10.1109/CVIDL58838.2023.10166778
An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020.
Touvron H, et al. (2021) Training data-efficient image transformers and distillation through attention. In: International conference on machine learning. PMLR
Zheng S, et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu Z, et al. (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision
Carion N, et al. (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer International Publishing, Cham
He K, et al. (2023) Image segmentation technology based on transformer in medical decision-making system. IET Image Process 17(10): 3040–3054. https://doi.org/10.1049/ipr2.12854
Zhou L, Tan Y (2022) A residual fusion network for osteosarcoma mri image segmentation in developing countries. Computat Intell Neurosci 2022:11–15
Lv B, Liu F, Li Y, Nie J (2023) Artificial intelligence-aided diagnosis solution by enhancing the edge features of medical images. Diagnostics 13(6):1063. https://doi.org/10.3390/diagnostics13061063
Zhan X et al (2023) An intelligent auxiliary framework for bone malignant tumor lesion segmentation in medical image analysis. Diagnostics 13(2):223. https://doi.org/10.3390/diagnostics13020223
Liu J, Zhu J, Wu J (2022) A multimodal auxiliary classification system for osteosarcoma histopathological images based on deep active learning. Healthcare 10(11):2189. https://doi.org/10.3390/healthcare10112189
Xiao P, Huang H, Zhou Z, Dai Z (2022) An artificial intelligence multiprocessing scheme for the diagnosis of osteosarcoma MRI images. IEEE J Biomed Health Inform 26(9):4656–4667. https://doi.org/10.1109/JBHI.2022.3184930
Yao, S, et al. (2023) Radar-camera fusion for object detection and semantic segmentation in autonomous driving: a comprehensive review. arXiv preprint arXiv:2304.10410
Zheng S, et al. (2015) Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision
Zhou Z, et al. (2018) "A nested U-Net architecture for medical image segmentation." arxiv preprint https://arxiv.org/abs/1807.10165
Chen, L-C, et al. (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV)
Oktay O, et al. (2018) Attention u-net: learning where to look for the pancreas. arxiv preprint https://arxiv.org/abs/1804.03999
Cao H, et al. (2022) Swin-unet: Unet-like pure transformer for medical image segmentation. In: European conference on computer vision. Springer Nature, Cham
Dong X, et al. (2022) CSWin transformer: a general vision transformer backbone with cross-shaped windows. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 12114–12124
Liu F et al (2022) Auxiliary segmentation method of osteosarcoma MRI image based on transformer and U-Net. Computat Intell Neurosci 2022:9990092. https://doi.org/10.1155/2022/9990092
Lv B, Liu F (2022) Multi-scale tumor localization based on priori guidance-based segmentation method for osteosarcoma MRI images. Mathematics 10(12):2099. https://doi.org/10.3390/math10122099
Guo Y, Dai Z (2022) A medical assistant segmentation method for MRI images of osteosarcoma based on DecoupleSegNet. Int J Intell Syst 37(11):8436–8461. https://doi.org/10.1002/int.22949
He K, Zhu J, Li L (2024) Two-stage coarse-to-fine method for pathological images in medical decision-making systems. IET Image Process 18(1):175–193. https://doi.org/10.1049/ipr2.12941
Funding
Guizhou Provincial Science and Technology Program Project (Qiankehe Basic-zk[2024] General 063).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflicts of interest
The authors declare no conflict of interest.
Institutional review board statement
Not applicable.
Informed consent statement
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gou, F., Tang, X., Liu, J. et al. Artificial intelligence multiprocessing scheme for pathology images based on transformer for nuclei segmentation. Complex Intell. Syst. 10, 5831–5849 (2024). https://doi.org/10.1007/s40747-024-01471-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01471-7