Introduction

High-resolution digital pathology images contain a wealth of medical information such as tissue details, cellular structures, and lesion characteristics [1]. This information helps assist physicians in making more accurate diagnosis and treatment of diseases. Traditional pathology diagnosis relies on the manual one-by-one judgment of pathologists [2]. This process is time-consuming and cumbersome, and diagnostic results are influenced by personal experience and subjective judgment. It can lead to discrepancies between diagnostic results, thus affecting patient outcomes. There is a growing shortage of pathologists globally, especially in developing countries [3]. The increased pressure on diagnostic pathology further affects the accuracy and efficiency of pathology diagnosis. Quantitative analysis of digital pathology images through computer-aided diagnostic techniques can greatly reduce the repetitive work of pathologists [4,5,6].

Computer-aided diagnosis provides medical workers with more efficient and accurate diagnostic results through image processing technology [7,8,9]. Among them, intelligent image segmentation technology can automatically identify and extract key features in images to improve diagnostic efficiency. Deep learning methods promote the development and application of artificial intelligence in digital pathology analysis due to its powerful learning ability [10]. For example, convolutional neural network (CNN), U-Net model, FCN model, recurrent neural network (RNN), etc., have been widely used in pathological image preprocessing, cell nucleus and karyotype identification and quantification, tumor classification and grading, etc. [11,12,13].

However, existing deep learning methods have certain limitations in segmenting pathological images of malignant tumors, which makes it difficult to meet practical needs [14]. Although clustering-based methods have high computational efficiency, they are more sensitive to noise, resulting in low accuracy. Learning-based segmentation methods are difficult to achieve a balance between accuracy and segmentation efficiency [15]. Although CNN-based segmentation methods have higher accuracy, they are time-consuming and occupy more memory [16]. Especially in developing countries with relatively backward economic conditions, the following problems still exist:

  1. 1.

    Limited specialized personnel. The backwardness of developing countries in terms of economic level and infrastructure has resulted in a relatively low per capita share of medical resources. And the distribution of medical resources often shows an extremely uneven state. In areas with limited medical resources, it is difficult for most patients to receive timely diagnosis and treatment in the early stages of disease development [17].

  2. 2.

    The high cost of pixel-level annotation of pathology images requires a lot of manpower and expertise. This high-cost annotation process poses an obstacle to the further development and application of pathology image analysis.

  3. 3.

    The value density of digital pathology images is low. Large sized sections contain numerous background information and the average cell area is only 8.29% of the background, implying a low percentage of valid information [18]. Due to the high number and resolution of pathology section images, it takes a lot of time and effort to screen and process these images with the naked eye.

  4. 4.

    The model consumes high computational resources. Tumor cell morphology is complex, there are stacking and other situations, which require high accuracy and generalization of the segmentation method. High-performance pathology image segmentation models usually have a large number of parameters and scale, which require a large amount of computational resources [19].

To realize an image segmentation method with high accuracy, high efficiency, high automation and good repeatability, we propose an artificial intelligence multiprocessing scheme (MSPInet) for digital pathology images of malignant tumors.

In preprocessing, we use data expansion methods such as random image flipping, cropping, rotation, and color augmentation to increase the diversity of the data set; We introduce a fast non-local method using integral images to reduce noise caused by equipment quality, production environment and operational errors in processes such as slice generation; and then we design a cell nucleus segmentation model based on Transformer for Semantic Segmentation (TFSS). It introduces bootstrap aggregation bilateral network for coarse segmentation and Conditional Random Field (CRF) to refine the coarse segmentation results. Finally, we improve the training strategy for knowledge distillation. The system can provide physicians with intuitive auxiliary diagnosis results.

Our contributions are specified below:

  1. 1.

    A semantic segmentation transformer (TFSS) based segmentation model was designed. By decoupling the body features obtained from feature low frequency and the edge features obtained from high-resolution shallow features, and then assigning different supervision, the model realizes high-precision cell nucleus segmentation. Our method solves the difficulty of boundary fitting so that the segmentation results fit the actual cell nucleus region to some extent better. The accurate structure can assist doctors to reduce the occurrence of misdiagnosis.

  2. 2.

    We use the CRF model as an optimizer to further refine the minutiae of cell nucleus gaps, cell nucleus edges, and overlapping cells. This method improves the fitting accuracy of the edge regions while reducing the loss of information about the cell nucleus edges.

  3. 3.

    We improve the training strategy of knowledge distillation so that the model not only maintains efficient training efficiency, but also focuses more on difficult-to-split samples, which results in higher prediction accuracy.

  4. 4.

    In this paper, more than 4000 samples were used for experiments. The results show that the MSPInet strategy proposed in this paper has obvious advantages. Doctors can use the segmentation results as an important auxiliary basis in the diagnosis and treatment stages, realizing rapid diagnosis, reducing workload, and improving work efficiency.

Related work

As information technology and computer hardware advance, the utilization of artificial intelligence in medical diagnosis is progressively expanding. This integration enhances the precision and efficiency of medical diagnosis, empowering healthcare professionals with robust tools to elevate the processes of diagnosis and treatment for patients.

Several researchers are now focusing on investigating the influence of different image staining styles on deep learning models. Salehi et al. [20] proposed STST (Pix2Pix-based Stain-to-Stain Translation), initially converting RGB format original pathological images into grayscale images, using the two images as pairs for model training. However, due to the close similarity in color between cell nuclei and cytoplasm, it becomes difficult to distinguish between the two in the transformed images. Pérez-Bueno et al. [21] utilized blind color deconvolution to isolate individual staining channels in multi-stained images, reducing the impact of color variations on computer-aided diagnosis systems. This framework introduced an innovative variational Bayesian blind color deconvolution algorithm that automatically estimates color vector matrices, stain concentrations, and all model parameters, although its effectiveness is not evident in certain bands. Yiqing et al. [22] proposed RandStainNA, which combines stain normalization and stain augmentation to enhance the model's generalization ability, but its generalization ability in color spaces like YUV still needs improvement. Ho et al. [23] introduced the URUST framework, which considers the correlation between neighboring patches, aiming to minimize the color discrepancy between adjacent stained patches after normalization, thereby reducing post-staining errors. Moreover, URUST significantly reduces hardware requirements, enabling the processing of ultra-high-resolution pathological images in the same hardware environment.

Several widely used image enhancement techniques can also enhance the segmentation performance of the model, such as Stylemix [24] and TokenMixup [25]. These image enhancement techniques typically enable the model to achieve improved generalization in a variety of scenarios. Unfortunately, in the realm of medical imaging, these approaches can readily distort the morphological information of osteosarcoma cell nuclei and disrupt the training of the model. Since the introduction of GAN [26], there have been many data augmentation studies based on GAN, such as ACGAN [27], DANGAN [28], which have played a significant role in cancer diagnosis and treatment processes. For example, Zhang et al. [29] proposed a model based on WGAN-GP, which, when used to augment one-dimensional clinical data, performs well in solving the imbalance problem with fewer sample datasets. However, due to the inability of GAN networks to guarantee convergence, the performance of GAN-based data augmentation methods is mediocre when faced with large-scale datasets.

Many researchers have also conducted studies on semantic segmentation techniques for pathological images. Yedong et al. [30] suggested a novel semi-supervised approach for semantic segmentation of cell nuclei in pathological images. Their sampling strategy concentrates on the distribution of cell nuclei to enhance segmentation accuracy, overlooking the image variations induced by staining styles. Ouyang et al. [31] found a significant amount of redundant patches in the original pathological images and suggested that eliminating these patches can significantly reduce manual labor costs, offering a novel perspective on pathological semantic segmentation. However, despite only requiring 5% of labels, the cost of selecting suitable patches cannot be ignored.

Enhancing the performance of deep learning models is also a crucial strategy for improving segmentation accuracy. Convolutional Neural Networks (CNNs) [32] have demonstrated remarkable achievements in the realm of computer vision, especially in image segmentation tasks. CNNs can learn local features from images and efficiently fuse them, enabling precise image segmentation.

TFSS (Transformer for Semantic Segmentation) [33] is a novel deep learning network architecture that, compared to traditional Unet [34], maintains high-resolution representations during model training. This allows TFSS to capture finer details in images and achieve multi-scale feature fusion, improving the spatial accuracy of segmentation results. In recent advancements in image segmentation, techniques that integrate the Fully Convolutional Network (FCN) [35] with an encoder-decoder structure have emerged as the primary approaches for semantic segmentation. For example, Du et al. [36] improved the FCN model with Resnet as the backbone to achieve the accuracy of recognition and segmentation with less time cost.

Recently, Vision Transformers (ViTs) [37] introduced a transformer architecture without convolutional layers designed for image classification, treating input images as sequences of patch tokens. ViTs necessitate training on extensive datasets, while DeiT [38] proposed a labeled knowledge distillation approach and utilized CNNs to achieve competitive visual transformers trained on the ImageNet-1 k dataset. Concurrent work has extended to video classification and semantic segmentation. In particular, SETR [39] employed a ViT backbone along with a conventional CNN decoder. Swin Transformer [40], a modification of ViT, utilizes local windows that shift and upsample between layers, forming a pyramid FCN decoder.

In this study, we present a segmenter, employing an encoder-decoder architecture based on transformers for semantic image segmentation. Our method utilizes a ViT backbone and incorporates a mask decoder inspired by DETR [41]. Without relying on convolutions, our architecture captures global image context through thoughtful design, delivering competitive performance on standard image segmentation benchmarks.

Systematic approach

Digital pathology images have emerged as the gold standard for diagnosing malignant tumors, owing to the abundant medical information they encapsulate [42]. Traditional pathology diagnosis relies on manual, one-by-one judgments by pathologists. This process is both time-consuming and cumbersome [43]. The complexity of pathology images results in diagnostic outcomes influenced by personal experience and subjective judgment [44,45,46]. Hence, our goal is to reduce the burden on doctors by streamlining their workload and improving the efficiency of clinical diagnosis through the use of intelligent medical technology. In this paper, an artificial intelligence multiprocessing scheme (MSPInet) for malignant tumor pathology tissue section images is developed. Specifically, we utilize a Visual Transformer (ViT)-based approach to segment cell nuclei in pathology images. This aids doctors in the initial identification of histopathology images and provides valuable references for their subsequent diagnosis and determination of the degree of soft tissue invasion. The system's overall design is depicted in Fig. 1.

Fig. 1
figure 1

Overall architecture diagram of MSPInet

The model documentation comprises two main parts. The first part involves the pre-processing of histopathological images before segmentation. In the initial stage, a pre-screening process sifts through a limited set of valuable images, specifically those that include areas with lesions, and then channels them to the next stage. Following this, we introduce a rapid non-local method that leverages integral images to remove noise from the images, consequently improving quantitative metrics and visual quality. The second part focuses on the system's segmentation. The noise-reduced image is input into an image segmentation network (Segmenter), and a Guided Aggregation Bipartite Network assists in segmenting the histopathology image of osteosarcoma, resulting in a preliminary segmentation result map. The tumor segmentation boundaries are then refined using conditional random fields. Ultimately, histopathological images identifying the lesion region are obtained to aid doctors in disease diagnosis. Additionally, to address the challenges of complex computation and large parameters in large models, we employ the knowledge distillation training strategy. Table 1 lists some notations used in this paper. This section is divided into three subsections: Sect. "Image Preprocessing Module" details the image preprocessing process, Sect. "Segmentation Model" analyzes the image segmentation model, and Sect. "Training Strategy" elaborates on the training strategy and loss function.

Table. 1 Main parameters

Image preprocessing module

Although we have more than 1,000 digital pathology images, each image is very large in size, about 3–5 G, and has more than hundreds of thousands of labeled nuclei [47]. Pixel-level labeling of pathology images is very expensive. It would require large labor and cost to annotate all the cells in the images one by one. To maximize the use of these expensive labeling data, we built a data enhancement pipeline for malignant tumor pathology image features. Each round of images introduced for training will first pass through this pipeline to ensure diversity of input data. A total of 10,000 images were obtained after random cropping of the images. Then we eliminated images that were contaminated, had too much blank area in the background, fewer nuclei, etc., and a total of 2164 images were used for annotation. These images were fed into the data enhancement pipeline to enhance the diversity of the data. The pipeline includes the following elements: the size of the input image is denoted as \(W\times H\). Additionally, the lower-left corner of the image serves as the origin for establishing a right-angled coordinate system. The positive half-axes of the x-axis and y-axis extend to the right and upward, respectively. \((\frac{W}{2},\frac{H}{2})\) represents the center of the image, while \((x,y)\) is used to denote the coordinates of pixels at any point in the input image. As shown in Fig. 2, we performed random cropping, rotating, panning, flipping and color enhancement operations on the original digital pathology images sequentially in order. Among them, random color enhancement includes random brightness, saturation, hue hue and contrast enhancement in several ways.

Fig. 2
figure 2

Image preprocessing module flow


1. Randomly Crop Images: The image is randomly cropped to the size \({W}_{0}\times {H}_{0}\), as required by the network input (\(512\times 512\) in our case). Simultaneously, efforts are made to ensure that the percentage of non-zero regions in the original image is greater than \(\alpha \) set to \(\alpha =0.7\) in our case. Equation 3 details the computation process for the vertex \((X1,Y1)\) at the top-left corner of the cropped rectangle. This process iterates in a loop for 10 times, concluding either when the non-zero region of the image meets the preset threshold value or when the loop completes without meeting the threshold. Throughout this, the marker image will adapt to the changes.


2. Randomly Rotate Images: Considering the rotational invariance of the cell, the input image undergoes a rotation with a fifty percent probability, introducing a random angle \(\theta \in [-\mathrm{180,180}]\) with \((\frac{W}{2},\frac{H}{2})\) as the center of rotation. The marker image follows this change, and the rotated coordinate point becomes \(\left({x}_{1},{y}_{1}\right)\) (see Eq. 1). Finally, the original image and the marker image are filled with black to create a minimal horizontal rectangle. As a result, the size of the rotated image changes to \({W}^{\prime} \times {H}^{\prime}\).

$$\left\{\begin{array}{c}{x}_{1}=\left(x-\frac{W}{2}\right){\text{cos}}\theta +\left(y-\frac{H}{2}\right)\left(-{\text{sin}}\theta \right)+\frac{W}{2}\\ {y}_{1}=\left(x-\frac{W}{2}\right){\text{sin}}\theta +\left(y-\frac{H}{2}\right){\text{cos}}\theta +\frac{H}{2}\end{array}\right.$$
(1)

3. Randomly Translate Images: In contrast to the conventional semantic segmentation task, where cells to be recognized have an equal probability of appearing at any position in the image, including the center and edges, there is often a scenario where a cell is intercepted across two images when intercepting a small block. To simulate this situation, the input image has a fifty percent probability of being randomly translated by distances \(h \,\epsilon\, [-\frac{H}{2},\frac{H}{2}]\) and \(w \,\epsilon\, [-\frac{W}{2},\frac{W}{2}]\) in the vertical and horizontal directions, respectively, resulting in translated coordinate points \(\left({x}_{2},{y}_{2}\right)\) (see Eq. 2). The size of the image remains unchanged after translation, and the labeled image undergoes the same modification. The blank area after the translation of the original and labeled images is filled with black color. The image after random translation is \({W}_{{x}_{2}}^{\prime}\times {H}_{{y}_{2}}^{\prime}\).

$$\left\{\begin{array}{c}{x}_{2}={x}_{1}+w\\ {y}_{2}={y}_{1}+h\end{array}\right.$$
(2)

4. Randomly Flip Images: The image has a fifty percent probability of being flipped vertically (Eq. 3) or horizontally (Eq. 4). The coordinate points of the flipped image are denoted as \(\left({x}_{3},{y}_{3}\right)\) and \(\left({x}_{4},{y}_{4}\right)\), respectively, with the marker image adapting accordingly.

$$\left\{\begin{array}{c}{x}_{4}={x}_{3}\\ {y}_{4}={H}_{{y}_{2}}^{\prime}-{y}_{3}-1\end{array}\right.$$
(3)
$$\left\{\begin{array}{c}{x}_{3}={W}_{{x}_{2}}^{\prime}-{x}_{2}-1\\ {y}_{3}={y}_{2}\end{array}\right.$$
(4)

5. Random Color Enhancement: This data augmentation approach simulates staining differences or visual noise caused by various factors, such as different staining workers, staining batches, scanners, etc., during the processing of pathology images. The module has a fifty percent probability of being executed, and it comprises several sub-modules. Each sub-module has a fifty percent probability of being executed, and none of the labeled images will be altered unless specified otherwise.


6. Random Luminance Enhancement: Cytopathology images frequently exhibit uneven illumination in imaging data due to factors such as microscope settings, sample thickness, staining depth, and leakage. To simulate these illumination differences, data enhancement is implemented here by varying the image luminance. The pixel value representing any point in the \({p}_{i}\) image is used, and a random \(\upgamma \in [-32, 32]\) is introduced to represent the randomly varying luminance value. A change in luminance results in a corresponding change in pixel value, as depicted in Eq. 5.

$${p}_{1i}={p}_{i}+\gamma $$
(5)

7. Image Conversion Module: Convert the image from RGB format to HSL format, with each component's conversion detailed in Eq. 6. This serves as an intermediate step, and this sub-module is executed whenever the Color Enhancement Module is executed;

$$\left\{\begin{array}{l}\vartheta ={{\text{cos}}}^{-1} \left[\frac{\frac{1}{2}[\left(R-G\right)+(R-B)]}{\sqrt{{(R-G)}^{2}+(R-B)(G-B)}} \right]\\ T=\left\{\begin{array}{c}0 G\ge B\\ 2\pi -\vartheta G<B\end{array}\right.\\ L=\frac{1}{\sqrt{3}}(R+G+B)\\ S=1-\frac{3{\text{min}}(R,G,B)}{R+G+B}\end{array}\right.$$
(6)

8. Random Saturation Enhancement: Due to different degrees of staining depth, inconsistency of stains, uneven absorption of stains, etc., cytopathology images yield diverse pathology pictures with distinct cell background colors and cell nucleus staining colors [34]. To simulate this diversity, we randomly adjust the saturation of colors within a certain range. This adjustment is accomplished by setting the random parameter \(\eta \in [\mathrm{0.7,1.3}]\) for the change in saturation, as illustrated in Eq. 7.

$${S}^{\prime}=S+\eta $$
(7)

9. Random Hue Enhancement: The rationale for conducting this data enhancement aligns with the previous point. We randomly select a parameter \(\upsilon \in [-\mathrm{18,18}]\) to denote the value of the tonal enhancement of the image, and the adjustment of the tonal value is expressed in Eq. 8.

$${T}^{\prime}=\left(T+\upsilon \right)\%100$$
(8)

10. Image Transformation Module: The image will be converted from HLS format back to RGB format using the same conversion formula as in formula 6. This module is executed whenever the color enhancement module is invoked.

11. Random Contrast Enhancement: Once again, to simulate color differences, we set a random contrast enhancement threshold \(\mu \in [\mathrm{0.5,1.5}]\) resulting in changes to the pixel values at each point, as depicted in Eq. 9.

$${p}_{2i}={p}_{1i}\times \mu $$
(9)

For model training on images, we first process each image using five data enhancement modules, which include random rotation and random color enhancement, to create an enhanced image input system. Despite the limited dataset, this approach significantly enriches the image inputs for each epoch, leading to highly diverse input images.

We attempted to denoise common artifacts in cytopathology images, such as background stray spots, dye-contaminated areas, varying staining depths, sample thickness, microscope settings, etc. However, we couldn't find an effective denoising method suitable for cytopathology images. Despite trying denoising methods designed for realistic images, testing revealed limited effectiveness in improving results. We refrained from using complex staining normalization methods to address staining level differences across different pathology labs and staining batches, as this would have increased computational costs, contradicting our goal of reducing computation. Instead, we aimed to model these noise and color differences through data augmentation, allowing the deep learning network to learn their characteristics on its own.

Segmentation model

To achieve more accurate cell nucleus boundaries while ensuring computational efficiency and lower hardware requirements, we designed a cell nucleus segmentation model for pathology images based on the Transformer for Semantic Segmentation. The initial coarse segmentation is conducted using a visual transformer-based model, and further refinement is achieved through a Conditional Random Field Network (CRF). Each module is described in detail below.

Rough prediction of cell nuclei in pathological images

Our segmentation model employs a Guided Aggregation Bipartite Network to assist in segmenting histopathology images, resulting in the initial segmentation result map. The architecture of the segmentation model is illustrated in Fig. 3. We input the preprocessed pathology images into the segmentation network. First the segmentation model converts the pathology image data into one-dimensional vectors by flattening and projecting the layers. This step aims to reduce the dimensionality of the pathology image data for easier processing. Next, a linear projection is applied to the pathology image data, aiding in identifying edges, textures, shapes, and other key elements in the pathology image. Following two Mask Transformer layers, semantic information and features in pathology images are extracted, enabling more accurate and high-level image analysis. Finally, a scalar product is performed to generate the pathology image mask, yielding the preliminary segmentation result map.

Fig. 3
figure 3

Overview of TFSS architecture

(1) Encoder module

A pathological image of an osteosarcoma \({\varvec{x}}\in {{\varvec{R}}}^{{\varvec{H}}\times W\times C}\) was segmented into a patch sequence \({\varvec{x}}=\left[{x}_{1},\dots ,{x}_{N}\right]\in {\mathbb{R}}^{N\times {P}^{2}\times C}\), where \((P,P)\) is the size of the patch, \(N=HW/{P}^{2}\) is the number of patches, and \(C\) is the number of channels. Every patch undergoes flattening into a one-dimensional vector, followed by a linear projection into a patch embedding, resulting in a sequence of patch embeddings denoted as \({{\varvec{x}}}_{0}=\left[{\varvec{E}}{x}_{1},\dots ,{\varvec{E}}{x}_{N}\right]\in {\mathbb{R}}^{N\times D}\), wher \(\mathbf{E}\in {\mathbb{R}}^{D\times \left({P}^{2}C\right)}\). To incorporate positional information, learnable positional embedding points \(pos=\left[{pos}_{1},\dots ,{pos}_{N}\right]\in {\mathbb{R}}^{N\times D}\) are added to the patch sequence, resulting in the input sequence labeled \({\mathbf{z}}_{0}={\mathbf{x}}_{0}+{\text{pos}}\). The input sequence is then used to obtain positional information for each patch.

A transformer encoder with L layers is applied to the sequence labeled \({\mathbf{z}}_{0}\) to generate a contextualized coded sequence \({\mathbf{z}}_{L}\in {\mathbb{R}}^{N\times D}\). The transformer layer is composed of a Multi-Self-Attention (MSA) block, succeeded by a two-layer, point-level MLP block. Layer Norms (LNs) are applied before each block, and residual connections are incorporated after each block:

$${{\varvec{a}}}_{{\varvec{i}}-1}=MSA\left(LN\left({{\varvec{z}}}_{{\varvec{i}}-1}\right)\right)+{{\varvec{z}}}_{{\varvec{i}}-1}$$
(10)
$${{\varvec{z}}}_{{\varvec{i}}} =MLP\left(LN\left({{\varvec{a}}}_{{\varvec{i}}-1}\right)\right)+{{\varvec{a}}}_{{\varvec{i}}-1}$$
(11)

where \(i\in \{1,\dots ,L\}\). These combinations of transformer layers and blocks are used to process information from osteosarcoma pathology images, providing contextual information for a better understanding of structures and features in the pathology images. This aids in the analysis and diagnosis of osteosarcoma. The self-attention mechanism comprises three pointwise linear layers mapping markers to intermediate representations: query \({\varvec{Q}}\in {\mathbb{R}}^{N\times d}\), key \({\varvec{K}}\in {\mathbb{R}}^{N\times d}\), and value \({\varvec{V}}\in {\mathbb{R}}^{N\times d}\). Self-attention is then computed as follows:

$$MSA({\varvec{Q}},{\varvec{K}},{\varvec{V}})=softmax\left(\frac{{{\varvec{Q}}{\varvec{K}}}^{T}}{\sqrt{d}}\right){\varvec{V}}$$
(12)

The converter encoder maps the input sequence \({z}_{0}\) = [\({z}_{0}\), 1, …, \({z}_{0}\), N] of embedded patches with positional encoding to \({z}_{L}\)= [\({z}_{L}\), 1, …,\({z}_{L}\), N], a contextually encoded sequence containing rich semantic information used by the decoder. In the next section, we will introduce the decoder.

(2) Decoder Module

The patch coding sequence \({\mathbf{z}}_{\mathbf{L}}\in {\mathbb{R}}^{N\times D}\) is decoded into the partition mapping s ∈ \({\mathbb{R}}^{H\times W\times K}\), where \(K\) is the class score. The decoder is trained to translate the patch-level encoding generated by the encoder into patch-level class fractions. Subsequently, these class fractions at the patch level are upsampled to pixel-level fractions using linear interpolation (bilinear interpolation). Below, we will elaborate on a linear decoder, serving as a baseline, whereas our approach is depicted as a mask converter in Fig. 2. A pointwise linear layer is applied to the patch encoding \({{\varvec{z}}}_{{\varvec{L}}}\in {\mathbb{R}}^{N\times D}\) to produce a patch-level class logarithm \({z}_{\text{lin }}\in {\mathbb{R}}^{N\times K}\). The sequence is then reshaped into a two-dimensional feature map \({s}_{\text{lin }}\in {\mathbb{R}}^{H/P\times W/P\times K}\) and upsampled in advance to the original pathology image size \({\varvec{s}}\in {\mathbb{R}}^{H\times W\times K}\). A softmax is then applied on the class dimension to obtain the final segmentation mapping.

Masked Transformers. For the transformer-based decoder, we introduce a set of K learnable class embeddings \(cls=\left[{cls}_{1},\dots ,{cls}_{K}\right]\in {\mathbb{R}}^{K\times D}\), where \(K\) is the number of classes. Each class embedding is randomly initialized and assigned to a semantic class. It will be used to generate a class mask. The class embeddings \(cls\) are processed by the decoder in conjunction with the patch encoding \({\mathbf{z}}_{\mathbf{L}}\), as shown in Fig. 2. The decoder comprises an ensemble of M transformer layers. Our mask transformer produces K masks by calculating the scalar product between the L2 normalized patch embedding \({\mathbf{z}}_{\mathbf{M}}^{\prime}\in {\mathbb{R}}^{N\times D}\) and the class embedding \(\mathbf{c}\in {\mathbb{R}}^{K\times D}\) produced by the decoder. The collection of class masks is computed as follows:

$$Masks\left({z}_{M}{\prime},c\right)={z}_{M}{\prime}{c}^{T}$$
(13)

where the mask \(\left({\mathbf{z}}_{\mathbf{M}}^{\prime},\mathbf{c}\right)\in {\mathbb{R}}^{N\times K}\) represents a set of patch sequences. Each mask sequence is subsequently reshaped into a two-dimensional mask, forming \({\mathbf{s}}_{\text{mask }}\in {\mathbb{R}}^{H/P\times W/P\times K}\), and upsampled to match the dimensions of the original pathology image, resulting in the feature map \(\mathbf{s}\in {\mathbb{R}}^{H\times W\times K}\). Subsequently, a softmax operation is applied to the class dimension, followed by a layer-wise paradigm to derive pixel-level class scores, ultimately generating the final segmented histopathology image. This process is performed for all \((i,j)\in H\times W\), the mask sequence, i.e. \({\sum }_{k=1}^{K} {s}_{i,j,k}=1\) for all \((i,j)\in H\times W\).

Our mask converter draws inspiration from DETR and MaxDeepLab, both of which introduce object embeddings for instance mask generation. However, unlike our method, MaxDeepLab adopts a hybrid approach involving CNN and transformers. Due to computational constraints, MaxDeepLab separates pixel and class embeddings into two streams. Using a pure transformer architecture and leveraging patch-level coding, we propose a simple approach that jointly processes patch and class embeddings at the decoding stage. This approach allows the generation of dynamic filters that vary with the input. When dealing with the semantic segmentation of osteosarcoma in this study, our mask transformer can also be directly adapted to perform floodlight segmentation [48]. Its realization is achieved by replacing class embeddings with object embeddings. Floodlight segmentation is a collection of semantic segmentation of static things and instance segmentation of countable objects. By understanding the information of single-frame LiDAR scans, floodlight segmentation can provide much useful information for autonomous driving, such as future prediction and map construction.

Refinement of coarse predictions

Conditional Random Fields (CRF) is a probabilistic graphical model often used as a post-processing tool to improve the performance of algorithms for pattern classification, labeling, segmentation, and other tasks in image processing and computer vision [49]. For example, a U-Net neural network gives good results. However, upon close inspection of the prediction mask, small "islands" of mispredicted pixels are found. To improve these small inconsistencies, a CRF model can be used to enhance and refine the segmentation results. In the recognition of digital pathology images, CRF can improve the segmentation quality by taking into account pixel relationships and interactions with the context to further refine the prediction results of cell nuclei. It also reduces noise interference in the segmentation results by modeling the dependencies between pixels, helping to filter out isolated noise points, and ultimately improving the accuracy of the segmentation.

The process of using CRF to refine the initial coarse segmentation results is outlined below::

  1. 1.

    Prepare pathological image data along with their corresponding preliminary segmentation results.

  2. 2.

    Define the CRF model, specifying its graph structure, observation nodes, hidden nodes, and characteristic functions. In the refined coarse segmentation, hidden nodes typically denote the label or category assigned to each pixel. Feature functions capture pixel relationships and interactions with the coarse segmentation.

  3. 3.

    Design feature functions that emphasize observed data (original images), hidden states (pixel labels), and their interactions. These functions may encompass pixel distances, color similarity, texture features, and other relevant factors.

  4. 4.

    Utilizing the training data, it estimates the weights of the feature functions in the CRF model to enhance its fitting to both the coarse segmentation results and the original image.

  5. 5.

    The learned CRF model is employed to perform label inference on each pixel, determining the most probable label for each pixel and, in turn, refining the coarse segmentation.

  6. 6.

    The CRF output undergoes post-processing to eliminate potential isolated noise points and enhance the smoothness of the segmentation results.

  7. 7.

    Assessing the refined segmentation results typically involves computing segmentation accuracy, IoU (Intersection over Union), and other performance metrics.

Training strategy

Large deep learning models exhibit excellent performance but come with a high parameter count and low computational efficiency. On the other hand, small models are computationally efficient but tend to have poorer performance. In the context of medical auxiliary diagnosis systems, achieving both high precision and efficiency is challenging, especially in developing countries and regions with limited resources. Hence, we embrace the knowledge distillation (KD) training strategy, enabling swift computation while maintaining relatively excellent performance. Moreover, this approach helps address the challenge of expensive and hard-to-obtain pathological image annotations, as model training frequently encounters the issue of insufficient datasets.

Pathological images were input into two networks: the teacher network (TFSS_T) and the student network (TFSS_S). Following the training of the teacher network, the outcomes are initially processed using softlabel to facilitate knowledge transfer from the teacher to the student network. Unlike traditional hard labels that only use "0" and "1" to annotate pathological images, soft labels assign values between 0 and 1, providing a more nuanced representation of the distinctions between background, cytoplasm, and nucleus in pathological images.

During the knowledge distillation training process, our segmentation model initially trains a TFSS_T network with numerous parameters. Subsequently, the network is employed to obtain the soft labels for the training set. These soft labels, along with the actual hard labels, serve as fitting objects for distillation training, with the parameter \(\mathrm{\alpha }\) adjusting the weight of the loss function. Upon completion of training, the TFSS_S network is used for further prediction.

The training loss function is:

$${L}_{KD}=\mathrm{\alpha }{\Gamma }^{2}{D}_{KL}({Q}_{S}^{\tau },{Q}_{T}^{\tau })+(1-\mathrm{\alpha })FL\left({p}_{t}\right)({Q}_{S}^{\tau },{y}_{true})$$
(15)

Here, \(\mathrm{\alpha }\) represents the weight parameter between the two partial losses. \({D}_{KL}\) denotes the KL divergence loss function, and \(FL\left({p}_{t}\right)\) represents the focal loss. The softening results after the output of the student network and teacher network are denoted by \({Q}_{S}^{\tau }\) and \({Q}_{T}^{\tau }\), respectively. \({y}_{true}\) indicates the actual label.

Sample imbalance is a significant issue in medical images, particularly in pathological images. Addressing the weighting among case samples has become a crucial factor for enhancing model performance. While cross-entropy loss can handle the imbalance between positive and negative samples in the medical environment, it fails to address the challenge of difficult-to-separate samples. Hence, we introduced Focal Loss.

$$FL\left({p}_{t}\right)=-{\left(1-{p}_{t}\right)}^{\lambda }{\text{log}}({p}_{t})$$
(16)

where parameter \(\lambda \) meets \(\lambda \ge 0\). The larger its value, the greater the impact of the control coefficient \({\left(1-{p}_{t}\right)}^{\lambda }\). In the medical decision-making system's pathological image processing, we aim to diminish the weight of easily classifiable background regions, allowing the model to prioritize cchallenging areas such as overlapping cell nuclei and cell edges during training. Additionally, to address the existing sample imbalance in our pathology image dataset, we incorporated cross-entropy loss. The parameter \(\lambda \) controls the rate of downweighting. When \(\lambda =0\), Focal Loss is the cross-entropy loss function.

The updates to Focal Loss are as follows:

$$FL\left({p}_{t}\right)=-{\beta }_{t}\left(1-{p}_{t}\right){\text{log}}({p}_{t})$$
(17)

where \(\beta \) is the coefficient of the positive label sample.

Automatic recognition of pathology images is challenging due to the labor-intensive production and reading of digital pathology slides and the demanding professional skills of pathologists. Our proposed multiprocessing scheme improves the accuracy of predicting lesion regions in digital pathology images of tumors. It requires less hardware and is more computationally efficient, making it ideal for complex medical environments. As a reliable auxiliary diagnostic tool, our system enables physicians to rapidly acquire high-quality pathology images, which improves the identification and classification of malignant lesion regions.

Experiments

Introduction to experimental environment and dataset

Dataset: We used data from the Monash University Artificial Intelligence Research Center, which consists of 1000 pathology images. We chose a magnification of 40 × and captured regions randomly to obtain 10 sub-images of each pathology image. The size of each sub-image is 512 × 512 pixels. We obtained a total of 10,000 images. Due to the random nature of the intercepted sites, many of the images did not contain enough medical information to guide the diagnosis of osteosarcoma and could not reflect the medical value, so we carried out further screening, and finally obtained a dataset of 2,164 pathology images that could be used for model training. The filtering process is realized using a sliding window. If the percentage of the area of the blank background part or the contaminated area in the intercepted image exceeded 0.7, the changed image was rejected. The 2164 images used for training were then given to three specialized pathologists to complete the annotation process. In addition, in order to improve the generalization ability of the deep neural network, we used the described data enhancement pipeline on the images in the training set and processed the annotated images accordingly. In the experiments, 1700 pathology images were used as the training set for the model and 464 pathology images were used as the test set for the model. The amount of data in the training set is roughly 78.6% of the total dataset and the amount of data in the test set is roughly 21.4% of the total dataset.

The experimental setup involved utilizing the Ubuntu 20.04.2 LTS operating system, PyTorch 1.10.0 deep learning framework, CUDA version 11.3, and Python version 3.8. The experiment employed an AMD EPYC 7543 32-Core Processor for the CPU and an RTX 3090 for the GPU.

Comparison models: The models used for comparison include U-Net, Unet++ [50], DeepLabv3 + [51], Attention U-Net [52], SETR, Swin-UNet [53], and CSWin Transformer [54]. These networks are arranged in chronological order of their proposal, spanning from 2015 to 2021. Among the selected comparative networks, R2UNet stands as the most representative medical segmentation network, with its U-shaped architecture and classic skip-connection design allowing the posterior networks to freely perform shallow and deep feature selections. Unet++ represents a variant of the Unet series without attention structures, maximizing the potential of skip connections. The DeepLabv3 + series has marked substantial success in the domain of semantic segmentation, with many of its designs, such as dilated convolutions, conditional random fields, and multi-scale dilated convolutions (ASPP module), carrying considerable inspirational value. Attention U-Net integrates soft attention into the simple, cost-effective Unet, significantly enhancing the model's accuracy. SETR, Swin-UNet, and CSWin Transformer are the recent popular ViT structure networks, almost outperforming networks consisting only of CNN in image segmentation. Performing comparative experiments with these models can more effectively showcase the superior performance of our designed model.


Assessment metrics: We employed accuracy (Acc), precision (Pre), recall (Re), F1-score (F1), Intersection over Union (IoU), and Dice Similarity Coefficient (DSC) as the metrics to evaluate the efficacy of cell segmentation performed by the network [55, 56]. Among these, IoU is the intersection of actual and predicted cytosolic regions than on the intersection of actual and predicted cytosolic regions, which can effectively represent the similarity between actual and predicted cytosolic regions. DSC indicates the variability of the model-predicted nucleus region of the cell with respect to the artificially labeled ground truth. The values of these two metrics range between [0, 1], where a higher value signifies better model performance. In pathological image segmentation, our aim is to increase the IoU and DSC values of cells as much as possible to achieve precise segmentation.

$$Acc=\frac{TP+TN}{TP+TN+FP+FN}$$
(18)
$$DSC=\frac{2TP}{2TP+FP+FN}$$
(19)
$$IoU=\frac{TP}{TP+FP+FN}$$
(20)
$$Pre=\frac{TP}{TP+FP}$$
(21)
$$Re=\frac{TP}{TP+FN}$$
(22)
$$F1=\frac{2\times Precision\times Recall}{Precision+Recall}$$
(23)

Moreover, for a comparative analysis of computational costs between the lightweight model resulting from knowledge distillation and the original model, we utilized floating-point operations (FLOPs) as a measure of the model's computational complexity [57]. We used "Params" to measure the size of the model's parameters. The larger the values of these two parameters, the greater the computational and storage resources required by the model [58].


Hyperparameter settings: In all the experiments below, we trained the models for 300 epochs and configured the batch size to 4. The detailed parameters for various algorithms in the data augmentation pipeline, such as execution probabilities and rotation angles, have been extensively described in the system methods module. During training, we utilized the Stochastic Gradient Descent (SGD) optimization algorithm to optimize parameters, initializing the learning rate at 0.01, momentum at 0.9, and decay rate at 0.0005. In the focal loss, the parameters are the coefficients \(\mathrm{\alpha }\) of the positively labeled samples. In knowledge distillation, we set the temperature Γ to 4. For the logits map, we set α to 3, and for the intermediate feature map, we set α to 50.

Experimental results

Before feeding the original images into the neural network, they need preprocessing. In this context, a data augmentation pipeline is employed to improve the model's generalization. Data augmentation processes the original images to create new ones, allowing the model to learn features from different angles, scales, and morphologies. In this experiment, the data augmentation pipeline primarily involves color enhancement but excludes basic operations like rotation and flipping. This means that the resulting enhanced images primarily focus on changes in color rather than adjustments in shape or orientation. The effect of color enhancement is shown in Fig. 4, demonstrating significant color differences between the original images and those after color enhancement. This type of processing helps the model better learn color features in pathological images, enabling it to have stronger generalization capabilities when dealing with pathological images from different staining batches or various laboratories. As pathological images from different staining batches or laboratories may vary, color enhancement can somewhat mitigate the impact of these differences on the model's learning. The contrast between the cell images and background images becomes more distinct after color enhancement, aiding the model in distinguishing and extracting features. Additionally, as staining effects differ each time, color enhancement ensures the model isn't confined to a specific staining method, thereby enhancing its performance in practical applications.

Fig. 4
figure 4

The sequentially color-enhanced results of images with staining variations. Column a represents the original images; column b depicts images after random brightness enhancement; column c exhibits images after random saturation enhancement; column d displays images after random hue enhancement; column e showcases images after random contrast enhancement; and column f presents the images finally input into the training model

The well-trained segmentation networks can efficiently convert malignant tumor pathological images into segmented results of the same size. This technological advancement has greatly improved the accuracy and efficiency of medical diagnosis, providing physicians with important auxiliary support in identifying and treating malignant tumors.

Figure 5 displays the resulting images after the model's processing. In this illustration, we can observe the original image alongside its corresponding segmented result. The segmentation result distinctly presents the cellular structures in the pathological image, enabling doctors to more accurately identify tumor regions.

Fig. 5
figure 5

The representative segmentation results of several pathological images. Column a represents the original pathological tissue slide image; column b describes the ultimate image output of the system's processing; column c shows the prediction from the deep learning network; column d presents the labeled image, and column e showcases the DSC score for this predicted image

The system exhibits remarkable performance in identifying and segmenting cells within pathological images. By leveraging advanced image processing techniques, the model separates cells from surrounding tissue, thereby enhancing diagnostic accuracy. However, despite its overall excellence, the system still exhibits some limitations. In certain instances, the model may present slight under-segmentation, leading to incomplete recognition of some cells. Additionally, there might be instances of blurred edges on some cell boundaries, impacting image clarity. To assess the system's performance, we utilized the DSC (Dice Similarity Coefficient) score. DSC is a frequently utilized metric for assessing image segmentation, with values closer to 1 indicating better segmentation results. The model's DSC score demonstrates excellence, affirming its high accuracy and reliability in handling malignant tumor pathological images.

Comparative experiments

We conducted comprehensive comparative experiments under the same conditions between our segmented model, the model after knowledge distillation, and other commonly used models for medical image segmentation or that are highly representative in the realm of semantic segmentation, as mentioned above. The values of the individual metrics are averaged over all test images for the different models. Figure 6 illustrates the experimental data comparison between the aforementioned models and our model, and the detailed data are recorded in Table 2. Analyzing the results reveals the proficient performance of our segmentation model in the context of pathological image segmentation, outperforming other models in many indicators. The Precision (Pr) value is substantially higher than that of other classical models, indicating that our proposed method achieves high precision in pathological image segmentation, accurately separating cell nuclei from cytoplasm. Additionally, the key evaluation metrics, IoU (Intersection over Union) and DSC (Dice Similarity Coefficient), suggest that the predicted nuclear area of our model is reasonably comparable to the actual nuclear area.

Fig. 6
figure 6

The comparative experimental results of our approach and typical medical segmentation models in terms of Accuracy (Acc), Precision (Pre), and Recall (Re) metrics

Table 2 Comparative experiments on osteosarcoma pathological images

The IoU and DSC of our teacher model exceed those of the best-performing CSWin Transformer in the comparative models by 5.9% and 1.8%, respectively. When compared only with traditional CNN models, our model demonstrates substantially superior performance. TFSS_T shows an improvement of 11.8% in IoU compared to the Unet network. At the same time, we also consider the ViT model, Swin-unet, which has a small parameter and computational load. It can be seen that its performance is even close to that of U-Net. This directly indicates that directly applying segmentation models from other domains to pathological image segmentation may not yield good results. However, the submodel we designed has lower FLOPs and Params than Swin-unet, yet its performance far surpasses Swin-unet.

In summary, our model has clear advantages compared to other models.

In addition, as shown in Fig. 7, we also compared our method with frameworks of similar FLOPs magnitude. Our approach achieves optimal performance while maintaining a low level of FLOPs, which is highly beneficial and important in areas with limited computational resources. Compared to Unet++ at the same FLOPs level, our model consistently outperformed Unet++. Furthermore, our method demonstrated significant performance improvements over several models with higher FLOPs. This indicates that our model can obtain more accurate pathological image segmentation results with faster training speed and lower cost.

Fig. 7
figure 7

Comparison graph of DSC and FLOPs

Figure 8 shows the segmentation results of different segmentation models on several representative histopathology images. This includes a comparison between our teacher model and the student model. From the figure, it can be visualized that our TFSS_T model has the most accurate segmentation results. the TFSS_S model also performs better. When dealing with simple segmentation tasks (e.g., Fig. 8d and e), the advantages of our model may not be well represented. This is mainly when the predictions of each model are better. However, when dealing with cell segmentation tasks with complex boundaries (e.g., Fig. 8a and b), our model is better able to outline the boundaries of densely stacked cells. When dealing with images with irregularly shaped cells (e.g., Fig. 8c and f), our model shows better robustness and generalization ability. We can accurately segment cells with uneven distribution, different sizes and irregular shapes. The DSC values below these predictions can illustrate the performance of our model more specifically.

Fig. 8
figure 8

Comparison of different segmentation model results

Ablation experiments

To further investigate the impact of Focus Loss, data augmentation pipeline, knowledge distillation, and CRF on the segmentation performance of the model, we conducted corresponding distillation experiments. The experimental results are shown in Figs. 9 and 10, and detailed experimental data are recorded in Table 3.

Fig. 9
figure 9

Results of the ablation experiment on the teacher model, where Our1 represents the original model, Our2 represents the model with only the data augmentation pipeline, Our3 represents the model with only Focus Loss, and Our represents the model with both the data augmentation pipeline and Focus Loss

Fig. 10
figure 10

Results of the ablation experiment on the student model, where Our1 represents the original model, Our2 represents the model with knowledge distillation, and our represents the model with both knowledge distillation and CRF

Table 3 Ablation experiment in osteosarcoma pathological images

Figure 9 demonstrates the performance of our model after separately incorporating data augmentation pipeline and Focus Loss. We can observe that when the teacher model only used the data augmentation method, the model's IoU and Re improved, while the DSC decreased, possibly due to the fact that the model did not reach the optimal performance point even with increased data volume under the same training load. When using only Focus Loss, the model's Re increased by 4%, with other metrics remaining relatively unchanged. Although the improvement is not significant, it is noteworthy as it is achieved on an already efficient model, indicating further enhancement of the model's performance through the addition of Focus Loss. For the model that incorporates both methods simultaneously, all aspects of the metrics are improved compared to the original model. The optimized model shows significant improvement. Compared to the original TFSS_T, the IoU, DSC, Pr, and F1 scores were increased by 4.3%, 1.6%, 1.6%, and 1.6%, respectively. It can be observed that the model with data augmentation and the inclusion of Focus Loss exhibits stronger performance, substantially increasing the model's flexibility and enabling more accurate segmentation.

Figure 10 demonstrates the performance changes of the student model before and after incorporating knowledge distillation and CRF. It is evident that knowledge distillation greatly improves the performance of the student model, with an increase of 5% in IoU, 13% in DSC, 15% in Recall, and 10% in Precision. This is a significant improvement over the original model, allowing for good performance even with a small computational load. The addition of CRF as a post-processing step does not lead to notable numerical improvements in the model's performance. However, in practical terms, CRF helps the segmentation model to connect the contextual information of the image, making the segmentation of stacked cells in pathological images more distinct and aiding doctors in making accurate decisions.

Conclusion

In this study, we have developed an artificial intelligence multi-processing solution (MSPInet) specifically designed for malignant tumor pathological tissue slice images, aiming to assist doctors in reducing their workload. The multi-processing part of the method is mainly reflected in our processing steps and different segmentation models for pathology images. We have designed a segmentation method based on visual transformers, which first performs coarse segmentation of the model and then utilizes CRF to optimize the segmentation results. Additionally, we have introduced a training strategy using knowledge distillation to achieve higher prediction accuracy at a lower cost. The results show that our system has higher accuracy, lower complexity, and shorter training time. The results predicted by the system can provide doctors with an auxiliary reference basis for clinical diagnosis. Thus, the system plays an important role in simplifying the clinical diagnostic process and assisting doctors in improving diagnostic efficiency and accuracy.

In the future, we will continue to improve the accuracy of the model segmentation, optimize the clarity of cell edge segmentation, and explore the use of semi-supervised or unsupervised learning to address the issue of small sample sizes. In terms of data processing, we will explore more possible denoising algorithms. In the design of small tools, we hope to use clustering methods to grade complex cell boundaries instead of relying solely on mathematical parameters. Lastly, we will integrate clinical practice to continuously improve the intelligent assisted diagnostic system.