Background

Cervical cancer is a common malignancy that poses a serious threat to women’s health. It is the fourth most common cancer in terms of both incidence and mortality. In 2020, approximately 600,000 new cases of cervical cancer were diagnosed and more than 340,000 people died from this disease globally [1, 2]. Fortunately, cervical cancer has a long precancerous stage, and annual screening programs can help detect and treat it in a timely manner. If cervical cancer is detected early, it can be completely eradicated. At present, manual screening of abnormal cells from a cervical cytology slide is still the common practice. However, it is usually tedious, inefficient and high-cost. Consequently, the automated cervical cancer cytology screening has attracted increasing attention. In the past few years, deep learning (DL), a branch of machine learning, has made great success in the field of medical image analysis [3,4,5]. The segmentation of cervical cytology images plays an important role in the automated cervical cancer cytology screening [6]. However, the performance of cervical cell segmentation is far from perfect [6,7,8,9,10].

Different from histology, which involves examining an entire section of tissue, cytology generally focuses on individual cells or clusters of cells. In some cases, several cells can determine the diagnostic result of the whole slide. One of the mainstream methods for automated cervical cancer cytology screening is cell segmentation followed by single cell classification. Compared to cervical cell segmentation, more research has been conducted on cell classification and more public datasets have been released [11,12,13,14]. According to the 2014 Bethesda guideline [15], nuclear morphologies, which include nuclear size and shape, nuclear pleomorphism, nucleus-to-cytoplasm ratio, multiple nuclei, and nucleoli morphology, are the most important biomarkers in cervical cytology screening. Therefore, both cytoplasm segmentation and nucleus segmentation are important for automated cervical cytology screening.

Previous studies have some limitations. Some previous studies only segmented cytoplasm or nucleus (not both of them simultaneously) [16]. Moreover, a lot of research was based on very limited data, so the generalization ability of these algorithms is not guaranteed. For example, some research only used 8 real cervical cytology images and over a hundred synthetic images [9, 10]. To the best of our knowledge, all previous studies adopted a single CNN such as the standard U-Net and did not use transfer learning during training [6]. Deep learning system heavily relies on the amount and quality of data. So far, there exist some public cervical cell segmentation datasets including ISBI2014 [9], ISBI2015 [10], BTTFA [16] and Cx22 dataset [6]. Among them the recently released Cx22 dataset is the biggest publicly available cervical cell segmentation dataset and contains both cytoplasm and nuclei annotations. The data descriptor paper of the Cx22 dataset also provided multiple baseline models including U-Net [17], U-Net +  +  [18] and U-Net +  +  +  [19], however performances of these baseline models are far from perfect. The Dice, sensitivity, specificity for cytoplasm segmentation and nucleus segmentation were 0.948, 0.954, 0.9823 and 0.750, 0.713, 0.9988, respectively.

This study aimed to develop a automated cervical cell segmentation algorithm including both cytoplasm and nucleus segmentation By means of a relatively large dataset, different model architectures with different encoders, model ensemble and loading pre-trained encoder weights, our algorithm outperformed those of previous studies.

Methods

Dataset and data processing

The Cx22 dataset delineate the contours of 14,946 cellular instances in 1320 images that were generated by a label cropping algorithm based on the region of interest. The data source and annotation pipeline were described in detail in the data descriptor paper [6]. A representative image and its ground truth labels can be found in the results section. The Cx22 dataset stored data using MATLAB.mat files with hdf5 data format. For convenience, these files were converted into image and mask files with jpeg format using Python code. The Cx22 dataset contained a training dataset and a testing dataset with 400 and 100 samples, respectively. Every sample consists of an image and two mask files, one for cytoplasm annotation and the other for nuclei annotation. All images have a resolution of 512*512 pixels. For model selection and hyperparameter tuning, the training dataset was further split into a new training dataset and a tuning dataset with a ratio of 0.9 and 0.1. The Cx22 dataset contains a predefined test dataset and the sample size of test dataset is not very small, for the convenience of comparing the performance our algorithm with that of baseline, in this study cross validation was not adopted.

Overall architecture

In this study, both cytoplasm segmentation and nucleus segmentation were considered as semantic segmentation tasks. These two tasks can be solved by either one multi-class classifier or two independent binary-class classifiers. To decouple the interference between cytoplasm segmentation and nucleus segmentation and simplify the hyper-parameter setting process, the latter method was adopted. According to common practice, the positive class stands for cytoplasm or nucleus and the negative class for background.

The flowchart of the automated cervical cell segmentation algorithm is shown in Fig. 1. Given an image, cytoplasm and nucleus were segmented independently. For every segmentation task, the image was inputted to multiple base models. The final predictions were obtained by aggregating results from multiple models using model ensemble method.

Fig. 1
figure 1

The flowchart of automated cervical cell segmentation. The two dashed boxes demonstrate two ensemble models, one for cytoplasm segmentation and the other for nucleus segmentation. The model ensemble method is unweighted average

Base models

To get a good ensemble model, base models should be as more accurate as possible, and as more diverse as possible [20]. Six different model architectures specifically U-Net, U-Net +  + , DeepLabV3 [21] DeepLabV3Plus [22], Transunet [23], and Segformer [24] were chosen as candidate models. These models belong to three different architectures, i.e., encoder-decoder, dilated convolution and vision transformer, and all of which were widely used. Some other U-Net variants including attention U-Net [25], R2U-Net [26] were also tested during pre-experiments on this tasks, because they did not perform better than U-Net and U-Net +  + and consume more GPU memory, they were abandoned in this study. Likewise, Swin transformer for semantic segmentation model [27] was not adopted because during pre-experiments on other tasks it did not perform better than its counterpart Transunet and Segformer models.

For every U-Net and U-Net +  + model, two different encoders resnet34 and densenet121 were used. Likewise, resnet34 and resnet50 were used as encoders of every DeepLabV3 and DeepLabV3Plus model. Densenet121 was replaced by resnet50 in DeepLabV3-series models was because there exist some bugs related to DeepLabV3-series models in the SMP implementation [16]. For the architecture of TransUnet and SegFormer, only the default setting was used. Setting of Transunet: vit_blocks = 12, vit_heads = 12, vit_dim_linear_mhsa_block = 3072, patch_size = 8, vit_transformer_dim = 768, vit_transformer = None, vit_channels = None. Setting of SegFormer: dims = (32, 64, 160, 256), heads = (1, 2, 5, 8), ff_expansion = (8, 8, 4, 4), reduction_ratio = (8, 4, 2, 1), num_layers = 2, decoder_dim = 256. Model implementation details can be found in the source code. For convenience, if a model has both an architecture name and an encoder name, it was named by combining the architecture name and encoder name. For example, Unet_resnet34 means that the model has the U-Net architecture and resnet34 encoder. Characteristics of candidate base models are shown in Table 1.

Table 1 Characteristics of candidate base models

These models were trained independently, afterwards model selection was conducted based on performance metrics. Finally, four models, i.e., Unet_resnet34, Unet_densenet121, UnetPlusPlus_resnet34 and UnetPlusPlus_densenet121 were chosen as the base models. Model performance comparisons were depicted in the results section.

Ensemble model

Although the performance differences among all models were significant, the performance differences among selected models were very small. Multiple ensemble methods, which include weighted averaging (using validation loss as weighting factor), unweighted averaging and stacking, were tested in preliminary experiments. Even though any ensemble method performed better than any single model, there was no obvious difference in the performance of the using different ensemble methods. For simplicity, unweighted average was chosen as the model ensemble method [20, 28]. It not only eliminated the need of setting parameters in weighted average or training a new model in stacking, but also did not decrease performance. Given an image, for each pixel four base models independently gave their predicted probabilities. The number of base models was set to 4 was because further increasing the number of base models would not result in perceivable performance improvement, but it would increase training time and slow down inference speed. The final probabilities were obtained by aggregating these outputted probabilities of multiple models using the unweighted average method. If its predicted probability was above a predefined threshold, the pixel was considered as positive, otherwise negative. For simplicity, the default value of 0.5 was used as the cut-off value. The mathematical formula for every pixel prediction is:

$$\mathrm{pred}\_\mathrm{class}=\frac{\sum_{i=1}^M\;p_i}M>0.5$$

For a pixel, pi is the predicted probability of model No i. M is the number of base models and in this case is equal to 4. If pred_class is true, the pixel is predicted as cytoplasm or nucleus depending on the segmentation task.

Training strategies

The sample size of Cx22 is not large, so real-time image augmentation was adopted during training to avoid overfitting. Compared with beforehand image augmentation, real-time image augmentation is more flexible. Image augmentation included random horizontal and vertical flipping, random brightness and contrast modifications, gaussianBlur transformation, hue/saturation color transformation and among others were used. Image augmentation was implemented with the albumentations library and PyTorch dataset class.

The data distribution of cytoplasm segmentation was relatively balanced, so binary cross-entropy was used as the loss function of cytoplasm segmentation. However, the nucleus occupies only a small area of the image, to tackle this class imbalance weighted binary cross-entropy was used as the loss function of cytoplasm segmentation and the weight factor for positive class was set to 8. Compare with similarity based loss functions such as the Dice loss and IOU loss, the binary cross-entropy loss has smooth gradients [29] and so as to train faster.

For models except for SegFormer and Transunet, encoders have corresponding easy to obtain ImageNet pre-trained models. Consequently, these models were trained under two settings: trained from scratch, encoders initialized from ImageNet pre-trained models and then all layers were fine-tuned.

Adam [30] with lookahead [31] (k = 5, alpha = 0.5) was used as the optimizer. Automatic mixed precision training [32] was used to speed up the training and inference processes and save GPU memory. Label smoothing (ε = 0.1) was used to calibrate probabilities and improve generalizability [33]. The batch size was set to 32 and the number of epochs were set to 20. The initial learning rate was set to 1e-3, and multiplied by a factor of 0.1 at 30%, 60% and 90% of the training epochs. Every model was trained 3 times under the same setting, and the model with the minimum validation loss was chosen as the final model. During training, performances were not sensitive to these hyper-parameters.

Evaluation metrics

In the original Cx22 data descriptor paper, the Dice, true positive rate (sensitivity) and false positive rate (1-specificity) [34] were used to quantitatively assess baseline models. To make a fair comparison, in this study these same performance metrics were used.

$$\mathrm{Sensitivity}=\frac{TP}{TP+FN}$$
$$\mathrm{Specificity}=\frac{TN}{TN+FP}$$
$$\mathrm{Dice}=\frac{2TP}{2TP+TN+FP}$$

A P value of less than 0.05 was considered statistically significant. Bootstrap method on the pixel level with a resampling number of 500 was used to calculate the 95% CIs. For simplicity, confidence intervals only calculated on performance indicators of ensemble models.

Experimental settings

Hardware: Intel Core i7-10,700, 128 GB Memory, Nvidia GTX 3090 * 2.

Software: Ubuntu 20.04, Cuda 11.3, Anaconda 4.10.

Programming language and libraries: Python 3.8, Pytorch 1.10, Torchvision OpenCV, NumPY, SciPY, Sklearn, Matplotlib, Pandas, Albumentations, segmentation_models_pytorch, Tqdm. Detailed information about these software libraries can be found in the file requirements.txt of the source code.

Results

Training and validation loss curves were used to demonstrate convergence speed and determine whether there exists overfitting. Loss curve graphs of cytoplasm segmentation and of nucleus segmentation are shown in the supplement Figure S1 and Figure S2, respectively. These graphs illustrate that the training speed of these models is fast and there is no obvious overfitting. The reason loss curves of Transunet Segformer models were not included is that during training some models did not converge and performances of other models were pretty bad.

All performance analyses were conducted on the testing dataset. Performance comparison of different models trained from scratch is shown in Table 2.

Table 2 Performance comparison of base models trained from scratch

Performance comparison of different models, which encoders were initialized by corresponding ImageNet pre-trained models, is shown in Table 3.

Table 3 Performance comparison of base models, which encoders were initialized from ImageNet pre-trained models

As shown in Table 2, in all cases, the U-Net-series models were consistently better than the DeeplabV3-series models. No matter on which segmentation task and what the model architecture was used, compared with training from scratch, using the ImageNet pretrained encoders apparently improved the performances. Even though Transunet [23] and Segformer [24] obtained very good or even SOTA results on many image segmentation benchmarks, in this study they performed much worse than their CNN counterparts. In most cases, these models even collapsed and predicted all pixels as negative or positive. Finally, according to performance metrics, for every segmentation task, 4 models Unet_resnet34, Unet_densenet121, UnetPlusPlus_resnet34, and UnetPlusPlus_densenet121 were chosen as base models, all of which were trained by the transfer learning strategy.

Although not every performance indicator of the ensemble model was better than that of any single model, all performance metrics of the ensemble model were better than the arithmetic mean of performance metrics of base models. Performance comparison of ensemble models and the arithmetic means of base models on the testing dataset is depicted in Table 4. The performance metrics of ensemble models were better than arithmetic means of performance metrics of base models (P < 0.05). ROC curves including AUC scores of cytoplasm segmentation and nucleus segmentation are shown in Fig. 2.

Table 4 Performance comparison of ensemble models and the arithmetic means of base models on the testing dataset
Fig. 2
figure 2

The ROC curves including AUC scores of cytoplasm segmentation and nucleus segmentation

The data descriptor paper [6] also provided multiple baseline models including U-Net, U-Net +  + and U-Net +  +  + . In this study, for every task we chose the best baseline metrics to compare. Performance comparison of the baseline model and the ensemble model is shown in Table 5. Except for the specificity on nucleus segmentation, the ensemble model outperformed the best baseline model with a moderate margin on all tasks. The specificity on nucleus segmentation of the ensemble model was very close to that of baseline model, and both were near perfect.

Table 5 Performance comparison of baseline and ensemble models on the testing dataset

Besides quantitative analyses, qualitative analyses were also conducted in this study. From a human's subjective point of view, predicted masks were very close to ground truth annotations. A randomly selected case including the image, its ground truth annotations and predicted masks are shown in Fig. 3. It should be mentioned that most of these false positives are not actually false positives. The region marked by red color in the predicted cytoplasm image is a cytoplasm area. Because the main part of the cell was cropped by its neighbor image, the remaining small portion of cytoplasm was not labeled. Likewise, the noise areas in the predicted nucleus image marked by red circles are small nucleus neglected by human annotations.

Fig. 3
figure 3

A representative image, its ground truth annotations and predicted masks. The image, ground truth annotations are shown in the first row. Predicted masks are shown in the second row. Cytoplasm images and nucleus images are shown in the second and third column, respectively

Discussion

Based on the above results, the following assumptions were proposed: under the conditions of medical image segmentation with small to medium sample size, U-Net variants are better than DeeplabV3 variants, and vision transformer models are much worse than CNNs. Vision transformers have fewer priors so that they need more training data. Even though both Transunet and Segformer adopt a CNN-like hierarchical structure and using a few convolutional layers at the lower level, they still need more data to train than U-Net variants. Whether these assumptions hold true for medical image segmentation tasks other than cervical cytology cell segmentation should be further investigated.

This study has both strengths and limitations. The strengths of this study include on the cytoplasm segmentation task, the proposed ensemble model outperformed the best baseline model on all performance metrics with a moderate margin. And on the nucleus segmentation task, the proposed ensemble model outperformed the best baseline model on all performance metrics except for specificity with a moderate margin. Moreover, this study compared the performances of different model architectures, different encoders, and different training strategies. These comparison results may be extended to other medical image segmentation tasks. This study also has some limitations. First and most importantly, cells are important objects in cervical cancer cytology screening, and both cytoplasm and nuclei are important parts of a cell. However, the semantic segmentation models only classify every pixel, they do not identify objects. Regarding to this issue, both adding a post-processing algorithm after the semantic segmentation model to do object identification and using instance segmentation algorithm are feasible solutions. Unfortunately, both solutions will bring a certain degree of complexity. Second, this study only used the Cx22 dataset, the generalization ability of the models was not guaranteed. We plan to conduct a new study in the future, which will add the ability of cell object identification and carry out external validation.

Conclusions

In this study, we have developed an automated cervical cytology cell segmentation algorithm using the Cx22 dataset by means of deep ensemble learning. The algorithm obtained the Dice, sensitivity, and specificity of 0.9535 (CIs:0.9534–0.9536), 0.9621 (0.9619–0.9622), 0.9835 (0.9834–0.9836) and 0.7863 (0.7851–0.7876), 0.9581 (0.9573–0.959), 0.9961 (0.9961–0.9962) for cytoplasm segmentation and nucleus segmentation, respectively. On most performance metrics, our algorithm outperformed the best baseline models (P < 0.05) with a moderate margin. In the future, after adding the cell identification functionality and conducted sufficient external validation, it can be used in automatic cervical cancer cytology screening system.