Skip to main content

An Uncertainty-Aware Transformer for MRI Cardiac Semantic Segmentation via Mean Teachers

  • Conference paper
  • First Online:
Medical Image Understanding and Analysis (MIUA 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13413))

Included in the following conference series:


Deep learning methods have shown promising performance in medical image semantic segmentation. The cost of high-quality annotations, however, is still high and hard to access as clinicians are pressed for time. In this paper, we propose to utilize the power of Vision Transformer (ViT) with a semi-supervised framework for medical image semantic segmentation. The framework consists of a student model and a teacher model, where the student model learns from image feature information and helps teacher model to update parameters. The consistency of the inference of unlabeled data between the student model and teacher model is studied, so the whole framework is set to minimize segmentation supervision loss and consistency semi-supervision loss. To improve the semi-supervised performance, an uncertainty estimation scheme is introduced to enable the student model to learn from only reliable inference data during consistency loss calculation. The approach of filtering inconclusive images via an uncertainty value and the weighted sum of two losses in the training process is further studied. In addition, ViT is selected and properly developed as a backbone for the semi-supervised framework under the concern of long-range dependencies modeling. Our proposed method is tested with a variety of evaluation methods on a public benchmarking MRI dataset. The results of the proposed method demonstrate competitive performance against other state-of-the-art semi-supervised algorithms as well as several segmentation backbones.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)

    Article  Google Scholar 

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020).

    Chapter  Google Scholar 

  3. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016).

    Chapter  Google Scholar 

  4. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  5. Ibtehaz, N., Rahman, M.S.: MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 121, 74–87 (2020)

    Article  Google Scholar 

  6. Kendall, A., Gal, Y.: What uncertainties do we need in Bayesian deep learning for computer vision? Adv. Neural. Inf. Process. Syst. 30, 5574–5584 (2017)

    Google Scholar 

  7. Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)

  8. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  9. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  10. Luo, X.: SSL4MIS (2020).

  11. Oktay, O., et al.: Attention U-Net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)

  12. Paszke, A., Chaurasia, A., Kim, S., Culurciello, E.: ENet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016)

  13. Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A.: Deep co-training for semi-supervised image recognition. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 142–159. Springer, Cham (2018).

    Chapter  Google Scholar 

  14. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).

    Chapter  Google Scholar 

  15. Strudel, R.: Segmenter (2021).

  16. Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. arXiv preprint arXiv:2105.05633 (2021)

  17. Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1195–1204 (2017)

    Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  19. Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526 (2019)

    Google Scholar 

  20. Wang, Z., Voiculescu, I.: Quadruple augmented pyramid network for multi-class COVID-19 segmentation via CT. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC) (2021)

    Google Scholar 

  21. Wang, Z., Zhang, Z., Voiculescu, I.: RAR-U-Net: a residual encoder to attention decoder by residual connections framework for spine segmentation under noisy labels. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 21–25. IEEE (2021)

    Google Scholar 

  22. Wightman, R.: Pytorch image models (2019).

  23. Yu, L., Wang, S., Li, X., Fu, C.-W., Heng, P.-A.: Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 605–613. Springer, Cham (2019).

    Chapter  Google Scholar 

  24. Zhang, Y., Yang, L., Chen, J., Fredericksen, M., Hughes, D.P., Chen, D.Z.: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 408–416. Springer, Cham (2017).

    Chapter  Google Scholar 

  25. Zhang, Z., Li, S., Wang, Z., Lu, Y.: A novel and efficient tumor detection framework for pancreatic cancer via CT images. In: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 1160–1164. IEEE (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ziyang Wang .

Editor information

Editors and Affiliations

A Appendix

A Appendix

Table 4 gives detailed systematic IOU results under different assumptions of the ratio of labeled to total data, on the MRI Cardiac test set. It is pleasantly remarkable to see serviceable results being obtained with a proportion of labelled data as small as 1%, 2%, or 3% of the total. Given the small set of type-specific annotations that exist, they can now be put to good use by pairing them with large amounts of unlabeled data and making them available through our proposed method.

Table 5 and Table 6 reports the different approaches to modify the threshold \(\tau \) of filtering certain or uncertain pixels with uncertainty estimation scheme, and the weight \(\lambda \) of loss \(\mathcal {L}_\mathrm{c}\) in each training iteration. We explore the fixed value, exponential ramp up [7], linear ramp up, cosine ramp down [9] and variants of them. Details of exponential ramp up, linear ramp up and cosine ramp down is illustrated in the following Eq. 8, 9, 10, respectively. Each experiment is conducted with different approaches under the other one either \(\tau \) or \(\lambda \) is fixed with exponential ramp up. The results illustrates different approaches of updating \(\tau \), \(\lambda \) in each training iteration cannot significantly improve the performance of proposed method, and all other experiments for \(\tau \), \(\lambda \) is with exponential ramp up.

$$\begin{aligned} \tau or \lambda = e^{-5\times (1-t_\mathrm{iteration}/t_\mathrm{maxiteration})^{2}} \end{aligned}$$
$$\begin{aligned} \tau or \lambda = t_\mathrm{iteration}/t_\mathrm{maxiteration)} \end{aligned}$$
$$\begin{aligned} \tau or \lambda = 0.5 \times (cosine(\pi \times t_\mathrm{iteration} / t_\mathrm{maxiteration})+1) \end{aligned}$$
Table 5. Ablation studies on the threshold setting of uncertainty in training process (the higher, the better)
Table 6. Ablation studies on the weight setting of consistency loss in training process (the higher, the better)

Figure 4 sketches randomly selected raw images with their corresponding uncertainty maps, and masks generated by proposed method at three different stages (from the beginning to the end) of the training process. In uncertainty maps, yellow represents the teacher ViT \(f_\mathrm{t}\) is uncertain of prediction with the given pixels, and blue represents the teacher ViT \(f_\mathrm{t}\) is certain of prediction with the given pixels. The uncertainty map is gradually moving from yellow to green in the training process as shown in Fig. 4. The threshold of certainty estimation is then applied with uncertainty map which results in masks, where the white represents that the prediction by teacher ViT \(f_\mathrm{t}\) is certain enough to guide the student ViT \(f_\mathrm{s}\) i.e. for calculation the consistency loss \(\mathcal {L}_\mathrm{s}\), and the black represents that the pixels with uncertainty is temporally unavailable to be considered in consistency semi-supervision loss calculation. Please remind that both the background and ROI can be certain with the white simultaneously. Some typical example masks illustrates that model is only uncertain with the boundary of ROI as shown in Fig. 4, and finally the framework is very likely to be certain with the whole image with a proper threshold setting, that the uncertainty map is going to be blue, mask is going to be white in the end of training process.

Fig. 4.
figure 4

Sample uncertainty maps, masks, and raw images during the training process (Color figure online)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Zheng, JQ., Voiculescu, I. (2022). An Uncertainty-Aware Transformer for MRI Cardiac Semantic Segmentation via Mean Teachers. In: Yang, G., Aviles-Rivero, A., Roberts, M., Schönlieb, CB. (eds) Medical Image Understanding and Analysis. MIUA 2022. Lecture Notes in Computer Science, vol 13413. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-12052-7

  • Online ISBN: 978-3-031-12053-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics