Skip to main content
Log in

Mmy-net: a multimodal network exploiting image and patient metadata for simultaneous segmentation and diagnosis

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Accurate medical image segmentation can effectively assist disease diagnosis and treatment. While neural networks were often applied to solve the segmentation problem in recent computer-aided diagnosis, the metadata of patients was usually neglected. Motivated by this, we propose a medical image segmentation and diagnosis framework that takes advantage of both the image and the patient’s metadata, such as gender and age. We present MMY-NET: a new multi-modal network for simultaneous tumor segmentation and diagnosis exploiting patient metadata. Our architecture consists of three parts: a visual encoder, a text encoder, and a decoder with a self-attention block. Specifically, we design a text preprocessing block to embed metadata effectively, and the image features and text embedding features are then fused on several layers between the two encoders. Moreover, Interlaced Sparse Self-Attention is added to the decoder to further boost the performance. We apply our algorithm on 1 private dataset (ZJU2), and 1 private dataset (LISHUI) for zero-shot validation. Results show that our algorithm combined with metadata outperforms its counterpart without metadata by a large margin for basal cell carcinoma segmentation (14.3\(\%\) improvement of IoU and 8.5\(\%\) of Dice on the ZJU2 dataset, and 7.1\(\%\) IoU on the LIZHUI validation dataset). Additionally, we applied MMY-Net to 1 public segmentation dataset to demonstrate its general segmentation capability. MMY-Net outperforms the state-of-the-art methods on the GlaS dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The GlaS dataset is openly available at https://warwick.ac.uk/fac/cross_fac/tia/data/glascontest/. The ZJU2 and LISHUI dataset are private and are not shared publicly.

References

  1. Yang, X., Liu, W., Tao, D., Cheng, J.: Canonical correlation analysis networks for two-view image recognition. Inf. Sci. 385, 338–352 (2017)

    Article  Google Scholar 

  2. Liu, W., Ma, X., Zhou, Y., Tao, D., Cheng, J.: \(p\)-laplacian regularization for scene recognition. IEEE Trans. Cybern. 49(8), 2927–2940 (2018)

    Article  Google Scholar 

  3. Liu, W., Li, J., Liu, B., Guan, W., Zhou, Y., Xu, C.: Unified cross-domain classification via geometric and statistical adaptations. Pattern Recogn. 110, 107658 (2021)

    Article  Google Scholar 

  4. Zhang, B., Xiao, J., Jiao, J., Wei, Y., Zhao, Y.: Affinity attention graph neural network for weakly supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 44(11), 8082–8096 (2021)

    Article  Google Scholar 

  5. Zhang, B., Xiao, J., Wei, Y., Zhao, Y.: Credible dual-expert learning for weakly supervised semantic segmentation. Int. J. Comput. Vis., 1–17 (2023)

  6. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

  7. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Springer (2015)

  8. Araújo, R.L., Araújo, F.H.D., Silva, R.R.E.: Automatic segmentation of melanoma skin cancer using transfer learning and fine-tuning. Multimedia Syst. 28(4), 1239–1250 (2022)

    Article  Google Scholar 

  9. Xu, W., Bian, Y., Lu, Y., Meng, Q., Zhu, W., Shi, F., Chen, X., Shao, C., Xiang, D.: Semi-supervised interactive fusion network for mr image segmentation. Med. Phys. 50(3), 1586–1600 (2023)

    Article  Google Scholar 

  10. Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.-W., Wu, J.: Unet 3+: a full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1055–1059 (2020). IEEE

  11. Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X., Wang, J.: Interlaced sparse self-attention for semantic segmentation (2019). arXiv preprint arXiv:1907.12273

  12. Sirinukunwattana, K., Pluim, J.P., Chen, H., Qi, X., Heng, P.-A., Guo, Y.B., Wang, L.Y., Matuszewski, B.J., Bruni, E., Sanchez, U., et al.: Gland segmentation in colon histology images: the glas challenge contest. Med. Image Anal. 35, 489–502 (2017)

    Article  Google Scholar 

  13. Wen, Z., Feng, R., Liu, J., Li, Y., Ying, S.: Gcsba-net: gabor-based and cascade squeeze bi-attention network for gland segmentation. IEEE J. Biomed. Health Inform. 25(4), 1185–1196 (2021). https://doi.org/10.1109/JBHI.2020.3015844

    Article  Google Scholar 

  14. Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: a nested u-net architecture for medical image segmentation. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Springer, Berlin (2018)

  15. Diakogiannis, F.I., Waldner, F., Caccetta, P., Wu, C.: Resunet-a: a deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote. Sens. 162, 94–114 (2020)

    Article  Google Scholar 

  16. Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: learning where to look for the pancreas (2018). arXiv preprint arXiv:1804.03999

  17. Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation (2018). arXiv preprint arXiv:1802.06955

  18. Guan, S., Khan, A.A., Sikdar, S., Chitnis, P.V.: Fully dense unet for 2-d sparse photoacoustic tomography artifact removal. IEEE J. Biomed. Health Inform. 24(2), 568–576 (2019)

    Article  Google Scholar 

  19. Mehta, S., Mercan, E., Bartlett, J., Weaver, D., Elmore, J.G., Shapiro, L.: Y-net: joint segmentation and classification for diagnosis of breast biopsy images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 893–901 (2018). Springer

  20. Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet: Unet-like pure transformer for medical image segmentation (2021). arXiv preprint arXiv:2105.05537

  21. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation (2021). arXiv preprint arXiv:2102.04306

  22. Song, Q., Li, J., Li, C., Guo, H., Huang, R.: Fully attentional network for semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2280–2288 (2022)

  23. Zhang, W., Huang, Z., Luo, G., Chen, T., Wang, X., Liu, W., Yu, G., Shen, C.: Topformer: token pyramid transformer for mobile semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12083–12093 (2022)

  24. Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE Trans. Med. Imaging 34(10), 1993–2024 (2014)

    Article  Google Scholar 

  25. Dolz, J., Gopinath, K., Yuan, J., Lombaert, H., Desrosiers, C., Ayed, I.B.: Hyperdense-net: a hyper-densely connected cnn for multi-modal image segmentation. IEEE Trans. Med. Imaging 38(5), 1116–1126 (2018)

    Article  Google Scholar 

  26. Ibtehaz, N., Rahman, M.S.: Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Netw. 121, 74–87 (2020)

    Article  Google Scholar 

  27. Yan, K., Tang, Y., Peng, Y., Sandfort, V., Bagheri, M., Lu, Z., Summers, R.M.: Mulan: multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 194–202 (2019). Springer

  28. Yan, K., Peng, Y., Sandfort, V., Bagheri, M., Lu, Z., Summers, R.M.: Holistic and comprehensive annotation of clinically significant findings on diverse ct images: learning from radiology reports and label ontology. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8523–8532 (2019)

  29. Wang, P., Chung, A.: Doubleu-net: colorectal cancer diagnosis and gland instance segmentation with text-guided feature control. In: European Conference on Computer Vision, pp. 338–354 (2020). Springer

  30. Xiao, T., Zheng, H., Wang, X., Chen, X., Chang, J., Yao, J., Shang, H., Liu, P.: Intracerebral haemorrhage growth prediction based on displacement vector field and clinical metadata. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 741–751 (2021). Springer

  31. Höhn, J., Krieghoff-Henning, E., Jutzi, T.B., Kalle, C., Utikal, J.S., Meier, F., Gellrich, F.F., Hobelsberger, S., Hauschild, A., Schlager, J.G., et al.: Combining cnn-based histologic whole slide image analysis and patient data to improve skin cancer classification. Eur. J. Cancer 149, 94–101 (2021)

    Article  Google Scholar 

  32. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). arXiv preprint arXiv:1810.04805

  33. Vahadane, A., Peng, T., Sethi, A., Albarqouni, S., Wang, L., Baust, M., Steiger, K., Schlitter, A.M., Esposito, I., Navab, N.: Structure-preserving color normalization and sparse stain separation for histological images. IEEE Trans. Med. Imaging 35(8), 1962–1971 (2016)

    Article  Google Scholar 

  34. Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4413–4421 (2018)

  35. Wang, H., Cao, P., Wang, J., Zaiane, O.R.: Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer (2021). arXiv preprint arXiv:2109.04335

  36. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-attention for medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 36–46 (2021). Springer

  37. Wazir, S., Fraz, M.M.: Histoseg: Quick attention with multi-loss function for multi-structure segmentation in digital histology images. In: 2022 12th International Conference on Pattern Recognition Systems (ICPRS), pp. 1–7 (2022). IEEE

  38. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

  39. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL (2018)

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation Regional Innovation and Development Joint Fund (no. U20A20386), the National Natural Science Foundation of China (no. U22A2033, no. 62206242 and no. 62202130), National Key R &D Projects (no. 2022YFE0210700), the Zhejiang Provincial Science and Technology Program in China (no. LQ22F020026), Applied Research of Public Welfare Technology of Zhejiang Province (no. LGF22H120007), and the Open project of State Key Laboratory of CAD & CG at Zhejiang University (no. A2212).

Author information

Authors and Affiliations

Authors

Contributions

RG: Methodology, Software, Writing—Original draft preparation. YZ: Data curation, Software, Writing—Original draft preparation. YW: Visualization, Writing—Reviewing and Editing. DC: Supervision, Resources. RG: Supervision. ZJ: Validation. LW: Software, Validation. JY: Supervision, Resources, Project administration, Funding acquisition. GJ: Supervision, Resources, Project administration, Funding acquisition. LW: Conceptualization, Investigation, Resources, Writing- Reviewing and Editing

Corresponding authors

Correspondence to Gangyong Jia or Linyan Wang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by T. Li.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, R., Zhang, Y., Wang, L. et al. Mmy-net: a multimodal network exploiting image and patient metadata for simultaneous segmentation and diagnosis. Multimedia Systems 30, 72 (2024). https://doi.org/10.1007/s00530-024-01260-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01260-9

Keywords

Navigation