Skip to main content
Log in

Can surgical computer vision benefit from large-scale visual foundation models?

  • Short communication
  • Published:
International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Abstract

Purpose

We investigate whether foundation models pretrained on diverse visual data could be beneficial to surgical computer vision. We use instrument and uterus segmentation in mini-invasive procedures as benchmarks. We propose multiple supervised, unsupervised and few-shot supervised adaptations of foundation models, including two novel adaptation methods.

Methods

We use DINOv1, DINOv2, DINOv2 with registers, and SAM backbones, with the ART-Net surgical instrument and the SurgAI3.8K uterus segmentation datasets. We investigate five approaches: DINO unsupervised, few-shot learning with a linear decoder, supervised learning with the proposed DINO-UNet adaptation, DPT with DINO encoder, and unsupervised learning with the proposed SAM adaptation.

Results

We evaluate 17 models for instrument segmentation and 7 models for uterus segmentation and compare to existing ad hoc models for the tasks at hand. We show that the linear decoder can be learned with few shots. The unsupervised and linear decoder methods obtain slightly subpar results but could be considered useful in data scarcity settings. The unsupervised SAM model produces finer edges but has inconsistent outputs. However, DPT and DINO-UNet obtain strikingly good results, defining a new state of the art by outperforming the previous-best by 5.6 and 4.1 pp for instrument and 4.4 and 1.5 pp for uterus segmentation. Both methods obtain semantic and spatial precision, accurately segmenting intricate details.

Conclusion

Our results show the huge potential of using DINO and SAM for surgical computer vision, indicating a promising role for visual foundation models in medical image analysis, particularly in scenarios with limited or complex data

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  1. Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z et al (2023) A survey of large language models. arXiv preprint arXiv:2303.18223

  2. Oquab M, Darcet T, Moutakanni T, Vo HV, Szafraniec M et al (2023) DINOv2: learning robust visual features without supervision. arXiv:2304.07193

  3. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. In: ICCV

  4. Zou X, Yang J, Zhang H, Li F, Li L, Wang J, Wang L, Gao J, Lee YJ (2023) Segment everything everywhere all at once. In: NeurIPS

  5. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G et al (2021) Learning transferable visual models from natural language supervision. In: ICML

  6. Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: ICCV

  7. Darcet T, Oquab M, Mairal J, Bojanowski P (2023) Vision transformers need registers. arXiv:2309.16588

  8. Hasan MK, Calvet L, Rabbani N, Bartoli A (2021) Detection, segmentation, and 3D pose estimation of surgical tools using convolutional neural networks and algebraic geometry. Med Image Anal 70:101994

    Article  PubMed  Google Scholar 

  9. Zadeh SM, François T, Comptour A, Canis M, Bourdel N, Bartoli A (2023) Surgai3. 8k: a labeled dataset of gynecologic organs in laparoscopy with application to automatic augmented reality surgical guidance. J Minim Invasive Gynecol 30(5):397–405

    Article  Google Scholar 

  10. Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. ArXiv preprint

  11. Ramesh S, Srivastav V, Alapatt D, Yu T, Murali A, Sestini L, Nwoye CI, Hamoud I, Sharma S, Fleurentin A et al (2023) Dissecting self-supervised learning methods for surgical computer vision. Med Image Anal 88:102844

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Navid Rabbani.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest

Ethical approval

All procedures involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors. Informed consent was obtained from the patients included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rabbani, N., Bartoli, A. Can surgical computer vision benefit from large-scale visual foundation models?. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03125-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11548-024-03125-y

Keywords

Navigation