Abstract
Purpose
We investigate whether foundation models pretrained on diverse visual data could be beneficial to surgical computer vision. We use instrument and uterus segmentation in mini-invasive procedures as benchmarks. We propose multiple supervised, unsupervised and few-shot supervised adaptations of foundation models, including two novel adaptation methods.
Methods
We use DINOv1, DINOv2, DINOv2 with registers, and SAM backbones, with the ART-Net surgical instrument and the SurgAI3.8K uterus segmentation datasets. We investigate five approaches: DINO unsupervised, few-shot learning with a linear decoder, supervised learning with the proposed DINO-UNet adaptation, DPT with DINO encoder, and unsupervised learning with the proposed SAM adaptation.
Results
We evaluate 17 models for instrument segmentation and 7 models for uterus segmentation and compare to existing ad hoc models for the tasks at hand. We show that the linear decoder can be learned with few shots. The unsupervised and linear decoder methods obtain slightly subpar results but could be considered useful in data scarcity settings. The unsupervised SAM model produces finer edges but has inconsistent outputs. However, DPT and DINO-UNet obtain strikingly good results, defining a new state of the art by outperforming the previous-best by 5.6 and 4.1 pp for instrument and 4.4 and 1.5 pp for uterus segmentation. Both methods obtain semantic and spatial precision, accurately segmenting intricate details.
Conclusion
Our results show the huge potential of using DINO and SAM for surgical computer vision, indicating a promising role for visual foundation models in medical image analysis, particularly in scenarios with limited or complex data
References
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z et al (2023) A survey of large language models. arXiv preprint arXiv:2303.18223
Oquab M, Darcet T, Moutakanni T, Vo HV, Szafraniec M et al (2023) DINOv2: learning robust visual features without supervision. arXiv:2304.07193
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo W-Y et al (2023) Segment anything. In: ICCV
Zou X, Yang J, Zhang H, Li F, Li L, Wang J, Wang L, Gao J, Lee YJ (2023) Segment everything everywhere all at once. In: NeurIPS
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G et al (2021) Learning transferable visual models from natural language supervision. In: ICML
Caron M, Touvron H, Misra I, Jégou H, Mairal J, Bojanowski P, Joulin A (2021) Emerging properties in self-supervised vision transformers. In: ICCV
Darcet T, Oquab M, Mairal J, Bojanowski P (2023) Vision transformers need registers. arXiv:2309.16588
Hasan MK, Calvet L, Rabbani N, Bartoli A (2021) Detection, segmentation, and 3D pose estimation of surgical tools using convolutional neural networks and algebraic geometry. Med Image Anal 70:101994
Zadeh SM, François T, Comptour A, Canis M, Bourdel N, Bartoli A (2023) Surgai3. 8k: a labeled dataset of gynecologic organs in laparoscopy with application to automatic augmented reality surgical guidance. J Minim Invasive Gynecol 30(5):397–405
Ranftl R, Bochkovskiy A, Koltun V (2021) Vision transformers for dense prediction. ArXiv preprint
Ramesh S, Srivastav V, Alapatt D, Yu T, Murali A, Sestini L, Nwoye CI, Hamoud I, Sharma S, Fleurentin A et al (2023) Dissecting self-supervised learning methods for surgical computer vision. Med Image Anal 88:102844
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest
Ethical approval
All procedures involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This article does not contain any studies with animals performed by any of the authors. Informed consent was obtained from the patients included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rabbani, N., Bartoli, A. Can surgical computer vision benefit from large-scale visual foundation models?. Int J CARS (2024). https://doi.org/10.1007/s11548-024-03125-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11548-024-03125-y