Abstract
Purpose
The purpose of this study was to improve surgical scene perception by addressing the challenge of reconstructing highly dynamic surgical scenes. We proposed a novel depth estimation network and a reconstruction framework that combines neural radiance fields to provide more accurate scene information for surgical task automation and AR navigation.
Methods
We added a spatial pyramid pooling module and a Swin-Transformer module to enhance the robustness of stereo depth estimation. We also improved depth accuracy by adding unique matching constraints from optimal transport. To avoid deformation distortion in highly dynamic scenes, we used neural radiance fields to implicitly represent scenes in the time dimension and optimized them with depth and color information in a learning-based manner.
Results
Our experiments on the KITTI and SCARED datasets show that the proposed depth estimation network performs close to the state-of-the-art method on natural images and surpasses the SOTA method on medical images with 1.12% in 3 px Error and 0.45 px in EPE. The proposed dynamic reconstruction framework successfully reconstructed the dynamic cardiac surface on a totally endoscopic coronary artery bypass video, achieving SOTA performance with 27.983 dB in PSNR, 0.812 in SSIM, and 0.189 in LPIPS.
Conclusion
Our proposed depth estimation network and reconstruction framework provide a significant contribution to the field of surgical scene perception. The framework achieves better results than SOTA methods on medical datasets, reducing mismatches on depth maps and resulting in more accurate depth maps with clearer edges. The proposed ER framework is verified on a series of dynamic cardiac surgical images. Future efforts will focus on improving the training speed and solving the problem of limited field of view.
Similar content being viewed by others
Data availability
The public dataset used during the current study is available from MICCAI2019 EndoVis Chall enge (https://endovissub2019-scared.grand-challenge.org/) and Hamlyn Centre Endoscopic Video Dataset (http://hamlyn.doc.ic.ac.uk/vision/).
Code availability
Code will be publicly available with the publication of this work.
References
Vitiello V, Kwok KW, Yang GZ (2012) Introduction to robot-assisted minimally invasive surgery (MIS). In: Gomes P (ed) Medical robotics. Woodhead Publishing, Sawston. pp 1–P1. https://doi.org/10.1533/9780857097392.1
Simaan N, Yasin RM, Wang L (2018) Medical technologies and challenges of robot-assisted minimally invasive intervention and diagnostics. In: Annual review of control, robotics, and autonomous systems vol 1(1), pp 465–490. https://doi.org/10.1146/annurev-control-060117-104956
Ginesi M, Meli D, Nakawala H, Roberti A, Fiorini P (2019) A knowledge-based framework for task automation in surgery. In: International conference on advanced robotics, pp 37–42. https://doi.org/10.1109/ICAR46387.2019.8981619
Ficuciello F, Tamburrini G, Arezzo A, Villani L, Siciliano B (2019) Autonomy in surgical robots and its meaningful human control. Paladyn J Behav Robot 10(1):30–43. https://doi.org/10.1515/pjbr-2019-0002
Souza JC, Bandeira Diniz JO, Ferreira JL, França da Silva GL, Corrêa Silva A, de Paiva AC (2019) An automatic method for lung segmentation and reconstruction in chest X-ray using deep neural networks. Comput Methods Programs Biomed 177:285–296. https://doi.org/10.1016/j.cmpb.2019.06.005
Shimasaki Y, Iwahori Y, Neog DR, Woodham RJ, Bhuyan M (2013) Generating lambertian image with uniform reflectance for endoscope image. In: International workshop on advanced image technology, pp 1–6
Basak H, Ghosal S, Sarkar M, Das M, Chattopadhyay S (2020) Monocular depth estimation using encoder-decoder architecture and transfer learning from single RGB image. In: International conference on electrical, electronics and computer engineering:1–6. https://doi.org/10.1109/UPCON50219.2020.9376365
Modi P, Rodriguez E, Chitwood WR Jr (2009) Robot-assisted cardiac surgery. Interact Cardiovasc Thorac Surg 9(3):500–505. https://doi.org/10.1007/978-3-642-10781-8_37
Song J, Wang J, Zhao L, Huang S, Dissanayake G (2018) MIS-SLAM: real-time large-scale dense deformable SLAM system in minimal invasive surgery based on heterogeneous computing. IEEE Robot Autom Lett 3(4):4068–4075. https://doi.org/10.1109/LRA.2018.2856519
Song J, Wang J, Zhao L, Huang S, Dissanayake G (2017) Dynamic reconstruction of deformable soft-tissue with stereo scope in minimal invasive surgery. IEEE Robot Autom Lett 3(1):155–162. https://doi.org/10.1109/LRA.2017.2735487
Mildenhall B, Srinivasan PP, Tancik M, Barron JT, Ramamoorthi R, Ng R (2021) NeRF: representing scenes as neural radiance fields for view synthesis. Commun ACM 65(1):99–106. https://doi.org/10.1145/3503250
Sun X, Zou Y, Wang S, Su H, Guan B (2022) A parallel network utilizing local features and global representations for segmentation of surgical instruments. Int J Comput Assist Radiol Surg 17(10):1903–1913. https://doi.org/10.1007/s11548-022-02687-z
Chang J, Chen Y (2018) Pyramid stereo matching network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5410–5418
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin Transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. https://doi.org/10.1109/ICCV48922.2021.00986
Su H, Jampani V, Sun D, Gallo O, Learned-Miller E, Kautz J (2019) Pixel-adaptive convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11166–11175. https://doi.org/10.1109/CVPR.2019.01142
Liu Z, Hu H, Lin Y, Yao Z, Xie Z, Wei Y, Ning J, Cao Y, Zhang Z, Dong L, Wei F, Guo B (2022) Swin Transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11999–12009. https://doi.org/10.1109/CVPR52688.2022.01170
Li Z, Liu X, Drenkow N, Ding A, Creighton FX, Taylor RH, Unberath M (2021) Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6197–6206. https://doi.org/10.1109/ICCV48922.2021.00614
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context. In: Proceedings of the association for computational linguistics. https://doi.org/10.18653/v1/P19-1285
Luo C, Zhan J, Xue X, Wang L, Ren R, Yang Q (2018) Cosine normalization: Using cosine similarity instead of dot product in neural networks. In: Artificial neural networks and machine learning, pp 382–391. https://doi.org/10.1007/978-3-030-01418-6_38
Liu Y, Zhu L, Yamada M, Yang Y (2020) Semantic correspondence as an optimal transport problem. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4463–4472. https://doi.org/10.1109/CVPR42600.2020.00452
Tulyakov S, Ivanov A, Fleuret F (2018) Toward applications-friendly deep stereo matching. Neural Inf Process Syst 31
Pumarola A, Corona E, Pons-Moll G, Moreno-Noguer F (2021) D-NeRF: neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10318–10327. https://doi.org/10.1109/CVPR46437.2021.01018
Aghakhani N, Geravand M, Shahriari N, Vendittelli M, Oriolo G (2013) Task control with remote center of motion constraint for minimally invasive robotic surgery. In International conference on robotics and automation, pp 5807–5812. https://doi.org/10.1109/ICRA.2013.6631412
Tancik M, Srinivasan P, Mildenhall B, Fridovich-Keil S, Raghavan N, Singhal U, Ramamoorthi R, Barron J, Ng R (2020) Fourier features let networks learn high frequency functions in low dimensional domains. Neural Inf Process Syst 33:7537–7547. https://doi.org/10.5555/3495724.3496356
Yin W, Liu Y, Shen C, Yan Y (2019) Enforcing geometric constraints of virtual normal for depth prediction. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5684–5693. https://doi.org/10.1109/ICCV.2019.00578
Allan M, Mcleod J, Wang C, Rosenthal JC, Hu Z, Gard N, Eisert P, Fu KX, Zeffiro T, Xia W (2021) Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:210101133
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the Kitti dataset. Int J Robot Res 32(11):1231–1237
Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3354–3361. https://doi.org/10.1109/CVPR.2012.6248074
Baker S, Roth S, Scharstein D, Black MJ, Lewis JP, Szeliski R (2007) A database and evaluation methodology for optical flow. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1–8. https://doi.org/10.1109/ICCV.2007.4408903
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Neural Inf Process Syst. https://doi.org/10.5555/2969033.2969091
Kotevski Z, Mitrevski P (2010) Experimental comparison of psnr and ssim metrics for video quality estimation. In: International conference on ICT innovations, pp 357–366. https://doi.org/10.1007/978-3-642-10781-8_37
Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612. https://doi.org/10.1109/TIP.2003.819861
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 586–595. https://doi.org/10.1109/CVPR.2018.00068
Xu G, Cheng J, Guo P, Yang X (2022) Attention concatenation volume for accurate and efficient stereo matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12981–12990. https://doi.org/10.1109/CVPR52688.2022.01264
Li J, Wang P, Xiong P, Cai T, Yan Z, Yang L, Liu J, Fan H, Liu S (2022) Practical stereo matching via cascaded recurrent network with adaptive correlation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16263–16272. https://doi.org/10.1109/CVPR52688.2022.01578
Shen Z, Dai Y, Song X, Rao Z, Zhou D, Zhang L (2022) PCW-Net: pyramid combination and warping cost volume for stereo matching. In: European conference on computer vision, pp 280–297. https://doi.org/10.1007/978-3-031-19824-3_17
Zhang F, Prisacariu V, Yang R, Torr PH (2019) GA-Net: Guided aggregation net for end-to-end stereo matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 185–194. https://doi.org/10.1109/CVPR.2019.00027
Guo X, Yang K, Yang W, Wang X, Li H (2019) Group-wise correlation stereo network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3273–3282. https://doi.org/10.1109/CVPR.2019.00339
Recasens D, Lamarca J, Fácil JM, Montiel J, Civera J (2021) Endo-depth-and-motion: reconstruction and tracking in endoscopic videos using depth networks and photometric constraints. IEEE Robot Autom Lett 6(4):7225–7232. https://doi.org/10.1109/LRA.2021.3095528
Godard C, Mac Aodha O, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 270–279. https://doi.org/10.1109/CVPR.2017.699
Geiger A, Roser M, Urtasun R (2010) Efficient large-scale stereo matching. In: Asian conference on computer vision, pp 25–38. https://doi.org/10.1007/978-3-642-19315-6_3
Song J, Zhu Q, Lin J, Ghaffari M (2023) BDIS: Bayesian dense inverse searching method for real-time stereo surgical image matching. IEEE Trans Rob 39(2):1388–1406. https://doi.org/10.1109/TRO.2022.3215018
Wang Y, Long Y, Fan SH, Dou Q (2022) Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery. In: Medical image computing and computer assisted intervention. Springer, Cham, pp 431–441. https://doi.org/10.1007/978-3-031-16449-1_41
Mildenhall B, Hedman P, Martin-Brualla R, Srinivasan PP, Barron JT (2022) NeRF in the Dark: High dynamic range view synthesis from noisy raw images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16190–16199. https://doi.org/10.1109/CVPR52688.2022.01571
Acknowledgements
The authors thank Ziqi Liu for polishing the article and Yuehao Wang for the guidance on NeRF theory.
Funding
This study was funded by the National Natural Science Foundation of China (Grant No. 52175028) and the National Natural Science Foundation of China (Grant No. 51721003).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file1 (MP4 16995 kb)
Supplementary file2 (MP4 4139 kb)
Supplementary file3 (MP4 7667 kb)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, X., Wang, F., Ma, Z. et al. Dynamic surface reconstruction in robot-assisted minimally invasive surgery based on neural radiance fields. Int J CARS 19, 519–530 (2024). https://doi.org/10.1007/s11548-023-03016-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11548-023-03016-8