Against spatial–temporal discrepancy: contrastive learning-based network for surgical workflow recognition

Xia, Tong; Jia, Fucang

doi:10.1007/s11548-021-02382-5

Against spatial–temporal discrepancy: contrastive learning-based network for surgical workflow recognition

Original Article
Published: 05 May 2021

Volume 16, pages 839–848, (2021)
Cite this article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

741 Accesses
6 Citations
Explore all metrics

Abstract

Purpose

Automatic workflow recognition from surgical videos is fundamental and significant for developing context-aware systems in modern operating rooms. Although many approaches have been proposed to tackle challenges in this complex task, there are still many problems such as the fine-grained characteristics and spatial–temporal discrepancies in surgical videos.

Methods

We propose a contrastive learning-based convolutional recurrent network with multi-level prediction to tackle these problems. Specifically, split-attention blocks are employed to extract spatial features. Through a mapping function in the step-phase branch, the current workflow can be predicted on two mutual-boosting levels. Furthermore, a contrastive branch is introduced to learn the spatial–temporal features that eliminate irrelevant changes in the environment.

Results

We evaluate our method on the Cataract-101 dataset. The results show that our method achieves an accuracy of 96.37% with only surgical step labels, which outperforms other state-of-the-art approaches.

Conclusion

The proposed convolutional recurrent network based on step-phase prediction and contrastive learning can leverage fine-grained characteristics and alleviate spatial–temporal discrepancies to improve the performance of surgical workflow recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Semi-supervised spatio-temporal CNN for recognition of surgical workflow

Article Open access 25 August 2018

Surgical workflow recognition with 3DCNN for Sleeve Gastrectomy

Article Open access 20 August 2021

DeepPhase: Surgical Phase Recognition in CATARACTS Videos

References

Cleary K, Kinsella A, Mun SK (2005) Or 2020 workshop report: Operating room of the future. Int Congr Ser 1281:832–838
Article Google Scholar
Padoy N (2019) Machine and deep learning for workflow recognition during surgery. Minim Invasive Ther Allied Technol 28(2):82–90
Article Google Scholar
Maier-Hein L, Vedula SS, Speidel S, Navab N, Kikinis R, Park A, Eisenmann M, Feussner H, Forestier G, Giannarou S, Hashizume M, Katic D, Kenngott H, Kranzfelder M, Malpani A, März K, Neumuth T, Padoy N, Pugh C, Schoch N, Stoyanov D, Taylor R, Wagner M, Hager GD, Jannin P (2017) Surgical data science for next-generation interventions. Nat Biomed Eng 1(9):691–696
Article Google Scholar
Schoeffmann K, Taschwer M, Sarny S, Münzer B, Primus MJ, Putzgruber D (2018) Cataract-101: video dataset of 101 cataract surgeries. In: Proceedings of the 9th ACM multimedia systems conference, pp 421–425
Loukas C (2018) Video content analysis of surgical procedures. Surg Endosc 32(2):553–568
Article Google Scholar
Quellec G, Lamard M, Cochener B, Cazuguel G (2014) Real-time segmentation and recognition of surgical tasks in cataract surgery videos. IEEE Trans Med Imaging 33(12):2352–2360
Article Google Scholar
Twinanda AP, Yengera G, Mutter D, Marescaux J, Padoy N (2019) Rsdnet: Learning to predict remaining surgery duration from laparoscopic videos without manual annotations. IEEE Trans Med Imaging 38(4):1069–1078
Article Google Scholar
Blum T, Feußner H, Navab N (2010) Modeling and segmentation of surgical workflow from laparoscopic video. In: MICCAI. pp. 400-407
Twinanda AP, Shehata S, Mutter D, Marescaux J, de Mathelin M, Padoy N (2017) Endonet: A deep architecture for recognition tasks on laparoscopic videos. IEEE Trans Med Imaging 36(1):86–97
Article Google Scholar
Jin Y, Dou Q, Chen H, Yu L, Qin J, Fu C, Heng PA (2018) SV-RCnet: Workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans Med Imaging 37(5):1114–1126
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article CAS Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR. pp 770–778
Jin Y, Li H, Dou Q, Chen H, Qin J, Fu CW, Heng PA (2020) Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med Image Anal 59:101572
Article Google Scholar
Lin TY, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: ICCV. pp 1450–1457
Chen MH, Li B, Bao Y, AlRegib G, Kira Z (2020) Action segmentation with joint self-supervised temporal domain adaptation. In: CVPR. pp 9454–9463
Charriere K, Quellec G, Lamard M, Martiano D, Cazuguel G, Coatrieux G, Cochener B (2017) Real-time analysis of cataract surgery videos using statistical models. Multimed Tools Appl 76(21):22473–22491
Article Google Scholar
Lalys F, Riffaud L, Bouget D, Jannin P (2011) A framework for the recognition of high-level surgical tasks from video images for cataract surgeries. IEEE Trans Biomed Eng 59(4):966–976
Article Google Scholar
van den Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv reprint. arXiv: 1807.03748
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: A unified embedding for face recognition and clustering. In: CVPR. pp 815–823
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. arXiv preprint. arXiv:2002.05709
Zhang H, Wu C, Zhang Z, Zhu Y, Zhang Z, Lin H, Sun Y, He T, Mueller J, Manmatha R, Li M, Smola A (2020) Resnest: Split-attention networks. arXiv preprint. arXiv:2004.08955
Lo BPL, Darzi A, Yang GZ (2003) Episode classification for the analysis of tissue/instrument interaction with multiple visual cues. In: MICCAI. pp 230–237
Deng J, Dong W, Socher R, Li L, Li K, Li F-F (2009) Imagenet: A large-scale hierarchical image database. In: CVPR. pp 248–255
Qi B, Qin X, Liu J, Xu Y, Chen Y (2019) A deep architecture for surgical workflow recognition with edge information. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 1358–1364

Download references

Acknowledgements

This work was supported by Grants from the National Key R&D Program (No. 2019YFC0118100 and 2017YFC0110903), the National Natural Science Foundation of China (12026602), the Shenzhen Key Basic Science Program (JCYJ20180507182437217), the Key-Area Research and Development Program of Guangdong Province (2020B010165004), the Science and Technology Program of Guangdong Province (2017ZC0222) and the Shenzhen Key Laboratory Program (ZDSYS201707271637577).

Author information

Authors and Affiliations

Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Tong Xia & Fucang Jia
University of Chinese Academy of Sciences, Beijing, China
Tong Xia & Fucang Jia

Authors

Tong Xia
View author publications
You can also search for this author in PubMed Google Scholar
Fucang Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fucang Jia.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest.

Ethical approval

For this type of study, formal consent is not required.

Informed consent

This article uses patient data from a publicly available dataset.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Detailed study on sequence length

Table 3 Average duration of each surgical step in cataract surgery

Full size table

When selecting the input sequence length, a key factor that should be considered is the average duration of each surgical step. The statistical results of each cataract surgery step were calculated from the Cataract-101 dataset and are presented in Table 3. The table shows that the shortest duration of a surgical step is 13 s. Half of the surgical steps have a duration of approximately 20 s. There is only one longest surgical step with a duration of 145 s. For usage under different occasions, we can select different input sequence lengths to satisfy specific needs. In our experiments, for real-time intra-operative use, the sequence length is set to 10 for safety concerns. For online post-operative use, the sequence length is set to 20 to improve the average recognition results by a trade-off several steps.

Table 4 Accuracy and Jaccard performance with the extension of input sequence length

Full size table

To validate our assumptions, we conduct experiments with different input sequence lengths (3 s, 5 s, 8 s, 10 s, 20 s, 40 s, and 60 s). The results based on recall, precision, accuracy, and F1 score are presented in Fig. 6 and Table 4. Figure 6 shows that, as the input sequence length extends from 3 s to 10 s, our model can exploit more sufficient temporal information between frames and achieve improving performance for most of the surgical steps. As sequence length increases up to 20 s, most surgical steps reach higher performance. However, there is an obvious decline in Step 2, viscous agent injection. This verifies our assumption that this step accounts for the least amount of time in the whole procedure and is very likely to be interrupted when aggregating temporal information from too long ago. When extending the sequence length from 20 to 60 s, more irrelevant information is introduced for most of the surgical steps, therefore decreasing average results, as shown in Table 4. The experimental results demonstrate the soundness of our consideration of sequence length in both intra-operative and post-operative uses.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xia, T., Jia, F. Against spatial–temporal discrepancy: contrastive learning-based network for surgical workflow recognition. Int J CARS 16, 839–848 (2021). https://doi.org/10.1007/s11548-021-02382-5

Download citation

Received: 15 March 2021
Accepted: 16 April 2021
Published: 05 May 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11548-021-02382-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Against spatial–temporal discrepancy: contrastive learning-based network for surgical workflow recognition