Abstract
In this paper we introduce OperA, a transformer-based model that accurately predicts surgical phases from long video sequences. A novel attention regularization loss encourages the model to focus on high-quality frames during training. Moreover, the attention weights are utilized to identify characteristic high attention frames for each surgical phase, which could further be used for surgery summarization. OperA is thoroughly evaluated on two datasets of laparoscopic cholecystectomy videos, outperforming various state-of-the-art temporal refinement approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Maier-Hein, L., et al.: Surgical data science: a consensus perspective. arXiv preprint arXiv:1806.03184 (2018)
Garrow, C.R., et al.: Machine learning for surgical phase recognition: a systematic review. Ann. Surg. 273, 684–693 (2020)
Padoy, N.: Machine and deep learning for workflow recognition during surgery. Minim. Invasive Ther. Allied Technol. 28, 82–90 (2019)
Huaulmé, A., Jannin, P., Reche, F., Faucheron, J.L., Moreau-Gaudry, A., Voros, S.: Offline identification of surgical deviations in laparoscopic rectopexy. Artif. Intell. Med. 104(May), 2020 (2019)
Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36(1), 86–97 (2017)
Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3D convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg. 14(7), 1217–1225 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, no. Mlm, pp. 4171–4186 (2019)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Ahmadi, S.-A., Sielhorst, T., Stauder, R., Horn, M., Feussner, H., Navab, N.: Recovery of surgical workflow without explicit models. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 420–428. Springer, Heidelberg (2006). https://doi.org/10.1007/11866565_52
Padoy, N., Blum, T., Ahmadi, S.A., Feussner, H., Berger, M.O., Navab, N.: Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16(3), 632–641 (2012)
Twinanda, A.P., Padoy, N., Troccaz, M.J., Hager, G.: Vision-based approaches for surgical activity recognition using laparoscopic and RBGD videos, Thesis, no. Umr 7357 (2017)
Yengera, G., Mutter, D., Marescaux, J., Padoy, N.: Less is more: surgical phase recognition with less annotations through self-supervised pre-training of CNN-LSTM networks. arXiv preprint arXiv:1805.08569 (2018)
Jin, Y., et al.: SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37(5), 1114–1126 (2018)
Jin, Y., et al.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. 59, 101572 (2020)
Czempiel, T., et al.: TeCNO: surgical phase recognition with multi-stage temporal convolutional networks. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 343–352. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59716-0_33
He, K., Zhang, X., Ren, S., Sun,J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016)
Brown, T.B., et al.: Language models are few-shot learners. arXiv (2020)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)
Heo, L., Feig, M.: High-accuracy protein structures by combining machine-learning with physics-based refinement. Proteins 88, 637–642 (2020)
Kondo, S.: Lapformer: surgical tool detection in laparoscopic surgical video using transformer architecture. In: Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–6 (2020)
Jain, S., Wallace, B.C.: Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1, pp. 3543–3556. Association for Computational Linguistics (2019)
Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 11–20. Association for Computational Linguistics (2019)
Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Structured attention networks. In: International Conference on Learning Representations, pp. 1–21 (2017)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (2015)
Al-Rfou, R., Choe, D., Constant, N., Guo, M., Jones, L.: Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, July 2019
Acknowledgements
Our research is partly funded by the DFG research unit PLAFOKON (FKZ 620/33-2) and BMBF research project ARTEKMED (FKZ 16SV8088) in collaboration with the Minimal-invasive Interdisciplinary Intervention Group.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N. (2021). OperA: Attention-Regularized Transformers for Surgical Phase Recognition. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12904. Springer, Cham. https://doi.org/10.1007/978-3-030-87202-1_58
Download citation
DOI: https://doi.org/10.1007/978-3-030-87202-1_58
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87201-4
Online ISBN: 978-3-030-87202-1
eBook Packages: Computer ScienceComputer Science (R0)