Skip to main content

OperA: Attention-Regularized Transformers for Surgical Phase Recognition

  • Conference paper
  • First Online:
Medical Image Computing and Computer Assisted Intervention – MICCAI 2021 (MICCAI 2021)


In this paper we introduce OperA, a transformer-based model that accurately predicts surgical phases from long video sequences. A novel attention regularization loss encourages the model to focus on high-quality frames during training. Moreover, the attention weights are utilized to identify characteristic high attention frames for each surgical phase, which could further be used for surgery summarization. OperA is thoroughly evaluated on two datasets of laparoscopic cholecystectomy videos, outperforming various state-of-the-art temporal refinement approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. 1.

  2. 2.


  1. Maier-Hein, L., et al.: Surgical data science: a consensus perspective. arXiv preprint arXiv:1806.03184 (2018)

  2. Garrow, C.R., et al.: Machine learning for surgical phase recognition: a systematic review. Ann. Surg. 273, 684–693 (2020)

    Article  Google Scholar 

  3. Padoy, N.: Machine and deep learning for workflow recognition during surgery. Minim. Invasive Ther. Allied Technol. 28, 82–90 (2019)

    Article  Google Scholar 

  4. Huaulmé, A., Jannin, P., Reche, F., Faucheron, J.L., Moreau-Gaudry, A., Voros, S.: Offline identification of surgical deviations in laparoscopic rectopexy. Artif. Intell. Med. 104(May), 2020 (2019)

    Google Scholar 

  5. Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: EndoNet: a deep architecture for recognition tasks on laparoscopic videos. IEEE Trans. Med. Imaging 36(1), 86–97 (2017)

    Article  Google Scholar 

  6. Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3D convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg. 14(7), 1217–1225 (2019)

    Article  Google Scholar 

  7. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 2017-Decem, no. Nips, pp. 5999–6009 (2017)

    Google Scholar 

  8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 1, no. Mlm, pp. 4171–4186 (2019)

    Google Scholar 

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Ahmadi, S.-A., Sielhorst, T., Stauder, R., Horn, M., Feussner, H., Navab, N.: Recovery of surgical workflow without explicit models. In: Larsen, R., Nielsen, M., Sporring, J. (eds.) MICCAI 2006. LNCS, vol. 4190, pp. 420–428. Springer, Heidelberg (2006).

    Chapter  Google Scholar 

  11. Padoy, N., Blum, T., Ahmadi, S.A., Feussner, H., Berger, M.O., Navab, N.: Statistical modeling and recognition of surgical workflow. Med. Image Anal. 16(3), 632–641 (2012)

    Article  Google Scholar 

  12. Twinanda, A.P., Padoy, N., Troccaz, M.J., Hager, G.: Vision-based approaches for surgical activity recognition using laparoscopic and RBGD videos, Thesis, no. Umr 7357 (2017)

    Google Scholar 

  13. Yengera, G., Mutter, D., Marescaux, J., Padoy, N.: Less is more: surgical phase recognition with less annotations through self-supervised pre-training of CNN-LSTM networks. arXiv preprint arXiv:1805.08569 (2018)

  14. Jin, Y., et al.: SV-RCNet: workflow recognition from surgical videos using recurrent convolutional network. IEEE Trans. Med. Imaging 37(5), 1114–1126 (2018)

    Article  Google Scholar 

  15. Jin, Y., et al.: Multi-task recurrent convolutional network with correlation loss for surgical video analysis. Med. Image Anal. 59, 101572 (2020)

    Article  Google Scholar 

  16. Czempiel, T., et al.: TeCNO: surgical phase recognition with multi-stage temporal convolutional networks. In: Martel, A.L., et al. (eds.) MICCAI 2020. LNCS, vol. 12263, pp. 343–352. Springer, Cham (2020).

    Chapter  Google Scholar 

  17. He, K., Zhang, X., Ren, S., Sun,J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016)

    Google Scholar 

  18. Brown, T.B., et al.: Language models are few-shot learners. arXiv (2020)

    Google Scholar 

  19. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  20. Ramesh, A., et al.: Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092 (2021)

  21. Heo, L., Feig, M.: High-accuracy protein structures by combining machine-learning with physics-based refinement. Proteins 88, 637–642 (2020)

    Article  Google Scholar 

  22. Kondo, S.: Lapformer: surgical tool detection in laparoscopic surgical video using transformer architecture. In: Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, pp. 1–6 (2020)

    Google Scholar 

  23. Jain, S., Wallace, B.C.: Attention is not explanation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, vol. 1, pp. 3543–3556. Association for Computational Linguistics (2019)

    Google Scholar 

  24. Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 11–20. Association for Computational Linguistics (2019)

    Google Scholar 

  25. Kim, Y., Denton, C., Hoang, L., Rush, A.M.: Structured attention networks. In: International Conference on Learning Representations, pp. 1–21 (2017)

    Google Scholar 

  26. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  27. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (2015)

    Google Scholar 

  28. Al-Rfou, R., Choe, D., Constant, N., Guo, M., Jones, L.: Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, July 2019

    Google Scholar 

Download references


Our research is partly funded by the DFG research unit PLAFOKON (FKZ 620/33-2) and BMBF research project ARTEKMED (FKZ 16SV8088) in collaboration with the Minimal-invasive Interdisciplinary Intervention Group.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tobias Czempiel .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5770 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Czempiel, T., Paschali, M., Ostler, D., Kim, S.T., Busam, B., Navab, N. (2021). OperA: Attention-Regularized Transformers for Surgical Phase Recognition. In: de Bruijne, M., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. MICCAI 2021. Lecture Notes in Computer Science(), vol 12904. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87201-4

  • Online ISBN: 978-3-030-87202-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics