Learning Where to Look While Tracking Instruments in Robot-Assisted Surgery

  • Mobarakol Islam
  • Yueyuan Li
  • Hongliang RenEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11768)


Directing of the task-specific attention while tracking instrument in surgery holds great potential in robot-assisted intervention. For this purpose, we propose an end-to-end trainable multitask learning (MTL) model for real-time surgical instrument segmentation and attention prediction. Our model is designed with a weight-shared encoder and two task-oriented decoders and optimized for the joint tasks. We introduce batch-Wasserstein (bW) loss and construct a soft attention module to refine the distinctive visual region for efficient saliency learning. For multitask optimization, it is always challenging to obtain convergence of both tasks in the same epoch. We deal with this problem by adopting ‘poly’ loss weight and two phases of training. We further propose a novel way to generate task-aware saliency map and scanpath of the instruments on MICCAI robotic instrument segmentation dataset. Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics.


  1. 1.
    Allan, M., et al.: 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 (2019)
  2. 2.
    Chaurasia, A., Culurciello, E.: LinkNet: exploiting encoder representations for efficient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. IEEE (2017)Google Scholar
  3. 3.
    Chen, Z., Zhao, Z., Cheng, X.: Surgical instruments tracking based on deep learning with lines detection and spatio-temporal context. In: 2017 Chinese Automation Congress (CAC), pp. 2711–2714. IEEE (2017)Google Scholar
  4. 4.
    Dvornik, N., Shmelkov, K., Mairal, J., Schmid, C.: BlitzNet: a real-time deep network for scene understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4154–4162 (2017)Google Scholar
  5. 5.
    Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a Wasserstein loss. In: Advances in Neural Information Processing Systems, pp. 2053–2061 (2015)Google Scholar
  6. 6.
    García-Peraza-Herrera, L.C., et al.: ToolNet: holistically-nested real-time segmentation of robotic surgical tools. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5717–5722. IEEE (2017)Google Scholar
  7. 7.
    Islam, M., Atputharuban, D.A., Ramesh, R., Ren, H.: Real-time instrument segmentation in robotic surgery using auxiliary supervised deep adversarial learning. IEEE Robot. Autom. Lett. 4, 2188–2195 (2019)CrossRefGoogle Scholar
  8. 8.
    Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1072–1080 (2015)Google Scholar
  9. 9.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2106–2113. IEEE (2009)Google Scholar
  10. 10.
    Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional network for saliency detection. IEEE Trans. Image Process. 27(7), 3264–3274 (2018)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Liu, W., Rabinovich, A., Berg, A.C.: ParseNet: looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
  12. 12.
    Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real-time joint semantic segmentation and depth estimation using asymmetric annotations. arXiv preprint arXiv:1809.04766 (2018)
  13. 13.
    Ngu, J.C.Y., Tsang, C.B.S., Koh, D.C.S.: The da Vinci Xi: a review of its capabilities, versatility, and potential role in robotic colorectal surgery. Robot. Surg.: Res. Rev. 4, 77–85 (2017)CrossRefGoogle Scholar
  14. 14.
    Palazzi, A., Abati, D., Calderara, S., Solera, F., Cucchiara, R.: Predictingthe driver’s focus of attention: the DR (eye) VE project. IEEE Trans. Pattern Anal. Mach. Intell. 41, 1720–1733 (2018)CrossRefGoogle Scholar
  15. 15.
    Pan, J., et al.: SalGAN: visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)
  16. 16.
    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 4353–4361 (2017)Google Scholar
  17. 17.
    Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: Frangi, A., Schnabel, J., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11070, pp. 421–429. Springer, Cham (2018). Scholar
  18. 18.
    Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624–628. IEEE (2018)Google Scholar
  19. 19.
    Tian, Z., Shen, C., He, T., Yan, Y.: Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. arXiv preprint arXiv:1903.02120 (2019)
  20. 20.
    Zhao, Z., Voros, S., Weng, Y., Chang, F., Li, R.: Tracking-by-detection of surgical instruments in minimally invasive surgery via the convolutional neural network deep learning-based method. Comput. Assist. Surg. 22(sup1), 26–35 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.NUS Graduate School for Integrative Sciences and Engineering (NGS)National University of SingaporeSingaporeSingapore
  2. 2.Department of Biomedical EngineeringNational University of SingaporeSingaporeSingapore
  3. 3.University of Michigan-Joint InstituteShanghai Jiao Tong UniversityShanghaiChina

Personalised recommendations