Skip to main content

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-Modal Representation Consistency

  • 5393 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13433)


The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between class token and patch tokens for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images. Code is available at


  • Colorectal polyps classification
  • Multi-modal represntation learning
  • Transformer architecture

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions


  1. Bisschops, R., et al.: BASIC (BLI adenoma serrated international classification) classification for colorectal polyp characterization with blue light imaging. Endoscopy 50(03), 211–220 (2018)

    CrossRef  Google Scholar 

  2. Cao, H., et al.: Swin-UNet: UNet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021)

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020).

    CrossRef  Google Scholar 

  4. Chen, C., et al.: Progressive feature alignment for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 627–636 (2019)

    Google Scholar 

  5. Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

    Google Scholar 

  6. Fonollà, R., et al.: A CNN CADx system for multimodal classification of colorectal polyps combining WL, BLI, and LCI modalities. Appl. Sci. 10(15), 5040 (2020)

    CrossRef  Google Scholar 

  7. Gao, W., et al.: TS-CAM: token semantic coupled attention map for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2886–2895 (2021)

    Google Scholar 

  8. Ji, Y., et al.: Multi-compound transformer for accurate biomedical image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 326–336. Springer, Cham (2021).

    CrossRef  Google Scholar 

  9. Keum, N., Giovannucci, E.: Global burden of colorectal cancer: emerging trends, risk factors and prevention strategies. Nat. Rev. Gastroenterol. Hepatol. 16(12), 713–732 (2019)

    CrossRef  Google Scholar 

  10. Komeda, Y., et al.: Magnifying narrow band imaging (NBI) for the diagnosis of localized colorectal lesions using the Japan NBI expert team (JNET) classification. Oncology 93(Suppl. 1), 49–54 (2017)

    CrossRef  MathSciNet  Google Scholar 

  11. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  12. Mesejo, P., et al.: Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE Trans. Med. Imaging 35(9), 2051–2063 (2016)

    CrossRef  Google Scholar 

  13. Sierra-Jerez, F., Martínez, F.: A deep representation to fully characterize hyperplastic, adenoma, and serrated polyps on narrow band imaging sequences. Heal. Technol. 12(2), 401–413 (2022).

    CrossRef  Google Scholar 

  14. Sun, B., Saenko, K.: Deep CORAL: correlation alignment for deep domain adaptation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 443–450. Springer, Cham (2016).

    CrossRef  Google Scholar 

  15. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  16. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  17. Wang, Q., et al.: Colorectal polyp classification from white-light colonoscopy images via domain alignment. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12907, pp. 24–32. Springer, Cham (2021).

    CrossRef  Google Scholar 

  18. Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., Luo, P.: End-to-end dense video captioning with parallel decoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6847–6857 (2021)

    Google Scholar 

  19. Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow attention network for polyp segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 699–708. Springer, Cham (2021).

    CrossRef  Google Scholar 

  20. Xie, Y., Zhang, J., Shen, C., Xia, Y.: CoTr: efficiently bridging CNN and transformer for 3D medical image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12903, pp. 171–180. Springer, Cham (2021).

    CrossRef  Google Scholar 

  21. Yang, J., Zhang, R., Wang, C., Li, Z., Wan, X., Zhang, L.: Toward unpaired multi-modal medical image segmentation via learning structured semantic consistency. arXiv preprint arXiv:2206.10571 (2022)

  22. Yang, Y.J., et al.: Automated classification of colorectal neoplasms in white-light colonoscopy images via deep learning. J. Clin. Med. 9(5), 1593 (2020)

    CrossRef  Google Scholar 

  23. Zauber, A.G., et al.: Colonoscopic polypectomy and long-term prevention of colorectal-cancer deaths. N. Engl. J. Med. 366, 687–696 (2012)

    CrossRef  Google Scholar 

  24. Zhang, R., et al.: Scan: self-and-collaborative attention network for video person re-identification. IEEE Trans. Image Process. 28(10), 4870–4882 (2019)

    CrossRef  MathSciNet  Google Scholar 

  25. Zhou, H.Y., Guo, J., Zhang, Y., Yu, L., Wang, L., Yu, Y.: nnFormer: interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201 (2021)

Download references


The work is supported in part by the Young Scientists Fund of the National Natural Science Foundation of China under grant No. 62106154, by Natural Science Foundation of Guangdong Province, China (General Program) under grant No. 2022A1515011524 and by the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese University of Hong Kong, Shenzhen.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Ruimao Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ma, W. et al. (2022). Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-Modal Representation Consistency. In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2022. MICCAI 2022. Lecture Notes in Computer Science, vol 13433. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16436-1

  • Online ISBN: 978-3-031-16437-8

  • eBook Packages: Computer ScienceComputer Science (R0)