Unlocking Fine-Grained Details with Wavelet-Based High-Frequency Enhancement in Transformers

Azad, Reza; Kazerouni, Amirhossein; Sulaiman, Alaa; Bozorgpour, Afshin; Aghdam, Ehsan Khodapanah; Jose, Abin; Merhof, Dorit

doi:10.1007/978-3-031-45673-2_21

Reza Azad¹²,
Amirhossein Kazerouni¹³,
Alaa Sulaiman¹⁴,
Afshin Bozorgpour¹⁵,
Ehsan Khodapanah Aghdam¹⁶,
Abin Jose¹² &
…
Dorit Merhof¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14348))

Included in the following conference series:

International Workshop on Machine Learning in Medical Imaging

721 Accesses
2 Citations

Abstract

Medical image segmentation is a critical task that plays a vital role in diagnosis, treatment planning, and disease monitoring. Accurate segmentation of anatomical structures and abnormalities from medical images can aid in the early detection and treatment of various diseases. In this paper, we address the local feature deficiency of the Transformer model by carefully re-designing the self-attention map to produce accurate dense prediction in medical images. To this end, we first apply the wavelet transformation to decompose the input feature map into low-frequency (LF) and high-frequency (HF) subbands. The LF segment is associated with coarse-grained features, while the HF components preserve fine-grained features such as texture and edge information. Next, we reformulate the self-attention operation using the efficient Transformer to perform both spatial and context attention on top of the frequency representation. Furthermore, to intensify the importance of the boundary information, we impose an additional attention map by creating a Gaussian pyramid on top of the HF components. Moreover, we propose a multi-scale context enhancement block within skip connections to adaptively model inter-scale dependencies to overcome the semantic gap among stages of the encoder and decoder modules. Throughout comprehensive experiments, we demonstrate the effectiveness of our strategy on multi-organ and skin lesion segmentation benchmarks. The implementation code will be available upon acceptance. GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Asadi-Aghbolaghi, M., Azad, R., Fathy, M., Escalera, S.: Multi-level context gating of embedded collective knowledge for medical image segmentation. arXiv preprint arXiv:2003.05056 (2020)
Azad, R., Asadi-Aghbolaghi, M., Fathy, M., Escalera, S.: Bi-directional convLSTM U-Net with densley connected convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (2019)
Google Scholar
Cao, H., et al.: Swin-UNet: UNet-like pure transformer for medical image segmentation. In: Proceedings of the European Conference on Computer Vision Workshops (ECCVW) (2022)
Google Scholar
Chang, Y., Menghan, H., Guangtao, Z., Xiao-Ping, Z.: TransClaw U-Net: Claw U-Net with transformers for medical image segmentation. arXiv preprint arXiv:2107.05188 (2021)
Chen, J., et al.: TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018)
Google Scholar
Codella, N., et al.: Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). arXiv preprint arXiv:1902.03368 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fan, H., et al.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Google Scholar
Heidari, M., et al.: HiFormer: hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 6202–6212 (2023)
Google Scholar
Huang, X., Deng, Z., Li, D., Yuan, X., Fu, Y.: MISSFormer: an effective transformer for 2D medical image segmentation. IEEE Trans. Med. Imaging 42(5), 1484–1494 (2022)
Google Scholar
Karimijafarbigloo, S., Azad, R., Merhof, D.: Self-supervised few-shot learning for semantic segmentation: An annotation-free approach. In: MICCAI 2023 workshop (2023)
Google Scholar
Landman, B., Xu, Z., Igelsias, J., Styner, M., Langerak, T., Klein, A.: MICCAI multi-atlas labeling beyond the cranial vault-workshop and challenge. In: Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge. vol. 5, p. 12 (2015)
Google Scholar
Liu, Z., et al.: Swin Transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Ren, P., et al.: Beyond fixation: dynamic window visual transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11987–11997 (2022)
Google Scholar
Renggli, C., Pinto, A.S., Houlsby, N., Mustafa, B., Puigcerver, J., Riquelme, C.: Learning to merge tokens in vision transformers. arXiv preprint arXiv:2202.12015 (2022)
Reza, A., Moein, H., Yuli, W., Dorit, M.: Contextual attention network: Transformer meets U-Net. arXiv preprint arXiv:2203.01932 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Schlemper, J., et al.: Attention gated networks: learning to leverage salient regions in medical images. Med. Image Anal. 53, 197–207 (2019)
Article Google Scholar
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3531–3539 (2021)
Google Scholar
Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-attention for medical image segmentation. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12901, pp. 36–46. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87193-2_4
Chapter Google Scholar
Wang, P., Zheng, W., Chen, T., Wang, Z.: Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=O476oWmiNNp
Wang, W., et al.: PVT V2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Article Google Scholar
Wu, H., Chen, S., Chen, G., Wang, W., Lei, B., Wen, Z.: FAT-Net: feature adaptive transformers for automated skin lesion segmentation. Med. Image Anal. 76, 102327 (2022)
Article Google Scholar
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021)
Google Scholar
Xu, G., Wu, X., Zhang, X., He, X.: LeViT-UNet: Make faster encoders with transformer for medical image segmentation. arXiv preprint arXiv:2107.08623 (2021)
Yao, T., Pan, Y., Li, Y., Ngo, C.W., Mei, T.: Wave-ViT: unifying wavelet and transformers for visual representation learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol. 13685. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19806-9_19
Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-ViT: adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818 (2022)
Google Scholar

Download references

Acknowledgments

This work was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) under project number 191948804.

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Information Technology, RWTH Aachen University, Aachen, Germany
Reza Azad & Abin Jose
School of Electrical Engineering, Iran University of Science and Technology, Tehran, Iran
Amirhossein Kazerouni
Faculty of Information Science and Technology, Universiti Kebangsaan, Bangi, Malaysia
Alaa Sulaiman
Faculty of Informatics and Data Science, University of Regensburg, Regensburg, Germany
Afshin Bozorgpour & Dorit Merhof
Department of Electrical Engineering, Shahid Beheshti University, Tehran, Iran
Ehsan Khodapanah Aghdam

Authors

Reza Azad
View author publications
You can also search for this author in PubMed Google Scholar
Amirhossein Kazerouni
View author publications
You can also search for this author in PubMed Google Scholar
Alaa Sulaiman
View author publications
You can also search for this author in PubMed Google Scholar
Afshin Bozorgpour
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Khodapanah Aghdam
View author publications
You can also search for this author in PubMed Google Scholar
Abin Jose
View author publications
You can also search for this author in PubMed Google Scholar
Dorit Merhof
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reza Azad .

Editor information

Editors and Affiliations

Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China
Xiaohuan Cao
Rensselaer Polytechnic Institute, Troy, NY, USA
Xuanang Xu
Imperial College London, London, UK
Islem Rekik
ShanghaiTech University, Shanghai, China
Zhiming Cui
Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China
Xi Ouyang

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2092 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Azad, R. et al. (2024). Unlocking Fine-Grained Details with Wavelet-Based High-Frequency Enhancement in Transformers. In: Cao, X., Xu, X., Rekik, I., Cui, Z., Ouyang, X. (eds) Machine Learning in Medical Imaging. MLMI 2023. Lecture Notes in Computer Science, vol 14348. Springer, Cham. https://doi.org/10.1007/978-3-031-45673-2_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-45673-2_21
Published: 15 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45672-5
Online ISBN: 978-3-031-45673-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)

Unlocking Fine-Grained Details with Wavelet-Based High-Frequency Enhancement in Transformers