Fine-Grained Image Style Transfer with Visual Transformers

Wang, Jianbo; Yang, Huan; Fu, Jianlong; Yamasaki, Toshihiko; Guo, Baining

doi:10.1007/978-3-031-26313-2_26

Jianbo Wang¹²,
Huan Yang¹³,
Jianlong Fu¹³,
Toshihiko Yamasaki¹² &
…
Baining Guo¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13843))

Included in the following conference series:

Asian Conference on Computer Vision

697 Accesses

Abstract

With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results (Code is available at https://github.com/researchmm/STTR).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Date, P., Ganesan, A., Oates, T.: Fashioning with networks: neural style transfer to design clothes. In: KDD ML4Fashion workshop, vol. 2 (2017)
Google Scholar
Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: ICCV, pp. 1105–1114 (2017)
Google Scholar
Zhang, W., Cao, C., Chen, S., Liu, J., Tang, X.: Style transfer via image component analysis. TMM 15, 1594–1601 (2013)
Google Scholar
Liu, J., Yang, W., Sun, X., Zeng, W.: Photo stylistic brush: robust style transfer via superpixel-based bipartite graph. TMM 20, 1724–1737 (2017)
Google Scholar
Virtusio, J.J., Ople, J.J.M., Tan, D.S., Tanveer, M., Kumar, N., Hua, K.L.: Neural style palette: a multimodal and interactive style transfer from a single style image. TMM 23, 2245–2258 (2021)
Google Scholar
Matsuo, S., Shimoda, W., Yanai, K.: Partial style transfer using weakly supervised semantic segmentation. In: ICME Workshops, pp. 267–272. IEEE (2017)
Google Scholar
Kim, B.K., Kim, G., Lee, S.Y.: Style-controlled synthesis of clothing segments for fashion image manipulation. TMM 22, 298–310 (2019)
Google Scholar
Liu, Y., Chen, W., Liu, L., Lew, M.S.: SwapGAN: a multistage generative approach for person-to-person fashion style transfer. TMM 21, 2209–2222 (2019)
Google Scholar
Castillo, C., De, S., Han, X., Singh, B., Yadav, A.K., Goldstein, T.: Son of Zorn’s lemma: targeted style transfer using instance-aware semantic segmentation. In: ICASSP, pp. 1348–1352. IEEE (2017)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR, pp. 2414–2423 (2016)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43
Chapter Google Scholar
Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: feed-forward synthesis of textures and stylized images. In: ICML, p. 4 (2016)
Google Scholar
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)
Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: an explicit representation for neural image style transfer. In: CVPR, pp. 1897–1906 (2017)
Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Diversified texture synthesis with feed-forward networks. In: CVPR, pp. 3920–3928 (2017)
Google Scholar
Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 349–365. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_32
Chapter Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501–1510 (2017)
Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086 (2017)
Kitov, V., Kozlovtsev, K., Mishustina, M.: Depth-Aware Arbitrary style transfer using instance normalization. arXiv preprint arXiv:1906.01123 (2019)
Hu, Z., Jia, J., Liu, B., Bu, Y., Fu, J.: Aesthetic-aware image style transfer. In: ACM MM, pp. 3320–3329 (2020)
Google Scholar
Deng, Y., Tang, F., Pan, X., Dong, W., Ma, C., Xu, C.: Stytr\(\hat{\,}\)2: unbiased image style transfer with transformers. In: CVPR (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR, pp. 5188–5196 (2015)
Google Scholar
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337 (2016)
An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: unbiased image style transfer via reversible neural flows. In: CVPR, pp. 862–871 (2021)
Google Scholar
Deng, Y., Tang, F., Dong, W., Huang, H., Ma, C., Xu, C.: Arbitrary video style transfer via multi-channel correlation. In: AAA, vol. 1, pp. 1210–1217 (2021)
Google Scholar
Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: CVPR, pp. 5880–5888 (2019)
Google Scholar
Liu, S., et al.: AdaAttN: revisit attention mechanism in arbitrary neural style transfer. In: ICCV, pp. 6649–6658 (2021)
Google Scholar
Hong, K., Jeon, S., Yang, H., Fu, J., Byun, H.: Domain-aware universal style transfer. In: ICCV (2021)
Google Scholar
Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Beal, J., Kim, E., Tzeng, E., Park, D.H., Zhai, A., Kislyuk, D.: Toward transformer-based object detection. arXiv preprint arXiv:2012.09958 (2020)
Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3D object detection with pointformer. arXiv preprint arXiv:2012.11409 (2020)
Yuan, Z., Song, X., Bai, L., Zhou, W., Wang, Z., Ouyang, W.: Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving. arXiv preprint arXiv:2011.13628 (2020)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv preprint arXiv:2010.04159 (2020)
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers. arXiv preprint arXiv:2012.00759 (2020)
Wang, Y., et al.: End-to-End Video Instance Segmentation with Transformers. arXiv preprint arXiv:2011.14503 (2020)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 17–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_2
Chapter Google Scholar
Huang, L., Tan, J., Meng, J., Liu, J., Yuan, J.: HOT-net: non-autoregressive transformer for 3D hand-object pose estimation. In: ACM MM, pp. 3136–3145 (2020)
Google Scholar
Lin, K., Wang, L., Liu, Z.: End-to-End Human Pose and Mesh Reconstruction with Transformer. arXiv preprint arXiv:2012.09760 (2020)
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: Towards Explainable Human Pose Estimation by Transformer. arXiv preprint arXiv:2012.14214 (2020)
Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019)
Google Scholar
Zeng, Y., Yang, H., Chao, H., Wang, J., Fu, J.: Improving visual quality of image synthesis by a token-based generator with transformers. In: NeurIPS (2021)
Google Scholar
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)
Google Scholar
Liu, C., Yang, H., Fu, J., Qian, X.: Learning trajectory-aware transformer for video super-resolution. In: CVPR (2022)
Google Scholar
Qiu, Z., Yang, H., Fu, J., Fu, D.: Learning spatiotemporal frequency-transformer for compressed video super-resolution. arXiv preprint arXiv:2208.03012 (2022)
Liu, C., Yang, H., Fu, J., Qian, X.: TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation. arXiv preprint arXiv:2207.09048 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: CVPR, pp. 6924–6932 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Nichol, K.: Painter by numbers, wikiart (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014)
Zheng, H., Yang, H., Fu, J., Zha, Z.J., Luo, J.: Learning conditional knowledge distillation for degraded-reference image quality assessment. In: ICCV (2021)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13, 600–612 (2004)
Google Scholar
Sheng, L., Lin, Z., Shao, J., Wang, X.: Avatar-net: multi-scale zero-shot style transfer by feature decoration. In: CVPR, pp. 8242–8250 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

The Univerisity of Tokyo, Tokyo, Japan
Jianbo Wang & Toshihiko Yamasaki
Microsoft Research, Redmond, USA
Huan Yang, Jianlong Fu & Baining Guo

Authors

Jianbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianlong Fu
View author publications
You can also search for this author in PubMed Google Scholar
Toshihiko Yamasaki
View author publications
You can also search for this author in PubMed Google Scholar
Baining Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huan Yang .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10352 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Yang, H., Fu, J., Yamasaki, T., Guo, B. (2023). Fine-Grained Image Style Transfer with Visual Transformers. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13843. Springer, Cham. https://doi.org/10.1007/978-3-031-26313-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-26313-2_26
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26312-5
Online ISBN: 978-3-031-26313-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Grained Image Style Transfer with Visual Transformers