Skip to main content

Fine-Grained Image Style Transfer with Visual Transformers

  • Conference paper
  • First Online:
Computer Vision – ACCV 2022 (ACCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13843))

Included in the following conference series:

  • 697 Accesses

Abstract

With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results (Code is available at https://github.com/researchmm/STTR).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Date, P., Ganesan, A., Oates, T.: Fashioning with networks: neural style transfer to design clothes. In: KDD ML4Fashion workshop, vol. 2 (2017)

    Google Scholar 

  2. Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: ICCV, pp. 1105–1114 (2017)

    Google Scholar 

  3. Zhang, W., Cao, C., Chen, S., Liu, J., Tang, X.: Style transfer via image component analysis. TMM 15, 1594–1601 (2013)

    Google Scholar 

  4. Liu, J., Yang, W., Sun, X., Zeng, W.: Photo stylistic brush: robust style transfer via superpixel-based bipartite graph. TMM 20, 1724–1737 (2017)

    Google Scholar 

  5. Virtusio, J.J., Ople, J.J.M., Tan, D.S., Tanveer, M., Kumar, N., Hua, K.L.: Neural style palette: a multimodal and interactive style transfer from a single style image. TMM 23, 2245–2258 (2021)

    Google Scholar 

  6. Matsuo, S., Shimoda, W., Yanai, K.: Partial style transfer using weakly supervised semantic segmentation. In: ICME Workshops, pp. 267–272. IEEE (2017)

    Google Scholar 

  7. Kim, B.K., Kim, G., Lee, S.Y.: Style-controlled synthesis of clothing segments for fashion image manipulation. TMM 22, 298–310 (2019)

    Google Scholar 

  8. Liu, Y., Chen, W., Liu, L., Lew, M.S.: SwapGAN: a multistage generative approach for person-to-person fashion style transfer. TMM 21, 2209–2222 (2019)

    Google Scholar 

  9. Castillo, C., De, S., Han, X., Singh, B., Yadav, A.K., Goldstein, T.: Son of Zorn’s lemma: targeted style transfer using instance-aware semantic segmentation. In: ICASSP, pp. 1348–1352. IEEE (2017)

    Google Scholar 

  10. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR, pp. 2414–2423 (2016)

    Google Scholar 

  11. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43

    Chapter  Google Scholar 

  12. Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43

    Chapter  Google Scholar 

  13. Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: feed-forward synthesis of textures and stylized images. In: ICML, p. 4 (2016)

    Google Scholar 

  14. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)

  15. Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: an explicit representation for neural image style transfer. In: CVPR, pp. 1897–1906 (2017)

    Google Scholar 

  16. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Diversified texture synthesis with feed-forward networks. In: CVPR, pp. 3920–3928 (2017)

    Google Scholar 

  17. Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 349–365. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_32

    Chapter  Google Scholar 

  18. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501–1510 (2017)

    Google Scholar 

  19. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086 (2017)

  20. Kitov, V., Kozlovtsev, K., Mishustina, M.: Depth-Aware Arbitrary style transfer using instance normalization. arXiv preprint arXiv:1906.01123 (2019)

  21. Hu, Z., Jia, J., Liu, B., Bu, Y., Fu, J.: Aesthetic-aware image style transfer. In: ACM MM, pp. 3320–3329 (2020)

    Google Scholar 

  22. Deng, Y., Tang, F., Pan, X., Dong, W., Ma, C., Xu, C.: Stytr\(\hat{\,}\)2: unbiased image style transfer with transformers. In: CVPR (2021)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  24. Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR, pp. 5188–5196 (2015)

    Google Scholar 

  25. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)

  26. Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337 (2016)

  27. An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: unbiased image style transfer via reversible neural flows. In: CVPR, pp. 862–871 (2021)

    Google Scholar 

  28. Deng, Y., Tang, F., Dong, W., Huang, H., Ma, C., Xu, C.: Arbitrary video style transfer via multi-channel correlation. In: AAA, vol. 1, pp. 1210–1217 (2021)

    Google Scholar 

  29. Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: CVPR, pp. 5880–5888 (2019)

    Google Scholar 

  30. Liu, S., et al.: AdaAttN: revisit attention mechanism in arbitrary neural style transfer. In: ICCV, pp. 6649–6658 (2021)

    Google Scholar 

  31. Hong, K., Jeon, S., Yang, H., Fu, J., Byun, H.: Domain-aware universal style transfer. In: ICCV (2021)

    Google Scholar 

  32. Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  33. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  34. Beal, J., Kim, E., Tzeng, E., Park, D.H., Zhai, A., Kislyuk, D.: Toward transformer-based object detection. arXiv preprint arXiv:2012.09958 (2020)

  35. Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3D object detection with pointformer. arXiv preprint arXiv:2012.11409 (2020)

  36. Yuan, Z., Song, X., Bai, L., Zhou, W., Wang, Z., Ouyang, W.: Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving. arXiv preprint arXiv:2011.13628 (2020)

  37. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv preprint arXiv:2010.04159 (2020)

  38. Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers. arXiv preprint arXiv:2012.00759 (2020)

  39. Wang, Y., et al.: End-to-End Video Instance Segmentation with Transformers. arXiv preprint arXiv:2011.14503 (2020)

  40. Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)

  41. Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 17–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_2

    Chapter  Google Scholar 

  42. Huang, L., Tan, J., Meng, J., Liu, J., Yuan, J.: HOT-net: non-autoregressive transformer for 3D hand-object pose estimation. In: ACM MM, pp. 3136–3145 (2020)

    Google Scholar 

  43. Lin, K., Wang, L., Liu, Z.: End-to-End Human Pose and Mesh Reconstruction with Transformer. arXiv preprint arXiv:2012.09760 (2020)

  44. Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: Towards Explainable Human Pose Estimation by Transformer. arXiv preprint arXiv:2012.14214 (2020)

  45. Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)

    Google Scholar 

  46. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019)

    Google Scholar 

  47. Zeng, Y., Yang, H., Chao, H., Wang, J., Fu, J.: Improving visual quality of image synthesis by a token-based generator with transformers. In: NeurIPS (2021)

    Google Scholar 

  48. Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)

    Google Scholar 

  49. Liu, C., Yang, H., Fu, J., Qian, X.: Learning trajectory-aware transformer for video super-resolution. In: CVPR (2022)

    Google Scholar 

  50. Qiu, Z., Yang, H., Fu, J., Fu, D.: Learning spatiotemporal frequency-transformer for compressed video super-resolution. arXiv preprint arXiv:2208.03012 (2022)

  51. Liu, C., Yang, H., Fu, J., Qian, X.: TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation. arXiv preprint arXiv:2207.09048 (2022)

  52. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  53. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: CVPR, pp. 6924–6932 (2017)

    Google Scholar 

  54. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  55. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  56. Nichol, K.: Painter by numbers, wikiart (2016)

    Google Scholar 

  57. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014)

  58. Zheng, H., Yang, H., Fu, J., Zha, Z.J., Luo, J.: Learning conditional knowledge distillation for degraded-reference image quality assessment. In: ICCV (2021)

    Google Scholar 

  59. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13, 600–612 (2004)

    Google Scholar 

  60. Sheng, L., Lin, Z., Shao, J., Wang, X.: Avatar-net: multi-scale zero-shot style transfer by feature decoration. In: CVPR, pp. 8242–8250 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huan Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 10352 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Yang, H., Fu, J., Yamasaki, T., Guo, B. (2023). Fine-Grained Image Style Transfer with Visual Transformers. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13843. Springer, Cham. https://doi.org/10.1007/978-3-031-26313-2_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26313-2_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26312-5

  • Online ISBN: 978-3-031-26313-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics