Abstract
With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results (Code is available at https://github.com/researchmm/STTR).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Date, P., Ganesan, A., Oates, T.: Fashioning with networks: neural style transfer to design clothes. In: KDD ML4Fashion workshop, vol. 2 (2017)
Chen, D., Liao, J., Yuan, L., Yu, N., Hua, G.: Coherent online video style transfer. In: ICCV, pp. 1105–1114 (2017)
Zhang, W., Cao, C., Chen, S., Liu, J., Tang, X.: Style transfer via image component analysis. TMM 15, 1594–1601 (2013)
Liu, J., Yang, W., Sun, X., Zeng, W.: Photo stylistic brush: robust style transfer via superpixel-based bipartite graph. TMM 20, 1724–1737 (2017)
Virtusio, J.J., Ople, J.J.M., Tan, D.S., Tanveer, M., Kumar, N., Hua, K.L.: Neural style palette: a multimodal and interactive style transfer from a single style image. TMM 23, 2245–2258 (2021)
Matsuo, S., Shimoda, W., Yanai, K.: Partial style transfer using weakly supervised semantic segmentation. In: ICME Workshops, pp. 267–272. IEEE (2017)
Kim, B.K., Kim, G., Lee, S.Y.: Style-controlled synthesis of clothing segments for fashion image manipulation. TMM 22, 298–310 (2019)
Liu, Y., Chen, W., Liu, L., Lew, M.S.: SwapGAN: a multistage generative approach for person-to-person fashion style transfer. TMM 21, 2209–2222 (2019)
Castillo, C., De, S., Han, X., Singh, B., Yadav, A.K., Goldstein, T.: Son of Zorn’s lemma: targeted style transfer using instance-aware semantic segmentation. In: ICASSP, pp. 1348–1352. IEEE (2017)
Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: CVPR, pp. 2414–2423 (2016)
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Li, C., Wand, M.: Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 702–716. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_43
Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.S.: Texture networks: feed-forward synthesis of textures and stylized images. In: ICML, p. 4 (2016)
Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. arXiv preprint arXiv:1610.07629 (2016)
Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G.: Stylebank: an explicit representation for neural image style transfer. In: CVPR, pp. 1897–1906 (2017)
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Diversified texture synthesis with feed-forward networks. In: CVPR, pp. 3920–3928 (2017)
Zhang, H., Dana, K.: Multi-style generative network for real-time transfer. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 349–365. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_32
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501–1510 (2017)
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. arXiv preprint arXiv:1705.08086 (2017)
Kitov, V., Kozlovtsev, K., Mishustina, M.: Depth-Aware Arbitrary style transfer using instance normalization. arXiv preprint arXiv:1906.01123 (2019)
Hu, Z., Jia, J., Liu, B., Bu, Y., Fu, J.: Aesthetic-aware image style transfer. In: ACM MM, pp. 3320–3329 (2020)
Deng, Y., Tang, F., Pan, X., Dong, W., Ma, C., Xu, C.: Stytr\(\hat{\,}\)2: unbiased image style transfer with transformers. In: CVPR (2021)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
Mahendran, A., Vedaldi, A.: Understanding deep image representations by inverting them. In: CVPR, pp. 5188–5196 (2015)
Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 (2015)
Chen, T.Q., Schmidt, M.: Fast patch-based style transfer of arbitrary style. arXiv preprint arXiv:1612.04337 (2016)
An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J.: Artflow: unbiased image style transfer via reversible neural flows. In: CVPR, pp. 862–871 (2021)
Deng, Y., Tang, F., Dong, W., Huang, H., Ma, C., Xu, C.: Arbitrary video style transfer via multi-channel correlation. In: AAA, vol. 1, pp. 1210–1217 (2021)
Park, D.Y., Lee, K.H.: Arbitrary style transfer with style-attentional networks. In: CVPR, pp. 5880–5888 (2019)
Liu, S., et al.: AdaAttN: revisit attention mechanism in arbitrary neural style transfer. In: ICCV, pp. 6649–6658 (2021)
Hong, K., Jeon, S., Yang, H., Fu, J., Byun, H.: Domain-aware universal style transfer. In: ICCV (2021)
Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Beal, J., Kim, E., Tzeng, E., Park, D.H., Zhai, A., Kislyuk, D.: Toward transformer-based object detection. arXiv preprint arXiv:2012.09958 (2020)
Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3D object detection with pointformer. arXiv preprint arXiv:2012.11409 (2020)
Yuan, Z., Song, X., Bai, L., Zhou, W., Wang, Z., Ouyang, W.: Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving. arXiv preprint arXiv:2011.13628 (2020)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv preprint arXiv:2010.04159 (2020)
Wang, H., Zhu, Y., Adam, H., Yuille, A., Chen, L.C.: MaX-DeepLab: End-to-End Panoptic Segmentation with Mask Transformers. arXiv preprint arXiv:2012.00759 (2020)
Wang, Y., et al.: End-to-End Video Instance Segmentation with Transformers. arXiv preprint arXiv:2011.14503 (2020)
Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2012.15840 (2020)
Huang, L., Tan, J., Liu, J., Yuan, J.: Hand-transformer: non-autoregressive structured modeling for 3D hand pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 17–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_2
Huang, L., Tan, J., Meng, J., Liu, J., Yuan, J.: HOT-net: non-autoregressive transformer for 3D hand-object pose estimation. In: ACM MM, pp. 3136–3145 (2020)
Lin, K., Wang, L., Liu, Z.: End-to-End Human Pose and Mesh Reconstruction with Transformer. arXiv preprint arXiv:2012.09760 (2020)
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: Towards Explainable Human Pose Estimation by Transformer. arXiv preprint arXiv:2012.14214 (2020)
Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703. PMLR (2020)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019)
Zeng, Y., Yang, H., Chao, H., Wang, J., Fu, J.: Improving visual quality of image synthesis by a token-based generator with transformers. In: NeurIPS (2021)
Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: CVPR, pp. 5791–5800 (2020)
Liu, C., Yang, H., Fu, J., Qian, X.: Learning trajectory-aware transformer for video super-resolution. In: CVPR (2022)
Qiu, Z., Yang, H., Fu, J., Fu, D.: Learning spatiotemporal frequency-transformer for compressed video super-resolution. arXiv preprint arXiv:2208.03012 (2022)
Liu, C., Yang, H., Fu, J., Qian, X.: TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation. arXiv preprint arXiv:2207.09048 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: CVPR, pp. 6924–6932 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Nichol, K.: Painter by numbers, wikiart (2016)
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014)
Zheng, H., Yang, H., Fu, J., Zha, Z.J., Luo, J.: Learning conditional knowledge distillation for degraded-reference image quality assessment. In: ICCV (2021)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13, 600–612 (2004)
Sheng, L., Lin, Z., Shao, J., Wang, X.: Avatar-net: multi-scale zero-shot style transfer by feature decoration. In: CVPR, pp. 8242–8250 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, J., Yang, H., Fu, J., Yamasaki, T., Guo, B. (2023). Fine-Grained Image Style Transfer with Visual Transformers. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13843. Springer, Cham. https://doi.org/10.1007/978-3-031-26313-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-26313-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26312-5
Online ISBN: 978-3-031-26313-2
eBook Packages: Computer ScienceComputer Science (R0)