Abstract
We present a novel framework for audio-guided localized image stylization. Sound often provides information about the specific context of a scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. This work proposes a framework in which a user provides an audio input to localize the target in the input image and another to locally stylize the target object or scene. We first produce a fine localization map using an audio-visual localization network leveraging CLIP embedding space. We then utilize an implicit neural representation (INR) along with the predicted localization map to stylize the target based on sound information. The INR manipulates local pixel values to be semantically consistent with the provided audio input. Our experiments show that the proposed framework outperforms other audio-guided stylization methods. Moreover, we observe that our method constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Lee, S. H.; Roh, W.; Byeon, W.; Yoon, S. H.; Kim, C. Y.; Kim, J.; Kim, S.; Lee, S. H.; Oh, G.; Byeon, W.; et al. Sound-guided semantic image manipulation. arXiv preprint arXiv:2112.00007, 2021.
Li, T.; Liu, Y.; Owens, A.; Zhao, H. Learning visual styles from audio-visual associations. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13697. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 235–252, 2022.
Lee, S. H.; Oh, G.; Byeon, W.; Yoon, S. H.; Kim, J.; Kim, S. Robust sound-guided image manipulation. arXiv preprint arXiv:2208.14114, 2022.
Kurzman, L.; Vazquez, D.; Laradji, I. Class-based styling: Real-time localized style transfer with semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2019.
Castillo, C.; De, S.; Han, X.; Singh, B.; Yadav, A. K.; Goldstein, T. Son of Zorn’s lemma: Targeted style transfer using instance-aware semantic segmentation. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1348–1352, 2017.
Stahl, F.; Meyer, M.; Schwanecke, U. IST-style transfer with instance segmentation. In: Proceedings of the 11th International Symposium on Image and Signal Processing and Analysis, 277–281, 2019.
Virtusio, J. J.; Talavera, A.; Tan, D. S.; Hua, K. L.; Azcarraga, A. Interactive style transfer: Towards styling user-specified object. In: Proceedings of the IEEE Visual Communications and Image Processing, 1–4, 2018.
Chen, H.; Xie, W.; Afouras, T.; Nagrani, A.; Vedaldi, A.; Zisserman, A. Localizing visual sounds the hard way. arXiv preprint arXiv:2104.02691, 2021.
Mo, S.; Morgado, P. Localizing visual sounds the easy way. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13697. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 218–234, 2022.
Johnson, J.; Alahi, A.; Li, F. F. Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 694–711, 2016.
Gatys, L. A.; Ecker, A. S.; Bethge, M. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2414–2423, 2016.
Luan F, Paris S, Shechtman E, Bala K. Deep photo style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4990–4998, 2017.
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, 1501–1510, 2017.
Kwon, G.; Ye, J. C. CLIPstyler: Image style transfer with a single text condition. arXiv preprint arXiv:2112.00374, 2021.
Tancik, M.; Srinivasan, P. P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J. T.; Ng, R. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, 2020.
Chung, J. S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. arXiv preprint arXiv:1611.05358, 2016.
Deng, K.; Bansal, A.; Ramanan, D. Unsupervised audiovisual synthesis via exemplar autoencoders. arXiv preprint arXiv:2001.04463, 2020.
Chen, L.; Li, Z.; Maddox, R. K.; Duan, Z.; Xu, C. Lip movements generation at a glance. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 538–553, 2018.
Zhou, H.; Liu, Y.; Liu, Z.; Luo, P.; Wang, X. Talking face generation by adversarially disentangled audiovisual representation. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 9299–9306, 2019.
Zakharov, E.; Shysheya, A.; Burkov, E.; Lempitsky, V. Few-shot adversarial learning of realistic neural talking head models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9459–9468, 2019.
Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. MakeItTalk: Speaker-aware talking-head animation. arXiv preprint arXiv:2004.12992, 2020.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Sutskever, I. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
Wang, Q.; Guo, C.; Dai, H. N.; Li, P. Stroke-GAN Painter: Learning to paint artworks using stroke-style generative adversarial networks. Computational Visual Media Vol. 9, No. 4, 787–806, 2023.
Gatys, L. A.; Ecker, A. S.; Bethge, M.; Hertzmann, A.; Shechtman, E. Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3730–3738, 2017.
Cui, M. Y.; Zhu, Z.; Yang, Y.; Lu, S. P. Towards natural object-based image recoloring. Computational Visual Media Vol. 8, No. 2, 317–328, 2022.
Sun, R.; Huang, C.; Zhu, H.; Ma, L. Mask-aware photorealistic facial attribute manipulation. Computational Visual Media Vol. 7, No. 3, 363–374, 2021.
Alegre, L. N.; Oliveira, M. M. SelfieArt: Interactive multi-style transfer for selfies and videos with soft transitions. In: Proceedings of the 33rd SIBGRAPI Conference on Graphics, Patterns and Images, 17–22, 2020.
Wang, C.; Tang, F.; Zhang, Y.; Wu, T.; Dong, W. Towards harmonized regional style transfer and manipulation for facial images. Computational Visual Media Vol. 9, No. 2, 351–366, 2023.
Xia, X.; Xue, T.; Lai, W. S.; Sun, Z.; Chang, A.; Kulis, B.; Chen, J. Real-time localized photorealistic video style transfer. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1089–1098, 2021.
Bar-Tal, O.; Ofri-Amar, D.; Fridman, R.; Kasten, Y.; Dekel, T. Text2LIVE: text-driven layered image and Video editing. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13675. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 707–723, 2022.
Tian, Y.; Shi, J.; Li, B.; Duan, Z.; Xu, C. Audiovisual event localization in unconstrained videos. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11206. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 252–268, 2018.
Zhou, J.; Wang, J.; Zhang, J.; Sun, W.; Zhang, J.; Birchfield, S.; Guo, D.; Kong, L.; Wang, M.; Zhong, Y. Audio–visual segmentation. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13697. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 386–403, 2022.
Zhou, J.; Shen, X.; Wang, J.; Zhang, J.; Sun, W.; Zhang, J.; Birchfield, S.; Guo, D.; Kong, L.; Wang, M.; et al. Audio-visual segmentation with semantics. arXiv preprint arXiv:2301.13190, 2023.
Lüddecke, T.; Ecker, A. S. Image segmentation using text and image prompts. arXiv preprint arXiv:2112.10003, 2022.
Xie, Y.; Takikawa, T.; Saito, S.; Litany, O.; Yan, S.; Khan, N.; Tombari, F.; Tompkin, J.; Sitzmann, V.; Sridhar, S. Neural fields in visual computing and beyond. arXiv preprint arXiv:2111.11426, 2021.
Genova, K.; Cole, F.; Vlasic, D.; Sarna, A.; Freeman, W.; Funkhouser, T. Learning shape templates with structured implicit functions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7154–7164, 2019.
Genova, K.; Cole, F.; Sud, A.; Sarna, A.; Funkhouser, T. Local deep implicit functions for 3D shape. arXiv preprint arXiv:1912.06126, 2019.
Park, J. J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. DeepSDF: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 165–174, 2019.
Sitzmann, V.; Martel, J. N. P.; Bergman, A. W.; Lindell, D. B.; Wetzstein, G. Implicit neural representations with periodic activation functions. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 7462–7473, 2020.
Mu, F.; Wang, J.; Wu, Y.; Li, Y. 3D photo stylization: Learning to generate stylized novel views from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16273–16282, 2022.
Chiang, P. Z.; Tsai, M. S.; Tseng, H. Y.; Lai, W. S.; Chiu, W. C. Stylizing 3D scene via implicit representation and HyperNetwork. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1475–1484, 2022.
Huang, Y. H.; He, Y.; Yuan, Y. J.; Lai, Y. K.; Gao, L. StylizedNeRF: Consistent 3D scene stylization as stylized NeRF via 2D–3D mutual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18342–18352, 2022.
Hollein, L.; Johnson, J.; Nießner, M. StyleMesh: Style transfer for indoor 3D scene reconstructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6198–6208, 2022.
Fan, Z.; Jiang, Y.; Wang, P.; Gong, X.; Xu, D.; Wang, Z. Unified implicit neural stylization. In: Computer-Vision – ECCV 2022. Lecture Notes in Computer-Science, Vol. 13675. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 636–654, 2022.
Alayrac, J. B.; Recasens, A.; Schneider, R.; Arandjelović, R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; Zisserman, A. Self-supervised multimodal versatile networks. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 25–37, 2020.
Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C. C.; Zoph, B.; Cubuk, E. D.; Le, Q. V. SpecAugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 761–769, 2016.
Gatys, L. A.; Ecker, A. S.; Bethge, M. A neural algorithm of artistic style. arXiv preprint arXiv: 1508.06576, 2015.
Owens, A.; Isola, P.; McDermott, J.; Torralba, A.; Adelson, E. H.; Freeman, W. T. Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2405–2413, 2016.
Chen, H.; Xie, W.; Vedaldi, A.; Zisserman, A. VGGSound: A large-scale audio-visual dataset. arXiv preprint arXiv:2004.14368, 2020.
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022, 2021.
Senocak, A.; Oh, T. H.; Kim, J.; Yang, M. H.; Kweon, I. S. Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4358–4366, 2018.
Guzhov, A.; Raue, F.; Hees, J.; Dengel, A. AudioCLIP: Extending CLIP to image, text and audio. arXiv preprint arXiv:2106.13043, 2021.
Nguyen, A. D.; Choi, S.; Kim, W.; Ahn, S.; Kim, J.; Lee, S. Distribution padding in convolutional neural networks. In: Proceedings of the IEEE International Conference on Image Processing, 4275–4279, 2019.
Smith, L. N. Cyclical learning rates for training neural networks. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 464–472, 2017.
Acknowledgements
This research was supported by the Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2022∼ (4D Content Generation and Copyright Protection with Artificial Intelligence, R2022020068, 30%; Research on neural watermark technology for copyright protection of generative AI 3D content, RS-2024-00348469, 40%; International Collaborative Research and Global Talent Development for the Development of Copyright Management and Protection Technologies for Generative AI, RS-2024-00345025, 10%) and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (RS-2019-II190079, 10%; No. 2017-0-00417, 10%).
Author information
Authors and Affiliations
Contributions
Conceptualization: Seung Hyun Lee, Wonmin Byeon, Jinkyu Kim, Sangpil Kim; Methodology: Seung Hyun Lee, Wonmin Byeon, Sang Ho Yoon, Jinkyu Kim, Sangpil Kim; Formal analysis and investigation: Wonmin Byeon, Jinkyu Kim, Sung-hee Hong, Sangpil Kim; Visualization: Sieun Kim, Gyeongrok Oh; Experiment: Seung Hyun Lee, Sieun Kim, Sumin In, Hyeongcheol Park.
Corresponding authors
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Seung Hyun Lee received his B.E. degree in computer science from the University of Seoul in 2022. He is currently a student of artificial intelligence at Korea University. His research interests include audio-visual representation, image editing, and generation by deep learning.
Sieun Kim received her B.S. degree in mathematics from Sookmyung Women’s University in 2023. She is currently a graduate student in the Department of Artificial Intelligence, Korea University. Her research interests include multimodal representation and video editing.
Wonmin Byeon is a senior research scientist at NVIDIA Research in Santa Clara, USA. Prior to joining NVIDIA, she was a post-doctoral researcher at ETH Zurich and IDSIA, Switzerland, working with Jurgen Schmidhuber and Petros Koumoutsakos. She received her Ph.D. degree in computer science from the Technical University of Kaiserslautern, Germany. Her research interests include spatio-temporal learning, continuous learning, and multi-dimensional data understanding, especially with recurrent neural networks (RNNs).
Gyeongrok Oh received his B.S. degree in computer science and engineering from Inha University in 2022. He is currently a graduate student in the Department of Artificial Intelligence, Korea University. His research interests include multi-modal representation and image restoration.
Sumin In received her B.S. degree in AI convergence from Soongsil University in 2024. She is currently a research intern in the Department of Artificial Intelligent, Korea University. Her research interests include multi-modal representation and generative model.
Hyeongcheol Park received his B.S. degree in converged electronics engineering from Sangmyung University in 2024. He is currently a research intern in the Department of Artificial Intelligence, Korea University. His research interests include multi-modal representation and generative models.
Sang Ho Yoon is an assistant professor in the Graduate School of Culture Technology at KAIST. Prior to joining KAIST, he worked at Samsung Research and Microsoft Applied Sciences Lab as a principal engineer and research engineer. He received his Ph.D. degree in mechanical engineering from Purdue University and earned his B.S and M.S degrees in mechanical engineering from Carnegie Mellon University, USA. His research interests include human-computer interaction (HCI), interaction techniques, and intelligent user interfaces along with sensing and haptic technologies.
Sung-Hee Hong received his B.S. degree from the Department of Electrical Engineering, Sungkyunkwan University in 1999. He received his M.S. and Ph.D. degrees in information and communication engineering from Sungkyunkwan University in 2001 and 2016, respectively. He has been the director of the Hologram Research Center since 2019. He is interested in holography, multimedia, and computer graphics.
Jinkyu Kim is an assistant professor in the Department of Computer Science and Engineering, Korea University. He was a research scientist at Waymo (formerly the Google self-driving car project), conducting cutting edge research to develop new solutions related to autonomous driving, in particular, to solve outstanding challenges in planning and behavior prediction. He received his Ph.D. degree in computer science from UC Berkeley (advisor: Prof. John Canny) and was part of Berkeley AI Research and Berkeley DeepDrive. He researched the building of explainable and advisable models that can explain their rationale, characterize their strengths and weaknesses, and convey an understanding of how they will behave in the future. He received his B.S. and M.S. degrees in electrical engineering from Korea University.
Sangpil Kim is an assistant professor in the Department of Computer Science and Engineering, Korea University. He received his Ph.D. degree in electrical and computer engineering from Purdue University and earned his B.S. degree in computer science from Korea University. His current research focuses on the intersection of computer vision and deep learning, with an emphasis on applications of multi-modal fusion for developing generative models.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Lee, S.H., Kim, S., Byeon, W. et al. Audio-guided implicit neural representation for local image stylization. Comp. Visual Media (2024). https://doi.org/10.1007/s41095-024-0413-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41095-024-0413-5