STATE: Learning structure and texture representations for novel view synthesis

Jing, Xinyi; Feng, Qiao; Lai, Yu-Kun; Zhang, Jinsong; Yu, Yuanqiang; Li, Kun

doi:10.1007/s41095-022-0301-9

STATE: Learning structure and texture representations for novel view synthesis

Research Article
Open access
Published: 11 July 2023

Volume 9, pages 767–786, (2023)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

STATE: Learning structure and texture representations for novel view synthesis

Download PDF

Xinyi Jing¹^na1,
Qiao Feng¹^na1,
Yu-Kun Lai²,
Jinsong Zhang¹,
Yuanqiang Yu¹ &
…
Kun Li¹

784 Accesses
Explore all metrics

Abstract

Novel viewpoint image synthesis is very challenging, especially from sparse views, due to large changes in viewpoint and occlusion. Existing image-based methods fail to generate reasonable results for invisible regions, while geometry-based methods have difficulties in synthesizing detailed textures. In this paper, we propose STATE, an end-to-end deep neural network, for sparse view synthesis by learning structure and texture representations. Structure is encoded as a hybrid feature field to predict reasonable structures for invisible regions while maintaining original structures for visible regions, and texture is encoded as a deformed feature map to preserve detailed textures. We propose a hierarchical fusion scheme with intra-branch and inter-branch aggregation, in which spatio-view attention allows multi-view fusion at the feature level to adaptively select important information by regressing pixel-wise or voxel-wise confidence maps. By decoding the aggregated features, STATE is able to generate realistic images with reasonable structures and detailed textures. Experimental results demonstrate that our method achieves qualitatively and quantitatively better results than state-of-the-art methods. Our method also enables texture and structure editing applications benefiting from implicit disentanglement of structure and texture. Our code is available at http://cic.tju.edu.cn/faculty/likun/projects/STATE.

Article PDF

CompNVS: Novel View Synthesis with Scene Completion

Depth Normalized Stable View Synthesis

ViewFormer: NeRF-Free Neural Rendering from Few Images Using Transformers

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Availability of data and materials

Our code and further results are available at http://cic.tju.edu.cn/faculty/likun/projects/STATE.

References

Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Multi-view 3D models from single images with a convolutional network. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9911. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 322–337, 2016.
Chapter Google Scholar
Yang, J.; Reed, S. E.; Yang, M.-H.; Lee, H. Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 1, 1099–1107, 2015.
Ren, Y. R.; Yu, X. M.; Chen, J. M.; Li, T. H.; Li, G. Deep image spatial transformation for person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7687–7696, 2020.
Sun, S. H.; Huh, M.; Liao, Y. H.; Zhang, N.; Lim, J. J. Multi-view to novel view: Synthesizing novel views with self-learned confidence. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 162–178, 2018.
Chapter Google Scholar
Zhou, T. H.; Tulsiani, S.; Sun, W. L.; Malik, J.; Efros, A. A. View synthesis by appearance flow. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 286–301, 2016.
Chapter Google Scholar
Flynn, J.; Neulander, I.; Philbin, J.; Snavely, N. Deep stereo: Learning to predict new views from the world’s imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern, 5515–5524, 2016.
Tulsiani, S.; Zhou, T. H.; Efros, A. A.; Malik, J. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 209–217, 2017.
Lê, H. Â.; Mensink, T.; Das, P.; Gevers, T. Novel view synthesis from single images via point cloud transformation. In: Proceedings of the British Machine Vision Conference, 2020.
Sitzmann, V.; Zollhoefer, M.; Wetzstein, G. Scene representation networks: Continuous 3D-structure-aware neural scene representations. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 101, 1121–1132, 2019.
Olszewski, K.; Tulyakov, S.; Woodford, O.; Li, H.; Luo, L. J. Transformable bottleneck networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7647–7656, 2019.
Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelNeRF: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4576–4585, 2021.
Ali Eslami, S. M.; Jimenez Rezende, D.; Besse, F.; Viola, F.; Morcos, A. S.; Garnelo, M.; Ruderman, A.; Rusu, A. A.; Danihelka, I.; Gregor, K.; et al. Neural scene representation and rendering. Science Vol. 360, No. 6394, 1204–1210, 2018.
Article Google Scholar
Liu, X. F.; Guo, Z. H.; You, J.; Vijaya Kumar, B. V. K. Dependency-aware attention control for image set-based face recognition. IEEE Transactions on Information Forensics and Security Vol. 15, 1501–1512, 2020.
Article Google Scholar
Liu, X. F.; Kumar, B. V. K. V.; Yang, C.; Tang, Q. M.; You, J. Dependency-aware attention control for unconstrained face recognition with image sets. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11215. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 573–590, 2018.
Chapter Google Scholar
Trevithick, A.; Yang, B. GRF: Learning a general radiance field for 3D representation and rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 15162–15172, 2021.
Yan, X.; Yang, J.; Yumer, E.; Guo, Y.; Lee, H. Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 1704–1712, 2016.
Kim, J.; Kim, Y. M. Novel view synthesis with skip connections. In: Proceedings of the IEEE International Conference on Image Processing, 1616–1620, 2020.
Yin, M. Y.; Sun, L.; Li, Q. L. ID-unet: Iterative soft and hard deformation for view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7216–7225, 2021.
Kwon, Y.; Petrangeli, S.; Kim, D.; Wang, H. L.; Fuchs, H.; Swaminathan, V. Rotationally-consistent novel view synthesis for humans. In: Proceedings of the 28th ACM International Conference on Multimedia, 2308–2316, 2020.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.
Park, E.; Yang, J. M.; Yumer, E.; Ceylan, D.; Berg, A. C. Transformation-grounded image generation network for novel 3D view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 702–711, 2017.
Song, J.; Chen, X.; Hilliges, O. Monocular neural image based rendering with continuous view control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 4089–4099, 2019.
Hou, Y. X.; Solin, A.; Kannala, J. Novel view synthesis via depth-guided skip connections. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3118–3127, 2021.
Choy, C. B.; Xu, D. F.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 628–644, 2016.
Chapter Google Scholar
Girdhar, R.; Fouhey, D. F.; Rodriguez, M.; Gupta, A. Learning a predictable and generative vector representation for objects. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 484–499, 2016.
Chapter Google Scholar
Kar, A.; Häne, C.; Malik, J. Learning a multi-view stereo machine. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 364–375, 2017.
Park, J. J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. DeepSDF: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 165–174, 2019.
Saito, S.; Huang, Z.; Natsume, R.; Morishima, S.; Li, H.; Kanazawa, A. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2304–2314, 2019.
Guo, P. S.; Bautista, M. A.; Colburn, A.; Yang, L.; Ulbricht, D.; Susskind, J. M.; Shan, Q. Fast and explicit neural view synthesis. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 11–20, 2022.
Lombardi, S.; Simon, T.; Saragih, J.; Schwartz, G.; Lehrmann, A.; Sheikh, Y. Neural volumes: Learning dynamic renderable volumes from images. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 65, 2019.
Nguyen-Phuoc, T.; Li, C.; Theis, L.; Richardt, C.; Yang, Y. L. HoloGAN: Unsupervised learning of 3D representations from natural images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7587–7596, 2019.
Nguyen-Phuoc, T.; Richardt, C.; Mai, L.; Yang, Y. L.; Mitra, N. BlockGAN: Learning 3D object-aware scene representations from unlabelled images. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 568, 6767–6778, 2020.
Niemeyer, M.; Mescheder, L.; Oechsle, M.; Geiger, A. Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3501–3512, 2020.
Galama, Y.; Mensink, T. IterGANs: Iterative GANs to learn and control 3D object transformation. Computer-Vision and Image Understanding Vol. 189, 102803, 2019.
Article Google Scholar
Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 405–421, 2020.
Chapter Google Scholar
Tewari, A.; Fried, O.; Thies, J.; Sitzmann, V.; Lombardi, S.; Sunkavalli, K.; Martin-Brualla, R.; Simon, T.; Saragih, J.; Nießner, M.; et al. State of the art on neural rendering. Computer Graphics Forum Vol. 39, No. 2, 701–727, 2020.
Article Google Scholar
Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing Vol. 13, No. 4, 600–612, 2004.
Article Google Scholar
Johnson, J.; Alahi, A.; Li, F. F. Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 694–711, 2016.
Chapter Google Scholar
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S. A.; Huang, Z. H.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.
Article MathSciNet Google Scholar
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Communications of the ACM Vol. 63, No. 11, 139–144, 2020.
Article MathSciNet Google Scholar
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations, 2015.
Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q. X.; Li, Z. M.; Savarese, S.; Savva, M.; Song, S. R.; Su, H.; et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 586–595, 2018.
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
Chibane, J.; Bansal, A.; Lazova, V.; Pons-Moll, G. Stereo radiance fields (SRF): Learning view synthesis for sparse views of novel scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7907–7916, 2021.
Riegler, G.; Koltun, V. Free view synthesis. In: Computer Vision–ECCV 2020. Lecture Notes in Computer Science, Vol. 12364. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 623–640, 2020.
Chapter Google Scholar
Gretton, A.; Fukumizu, C.; Teo, H.; Song, L.; Schölkopf, B.; Smola, A. J. A kernel statistical test of independence. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, 585–592, 2007.
Hu, S. M.; Liang, D.; Yang, G. Y.; Yang, G. W.; Zhou, W. Y. Jittor: A novel deep learning framework with meta-operators and unified graph execution. Science China Information Sciences Vol. 63, No. 12, 222103, 2020.
Article MathSciNet Google Scholar
Zhou, W. Y.; Yang, G. W.; Hu, S. M. Jittor-GAN: A fast-training generative adversarial network model zoo based on Jittor. Computational Visual Media Vol. 7, No. 1, 153–157, 2021.
Article Google Scholar

Download references

Acknowledgements

We are grateful to the Associate Editor and anonymous reviewers for their help in improving this paper.

Funding

This work was supported in part by the National Natural Science Foundation of China (62171317 and 62122058).

Author information

Xinyi Jing and Qiao Feng contributed equally to this work.

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, 300350, China
Xinyi Jing, Qiao Feng, Jinsong Zhang, Yuanqiang Yu & Kun Li
School of Computer Science and Informatics, Cardiff University, Cardiff, CF24 4AG, UK
Yu-Kun Lai

Authors

Xinyi Jing
View author publications
You can also search for this author in PubMed Google Scholar
Qiao Feng
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Kun Lai
View author publications
You can also search for this author in PubMed Google Scholar
Jinsong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yuanqiang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Kun Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Xinyi Jing: theoretical development, experiment implementation, paper writing, approving the final version of the article publication, including references.

Qiao Feng: theoretical development, experiment implementation, paper writing, approving the final version of the article for publication, including references.

Yu-Kun Lai: guidance, theoretical development, experimental design, paper revision, approving the final version of the article for publication, including references.

Jinsong Zhang: theoretical development, experimental design, paper revision, approving the final version of the article for publication, including references.

Yuanqiang Yu: theoretical development, experiment implementation, paper writing, approving the final version of the article for publication, including references.

Kun Li: guidance, theoretical development, experimental design, paper writing, approving the final version of the article for publication, including references.

Corresponding author

Correspondence to Kun Li.

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Xinyi Jing received her B.E. degree from the School of Computer Science, Shaanxi Normal University, Xi’an, China, in 2020. She is currently pursuing an M.E. degree in the College of Intelligence and Computing, Tianjin University, China. Her research interests are in computer vision and computer graphics.

Qiao Feng received his B.E. degree from the College of Intelligence and Computing, Tianjin University in 2021. He is currently pursuing a master degree in the College of Intelligence and Computing, Tianjin University. His research interests include machine learning and computer graphics.

Yu-Kun Lai received his bachelor and Ph.D. degrees in computer science from Tsinghua University in 2003 and 2008, respectively. He is currently a professor in the School of Computer Science & Informatics, Cardiff University, UK. His research interests include computer graphics, geometry processing, image processing, and computer vision. He is on the editorial boards of Computer Graphics Forum and The Visual Computer.

Jinsong Zhang received his B.E. and M.E. degrees from Tianjin University in 2018. He is currently pursuing a Ph.D. degree in computer science in Tianjin University. His interests are mainly in computer vision and image synthesis.

Yuanqiang Yu received his B.E. degree from the School of Computer Science and Technology, Tiangong University, Tianjin, in 2020. He is currently pursuing an M.E. degree in the College of Intelligence and Computing, Tianjin University. His research interests are in deep reinforcement learning, transfer learning, and computer vision.

Kun Li received her B.E. degree from Beijing University of Posts and Telecommunications, Beijing, China, in 2006, and master and Ph.D. degrees from Tsinghua University, Beijing, in 2011. She visited the École Polytechnique Fédérale de Lausanne, Switzerland, in 2012 and 2014–2015. She is currently an associate professor in the College of Intelligence and Computing, Tianjin University. Her research interests include dynamic scene 3D reconstruction, and image and video processing.

Electronic supplementary material

Supplementary document for STATE: Learning structure and texture representations for novel view synthesis

Supplementary material, approximately 40.8 MB.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Jing, X., Feng, Q., Lai, YK. et al. STATE: Learning structure and texture representations for novel view synthesis. Comp. Visual Media 9, 767–786 (2023). https://doi.org/10.1007/s41095-022-0301-9

Download citation

Received: 15 February 2022
Accepted: 16 June 2022
Published: 11 July 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s41095-022-0301-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

STATE: Learning structure and texture representations for novel view synthesis

Abstract

Article PDF

Similar content being viewed by others

CompNVS: Novel View Synthesis with Scene Completion

Depth Normalized Stable View Synthesis

ViewFormer: NeRF-Free Neural Rendering from Few Images Using Transformers

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Electronic supplementary material

Supplementary document for STATE: Learning structure and texture representations for novel view synthesis

Rights and permissions

About this article

Cite this article

Keywords

Navigation

STATE: Learning structure and texture representations for novel view synthesis

Abstract

Article PDF

Similar content being viewed by others

CompNVS: Novel View Synthesis with Scene Completion

Depth Normalized Stable View Synthesis

ViewFormer: NeRF-Free Neural Rendering from Few Images Using Transformers

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Electronic supplementary material

Supplementary document for STATE: Learning structure and texture representations for novel view synthesis

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation