STATE: Learning structure and texture representations for novel view synthesis

Novel viewpoint image synthesis is very challenging, especially from sparse views, due to large changes in viewpoint and occlusion. Existing image-based methods fail to generate reasonable results for invisible regions, while geometry-based methods have difficulties in synthesizing detailed textures. In this paper, we propose STATE, an end-to-end deep neural network, for sparse view synthesis by learning structure and texture representations. Structure is encoded as a hybrid feature field to predict reasonable structures for invisible regions while maintaining original structures for visible regions, and texture is encoded as a deformed feature map to preserve detailed textures. We propose a hierarchical fusion scheme with intra-branch and inter-branch aggregation, in which spatio-view attention allows multi-view fusion at the feature level to adaptively select important information by regressing pixel-wise or voxel-wise confidence maps. By decoding the aggregated features, STATE is able to generate realistic images with reasonable structures and detailed textures. Experimental results demonstrate that our method achieves qualitatively and quantitatively better results than state-of-the-art methods. Our method also enables texture and structure editing applications benefiting from implicit disentanglement of structure and texture. Our code is available at http://cic.tju.edu.cn/faculty/likun/projects/STATE.


Introduction
Given a single image of an object, or several images from different viewpoints, novel view synthesis aims to generate a further image seen from a new viewpoint. This has a wide range of applications in virtual reality, education, and movie production. It is a very challenging problem given sparse input views due to large appearance variations and occlusion.
Existing methods for novel view synthesis can be classified as image-based or geometry-based. Imagebased methods warp a source image from the source viewpoint to the target viewpoint by estimating an affine transformation [1,2] or an appearance flow field [3][4][5]. Flow-based methods can more flexibly deal with complex deformations than affine transformation methods. However, due to lack of geometric information, image-based methods tend to generate unsatisfactory results for invisible regions, especially given sophisticated objects or sparse views. Geometry-based methods first model the 3D structure of the object in an explicit [6][7][8] or implicit [9][10][11] manner, and then generate the target image by rotation and projection. Explicit representations use discrete volumes while implicit methods use continuous implicit functions. Along with neural rendering based methods [12], the latter can be trained without 3D supervision. Although geometrybased methods can ensure structural consistency and predict reasonable shapes for the invisible regions, results are poor for sparse views, and they may lose texture detail if the representation has limited resolution.
It is very important to find an effective way to make better use of multi-view information, especially for sparse views. Most works [10,[13][14][15][16] directly average the representations from all inputs, where all locations of inputs are taken as valid values. However, not all locations of inputs have a positive impact on the target image. To solve this problem, Sun et al. [4] propose a self-learned confidence method to fuse the resulting images generated by each input at the pixel level. However, this fusion scheme requires a large amount of memory and cannot deal with the unavoidable misalignment problem.
The aforementioned methods encounter three challenges to synthesizing satisfactory images: (i) the coupling of shape and texture in the input images, (ii) potential uncertainties in invisible regions, and (iii) difficulty in achieving color, texture, and shape consistency.
To address these problems, in this paper, we propose an end-to-end deep neural network, STATE, for sparse view synthesis; it disentangles the input images into STructure And TExture representations to ensure both shape and texture consistency. Although our method does not explicitly control disentanglement, proper design of the two branches achieves effective disentanglement of structure and texture as we verify later through experimental results. In the structure-aware encoder, we represent structure as a hybrid feature field, which can predict reasonable structure for invisible regions. In the texture-aware encoder, we estimate an appearance flow field and warp the source image feature from the source viewpoint to the target viewpoint at the feature level. To make the best use of multiple images, we also propose spatio-view attention aggregation to adaptively fuse multi-view information at the feature level by regressing pixel-wise or voxel-wise confidence maps. The final image is delivered by decoding the aggregated feature of structure-aware representation and texture-aware representation. Our model works well for both single view and multi-view inputs. Experimental results demonstrate that our method works better than the state-of-the-art. We also validate our approach by comprehensive ablation studies. Figure 1 gives some examples of our results.
Our main contributions are, in summary: • STATE, an end-to-end deep neural network to disentangle sparse input images into two neural embedding representations of structure and texture; it can help predict reasonable regions for ones invisible in the source images, while also recovering detailed textures, • a hierarchical fusion scheme with intra-branch and inter-branch aggregation; spatio-view attention provides multi-view fusion at the feature level to adaptively select important information by regressing pixel-wise or voxel-wise confidence maps, and • a model which can realize texture or structure swapping without training due to effective disentanglement of structures and textures: our model can be easily and robustly trained with a hybrid loss such as cosine loss to achieve color, texture, and shape consistency, leading to stateof-the-art results.

Scope
We next review existing work on novel view synthesis for objects or humans, from a single or multiple images; methods can be image-based or geometrybased. The former maintain appearance consistency by transferring pixels from the source images to the target image, while the latter maintain structural consistency by reconstructing the 3D object to render the novel image.

Image-based novel view synthesis
Image-based novel view synthesis methods directly generate pixels or move pixels from the source images to the target image. Tatarchenko et al. [1] and Yang et al. [2] generate pixels with affine transformation. Instead of learning to synthesize pixels from scratch, Zhou et al. [5] prove that the visual appearances of the same instance from different viewpoints are highly correlated, and such correlation can be explicitly learned to predict appearance flow [3,4,17], i.e., 2D coordinate vectors specifying which pixels in the input view can be used to reconstruct the target view. To use features at different scales, Yin et al. [18] estimate appearance flows with different resolutions to warp the source view to the target view. Controlled by the appearance flow, bilinear sampling is used to move pixels from the source images to the target image [4,5,17,19]. To avoid the poor gradient propagation of bilinear sampling, Ren et al. [3] propose a contentaware sampling method adopting a local attention mechanism. Most flow-based methods [4,5] warp the input images pixel-wise, which prevents the network from generating new content for invisible pixels. Warping the input images at the feature level can solve this problem [3,17,20]. Other methods synthesize invisible pixels without warping the input features. Park et al. [21] use a completion network to hallucinate the empty parts. In summary, imagebased methods can generate detailed textures by moving pixels from the source images to the target image, but the results generated by the above methods lack a consistent shape and so may have artifacts along the silhouette.

Geometry-based novel view synthesis
Geometry-based novel view synthesis methods determine the 3D structure of the object in an explicit or implicit manner, and then generate the target image by rotation and projection. Approaches may be based on depth maps or 3D models (textured occupancy volumes, colored point clouds, or neural scene representations). Depth-map-based approaches [6,22,23] [27,28] or RGBα-encoded volumes [29,30], with good results. Since explicit volumes are discrete, several methods [10,[31][32][33] based on implicit volume representations without any 3D supervision have been proposed. In order to have better understanding of the structure of objects, Galama and Mensink [34] propose IterGANs to iteratively learn an implicit 3D model of the object. Implicit volume representation has gained popularity due to its continuous shape and texture representation. Some methods [9,11,35,36] predict continuous neural scene representations, and then use neural rendering to produce the novel view image. Geometry-based methods can keep structural consistency and predict reasonable shapes for invisible regions, but the generated textures tend to lack fine details.
In this paper, we propose an end-to-end deep neural network for sparse view synthesis by learning structure and texture representations. Structure is encoded as a hybrid feature field; texture is encoded as a deformed feature map. Each representation is generated by spatio-view attention aggregation for multi-view cases. The results generated by our approach have consistent structures and detailed textures.

Overview
The inputs of novel view synthesis from N images are a target camera pose p t and N source images each coupled with a camera pose (I 1 s , p 1 s ), . . . , (I N s , p N s ). Our goal is to synthesize the target imageÎ t in the target camera pose p t . I t andÎ t denote the ground truth and synthetic target images, respectively. In order to generate a result with reasonable structure and fine texture, we propose a new network STATE that aggregates information from both structure and texture representations. As Fig. 2 shows, STATE consists of a two-branch encoder and a fusion decoder.
The two-branch encoder E(·), consisting of a structure-aware branch and a texture-aware branch, encodes the inputs into a structure feature volume f str and a texture feature map f tex . It can be written as The structure-aware branch produces a hybrid feature field for each view, and then rotates and adaptively aggregates them into a single feature volume f str containing structure information. The texture-aware branch generates a single feature map f tex containing texture information by adaptively fusing the flow- The fusion decoder D(·) takes the feature volume f str and the feature map f tex as input and generates the target image:Î Adaptive fusion of multi-view inputs is explained in detail in Section 3.3. Note that our model can handle an arbitrary number of inputs for both training and testing without modifying the encoder or decoder.

Two-branch encoder
We use a two-branch encoder to disentangle texture and structure from the sparse input images; it includes a texture-aware branch and a structureaware branch. For both branches, to cope with occlusion and large view differences, pixels in the input images should not have the same contributions. We thus use a spatio-view attention based on calculating confidence maps for multi-view images to obtain the final texture representation f tex and structure representation f str . See Section 3.3.
In the texture-aware branch (see Fig. 2), we use an hourglass network F warp to predict a warping field w i and a confidence map c i tex for each input view i, which takes the target pose p t , the i-th source image I i s , and the i-th source pose p i s as inputs: The warping field w i is represented by displacements between the source image and the target image. Camera poses p t and p i s are represented by quaternions. We expand the dimensions of the quaternion to match the dimensions of the image, and then concatenate them to form the input. The confidence map c i tex is used to fuse the feature maps from different views. c i tex and w i share all weights of F warp except for their output layers. We use a fully convolutional network F tex to extract featuresf i tex from the source images, and then warp the features to get the target features f i tex : where W(·) is the warping function; bilinear sampling is used in our network.
In the structure-aware branch, we use an encoder F str [10] consisting of a series of 2D convolutions, reshaping, and 3D convolutions to extract a hybrid feature field represented as a structure feature volume for each image: wheref i str is the structure feature volume in the corresponding pose p i s . Each voxel in our 3D feature volume corresponds to a point in 3D space and represents information like its color, and whether it is inside the object or not. Such a 3D feature volume is more robust than a 2D feature map that represents depth information, and has been widely used in 3D reconstruction and novel view synthesis. It is also reasonable to reshape a 2D feature map to get a 3D feature volume. A feature map with [c × d]-dimensional channels can be treated as a concatenation of d feature maps with c dimensional channels, each of which represents geometry and appearance information for a slice in 3D space. Thus, the feature map with [c × d]-dimensional channels contains d slices in 3D space and can be reshaped to a 3D feature volume with a depth resolution of d. Next, we rotatef i str from the source pose p i s to the target pose p t : is a rotation operation with trilinear sampling, f i str is the transformed feature volume having the same shape asf i str , and 3DConv(·) represents 3D convolution. The confidence map c i str is used to fuse the feature maps from different views.
The texture representation f tex and the structure representation f str are decoded by a fusion decoder described in Section 3.4.

Spatio-view attention aggregation
Due to occlusions and large view variation, the texture representation f i tex of view i may be incomplete.
Missing regions should not have the same weighting as other regions. Moreover, the visible view should have more impact on the final result. Similarly, the structure-aware branch requires different weights for different regions of f i str and different views. Therefore, instead of simply averaging the encoded feature maps, we apply adaptive aggregation with spatioview attention for the texture-aware encoder and the structure-aware encoder by calculating a confidence map for each view, as shown in Fig. 3. The pixelwise and voxel-wise confidence maps {c i tex } 1 i N and {c i str } 1 i N are used to fuse the texture features and structure features of all views using We normalize the predicted confidence maps {c i tex } 1 i N and {c i str } 1 i N by applying Softmax(·) across them. The normalized confidence maps can then be used as the weights to aggregate the feature maps. This mechanism enables the weights to be automatically adjusted for any number of input views, which is very flexible. Moreover, fusion at the feature level needs less memory yet can produce a more continuous result.

Fusion decoder
The fusion decoder fuses the texture feature map and the structure feature volume, and then generates the final image. After several 3D convolutions, the structure feature volume is turned into a structure feature map by merging the depth dimension into the channel dimension. We concatenate the structure feature map and the texture feature map, and then get the final image using a U-Net decoder. Instead of fusion at the pixel level, we fuse the structure representation and the texture representation at the feature level, for three reasons: (i) it is difficult to ensure the alignment of two-branch results, (ii) the features before the decoder contain more information than the decoded images, and (iii) fusion at the feature level enables the network to generate new content, especially for the invisible regions.

Loss functions
Because STATE is an end-to-end trainable network, we directly define several losses in image space to train our network. Our full training loss consists of a reconstruction term, a structural term, a perceptual term, a cosine term, and an adversarial term. The full loss is formulated as where λ r , λ s , λ p , λ c , and λ a weight the five loss terms. The reconstruction loss directly guides the similarity between the generated imageÎ t and the ground-truth image I t at the pixel level, accelerating convergence. L R is defined as the 1 distance: We use the structural similarity (SSIM) loss L S [37] with a window size of 11×11 to improve the structural similarity, and to improve consistency with human perception. The structural dissimilarity between the generated imageÎ t and the ground-truth image I t is given by In addition to low-level constraints at the pixel level, we adopt perceptual loss [38] to compute the difference between the deep features of the generated imageÎ t and the ground-truth image I t at a perceptual level; this is formulated as where φ i is the output of the i-th layer of VGG-19 [39] pre-trained on ImageNet [40]. We use layers 1, 6, 11, and 16 to supervise our network.
To ensure color consistency, we calculate the cosine similarity between the generated imageÎ t and the ground-truth image I t . Cosine similarity measures the similarity between two vectors by measuring the cosine of the angle between them: We adopt the discriminator from generative adversarial networks [41], which has achieved great progress in image synthesis. It constrains the distance between the distributions of the generated imageÎ t and the ground-truth image I t . The discriminator loss is defined as where D(·) is a patch discriminator, log(·) is the base 2 logarithm, and E[·] is the expectation.

Implementation details
Our framework is implemented in PyTorch. The hyper-parameters [λ r , λ s , λ p , λ c , λ a ] were set to [1, 10, 0.5, 1, 1] for training. The Adam optimizer [42] was used to optimize our network with the default parameters (β 1 = 0.9 and β 2 = 0.999) and learning rate 2 × 10 −4 . We trained our model with four source view images until convergence on the training data, which took approximately 7 days using a single GeForce GTX 2080 Ti GPU. During testing, generating an image takes about 90 ms using a single GeForce GTX 2080 Ti GPU.

Datasets
To evaluate the performance of our view synthesis approach, we conducted experiments on ShapeNet (using Chair and Car) [43], in which the camera poses are represented by the rotation components around the object's central axis. We used the same training and testing splits as Refs. [4,5,10,21] (80% of models for training and the remaining 20% for testing). Each model was rendered as 256 × 256 RGB images at 18 azimuth angles sampled at 20 • intervals and 3 elevations (0 • , 10 • , 20 • ), for a total of 54 viewpoints per model.
We also synthesized a dataset Human from 496 real scanned 3D human models from https://web. twindom.com. Each model was rendered as 256 × 256 RGB images at 18 azimuth angles sampled at 20 • intervals and 3 elevations (0 • , 10 • , 20 • ), for a total of 54 viewpoints per model. We used 80% of the models for training and the remaining 20% for testing.
Models in the test images were not included in the training set.

Metrics
We used two popular metrics, learned perceptual image patch similarity (LPIPS) [44] and Fréchet inception distance (FID) [45], which are generally considered to be closer to human perception, to assess the reconstruction errors. LPIPS computes the distance between the generated image and the groundtruth image in the perceptual domain. FID calculates the Wasserstein-2 distance between the distributions of the generated images and the ground-truth images, which measures the realism of the generated images.

Ablation study
We first evaluate our method against four alternative models to determine the factors that contribute to achieving reasonable view synthesis from sparse input images. These models use the same setup, training schedule, and sequence of input images as STATE. We used the same training and test scheme as Refs. [4,10] for Chair, Car, and Human datasets: training with 4 views and testing with 1-4 views.
The alternative models were as follows. w/o Tex. This model omits the texture-aware branch but retains the multi-view adaptive weighting. It is designed to assess the importance of the textureaware branch, and to verify the necessity of the combination of both texture representation and structure representation.
w/o Str. The model omits the structure-aware branch but retains the multi-view adaptive weighting. It is designed to assess the importance of the structure-aware branch, and to verify the necessity of the combination of both texture representation and structure representation. w/o SVA. This model is trained with multi-view averaging fusion, to assess the importance of spatioview attention.
w/o Cos. This model omits cosine loss to assess the importance of cosine loss.
Full. Our full model includes the two-branch encoder and multi-view fusion at the feature level with adaptive weighting. To verify the disentanglement of textures and structures, we visualize the results of the two branches: we output the result of one branch by zeroing out the features of the other branch. Figure 8 demonstrates that our method can effectively disentangle textures and structures to generate realistic images with correct shapes and textures.
We visualize the confidence maps to demonstrate the effect of spatio-view attention aggregation in Fig. 9, taking novel view synthesis from two views as an example. The first two columns are the source images, the third column is the generated image, and the last two columns multiply the confidence map by the generated image. As can be seen, the generated image obtains more texture information from source image 2 due to its similarity to the target view, showing that our spatio-view attention aggregation can select more relevant information from different input views.

Comparison to other methods
We compare our method to TBN [10] and pixelNeRF [11]. For simplicity, we omit comparisons to earlier works [4,9] that have already been compared to TBN or pixelNeRF, and the methods that do not work well for sparse views [30,35,46,47]. We use the same training and test scheme as TBN [10] for Chair and Car datasets: training with 4 views and testing with 1-4 views. For the case of single-view input, we use a single view for training, as multi-view adaptive weighting is not used. Pre-trained TBN [10] models for Chair and Car datasets were used and we retrained TBN [10] on the Human dataset for a fair comparison, using the same training and test scheme: training with 4 views and testing with any number of views. We also re-trained pixelNeRF [11] on the Car, Chair, and Human datasets for fair comparison: training with 4 views and testing with 2-4 views. For single-view input, we used single view for training as suggested by the author. Table 2 provides a quantitative comparison on Chair, Car, and Human datasets. It can be seen that our proposed method outperforms the other methods in terms of FID by a significant margin on the Chair dataset, even in the challenging case of single-view input. For the Car dataset, benefiting from spatio-view attention, our method achieves the best performance for multi-view inputs. The cars are left-right symmetric, but not front-back. As a result, our texture-aware branch finds it difficult to provide reasonable textures when there is heavy occlusion in  front of or behind the car for a single view, leading to some faults in the final textures, even if the shape estimated by the structure-aware branch is accurate.
For the Human dataset, our method achieves the best LPIPS scores for all cases. The clothed posed human has complex color and is asymmetric, which influences    10-12. Due to the representation's limited resolution, TBN [10] finds it difficult to recover image details, such as chair legs, and textures of cars and people.
PixelNeRF [11] generates certain artifacts along structural edges. In contrast, our method provides detailed textures while maintaining the structures of objects: e.g., see the stripes on the car and the suit on the person. Thanks to the disentangled learning of the structure representation and the texture representation, invisible regions and detailed textures are successfully recovered by our method for any number of input views. By fusing and decoding the two representations, our method does not suffer from missing pixels: our method can generate visually better and more realistic images.
Further results are given in the ESM.

User study
To better evaluate our method, we performed a perceptual evaluation with a user study, including a comparison to state-of-the-art methods. We showed results from TBN [10] (Method A), pixelNeRF [11]   (Method B), our method (Method C), and the ground-truth for the same input images, for twelve cases, with three questions per case (38 questions in total including asking the gender and age of the participant): 1-4 views as input on the Car, Chair, and Human datasets. The results shown were randomly selected, and the users are required to choose from A, B, and C the one closest to the ground-truth in terms of texture, structure, and overall quality for each case. We collected 111 sets of answers, from 59 females and 52 males, with 108 users aged between 18 and 40, 1 user between 40 and 60, and 2 users over 60. Table 3 presents results of the user study. For each gender, we give the percentage of participants  who chose the result from a particular method for each case, as well as average results over the twelve cases. In addition to the percentage, we also carried out an independent T test [48] between the result and gender: t is a statistical variable calculated from the results and p is found from a table according to t. A p value greater than 0.05 means there is no significant difference between the results for the two genders. We use 1, 2, and 3 to represent Methods A, B, and C, respectively, and average the results of the three questions in each case. Table 3 shows that the user study results do not depend on gender. Overall, our method achieves the best results in the user study. Further results are given in the ESM.

Applications
Our method does not explicitly constrain texture and structure, but as the branches are capable of generating better structure and texture respectively, this implicitly leads to disentanglement. We may also achieve texture or structure swapping with trained models for novel view synthesis. Using the texture and structure branches, we can easily edit the texture and the structure by changing the inputs to each branch. Figures 13 and  14 show some disentangled results on the Car and Chair datasets. The first row provides the texture information and the first column gives the structure information. Each result in other positions i, j uses a decoded result of that combination of structure representation and texture representation. It can be seen that the structure of the result in each case is consistent with that of the first item in the row, and the texture of the result is consistent with the top item in this column. Figure 15 shows some disentangled results for various views, showing that our method achieves the disentanglement of texture and structure.

Failure cases
Although our method generates realistic images with reasonable structures and detailed textures in most cases, it cannot cope well with the structures and textures that deviate greatly from the training set distribution. The neural network predicts outputs by interpolation within in the manifold built on the training data. Therefore, it is difficult to predict reasonable results for some challenging cases, especially those with extremely complex structures and textures. Figure 16 shows examples in which our method fails to predict correct textures and shapes for extremely complex cases.

Conclusions
In this paper, we propose STATE, an end-toend deep neural network, for view synthesis from sparse input images by learning structure and texture representations. Specifically, we propose a two-branch encoder to extract implicit structure representation and deformed texture representation. We also propose spatio-view attention to adaptively fuse multi-view information at the feature level by regressing pixel-wise or voxel-wise confidence maps. By decoding the aggregated feature, STATE can generate realistic images with reasonable structures and detailed textures. Experimental results demonstrate that our method works better than current state-of-the-art methods. We have validated our approach via a comprehensive ablation study. Our method enables texture and structure editing applications benefiting from implicit disentanglement of structures and textures.
Despite its good novel view synthesis results, the training efficiency of our method is not high. Our method is implemented in PyTorch, and it takes approximately 7 days to train the model for four source images using a single GeForce GTX 2080 Ti GPU. In future, we hope to improve training efficiency using a Jittor model [49,50], which is 2.26 times faster than the equivalent PyTorch model on average.

Availability of data and materials
Our code and further results are available at http:// cic.tju.edu.cn/faculty/likun/projects/STATE.

Funding
This work was supported in part by the National Natural Science Foundation of China (62171317 and 62122058).
The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.