Base Networks To evaluate the performance and various properties of AttSets, we choose the encoder–decoders of 3D-R2N2 (Choy et al. 2016) and SilNet (Wiles and Zisserman 2017) as two base networks.
Encoder–decoder of 3D-R2N2. The original 3D-R2N2 consists of (1) a shared ResNet-based 2D encoder which encodes a size of \(127\times 127 \times 3\) images into 1024 dimensional latent vectors, (2) a GRU module which fuses N 1024 dimensional latent vectors into a single \(4\times 4\times 4\times 128\) tensor, and (3) a ResNet-based 3D decoder which decodes the single tensor into a \(32\times 32\times 32\) voxel grid representing the 3D shape. Figure 4 shows the architecture of AttSets based multi-view 3D reconstruction network where the only difference is that the original GRU module is replaced by AttSets in the middle. This network is called Base\(_{\text {r2n2}}\)-AttSets.
Encoder–decoder of SilNet. The original SilNet consists of (1) a shared 2D encoder which encodes a size of \(127\times 127\times 3\) images together with image viewing angles into 160 dimensional latent vectors, (2) a max pooling module which aggregates N latent vectors into a single vector, and (3) a 2D decoder which estimates an object silhouette (\(57\times 57\)) from the single latent vector and a new viewing angle. Instead of being explicitly supervised by 3D shape labels, SilNet aims to implicitly learn a 3D shape representation from multiple images via the supervision of 2D silhouettes. Figure 5 shows the architecture of AttSets based SilNet where the only difference is that the original max pooling is replaced by AttSets in the middle. This network is called Base\(_{\text {silnet}}\)-AttSets.
Competing Approaches We compare our AttSets and FASet with three groups of competing approaches. Note that all the following competing approaches are connected at the same location of the base encoder–decoder shown in the pink block of Figs. 4 and 5, with the same network configurations and training/testing settings.
RNNs. The original 3D-R2N2 makes use of the GRU (Choy et al. 2016; Kar et al. 2017) unit for feature aggregation and serves as a solid baseline.
First-order poolings. The widely used max/mean/ sum pooling operations (Huang et al. 2018; Paschalidou et al. 2018; Eslami et al. 2018) are all implemented for comparison.
Higher-order poolings. We also compare with the state-of-the-art higher-order pooling approaches, including bilinear pooling (BP) (Lin et al. 2015), and the very recent MHBN (Yu et al. 2018) and SMSO poolings (Yu and Salzmann 2018).
Datasets All approaches are evaluated on four large open datasets.
ShapeNet\(_{\text {r2n2}}\) Dataset (Choy et al. 2016). The released 3D-R2N2 dataset consists of 13 categories of 43, 783 common objects with synthesized RGB images from the large scale ShapeNet 3D repository (Chang et al. 2015). For each 3D object, 24 images are rendered from different viewing angles circling around each object. The train/test dataset split is 0.8 : 0.2.
ShapeNet\(_{\text {lsm}}\) Dataset (Kar et al. 2017). Both LSM and 3D-R2N2 datasets are generated from the same 3D ShapeNet repository (Chang et al. 2015), i.e., they have the same ground truth labels regarding the same object. However, the ShapeNet\(_{\text {lsm}}\) dataset has totally different camera viewing angles and lighting sources for the rendered RGB images. Therefore, we use the ShapeNet\(_{\text {lsm}}\) dataset to evaluate the robustness and generality of all approaches. All images of ShapeNet\(_{\text {lsm}}\) dataset are resized from \(224\times 224\) to \(127\times 127\) through linear interpolation.
ModelNet40 Dataset. ModelNet40 (Wu et al. 2015) consists of 12, 311 objects belonging to 40 categories. The 3D models are split into 9, 843 training samples and 2, 468 testing samples. For each 3D model, it is voxelized into a \(30\times 30\times 30\) shape in (Qi et al. 2016), and 12 RGB images are rendered from different viewing angles. All 3D shapes are zero-padded to be \(32\times 32\times 32\), and the images are linearly resized from \(224\times 224\) to \(127\times 127\) for training and testing.
Blobby Dataset (Wiles and Zisserman 2017). It contains 11, 706 blobby objects. Each object has 5 RGB images paired with viewing angles and the corresponding silhouettes, which are generated from Cycles in Blender under different lighting sources and texture models.
Metrics The explicit 3D voxel reconstruction performance of Base\(_{\text {r2n2}}\)-AttSets and the competing approaches is evaluated on three datasets: ShapeNet\(_{\text {r2n2}}\), ShapeNet\(_{\text {lsm}}\) and ModelNet40. We use the mean Intersection-over-Union (IoU) (Choy et al. 2016) between predicted 3D voxel grids and their ground truth as the metric. The IoU for an individual voxel grid is formally defined as follows:
$$\begin{aligned} IoU = \frac{\sum _{i=1}^{L} \left[ I (h_i>p) * I(\bar{h_i}) \right] }{ \sum _{i=1}^{L} \left[ I \left( I(h_{i} >p) + I(\bar{h_i}) \right) \right] } \end{aligned}$$
where \(I(\cdot )\) is an indicator function, \(h_{i}\) is the predicted value for the ith voxel, \(\bar{h_i}\) is the corresponding ground truth, p is the threshold for voxelization, L is the total number of voxels in a whole voxel grid. As there is no validation split in the above three datasets, to calculate the IoU scores, we independently search the optimal binarization threshold value from 0.2 to 0.8 with a step 0.05 for all approaches for fair comparison. In our experiments, we found that all optimal thresholds of different approaches end up with 0.3 or 0.35.
The implicit 3D shape learning performance of Base\(_{\text {silnet}}\)-AttSets and the competing approaches is evaluated on the Blobby dataset. The mean IoU between predicted 2D silhouettes and the ground truth is used as the metric (Wiles and Zisserman 2017).
Table 1 Group 1: mean IoU for multi-view reconstruction of all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 2 images per object in Stage 2, while other competing approaches are fine-tuned given 2 images per object in Stage 2 Table 2 Group 2: mean IoU for multi-view reconstruction of all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 8 images per object in Stage 2, while other competing approaches are fine-tuned given 8 images per object in Stage 2 Table 3 Group 3: mean IoU for multi-view reconstruction of all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 16 images per object in Stage 2, while other competing approaches are fine-tuned given 16 images per object in Stage 2 Evaluation on ShapeNet\(_{\text {r2n2}}\) Dataset
To fully evaluate the aggregation performance and robustness, we train the Base\(_{\text {r2n2}}\)-AttSets and its competing approaches on ShapeNet\(_{\text {r2n2}}\) dataset. For fair comparison, all networks (the pooling/GRU/AttSets based approaches) are trained according to the proposed two-stage training algorithm.
Training Stage 1 All networks are trained given only 1 image for each object, i.e., \(N=1\) in all training iterations, until convergence. Basically, this is to guarantee all networks are well optimized for the extreme case where there is only one input image.
Training Stage 2 To enable these networks to be more robust for multiple input images, all networks are further trained given more images per object. Particularly, we conduct the following five parallel groups of training experiments.
Group 1. All networks are further trained given only 2 images for each object, i.e., \(N=2\) in all iterations. As to our Base\(_{\text {r2n2}}\)-AttSets, the well-trained encoder–decoder in previous stage 1 is frozen, and we only optimize the AttSets module according to our FASet algorithm 1. As to the competing approaches, e.g., GRU and all poolings, we turn to fine-tune the whole networks because they do not have separate parameters suitable for special training. To be specific, we use smaller learning rate (1e\({-}5\)) to carefully train these networks to achieve better performance where \(N=2\) until convergence.
Group 2/3/4. Similarly, in these three groups of second-stage training experiments, N is set to be 8, 16, 24 separately.
Group 5. All networks are further trained until convergence, but N is uniformly and randomly sampled from [1, 24] for each object during training. In the above Group 1/2/3/4, N is fixed for each object, while N is dynamic for each object in this Group 5.
Table 4 Group 4: mean IoU for multi-view reconstruction of all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 24 images per object in Stage 2, while other competing approaches are fine-tuned given 24 images per object in Stage 2 Table 5 Group 5: mean IoU for multi-view reconstruction of all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given random number of images per object in Stage 2, i.e., N is uniformly sampled from [1, 24], while other competing approaches are fine-tuned given random number of views per object in Stage 2
The above experiment Groups 1/2/3/4 are designed to investigate how all competing approaches would be further optimized towards the statistics of a fixed N during training, thus resulting in different level of robustness given an arbitrary number of N during testing. By contrast, the paradigm in Group 5 aims at enumerating all possible N values during training. Therefore the overall performance might be more robust regarding an arbitrary number of input images during testing, compared with the above Group 1/2/3/4 experiments.
Testing Stage All networks trained in above five groups of experiments are separately tested given \(N = \{1, 2, 3, 4, 5, 8, 12, 16, 20,24\}\). The permutations of input images are the same for all different approaches for fair comparison. Note that, we do not test the networks which are only trained in Stage 1, because the AttSets module is not optimized and the corresponding Base\(_{\text {r2n2}}\)-AttSets is unable to generalize to multiple input images during testing. Therefore, it is meaningless to compare the performance when the network is solely trained on a single image.
Results Tables 1, 2, 3, 4 and 5 show the mean IoU scores of all 13 categories for experiments of Group 1–5, while Figs. 6, 7, 8, 9 and 10 show the trends of mean IoU changes in different Groups. Figure 11 shows the estimated 3D shapes in experiment Group 5, with an increasing number of images from 1 to 5 for different approaches.
Table 6 Per-category mean IoU for single view reconstruction on ShapeNet\(_{\text {r2n2}}\) testing split We notice that the reported IoU scores of ShapeNet data repository in original LSM (Kar et al. 2017) are higher than our scores. However, the experimental settings in LSM (Kar et al. 2017) are quite different from ours in the following two aspects. (1) The original LSM requires both RGB images and the corresponding viewing angles as input, while all our experiments do not. (2) The original LSM dataset has different styles of rendered color images and different train/test splits compared with our experimental settings. Therefore the reported IoU scores in LSM are not directly comparable with ours and we do not include the results in this paper to avoid confusion. Note that, the aggregation module of LSM (Kar et al. 2017), i.e., GRU, is the same as used in 3D-R2N2 (Choy et al. 2016), and is indeed fully evaluated throughout our experiments.
To highlight the performance of single view 3D reconstruction, Table 6 shows the optimal per-category IoU scores for different competing approaches from experiments Group 1–5. In addition, we also compare with the state-of-the-art dedicated single view reconstruction approaches including OGN (Tatarchenko et al. 2017), AORM (Yang et al. 2018) and PointSet (Fan et al. 2017) in Table 6. Overall, our AttSets based approach outperforms all others by a large margin for either single view or multi view reconstruction, and generates much more compelling 3D shapes.
Table 7 Mean IoU for multi-view reconstruction of all 13 categories from ShapeNet\(_{\text {lsm}}\) dataset. All networks are well trained in previous experiment Group 5 of Sect. 5.1
Analysis We investigate the results as follows:
The GRU based approach can generate reasonable 3D shapes in all experiments Group 1–5 given either few or multiple images during testing, but the performance saturates quickly after being given more images, e.g., 8 views, because the recurrent unit is hardly able to capture features from longer image sequences, as illustrated in Fig. 9
.
In Group 1–4, all pooling based approaches are able to estimate satisfactory 3D shapes when given a similar number of images as given in training, but they are unlikely to predict reasonable shapes given an arbitrary number of images. For example, in experiment Group 4, all pooling based approaches have inferior IoU scores given only few images as shown in Table 4 and Fig. 9
, because the pooled features from fewer images during testing are unlikely to be as general and representative as pooled features from more images during training. Therefore, those models trained on 24 images fail to generalize well to only one image during testing.
In Group 5, as shown in Table 5 and Fig. 10, all pooling based approaches are much more robust compared with Group 1–4, because the networks are generally optimized according to an arbitrary number of images during training. However, these networks tend to have the performance in the middle. Compared with Group 4, all approaches in Group 5 tend to have better performance when \(N=1\), while being worse when \(N=24\). Compared with Group 1, all approaches in Group 5 are likely to be better when \(N=24\), while being worse when \(N=1\). Basically, these networks tend to be optimized to learn the mean features overall.
In all experiments Group 1–5, all approaches tend to have better performance when given enough input images, i.e., \(N=24\), because more images are able to provide enough information for reconstruction.
In all experiments Group 1–5, our AttSets based approach clearly outperforms all others in either single or multiple view 3D reconstruction and it is more robust to a variable number of input images. Our FASet algorithm completely decouples the base network to learn visual features for accurate single view reconstruction as illustrated in Fig. 9
, while the trainable parameters of AttSets module are separately responsible for learning attention scores for better multi-view reconstruction as shown in Fig. 9
. Therefore, the whole network does not suffer from limitations of GRU or pooling approaches, and can achieve better performance for either fewer or more image reconstruction.
Table 8 Group 1: mean IoU for multi-view reconstruction of all 40 categories in ModelNet40 testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 12 images per object in Stage 2, while other competing approaches are fine-tuned given 12 images per object in Stage 2 Table 9 Group 2: mean IoU for multi-view reconstruction of all 40 categories in ModelNet40 testing split. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given random number of images per object in Stage 2, i.e., N is uniformly sampled from [1, 12], while other competing approaches are fine-tuned given random number of views per object in Stage 2 Evaluation on ShapeNet\(_{\text {lsm}}\) Dataset
To further investigate how well the learnt visual features and attention scores generalize across different style of images, we use the well trained networks of previous Group 5 of Sect. 5.1 to test on the large ShapeNet\(_{\text {lsm}}\) dataset. Note that, we only borrow the synthesized images from ShapeNet\(_{\text {lsm}}\) dataset corresponding to the objects in ShapeNet\(_{\text {r2n2}}\) testing split. This guarantees that all the trained models have never seen either the style of LSM rendered images or the 3D object labels before. The image viewing angles from the original ShapeNet\(_{\text {lsm}}\) dataset are not used in our experiments, since the Base\(_{\text {r2n2}}\) network does not require image viewing angles as input. Table 7 shows the mean IoU scores of all approaches, while Fig. 12 shows the qualitative results.
Our AttSets based approach outperforms all others given either few or multiple input images. This demonstrates that our Base\(_{\text {r2n2}}\)-AttSets approach does not overfit the training data, but has better generality and robustness over new styles of rendered color images compared with other approaches.
Evaluation on ModelNet40 Dataset
We train the Base\(_{\text {r2n2}}\)-AttSets and its competing approaches on ModelNet40 dataset from scratch. For fair comparison, all networks (the pooling/GRU/AttSets based approaches) are trained according to the proposed FASet algorithm, which is similar to the two-stage training strategy of Sect. 5.1.
Training Stage 1 All networks are trained given only 1 image for each object, i.e., \(N=1\) in all training iterations, until convergence. This guarantees all networks are well optimized for single view 3D reconstruction.
Training Stage 2 We further conduct the following two parallel groups of training experiments to optimize the networks for multi-view reconstruction.
Group 1. All networks are further trained given all 12 images for each object, i.e., \(N=12\) in all iterations, until convergence. As to our Base\(_{\text {r2n2}}\)-AttSets, the well-trained encoder–decoder in previous Stage 1 is frozen, and only the AttSets module is trained. All other competing approaches are fine-tuned using smaller learning rate (1e\({-}5\)) in this stage.
Group 2. All networks are further trained until convergence, but N is uniformly and randomly sampled from [1, 12] for each object during training. Only the AttSets module is trained, while all other competing approaches are fine-tuned in this Stage 2.
Table 10 Group 1: mean IoU for silhouettes prediction on the Blobby dataset. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 2 images per object, i.e., N =2, while other competing approaches are fine-tuned given 2 views per object in Stage 2 Testing Stage All networks trained in above two groups are separately tested given \(N=[1,2,3,4,5,8,12]\). The permutations of input images are the same for all different approaches for fair comparison.
Results Tables 8 and 9 show the mean IoU scores of Groups 1 and 2 respectively, and Fig. 13 shows qualitative results of Group 2. The Base\(_{\text {r2n2}}\)-AttSets surpasses all competing approaches by a large margin for both single and multiple view 3D reconstructions, and all the results are consistent with previous experimental results on both ShapeNet\(_{\text {r2n2}}\) and ShapeNet\(_{\text {lsm}}\) datasets.
Evaluation on Blobby Dataset
In this section, we evaluate the Base\(_{\text {silnet}}\)-AttSets and the competing approaches on the Blobby dataset. For fair comparison, the GRU module is implemented with a single fully connected layer of 160 hidden units, which has similar network capacity with our AttSets based network. All networks (the pooling/GRU/AttSets based approaches) are trained with the proposed two-stage FASet algorithm as follows:
Table 11 Group 2: mean IoU for silhouettes prediction on the Blobby dataset. All networks are firstly trained given only 1 image for each object in Stage 1. The AttSets module is further trained given 4 images per object, i.e., N=4, while other competing approaches are fine-tuned given 4 views per object in Stage 2
Training Stage 1 All networks are trained given only 1 image together with the viewing angle for each object, i.e., N=1 in all training iterations, until convergence. This guarantees the performance of single view shape learning.
Training Stage 2 Another two parallel groups of training experiments are conducted to further optimize the networks for multi-view shape learning.
Group 1. All networks are further trained given only 2 images for each object, i.e., N=2 in all iterations. As to Base\(_{\text {silnet}}\)-AttSets, only the AttSets module is optimized with the well-trained base encoder–decoder being frozen. For fair comparison, all competing approaches are fine-tuned given 2 images per object for better performance where N =2 until convergence.
Group 2. Similar to the above Group 1, all networks are further trained given all 4 images for each object, i.e., N=4, until convergence.
Testing Stage All networks trained in above two groups are separately tested given N = [1,2,3,4]. The permutations of input images are the same for all different networks for fair comparison.
Results Tables 10 and 11 show the mean IoUs of above two groups of experiments and Fig. 14 shows the qualitative results of Group 2. Note that, the IoUs are calculated on predicted 2D silhouettes instead of 3D voxels, so they are not numerically comparable with previous experiments on ShapeNet\(_{\text {r2n2}}\), ShapeNet\(_{\text {lsm}}\), and ModelNet40 datasets. We do not include the IoU scores of the original SilNet (Wiles and Zisserman 2017), because the original IoU scores are obtained from an end-to-end training strategy. In this paper, we uniformly apply the proposed two-stage FASet training paradigm on all approaches for fair comparison. Our Base\(_{\text {silnet}}\)-AttSets consistently outperforms all competing approaches for shape learning from either single or multiple views.
Qualitative Results on Real-World Images
To the best of our knowledge, there is no public real-world dataset for multi-view 3D object reconstruction. Therefore, we manually collect real world images from Amazon online shops to qualitatively demonstrate the generality of all networks which are trained on the synthetic ShapeNet\(_{\text {r2n2}}\) dataset in experiment Group 4 of Sect. 5.1, as shown in Fig. 15.
In the meantime, we use these real-world images to qualitatively show the permutation invariance of different approaches. In particular, for each object, we use 6 different permutations in total for testing. As shown in Fig. 16, the GRU based approach generates inconsistent 3D shapes given different image permutations. For example, the arm of a chair and the leg of a table can be reconstructed in permutation 1, but fail to be recovered in another permutation. By comparison, all other approaches are permutation invariant, as the results shown in Fig. 15.
Table 12 Mean time consumption for a single object (\(32^3\) voxel grid) estimation from different number of images (milliseconds) Computational Efficiency
To evaluate the computation and memory cost of AttSets, we implement Base\(_{\text {r2n2}}\)-AttSets and the competing approaches in Python 2.7 and Tensorflow 1.2 with CUDA 9.0 and cuDNN 7.1 as the back-end driver and library. All approaches share the same Base\(_{\text {r2n2}}\) network and run in the same Titan X and software environments. Table 12 shows the average time consumption to reconstruct a single 3D object given different number of images. Our AttSets based approach is as efficient as the pooling methods, while Base\(_{\text {r2n2}}\)-GRU (i.e., 3D-R2N2) takes more time when processing an increasing number of images due to the sequential computation mechanism of its GRU module. In terms of the total trainable weights, the max/mean/sum pooling based approaches have 16.66 million, while AttSets based net has 17.71 million. By contrast, the original 3D-R2N2 has 34.78 million, the BP/MHBN/SMSO have 141.57, 60.78 and 17.71 million respectively. Overall, our AttSets outperforms the recurrent unit and pooling operations without incurring notable computation and memory cost.
Table 13 Mean IoU of AttSets variants on all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split Comparison Between Variants of AttSets
We further compare the aggregation performance of fc, conv2d and conv3d based AttSets variants which are shown in Fig. 3 in Sect. 3.4. The fc based AttSets net is the same as in Sect. 5.1. The conv2d based AttSets is plugged into the middle of the 2D encoder, fusing a (N, 4, 4, 256) tensor into (1, 4, 4, 256), where N is an arbitrary image number. The conv3d based AttSets is plugged into the middle of the 3D decoder, integrating a (N, 8, 8, 8, 128) tensor into (1, 8, 8, 8, 128). All other layers of these variants are the same. Both the conv2d and conv3d based AttSets networks are trained using the paradigm of experiment Group 4 in Sect. 5.1. Table 13 shows the mean IoU scores of three variants on ShapeNet\(_{\text {r2n2}}\) testing split. fc and conv3d based variants achieve similar IoU scores for either single or multi view 3D reconstruction, demonstrating the superior aggregation capability of AttSets. In the meantime, we observe that the overall performance of conv2d based AttSets net is slightly decreased compared with the other two. One possible reason is that the 2D feature set has been aggregated at the early layer of the network, resulting in features being lost early. Figure 17 visualizes the learnt attention scores for a 2D feature set, i.e., (N, 4, 4, 256) features, via the conv2d based AttSets net. To visualize 2D feature scores, we average the scores along the channel axis and then roughly trace back the spatial locations of those scores corresponding to the original input. The more visual information the input image has, the higher attention scores are learnt by AttSets for the corresponding latent features. For example, the third image has richer visual information than the first image, so its attention scores are higher. Note that, for a specific base network, there are many potential locations to plug in AttSets and it is also possible to include multiple AttSets modules into the same net. Fully evaluating these factors is out of the scope of this paper.
Table 14 Mean IoU of all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split for feature-wise and element-wise attentional aggregation Feature-Wise Attention versus Element-Wise Attention
Our AttSets module is initially designed to learn unique feature-wise attention scores for the whole input deep feature set, and we demonstrate that it significantly improves the aggregation performance over dynamic feature sets in previous Sects. 5.1, 5.2, 5.3 and 5.4 . In this section, we further investigate the advantage of this feature-wise attentive pooling over element-wise attentional aggregation.
Table 15 Mean IoU of different training algorithms on all 13 categories in ShapeNet\(_{\text {r2n2}}\) testing split For element-wise attentional aggregation, the AttSets module turns to learn a single attention score for each element of the feature set \({\mathcal {A}} = \{{\varvec{x}}_1, {\varvec{x}}_2, \ldots , {\varvec{x}}_N\}\), followed by the softmax normalization and weighted summation pooling. In particular, as shown in previous Fig. 2, the shared function \(g({\varvec{x}}_n, {\varvec{W}})\) now learns a scalar, instead of a vector, as the attention activation for each input element. Eventually, all features within the same element are weighted by a learnt common attention score. Intuitively, the original feature-wise AttSets tends to be fine-grained aggregation, while the element-wise AttSets learns to coarsely aggregate features.
Following the same training settings of experiment Group 4 in Sect. 5.1, we conduct another group of experiment on ShapeNet\(_{\text {r2n2}}\) dataset for element-wise attentional aggregation. Table 14 compares the mean IoU for 3D object reconstruction through feature-wise and element-wise attentional aggregation. Figure 18 shows an example of the learnt attention scores and the predicted 3D shapes. As expected, the feature-wise attention mechanism clearly achieves better aggregation performance compared with the coarsely element-wise approach. As shown in Fig. 18, the element-wise attention mechanism tends to focus on few images, while completely ignoring others. By comparison, the feature-wise AttSets learns to fuse information from all images, thus achieving better aggregation performance.
Significance of FASet Algorithm
In this section, we investigate the impact of FASet algorithm by comparing it with the standard end-to-end joint training (JoinT). Particularly, in JoinT, all parameters \(\varTheta _{base}\) and \(\varTheta _{att}\) are jointly optimized with a single loss. Following the same training settings of experiment Group 4 in Sect. 5.1, we conduct another group of experiment on ShapeNet\(_{\text {r2n2}}\) dataset under the JoinT training strategy. As its IoU scores shown in Table 15, the JoinT training approach tends to optimize the whole net regarding the training multi-view batches, thus being unable to generalize well for fewer images during testing. Basically, the network itself is unable to dedicate the base layers to learning visual features, while the AttSets module to learning attention scores, if it is not trained with the proposed FASet algorithm. The theoretical reason is discussed previously in Sect. 4.1. The FASet algorithm may also be applicable to other learning based aggregation approaches, as long as the aggregation module can be decoupled from the base encoder/decoder.