Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

We study the problem of recovering an underlying 3D shape from a set of images. Existing learning based approaches usually resort to recurrent neural nets, e.g., GRU, or intuitive pooling operations, e.g., max/mean pooling, to fuse multiple deep features encoded from input images. However, GRU based approaches are unable to consistently estimate 3D shapes given the same set of input images as the recurrent unit is permutation variant. It is also unlikely to refine the 3D shape given more images due to the long-term memory loss of GRU. The widely used pooling approaches are limited to capturing only the first order/moment information, ignoring other valuable features. In this paper, we present a new feed-forward neural module, named AttSets, together with a dedicated training algorithm, named JTSO, to attentionally aggregate an arbitrary sized deep feature set for multi-view 3D reconstruction. AttSets is permutation invariant, computationally efficient, flexible and robust to multiple input images. We thoroughly evaluate various properties of AttSets on large public datasets. Extensive experiments show AttSets together with JTSO algorithm significantly outperforms existing aggregation approaches.


Introduction
Given a set of images, to recover a geometric representation of the 3D world is classically defined as multi-view 3D reconstruction in computer vision. Traditional pipelines such as Structure from Motion (SfM) [20] and visual Simultaneous Localization and Mapping (vSLAM) [3] typically rely on hand-crafted feature extraction and matching across multiple views to reconstruct the underlying 3D model. However, if the multiple viewpoints are separated by large baseline, the feature matching approach is extremely challenging due to significant appearance changes or self occlusions [18]. Furthermore, the reconstructed 3D shape is usually a sparse point cloud without geometric details.
Recently, a number of deep learning approaches, such as 3D-R2N2 [6], LSM [15], DeepMVS [11] and RayNet [21] have been proposed to estimate the 3D dense shape from multiple images and have shown encouraging results. Both 3D-R2N2 [6] and LSM [15] formulate multi-view reconstruction as a sequence learning problem, and leverage RNNs, particularly GRU, to fuse the multiple deep features extracted by a shared encoder for input images. However, there are three limitations. First, the recurrent network is permutation variant, as the order of the input sequence matters [27]. Therefore, inconsistent 3D shapes are estimated from the same image set with different permutations. Second, it is difficult to capture long-term dependencies in the sequence because of gradient vanishing or exploding [2] [16], so the estimated 3D shapes are unlikely to be refined even if more images are given during training and testing. Third, the RNN unit is inefficient as each element of the input sequence must be sequentially processed without parallelization [19], so is time-consuming to generate the final 3D shape given a sequence of images. The recent DeepMVS [11] applies max pooling to aggregate deep features across a set of unordered images for multi-view stereo reconstruction, while RayNet [21] takes use of average pooling to aggregate the deep features corresponding to the same voxel  Figure 1: Overview of an attentional aggregation module for multi-view 3D reconstruction. from multiple images to recover a dense 3D model. Although max and average poolings do not suffer from above limitations of RNN, they only capture the first order or moment information from the large deep feature set, totally ignoring other features which might be valuable for accurate 3D shape estimation.
In this paper, we introduce a simple yet efficient attentional aggregation module, named AttSets, that can be easily included in an existing multi-view 3D reconstruction network to aggregate an arbitrary number of elements of a deep feature set, completely replacing the RNN module or max/average pooling operations. Inspired by the attention mechanism which shows great success in natural language processing [1] [23], image captioning [29], etc., we design a feed-forward neural layer that can automatically learn to aggregate each element of the input deep feature set. In particular, as shown in Figure 1, given a variable sized deep feature set, which are usually learnt view-invariant visual representations from a shared encoder [21], our AttSets module firstly learns an attention activation for each latent feature through a standard neural layer (e.g., a fully connected layer, a 2D or 3D convolutional layer), after which an attention score is computed for the corresponding feature. Subsequently, the attention scores are simply multiplied by the original elements of the deep feature set, generating a set of weighted features. At last, the weighted features are aggregated across different elements of the deep feature set, producing a fixed size of aggregated features which are then fed into a decoder to estimate 3D shapes. To enable AttSets to learn the desired attention scores for deep feature sets, we further propose a joint-training and separate-optimizing (JTSO) algorithm that decouples the base encoder-decoder to learn deep features, while dedicating the AttSets module to learning attention scores for feature sets.
Our AttSets is designed with the following desirable properties and advantages over existing approaches for multi-view 3D reconstruction: • Compared with RNN approaches, AttSets is permutation invariant and is able to capture value information across a large number of elements of a set. AttSets consists of standard forward neural nets, and therefore can be parallelized instead of sequentially computed. • Compared with max/average pooling, AttSets learns scores for all elements of the deep feature sets, attentionally aggregating useful features instead of simply capturing the first order/moment information. • In addition, AttSets is flexible to embed either standard fully connected, or 2D/3D convolutional layers and can be easily plugged into an existing encoder-decoder net to estimate 3D shapes from a variable number of images without increasing notable memory and computation cost.

Related Work
(1) Multi-view 3D Reconstruction. 3D shapes can be recovered from multiple color images or depth scans. To estimate the underlying 3D shape from multiple color images, classic SfM [20] and vSLAM [3] algorithms firstly extract and match hand-crafted geometric features [10] and then apply bundle adjustment [26] for both shape and camera motion estimation. Ji et al. [14] use "maximizing rigidity" for reconstruction, but this requires 2D point correspondences across images. Recent deep neural net based approaches tend to recover dense 3D shapes through learnt features from multiple images and achieve compelling results. To fuse the deep features from multiple images, both 3D-R2N2 [6] and LSM [15] apply the recurrent unit GRU, resulting in the networks being permutation variant and inefficient for aggregating long sequence of images. Recent SilNet [28] and DeepMVS [11] simply use max pooling to preserve the first order information of the deep features of multiple images, while RayNet [21] applies average pooling to reserve the first moment information of multiple deep features. MVSNet [31] proposes a variance-based approach to capture the second moment information for multiple feature aggregation. These pooling techniques only capture partial information, ignoring the majority of the deep features. Recent SurfaceNet [13] and SuperPixel Soup [17] can reconstruct 3D shapes from two images, but they are unable to process an arbitrary number of images. As to multiple depth image reconstruction, the traditional volumetric fusion method [7] integrates multiple viewpoint information by averaging truncated signed distance functions (TSDF). Recent learning based OctNetFusion [24] also takes a similar strategy to integrate multiple depth information. However, this integration might result in information loss since TSDF values are averaged [24].
(2) Deep Learning on Sets. In contrast to traditional approaches operating on fixed dimensional vectors or matrices, deep learning tasks defined on sets usually require learning functions to be permutation invariant and able to process an arbitrary number of elements in a set [32]. Such problems are widespread. Zaheer et al. introduce general permutation invariant and equivariant models in [32], and they end up with a summation pooling for permutation invariant tasks such as population statistics estimation and point cloud classification. In the very recent GQN [8], summation pooling is also used to aggregate an arbitrary number of orderless images for 3D scene representation. Gardner et al. [9] use average pooling to integrate an unordered deep feature set for classification task. Su et al. [25] use max pooling to fuse the deep feature set of multiple views for 3D shape recognition. Similarly, PointNet [22] also uses max pooling to aggregate the set of features learnt from point clouds for 3D classification and segmentation. In fact, the above summation, average, and max pooling techniques are the most common aggregation operators on sets in mathematics [4]. However, these pooling operations ignore a majority of the information of a set and they do not have trainable parameters available for the network to learn.
(3) Attention Mechanism. The attention mechanism was originally proposed for natural language processing [1]. Being coupled with RNNs, it achieves compelling results in neural machine translation [1], image captioning [29], image question answering [30], etc. Little work has been done to explore attention mechanisms for learning tasks on sets, which usually requires permutation invariance and adaptability to variable cardinality. Compared with the original attention mechanism, our AttSets does not couple with RNNs. Instead, AttSets is a simplified feed-forward module which shares similar concepts with [23] [12]. However, [23] aims to solve the long-term memory problem for sequence learning, while [12] focuses on multiple instance learning for classification. By contrast, our AttSets and the dedicated JTSO algorithm are designed for general learning tasks on sets.

Problem Definition
This paper considers the problem of aggregating an arbitrary number of elements of a set A to a fixed dimensional output y. Usually, each element of set A is a feature vector extracted from a shared encoder, while the fixed dimensional y is fed into a subsequent decoder, such that the whole network can process an arbitrary number of input elements with a fixed and predefined network architecture.
Given N elements in the input deep feature set A = {x 1 , x 2 , · · · , x N }, x n ∈ R 1×D , where N is an arbitrary value, while D is fixed for a specific encoder, and the output y ∈ R K×D , where K is also fixed and predefined for the subsequent decoder, our task is to design an aggregation function f with learnable weights W : y = f (A, W ), which should be permutation invariant, i.e., for any permutation π: Basically, the max/mean/sum pooling operations are the simplest instantiations of function f where W ∈ ∅. However, these pooling operations are predefined to capture partial information without trainable weights, which is unable to unleash the power of a standard neural net.

AttSets Module
The basic idea of our AttSets module is to learn an attention score for each latent feature of the whole deep feature set. In this paper, each latent feature refers to each entry of an individual element of the feature set, with an individual element usually represented by a latent vector, i.e., x n . The learnt scores can be regarded as a mask that automatically selects useful latent features across the set. The selected features are then summed across multiple elements of the set.
As shown in Figure 2, given a set of features A = {x 1 , x 2 , · · · , x N }, x n ∈ R 1×D , AttSets aims to fuse it into a fixed dimensional output y, where y ∈ R 1×D , i.e., we set K = 1 for simplicity. First of all, we feed each element of the feature set A into a shared function g which can be a standard neural layer, i.e., a linear transformation layer optionally followed by a non-linear activation function.
Here we use a fully connected layer followed by a tanh layer as an example, the bias term is dropped for simplicity. The output of function g is a set of learnt attention activations (2) Secondly, the learnt attention activations are normalized across the N elements of the set, computing a set of attention scores S = {s 1 , s 2 , · · · , s N }. We choose sof tmax as the normalization operation, so the attention scores for the n th feature element are Thirdly, the computed attention scores S are multiplied by their corresponding original feature set A, generating a new set of deep features, denoted as weighted Lastly, the set of weighted features O are summed up across the total N elements to get a fixed size feature vector, denoted as y, where y = [y 1 , y 2 , · · · , y d , · · · , y D ], In above formulation, we show how AttSets gradually aggregates a set of N feature vectors A into a single vector y, where y ∈ R 1×D . If we want to learn larger size of aggregated features y where K > 1, simply add another parallel branch of layers to learn another set of different attention scores for the original set of features. In this case, the number of learnable weights W and the capacity of AttSets increases accordingly, but the AttSets module is still parallelly computed.

Permutation Invariance
The output of AttSets module y is permutation invariant with regard to the input deep feature set A.
Here is the simple proof where y is a single vector.
In above Equation 6, the d th entry of the output y is computed as follows: , w d is the d th column of the weights W .
Both the denominator and numerator are a summation of a permutation equivariant term in above Equation 7. Therefore the value y d , also the full vector y, is invariant to different permutations of the deep feature set A = {x 1 , x 2 , · · · , x n , · · · , x N } [32]. For the aggregated feature y ∈ R K×D , where K > 1, AttSets module is still permutation invariant because all the aggregated feature vectors are parallelly and independently learnt.  (1,1) x 2 Figure 3: Implementation of AttSets with fully connected layer, 2D ConvNet, and 3D ConvNet.

Implementation
In above section 3.2, our AttSets aggregates a set of arbitrary number of vector features into a fixed number of vectors, where the attention activation learning function g embeds a fully connected (f c) layer. AttSets is also flexible and able to be easily implemented with both 2D and 3D convolutional neural layers to aggregate both 2D and 3D deep feature sets. Particularly, as shown in Figure 3, to aggregate a set of 2D features, i.e., a tensor of (width × height × channels), the attention activation learning function g embeds a standard conv2d layer with a stride of (1 × 1). Similarly, to fuse a set of 3D features, i.e., a tensor of (width × height × depth × channels), the function g embeds a standard conv3d layer with a stride of (1 × 1 × 1). Compared with f c enabled AttSets, the conv2d or conv3d based AttSets tends to have less learnable weights. Note that both the conv2d and conv3d based AttSets are still permutation invariant, as the function g is shared across all elements of the deep feature set and it does not depend on the order of the elements [32].
We already show the implementation of a f c based AttSets to aggregate vector features in previous section 3.2. Similarly, conv2d or conv3d based AttSets can be plugged into a 2D encoder or a 3D decoder to fuse 2D/3D feature sets with minimal deployment cost, which are studied in section 4.2.

Training Algorithm
Our AttSets module can be easily plugged in an existing encoder-decoder multi-view 3D reconstruction network, replacing the RNN unit or pooling operation. Basically, in an AttSets enabled encoder-decoder net, the encoder-decoder serves as the base architecture to learn visual features for shape estimation, while the AttSets module learns to assign different attention scores to combine those features instead of learning visual features concurrently. As such, the base network tends to have generality with regard to different input image content, while the AttSets module tends to be general regarding arbitrary number of input images.
To train an AttSets enabled network, a naive approach is applying a unified end-to-end training strategy, treating AttSets as a standard layer in the middle. However, the unified training paradigm may optimize the whole network with regard to the statistics of training image batches, resulting in less generality overall. Therefore, we propose a joint-training separate-optimizing (JTSO) approach for an AttSets enabled network. In particular, the trainable weights of an encoder-decoder base network are denoted as Θ base , and the trainable weights of AttSets module are denoted as Θ att , while the loss function of the whole network is represented by which is determined by the specific supervision signal of the base network. Our JTSO is shown in Algorithm 1.
Algorithm 1 Joint-training separate-optimizing of an AttSets enabled network. M is batch size, N is image number, k1 and k2 are hyperparameters. We use k1 = k2 = 1 in our experiments.
The gradient-based updates can use any algorithm.   plane bench cabinet car chair monitor lamp speaker firearm couch table cellphone watercraft 3D-R2N2 (Base-GRU) [6]

Evaluation
To evaluate the performance and various properties of AttSets, we choose 3D-R2N2 [6] as the base network. The original 3D-R2N2 consists of (1) a shared ResNet-based 2D encoder which encodes a size of 127 × 127 × 3 images into 1024 dimensional latent vectors, (2) a GRU module which fuses N 1024 dimensional latent vectors into a single 4 × 4 × 4 × 128 tensor, and (3) a ResNet-based 3D decoder which decodes the single tensor into a 32 × 32 × 32 voxel grid representing the 3D shape. The released dataset consists of 13 categories of 43, 783 common objects with synthesized RGB images from the large scale ShapeNet 3D repository [5]. For each 3D object, 24 images are rendered from different viewing angles circling around. The train/test dataset split is 0.8 : 0.2.
Intersection-over-Union (IoU) is used to evaluate the reconstruction performance [6].

Comparison with GRU and Pooling Operations
To compare with the existing GRU module [6][15] and the widely used max/mean/sum pooling operations [28][11] [21], we replace the GRU module of 3D-R2N2 by our f c based AttSets and the three max/mean/sum poolings, keeping all other neural layers untouched. Architecture details are in the Appendix A. All networks are trained from scratch, with the image number N =24 and learning rate = 0.0001 which are the same as in 3D-R2N2 [6], and the batch size M =2, on a single Titan X GPU. As there is no validation dataset split, to calculate the IoU scores, we independently search the optimal binarization threshold value from 0.2 ∼ 0.8 with a step 0.05 for all approaches for fair comparison. In our experiments, we found that all optimal thresholds of different approaches end up with 0.3 or 0.35. We use the author released well-trained 3D-R2N2 weights to calculate its IoU.
Aggregation Performance. Table 1 shows the mean IoU scores of different approaches on all 13 categories in 3D-R2N2 testing dataset, while Figure 5 shows the trends of IoU changes. Table 2 highlights per-category IoU scores for single view reconstruction. During testing, permutation of the images are the same for different approaches for fair comparison. Figure 4 shows the estimated 3D shapes regarding an increasing number of images for different approaches. Our AttSets based approach outperforms all others by a large margin for either single view or multi view reconstruction, and generates much more compelling 3D shapes.  Analysis.
(1) GRU based approach can generate reasonable 3D shapes given few images, but the performance saturates quickly after being given more images, e.g., 8 views, because the recurrent unit is hardly to capture features from longer image sequences ( Figure 5 1 ). (2) All pooling based approaches are able to estimate satisfactory 3D shapes until being given enough images, e.g., 12 views, but they are unlikely to learn reasonable shapes given only few images ( Figure 5 2 ), because the pooled features from fewer images are unlikely to be as general and representative as pooled features from more images. (3) Our JTSO algorithm completely decouples the base network to learn visual features for accurate single view reconstruction ( Figure 5 3 ), while the trainable weights of AttSets module are separately responsible for learning attention scores for better multi-view reconstruction ( Figure 5 4 ). Therefore, the whole network does not suffer from limitations of GRU or pooling approaches, and can achieve superior performance for either fewer or more image reconstruction. More results are in the Appendix B.1.
Permutation Invariance. To evaluate the permutation invariance of different approaches, we choose the bench category in 3D-R2N2 testing dataset for experiments. In particular, for each object, we randomly select 3 images with 6 different permutations in total for testing. As shown in Figure 6, the mean IoU scores of 3D-R2N2 fluctuates regarding different image permutations, although the model has been officially trained with various image permutations by authors [6]. In contrast, the AttSets based approach is completely not sensitive to input image permutations, and the mean IoU score consistently achieves 0.601. The pooling approaches are also permutation invariant, but their mean IoU scores are all below 0.5. Qualitative results are in Appendix C.
Computation Efficiency. To evaluate the computation and memory cost of AttSets, we implement all nets in Python 2.7 and Tensorflow 1.6 with CUDA 9.0 and cuDNN 7.1 as the back-end driver and library. All approaches share the same base network and run in the same Titan X and software environments. Table 3 shows the average time consumption to reconstruct a single 3D object given different number of images. Our AttSets based approach is as efficient as all pooling based methods, while 3D-R2N2 takes more time when processing an increasing number of images due to the sequential computation mechanism of its GRU module. In terms of the total trainable weights, all pooling based approaches have 16.66 million, while AttSets based net has 17.71 million. By contrast, the original 3D-R2N2 has 34.78 million. Overall, our AttSets module is able to replace the recurrent unit or pooling operations without incurring notable computation and memory cost. Generality and Robustness. We further evaluate the generality and robustness of an AttSets enabled network. Particularly, all the well-trained models, i.e., trained on 3D-R2N2 training data split, are tested on the synthesized images in LSM dataset [15], which have totally different camera viewing angles and lighting sources, but both 3D-R2N2 and LSM datasets are generated from the same 3D ShapeNet repository [5], i.e., they have the same ground truth labels regarding the same object. Note that, we only borrow the synthesized images from LSM dataset corresponding to the objects in 3D-R2N2 testing data split, i.e., all the trained models have never seen either the style of LSM synthesized images or the 3D object labels before. We resize the images of LSM dataset from 224 × 224 to 127 × 127 through linear interpolation. As shown in Table 4, the IoU scores of all pooling approaches increases given more images, but their overall performance is inferior, primarily because the learnt first order/moment features are unable to generalize well across different styles of synthesized images. 3D-R2N2 has fair generality, but it is unable to fuse more information after being given 8 views because of GRU memory loss. By contrast, our AttSets achieves satisfactory generality though being given a single image, and it can effectively aggregate information from more images without suffering from early saturation. More quantitative results are in the Appendix B.2.   We also evaluate all approaches on real world RGB images and our AttSets are fairly robust and outperforms the others. Qualitative results are in the Appendix C.  (1,4,4,256), where N is an arbitrary image number. The conv3d based AttSets is plugged into the middle of the 3D decoder, integrating a (N, 8,8,8,128) tensor into (1,8,8,8,128). All other layers of these variants are the same. Architecture details are in the Appendix D. Table 5 shows the mean IoU scores of three variants on 3D-R2N2 testing dataset and more results are in the Appendix E. f c and conv3d based variants achieve similar IoU scores for either single or multi view 3D reconstruction, demonstrating the superior aggregation capability of AttSets. In the meantime, we observe that the overall performance of conv2d based AttSets net is slightly less effective compared with the other two. One possible reason is that the 2D feature set has been aggregated at the early layer of the network, resulting in features being lost early. Figure 7 visualizes the learnt attention scores for a 2D feature set, i.e., (N, 4, 4, 256) features, via the conv2d based AttSets net. To visualize 2D feature scores, we average the scores along the channel axis and then roughly trace back the spatial locations of those scores corresponding to the original input. The more visual information the input image has, the higher attention scores are automatically learnt by AttSets for the corresponding latent features. For example, the fourth image has richer visual information than the third image, so its attention scores are higher. More visualization results are in the Appendix F. Note that, for a specific base network, there are many potential locations to plug in AttSets and it is also possible to plug multiple AttSets modules into the same net. To search the optimal location and strategy to integrate AttSets is left for our future work.

Impact of JTSO Algorithm
In this section, we investigate the impact of our JTSO algorithm by comparing it with the standard end-to-end joint training and optimizing approach (JTO). Particularly, in JTO, all parameters Θ base and Θ att of the same f c based AttSets net are jointly trained and optimized with a single loss from scratch. As its IoU scores shown in Table 6, the JTO training approach tends to optimize the whole net regarding the training multi-view batches, thus being unable to generalize well for fewer images. Basically, the network itself is unable to dedicate the base layers to learning general features, while the AttSets module to learning attention scores for different elements of a set, if it is not trained with the JTSO algorithm. This issue also exists in the widely used pooling based networks which are unable to wisely aggregate an arbitrary number of deep features. However, the pooling based approaches do not have extra trainable weights to deal with multiple elements of a set.

Conclusion
In this paper, we present AttSets module together with JTSO training algorithm to aggregate elements of deep feature sets. AttSets has powerful permutation invariance, computation efficiency, robustness, and flexible implementation properties, along with the theory and extensive experiments to support its performance for multi-view 3D reconstruction. Both quantitative and qualitative results explicitly show that AttSets significantly outperforms other widely used aggregation approaches. Our future work is to integrate AttSets into more multi-view reconstruction networks such as LSM [15]. In addition, we also plan to apply AttSets on other general learning tasks on sets [32].

Appendix:
A Architecture of 3D-R2N2, Base-max/mean/sum pool, and Base-AttSets  Table 7, 8 and 9 show the per-category mean IoU scores of different approaches for multi-view reconstruction on 3D-R2N2 testing dataset.             All models, which are trained on synthetic 3D-R2N2 training dataset, are further tested on real world images crowdsourced from Amazon online shops. As shown in Figure 13, 3D-R2N2 estimates inconsistent 3D shapes given different image permutations, while the pooling approaches and our f c based AttSets net generate permutation invariant 3D shapes. AttSets generates more compelling 3D shapes, demonstrating the stronger generality and robustness to real world images compared with 3D-R2N2 and pooling based approaches. Note that, we manually search for appropriate different thresholds for visualization in favor of different approaches. The same binarization threshold is applied for visualizing all different permutation results of 3D-R2N2.

Input Images
3D-R2N2 D Architecture of f c, conv2d and conv3d based Base-AttSets

E Per-category Mean IoU of AttSets Variants
The following Table 14, 15 and 16 show the per-category mean IoU scores of f c, conv2d and conv3d based AttSets for multi-view reconstruction on 3D-R2N2 testing dataset.      The following Table 17, 18 and 19 show the per-category mean IoU scores of f c, conv2d and conv3d based AttSets for multi-view reconstruction on LSM dataset.       Figure 17 visualizes the learnt attention scores for each latent 1024d vector encoded from an input image. The more visual features the input image has, the higher attention scores are automatically learnt for the latent vector of that image overall. For example, the fourth image tends to have more visual information than the third one. Therefore, the fourth column is darker, i.e., higher attention scores, than the third column.
Learnt attention scores for latent vector feature set

Estimated 3D shape
Ground truth Input Images Figure 17: Visualization of the learnt attention scores for each latent 1024d vector which is encoded from an input image.