# Deep Shape from a Low Number of Silhouettes

- 3 Citations
- 4.2k Downloads

## Abstract

Despite strong progress in the field of 3D reconstruction from multiple views, holes on objects, transparency of objects and textureless scenes, continue to be open challenges. On the other hand, silhouette based reconstruction techniques ease the dependency of 3d reconstruction on image pixels but need a large number of silhouettes to be available from multiple views. In this paper, a novel end to end pipeline is proposed to produce high quality reconstruction from a low number of silhouettes, the core of which is a deep shape reconstruction architecture. Evaluations on ShapeNet [1] show good quality of reconstruction compared with ground truth.

## Keywords

Deep 3D reconstruction End to end architecture Silhouettes## 1 Introduction

3D geometry reconstruction techniques have made significant progress during this two decades from theoretical development to software implementation. The aim of both contributions are to build high-quality 3D reconstruction of scenes and objects from 2D or 2.5D source information in terms of image pixels, depth and other source data. Current progress in this filed involves wider application of 3D reconstruction towards real world application including video data application, large-scale scene reconstruction, light field reconstruction and other applications. However, several open questions remain challenging for 3D reconstruction such as holes, wrinkles, coarse region and other unwanted artifacts in the 3D rebuilt world for the reconstruction of transparent objects, textureless scenes and other challenging objects and scenes (Fig. 1).

The second technique groups build the 3D real world in the grid space consists of voxels. Priors such as connectivity priors, surface orientation priors are represented as data constraint terms. These terms are calculated in a mathematical framework such as multiple label convex framework or MRF pipelines to provide solutions to the opening challenges. Besides, sufficient information of viewpoints including camera parameter matrix of each viewpoint, a large number of silhouettes and 2D image pixels are the base of quality of 3D reconstruction in the grid space.

The proposed pipeline is a deep reconstruction pipeline consisting of two reconstruction stages. The first stage is shape coarse reconstruction stage. It takes a small number of silhouettes as input with known camera parameters of the associated viewpoints, and produces a coarse visual hull. The second stage is deep shape reconstruction stage, it works based on a deep shape reconstruction architecture. This proposed 3D convolution networks (3D-CNNs) architecture reconstructs good quality shapes from coarse shapes. The currently proposed pipeline is designed for reconstructing category-specific object shapes.

Two contributions are made. First, this pipeline produces high quality of 3D reconstruction based on a low number of silhouettes. It is not dependent on images pixels, depth data and et al. Therefore, this silhouettes based technique is considered as a potential solution for 3D reconstruction of transparent objects or objects with textureless parts. Second, compared with techniques relying on a large number of images taken in different viewpoints, the proposed pipeline reduces the number of views.

This paper is organized as follows. We review related work in Sect. 2. We formulate the problem in Sect. 3. Both the solution to the formulated problem and the proposed reconstruction pipeline is presented in Sect. 4. Evaluation details including the dataset, traning, test and results are shown in Sect. 5. Finally, conclusion and future work for our pipeline is discussed in Sect. 6.

## 2 Related Work

End to end deep learning architectures have been applied successfully to a variety of vision problems such as segmentation [2, 3, 4, 5], edge prediction [6], classification [7], optical flow prediction [8], depth prediction [9], keypoint prediction [4] and feature learning [10, 11]. Also, fully convolutional networks prove their efficiency in one dimensional input strings [12] extended from LeNet [13], two dimensional detection [14, 15] with learning and inference, three dimensional representation [16], volumetric 3D shapes classification and interpolation [17].

Convolutional Neural Networks(CNNs) have been proven their efficiency in improving 3D reconstruction. For example, CNNs is developed for the task of prediction of surface normals from a single image [18]. A combined framework for viewpoint estimation and local keypoint prediction is proposed through application of a convolutional neural network architectures [19]. A convolutional neural network is built to perform extremely well for stereo matching [20].

Unlike the above methods, end to end deep architectures are built for 3D reconstruction. For example, convolution network is applied to build 3D models from single images [21], 3D recurrent reconstruction neural network (3D-R2N2) is built to unify both single and multi-view 3D object reconstruction [22], Semantic deformation flows are learned with 3D convolution networks for improving 3D reconstruction [23], 3D volumetric reconstruction is learned from single-view with projective transformations [24]. However, these 3D reconstruction deep architectures are dependent on image pixels, transformation and etc. In contrast, a novel reconstruction pipeline is proposed directly from a small number of silhouettes input end to 3D reconstruction end.

## 3 Problem Formulation

*V*of Visual Hull \(H_k\) (inferred from

*k*silhouettes [25]) that reconstructs a shape as close as possible to the Ground Truth (

*GT*) shape:

*k*of silhouettes increases such that:

*H*is the best approximation possible of the object shape estimated from silhouettes.

*H*itself is often far away from the true shape

*GT*as concave areas fail to be recovered from Shape-from-Silhouettes techniques [26].

To avoid these artifacts, we propose an improvement on visual hull inferred by shape-from-silhouettes techniques using a Bayesian like framework where we aim at using prior information about the object category to design a function \(V(H_k)=V_k\) that is as close as possible to *GT*. The function *V* is designed using deep neural network and is trained using ShapeNets dataset [1]. Our formulation is tested for \(k=2,3,4,5\) silhouettes available as input of the pipeline.

## 4 Two-Stage Deep Shape Reconstruction Pipeline

### 4.1 Architecture

Both the input end \(H_{k}\) and the output end \(V_{k}\) are in the forms of volume space consisting of voxels. The values of the voxels are binary. The key components of the deep learning architecture we built for the deep shape reconstruction are the convolutional encoding layers(recognition network) and convolutional decoding layers(generative network). As shown in Fig. 2, there are three 3D convolutional layers and three 3D deconvolutional layers in our deep shape reconstruction architectures. From the input end to the output end of the deep shape architecture, we use different size of filters. From the beginning layer of the recognition network, the filter size changes from \(4\times 4\times 4\) to \(2\times 2\times 2\), the number of filters increase from 64 to 1024. In contrast, from the beginning layer of the generative network, the filter size changes from \(2\times 2\times 2\) to \(4\times 4\times 4\), and the number of filters reduces from 1024 to 64.

### 4.2 Convolution and Deconvolution

## 5 Evaluation

Evaluation are conducted for the proposed pipeline. First, training of a deep shape model is conducted through usage of the ShapeNet [1] dataset. Second, test is carried on to reconstruct 3D shapes from a low number of silhouettes. Finally, comparison between the reconstructed 3D shapes and the ground truth shape is made and the reconstruction errors are calculated. We did 4 experiments. The goal of each experiment is to reconstruct the 3D shape of instances of a specific object category. For conducting each experiment, category-specific deep shape model is trained and then the evaluation on the reconstruction accuracy is calculated.

### 5.1 Dataset

ShapeNet is presented as a richly-annotated, large-scale repository of shapes containing 3D CAD models of objects for a rich number of object categories. ShapeNets contains more than 3, 000, 000 models, and 220, 000 models of which are classified into 3, 135 categories. And ShapeNets has been used in a range of deep 3D shape research work [27, 28, 29]. Here, we use a subset of ShapeNets to train our category-specific deep shape reconstruction model. For example, we use 372 car CAD models for training our car deep shape reconstruction network, and then use 110 car CAD models for testing the performance of our pipeline. The number of CAD models used in the training for other three categories including planes, motorbikes and chairs are 372, 277, 315 respectively. The number of CAD models for testing are 107, 164 and 62 respectively.

### 5.2 Training

### 5.3 Testing

### 5.4 Results

The reconstructed shape in the two stages of our deep 3D reconstruction pipeline are represented. For the test of the 4 view arrangements and 4 object categories, both the coarse shapes and final shapes are shown qualitatively and quantitatively. Both the qualitative and quantitative results show that the deep 3D reconstruction pipeline is capable of reconstructing category-specific 3D shapes with a small number of silhouettes as input. It also demonstrates that the network improves the reconstruction after coarse shape stage. The reconstruction error between \(V_k\) and *GT* is shown to be smaller than the reconstruction error between \(H_k\) and *GT*, proving that the deep shape reconstruction architecture reconstructs \(v_k\) which is better than \(H_k\).

**Qualitative Results.** Figures 6, 7, 8, 9 visualize the shape reconstruction of four object categories including cars, planes, motorbikes and chairs. Ground truth shapes, shapes reconstructed in both the coarse shape reconstruction stage and the deep shape reconstruction stage are all represented in the figures.

**Quantitative Results.**In order to get quantitative measures of the reconstructed shape, we evaluate the 3D reconstruction in two ways. The first evaluation is the mean square error between a 3D voxel reconstruction before thresholding and its ground truth voxelized model. The second evaluation is the voxel Intersection-over-Union(IoU) between the 3D voxel reconstruction and the ground truth model. More formally,

*i*in a grid space before thresholding, \(vp_{i}\in [0,1]\). And let the corresponding ground truth occupancy be \(Gvp_{i}\), \(Gvp_{i}\in \{0,1\}\). Lower error indicates better reconstruction. To be noted, we train and test in a \(50\times 50\times 50\) grid space so the total number of voxels is determined as \(n=50\times 50\times 50\).

*I*(.) is an indicator function and \(t\in [0,1]\) is a voxelization threshold. Higher IoU values indicates better reconstruction. In the test, the value of threshold is set to \(t=0.5\) for cars, planes and motorbikes. \(t=0.3\) is chosen for chairs.

## 6 Conclusion and Future Work

Our 3D reconstruction pipeline using 3D CNNs networks has been trained end to end. The input of the pipeline are a small number of silhouettes with corresponding camera parameter matrix and the output is the reconstruction of category-specific 3D shapes. Our approach proves its efficiency in tackling the complexity of the shapes considered where the object categories contain instances with large non-linear shape variations. The proposed pipeline works in two stages including coarse shape reconstruction stage and deep shape reconstruction stage.

Our 3D reconstruction pipeline works from a low number of silhouettes given as inputs, to reconstruct good-quality 3D category-specific shapes. This reconstruction pipeline is independent on pixel values, feature matches and other forms of data. It provides a potential solution to opening challenges in current 3D reconstruction field including reconstruction failures for objects containing textureless, transparent parts and low-quality reconstruction due to insufficient dense feature correspondences. Furthermore, this pipeline is practical to use because it depends on a low number of silhouettes inputs as opposed to providing a large number of images/silhouettes from multiple views as needed for reconstruction.

However, some limitations of the reconstruction pipeline exists. First, the proposed pipeline is not capable of reconstructing good shapes from two silhouettes or a single silhouette. Second, the proposed pipeline is demonstrated to produce reconstruction for a range of selected view arrangements: the selected silhouettes were taken from evenly spaced locations on a circle around the object. Third, the reconstruction pipeline relies on camera parameters matrix to be available. Finally, quality of the final reconstruction is not very good for chairs that have legs broken. Moreover, our current shape resolution is only \(50\times 50\times 50\) (inputs, and outputs) and this is an open computational challenge to address high resolution volumetric shape in 3D reconstruction.

Therefore, our future work will improve the reconstruction quality of our pipeline in four aspects. First, we will explore to produce good reconstruction from two silhouettes or a single silhouette. Second, more work is planed to produce good reconstruction from random views. Third, in order to let our pipeline to work more automatically, we will improve our pipeline to reduce the input to only silhouettes without the knowledge of their camera parameter matrix. Finally, the improvement on both higher resolution of final reconstruction and less failure such as broken legs will also be made.

## Notes

### Acknowledgments

This work has been supported by a scholarship from Trinity College Dublin (Ireland), and partially supported by EU FP7-PEOPLE-2013-IAPP GRAISearch grant (612334).

## References

- 1.Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: ShapeNet: an information-rich 3D model repository. Technical report [cs.GR], Stanford University – Princeton University – Toyota Technological Institute at Chicago (2015). arXiv:1512.03012
- 2.Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell.
**35**(8), 1915–1929 (2013)CrossRefGoogle Scholar - 3.Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)Google Scholar
- 4.Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 447–456 (2015)Google Scholar
- 5.Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)Google Scholar
- 6.Ganin, Y., Lempitsky, V.: \(N^4\)-fields: neural network nearest neighbor fields for image transforms. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9004, pp. 536–551. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16808-1_36 Google Scholar
- 7.Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)Google Scholar
- 8.Dosovitskiy, A., Fischery, P., Ilg, E., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T., et al.: Flownet: learning optical flow with convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766. IEEE (2015)Google Scholar
- 9.Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374 (2014)Google Scholar
- 10.Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
- 11.Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
- 12.Matan, O., Burges, C.J., LeCun, Y., Denker, J.S.: Multi-digit recognition using a space displacement neural network. In: NIPS, pp. 488–495 (1991)Google Scholar
- 13.LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput.
**1**(4), 541–551 (1989)CrossRefGoogle Scholar - 14.Wolf, R., Platt, J.C.: Postal address block location using a convolutional locator network. In: Advances in Neural Information Processing Systems, p. 745 (1994)Google Scholar
- 15.Ning, F., Delhomme, D., LeCun, Y., Piano, F., Bottou, L., Barbano, P.E.: Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process.
**14**(9), 1360–1371 (2005)CrossRefGoogle Scholar - 16.Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538–1546 (2015)Google Scholar
- 17.Sharma, A., Grau, O., Fritz, M.: VConv-DAE: deep volumetric shape learning without object labels. arXiv preprint (2016). arXiv:1604.03755
- 18.Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)Google Scholar
- 19.Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1510–1519. IEEE (2015)Google Scholar
- 20.Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703 (2016)Google Scholar
- 21.Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-46478-7_20 CrossRefGoogle Scholar
- 22.Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3D–R2N2: a unified approach for single and multi-view 3D object reconstruction. arXiv preprint (2016). arXiv:1604.00449
- 23.Yumer, M.E., Mitra, N.J.: Learning semantic deformation flows with 3D convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 294–311. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-46466-4_18 CrossRefGoogle Scholar
- 24.Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Learning volumetric 3D object reconstruction from single-view with projective transformations. In: Neural Information Processing Systems (NIPS 2016) (2016)Google Scholar
- 25.Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell.
**16**(2), 150–162 (1994)CrossRefGoogle Scholar - 26.Kim, D., Ruttle, J., Dahyot, R.: Bayesian 3D shape from silhouettes. Digit. Signal Proc.
**23**(6), 1844–1855 (2013)MathSciNetCrossRefGoogle Scholar - 27.Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)Google Scholar
- 28.Maturana, D., Scherer, S.: Voxnet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. IEEE (2015)Google Scholar
- 29.Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1912–1920 (2015)Google Scholar