## Abstract

Many man-made objects have intrinsic symmetries and often Manhattan structure. By assuming an orthographic or a weak perspective projection model, this paper addresses the estimation of 3D structures and camera projection using symmetry and/or Manhattan structure cues, for the two cases when the input is a single image or multiple images from the same category, e.g. multiple different cars from various viewpoints. More specifically, analysis on the single image case shows that Manhattan alone is sufficient to recover the camera projection and then the 3D structure can be reconstructed uniquely by exploiting symmetry. But Manhattan structure can be hard to observe from a single image due to occlusion. Hence, we extend to the multiple-image case which can also exploit symmetry but does not require Manhattan structure. We propose novel structure from motion methods for both rigid and non-rigid object deformations, which exploit symmetry and use multiple images from the same object category as input. We perform experiments on the Pascal3D+ dataset with either human labeled 2D keypoints or with 2D keypoints localized from a convolutional neural network. The results show that our methods which exploit symmetry significantly outperform the baseline methods.

This is a preview of subscription content, access via your institution.

## Notes

- 1.
However, the general framework in Hong and Fitzgibbon (2015) cannot be used to SfM directly, because it did not constrain that all the keypoints within the same frame should have the same translation. Instead, Hong and Fitzgibbon (2015) focused on better optimization of rank-

*r*matrix factorization and better runtime. - 2.
Note that we set hard constraints on \(\mathbb {{\bar{S}}}\) and \(\mathbb {{\bar{S}}}^{\dag }\), i.e. replace \(\mathbb {{\bar{S}}}^{\dag }\) by \({\mathcal {A}}_P \mathbb {{\bar{S}}}\) in Eq. (57), because it can be guaranteed by our Sym-RSfM initialization in Sect. 6. While the initialization on \({\mathbf {V}}\) and \({\mathbf {V}}^{\dag }\) by PCA cannot guarantee such a desirable property, thus a Language multiplier term is used for the constraint on \({\mathbf {V}}\) and \({\mathbf {V}}^{\dag }\) in the following Eq. (61).

- 3.
For the subtypes of more categories, please refer to the Pascal3D+ official website at http://cvgl.stanford.edu/projects/pascal3d.html.

- 4.
For the rigid case, as we use the images from the same

*subtype*as input (so that we can reasonably assume rigid deformation among them), therefore, we also report the rotation error according to*subtype*for the rigid experiments. - 5.
As there is no baseline method for comparison, we also calculate the average rotation errors measured by averaged geodesic distance \(\frac{1}{N} \sum _{n=1}^{N} ||\log ({R_n^{\text {aligned}}}^\top R_n^*) ||_\text {F} / \sqrt{2}\), which represents the angle difference between two rotation matrices. The results show that the rotation error is

*4.1766*degree in average. - 6.
As analyzed in Remark 10 and Eq. (38), the relationship between the number of allowed deformation bases

*K*and the number of keypoint pairs*P*follows: \(K \le P/3\). - 7.
This is because the self-occluded information/features can be recovered by the training images from a different viewpoint, but the training data cannot exhaustively retain various occlusions introduced by other objects or various truncated types.

- 8.
They are not directly comparable because (i) Tables 1 and 2 use 2D annotations from (Bourdev et al. 2010) [the same as those used in Kar et al. (2015)], while the keypoint localization network for Tables 4 and 5 is trained on 2D annotations from Pascal3D+ (Xiang et al. 2014). (ii) We exclude the occluded-by-others and truncated objects in Tables 4 and 5 [the same as those in Pavlakos et al. (2017)] because the stacked hourglass network (Newell et al. 2016) does not produce satisfied results on those images.

## References

Agudo, A., Agapito, L., Calvo, B., & Montiel, J. (2014). Good vibrations: A modal analysis approach for sequential non-rigid structure from motion. In

*CVPR*(pp. 1558–1565).Akhter, I., Sheikh, Y., & Khan, S. (2009). In defense of orthonormality constraints for nonrigid structure from motion. In

*CVPR*.Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2008). Nonrigid structure from motion in trajectory space. In

*NIPS*.Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2011). Trajectory space: A dual representation for nonrigid structure from motion.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*33*(7), 1442–1456.Bishop, C. M. (2006).

*Pattern recognition and machine learning*. New York: Springer.Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In

*ECCV*.Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering non-rigid 3D shape from image streams. In

*CVPR*.Ceylan, D., Mitra, N. J., Zheng, Y., & Pauly, M. (2014). Coupled structure-from-motion and 3D symmetry detection for urban facades.

*ACM Transactions on Graphics*,*33*, 2. https://doi.org/10.1145/2517348.Chen, X., & Yuille, A. L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In

*NIPS*(pp. 1736–1744).Coughlan, J. M., & Yuille, A. L. (1999). Manhattan world: Compass direction from a single image by bayesian inference. In

*ICCV*.Coughlan, J. M., & Yuille, A. L. (2003). Manhattan world: Orientation and outlier detection by bayesian inference.

*Neural Computation*,*15*(5), 1063–1088.Dai, Y., Li, H., & He, M. (2012). A simple prior-free method for non-rigid structure-from-motion factorization. In

*CVPR*.Dai, Y., Li, H., & He, M. (2014). A simple prior-free method for non-rigid structure-from-motion factorization.

*International Journal of Computer Vision*,*107*, 101–122.Furukawa, Y., Curless, B., Seitz, S. M., & Szeliski, R. (2009). Manhattan-world stereo. In

*CVPR*.Gao, Y., Ma, J., Zhao, M., Liu, W., & Yuille, A. L. (2019). NDDR-CNN: Layerwise feature fusing in multi-task CNNs by neural discriminative dimensionality reduction. In

*CVPR*.Gao, Y., & Yuille, A. L. (2016). Symmetry non-rigid structure from motion for category-specific object structure estimation. In

*ECCV*.Gao, Y., & Yuille, A. L. (2017). Exploiting symmetry and/or manhattan properties for 3D object structure estimation from single and multiple images. In

*IEEE international conference on computer vision and pattern recognition*.Gordon, G. G. (1990). Shape from symmetry. In

*Proceedings of SPIE*.Gotardo, P., & Martinez, A. (2011). Computing smooth time-trajectories for camera and deformable shape in structure from motion with occlusion.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*33*, 2051–2065.Grossmann, E., Ortin, D., & Santos-Victor, J. (2002). Single and multi-view reconstruction of structured scenes. In

*ACCV*.Grossmann, E., & Santos-Victor, J. (2002). Maximum likehood 3D reconstruction from one or more images under geometric constraints. In

*BMVC*.Grossmann, E., & Santos-Victor, J. (2005). Least-squares 3D reconstruction from one or more views and geometric clues.

*Computer Vision and Image Understanding*,*99*(2), 151–174.Hamsici, O. C., Gotardo, P. F., & Martinez, A. M. (2012). Learning spatially-smooth mappings in non-rigid structure from motion. In

*ECCV*(pp. 260–273).Hartley, R. I., & Zisserman, A. (2004).

*Multiple view geometry in computer vision*(2nd ed.). Cambridge: Cambridge University Press.Hong, J. H., & Fitzgibbon, A. (2015). Secrets of matrix factorization: Approximations, numerics, manifold optimization and random restarts. In

*ICCV*.Hong, W., Yang, A. Y., Huang, K., & Ma, Y. (2004). On symmetry and multiple-view geometry: Structure, pose, and calibration from a single image.

*International Journal of Computer Vision*,*60*, 241–265.Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In

*CVPR*.Kontsevich, L. L. (1993). Pairwise comparison technique: A simple solution for depth reconstruction.

*JOSA A*,*10*(6), 1129–1135.Kontsevich, L. L., Kontsevich, M. L., & Shen, A. K. (1987). Two algorithms for reconstructing shapes.

*Optoelectronics, Instrumentation and Data Processing*,*5*, 76–81.Li, Y., & Pizlo, Z. (2007). Reconstruction of shapes of 3D symmetric objects by using planarity and compactness constraints. In

*Proceedings of SPIE-IS&T electronic imaging*.Ma, J., Zhao, J., Tian, J., Tu, Z., & Yuille, A. L. (2013). Robust estimation of nonrigid transformation for point set registration. In

*CVPR*(pp. 2147–2154).Marques, M., & Costeira, J. (2009). Estimating 3D shape from degenerate sequences with missing data.

*Computer Vision and Image Understanding*,*113*(2), 261–272.Ma, J., Zhao, J., Ma, Y., & Tian, J. (2015). Non-rigid visible and infrared face registration via regularized gaussian fields criterion.

*Pattern Recognition*,*48*(3), 772–784.Ma, J., Zhao, J., Tian, J., Bai, X., & Tu, Z. (2013). Regularized vector field learning with sparse approximation for mismatch removal.

*Pattern Recognition*,*46*(12), 3519–3532.Morris, D. D., Kanatani, K., & Kanade, T. (2001). Gauge fixing for accurate 3D estimation. In

*CVPR*.Mukherjee, D. P., Zisserman, A., & Brady, M. (1995). Shape from symmetry: Detecting and exploiting symmetry in affine images.

*Philosophical Transactions: Physical Sciences and Engineering*,*351*, 77–106.Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In

*European conference on computer vision*(pp. 483–499). Springer.Olsen, S. I., & Bartoli, A. (2008). Implicit non-rigid structure-from-motion with priors.

*Journal of Mathematical Imaging and Vision*,*31*(2–3), 233–244.Pavlakos, G., Zhou, X., Chan, A., Derpanis, K. G., & Daniilidis, K. (2017). 6-DoF object pose from semantic keypoints. In

*2017 IEEE international conference on robotics and automation (ICRA)*(pp. 2011–2018). IEEE.Rosen, J. (2011).

*Symmetry discovered: Concepts and applications in nature and science*. Mineola: Dover Publications.Schönemann, P. H. (1966). A generalized solution of the orthogonal procrustes problem.

*Psychometrika*,*31*, 1–10.Thrun, S., & Wegbreit, B. (2005). Shape from symmetry. In

*ICCV*.Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: A factorization method.

*International Journal of Computer Vision*,*9*(2), 137–154.Torresani, L., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3D shape from 2D motion. In

*NIPS*.Torresani, L., Hertzmann, A., & Bregler, C. (2008). Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*30*, 878–892.Vetter, T., & Poggio, T. (1994). Symmetric 3D objects are an easy case for 2D object recognition.

*Spatial Vision*,*8*, 443–453.Vicente, S., Carreira, J., Agapito, L., & Batista, J. (2014). Reconstructing PASCAL VOC. In

*CVPR*.Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3D object detection in the wild. In

*WACV*.Xiao, J., Chai, J., & Kanade, T. (2004). A closed-form solution to nonrigid shape and motion recovery. In

*ECCV*.

## Acknowledgements

We would like to thank Ehsan Jahangiri, Cihang Xie, Weichao Qiu, Xuan Dong, Siyuan Qiao for giving feedbacks on the manuscript. This work was partially supported by ARO 62250-CS, ONR N00014-15-1-2356, and the NSF award CCF-1317376.

## Author information

### Affiliations

### Corresponding author

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Cordelia Schmid.

## Electronic supplementary material

Below is the link to the electronic supplementary material.

## Rights and permissions

## About this article

### Cite this article

Gao, Y., Yuille, A.L. Estimation of 3D Category-Specific Object Structure: Symmetry, Manhattan and/or Multiple Images.
*Int J Comput Vis* **127, **1501–1526 (2019). https://doi.org/10.1007/s11263-019-01195-z

Received:

Accepted:

Published:

Issue Date:

### Keywords

- Symmetry
- Manhattan
- Single image
- Symmetric rigid structure from motion
- Symmetric non-rigid structure from motion