Object Detection and Viewpoint Estimation with Auto-masking Neural Network

  • Linjie Yang
  • Jianzhuang Liu
  • Xiaoou Tang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8691)


Simultaneously detecting an object and determining its pose has become a popular research topic in recent years. Due to the large variances of the object appearance in images, it is critical to capture the discriminative object parts that can provide key information about the object pose. Recent part-based models have obtained state-of-the-art results for this task. However, such models either require manually defined object parts with heavy supervision or a complicated algorithm to find discriminative object parts. In this study, we have designed a novel deep architecture, called Auto-masking Neural Network (ANN), for object detection and viewpoint estimation. ANN can automatically learn to select the most discriminative object parts across different viewpoints from training images. We also propose a method of accurate continuous viewpoint estimation based on the output of ANN. Experimental results on related datasets show that ANN outperforms previous methods.


Object Detection Average Precision Image Patch Convolutional Neural Network Discriminative Feature 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arie-Nachimson, M., Basri, R.: Constructing implicit 3d shape models for pose estimation. In: ICCV (2009)Google Scholar
  2. 2.
    Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: CVPR (2012)Google Scholar
  3. 3.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  4. 4.
    DeCarlo, D., Finkelstein, A., Rusinkiewicz, S., Santella, A.: Suggestive contours for conveying shape. In: SIGGRAPH (2003)Google Scholar
  5. 5.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007),
  6. 6.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. T-PAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  7. 7.
    Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: CVPR (2003)Google Scholar
  8. 8.
    Gu, C., Ren, X.: Discriminative mixture-of-templates for viewpoint classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 408–421. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Huang, G.B., Lee, H., Learned-Miller, E.: Learning hierarchical representations for face verification with convolutional deep belief networks. In: CVPR (2012)Google Scholar
  10. 10.
    Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: ICCV (2009)Google Scholar
  11. 11.
    Kavukcuoglu, K., Sermanet, P., Boureau, Y.L., Gregor, K., Mathieu, M., Cun, Y.L.: Learning convolutional feature hierarchies for visual recognition. In: NIPS (2010)Google Scholar
  12. 12.
    Kullback, S., Leibler, R.A.: On information and sufficiency. The Annals of Mathematical Statistics 22(1), 79–86 (1951)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML (2009)Google Scholar
  14. 14.
    Liebelt, J., Schmid, C., Schertler, K.: Viewpoint-independent object class detection using 3d feature maps. In: CVPR (2008)Google Scholar
  15. 15.
    Lopez-Sastre, R.J., Tuytelaars, T., Savarese, S.: Deformable part models revisited: A performance evaluation for object category pose estimation. In: ICCV Workshops (2011)Google Scholar
  16. 16.
    Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: CVPR (2012)Google Scholar
  17. 17.
    Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: CVPR (2009)Google Scholar
  18. 18.
    Pepik, B., Gehler, P., Stark, M., Schiele, B.: 3D2PM – 3D deformable part models. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 356–370. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  19. 19.
    Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3d geometry to deformable part models. In: CVPR (2012)Google Scholar
  20. 20.
    Savarese, S., Fei-Fei, L.: 3d generic object categorization, localization and pose estimation. In: ICCV (2007)Google Scholar
  21. 21.
    Sermanet, P., Kavukcuoglu, K., Chintala, S., LeCun, Y.: Pedestrian detection with unsupervised multi-stage feature learning. In: CVPR (2013)Google Scholar
  22. 22.
    Sohn, K., Zhou, G., Lee, C., Lee, H.: Learning and selecting features jointly with point-wise gated Boltzmann machines. In: ICML (2013)Google Scholar
  23. 23.
    Stark, M., Goesele, M., Schiele, B.: Back to the future: Learning shape models from 3d cad data. In: BMVC (2010)Google Scholar
  24. 24.
    Su, H., Sun, M., Fei-Fei, L., Savarese, S.: Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories. In: ICCV (2009)Google Scholar
  25. 25.
    Sudderth, E.B., Torralba, A., Freeman, W.T., Willsky, A.S.: Learning hierarchical models of scenes, objects, and parts. In: ICCV (2005)Google Scholar
  26. 26.
    Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR (2013)Google Scholar
  27. 27.
    Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Teney, D., Piater, J.: Continuous pose estimation in 2d images at instance and category levels. In: Comp. and Rob. Vis. (2013)Google Scholar
  29. 29.
    Thomas, A., Ferrar, V., Leibe, B., Tuytelaars, T., Schiel, B., Van Gool, L.: Towards multi-view object class detection. In: CVPR (2006)Google Scholar
  30. 30.
    Torki, M., Elgammal, A.: Regression from local features for viewpoint and pose estimation. In: ICCV (2011)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Linjie Yang
    • 1
  • Jianzhuang Liu
    • 1
    • 3
  • Xiaoou Tang
    • 1
    • 2
  1. 1.Department of Information EngineeringThe Chinese University of Hong KongChina
  2. 2.Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesChina
  3. 3.Media LabHuawei Technologies Co. Ltd.China

Personalised recommendations