Computational Visual Media

, Volume 5, Issue 1, pp 91–104 | Cite as

Recurrent 3D attentional networks for end-to-end active object recognition

  • Min Liu
  • Yifei Shi
  • Lintao Zheng
  • Kai XuEmail author
  • Hui Huang
  • Dinesh Manocha
Open Access
Research Article


Active vision is inherently attention-driven: an agent actively selects views to attend in order to rapidly perform a vision task while improving its internal representation of the scene being observed. Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we address multi-view depth-based active object recognition using an attention mechanism, by use of an end-to-end recurrent 3D attentional network. The architecture takes advantage of a recurrent neural network to store and update an internal representation. Our model, trained with 3D shape datasets, is able to iteratively attend the best views targeting an object of interest for recognizing it. To realize 3D view selection, we derive a 3D spatial transformer network. It is differentiable, allowing training with backpropagation, and so achieving much faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method, with only depth input, achieves state-of-the-art next-best-view performance both in terms of time taken and recognition accuracy.


active object recognition recurrent neural network next-best-view 3D attention 



We thank the anonymous reviewers for their valuable comments. This work was supported, in part, by National Natural Science Foundation of China (Nos. 61572507, 61622212, and 61532003). Min Liu is supported by the China Scholarship Council.


  1. [1]
    Denzler, J.; Brown, C. M. Information theoretic sensor data selection for active object recognition and state estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 24, No. 2, 145–157, 2002.CrossRefGoogle Scholar
  2. [2]
    Huber, M. F.; Dencker, T.; Roschani, M.; Beyerer, J. Bayesian active object recognition via Gaussian process regression. In: Proceedings of the 15th International Conference on Information Fusion, 1718–1725, 2012.Google Scholar
  3. [3]
    Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1912–1920, 2015.Google Scholar
  4. [4]
    Jayaraman, D.; Grauman, K. Look-ahead before you leap: End-to-end active recognition by forecasting the effect of motion. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9909. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 489–505, 2016.CrossRefGoogle Scholar
  5. [5]
    Xu, K.; Shi, Y.; Zheng, L.; Zhang, J.; Liu, M.; Huang, H.; Su, H.; Cohen-Or, D.; Chen, B. 3D attention-driven depth acquisition for object identifiation. ACM Transactions on Graphics Vol. 35, No. 6, Article No. 238, 2016.Google Scholar
  6. [6]
    Chen, S.; Zheng, L.; Zhang, Y.; Sun, Z.; Xu, K. VERAM: View-enhanced recurrent attention model for 3D shape classification. IEEE Transactions on Visualization and Computer Graphics doi:, 2018.Google Scholar
  7. [7]
    Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In: Proceedings of the Advances in Neural Information Processing Systems 27, 2204–2212, 2014.Google Scholar
  8. [8]
    Xu, K.; Ba, J. L.; Kiros, R.; Courville, A.; Salakhutdinov, R.; Zemel, R. S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, 2048–2057, 2015.Google Scholar
  9. [9]
    Corbetta, M.; Shulman, G. L. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience Vol. 3, No. 3, 201–215, 2002.CrossRefGoogle Scholar
  10. [10]
    Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the Advances in Neural Information Processing Systems 28, 2017–2025, 2015.Google Scholar
  11. [11]
    Scott, W. R.; Roth, G.; Rivest, J.-F. View planning for automated three-dimensional object reconstruction and inspection. ACM Computing Surveys Vol. 35, No. 1, 64–96, 2003.CrossRefGoogle Scholar
  12. [12]
    Dutta Roy, S.; Chaudhury, S.; Banerjee, S. Active recognition through next view planning: A survey. Pattern Recognition Vol. 37, No. 3, 429–446, 2004.CrossRefGoogle Scholar
  13. [13]
    Qi, C. R.; Su, H.; Mo, K.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 652–660, 2017.Google Scholar
  14. [14]
    Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the Advances in Neural Information Processing Systems 30, 5099–5108, 2017.Google Scholar
  15. [15]
    Xie, S.; Liu, S.; Chen, Z.; Tu, Z. Attentional ShapeContextNet for point cloud recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4606–4615, 2018.Google Scholar
  16. [16]
    Feng, Y.; Zhang, Z.; Zhao, X.; Ji, R.; Gao, Y. GVCNN: Group-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 264–272, 2018.Google Scholar
  17. [17]
    Borotschnig, H.; Paletta, L.; Prantl, M.; Pinz, A. Appearance-based active object recognition. Image and Vision Computing Vol. 8, No. 9, 715–727, 2000.CrossRefzbMATHGoogle Scholar
  18. [18]
    Callari, F. G.; Ferrie, F. P. Active object recognition: Looking for differences. International Journal of Computer Vision Vol. 43, No. 3, 189–204, 2001.CrossRefzbMATHGoogle Scholar
  19. [19]
    Arbel, T.; Ferrie, F. P. Entropy-based gaze planning. Image and Vision Computing Vol. 19, No. 11, 779–786, 2001.CrossRefGoogle Scholar
  20. [20]
    Paletta, L.; Pinz, A. Active object recognition by view integration and reinforcement learning. Robotics and Autonomous Systems Vol. 31, No. 1, 71–86, 2000.CrossRefGoogle Scholar
  21. [21]
    Kurniawati, H.; Hsu, D.; Lee, W. S. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Proceedings of the Robotics: Science and Systems, Vol. 2008, 2008.Google Scholar
  22. [22]
    Lauri, M.; Atanasov, N.; Pappas, G.; Ritala, R. Active object recognition via Monte Carlo tree search. In: Proceedings of the Workshop on Beyond Geometric Constraints at the International Conference on Robotics and Automation, 2015.Google Scholar
  23. [23]
    Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research Vol. 17, No. 39, 1–40, 2016.MathSciNetzbMATHGoogle Scholar
  24. [24]
    Malmir, M.; Sikka, K.; Forster, D.; Movellan, J.; Cottrell, G. W. Deep Q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning. In: Proceedings of the British Machine Vision Conference, 161–171, 2016.Google Scholar
  25. [25]
    Krizhevsky, A.; Sutskever, I.; Hinton, G. E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems 25, 1097–1105, 2012.Google Scholar
  26. [26]
    Mozer, M. C. A focused back-propagation algorithm for temporal pattern recognition. Complex Systems Vol. 3, No. 4, 349–381, 1989.MathSciNetzbMATHGoogle Scholar
  27. [27]
    Wu, Z.; Song, S.; Khosla, A.; Tang, X.; Xiao, J. 3D ShapeNets for 2.5D object recognition and next-best-view prediction. arXiv preprint arXiv:1406.5670, 2014.Google Scholar
  28. [28]
    Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; Xiao, J.; Yi, L.; Yu, F. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.Google Scholar
  29. [29]
    Johns, E.; Leutenegger, S.; Davison, A. J. Pairwise decomposition of image sequences for active multi-view recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3813–3822, 2016.Google Scholar
  30. [30]
    Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 945–953, 2015.Google Scholar
  31. [31]
    Bajcsy, R. Active perception. Proceedings of the IEEE Vol. 76, No. 8, 966–1005, 1988.CrossRefGoogle Scholar
  32. [32]
    Xiao, T.; Xu, Y.; Yang, K.; Zhang, J.; Peng, Y.; Zhang, Z. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 842–850, 2015.Google Scholar

Copyright information

© The Author(s) 2019

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

Other papers from this open access journal are available free of charge from To submit a manuscript, please go to

Authors and Affiliations

  • Min Liu
    • 1
    • 2
  • Yifei Shi
    • 1
  • Lintao Zheng
    • 1
  • Kai Xu
    • 1
    Email author
  • Hui Huang
    • 3
  • Dinesh Manocha
    • 2
  1. 1.School of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.Department of Computer Science and Electrical & Computer EngineeringUniversity of MarylandCollege ParkUSA
  3. 3.Visual Computing Research CenterShenzhen UniversityShenzhenChina

Personalised recommendations