Skip to main content
Log in

A Neural Autoregressive Approach to Attention-based Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Tasks that require the synchronization of perception and action are incredibly hard and pose a fundamental challenge to the fields of machine learning and computer vision. One important example of such a task is the problem of performing visual recognition through a sequence of controllable fixations; this requires jointly deciding what inference to perform from fixations and where to perform these fixations. While these two problems are challenging when addressed separately, they become even more formidable if solved jointly. Recently, a restricted Boltzmann machine (RBM) model was proposed that could learn meaningful fixation policies and achieve good recognition performance. In this paper, we propose an alternative approach based on a feed-forward, auto-regressive architecture, which permits exact calculation of training gradients (given the fixation sequence), unlike for the RBM model. On a problem of facial expression recognition, we demonstrate the improvement gained by this alternative approach. Additionally, we investigate several variations of the model in order to shed some light on successful strategies for fixation-based recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. This is done by setting \(\mathbf {z}\left( i_k,j_k\right) = \mathrm{sigmoid}\left( \bar{ \mathbf {z}}\left( i_k,j_k\right) \right) \), and learning the unconstrained \(\bar{ \mathbf {z}}\left( i_k,j_k\right) \) vectors instead. We also use a learning rate \(100\) times larger than learning the other parameters.

  2. The retinal transformation covered a patch of \(44\times 44\) pixels, without using a lower resolution periphery. Hence, the total number of pixels is \(1936\).

References

  • Bazzani, L., Freitas, N., Larochelle, H., Murino, V., & Ting, J.-A. (2011). Learning attentional policies for tracking and recognition in video with deep networks. In Proceedings of the 28th international conference on machine learning (ICML 2011) (pp. 937–944). ACM.

  • Butko, N. J., & Movellan, J. R. (2010). Infomax control of eye movements. IEEE Transactions on Autonomous Mental Development, 2(2), 91–107.

    Article  Google Scholar 

  • Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X., & Hu, S.-M. (2011). Global contrast based salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011 (pp. 409–416). IEEE.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition. CVPR 2005 (Vol. 1, pp. 886–893). IEEE.

  • David, G. (2004). Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Denil, M., Bazzani, L., Larochelle, H., & de Freitas, N. (2012). Learning where to attend with deep architectures for image tracking. Neural Computation, 24(8), 2151–2184.

    Article  MathSciNet  Google Scholar 

  • Erez, T., Tramper, J. J., Smart, W. D., & Stan CAM Gielen. (2011). A pomdp model of eye-hand coordination. In AAAI.

  • Fazl, A., Grossberg, S., & Mingolla, E. (2009). View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds. Cognitive psychology, 58(1), 1–48.

    Article  Google Scholar 

  • Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.

  • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.

    Article  MATH  MathSciNet  Google Scholar 

  • Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.

  • Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In IEEE International Conference on Computer Vision (ICCV).

  • Kanan, C., & Cottrell, G. (2010) Robust classification of objects, faces, and flowers using natural image statistics. In CVPR.

  • Krause, A., & Ong, C. S. (2011). Contextual gaussian process bandit optimization. In NIPS (pp. 2447–2455).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1106–1114.

    Google Scholar 

  • Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th international conference on machine learning (pp. 536–543). ACM.

  • Larochelle, H., & Hinton, G. E. (2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in neural information processing systems (pp. 1243–1251).

  • Larochelle, H., & Murray, I. (2011). The neural autoregressive distribution estimator. Artificial Intelligence and Statistics (AISTATS), 15, 29–37.

    Google Scholar 

  • Larochelle, H., & Lauly, S. (2012). A neural autoregressive topic model. Advances in Neural Information Processing Systems, 25, 2717–2725.

    Google Scholar 

  • Lazebnik, S. (2006). Cordelia, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • Mathe, S., & Sminchisescu, C. (2013). Action from still image dataset and inverse optimal control to learn task specific visual scanpaths. In Advances in neural information processing systems (pp. 1923–1931, 2013).

  • Nair, V., & Hinton, G. E. (2010) Rectified linear units improve restricted boltzmann machines. In ICML.

  • Najemnik, J., & Geisler, W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434(7031), 387–391.

    Article  Google Scholar 

  • Perazzi, F., Krahenbuhl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012 (pp. 733–740). IEEE.

  • Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML 2011).

  • Schmidhuber, J., & Huber, R. (1991). Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(01n02), 125–134.

  • Southall, J. P. C. (1962). Helmholtzs treatise on physiological optics. vol. 2: The sensation of vision, trans. J. P. C. Southall. (translated from the third german edition).

  • Susskind, J. M., Anderson, A. K., & Hinton, G. E. (2010). The toronto face database. Department of Computer Science, University of Toronto, Toronto, ON, Canada, Tech. Rep.

  • Uria, B., Murray, I., & Larochelle, H. (2013). Rnade: The real-valued neural autoregressive density-estimator. Advances in Neural Information Processing Systems, 26, 2175–2183.

    Google Scholar 

  • Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on machine learning (ICML 2008) (pp. 1096–1103). ACM.

  • Yang, J., Yu., K., & Gong, Y. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.

Download references

Acknowledgments

This work was partially supported by the Natural Sciences and Engineering Research Council of Canada, the National Natural Science Foundation under Grants NNSF-61171118 and the Ministry of Education under Grants SRFDP-20110002110057 of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yin Zheng.

Additional information

Communicated by Marc’Aurelio Ranzato, Geoffrey E. Hinton, and Yann LeCun.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, Y., Zemel, R.S., Zhang, YJ. et al. A Neural Autoregressive Approach to Attention-based Recognition. Int J Comput Vis 113, 67–79 (2015). https://doi.org/10.1007/s11263-014-0765-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0765-x

Keywords

Navigation