A Neural Autoregressive Approach to Attention-based Recognition

Zheng, Yin; Zemel, Richard S.; Zhang, Yu-Jin; Larochelle, Hugo

doi:10.1007/s11263-014-0765-x

A Neural Autoregressive Approach to Attention-based Recognition

Published: 30 September 2014

Volume 113, pages 67–79, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yin Zheng¹,
Richard S. Zemel²,
Yu-Jin Zhang¹ &
…
Hugo Larochelle³

1118 Accesses
17 Citations
3 Altmetric
Explore all metrics

Abstract

Tasks that require the synchronization of perception and action are incredibly hard and pose a fundamental challenge to the fields of machine learning and computer vision. One important example of such a task is the problem of performing visual recognition through a sequence of controllable fixations; this requires jointly deciding what inference to perform from fixations and where to perform these fixations. While these two problems are challenging when addressed separately, they become even more formidable if solved jointly. Recently, a restricted Boltzmann machine (RBM) model was proposed that could learn meaningful fixation policies and achieve good recognition performance. In this paper, we propose an alternative approach based on a feed-forward, auto-regressive architecture, which permits exact calculation of training gradients (given the fixation sequence), unlike for the RBM model. On a problem of facial expression recognition, we demonstrate the improvement gained by this alternative approach. Additionally, we investigate several variations of the model in order to shed some light on successful strategies for fixation-based recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boltzmann Machines for Image Denoising

Accelerated learning for Restricted Boltzmann Machine with momentum term

Scoring and Classifying with Gated Auto-Encoders

Notes

This is done by setting \(\mathbf {z}\left( i_k,j_k\right) = \mathrm{sigmoid}\left( \bar{ \mathbf {z}}\left( i_k,j_k\right) \right) \), and learning the unconstrained \(\bar{ \mathbf {z}}\left( i_k,j_k\right) \) vectors instead. We also use a learning rate \(100\) times larger than learning the other parameters.
The retinal transformation covered a patch of \(44\times 44\) pixels, without using a lower resolution periphery. Hence, the total number of pixels is \(1936\).

References

Bazzani, L., Freitas, N., Larochelle, H., Murino, V., & Ting, J.-A. (2011). Learning attentional policies for tracking and recognition in video with deep networks. In Proceedings of the 28th international conference on machine learning (ICML 2011) (pp. 937–944). ACM.
Butko, N. J., & Movellan, J. R. (2010). Infomax control of eye movements. IEEE Transactions on Autonomous Mental Development, 2(2), 91–107.
Article Google Scholar
Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X., & Hu, S.-M. (2011). Global contrast based salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011 (pp. 409–416). IEEE.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE computer society conference on computer vision and pattern recognition. CVPR 2005 (Vol. 1, pp. 886–893). IEEE.
David, G. (2004). Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Denil, M., Bazzani, L., Larochelle, H., & de Freitas, N. (2012). Learning where to attend with deep architectures for image tracking. Neural Computation, 24(8), 2151–2184.
Article MathSciNet Google Scholar
Erez, T., Tramper, J. J., Smart, W. D., & Stan CAM Gielen. (2011). A pomdp model of eye-hand coordination. In AAAI.
Fazl, A., Grossberg, S., & Mingolla, E. (2009). View-invariant object category learning, recognition, and search: How spatial and object attention are coordinated using surface-based attentional shrouds. Cognitive psychology, 58(1), 1–48.
Article Google Scholar
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Article MATH MathSciNet Google Scholar
Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.
Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In IEEE International Conference on Computer Vision (ICCV).
Kanan, C., & Cottrell, G. (2010) Robust classification of objects, faces, and flowers using natural image statistics. In CVPR.
Krause, A., & Ong, C. S. (2011). Contextual gaussian process bandit optimization. In NIPS (pp. 2447–2455).
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1106–1114.
Google Scholar
Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted boltzmann machines. In Proceedings of the 25th international conference on machine learning (pp. 536–543). ACM.
Larochelle, H., & Hinton, G. E. (2010). Learning to combine foveal glimpses with a third-order Boltzmann machine. In Advances in neural information processing systems (pp. 1243–1251).
Larochelle, H., & Murray, I. (2011). The neural autoregressive distribution estimator. Artificial Intelligence and Statistics (AISTATS), 15, 29–37.
Google Scholar
Larochelle, H., & Lauly, S. (2012). A neural autoregressive topic model. Advances in Neural Information Processing Systems, 25, 2717–2725.
Google Scholar
Lazebnik, S. (2006). Cordelia, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Mathe, S., & Sminchisescu, C. (2013). Action from still image dataset and inverse optimal control to learn task specific visual scanpaths. In Advances in neural information processing systems (pp. 1923–1931, 2013).
Nair, V., & Hinton, G. E. (2010) Rectified linear units improve restricted boltzmann machines. In ICML.
Najemnik, J., & Geisler, W. S. (2005). Optimal eye movement strategies in visual search. Nature, 434(7031), 387–391.
Article Google Scholar
Perazzi, F., Krahenbuhl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012 (pp. 733–740). IEEE.
Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML 2011).
Schmidhuber, J., & Huber, R. (1991). Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(01n02), 125–134.
Southall, J. P. C. (1962). Helmholtzs treatise on physiological optics. vol. 2: The sensation of vision, trans. J. P. C. Southall. (translated from the third german edition).
Susskind, J. M., Anderson, A. K., & Hinton, G. E. (2010). The toronto face database. Department of Computer Science, University of Toronto, Toronto, ON, Canada, Tech. Rep.
Uria, B., Murray, I., & Larochelle, H. (2013). Rnade: The real-valued neural autoregressive density-estimator. Advances in Neural Information Processing Systems, 26, 2175–2183.
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on machine learning (ICML 2008) (pp. 1096–1103). ACM.
Yang, J., Yu., K., & Gong, Y. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.

Download references

Acknowledgments

This work was partially supported by the Natural Sciences and Engineering Research Council of Canada, the National Natural Science Foundation under Grants NNSF-61171118 and the Ministry of Education under Grants SRFDP-20110002110057 of China.

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, Beijing, 10084, China
Yin Zheng & Yu-Jin Zhang
Department of Computer Science, University of Toronto, Toronto, M5S 3G4, Canada
Richard S. Zemel
Départment d’informatique, Université de Sherbrooke, Sherbrooke, J1K 2R1, Canada
Hugo Larochelle

Authors

Yin Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Richard S. Zemel
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Jin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Larochelle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yin Zheng.

Additional information

Communicated by Marc’Aurelio Ranzato, Geoffrey E. Hinton, and Yann LeCun.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, Y., Zemel, R.S., Zhang, YJ. et al. A Neural Autoregressive Approach to Attention-based Recognition. Int J Comput Vis 113, 67–79 (2015). https://doi.org/10.1007/s11263-014-0765-x

Download citation

Received: 11 February 2014
Accepted: 08 September 2014
Published: 30 September 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s11263-014-0765-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Neural Autoregressive Approach to Attention-based Recognition

Abstract

Access this article

Similar content being viewed by others

Boltzmann Machines for Image Denoising

Accelerated learning for Restricted Boltzmann Machine with momentum term

Scoring and Classifying with Gated Auto-Encoders

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Neural Autoregressive Approach to Attention-based Recognition

Abstract

Access this article

Similar content being viewed by others

Boltzmann Machines for Image Denoising

Accelerated learning for Restricted Boltzmann Machine with momentum term

Scoring and Classifying with Gated Auto-Encoders

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation