You Only Look as Much as You Have To

Van de Maele, Toon; Verbelen, Tim; Çatal, Ozan; De Boom, Cedric; Dhoedt, Bart

doi:10.1007/978-3-030-64919-7_11

Toon Van de Maele⁹,
Tim Verbelen⁹,
Ozan Çatal⁹,
Cedric De Boom⁹ &
…
Bart Dhoedt⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1326))

Included in the following conference series:

International Workshop on Active Inference

784 Accesses
1 Citations

Abstract

Active vision considers the problem of choosing the optimal next viewpoint from which an autonomous agent can observe its environment. In this paper, we propose to use the active inference paradigm as a natural solution to this problem, and evaluate this on a realistic scenario with a robot manipulator. We tackle this problem using a generative model that was learned unsupervised purely from pixel-based observations. We show that our agent exhibits information-seeking behavior, choosing viewpoints of regions it has not yet observed. We also show that goal-seeking behavior emerges when the agent has to reach a target goal, and it does so more efficiently than a systematic grid search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. Int. J. Comput. Vision 1(4), 333–356 (1988). https://doi.org/10.1007/bf00133571
Article Google Scholar
Denzler, J., Zobel, M., Niemann, H.: Information theoretic focal length selection for real-time active 3D object tracking. In: ICCV (2003)
Google Scholar
Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: 6D object detection and next-best-view prediction in the crowd. In: CVPR (2016)
Google Scholar
Ali Eslami, S.M., et al.: Neural scene representation and rendering. Science 360, 1204–1210 (2018)
Article Google Scholar
Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2017), pp. 2786–2793 (2017)
Google Scholar
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., Pezzulo, G.: Active inference and learning. Neurosci. Biobehav. Rev. 68, 862–879 (2016)
Article Google Scholar
Heins, R.C., Mirza, M.B., Parr, T., Friston, K., Kagan, I., Pooresmaeili, A.: Deep active inference and scene construction (2020). https://doi.org/10.1101/2020.04.14.041129
Hepp, B., Dey, D., Sinha, S.N., Kapoor, A., Joshi, N., Hilliges, O.: Learn-to-score: efficient 3D scene exploration by predicting view utility. In: ECCV (2018)
Google Scholar
Kaba, M.D., Uzunbas, M.G., Lim, S.N.: A reinforcement learning approach to the view planning problem. In: CVPR (2017)
Google Scholar
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME J. Fluids Eng. 82(1), 35–45 (1960). https://doi.org/10.1115/1.3662552
Article MathSciNet Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings (Ml), pp. 1–14 (2014)
Google Scholar
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013)
Google Scholar
Mendoza, M., Vasquez-Gomez, J.I., Taud, H., Sucar, L.E., Reta, C.: Supervised learning of the next-best-view for 3D object reconstruction. Pattern Recogn. Lett. 133, 224–231 (2020)
Article Google Scholar
Mirza, M.B., Adams, R.A., Mathys, C.D., Friston, K.J.: Scene construction, visual foraging, and active inference. Front. Comput. Neurosci. 10, 56 (2016)
Article Google Scholar
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 9191–9200. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/8132-visual-reinforcement-learning-with-imagined-goals.pdf
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 3942–3951 (2018)
Google Scholar
Rasouli, A., Lanillos, P., Cheng, G., Tsotsos, J.K.: Attention-based active visual search for mobile robots. Auton. Robots 44(2), 131–146 (2019). https://doi.org/10.1007/s10514-019-09882-z
Article Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, vol. 4, pp. 3057–3070 (2014)
Google Scholar
Rezende, D.J., Viola, F.: Taming vaes. CoRR abs/1810.00597 (2018)
Sancaktar, C., van Gerven, M., Lanillos, P.: End-to-end pixel-based deep active inference for body perception and action (2019)
Google Scholar
Yamauchi, B.: A frontier-based exploration for autonomous exploration. In: ICRA (1997)
Google Scholar
Çatal, O., Verbelen, T., Nauta, J., Boom, C.D., Dhoedt, B.: Learning perception and planning with deep active inference. In: ICASSP, pp. 3952–3956 (2020)
Google Scholar

Download references

Acknowledgments

This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

Author information

Authors and Affiliations

IDLab, Department of Information Technology, Ghent University – imec, Ghent, Belgium
Toon Van de Maele, Tim Verbelen, Ozan Çatal, Cedric De Boom & Bart Dhoedt

Authors

Toon Van de Maele
View author publications
You can also search for this author in PubMed Google Scholar
Tim Verbelen
View author publications
You can also search for this author in PubMed Google Scholar
Ozan Çatal
View author publications
You can also search for this author in PubMed Google Scholar
Cedric De Boom
View author publications
You can also search for this author in PubMed Google Scholar
Bart Dhoedt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toon Van de Maele .

Editor information

Editors and Affiliations

Ghent University, Ghent, Belgium
Tim Verbelen
Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
Pablo Lanillos
University of Sussex, Brighton, UK
Christopher L. Buckley
Ghent University, Ghent, Belgium
Cedric De Boom

A The Generative Model

The generative model, described in this paper, is approximated by a neural network that predicts a multivariate Gaussian distribution with a diagonal covariance matrix. We consider a neural network architecture from the family of the variational autoencoders (VAE) [11, 18] which is very similar to the Generative Query Network (GQN) [4]. In contrast to the traditional autoencoders, this model encodes multiple observations into a single latent distribution that describes the scene. Given a query viewpoint, new unseen views can be generated from the encoded scene description. A high level description of the architecture is shown in Fig. 5.

We represent the camera pose as 3D point and the orientation as a quaternion as this representation does not suffer from Gimbal lock. The encoder encodes each observation in a latent distribution which we choose to model by a multivariate Gaussian of 32 dimensions with a diagonal covariance matrix. The latent distributions of all observations are combined into a distribution over the entire scene in a similar way as the update step from the Kalman filter [10]. No prediction step is necessary as the agent does not influence the environment. In the decoder, the input is a concatenated vector of both the scene representation and the query viewpoint. Intuitively, both are important as the viewpoint determines which area of the scene is observed and the representation determines which objects are visible at each position. Between the convolutional layers, the intermediate representation is transformed using a FiLM layer, conditioned on the input vector, this allows the model to learn which features are relevant at different stages of the decoding process.

A dataset of 250 scenes, each consisting of approximately 25 (image, viewpoint) pairs has been created in a simulator in order to train this model. To limit the complexity of this model, all observations consist of the same fixed downward orientation.

Table 1. Training implementation details.

Full size table

The neural network is optimized using the Adam optimizer algorithm with parameters shown in Table 1. For each scene between 3 and 10 randomly picked observations are provided to the model, from which it is tasked to predict a new one. The model is trained end-to-end using the GECO algorithm [19] on the following loss function:

$$\begin{aligned} \mathcal {L}_\lambda = D_{KL}[Q(\tilde{\pmb {s}}|\tilde{\pmb {o}}) || \mathcal {N}(\pmb {0}, \pmb {I})] + \lambda \cdot \mathcal {C}(\pmb {o}, \pmb {\hat{o}}) \end{aligned}$$

(3)

The constraint $\mathcal {C}$ is applied to a MSE loss on the reconstructed and ground truth observation. This constraint simply means that the MSE should stay below a fixed tolerance. $\lambda $ is a Lagrange multiplier and the loss is optimized using a min-max scheme [19]. Specific implementation values are shown in Table 1.

The expected free energy is computed for a set of potential poses. The generative model is first used to estimate the expected view for each considered pose. The expected value of the posterior with this expected view is computed for a large number of samples. This way, the expected epistemic term is computed. For numerical stability, we clamp the variances of the posterior distributions to a value of 0.25. The instrumental value is computed as the MSE between the preferred state and the expected observation. This essentially boils down to computing the log likelihood of every pixel is modelled by a Gaussian with a fixed variance of 1.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van de Maele, T., Verbelen, T., Çatal, O., De Boom, C., Dhoedt, B. (2020). You Only Look as Much as You Have To. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds) Active Inference. IWAI 2020. Communications in Computer and Information Science, vol 1326. Springer, Cham. https://doi.org/10.1007/978-3-030-64919-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-64919-7_11
Published: 18 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64918-0
Online ISBN: 978-3-030-64919-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

You Only Look as Much as You Have To

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A The Generative Model

A The Generative Model

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation