Abstract
Active vision considers the problem of choosing the optimal next viewpoint from which an autonomous agent can observe its environment. In this paper, we propose to use the active inference paradigm as a natural solution to this problem, and evaluate this on a realistic scenario with a robot manipulator. We tackle this problem using a generative model that was learned unsupervised purely from pixel-based observations. We show that our agent exhibits information-seeking behavior, choosing viewpoints of regions it has not yet observed. We also show that goal-seeking behavior emerges when the agent has to reach a target goal, and it does so more efficiently than a systematic grid search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. Int. J. Comput. Vision 1(4), 333–356 (1988). https://doi.org/10.1007/bf00133571
Denzler, J., Zobel, M., Niemann, H.: Information theoretic focal length selection for real-time active 3D object tracking. In: ICCV (2003)
Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: 6D object detection and next-best-view prediction in the crowd. In: CVPR (2016)
Ali Eslami, S.M., et al.: Neural scene representation and rendering. Science 360, 1204–1210 (2018)
Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2017), pp. 2786–2793 (2017)
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., Pezzulo, G.: Active inference and learning. Neurosci. Biobehav. Rev. 68, 862–879 (2016)
Heins, R.C., Mirza, M.B., Parr, T., Friston, K., Kagan, I., Pooresmaeili, A.: Deep active inference and scene construction (2020). https://doi.org/10.1101/2020.04.14.041129
Hepp, B., Dey, D., Sinha, S.N., Kapoor, A., Joshi, N., Hilliges, O.: Learn-to-score: efficient 3D scene exploration by predicting view utility. In: ECCV (2018)
Kaba, M.D., Uzunbas, M.G., Lim, S.N.: A reinforcement learning approach to the view planning problem. In: CVPR (2017)
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME J. Fluids Eng. 82(1), 35–45 (1960). https://doi.org/10.1115/1.3662552
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings (Ml), pp. 1–14 (2014)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013)
Mendoza, M., Vasquez-Gomez, J.I., Taud, H., Sucar, L.E., Reta, C.: Supervised learning of the next-best-view for 3D object reconstruction. Pattern Recogn. Lett. 133, 224–231 (2020)
Mirza, M.B., Adams, R.A., Mathys, C.D., Friston, K.J.: Scene construction, visual foraging, and active inference. Front. Comput. Neurosci. 10, 56 (2016)
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 9191–9200. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/8132-visual-reinforcement-learning-with-imagined-goals.pdf
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 3942–3951 (2018)
Rasouli, A., Lanillos, P., Cheng, G., Tsotsos, J.K.: Attention-based active visual search for mobile robots. Auton. Robots 44(2), 131–146 (2019). https://doi.org/10.1007/s10514-019-09882-z
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, vol. 4, pp. 3057–3070 (2014)
Rezende, D.J., Viola, F.: Taming vaes. CoRR abs/1810.00597 (2018)
Sancaktar, C., van Gerven, M., Lanillos, P.: End-to-end pixel-based deep active inference for body perception and action (2019)
Yamauchi, B.: A frontier-based exploration for autonomous exploration. In: ICRA (1997)
Çatal, O., Verbelen, T., Nauta, J., Boom, C.D., Dhoedt, B.: Learning perception and planning with deep active inference. In: ICASSP, pp. 3952–3956 (2020)
Acknowledgments
This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A The Generative Model
A The Generative Model
The generative model, described in this paper, is approximated by a neural network that predicts a multivariate Gaussian distribution with a diagonal covariance matrix. We consider a neural network architecture from the family of the variational autoencoders (VAE) [11, 18] which is very similar to the Generative Query Network (GQN) [4]. In contrast to the traditional autoencoders, this model encodes multiple observations into a single latent distribution that describes the scene. Given a query viewpoint, new unseen views can be generated from the encoded scene description. A high level description of the architecture is shown in Fig. 5.
We represent the camera pose as 3D point and the orientation as a quaternion as this representation does not suffer from Gimbal lock. The encoder encodes each observation in a latent distribution which we choose to model by a multivariate Gaussian of 32 dimensions with a diagonal covariance matrix. The latent distributions of all observations are combined into a distribution over the entire scene in a similar way as the update step from the Kalman filter [10]. No prediction step is necessary as the agent does not influence the environment. In the decoder, the input is a concatenated vector of both the scene representation and the query viewpoint. Intuitively, both are important as the viewpoint determines which area of the scene is observed and the representation determines which objects are visible at each position. Between the convolutional layers, the intermediate representation is transformed using a FiLM layer, conditioned on the input vector, this allows the model to learn which features are relevant at different stages of the decoding process.
A dataset of 250 scenes, each consisting of approximately 25 (image, viewpoint) pairs has been created in a simulator in order to train this model. To limit the complexity of this model, all observations consist of the same fixed downward orientation.
The neural network is optimized using the Adam optimizer algorithm with parameters shown in Table 1. For each scene between 3 and 10 randomly picked observations are provided to the model, from which it is tasked to predict a new one. The model is trained end-to-end using the GECO algorithm [19] on the following loss function:
The constraint \(\mathcal {C}\) is applied to a MSE loss on the reconstructed and ground truth observation. This constraint simply means that the MSE should stay below a fixed tolerance. \(\lambda \) is a Lagrange multiplier and the loss is optimized using a min-max scheme [19]. Specific implementation values are shown in Table 1.
The expected free energy is computed for a set of potential poses. The generative model is first used to estimate the expected view for each considered pose. The expected value of the posterior with this expected view is computed for a large number of samples. This way, the expected epistemic term is computed. For numerical stability, we clamp the variances of the posterior distributions to a value of 0.25. The instrumental value is computed as the MSE between the preferred state and the expected observation. This essentially boils down to computing the log likelihood of every pixel is modelled by a Gaussian with a fixed variance of 1.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Van de Maele, T., Verbelen, T., Çatal, O., De Boom, C., Dhoedt, B. (2020). You Only Look as Much as You Have To. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds) Active Inference. IWAI 2020. Communications in Computer and Information Science, vol 1326. Springer, Cham. https://doi.org/10.1007/978-3-030-64919-7_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-64919-7_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64918-0
Online ISBN: 978-3-030-64919-7
eBook Packages: Computer ScienceComputer Science (R0)