Skip to main content

You Only Look as Much as You Have To

Using the Free Energy Principle for Active Vision

  • Conference paper
  • First Online:
Active Inference (IWAI 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1326))

Included in the following conference series:

Abstract

Active vision considers the problem of choosing the optimal next viewpoint from which an autonomous agent can observe its environment. In this paper, we propose to use the active inference paradigm as a natural solution to this problem, and evaluate this on a realistic scenario with a robot manipulator. We tackle this problem using a generative model that was learned unsupervised purely from pixel-based observations. We show that our agent exhibits information-seeking behavior, choosing viewpoints of regions it has not yet observed. We also show that goal-seeking behavior emerges when the agent has to reach a target goal, and it does so more efficiently than a systematic grid search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. Int. J. Comput. Vision 1(4), 333–356 (1988). https://doi.org/10.1007/bf00133571

    Article  Google Scholar 

  2. Denzler, J., Zobel, M., Niemann, H.: Information theoretic focal length selection for real-time active 3D object tracking. In: ICCV (2003)

    Google Scholar 

  3. Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: 6D object detection and next-best-view prediction in the crowd. In: CVPR (2016)

    Google Scholar 

  4. Ali Eslami, S.M., et al.: Neural scene representation and rendering. Science 360, 1204–1210 (2018)

    Article  Google Scholar 

  5. Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2017), pp. 2786–2793 (2017)

    Google Scholar 

  6. Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., Pezzulo, G.: Active inference and learning. Neurosci. Biobehav. Rev. 68, 862–879 (2016)

    Article  Google Scholar 

  7. Heins, R.C., Mirza, M.B., Parr, T., Friston, K., Kagan, I., Pooresmaeili, A.: Deep active inference and scene construction (2020). https://doi.org/10.1101/2020.04.14.041129

  8. Hepp, B., Dey, D., Sinha, S.N., Kapoor, A., Joshi, N., Hilliges, O.: Learn-to-score: efficient 3D scene exploration by predicting view utility. In: ECCV (2018)

    Google Scholar 

  9. Kaba, M.D., Uzunbas, M.G., Lim, S.N.: A reinforcement learning approach to the view planning problem. In: CVPR (2017)

    Google Scholar 

  10. Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME J. Fluids Eng. 82(1), 35–45 (1960). https://doi.org/10.1115/1.3662552

    Article  MathSciNet  Google Scholar 

  11. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. In: Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings (Ml), pp. 1–14 (2014)

    Google Scholar 

  12. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML Workshop on Deep Learning for Audio, Speech and Language Processing (2013)

    Google Scholar 

  13. Mendoza, M., Vasquez-Gomez, J.I., Taud, H., Sucar, L.E., Reta, C.: Supervised learning of the next-best-view for 3D object reconstruction. Pattern Recogn. Lett. 133, 224–231 (2020)

    Article  Google Scholar 

  14. Mirza, M.B., Adams, R.A., Mathys, C.D., Friston, K.J.: Scene construction, visual foraging, and active inference. Front. Comput. Neurosci. 10, 56 (2016)

    Article  Google Scholar 

  15. Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S.: Visual reinforcement learning with imagined goals. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 9191–9200. Curran Associates, Inc. (2018). http://papers.nips.cc/paper/8132-visual-reinforcement-learning-with-imagined-goals.pdf

  16. Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 3942–3951 (2018)

    Google Scholar 

  17. Rasouli, A., Lanillos, P., Cheng, G., Tsotsos, J.K.: Attention-based active visual search for mobile robots. Auton. Robots 44(2), 131–146 (2019). https://doi.org/10.1007/s10514-019-09882-z

    Article  Google Scholar 

  18. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, ICML 2014, vol. 4, pp. 3057–3070 (2014)

    Google Scholar 

  19. Rezende, D.J., Viola, F.: Taming vaes. CoRR abs/1810.00597 (2018)

  20. Sancaktar, C., van Gerven, M., Lanillos, P.: End-to-end pixel-based deep active inference for body perception and action (2019)

    Google Scholar 

  21. Yamauchi, B.: A frontier-based exploration for autonomous exploration. In: ICRA (1997)

    Google Scholar 

  22. Çatal, O., Verbelen, T., Nauta, J., Boom, C.D., Dhoedt, B.: Learning perception and planning with deep active inference. In: ICASSP, pp. 3952–3956 (2020)

    Google Scholar 

Download references

Acknowledgments

This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Toon Van de Maele .

Editor information

Editors and Affiliations

A The Generative Model

A The Generative Model

The generative model, described in this paper, is approximated by a neural network that predicts a multivariate Gaussian distribution with a diagonal covariance matrix. We consider a neural network architecture from the family of the variational autoencoders (VAE) [11, 18] which is very similar to the Generative Query Network (GQN) [4]. In contrast to the traditional autoencoders, this model encodes multiple observations into a single latent distribution that describes the scene. Given a query viewpoint, new unseen views can be generated from the encoded scene description. A high level description of the architecture is shown in Fig. 5.

We represent the camera pose as 3D point and the orientation as a quaternion as this representation does not suffer from Gimbal lock. The encoder encodes each observation in a latent distribution which we choose to model by a multivariate Gaussian of 32 dimensions with a diagonal covariance matrix. The latent distributions of all observations are combined into a distribution over the entire scene in a similar way as the update step from the Kalman filter [10]. No prediction step is necessary as the agent does not influence the environment. In the decoder, the input is a concatenated vector of both the scene representation and the query viewpoint. Intuitively, both are important as the viewpoint determines which area of the scene is observed and the representation determines which objects are visible at each position. Between the convolutional layers, the intermediate representation is transformed using a FiLM layer, conditioned on the input vector, this allows the model to learn which features are relevant at different stages of the decoding process.

A dataset of 250 scenes, each consisting of approximately 25 (image, viewpoint) pairs has been created in a simulator in order to train this model. To limit the complexity of this model, all observations consist of the same fixed downward orientation.

Table 1. Training implementation details.

The neural network is optimized using the Adam optimizer algorithm with parameters shown in Table 1. For each scene between 3 and 10 randomly picked observations are provided to the model, from which it is tasked to predict a new one. The model is trained end-to-end using the GECO algorithm [19] on the following loss function:

$$\begin{aligned} \mathcal {L}_\lambda = D_{KL}[Q(\tilde{\pmb {s}}|\tilde{\pmb {o}}) || \mathcal {N}(\pmb {0}, \pmb {I})] + \lambda \cdot \mathcal {C}(\pmb {o}, \pmb {\hat{o}}) \end{aligned}$$
(3)
Fig. 5.
figure 5

Schematic view of the generative model. The left part is the encoder that produces a latent distribution for every observation, viewpoint pair. This encoder consists of 4 convolutional layers interleaved with FiLM [16] layers that condition on the viewpoint. This transforms the intermediate representation to encompass the spatial information from the viewpoint. The latent distributions are combined to form an aggregated distribution over the latent space. A sampled vector is concatenated with the query viewpoint from which the decoder generates a novel view. The decoder mimicks the encoder architecture and has 4 convolutional cubes (upsamples the image and processes it with two convolutional layers) interleaved with a FiLM layer that conditions on the concatenated information vector. Each layer is activated with a LeakyReLU [12] activation function.

The constraint \(\mathcal {C}\) is applied to a MSE loss on the reconstructed and ground truth observation. This constraint simply means that the MSE should stay below a fixed tolerance. \(\lambda \) is a Lagrange multiplier and the loss is optimized using a min-max scheme [19]. Specific implementation values are shown in Table 1.

The expected free energy is computed for a set of potential poses. The generative model is first used to estimate the expected view for each considered pose. The expected value of the posterior with this expected view is computed for a large number of samples. This way, the expected epistemic term is computed. For numerical stability, we clamp the variances of the posterior distributions to a value of 0.25. The instrumental value is computed as the MSE between the preferred state and the expected observation. This essentially boils down to computing the log likelihood of every pixel is modelled by a Gaussian with a fixed variance of 1.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Van de Maele, T., Verbelen, T., Çatal, O., De Boom, C., Dhoedt, B. (2020). You Only Look as Much as You Have To. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds) Active Inference. IWAI 2020. Communications in Computer and Information Science, vol 1326. Springer, Cham. https://doi.org/10.1007/978-3-030-64919-7_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64919-7_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64918-0

  • Online ISBN: 978-3-030-64919-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics