Skip to main content

Sensorimotor Visual Perception on Embodied System Using Free Energy Principle

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021)

Abstract

We propose an embodied system that is based on the free energy principle (FEP) for sensorimotor visual perception (SMVP). Although the FEP mathematically describes the rule that living things obey, limitation by embodiment is required to model SMVP. The proposed system is configured by a body, which partially observes the environment, and memory, which retains classified knowledge about the environment as a generative model, and executes active and perceptual inferences. Evaluation using the MNIST dataset showed that the proposed system recognizes characters by active and perceptual inferences, and the intentionality corresponding to human confirmation bias is reproduced on the system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mandelbaum, J., Sloan, L.L.: Peripheral visual acuity*: with special reference to scotopic illumination. Am. J. Ophthalmol. 30(5), 581–588 (1947)

    Article  Google Scholar 

  2. O’Regan, J.K., Noë, A.: A sensorimotor account of vision and visual consciousness. Behav. Brain Sci. 24(5), 939–973 (2001)

    Article  Google Scholar 

  3. Seth, A.K.: The cybernetic Bayesian brain: from interoceptive inference to sensorimotor contingencies. In: Open MIND, vol. 35 (2015)

    Google Scholar 

  4. Land, M.F.: Eye movements and the control of actions in everyday life. Prog. Retin. Eye Res. 25(3), 296–324 (2006)

    Article  Google Scholar 

  5. Friston, K., Kiebel, S.: Predictive coding under the free-energy principle. Philos. Trans. R. Soc. B Biol. Sci. 364(1521), 1211–1221 (2009)

    Article  Google Scholar 

  6. Seth, A.K., Suzuki, K., Critchley, H.D.: An interoceptive predictive coding model of conscious presence. Front. Psychol. 2, 395 (2012)

    Article  Google Scholar 

  7. Adams, R.A., Shipp, S., Friston, K.J.: Predictions not commands: active inference in the motor system. Brain Struct. Funct. 218(3), 611–643 (2013)

    Article  Google Scholar 

  8. Bogacz, R.: A tutorial on the free-energy framework for modelling perception and learning. J. Math. Psychol. 76, 198–211 (2017). https://doi.org/10.1016/j.jmp.2015.11.003

    Article  MathSciNet  MATH  Google Scholar 

  9. Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: 5th International Conference on Learning Representations, Toulon (2017)

    Google Scholar 

  10. O’Regan, J. K.: Experience is not something we feel but something we do: a principled way of explaining sensory phenomenology, with Change Blindness and other empirical consequences. http://nivea.psycho.univ-paris5.fr/ASSChtml/Pacherie4.html. Accessed 27 Aug 2021

  11. Parr, T., Sajid, N., Da Costa, L., Mirza, M.B., Friston, K.J.: Generative models for active vision. Front. Neurorobot. 15, 34 (2021)

    Article  Google Scholar 

  12. Tang, Y., Nguyen, D., Ha, D.: Neuroevolution of self-interpretable agents. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, Cancún, pp. 414–424. Association for Computing Machinery (2020)

    Google Scholar 

  13. Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, pp. 1025–1030. Morgan Kaufmann Publishers Inc. (2003)

    Google Scholar 

  14. Ji, S., Parr, R., Li, H., Liao, X., and Carin, L.: Point-based policy iteration. In: Proceedings of the 22nd National Conference on Artificial Intelligence, Vancouver, vol. 2, pp. 1243–1249. AAAI Press (2007)

    Google Scholar 

  15. Silver, D., Veness, J.: Monte-Carlo planning in large POMDPs. Adv. Neural. Inf. Process. Syst. 23, 2164–2172 (2010)

    Google Scholar 

  16. Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., Wierstra, D.: DRAW: a recurrent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, pp. 1462–1471. JMLR.org (2015)

    Google Scholar 

  17. Oord, A. V., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: Proceedings of the 33rd International Conference on Machine Learning, New York, pp. 1747–1756. JMLR.org (2016)

    Google Scholar 

  18. Salimans, T., Karpathy, A., Chen, X., Kingma, D. P.: PixelCNN++: improving the pixelCNN with discretized logistic mixture likelihood and other modifications. In: 5th International Conference on Learning Representations, Toulon (2017)

    Google Scholar 

  19. Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction using deep networks in Atari games. Adv. Neural. Inf. Process. Syst. 28, 2863–2871 (2015)

    Google Scholar 

  20. Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., Abbeel, P.: VIME: variational information maximizing exploration. Adv. Neural. Inf. Process. Syst. 29, 1117–1125 (2016)

    Google Scholar 

  21. van der Himst, O., Lanillos, P.: Deep active inference for partially observable MDPs. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds.) International Workshop on Active Inference 2020. Communications in Computer and Information Science, vol. 1326, pp. 61–71. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64919-7_8

  22. Daucé, E., Perrinet, L.: Visual search as active inference. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds.) International Workshop on Active Inference 2020. Communications in Computer and Information Science, vol. 1326, pp. 165–178. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64919-7_17

  23. Friston, K., Adams, R.A., Perrinet, L., Breakspear, M.: Perceptions as hypotheses: saccades as experiments. Front. Psychol. 3, 151 (2012)

    Google Scholar 

  24. Mirza, M.B., Adams, R.A., Mathys, C.D., Friston, K.J.: Scene construction, visual foraging, and active inference. Front. Comput. Neurosci. 10, 56 (2016)

    Article  Google Scholar 

  25. Heins, R.C., Mirza, M.B., Parr, T., Friston, K., Kagan, I., Pooresmaeili, A.: Deep active inference and scene construction. Front. Artif. Intell. 3, 81 (2020)

    Article  Google Scholar 

  26. Friston, K., Kilner, J., Harrison, L.: A free energy principle for the brain. J. Physiol. Paris 100(1–3), 70–87 (2006)

    Article  Google Scholar 

  27. Friston, K.: The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 127–138 (2010)

    Article  Google Scholar 

  28. McGregor, S., Baltieri, M., Buckley, C.L.: A minimal active inference agent. arXiv preprint arXiv:1503.04187 (2015)

  29. Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G.: Active inference: a process theory. Neural Comput. 29(1), 1–49 (2017)

    Article  MathSciNet  Google Scholar 

  30. Buckley, C.L., Kim, C.S., McGregor, S., Seth, A.K.: The free energy principle for action and perception: a mathematical review. J. Math. Psychol. 81, 55–79 (2017)

    Article  MathSciNet  Google Scholar 

  31. Fitzpatrick, P., Metta, G., Natale, L., Rao, S., Sandini, G.: Learning about objects through action - initial steps towards artificial cognition. In: 2003 IEEE International Conference on Robotics and Automation, Taipei, pp. 3140–3145. IEEE (2003)

    Google Scholar 

  32. Cheng, G., et al.: CB: a humanoid research platform for exploring neuroscience. In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, Genova, pp. 182–187. IEEE (2006)

    Google Scholar 

  33. Friston, K.: Embodied inference: or “I think therefore I am, if I am what I think”. In: Tschacher, W., Bergomi, C. (eds.) The Implications of Embodiment: Cognition and Communication, pp. 89–125. Imprint Academic (2011)

    Google Scholar 

  34. Gallagher, S., Allen, M.: Active inference, enactivism and the hermeneutics of social cognition. Synthese 195(6), 2627–2648 (2016)

    Article  Google Scholar 

  35. THE MNIST DATABASE of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 27 Aug 2021

  36. Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 1–9. IEEE (2015)

    Google Scholar 

  37. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, Banff (2014)

    Google Scholar 

  38. Rezende, D. J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, pp. 1278–1286. JMLR.org (2014)

    Google Scholar 

  39. Kappes, A., Harvey, A.H., Lohrenz, T., Montague, P.R., Sharot, T.: Confirmation bias in the utilization of others’ opinion strength. Nat. Neurosci. 23, 130–137 (2020)

    Article  Google Scholar 

Download references

Acknowledgements

The authors thank Dr. Qinghua Sun from Hitachi Ltd. for his constructive comments and suggestions for improving this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kanako Esaki .

Editor information

Editors and Affiliations

Appendix

Appendix

The generative model, described in this paper, is a combination of a variational autoencoder (VAE) and a fully connected neural network (FNN). The architecture is shown in Fig. 7. The encoder consists of four 2D convolutional layers and each layer is followed by a batch normalization and a rectified linear unit. The bottleneck consists of two linear transformation layers for calculating the average and the variance with reparameterizing function. The decoder consists of four 2D transposed convolutional layers and each layer is followed by a batch normalization and a rectified linear unit (sigmoid unit for the last layer). The classifier consists of a linear transformation layer followed by a rectified linear unit and a linear transformation layer followed by a softmax unit. The model was trained using Adam optimizer (learning rate: 0.001) with the sum of VAE loss and FNN loss.

Fig. 7.
figure 7

Architecture of generative model

Algorithm 1 shows the pseudo code of the processing flow. The \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\) is pre-trained using training data of \(\left({s}_{t},{x}_{t}\right)\). All the training data of \({s}_{t}\) are pre-processed so that the center of gravity of an image is shifted to the center position. During the operation of the proposed embodied system, the process from the 2nd line to the 12th line is repeated. First, an attention image \(s^{\prime}_{t}\) is obtained from the vision sensor. The past sensory input images are composed with the obtained \(s^{\prime}_{t}\) while maintaining each relative attention position. The center of gravity of the composed image is calculated, and the composed image is shifted so that the center of gravity is located at the center position of the image. The shifted composed image is an \({s}_{t}\). Then, \(q\left({x}_{t}|{\phi }_{{x}_{t}}\right)\) is calculated by inputting \({s}_{t}\) to \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\). After that, the sub-function starting from the 14th line is called to generate expected sensory input images \({s}_{t+1}\). In the sub-function, an \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\) is calculated by inputting \({s}_{t}\) to \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\). A template image is generated by detecting a bounding rectangle area of non-zero pixels in \({s}_{t}\) and extracting the area from \({s}_{t}\). Template matching is carried out in the \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\), and the representative position of the current \(s_{t}\), \(u_{cur}\), is obtained. To calculate the next candidate attention positions \({u}_{next}\), a candidate region of \({u}_{next}\) is set. The candidate region is a region obtained by adding a fixed margin pixel to a region of \({s}_{t}\) in \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\). The region of \({s}_{t}\) is defined by \({u}_{cur}\) and the size of the template image. The \({u}_{next}\) are calculated by sliding the window with the fixed stride pixel in the candidate region. The window is the size of \(s^{\prime}_{t}\). The representative positions of all the window positions during sliding are \({u}_{next}\). The \({s}_{t+1}\) are generated by extracting the region of the \({s}_{t}\) and the region of the next candidate attention images \(s^{\prime}_{t + 1}\) from \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\). The region of the \({s}_{t}\) is defined by \({u}_{cur}\) and the size of the template image, as mentioned above. The region of the \(s^{\prime}_{t + 1}\) are defined by \({u}_{next}\) and the size of \(s^{\prime}_{t + 1}\). The extracted images are clipped or applied with zero-padding to have the same size as \({s}_{t}\). Each approximate posterior distribution \(q\left({x}_{t+1}|{\phi }_{{x}_{t+1}}\right)\) is calculated by inputting each image included in \({s}_{t+1}\) to \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\). Note that \(q\left({x}_{t}|{\phi }_{{x}_{t}}\right)\) is calculated using the current \({s}_{t}\), while \(q\left({x}_{t+1}|{\phi }_{{x}_{t+1}}\right)\) is calculated using \({s}_{t+1}\). The entropy of each \(q\left({x}_{t+1}|{\phi }_{{x}_{t+1}}\right)\) is calculated and added to the uncertainty map \(M\). Finally, the attention position having the minimum value in \(M\) is defined as the next attention position.

figure a

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Esaki, K., Matsumura, T., Ito, K., Mizuno, H. (2021). Sensorimotor Visual Perception on Embodied System Using Free Energy Principle. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_62

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93736-2_62

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93735-5

  • Online ISBN: 978-3-030-93736-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics