Abstract
We propose an embodied system that is based on the free energy principle (FEP) for sensorimotor visual perception (SMVP). Although the FEP mathematically describes the rule that living things obey, limitation by embodiment is required to model SMVP. The proposed system is configured by a body, which partially observes the environment, and memory, which retains classified knowledge about the environment as a generative model, and executes active and perceptual inferences. Evaluation using the MNIST dataset showed that the proposed system recognizes characters by active and perceptual inferences, and the intentionality corresponding to human confirmation bias is reproduced on the system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mandelbaum, J., Sloan, L.L.: Peripheral visual acuity*: with special reference to scotopic illumination. Am. J. Ophthalmol. 30(5), 581–588 (1947)
O’Regan, J.K., Noë, A.: A sensorimotor account of vision and visual consciousness. Behav. Brain Sci. 24(5), 939–973 (2001)
Seth, A.K.: The cybernetic Bayesian brain: from interoceptive inference to sensorimotor contingencies. In: Open MIND, vol. 35 (2015)
Land, M.F.: Eye movements and the control of actions in everyday life. Prog. Retin. Eye Res. 25(3), 296–324 (2006)
Friston, K., Kiebel, S.: Predictive coding under the free-energy principle. Philos. Trans. R. Soc. B Biol. Sci. 364(1521), 1211–1221 (2009)
Seth, A.K., Suzuki, K., Critchley, H.D.: An interoceptive predictive coding model of conscious presence. Front. Psychol. 2, 395 (2012)
Adams, R.A., Shipp, S., Friston, K.J.: Predictions not commands: active inference in the motor system. Brain Struct. Funct. 218(3), 611–643 (2013)
Bogacz, R.: A tutorial on the free-energy framework for modelling perception and learning. J. Math. Psychol. 76, 198–211 (2017). https://doi.org/10.1016/j.jmp.2015.11.003
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: 5th International Conference on Learning Representations, Toulon (2017)
O’Regan, J. K.: Experience is not something we feel but something we do: a principled way of explaining sensory phenomenology, with Change Blindness and other empirical consequences. http://nivea.psycho.univ-paris5.fr/ASSChtml/Pacherie4.html. Accessed 27 Aug 2021
Parr, T., Sajid, N., Da Costa, L., Mirza, M.B., Friston, K.J.: Generative models for active vision. Front. Neurorobot. 15, 34 (2021)
Tang, Y., Nguyen, D., Ha, D.: Neuroevolution of self-interpretable agents. In: Proceedings of the 2020 Genetic and Evolutionary Computation Conference, Cancún, pp. 414–424. Association for Computing Machinery (2020)
Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, pp. 1025–1030. Morgan Kaufmann Publishers Inc. (2003)
Ji, S., Parr, R., Li, H., Liao, X., and Carin, L.: Point-based policy iteration. In: Proceedings of the 22nd National Conference on Artificial Intelligence, Vancouver, vol. 2, pp. 1243–1249. AAAI Press (2007)
Silver, D., Veness, J.: Monte-Carlo planning in large POMDPs. Adv. Neural. Inf. Process. Syst. 23, 2164–2172 (2010)
Gregor, K., Danihelka, I., Graves, A., Rezende, D. J., Wierstra, D.: DRAW: a recurrent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning, Lille, pp. 1462–1471. JMLR.org (2015)
Oord, A. V., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: Proceedings of the 33rd International Conference on Machine Learning, New York, pp. 1747–1756. JMLR.org (2016)
Salimans, T., Karpathy, A., Chen, X., Kingma, D. P.: PixelCNN++: improving the pixelCNN with discretized logistic mixture likelihood and other modifications. In: 5th International Conference on Learning Representations, Toulon (2017)
Oh, J., Guo, X., Lee, H., Lewis, R., Singh, S.: Action-conditional video prediction using deep networks in Atari games. Adv. Neural. Inf. Process. Syst. 28, 2863–2871 (2015)
Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., Abbeel, P.: VIME: variational information maximizing exploration. Adv. Neural. Inf. Process. Syst. 29, 1117–1125 (2016)
van der Himst, O., Lanillos, P.: Deep active inference for partially observable MDPs. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds.) International Workshop on Active Inference 2020. Communications in Computer and Information Science, vol. 1326, pp. 61–71. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64919-7_8
Daucé, E., Perrinet, L.: Visual search as active inference. In: Verbelen, T., Lanillos, P., Buckley, C.L., De Boom, C. (eds.) International Workshop on Active Inference 2020. Communications in Computer and Information Science, vol. 1326, pp. 165–178. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-64919-7_17
Friston, K., Adams, R.A., Perrinet, L., Breakspear, M.: Perceptions as hypotheses: saccades as experiments. Front. Psychol. 3, 151 (2012)
Mirza, M.B., Adams, R.A., Mathys, C.D., Friston, K.J.: Scene construction, visual foraging, and active inference. Front. Comput. Neurosci. 10, 56 (2016)
Heins, R.C., Mirza, M.B., Parr, T., Friston, K., Kagan, I., Pooresmaeili, A.: Deep active inference and scene construction. Front. Artif. Intell. 3, 81 (2020)
Friston, K., Kilner, J., Harrison, L.: A free energy principle for the brain. J. Physiol. Paris 100(1–3), 70–87 (2006)
Friston, K.: The free-energy principle: a unified brain theory? Nat. Rev. Neurosci. 11, 127–138 (2010)
McGregor, S., Baltieri, M., Buckley, C.L.: A minimal active inference agent. arXiv preprint arXiv:1503.04187 (2015)
Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G.: Active inference: a process theory. Neural Comput. 29(1), 1–49 (2017)
Buckley, C.L., Kim, C.S., McGregor, S., Seth, A.K.: The free energy principle for action and perception: a mathematical review. J. Math. Psychol. 81, 55–79 (2017)
Fitzpatrick, P., Metta, G., Natale, L., Rao, S., Sandini, G.: Learning about objects through action - initial steps towards artificial cognition. In: 2003 IEEE International Conference on Robotics and Automation, Taipei, pp. 3140–3145. IEEE (2003)
Cheng, G., et al.: CB: a humanoid research platform for exploring neuroscience. In: 2006 6th IEEE-RAS International Conference on Humanoid Robots, Genova, pp. 182–187. IEEE (2006)
Friston, K.: Embodied inference: or “I think therefore I am, if I am what I think”. In: Tschacher, W., Bergomi, C. (eds.) The Implications of Embodiment: Cognition and Communication, pp. 89–125. Imprint Academic (2011)
Gallagher, S., Allen, M.: Active inference, enactivism and the hermeneutics of social cognition. Synthese 195(6), 2627–2648 (2016)
THE MNIST DATABASE of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 27 Aug 2021
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, pp. 1–9. IEEE (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: 2nd International Conference on Learning Representations, Banff (2014)
Rezende, D. J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of the 31st International Conference on Machine Learning, Beijing, pp. 1278–1286. JMLR.org (2014)
Kappes, A., Harvey, A.H., Lohrenz, T., Montague, P.R., Sharot, T.: Confirmation bias in the utilization of others’ opinion strength. Nat. Neurosci. 23, 130–137 (2020)
Acknowledgements
The authors thank Dr. Qinghua Sun from Hitachi Ltd. for his constructive comments and suggestions for improving this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
The generative model, described in this paper, is a combination of a variational autoencoder (VAE) and a fully connected neural network (FNN). The architecture is shown in Fig. 7. The encoder consists of four 2D convolutional layers and each layer is followed by a batch normalization and a rectified linear unit. The bottleneck consists of two linear transformation layers for calculating the average and the variance with reparameterizing function. The decoder consists of four 2D transposed convolutional layers and each layer is followed by a batch normalization and a rectified linear unit (sigmoid unit for the last layer). The classifier consists of a linear transformation layer followed by a rectified linear unit and a linear transformation layer followed by a softmax unit. The model was trained using Adam optimizer (learning rate: 0.001) with the sum of VAE loss and FNN loss.
Algorithm 1 shows the pseudo code of the processing flow. The \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\) is pre-trained using training data of \(\left({s}_{t},{x}_{t}\right)\). All the training data of \({s}_{t}\) are pre-processed so that the center of gravity of an image is shifted to the center position. During the operation of the proposed embodied system, the process from the 2nd line to the 12th line is repeated. First, an attention image \(s^{\prime}_{t}\) is obtained from the vision sensor. The past sensory input images are composed with the obtained \(s^{\prime}_{t}\) while maintaining each relative attention position. The center of gravity of the composed image is calculated, and the composed image is shifted so that the center of gravity is located at the center position of the image. The shifted composed image is an \({s}_{t}\). Then, \(q\left({x}_{t}|{\phi }_{{x}_{t}}\right)\) is calculated by inputting \({s}_{t}\) to \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\). After that, the sub-function starting from the 14th line is called to generate expected sensory input images \({s}_{t+1}\). In the sub-function, an \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\) is calculated by inputting \({s}_{t}\) to \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\). A template image is generated by detecting a bounding rectangle area of non-zero pixels in \({s}_{t}\) and extracting the area from \({s}_{t}\). Template matching is carried out in the \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\), and the representative position of the current \(s_{t}\), \(u_{cur}\), is obtained. To calculate the next candidate attention positions \({u}_{next}\), a candidate region of \({u}_{next}\) is set. The candidate region is a region obtained by adding a fixed margin pixel to a region of \({s}_{t}\) in \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\). The region of \({s}_{t}\) is defined by \({u}_{cur}\) and the size of the template image. The \({u}_{next}\) are calculated by sliding the window with the fixed stride pixel in the candidate region. The window is the size of \(s^{\prime}_{t}\). The representative positions of all the window positions during sliding are \({u}_{next}\). The \({s}_{t+1}\) are generated by extracting the region of the \({s}_{t}\) and the region of the next candidate attention images \(s^{\prime}_{t + 1}\) from \({q}_{img}\left({x}_{t}|{\phi }_{{x}_{t}}\right)\). The region of the \({s}_{t}\) is defined by \({u}_{cur}\) and the size of the template image, as mentioned above. The region of the \(s^{\prime}_{t + 1}\) are defined by \({u}_{next}\) and the size of \(s^{\prime}_{t + 1}\). The extracted images are clipped or applied with zero-padding to have the same size as \({s}_{t}\). Each approximate posterior distribution \(q\left({x}_{t+1}|{\phi }_{{x}_{t+1}}\right)\) is calculated by inputting each image included in \({s}_{t+1}\) to \({p}_{{a}_{t-1}}\left({s}_{t},{x}_{t}\right)\). Note that \(q\left({x}_{t}|{\phi }_{{x}_{t}}\right)\) is calculated using the current \({s}_{t}\), while \(q\left({x}_{t+1}|{\phi }_{{x}_{t+1}}\right)\) is calculated using \({s}_{t+1}\). The entropy of each \(q\left({x}_{t+1}|{\phi }_{{x}_{t+1}}\right)\) is calculated and added to the uncertainty map \(M\). Finally, the attention position having the minimum value in \(M\) is defined as the next attention position.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Esaki, K., Matsumura, T., Ito, K., Mizuno, H. (2021). Sensorimotor Visual Perception on Embodied System Using Free Energy Principle. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_62
Download citation
DOI: https://doi.org/10.1007/978-3-030-93736-2_62
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-93735-5
Online ISBN: 978-3-030-93736-2
eBook Packages: Computer ScienceComputer Science (R0)