Skip to main content

Disentangling What and Where for 3D Object-Centric Representations Through Active Inference

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021)


Although modern object detection and classification models achieve high accuracy, these are typically constrained in advance on a fixed train set and are therefore not flexible to deal with novel, unseen object categories. Moreover, these models most often operate on a single frame, which may yield incorrect classifications in case of ambiguous viewpoints. In this paper, we propose an active inference agent that actively gathers evidence for object classifications, and can learn novel object categories over time. Drawing inspiration from the human brain, we build object-centric generative models composed of two information streams, a what- and a where-stream. The what-stream predicts whether the observed object belongs to a specific category, while the where-stream is responsible for representing the object in its internal 3D reference frame. We show that our agent (i) is able to learn representations for many object categories in an unsupervised way, (ii) achieves state-of-the-art classification accuracies, actively resolving ambiguity when required and (iii) identifies novel object categories. Furthermore, we validate our system in an end-to-end fashion where the agent is able to search for an object at a given pose from a pixel-based rendering. We believe that this is a first step towards building modular, intelligent systems that can be used for a wide range of tasks involving three dimensional objects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The YCB object and model set: towards common benchmarks for manipulation research. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 510–517 (2015)

    Google Scholar 

  2. Daucé, E., Perrinet, L.: Visual search as active inference. In: IWAI (2020).

  3. Eslami, S.M.A., et al.: Neural scene representation and rendering. Science 360(6394), 1204–1210 (2018)

    Google Scholar 

  4. Ettlinger, G.: Object vision and spatial vision: the neuropsychological evidence for the distinction. Cortex 26(3), 319–341 (1990)

    Google Scholar 

  5. Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., Pezzulo, G.: Active inference and learning. Neurosc. Biobehav. Rev. 68, 862–879 (2016)

    Article  Google Scholar 

  6. Gilmer, J., Adams, R.P., Goodfellow, I.J., Andersen, D.G., Dahl, G.E.: Motivating the rules of the game for adversarial example research. CoRR, abs/1807.06732 (2018)

    Google Scholar 

  7. Hadsell, R., Rao, D., Rusu, A.A., Pascanu, R.: Embracing change: continual learning in deep neural networks. Trends Cogn. Sci. 24(12), 1028–1040 (2020)

    Article  Google Scholar 

  8. Hawkins, J., Ahmad, S., Cui, Y.: A theory of how columns in the neocortex enable learning the structure of the world. Front. Neural Circ. 11, 81 (2017)

    Article  Google Scholar 

  9. Hawkins, J., Lewis, M., Klukas, M., Purdy, S., Ahmad, S.: A framework for intelligence and cortical function based on grid cells in the neocortex. Front. Neural Circ. 12, 121 (2019)

    Article  Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016

    Google Scholar 

  11. Hinton, G.E.: How to represent part-whole hierarchies in a neural network. CoRR, abs/2102.12627 (2021)

    Google Scholar 

  12. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991)

    Article  Google Scholar 

  13. Kingma, D.P., Ba, J.: A method for stochastic optimization, Adam (2017)

    Google Scholar 

  14. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)

  15. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - vol. 1, NIPS 2012, pp. 1097–1105, Red Hook, NY, USA (2012). Curran Associates Inc

    Google Scholar 

  16. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020).

    Chapter  Google Scholar 

  17. Nikolov, N., Kirschner, J., Berkenkamp, F., Krause, A.: Information-directed exploration for deep reinforcement learning (2019)

    Google Scholar 

  18. Parr, T., Sajid, N., Da Costa, L., Mirza, M.B., Friston, K.J.: Generative models for active vision. Front. Neurorobot. 15, 34 (2021)

    Google Scholar 

  19. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models (2014)

    Google Scholar 

  20. Rezende, D.J., Viola, F.: Taming vaes (2018)

    Google Scholar 

  21. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. CoRR, abs/1710.09829 (2017)

    Google Scholar 

  22. Safron, A.: The radically embodied conscious cybernetic Bayesian brain: from free energy to free will and back again. Entropy 23(6), 783 (2021)

    Google Scholar 

  23. Shanahan, M., Kaplanis, C., Mitrovic, J.: Encoders and ensembles for task-free continual learning. CoRR, abs/2105.13327 (2021)

    Google Scholar 

  24. Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: continuous 3D-structure-aware neural scene representations. In: Advances in Neural Information Processing Systems (2019)

    Google Scholar 

  25. Smith, R., Friston, K., Whyte, C.: A step-by-step tutorial on active inference and its application to empirical data, January 2021

    Google Scholar 

  26. Van de Maele, T., Verbelen, T., Çatal, O., De Boom, C., Dhoedt, B.: Frontiers in Neurorobotics 15, 14 (2021)

    Google Scholar 

Download references


This research received funding from the Flemish Government (AI Research Program). Ozan Çatal was funded by a Ph.D. grant of the Flanders Research Foundation (FWO). Part of this work has been supported by Flanders Innovation & Entrepreneurship, by way of grant agreement HBC.2020.2347.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Toon Van de Maele .

Editor information

Editors and Affiliations


Appendix A Neural Network Architecture and Training Details

The neural network is based on a variational autoencoder [14, 19] consisting of an encoder and a decoder. The encoder \(\phi _\theta \) uses a convolutional pipeline to map a high dimensional input image (64x64x3) into a low dimensional latent distribution. We parameterize this distribution as a Bernouilli distribution representing the identity of the object and the camera viewpoint as a Multivariate Normal distribution with diagonal covariance matrix of 8 latent dimensions. The decoder \(\psi _\theta \) then takes a sample from the viewpoint and is able to reconstruct the observation through a convolutional pipeline using transposed convolutions. In addition to a traditional variational autoencoder, we have a transition model \(\chi _\theta \) that transforms a sample from the viewpoint distribution to a novel latent distribution, provided with an action. This action is a 7D vector representing the translation as coordinates in and rotation in quaternion representation. The model architecture for encoder, decoder and transition models are shown in Table 1, Table 2 and Table 3, respectively.

The model is optimized end-to-end through the minimization of Free Energy as described in Eq. 2. The expectations over the different terms are approximated through stochastic gradient descent using the Adam optimizer [13]. As minimization of negative log likelihood over reconstruction is equivalent to minimization of the Mean Squared Error, this is used in practice. Similarly, the negative log likelihood over the identity is implemented as a binary cross-entropy term. We choose the prior belief over \(\mathbf {v}\) to be an isotropic Gaussian with variance 1. The individual terms of the loss function are constrained and weighted using Lagrangian multipliers [20]. We consider only a single timestep during the optimization process. In practice this boils down to:

$$\begin{aligned} \begin{aligned} L_{FE} =&\lambda _1 \cdot L_{BCE}(\hat{i}, i) + \lambda _2 \cdot L_{MSE}(\psi _\theta (\mathbf {\hat{v}}_{t+1}), \mathbf {o}_{t+1}) \\&+ D_{KL}[ \underbrace{\chi _\theta (\mathbf {v}_t, \mathbf {a}_t)}_{q(\mathbf {v}_{t+1}|\mathbf {v}_{t}, \mathbf {a}_t, \mathbf {i})} || \underbrace{\phi _\theta (\mathbf {\hat{o}})}_{p(\mathbf {v}_{t+1}|\mathbf {i}, \mathbf {o}_t)} ] \end{aligned} \end{aligned}$$

where \(\hat{i}\) is the prediction \(\phi _\theta (\mathbf {o}_t)\) of the what-stream for the encoder, \(\mathbf {\hat{v}_{t+1}}\) is a sample from the predicted transitioned distribution \(\chi _\theta (\mathbf {v}_t, \mathbf {a}_t)\) and \(\mathbf {\hat{o}}_{t+1}\) is the expected observation from viewpoint \(\mathbf {\hat{v}_{t+1}}\), decoded through \(\psi _\theta (\mathbf {v}_{t+1})\). The \(\lambda _{i}\) variables represent the Lagrangian multipliers used in the optimization process.

During training, pairs of observations \(\mathbf {o}_t\) and \(\mathbf {o}_{t+1}\) and corresponding action \(\mathbf {a}_t\) are required. To maximize data efficiency, the equation is also evaluated for zero-actions using only a single observation, and reconstructing this directly without transition model.

Table 1. Neural network architecture for the image encoder. All strides are applied with a factor 2. The input image has a shape of 3 \(\times \) 64 \(\times \) 64. The output of the convolutional pipeline is used for three different heads. The first predicts the mean of the distribution \(\mu \), the second head predicts the natural logarithm of the variance \(\sigma ^2\), for stability reasons and finally the third head predicts the classification output score c as a value between zero and one after activation through the sigmoid activation function.
Table 2. Neural network architecture for the image decoder. The input of this model is a sample drawn from the latent distribution, either straight from the encoder, or transitioned through the transition model. All transpose layers use a stride of two. The final layer of the model is a regular convolution with stride 1 and kernel size 1, after which a sigmoid activation is applied to map the outputs in the correct image range.
Table 3. Neural network architecture for the transition model. The input from this model is an 8 dimensional latent code, concatenated with the 7-dimensional representation of the relative transform (position coordinates and orientation in quaternion representation). For stability reasons, the log-variance is predicted rather than the variance directly.

Appendix B Additional experimental details

In Table 4, the computed angular and translational distances for the 9 evaluated objects are shown. Figure 5 shows a sequence of imaginations for all 9 objects, the top row represents the ground truth input, the second row the reconstruction and the subsequent rows are imagined observations along a trajectory.

Table 4. The mean distance error in meters and angle error in radians for different objects of the YCB dataset [1] in our simulated environment. For each object 20 arbitrary target poses were generated over which the mean values are computed.
Fig. 5.
figure 5

The top row represents the ground truth observation that was provided as input to the model. The second row shows a direct reconstruction when no action is applied to the transition model. All subsequent rows show imagined observations along a trajectory.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Maele, T.V., Verbelen, T., Çatal, O., Dhoedt, B. (2021). Disentangling What and Where for 3D Object-Centric Representations Through Active Inference. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93735-5

  • Online ISBN: 978-3-030-93736-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics