Sensorimotor representation learning for an"active self"in robots: A model survey

Safe human-robot interactions require robots to be able to learn how to behave appropriately in \sout{humans' world} \rev{spaces populated by people} and thus to cope with the challenges posed by our dynamic and unstructured environment, rather than being provided a rigid set of rules for operations. In humans, these capabilities are thought to be related to our ability to perceive our body in space, sensing the location of our limbs during movement, being aware of other objects and agents, and controlling our body parts to interact with them intentionally. Toward the next generation of robots with bio-inspired capacities, in this paper, we first review the developmental processes of underlying mechanisms of these abilities: The sensory representations of body schema, peripersonal space, and the active self in humans. Second, we provide a survey of robotics models of these sensory representations and robotics models of the self; and we compare these models with the human counterparts. Finally, we analyse what is missing from these robotics models and propose a theoretical computational framework, which aims to allow the emergence of the sense of self in artificial agents by developing sensory representations through self-exploration.


Introduction
In order to bring robots to safely cooperate with humans within the same environment, it is vital for the next generation of robots to be equipped with the abilities to learn, adapt, and act autonomously in unstructured and dynamic environments. In other words, we want robots to operate in the same situations and conditions as humans do, to use the same tools, to interact with and to understand the same world in which humans' daily lives take place. For achieving this, we want robots to be able to learn how to behave appropriately in our world and thus to find efficient ways to cope with the challenges posed by our dynamic and unstructured environment, rather than providing a rigid set of rules to drive their actions. Robots should be able to deal efficiently with unexpected changes in the perceived environment as well as with modifications of their own physical structure in a scalable manner. So how can we build robots that possess such abilities?
We address this question by reviewing interdisciplinary research related to the developmental processes that form the representations of body schema and the peripersonal space (PPS), the sense of agency, and discussing how these relate to the active self. We first review the development of body schema, sense of agency and the PPS in humans in Section 2. Also we highlight that the body schema and the PPS representations emerge by exploration, and that they are critical for the development of agency and higher cognitive functions.
Then, in Section 3, we discuss the behavioral function and properties of the body schema and PPS representations in humans, and then review the state of the art models in developmental robotics with respect to these representations. Finally, in Section 4 we analyze these issues with respect to the so-called "active self" and conclude in Section 5 by proposing a general blueprint that builds on the verification principle by Stoytchev (2009) to overcome the limitations of current robotic systems. In the remainder of this section, we provide a brief overall background and similar reviews in developmental robotics.
The main scope of this review paper focuses on the first phases of the development of the self in humans. We consider especially the physical mechanisms of the development and models inspired by these developmental processes. Therefore, our attention is on computational and robotics models proposed for humanoid robots or of equivalent configuration, i.e. having cameras as eyes, manipulators as arms and possibly a tactile-covered system as artificial skin. We acknowledge that social interactions might have some influence on the development processes of multisensory representations and the self, which are discussed elsewhere, e.g. (Cléry et al., 2015, Section 6), (Serino, 2019, Section 6), (Teneggi et al., 2013;Meltzoff, 2007;Meltzoff and Marshall, 2020;Tsakiris, 2017;Fotopoulou and Tsakiris, 2017). We leave the systematic review of social aspects for future work.

The active self and the sense of agency
The notion of the active self relates to the connections between perception, action, and prediction, and how these connections facilitate the emergence of a minimal self. For the term "active self", we argue that the sensorimotor activities of an agent are a prerequisite for the emergence of the minimal self, in the sense that "the phenomenal, minimal self is empirically derived from sensorimotor experience and not a theoretical and empirical given" (Verschoor and Hommel, 2017). The minimal self, or "minimal phenomenal selfhood" (Blanke and Metzinger, 2009), refers to the pre-reflective sense of being a self as being subject to immediate experience (Gallagher, 2000). This minimal notion of the self involves the sense of agency-the sense of the self as the one causing or generating an action, and the sense of ownership-the sense of the self as the one subjected to an experience (Gallagher, 2000).
The sense of agency and body ownership are emergent properties of a complex, embodied system that is situated in a dynamic environment that has a level of uncertainty. However, one can argue that the level of "complexity" of the environment is related to the system's own sensory and motor capacities. Simply put, the more information a system can perceive from the   (Cardinali et al., 2009;de Vignemont, 2010;Serino, 2019). Serino (2019) suggests that the interaction of proprioception ( * ) and visual signals about the body part is vital for frame transformations in the dynamic cases of the PPS interaction with the environment, and the "richness" with which the system can act upon the environment, the more "complex" the information from the environment will be to that system (Pfeifer et al., 2007). Thus, information is formed by the interaction, rather than being "provided" by the environment and decoded by the perceptual system of the agent. It follows then that the properties of the system, the body, along with the properties of the environment, govern the interaction between the embodied agent and the environment in which it is situated (the ecological niche), as well as the developmental process of the agent itself. Infants are born into a dynamic, uncertain environment with which the interaction is complex. However, human infants (as well as other complex biological systems) are not born with complete pre-existing knowledge about their environment, nor about their own body. Infants construct this knowledge over time, and progressively form a model of the body-a body representation, and a model of the environment through interactions ( O'Regan and Noë, 2001;Varela et al., 1991; also see Hoffmann et al., 2010 and Jacquey et al., 2019 for a review).

The body schema and the peripersonal space
In humans, the capabilities to deal with unexpected changes in the environment and modifications of our own physical structure (e.g. growth or by extending an arm with a tool) emerge from our ability to perceive our body in space, sensing the location of our limbs during movement, being aware of other objects and agents, and controlling our body parts to interact with them intentionally. These abilities are thought to be related to the presence of a body schema, peripersonal space (PPS), and the minimal self including the sense of body ownership and sense of agency. As initially defined by Head and Holmes (1911), the body schema is a sensorimotor representation of the Fig. 1: Development path of different body-related representations and sensations in the first year in infants. The development of the body schema is reviewed and discussed in section 2.1, from the fetal stage to the stage around 3 months of age after birth. The development of the PPS is discussed in section 2.3, which is suggested to continue from 3 to 6-10 months of age. The development of the active self, which relates to causal action-effect and action selection, is discussed in section 2.2. Literature suggests that the active self, as a process of self emergence, takes place from birth to 9 months of age. structure and position of the human body, which is encoded in the brain and allows the agent to perform body movements. Also maintained by the brain, the PPS denotes the representation of the proximal space surrounding the agent's body. This space is commonly defined as the reachable space but outside the body surface, differentiated from the extrapersonal space and the personal space. Specifically, PPS is the space where all motor activities of an agent such as object manipulations take place (Serino, 2019). For example, consider grasping an external object in the reachable space of a robot. In order to execute this action, an agent requires two prerequisites: First, it needs to be aware of and monitor the position of its body part, e.g. a limb to execute the movement. Second, it needs to "compute" the dynamic position, dimension, etc. of the target with respect to the agent's body. The brain provides the awareness about the body schema and body configuration, and the computation of the target location is a result of the brain's PPS representation. These two representations, the body schema and the PPS, emerge from the low-level integration of different sensory modalities available in a human body (see Table 1 for details). They are closely related and interact with each other.
Indeed, there are some overlapping functionalities of the body schema and the PPS representation, namely (i) they are both multisensory representations; (ii) they convey frame of reference (FoR) transformations; (iii) they have a strong link with actions within the reachable space and (iv) the representations are plastic. According to Cardinali et al. (2009) and Canzoneri et al. (2013) these overlaps are potentially due to their causal relation (the extension of the body representation leads to the extension of PPS representation in tool-use cases) or their unique identity. Nevertheless, the differences between the two representations still exist, and stem from the involvement of external objects within reach in the environment 1 causing the non-bodily stimuli. Because 1 in tool-use cases, external out-of-reach objects also get involved of the requirement of body continuity, there are also certain tool-use cases, e.g. a remotely controlled tool like the computer mouse and its pointer, in which the body schema representation cannot include the tool. Hence, the representation is not affected. Instead, the spatial representation of PPS would be modulated due to the availability of visual-tactile correlation and action-effect association (Cardinali et al., 2009). The recent behavioral study of D' Angelo et al. (2018) suggests that there are separate mechanisms for the plastic changes of body schema and PPS representations.
As reviewed by de Vignemont (2010), the representation of an agent's body can be distinguished into body schema-the representation for actions 2 , and body image-other body-related representations for perception, conception and emotion (according to the dyadic taxonomy) (see also Dijkerman and de Haan, 2007). The body image can be further separated into two distinct representations, namely visuo-spatial body map-the structure description of body parts, and body semanticsthe conceptual and linguistic level of body parts (according to the triadic taxonomy). However, with the perspective of the enactive approach (Varela et al., 1991), in which the sensorimotor exploration gives rise to perceptual experiences, the distinction between the bodily action-oriented and perceptual representation is quite blurry. For example, visual appearance and boundary of a limb would have an effect on the agent's perception of the length and position of the limb. Hence, it is reasonable to include the body structure description of the body image representation (and its sources of sensory information) when considering the body schema in action, especially from the computational perspective. Indeed, most robotics models of the so-called body schema fall into this category (see Section 3.2 and Table 2). Furthermore, from the technical point of view, it is difficult to model the mental level of the body image when the definition is unclear. Therefore, we will use the term "body schema" in an extended meaning including both "body schema" and "body structure description". 1.3 Developmental processes of emergent selfhood and body awareness Although the bodily senses exist in human adults as a result of multisensory integration processes, these abilities are not innate-newborns and infants develop these abilities over time (Bremner et al., 2012b). Indeed, the senses of the bodily self, i.e. the sensation of the position of a body part, the surrounding space, and the feeling of owning and controlling one's body, incrementally develop in newborns in the very first months of their lives, (e.g. Rochat, 1998;Rochat and Morgan, 1995a;Bremner et al., 2008aBremner et al., , 2012bOrioli et al., 2019(Marshall andMeltzoff and Marshall, 2020)).
Taken together, it is reasonable to argue that after birth, infants spend their first months of life undergoing many developmental milestones to incrementally develop the representation of their body. This body schema is related mainly to touch, proprioception, and vision (see Table 1) as these sensory modalities continue to develop from the fetal stage (see Hoffmann, 2017;Adolph and Joh, 2007 for reviews). Later on, the representation of the surrounding space of the body-the PPS-is aggregated from the proprioceptive and exteroceptive modalities (see Table 1). In addition, infants develop the capability to generate motor actions corresponding to desired outcomes, and the ability to distinguish between self and other, both related to the senses of body ownership and agency. At first, these developments may be triggered by self-exploration movements. However, then the enhanced perceptual capability may help infants in improving their motor control, from a reflexive manner to intentionally goal-directed state during these processes (see Fig. 1). Insights from the developmental dynamics of these abilities may suggest important prerequisites for formulating developmental models of artificial intelligence.

Related work
There are several reviews that relate to the topic of this paper. First, the review by Hoffmann et al. (2010) on robotic models of body schema surveyed the concept of body schema in biology, its properties, and its relation with the forward models used in the field of robotics. The review also provides a thorough overview on body schema-inspired robotic models. In this work, we will briefly review the body schema properties and further provide a complementary view on this sensory representation. Furthermore, we will provide an update on robotic models of the body schema representation.
In (Cangelosi and Schlesinger, 2014), the authors presented both theoretical and experimental aspects of the developmental robotics approach. The approach promotes the idea of building artificial agents by receiving inspiration from human developmental science. The authors outline the theoretical principles of the approach including embodiment, enaction, cross-modality, and online, cumulative, open-ended learning. The experimental review provides an overview on developmental robotics models from intrinsic motivation, perceptual and motor developments, to social learning, language skills, and abstract knowledge developments.
Moreover, Schillaci et al. (2016) reviewed and suggested the fundamental role of sensorimotor interaction in the development of both human and artificial agents. In this process, the agent's motor exploration in a situated environment serves as a means for gathering sensorimotor experiences, which facilitates the emergence of other cognitive functions. For example, sensorimotor experiences are used to learn a forward model, and a forward model can be the basis for learning high-level cognitive conceptual representations. In agreement with Schillaci et al. (2016), we aim to go deeper into the role of multisensory information collected through exploration in the formation of an agent's body and peripersonal space representation, and how these sensorimotor representations affect the agent's sense of the active self, including the sense of agency and the sense of body ownership. Thus, motor explorations will be mentioned but not exhaustively discussed in this surveyed work. Instead, we focus on the body schema, the peripersonal space, and the emergence of the sense of agency. Georgie et al. (2019) discussed the development of body representations as a prerequisite for the emergence of the minimal self, which includes body ownership and agency. They discuss some of the behavioural measures indicating the presence of body ownership and sense of agency in humans, and survey some of the related robotics research that examined and developed these concepts. In their review, the authors suggested possible expansions to the robotics research for exploring the development of an artificial minimal self. Specifically, to focus on developing models that incorporate a whole developmental path in a real robot that would include e.g. self exploration and self-touch, where behavioural indices can be measured at different points along the developmental path.
Concurrently with this paper, Tani and White (2020) review models of the sense of minimal and narrative self in cognitive neurorobotics, but mainly focus on models utilizing RNN architectures that follow the free-energy principle and active inference approach (Friston and Kiebel, 2009). 2 Development of the body schema, peripersonal space, and the sense of agency in humans Before reviewing robotic models of the body schema, PPS representations, and the sense of agency, we first consider the development of these representations and agency in humans. This involves the development of the body schema from gestation to infancy in Section 2.1, the PPS representation in infants in Section 2.3, and the emergence of the sense of agency in infants in Section 2.2. 2.1 Development of the body schema from gestation to infancy The development of the body schema is inseparably linked with sensorimotor development, starting from as early as the fetal stage and continuing later on after birth. The body schema's neural foundation is formed by the neurological representations of the different anatomical divisions of the body. These are the cortical "homunculi" (see Fig. 2) for an illustration in the primary sensory (S1) and motor (M1) cortices (Penfield and Boldrey, 1937). The different anatomical divisions of the body are mapped onto brain areas in charge of sensory and motor processing along the S1 and M1. The organization of these specialized areas is realized in a somatotopic map, where adjacent body parts are represented closely together (for the most part-see Rasmussen, 1950, but also Di Noto et al., 2013). Moreover, the extent of cortex dedicated to a body region is proportional to the density of innervation in that specific part (e.g. the mouth and palms) rather than to its size in the body. The establishment of the somatotopic organisation in S1 and M1 is facilitated by genetic factors, and later refined through connectivity changes driven by embodied interactions both before and after birth (Dall'Orso et al., 2018).
In terms of motor development, fetuses in the first weeks of gestation typically display different types of motor patterns such as spontaneous startles that start at 7-8 weeks, general movements which start at 8 weeks, isolated movements that emerge soon after, and twitches which start at 10-12 weeks and are produced during active sleep (Fagard et al., 2018). These very early motor patterns seem to be spontaneous rather than responses to sensations. However, the first sense to develop in the fetus is the tactile sense (Bradley and Mistretta, 1975), where fetuses are in a state of constantly being touched by their environment, the tactile sense develops at around the same time as motor movements. Once sensory receptors develop, the fetus' spontaneous movements inevitably lead to sensations, thus facilitating the formation of contingencies between movements and their sensory outcomes (Fagard et al., 2018). Also, fetuses engage in self-touch in the womb: They often touch body parts that are highly innervated and therefore most sensitive to touch such as the mouth and feet, and later on other parts of the body. The early tendency for movements and self-touch in parts of the body that are more sensitive, points to a certain preference towards movements that induce more informative sensations (for a review on fetal sensorimotor development see Fagard et al., 2018).
Positron emission tomography (PET) studies revealed that in infants under 5 weeks after birth, the dominant metabolic activity is in subcortical regions and the sensorimotor cortex, and by 3 months, metabolic activity increases in the parietal, temporal, and dorsolateral occipital cortices (Chugani, 1994). It seems that at around 2 months after birth, behavioral control transitions from subcortical to cortical systems. In addition, subcortical regions such as the superior colliculus have been investigated as a hub for multimodal integration in human and animal studies (Bahrick and Lickliter, 2002). Specifically, the superior colliculus has been implicated as able to support social behaviour in early infancy (Pitti et al., 2013), due to its role in attentional behaviours (Valenza et al., 1996;Stein and Meredith, 1993) It seems that the ability to predict sensory consequences of actions, and subsequently to form sensorimotor contingencies begins to develop already in the uterus. There is evidence of fetus anticipation behaviour of hand-to-mouth touch already at 19 weeks, (Myowa-Yamakoshi and Takeshita, 2006;Reissland et al., 2014), indicating the presence of a sort of sensorimotor mapping and inference. And from 22 weeks after gestation, movements seem to show an early form of goal-directedness, when the properties of a movement differ depending on the actions' target (more careful movement towards the eye than towards the mouth) (Zoia et al., 2007). In turn, these in-utero embodied interactions are thought to lay the foundation for the later integration of tactile-proprioceptive and visual information after birth. Using an embodied brain model of a human fetus in a simulated uterine environment, Yamada et al. (2016) showed how these interactions promote the cortical learning of body representations by way of regularities in sensorimotor experiences, and instantiate postnatal visual-somatosensory integration.
Right after birth, there is a certain regression in motor control, possibly due to the fundamental change in environment-the newborn has to adapt to an aerial environment in which gravity is felt more strongly, and to the sudden change in brightness, and is highly preoccupied with bodily functions such as feeding, sleeping, and crying (Fagard et al., 2018). Nonetheless, hand-mouth coordination still continues to develop after birth. Infants seem to frequently explore their body at around 2 or 3 months, and from birth to 6 months, infants display self-touch progressively throughout their body, from frequently touching rostral parts such as the head and trunk, to more caudal parts of the body such as the hips, legs, and feet (Thomas et al., 2015). From the evidence brought forth by Rochat and Morgan (1995a); Morgan and Rochat (1997); Rochat (1998), it seems that infants develop the ability to perceive multisensory spatial contingencies (e.g. visual-propriopceptive or visuotactile) soon after birth (e.g. Bahrick and Watson, 1985; see also Bremner et al., 2012b for a review), and also form the perceptual body schema (via intermodal calibration) by 3 months old.
While evidence from neural development studies suggests that even before birth, the prenatal brain should be able to perceive information arising from the body-a rudimentary body schema involving tactile and proprioceptive information-the later maturation of cortical association areas constitutes higher level (multimodal) representations that are possibly formed during the first year after birth Hoffmann (2017). As Hoffmann (2017) writes "However, the formation of more holistic multimodal representations of the body in space occurs probably only after birth, in particular from about 2-3 months." Studies show that infants develop a body schema from early on in life allowing them to form expectations about how their bodies look and where they are located in space (Rochat, 1998). From 3 months of age on, when presented with a real-time display of their own legs, infants look longer at an unfamiliar, third-person perspective of their legs than at a familiar, first-person view (Rochat and Morgan, 1995b). Longer looking times of infants were interpreted such that infants expected the images to match their own body schema, thus, they were surprised when their expectations were violated in case of a mismatch between what they expected and what they observed on the display.
Others provided further evidence on infants' body representations using an adapted version of the rubberhand illusion paradigm (Zmyj et al., 2011). In the first experiment, infants observed two adjacent displays of baby doll legs being stroked while their own leg was also stroked simultaneously. In the contingent display, the stroking of the infant's own leg corresponded to the movements on the display whereas in the non-contingent displays, there was a mismatch between the felt and observed stroking of the leg. Results showed that 10month-old infants, but not 7-month-olds, looked longer at the contingent displays suggesting that at 10-months of age infants detected visual-tactile contingencies necessary for the identification of self-related stimuli. In this study, longer looking times were interpreted as indicating the early ability to detect visual-tactile contingencies In order to find out whether morphological properties of the body facilitated the detection of visual-tactile contingencies, Zmyj et al. (2011) ran a control experiment with 10-month-old infants in which infants observed wooden blocks instead of baby doll legs, which were stroked in synch or out of synch with their own leg. Data revealed that infants looked equally long at both contingent and non-contingent displays suggesting that they were able to detect visual-tactile contingencies only when the visual information was related to the body (Zmyj et al., 2011).
This preference for specifically body-related synchrony was also later found in newborns. Filippetti et al. (2013) investigated the role of temporal synchrony in multisensory integration, to examine whether bodyrelated temporal synchrony detection plays a role even from birth. In two experiments, Filippetti et al. presented newborns (from as early as 12 hours after birth) with temporally synchronous and asynchronous visualtactile stimulation. The visual information was either body-related (an upright newborn face in experiment 1) or non-body-related (an inverted newborn face in experiment 2). Preference or increased attention to the stimuli was measured by longer looking time. Newborns showed a preference to the synchronous visual-tactile stimulus, only in the body-related condition, indicating that this increased attention or preference was present only when the synchrony was related to their own body, rather than a general preference to synchrony. The results provide another piece of evidence to the notion that even right after birth, newborns are able to integrate multisensory information, and detect synchronous multisensory stimulation, processes that are fundamental for body representations.
In another study, Filippetti et al. (2015b) presented newborns with videos of newborn faces being stroked with a paintbrush in either a spatially congruent or incongruent location of tactile stimulation. The newborns showed a preference towards the spatially congruent visual-tactile stimulation, suggesting that even shortly at birth, newborns are sensitive to visual-tactile multisensory information. These two studies showed that the ability for detecting temporal and spatial contingencies in multisensory information is present even shortly after birth, and it is present even without self-generated movement.
It is worth pointing out that besides methodological differences between studies (i.e. age groups, sample size), the different feedback modalities (i.e. visual-tactile vs.visual-proprioceptive) and task complexity might have played a role in different looking-time responses in infants. More research with different measures (e.g. pupillometry, EEG etc.) is needed to clarify this point.
Following up on (Filippetti et al., 2013), Filippetti et al. (2015a) ran an fNIRS study to investigate the brain regions involved in visual-tactile contingency detection for body ownership in infants. Five-month-old infants observed either real-time or delayed videos of themselves while they received tactile stimulation on the cheek with a soft brush. Data revealed that infants showed bilateral activation over the superior temporal sulcus (STS), temporoparietal junction (TPJ) and inferior frontal gyrus (IFG) cortical regions in the contingent condition in response to visual-tactile (and visual-proprioceptive) contingencies. This finding shows that infants as young as 5 months of age show activation in brain regions similar to that of adults when they process information related to their own bodies.
Recently, employing neuroscience techniques, Marshall, Meltzoff and colleagues conducted a set of experiments in infants' representations of bodies at the neural level (see Meltzoff and Marshall, 2020) for reviews). Using EEG, Saby et al. (2015) state that a group of 7-month-old infants shows some somatotopic patterns as the homunculi map: tactile stimuli in infants' feet corresponds to response in the midline area of the brain, whereas stimuli in their hands yield responses in lateral central areas. Even a younger group of infants (of 60-day-old) shows brain response when being touched in their hand, foot and upper lip (Meltzoff et al., 2019). Especially, the magnitude of the response to lip touch is much higher than the responses to hand or foot touch, suggesting the tactile sensitivity of the lip area after birth.

Emergence of sense of agency
Developmental researchers have pointed towards two potential underlying mechanisms explaining how infants become agents over their bodies and the environment, namely (i) associative learning and (ii) a causal representation of the world.
One line of research emphasized an associative learning mechanism that enables infants to detect the sensory contingencies in their environment. Although their focus in the paper was memory functions of infants, the seminal work by Rovee and Rovee (1969) has revealed some of the early findings on infants' sense of agency. In their mobile-paradigm experiments, infants at around 3 months of age laid in a crib above which a mobile was hanging. One of the limbs of the infant was connected to the mobile with a ribbon. In the connect phase, when the infant moved the connected limb, this resulted in the movement of the mobile. Infants moved their connected limb with increasing frequency when the limb was connected to the mobile, but not when the connected limb was switched or when there was a delay between the movement of the limb and the effect. Interestingly, infants showed increased kicking movement when the mobile was disconnected suggesting that they were trying to re-elicit the effect (Rovee-Collier et al., 1978). Using the mobile paradigm, Watanabe and Taga (2006) have shown that whereas 2-month-old infants produced increased movement in all limbs as compared to a baseline period, by the age of 3 to 4 months, they showed increased movement only in the connected limb to activate the mobile (Watanabe and Taga, 2006). These findings were interpreted such that at around 3 months of age infants learned the causal link between self-produced movements and their effects in the environment as an indication of "a sense of selfagency" (Watanabe and Taga, 2011). Other researchers investigated infants' sense of agency in using different paradigms (Rochat and Striano, 1999). For example, they measured infants' sucking on a dummy pacifier to investigate whether 2-month-old infants showed differential oral activity based on auditory feedback. In the Analog condition, each time infants sucked on the pacifier, they heard a pitch variation of the sound corresponding to the oral pressure applied on the pacifier. In the Non-Analog condition, each time infants applied pressure on the pacifier, they heard a random pitch variation. Data revealed that 2-month-old infants produced more frequent oral pressure on the pacifier when the auditory effect matched their sucking behavior suggesting that they detected the link between their sucking behavior and the sound effect.
Another line of research emphasized the causal representation of actions and their effects underlying the sense of agency. Researchers ascribing to this view argue that an associative learning mechanism would not be sufficient to account for infants' sense of agency because sense of agency requires a causal representation of the world (Zaadnoordijk et al., 2018). Because the behavioral patterns such as increased movement frequency when connected to a mobile can be explained by alternative mechanisms, these findings provide no evidence for infants' causal representations of their actions and the effects, i.e. sense of agency. Zaadnoordijk et al. (2018) simulated the mobile paradigm with a "babybot" that functioned on operant conditioning, thus, it did not have a causal representation of itself and its environment to guide its actions. The simulation results showed that the non-representational babybot produced increased movement with the connected limb as compared to the baseline level of that limb as well as other unconnected limbs. That is, even in the absence of a causal model of the world, the babybot replicated the behavioral findings observed in infant experiments that have been interpreted as evidence for a sense of agency. However, unlike infants, the babybot did not increase its movement rate when the mobile was disconnected. In other words, in the absence of reinforcement, the babybot ceased its behavior. Based on these findings, the authors argued that a sense of agency requires representing the causal link between one's actions and an effect, which is observed in infants but not in non-representational agents.
In a follow-up EEG study, Zaadnoordijk et al. (2020) tested whether 3-to 4.5-month infants showed neural markers of causal action-effect models that are required for a sense of agency. Infants' limbs were connected to a digital mobile on a computer screen with four accelerometers attached to each limb, one of which was functional to activate the mobile. In the connect phase, the image was animated when the infant moved their limb connected to the functional accelerometer. In the disconnect phase, the link between infants' movement and the effect was broken, that is, the image remained static even if infants moved their limb operating the mobile. Data showed that a group of infants who showed increased error response in their brains in the disconnect phase (i.e. when the action-effect link was broken) also showed an extinction burst in their behavior indicating that they had constructed a causal model of their actions and the effect. Moreover, the same group of infants moved their limb that operated the mobile more frequently than the other connected limbs. These findings show that causal action-effect models that are necessary for a sense of agency only begin to emerge between 3-to 4.5-month of age in infancy. It is worth noting that the causal relation of actions and sensory effects can be represented as computational forward models that map the current state of the system to the next state through actions.
Other evidence regarding infants' ability of detecting sensory contingencies presented by Verschoor and Hommel (2017) also supports the idea that the sense of agency would emerge through the agent's own sensorimotor experience at around the same time, rather than being innate. As the authors point out, however, the ability of anticipating the outcomes of actions, realized by a forward model, is vital but not sufficient for a full development of the sense of agency in infants. Without the ability to control their own bodies to render actions to change the environment corresponding to expected sensory effects, it is hard to rule out the possibility that the increase of infants' activities (during and after the experiments) might be due to the entrainment effect. It is worth recalling that the infants' motor movement is highly reflexive-like during this early stage of development, rather than voluntary and controlled (see discussion in Section 2.3). No earlier than 9 months old, infants know to select which actions to perform in order to achieve an expected or desired outcome, which relates to the action selection process (some sort of inverse model)(see Verschoor and Hommel, 2017 for a review; also Willatts, 1999;Woodward and Sommerville, 2000;Woodward et al., 2009;Elsner, 2007). This timeline corresponds with the development of other motor skills in infants, e.g. reaching, as we will discuss in Section 2.3. These processes are, of course, in coordination with the maturation of other skills in infants such as eye-head coordination, and postural control (see e.g. Von Hofsten, 2004; Adolph and Joh, 2007 for a review). However, the ability to predict sensory outcomes of motor actions develops earlier and precedes the ability to predict motor actions that would produce a desired sensory state (see Jacquey et al., 2019 for a discussion and review on the development of predictive abilities in humans). The bidirectional associations between actions and effects being refined through the forward and inverse models are hypothesized as a trigger for the sense of agency: While the forward model helps to predict outcomes of conducted actions, the inverse model maps expected effects to action to perform. The smaller the error between the predicted and the actual outcome of an intentional action-the predictive-coding process (Friston and Kiebel, 2009;Friston, 2012;Apps and Tsakiris, 2014), the stronger the agency experience (see Verschoor and Hommel, 2017 for a review; Hommel, 2015a;Chambon and Haggard, 2013;Tsakiris et al., 2007).

Development of the peripersonal space
While there is a body of studies on the representation of peripersonal space (PPS) in adults (see Section 3.3 for a brief review), there is very little research on this representation in infants, especially in their first months after birth. In a recent study, Orioli et al. (2019) present a modified version of the reaction times (RTs) measurement, developed by (Canzoneri et al., 2012), to address the question whether the boundaries of the PPS representation is available in newborns. Instead of measuring the participants' vocal response time to tactile stimuli during an audio-tactile interaction task, they propose to measure the saccadic latency to visual targets (sRTs) as an indirect measure of infants' RTs. With the results of infants' sRTs showing a similar pattern as the adults' RTs, (Orioli et al., 2019) suggest that some sort of PPS boundaries exist already soon after birth, which facilitate the simultaneous multisensory matching in newborns.
More systematically, (Bremner et al., 2012a(Bremner et al., , 2008a propose that the development of PPS representation relates to two main mechanisms, namely the visual spatial reliance and postural remapping. The former mechanism, which develops as early as 6 months of ages, allows infants to statistically estimate the body and surroundings based on the statistical variability of sensory sources, and the canonical layout of their body. This seems to follow the ability to detect sensory contingencies, which contributes to constructing some sort of perceptual body schema (as discussed in Section 2.1). However, these sensory contingencies, at the early age, may not necessarily be encoded in a certain body part reference frame, which is an important functionality of PPS representation (Bremner et al., 2012a). The latter mechanism, the postural remapping, takes into account the postural changes to dynamically mapped external stimuli and limbs position. This mechanism develops (and works alongside) in infants at around 6.5 to 10 months. In their experiments, Bremner et al. (2008b) reveal that 6.5-month-old infants bias their crossmodal responses to the typical side of their hands, whereas 10-month-old infants can respond appropriately in both sides even in crossed-hand postures. That said the findings suggest that PPS representation emerges through the combination of the two mechanisms and is not yet fully-developed prior to 6.5 months. This stage-wise development is in line with a recent neuroscience finding on somatosensory processing in 6-7-month-old infants (using somatosensory mismatch negativity (sMMN)), which speculates that the somatotopic phase of tactile processing does exist at that age while the later phases involving the frame of reference shifting are still under development (Shen et al., 2018).
As we present later (in Section 3.3) the sensorimotor mapping of PPS representation takes part in the voluntary movements to nearby objects within the reachable space. The development of these motor movements in infants can be observed as a source of behavioral measures for the PPS development (Bremner et al., 2008a). Furthermore, these changes in properties of the motor movements, in turn, provide sensory experiences for the refinement and the alignment of different sensory modalities, underlying the PPS representation. In the first year after birth, reaching movements develop from discontinuous, reflexive-like movements, to more directed, organized, and visually-elicited reaching (see Corbetta et al., 2018 for a review; also Thelen et al., 1993). In the former phase, the movement appears to be in a trial-and-error manner (Thelen et al., 1993), and monitored mainly by proprioceptive feedback (Schlesinger and Parisi, 2001;Bremner et al., 2008b). That is, the movements to the goal can be conducted without visual feedback of the infant's hand (e.g. Clifton et al., 1993Clifton et al., , 1991Clifton et al., , 1994. During this pre-reaching phase, infants are also observed to accidentally touch their own bodies-double touch (Rochat, 1998), or clothes during spontaneous movements, giving rise to the grounding of the bodily perception by integrating proprioception and touch. At the reaching onset, infants prefer looking at the space in which the hand and object make contact (Corbetta et al., 2014). This suggests that tactile feedback facilitates the emergence of hand-eye coordination, when the perception of the body and the external space intersect and are being calibrated (Corbetta et al., 2018). These events are in agreement with results from Bremner et al. (2012b), arguing that this development of reaching behaviors is due to the infants' improvement in using both familiar and unfamiliar postural information (e.g. crossed-hands) to competently align spatial information from different sensory sources. These observations and results approximate the emergence of PPS in infants at around 6-10 months of age. 3 Computational and robotic models of body schema and PPS representations In this section, we first discuss the behavioral functionalities and properties of the body schema and PPS representations in humans (Section 3.1 and 3.3). Second, we review computational and robotics models of the representations (Section 3.2 and 3.4). This structure may encourage readers in directly comparing models of those sensory representations constructed in artificial agents with the ones in humans. 3.1 Properties and function of the body schema representation As discussed above, the representation of the body schema seems to develop at a very early stage in new-borns (in continuity of the development during the fetal stage) and is based upon multisensory integration, i.e. from proprioceptive, tactile and possibly visual information(see Table 1 and e.g. (Holmes and Spence, 2004;Cardinali et al., 2009;Gallese and Sinigaglia, 2010)). Along with the maturation of the visual modality, the body schema representation would be grounded and extended with the perceptual representation.
Due to the integration of sensory information, the body schema representation can plastically be modulated to include other objects such as a tool. This is known as the body schema extension paradigm, where agents are trained to actively use a tool to conduct motor actions (Martin et al., 2014;Cardinali et al., 2009;Martel et al., 2016;Serino, 2019). It is worth noting that this plasticity property does not exist when the tool is passively held by the agents. This dynamic plasticity of the body schema enables humans (and primates) to use tools flexibly.
The role of body schema in actions has been suggested as related to the motor control process through two types of internal models of the agent, namely the forward and inverse models. These two models construct the bi-directional mapping between the sensory information with motor information. Taking into account the temporal properties of sensory information forming the body representation, there exists a short-term representation, updated constantly like the angle of a joint, and a long-term representation, such as the size of a limb, which is relatively stable over time. Jointly, these two representations provide a good initial estimate for the body schema. This is required for the inverse computation (of the inverse model) for motor commands generation to achieve a desired state of the body. Concurrently, the forward model predicts the outcomes of the motor commands, resulting in the predicted body schema, and receives the feedback from the sensory system as the updated body schema (Hoffmann et al., 2010;de Vignemont, 2010).
Another key function of the body schema is to allow the coordinate transformations between different sensory modalities conducted by the brain. The transformations are thought to be processed under the population-based encoding conducted by gain field neurons(see Hoffmann et al., 2010 for a review; also (Bullock et al., 1993;Blohm and Crawford, 2009;Salinas and Abbott, 2001;Pouget et al., 2002;Ajemian et al., 2001;Baraduc et al., 2001)). In robotics, the frame of reference (FoR) transformation is normally computed by the chain of transformation matrices, each represented by Denavit-Hartenberg (D-H) paramaterization (Siciliano et al., 2009;Siciliano and Khatib, 2016). However, the D-H transformations do not directly allow the mapping between different sensory modalities like the gain-field neurons.

Computational and robotic models of the body schema
The problem of learning the robot's body schema is often broken down into two main problems: (i) kinematics models identification/calibration, and (ii) visuomotor learning/mapping, depending on the the type of input signals. Models of the former group mostly require only body-related sensors including proprioception and touch, e.g.  (2016); Nguyen et al. (2018c). As a result, the former category requires some sort of a priori knowledge of the robot's body in terms of parameterized functions, e.g. CAD model, Forward kinematic, Inverse Kinematic, etc. The approaches of the latter category can work completely model-free and without a priori knowledge.
In the following, we present a survey of models on robotic body schema in an ascending order of the amount of a priori knowledge provided in the learning problem. By organizing reviewed models in this order, we aim to emphasize one important aspect of autonomous systems: The ability to learn and adapt to dynamic environments. Ideally, an autonomous system should be able to learn to complete different tasks with only little provided information. A summary of the reviewed models is presented in the Table 2. Inspired by infants' self-touch behaviors for "body calibration", Roncone et al. (2014) present a strategy for a humanoid robot to self-calibrate its body schema by bringing an end-effector of an arm to touch various locations in the other arm (which are covered by artificial skin taxels). In this work, the body schema is represented in the form of kinematic chains. Positions of the end-effector computed from proprioceptive input (i.e. joint encoders) and estimated from the skin system are utilized for kinematic calibration by an optimization algorithm.
Similarly, Li et al. (2015) consider the problem of learning the body schema as kinematic calibration, in which they can exploit the CAD model for initialization. In detail, the authors utilize continuous self-touch movements (sliding) to calibrate the closed kinematic chain formed by both KUKA LWR arms (i.e. the slave and master in a torso setup) touching each other. Hence, the calibration problem becomes computing the relative  Table 2: Summary of models of body schema representations. Sensory information is coded as: visual-V, proprioception-P, tactile-T, audio-A transformation matrix by least squares estimation, given pairs of measured contact locations in the two arms. Vicente et al. (2016a,b) cast the internal process of adapting the robot body schema into a hand-eye coordination problem: First, the hand pose and initial calibrated offsets is estimated with the particle filter method, using stereo-vision and encoder measurements; then the internal model is updated by reducing differences between the model prediction of the end-effector and its observed value. For this approach, it is vital to have prior knowledge about the kinematic structure of the robot, i.e. a kinematics model, transformation matrices and the camera's intrinsic parameters. In contrast to (Vicente et al., 2016a), Zenha et al. (2018) employ an Extended Kalman filter is instead of the Monte Carlo Partical filter for incremental kinematics model calibration in iCub simulation. Besides, tactile input caused by touch events between the robot's finger and known surfaces during robot's random movements is employed instead of visual input. The prior knowledge of the robot model is also employed in a goal babbling strategy toward the desired contact surfaces.
Diaz Ledezma and Haddadin (2019) present a versatile and dedicated framework using the First-Order-Principle (FOP), derived from Newton-Euler equations, for learning both the body schema, i.e. topology and morphology, and the inverse dynamics, i.e. the inertial properties, of a simulated ATLAS humanoid and a Franka Emika arm in a modular manner. Parameters of FOP are learnt from only the proprioceptive signals, including Kinematics-related measurements K and dynamics-related measurements D, collected during random trajectories generated by PD controller. Especially, in this approach, the authors propose to exploit knowledge regarding the physical system, i.e. physical laws and joints connectivity, as optimization constraints in facilitating the topology search problem.
Differently, Hoffmann et al. (2018) present an approach to construct the representation for the iCub robot's whole body skin surface in a form of a 2D mapa robotic somatosensory homunculus-by employing the dot product based SOM(DP-SOM) with an additional mask vector as a way to impose the binding constraint between neurons and input layer, i.e. skin taxels, to steer the learning process of the network. Finally, the authors show that the new variety of SOM-Maximum Receptive Field SOM(MRF-SOM)-allows to handle multiple tactile contacts simultaneously and enables the robot to learn a topological representation similar to the primary somatosensory cortex of primates. In a later study, Gama and Hoffmann (2019) extend the MRF-SOM in the proprioceptive domain, to preliminary results. They aim to enable a robot to learn a proprioceptive representation of its joint space to resemble the proprioceptive representations in the somatosensory cortex. The underlying hypothesis is that body representations may arise as a consequence of the agent's self-touch.
Inspired by the gain-field mechanism in human brains for the spatial transformation, (Abrossimoff et al., 2018) propose a neural network model consisting of two gain-field networks, the sigma-pi networks of Radial Basis Function, for sensorimotor transformation and multimodal integration. The former is a visuomotor network for inverse dynamic learning, and the latter is to learn a body-centered coordinate system of the robot's hand and the target. After being trained, the networks enable a three-link robot to complete the reaching visual targets in a simulated 2D environment. Ulbrich et al. (2009) propose a method to learn the forward kinematics (FK) mapping from robot's joint configuration and visual position of the end-effector as body schema learning. Moreover, they represent the FK with Kinematic Bézier Maps (KB-Maps), a derived technique from computational geometry, and show that the model can be learned more efficiently with linear least square optimization by constraining the KB-Map with some topology knowledge. The learning method is validated on noisy data collected from random joint movements of the ARMAR-IIIa humanoid robot in both simulation and hardware. Lallee and Dominey (2013) propose so-called Multimodel Convergence Maps (MMCMs)-a SOM-based implementation of the Convergence-Divergence zones framework-for multiple sensory modalities integration to encode sensorimotor experiences of iCub robots. MM-CMs contain the bi-directional connections from each sensory modality, through a hierarchical structure (i.e. unimodal-amodal). Thus after being trained, it allows predicting the activation of missing modalities given the other(s). Herein, the visuomotor mapping 3 is constructed by training the MMCMs with proprioceptive data from the arm and head, and image data from the robot camera during gazing and reaching activities. The encoded map of the learnt internal representation allows the robot to "mentally imagine" the appearance and position of its body parts. Schillaci et al. (2014) learn a visuo-motor coordination task in the Nao humanoid robot with a model consisting of two Dynamic Self-orgainising maps (DSOMs) encoding the arm and head joint space input, associated by Hebbian links to simulate synaptic plasticity of the brain. Two learning processes, one for updating DSOMs and another for Hebbian learning, are employed to train the model in an online manner during the robot's motor babbling. As a result, the robot improves its ability to gradually track the movement of its arm during the exploration process by controlling the head with output from the DSOMs based model. Widmaier et al. (2016) propose an algorithm based on Random Forest to estimate the robot's arm pose by regressing directly the joint angles from the depth input images on the pixel-level. The model can work the frame-by-frame manner, without the requirement of an initialization or segmentation step. Instead of the random forest, Nguyen et al. (2018c)'s model utilizes a deep neural network to regress the joint angles of the iCub humanoid robot, given a pair of stereo-vision images and 6-DoF joint configuration of the robot's head (and eyes). The model is trained by a self-generated dataset from the robot's motor babbling of its head and arms in a simulated environment and the real robot. Furthermore, a framework based on a GAN network is also designed for transferring the learnt visuo-motor mapping from the simulation to a real robot, which helps to overcome calibration errors that often occur in physical robots.
Based on the hypothesis about the slow dynamics of the agent's own body compared to the dynamics of the environment, Laflaquiere and Hafner (2019) propose a deep neural network model for body representation estimation. The network is composed of two branches consisting of deconvolution and convolution layers. The former branch generates images of the robot's body with respect to the robot motor input, whereas the latter estimates the pixel-wise prediction error between the generated image from the former branch and the ground-truth. After training, the robot is able to predict the image of its own body in the environment, and to differentiate which part from the predicted image, i.e. a 3 considered as PPS representation by the authors pixel, belongs to the agent's body or the environment based on its element-wise prediction error. Wijesinghe et al. (2018) present a bio-inspired predictive model for visuomotor mapping to track the robot's end-effector from the visual and proprioceptive inputs (i.e. from position, velocity and acceleration of 4 arm joints and position and velocity of 2 eye joints). The authors employ the Generative Adaptive Subspace SOMs (GASSOMs) in their neural model for two purposes: (i) to encode the raw visual stimuli before combining with proprioception to generate one-step prediction of the encoded visual stimuli; (ii) to combine the encoded visual stimuli with its prediction. The output of the network is further used to control the robot's eye in tracking the arm movements.
Lanillos and Cheng (2018), introduce a computational perceptual model based on Gaussian additive noise model and free-energy minimization that enables a robot to learn, infer and update its body configuration from different sources of information, i.e. tactile, visual and proprioceptive. The model is evaluated on a real multisensory robotic arm, showing the contributions of different sensory modalities in improving the body estimation, and the adaptability of the system against visuotactile perturbations.
So far, all models reviewed in this section share two common steps as shown in Fig. 3. The first step employs robots' movement as motor babbling for data generation. The second step constructs the relation between different sensory data by using analytical functions or machine learning techniques, e.g. artificial neural networks. While the performance of the analytically-based approaches depends mostly on the designers' choices of functions, the approaches using machine learning techniques depend strongly on sensory data. Irrespective of the representation form employed as the body schema model, the main achievement of these approaches is the optimal estimation of the agents' body, i.e. joint configuration, end-effector position, or image of the hand/arm, with respect to the distribution of collected data from the babbling step. However, while these models demonstrate that they can (potentially) serve as a building block for more complex robotics behaviors, there are no possibilities for agents to continuously develop and learn these models outside the optimal estimation task they are meant to perform. We will discuss these points in detail in Section 5.
3.3 Peripersonal space as a brain's representation of the dynamic interface between the body and the environment Similar to the body schema, the representation of the PPS representation is a result of various multisensory integration processes happening in the brain. The sources of sensory information includes touch on the body, and vision and audio close to the body. Additionally, proprioception is also thought to take part in the process (Serino, 2019), especially in the arm-center PPS (see below text for more details of body-part centered PPS). This spatial representation helps to facilitate the manipulation of objects (Holmes and Spence, 2004;Goerick et al., 2005) and to ease a variety of human actions such as reaching and locomotion with obstacle avoidance (Holmes and Spence, 2004;Làdavas and Serino, 2008). Notably, this is not the case for the space farther from the human body (Farnè et al., 2005).
In term of neuronal activation, the neuronal network of parieto-premotor areas of the cortex plays a vital role in PPS representation. In fact, PPS encoding neurons are found to be stimulated in the several regions in primate brains, namely ventral intraparietal are (VIP), parietal area 7b and premotor cortex (PMC), i.e. F4 and F5 areas (see Cléry et al., 2015 for a review). Neuroimaging studies in humans show similar results: Neurons in ventral PMC and inferior parietal sulcus (IPS 4 ) relates to the hand-PPS; IPS neurons also relates to the face-PPS; many clusters of activation in parietal cortex and PMC correlate with PPS events (see (Serino, 2019) for a recent review; Grivaz et al., 2017). The activation of brain regions in the premotor cortex (during PPS events) also implies the link between the multisensory integration representation of PPS and motor activities.
The PPS representation serves as an interface between an agent's body and the environment through the multisensory neural network: It maps the sensory stimuli, e.g. objects via vision, directly to a body part frame of reference (FoR) to generate both voluntary and involuntary motor movements, e.g. reaching to grasp or avoidance reaction. The mapping is thought due to the multimodal receptive field (RF) of the activated PPS neurons anchored to this body part Fogassi et al., 1996;Serino, 2019. Furthermore, the two types of PPS motor movements are not mutually exclusive (Brozzoli et al., , 2011di Pellegrino and Làdavas, 2015;Serino, 2019), and thought to be related to two systems of PPS representation. First, the active PPS links with voluntary actions toward objects in the working reachable space. Second, the defensive PPS serves for involuntary defensive action (de Vignemont and Iannetti, 2015;Cléry et al., 2015). In the brains, there are specific networks for these two systems of PPS representation: The VIP-F4 network mainly process information for the defensive PPS Graziano, 2003, 2004;Graziano and Cooke, 2006;Graziano et al., 2002Graziano et al., , 1997Bremmer et al., 2002a,b); the 7b-F5 network serves a core role of the active PPS (Matelli and Luppino, 2001; Rizzo-  (Cléry et al., 2015)); Center: active PPS as reachable regions in a Nao robot (from (Schillaci et al., 2016); Right: defensive PPS as safety margin of a forearm in an iCub robot (from (Nguyen et al., 2018a)).
The PPS representation is maintained (in the brain) by neurons with visuotactile RFs attached to different body parts, following the parts as they move (see e.g. Holmes and Spence, 2004;Cléry et al., 2015 for a recent survey). This forms a distributed and very dense coverage of the "safety margin" around the whole body. This defensive representation is not a unique space for the whole body, but rather composed of many different subrepresentations corresponding to different body parts. For example, the hands' PPS margin ranges around 30-45 cm from the surface, the trunk 70-80 cm, and the face 50-60 cm (Serino, 2019) (see Fig. 4, Left). Each sub-representation of a body part is closely coupled with that part even in movement, which is very useful for obstacle avoidance. When a body part moves, its PPS representation is modified independently from other body parts' representations, eliciting adaptive behaviors for only that specific body part.
That said, Cléry et al. (2015) suggests that the separated PPS representations of body parts can interact and merge, depending on their relative positions. Besides, this protective safety zone is dynamically adapted to the action that the agent is performing, namely reaching vs. grasping (Brozzoli et al., 2010). It is also modulated by the state of the agent or by the identity and the "valence" (positive or negative) of the approaching object For example, the safety zones are different in the cases of empty and full glasses of water (de Haan et al., 2014), or in the cases of interacting with spiders and butterflies (de Haan et al., 2016). Furthermore, the social and emotional cues of interaction contexts also cause dynamic adjustment of the PPS representation (Teneggi et al., 2013;Lourenco et al., 2011).
Moreover, the PPS representation is incrementally trained and adapted (i.e. expanded, shrunk, enhanced, etc.) through motor activities, as reported in, among others (Cléry et al., 2015;Làdavas and Serino, 2008;Serino et al., 2015). One of motor actions being extensively studied is tool-use, where evidence from both primates and human studies reveal the enlargement of visuotactile RFs to include the tool (Iriki et al., 1996;Maravita and Iriki, 2004) 5 or the increase of cross-modal extinction af-ter actively using tool to interact with far-space objects (see Martel et al., 2016;Serino, 2019 for reviews). Using short tools within the reachable space is not sufficient for this effect. More importantly, the degree of the extension of PPS representation depends on the way tools are used rather than the physical properties, e.g. the length, of the tools. In other words, the bodily experiences are necessary for the plasticity of the PPS representation. The underlying reasons for this plasticity are temporally synchronous tactile and visual/audio stimulus during tool-use, which cause activation on the multisensory neurons integrated the corresponding unisensory tactile and visual/audio neurons. Thus these synapses between two sets of neurons are reinforced, according to the Hebbian learning principle.
The capabilities of PPS representation in updating the external stimuli to body parts (even in movements) imply necessary of FoR transformations to align different sensory modalities coded in different FoRs. This is also the role of body schema (recall Section 3.1). However, in the PPS representation, the FoR transformations include both bodily and external stimuli (e.g. from vision, audio) (Serino, 2019). To support this functionality, the propriopceptive stimuli may get involve with other sensory modalities, i.e. vision or audio, especially in the case of the hand-centered PPS representations (Serino, 2019). There is no clear evidence whether body schema representation takes part in the FoR transformation within the PPS representation. Cardinali et al. (2009) suggest that the body schema may play as the "skeleton" for PPS but only it is not sufficient. 3.4 Computational and robotic models of PPS Similar to the Section 3.2, this section provides an overview of the research related to computational and robotics models of the PPS representation, organized in the increasing order of a priori information. The main differences between the approaches considered here are outlined in Table 3, which is constructed accounting for the following criteria: computation model for the PPS representation, sources of sensory information, agent's body, and learning approach (i.e. model-based or modelfree, autonomous or not). Roncone et al. (2015Roncone et al. ( , 2016 propose a model of PPS representation as collision predictors distributed around robot's body, as a protective safety zone. Authors aim to investigate an integrated representation of the artificial visual and tactile sensors in the iCub humanoid robot. The multisensory information is integrated by probability associations between visual information, as the objects are seen approaching the body, and actual the body itself after tool-use are not shown directly, but rather indirectly demonstrated through perceptual changes in PPS representations  tactile information as the objects eventually physically contact the skin. Nguyen et al. (2018a,b) further extend this PPS model with the adaptability to the identity of approaching objects, e.g. neutral vs. dangerous, and interacting situation, e.g. hand-on interaction, to replicate the behavior of the protective PPS in humans (Cléry et al., 2015;de Haan et al., 2014de Haan et al., , 2016. Noticeably, the defensive behaviors of this PPS representation does not hinder the planned manipulating actions such as reaching, grasping an object. Instead, these two capabilities work harmoniously within the cognitive architectures through an optimal control algorithm. Hence the model facilitates the robot's activities alongside human partner in different Human-Robot interaction scenarios (Moulin-Frier et al., 2018;Nguyen et al., 2018b). Magosso et al. (2010b) propose and analyse a neural network model to integrate visuotactile stimuli for the PPS representation. This model is composed of two identical networks, corresponding to the left and right hemispheres of the brain. Each network is composed of unimodal neurons for visual and tactile stimuli input, and multimodal neurons for multisensory integration. Inhibitory connections also exist between the left and right hemisphere networks so as to model their mutually inhibiting relations: When one hemisphere activates, the other one will be to an equal extent inhibited. This brain-like construction allows modeling the behaviour of the PPS at physical level and to be compared with data collected from humans. Similar models are proposed for the case of audiotactile stimuli in (Serino et al., 2015) and (Magosso et al., 2010a). The authors did not design a training procedure, except for the tool-use case presented in (Magosso et al., 2010a), where the Hebbian learning rule is employed.
Similarly, the PPS representation by Straka and Hoffmann (2017)'s computational model associates visual and tactile stimuli in a simulated 2D scenario. The model is composed of Restricted Boltzmann Machine for object properties association (i.e. position and velocity), and a two-layer fully-connected artificial neural network for "temporal" prediction. After training, the model is capable of predicting the collision position, given the visual stimulus as in (Roncone et al., 2016). The designed scenario remains quite simple, however, since it boils down to simply a simulation in 2D space: The skin area is a line and there is no concept of the body, hence no transformation between sensory frames are taken into account.
Differently, Kuipers (2016, 2018) model the PPS representation as a graph of nodes in the robot's reachable space through a constrained motor babbling of a Baxter robot. Each node in the graph is composed of inputs from joint encoder values and images). With the learned graph, search algorithms can be applied to find the shortest path connecting the current and the final state. In their most recent work, the final state search algorithm is extended to allow grasping objects. Although, the graph model can be learnt without a kinematics model, the authors utilize some image segmentation techniques to locate the robot's gripper during the learning phase, and the targets in the action phase from the input image(s). Requiring each node in the graph to store images is a memory intensive solution. Antonelli et al. (2013) and Chinellato et al. (2011) adopt radial basis function networks to construct the forward and inverse mappings between stereo visual data and proprioceptive data in a robot platform. This is conducted through the robot's gazing and reaching activities within the reachable space. Their mapping, however, requires visual markers to extract features with known disparity. Although authors aim to form a model of PPS representation, without the involvement of external objects and the tactile sensing, there is not much different between this model and visuomotor mapping models of the body schema.
Inspired by (Magosso et al., 2010b;Antonelli et al., 2013), Nguyen et al. (2019) present a model of the spatial representation by a visuo-tactile-propriopceptive integration neural network for reaching external object in reachable space on iCub robots. The model maps the visual input from 6-D0F stereo-vision system to the 10-DoF motor space including the torso and an arm. This is taken place under the supervision signal of touch events between objects and artificial skin taxels covering the robot body. After training, this model allows robot to estimate the ability of reaching/colliding with visual stimulus within its reachable space, as similar as PPS representation.
De La Bourdonnaye et al. (2018) present a stage-wise approach for a robotics agent learning to touch an object in the scene with a reinforcement learning algorithm. First, the robot learns to fixate the object by learning the configuration of the camera system to encode the object. Then it learns the hand-eye coordination by constructing the mapping from the robot's motor space to the camera space. Finally, the previously learnt information is used to shape the reward in learning to touch objects. While the first leaning stage is equivalent to learning the PPS representation, the second phase is learning the body schema of the agent. Pugach et al. (2019) implements a gain-field network (recall Abrossimoff et al. (2018) in Section 3.2) to construct the representations of a Jaco arm's body schema and PPS. Inputs for the network come from a fixed camera, a system of artificial skin covering the robot's forearm and its encoder, collected during onedegree-of-freedom movement of the arm. The tactile signal is employed to trigger the process of learning visual representation-the visual-tactile receptive field. Though the approach requires some preprocessing steps, i.e. color-based object recognition, constraint movement of the robot and denoised filters for outputs of the gainfield network, it presents some potential aspects of a defensive PPS representation as (Roncone et al., 2016).
On the other hand, Ramírez Contla (2014) focuses on the plastic nature of PPS representation to account for the modification the body undergoes, and the impact of this plasticity on the confidence levels in respect to reaching activities. In their experiments, the author first assesses the contribution of visual and proprioceptive data to reaching performance, then measures the contribution of posture and arm-modification to reaching regions. The modifications applied to the arm, i.e. the changes in the arm's length, have similar effects as the extension of the PPS representation during tool-use.
As we discussed earlier, the main difference between models of PPS representation reviewed in this section and body-schema models is the involvement of external objects in the vicinity of the agents' body and thus the tactile sensing. Unsurprisingly, most approaches to modeling PPS representation also apply similar steps as the body schema models: (i) generating sensory data through the agent's movement for (ii) learning the model of PPS representation. The PPS representations are mostly constructed by artificial neural networks. The approaches are able to fulfill the main function of the PPS representation: Correlating information from different sensory modalities including FoR transformations; and mapping the external objects within reach onto the agents' body parts. However, they also lack the ability to learn continuously outside the context of the designed learning tasks, as with the cases of body schema models. 4 The active self 4.1 The self in humans The process of infants' development involves, among other things, the acquisition of "body knowledge". The body knowledge has been described within the context of infants' development as the formation of the body's sensorimotor map (the body schema) and the variety of actions that support motor and cognitive development Mannella et al. (2018). The formation of the body schema-the sensorimotor representation of the body, begins with the genetic predisposition for the organisation of body parts representation in the S1 and the M1. It is later elaborated through early (fetal stage) body-environment involuntary interactions such as the touch of the amniotic fluid with the skin (part of the development of tactile perception), and most importantly,  Table 3: Summary of models of PPS representation. Sensory information is coded as: visual-V, proprioception-P, tactile-T, audio-A body-body interactions (e.g. self-touch). In the first months of life, the infant is more focused on body-body interactions. For example, acquiring body knowledge through self-touch behaviours. This goes alongside motor development, and as the body is the most accessible part of the environment, and also the most predictable, the body is the first part of the environment to be modeled (Stoytchev, 2009). At this time, the agent is learning the forward model-the causal relationship between motor actions and their sensory effects on the body. Also at this time, motor actions do not necessarily have to be voluntary, intentional, or goal-directed in order to construct the forward model and develop the causal representation of action-effect links. However, the bidirectional associations between actions and effects will develop with an inverse model that is involved with goal-directed movements: Selecting actions that produce a predicted or desired sensory effect. This stage can be thought of as one that incorporates verification (Stoytchev, 2009).
According to the basic principles of developmental robotics (Stoytchev, 2009), artificial agents and robots need to be able to verify what they learn about the environment (Sutton, 2001), in order to effectively interact with a complex and dynamic external environment. Verification requires the ability to act upon the environ-ment, hence, the agent needs to be embodied (Stoytchev, 2009). In addition, the verification needs "grounding"a process or its outcome that establishes what is valid verification. Because of the environment is probabilistic, grounding requires the agent to construct action-effect pairs, and therefore to have a causal representation of actions and effects in a probabilistic manner. The process of grounding the verification in a probabilistic way requires the agent to repeat its actions to test and refine what it learns about the environment as a causal representation of action-effect, through, for example, detecting temporal contingencies (Stoytchev, 2009). In this view, it arises then that the developmental process goes from exploring the most predictable and verifiable parts of the environment (i.e. the body) to the least. Exploration is driven intrinsically: The agent is "drawn" to explore that part of the environment which has intermediate variability, until the variability is reduced, and the attention shifts to other parts (see also Schillaci et al., 2016).
The recent term "body know-how" focuses more on the practical aspects of body knowledge (Jacquey et al., 2020), and was defined as "the ability to sense and use the body parts in an organized and differentiated manner" (Jacquey et al., 2020, p. 109). Body know-how and its acquisition is therefore interlinked with motor de-velopment. The more body know-how is accumulated, motor skills enhanced, and the forward model perfected, the more the agent can learn about its environment. This is because more body know-how leads to more informative and complex interactions. These are "informative" in the sense that the verification becomes more and more efficient as the agent learns about the morphological properties of the body, and about how to move the body. One can argue that the sort of information that the agent learns from the interaction with the environment is statistical information: Spatiotemporal, sensorimotor contingencies, as well as causal links between actions and effects. Because the world is not deterministic, this information is therefore probabilistic.
Developing a representation of causal links between actions and effects on the environment is necessary, but not sufficient for the development of the sense of agency. This is because having a representation of associations between actions and effects, is not informative with regards to who the author of the action was. In order to verify that the author of an action having led to an effect was oneself, the agent needs to perform goal-directed actions. In computational terms, the forward model represents the causal links between actions and effects, and allows the agent to predict sensory outcomes of actions. The agent makes use of the predictions brought by the forward model to produce goal-directed actions. The agent also needs to perform goal-directed actions to refine the inverse model, a representation of the links between a sensory effect and the action that will cause it, i.e. bidirectional action-effect links. Verschoor and Hommel (2017) argue that goal-directed action is a prerequisite for the emergence of the minimal self, rather than an indication for its emergence.
Moreover, the developmental process is iterative: Acquiring knowledge about the body ("this movement led to this body sensation"-what body sensation does a certain movement elicit?) leads to acquiring knowledge about the environment ("this movement led to this perceived effect on the environment"-what is the perception that comes from this movement?), which leads back to knowledge about the body ("to get effect x on the environment, I need to move this way"-how to move to achieve a certain goal) . The interface through which body know-how is acquired is the body schema representation, and the interface through which complex knowledge about the environment is acquired is the PPS representation.
The notion of verification reflects the active inference approach (Friston et al., 2015), which postulates that to reduce uncertainty (free energy), an embodied agent uses an internal generative model that samples sensory data through action. Sampling is done through approximate Bayesian inference to induce posterior beliefs, under the assumption that active sampling will update model priors. The uncertainty is resolved with actions that hold "epistemic value" to the agent, i.e. informationseeking behaviours (Friston et al., 2015). The principles of active inference and free energy present the forward model as a mechanism to fulfill curiosity by minimizing the expected prediction error (Friston et al., 2011).
In this probabilistic framework, the agent gathers information about statistical regularities, through predictive processes-making predictions about sensory outcomes of generated actions, and resolving "prediction errors"-either in favor of updating the model, or in favor of adapting the sensory information itself (see Limanowski and Blankenburg, 2013 for a review on the minimal self in this framework). One might think about the body model as explicitly distinct from the "world model". However, the boundary between the body and the environment can also be thought of as a sort of statistically-dependent boundary: The body is the most predictable and consistent part of the environment, and therefore the most verifiable (Stoytchev, 2009).
Lending this notion to the minimal self, the boundary between the (sensorimotor, minimal) self model and non-self model can also be thought of as statisticallydependent. For example, the notion of nested Markov blankets (Kirchhoff et al., 2018) postulates that biological systems tend to autonomously self-organise in a coherent way, through active inference, to separate their internal states from external ones, with nested hierarchical Markov blankets that define its boundaries in a statistical sense. Similarly, Hafner et al. (2020) propose the notion of the self-manifold for an artificial agent, which is defined as a dynamic and adaptive outline for the boundaries of the self, and related to both body ownership and agency, as in their view, they cannot be separated. They propose to formalize the self-manifold as a markov blanket around the sensorimotor states of an agent.

Robotic models of the active self
In this section, we review robotics models of the active self or models owning a common feature, which is employing the predictive coding mechanism or the forward model. This focus roots from the idea that the feeling of agency can emerge in an agent with an ability to anticipate the effect of its own action (see detailed discussions in Section 2.2 and 4.1). We first review models employing multisensory modalities (in Section 4.2.1 and Table 4) then continue with using single sensory modality (mostly from visual input, in Section 4.2.2 and Table 5). For the latter cases, they are possible to capture the dynamics of the whole system (including the agent and the interactive environment) via only single input due to the special design of input, i.e. the visual input is not taken from the first perspective viewer (as in human and other animals).

Models with multisensory input
Zambelli and Demiris (2017) introduce a learning architecture where forward and inverse models are coupled and updated as new data becomes available, without prior information about the robot kinematic structure. The ensemble learning process of the forward model combines different parametric and non-parametric online algorithms to build the sensorimotor representation models, while the inverse models are learned by interacting with a piano keyboard, thus engaging vision, touch, motor encoders and sound. Zambelli et al. (2020) extended the idea but trained a multimodal variational autoencoder (MVAE) model from motor babbling data that included combinations of complete and missing data from joint position, vision, touch, sound, and motor command modalities. They tested the model in the same imitation task that involved predicting the sensory state of the robot arising from visual input alone when observing another agent's actions.
The computational model by Copete et al. (2017) allows a simulated robot to (i) acquire the ability of predict the intention of others' actions, and (ii) learn to produce the same actions. The main component of the model is a deep autoencoder-based predictor, whose aim is to integrate visual, motor and tactile signal (in both spatial and temporal manners). In the action learning mode, the autoencoder receives input from all sensory modalities to train the network, while in the action observation (of the other robot) mode, the learnt network receives only visual signals as input and is able to produce the missing sensory modalities, i.e. tactile and joint signals. Feeding the output signals back into the input of the network allows it to predict the future sensorimotor signals. Hwang et al. (2018) construct a multiple-layer predictive model (P-VMDNN) with two pathways for visual and propriocepetive inputs, in which pathways are only connected in the high-layer to simulate the link between perception and action. These pathways employ variations of RNN, namely Predictive-Multiple Spatio-Temporal Scales and Multiple Timescales for processing visual and proprioceptive input, respectively. The model is trained end-to-end by back propagation through time (BPTT) in order to minimize the (one-step ahead) prediction errors of the two inputs. As a result, a simulated iCub can imitate some primitive hand-waving gestures of another displayed on a screen, even in the case of missing one of sensory inputs (as similar as models using autoencoder). Recently, this model is also employed for imitative interaction between an iCub robot and a human (Hwang et al., 2020). Saponaro et al. (2018) further exploit the body schema and forward model (developed from visual and proprioception by (Vicente et al., 2016b)) in "mental" simulation of sensory outcomes in the affordance learning task. This is carried out by employing Principal Component Analysis (PCA) and an additional Bayesian Network to construct the relation between four predefined actions (in varied directions) of robots with the known hand configurations or objects/tools. Lang et al. (2018) employ a deep convolution neural network that integrates proprioception, vision and the motor commands to predict the visual outcomes of a Nao robot's actions. This forward model is trained with self generated data from the robot's motor babbling, and is employed in the task of self-other distinction. It is expected that the prediction error of the forward model is lower when observed arm movements are performed by the robot itself than by other agents. The authors also showed how predictions can be used to attenuate self-generated movements, and thus create enhanced visual perceptions, where the sight of objects-originally occluded by the robot body-was still maintained. Lanillos et al. (2017) conceive a hierarchical Bayesian model, which aims to integrate movement and touch from an artificial skin system with vision from a camera. The hierarchical model consists of three layers: The first two deal with self-detection using inter-modal contingencies to avoid relying on visual assumptions like markers, whereas the last layer employs self-detection to enable conceptual interpretation such as object discovery. To validate the model, the authors design an experiment entailing object discovery through interactions, in which the robot has to discern between its own body, usable objects and illusion in the scene. Hinz et al. (2018) extend the model of body estimation by Lanillos and Cheng (2018) (see discussion in Section 3.2) with an additional visual-tactile sensation, in the task of replicating the Rubber hand illusion in a humanoid robot. In this experiment, authors consider the differences between the estimated robot's end-effector position and the ground truth as the drift of the illusion, which shows similar patterns with the experiment in human participants.
Instead of the Gaussian process regression in previous models , Lanillos et al. (2020) employ the Mixture density network (MDN) to encode the visual generative model and follow the free energy minimization framework to estimate the robot's body. The authors further utilize a deep learning-based classifier for contingency learning, i.e. the probability of association between the visual input from optical flow  Table 4: Summary of active self models based on multisensory sources. Sensory information is coded as: visual-V, proprioception-P, motor-M, tactile-T, audio-A and the joint velocity of the robot. Finally, both prediction error of the robot's body estimation and the sensory contingency contribute to the tasks of self-recognition and self/other distinction at a sensorimotor level. 4.2.2 Models with single sensory input Watter et al. (2015) employ a Variational Autoencoder (VAE) to probabilistically infer the visual depiction of the system state into a latent space, where the dynamic transition from current latent state to the next state (under the untransformed action) is assumed to be linear. As a result, the problem of non-linear system identification and control from high-dimensional images becomes locally optimal control in linearized latent space. The learnt feature allows locally optimal actions can be found in closed form stochastic optimal control algorithms. An additional constraint is also employed to enforce the similarity between samples from the state transition distribution and from the inference distribution, thus guarantees a valid encoded representation for long-term prediction. Both autoencoder and transition networks are learnt jointly.
Similarly, Van Hoof et al. (2016) propose a variance of VAE to encode low-dimensional features of the raw tactile input for more efficient reinforcement learning.
The VAE is modified to take into account the transition dynamics by linearly combining the estimated latent state with action (through a linear neural network layer), and generating prediction of the next latent state. The feature is learnt by optimized the marginal likelihood of sensory input with respect to the prediction of the next latent state (instead of the latent state).
Borrowing some ideas from (Watter et al., 2015), Byravan et al. (2018) develop a deep learning based predictive model to learn the latent space from a pair of successive input images related by an action. The predictive model is formed as a U-net with an encoder of convolutional layers and a decoder of de-convolutional layers. Specifically, the network can (i) model the structure of the scene x t in form of segmented moving parts k ∈ K(predefined) and their 6D pose; (ii) predict the changes of each part k under the applied action; and (iii) output the prediction of the scene dynamics, i.e. a predicted point cloud, as a result of the rigid rotation and translation of all point x j belong to the part k. The model is trained by the jointed prediction losses at the point cloud and pose level. After training, model is employed for closed-loop control directly in latent space with a reactive controller using gradient-based methods.   (2016) propose a method to learn jointly forward model (for action outcome prediction) and inverse model (for a greedy planner to generate robot's discretized poking action) from the feature space of visual input in a supervised manner. Authors show that the forward model helps to regularize the inverse model and generalizes better than the case using only the inverse model (especially when the robot is tasked to poke the object in a long distance). Park et al. (2018) deploy a computational model based on RNNPB-recurrent neural network with parametric bias (PB)-on robots (i.e. a virtual 2 DoF arm and a NAO humanoid) and gradually allow them to imitate the goal-directed motor behaviors in term of the movement shape. In order to do so, the network is trained by BPTT with the prediction error (between the network output and the reference) during the learning phase. During the imitation phase, with observed actions the PB is first recognized by BPTT and then can be used to generate imitated actions as output of the network. Pathak et al. (2019) propose to use an ensemble of forward dynamics functions within a policy-gradientbased deep reinforcement learning agent. The model also exploits the disagreement among prediction errors in the ensemble as the intrinsic motivation to drive the agent's exploration without external reward from the environment. Furthermore, the authors formulate the intrinsic reward as a differentiable function to perform policy optimization in a supervised learning manner instead of reinforcement. The authors show that a robotics manipulator can learn to touch a random object in the scene with only visual input.

From biological agents to artificial agents
In humans, the sense of body ownership and of agency develop through interaction with the environment which is perceived and controlled with the available sensorimotor system. The underlying mechanisms for the sense of body ownership and the sense of agency are build on interactions and associations between different sensory modalities and sensorimotor contingencies. This leads to the formation of representations of the body and the surrounding environment within reach (including other objects and agents).
Most of the research on learning multisensory representations that we review in Section 3.2 and 3.4 casts the development of multisensory representations in bioagents into equivalent robotics learning tasks, namely body calibration, pose estimation and visuomotor mapping for the body schema representation; or reaching estimation and collision estimation for the PPS representation (refer to Fig. 3 for different learning approaches). Tackling the development problems in this way and following two-step approaches, most approaches are able to find the optimal solution for the designed learning tasks, and provide the learning outcome as a building block in a more complex architecture for robotics behaviors. This is, however, different from the development of sensorimotor representations in biology, which is a continuous iterative and interactive process. For example, the body schema representation in humans not only adapts during the motor babbling phase in infants, but also continues to adapt during the tool-use context, where the agent's intention is to optimize the actions of grasping and manipulating the tool rather than optimizing the estimation of the position of the hand and arm. In other words, the human sensorimotor representations develop in multiple settings: They do not only learn once through random actions and serve as input for more complex actions. These representations are continuously refined through feedback from the perceived outcomes of complex actions.
Similarly, models of the active self presented in Section 4.2.1 focus on learning to optimize the prediction loss of the forward models w.r.t the raw sensory input from multiple sources directly-without constructing explicit representations of the body and environment. The prediction errors of the learnt forward models are then employed to generate movements as similar as learnt ones through imitation or babbling. By additionally constructing the explicit sensory representation of the agents' body (in forms of generative images or joint estimation), other models like (Lang et al., 2018;Hinz et al., 2018;Lanillos et al., 2020) enable agents to distinguish between agents' body and external objects. However, all of the existing approaches lack the ability to generalize beyond the learning tasks.
The predictive models with single sensory input that we review in Section 4.2.2 lack certain properties of bio-agents related to multisensory integration. However, their proposed architectures can efficiently enable agents to develop the ability to predict outcomes of their own actions in a latent representational space. In these models, the latent state abstraction serves as dimensionality reduction for the desired learning tasks. However, all existing models learn these two steps separately instead of simultaneously (Pathak et al., 2017).
In humans, the involvement of the body schema and PPS representations in various motor activities (as we review in Section 3.1,3.3) suggests that the brain might learn and use these representations as a process of dimensionality reduction or state abstraction, which then facilitate the ability of learning manipulation skills and transferring knowledge between different learnt skills. Furthermore, the sense of touch plays a crucial role in the development of PPS and body schema representations, especially in later development of manipulation skills when interacting with the external environment. Results from models taking into account the tactile sensing capability as one of the sensory modalities, e.g. Roncone et al. (2014Roncone et al. ( , 2016 Thus it is worth considering this sensory modality in an architecture for developmental agents.

A conceptual sketch for the development of an artificial minimal self
Our review on the state of the art in models of the active self and bodily-related representations suggests certain guidelines and principles that are important for modeling a self computationally. Here we propose a sketch of an architecture to integrate these principles (see Fig. 5), aiming to enable artificial agents to develop the active self through self-exploration within an environment as discussed by Schillaci et al. (2016). s v t , s tac t , s p t denote raw visual, tactile and proprioception input at time t respectively. The blue and red arrows denote the source of data affecting the learning of the target module: blue for the predictor and red for the Action generator Our review points out that agents require two critical components to develop a self: (i) a representation of multimodal sensorimotor contingencies, and (ii) bidirectional associations of actions and effects. The former condition is addressed in our proposal with the Multisensory integration module. The latter condition is fulfilled by two modules, namely the Predictor and the Action generator.
The Predictor is a multimodal forward model that predicts a sensory effectφ(s t+1 ) from a currently conducted action a t and the currently perceived sensory state representation φ(s t ). The Action generator generates motor actions a t under constraints exerted by the environment and under consideration of the prediction error e t+1 of the Predictor ). Both the Predictor and the Action generator operate in the latent space of the multimodal sensory input, which is compressed by the Multisensory integration process. We specify the  Table 6: Summary of sub-problems focused by reviewed models operation of these modules as follows: Multisensory representations: φ(s t ) = φ P P S (s e∪i t ) φ body (s i t ) Predictor:φ (s t+1 ) = f φ(s t ), a t Predictor error: (1) Here, φ P P S (s e∪i t ) denotes the representation of the PPS, s t denotes the current sensory state and φ body (s i t ) denotes the body schema representation. In terms of the implementation, these all modules can be constructed by a multiple head neural network with each head corresponding to each module output. The large part of the network is shared between different modules. This artificial neural architecture reflects the hierarchical structure of multisensory integration processes to generate abstract, multimodal predictions at the high level from low-level unimodal sensory signals (Friston, 2012).
Importantly, all modules learn simultaneously through the agent's own interactive experience in the environment. Their behavior is driven by sparse extrinsic feedback and the intrinsic motivation to minimize prediction errors of their intentional actions. In this setting, learning to minimize the prediction errors and integrate multisensory input are the auxiliary tasks alongside the main task of learning to generate skilldependent actions. One possibility to model the Action generator is to combine motor babbling as being used by most of reviewed approaches and sampled outputs of the reinforcement learning policy, which is known as − greedy exploration (Sutton and Barto, 2017, Chapter 13). Taking an example of a reinforcement learning agent, at every time step, the agent selects an action drawn from the policy π-an action generator-based on the current state s t , exerts on the environment and receives an extrinsic reward r e t depending on the next state s t+1 of the whole system. Moreover, the predictor also provides another internal reward r i t based on the prediction error. In turn, the total reward r t = r e t + r i t guides the improvement of the policy π through estab-lished algorithms such as policy gradient (Sutton and Barto, 2017). One problem, however, is that agents are prone to overfitting when learning only from a single task or in a single environment. As we point out in Section 3.2 and Section 3.4, irrespective of the chosen form for the models of the sensory representations, behaviors of trained agents are optimized w.r.t the estimation task they are desired to perform. They lack the ability to learn these models continuously outside the context of the tasks. For example, an agent who is trained to perform a visuomotor tracking skill cannot easily adapt to completing the grasping skill without catastrophically forgetting the trained knowledge. To address this issue, we propose to use the sub-problems in the third column of Table 6 (i.e. calibration, pose estimation, visuomotor mapping, reaching estimation, and collision estimation) as benchmark tests instead of using them as objective functions for the learning task (e.g. object manipulation, tool use). Our main hypothesis is that since embodied agents have varieties of sensory modalities like vision, touch and proprioception, the developed agents should pass the benchmark tests and show behaviors equivalent to humans, including sensory phenomena like the Rubber Hand Illusion. The general learning objective function is designed to maximize agents' ability to learn skills while minimizing the prediction error of the agents' internal predictor. Furthermore, we propose to employ the stage-wise or curriculum learning strategies for a set of different skills 6 , which are gradually more difficult to achieve (Parisi et al., 2019a). Since the sensory representations continuously mature during learning one skill, e.g. object manipulation, the development implicitly facilitates transfer learning to other more sophisticated skills, e.g. grasping a tool and using a tool to manipulate objects, faster and easier than learning from scratch. During the learning process, while the skill-dependent objective function motivates the agent to generate actions to fulfill the skill requirement, the auxiliary objective function ensures multisensory representation learning to minimize the prediction error e t+1 (Eq. 1). The former learns with the stored long-term experiences, whereas the latter is trained with the shortterm prediction error (as shown in right side of Fig. 5). The auxiliary task of learning multisensory representation plays as an intrinsic motivation for the transition from learning one skill to another skill.
The multitask learning process of the proposed architecture includes learning the multisensory representations and learning the predictive model for control tasks. This learning process is equivalent to state representation learning for control, as highlighted in a recent review by Lesort et al. (2018). Furthermore, our architecture shares some similarities with the proposal by Nagai (2019), who focuses on modeling cognitive development by minimizing prediction errors of a forward model. However, we emphasize the importance of learning the sensory representations as a state abstraction from multiple sources simultaneously with learning the internal models in our proposal. In summary, we propose to combine a number of strategies to support the ability of continual learning, as highlighted in (Parisi et al., 2019b), namely, multisensory learning and intrinsic motivation (of minimizing prediction error). This combination is supported by reviewed evidence from the development of biological agents and related computational and robotics models.

Towards modelling a self with higher cognitive functions
The embodied conceptualization hypothesis by Lakoff and Johnson (1999) entails that our body-specific sensorimotor apparatus and, therefore, our representations of body schema and PPS, determines how we conceptualize the world. Hence, these representations have strong influences on higher cognitive functions as they directly shape the way we think (Pfeifer and Bongard, 2006). This becomes evident in natural language, where metaphorical expression involves basic body-related concepts Trott et al., 2016). What remains open, though, is how we can model grounding of sensorimotor concepts computationally. Several approaches, including the Theory of Event-Coding (Hommel, 2015b), and Event Segmentation Theory (Zacks et al., 2007;Gumbsch et al., 2019), exist. However it is subject to future work to fully integrate these approaches within a unifying computational theory of high-level cognition. Research on the minimal active self fosters the development of such a unifying theory as it allows one to investigate how basic body-related concepts emerge from sensorimotor interaction.