1 Shared Intentionality as Marker of Human Uniqueness

Ideas about what makes humans distinct from other animals have a long history. Aristotle (2020) famously defined humans as “rational animals.” Others argue that humans are unique in their capacity to craft and use tools or in their ability to use language and symbols (Cassirer 1944/1994), yet others propose that it is the ability to laugh and cry (Plessner 1940), to negotiate (Rochat and Ferreira 2008), or to plan for the future (e.g., Bratman 2017) that sets humans apart from other animals. A popular, relatively recent suggestion as to where the human–animal difference lies is the shared intentionality thesis, according to which only humans share experiences with one another in acts of cooperation and joint attention (Tomasello et al. 2005; Tomasello 2019).

In a recent article, Kern and Moll (2017) argue that trying to define the “anthropological difference” by pinpointing a special behavior or capacity that humans alone show or have—be it toolcraft, language, foresight, or cooperation—is futile because there are countless such differences. After all, only humans play water polo, watch television, and go to church. Isolating any one capacity or trait will always be more or less arbitrary. Moreover, every new proposal regarding the differentia specifica has led comparative psychologists to try to show that nonhuman animals (e.g., primates, birds, cetaceans) have the same capacity, if only in rudimentary form. The discovery that chimpanzees use stones to crack nuts and sticks to fish for termites was meant to cast doubt on the notion that only humans craft and use tools (Boesch and Boesch 1990), and the observation that apes can be trained to communicate with humans via “lexigrams” fueled the idea that certain primates are language and symbol users, too (Savage-Rumbaugh et al. 1993). The shared intentionality thesis was also attacked—based on findings suggesting that when chimpanzees hunt, they show the same kind of mutual responsiveness and role differentiation that this thesis claims to be unique to human interaction (Boesch 2002).

Kern and Moll (2017) do not conclude that we should give up on articulating sharp human–animal differences, but that instead of looking for the special something that separates us from them, we should discern a distinct principle by which human life is organized. The goal should be to identify a red thread that unites the manifold human–animal differences and points to a common underlying cause. An account that tries to do just that is a transformative account. While an additive account maintains that humans have a special skill in addition to the capacities they share with other primates (e.g., perception, attention, teleological understanding), a transformative account states that humans have a distinct form of life that shapes their cognition as a whole—not just the way they think or act in any particular domain. The idea of a transformative account of human cognition is not new. Jung (2009) argued along similar lines when proposing a holistic understanding of human–animal differences (Differenzholismus), and Boyle (2016) has proposed a transformative, as opposed to an additive, account of human rationality.

Our goal in this paper is to present a transformative variant of the shared intentionality thesis briefly sketched above. In its original version, the shared intentionality thesis had a strong additive flavor: Shared intentionality was mainly conceived as a special set of social–cognitive skills that humans evolved (phylogenetically) and developed (ontogenetically) in addition to the cognitive skills they already possessed by way of being a primate. For example, in A Natural History of Human Thinking, Tomasello (2014, p. 34) states that “out of the elements of […] sophisticated processes of individual intentionality built for competition…humans evolved, in addition (our emphasis), even more sophisticated processes of joint intentionality… built for social coordination.” Similarly, Tomasello and colleagues conclude in a research article, “Humans share many cognitive skills with nonhuman apes, especially for dealing with the physical world, but in addition (our emphasis) have evolved special skills of social cognition” (Herrmann et al. 2010, p. 102). In other words, what Tomasello and his team have proposed is that humans’ cognitive apparatus is much the same as that of apes but that, at a certain time in evolution and in the life course of the individual, specific adaptive social skills were/are added, allowing us to excel in mind reading and cooperation.

Here, we follow Kern and Moll’s (2017) suggestion to replace this additive picture of shared intentionality with a transformative one, according to which “shared intentionality” marks out a uniquely relational form of life that shapes humans’ entire cognitive development, including “nonsocial” domains, such as instrumental problem-solving or tool use. By engaging with things in the world together with one’s caretakers and seeing it through their eyes, children participate in shared or collective representations of the world. And by imitating others, infants come to handle objects and situations in ways that reflect cultural modes of engaging with the world that have persisted for generations. The transformative thesis thus argues that shared intentionality is not a capacity that sits on top of a foundation of primate cognition, but instead denotes a special form of life: one whose members recognize and relate to one another as potential communicative and cooperative partners and who, consequently, develop their cognition differently than do other animals.

Contra its additive analogue, the transformative interpretation of the shared intentionality thesis defended here maintains that shared intentionality is always already present in the child’s life: There is no particular moment in ontogeny when shared intentionality or human-unique thinking suddenly erupts. In the original additive variant of the thesis, the age of 9 to 12 months marks a crucial watershed: At this time, a child is said to cognitively differ for the first time in notable ways from an ape, because now, unlike the ape, the child starts to be able to share experiences with others in triadic joint attentional engagement. In line with the transformative interpretation, we view the child as distinctly human from the start, and we will point out even earlier forms of social exchange, prior to the dawn of joint attention, demonstrating that human sociality has a unique quality from the very beginning.

In a broader theoretical context, the shared intentionality thesis takes as its point of departure Lev Vygotsky’s (1978; see Kozulin 1990) sociocultural account of development. Vygotsky was the first developmentalist to argue that humans’ “mental functions” are social in origin and that a child’s reciprocal interactions with others drive her intellectual or cognitive growth. In his own words, “We could formulate the general genetic law of cultural development as follows: Any function in the child’s cultural development appears twice, or on two planes. First it appears on the social plane and then it appears on the psychological plane. First it appears between people as an interpsychological category, and then within a child as an intrapsychological category” (Vygotsky 1981, p. 163). The view Vygotsky articulates here is in contrast to what has been called the “inside-out” view of development, often attributed to Piaget, according to which children first grasp the world as individuals and then learn to communicate their thoughts and feelings to others through a (secondary) process of socialization. In Vygotsky’s “outside-in” view—similar to the way in which George Herbert Mead (1922) thinks of the development of the self through communication—mental representations and thoughts are internalized social experiences.

Methodologically, the shared intentionality thesis relies heavily on experimental and observational work in the psychology of early child development. It furthermore draws on studies in comparative psychology and primatology, with the goal to contrast the cognition (and its development) of humans with that of other animals. The thesis also incorporates ideas from evolutionary anthropology about how humans’ physiology, behavior, and cognition evolved to be what they are today. Lastly, the shared intentionality thesis draws on models of cultural evolution that try to explain how human culture accumulates over historical time and is passed on between generations.

2 From Dyadic Encounters to Appreciating Perspectives: A Step Model

In this article, we will show how human-shared intentionality becomes increasingly complex in early human ontogeny. We identify three milestones of this development, shown in Fig. 1. Throughout these stages, the child’s social orientation must be thought of as equally present or powerful. The idea of a process of socialization, by which a child is increasingly drawn into the lives of others, is misleading, as there is no such thing as a presocialized state in child development. What changes is the complexity of the interaction and the child’s increasing ability both to learn about the objective world with and through other persons and, in parallel, to understand the subjective mental lives of those with whom she explores this world.

Fig. 1
figure 1

Steps in the early development of shared intentionality. Shared intentionality begins with young infants’ exchanging affect with others in dyadic interaction. By age 1, infants jointly attend with others to objects in their surroundings. Joint attention leads to the capacity of perspective-taking, which emerges in two distinct levels: practical perspective-taking and theoretical perspective-confronting

The first crucial step is taken by infants at about 2 months of age, when newborns smile, coo, and look at others in episodes of so-called primary intersubjectivity (Reddy 2008). By engaging in such “protoconversations,” young infants affectively connect with others and, together with these persons, prepare the forging of attachment bonds with their caregivers. The second step is taken when infants, as they approach their first birthdays, begin to co-orient and gesture to objects of shared interest in acts of triadic joint attention. The ability for joint attention goes beyond the earlier, dyadic engagements because the infant and her partner now focus together on some object of shared interest. Infant and other are thus conjoined to a “plural subject” of experience, and the infant learns what it means to experience the world together with other persons. This joint experience is foundational for advancing to the next, third, milestone, which is children’s ability to take different perspectives and imagine the world from alternative viewpoints. As will be shown, the development of perspective-taking is protracted and can be broken down into two separate levels, defined as two different sets of skills. At the first level (“level 1” = perspective-taking), a child competently interacts with other agents whose viewpoints differ from the child’s own. For example, a child might discern the referent of a speaker’s ambiguous speech act by taking into account which objects the speaker has and has not engaged with in the past. However, at this level, children’s perspective-related skills are exclusively practical and used in interaction with other persons—they serve as means of social coordination and communication but do not, at this young age, include the capacity to reflect on others’ viewpoints and how they might differ from one’s own. Perspective-taking is available to toddlers between around 1.5 to 3 years old. At the second level (“level 2” = perspective-confronting), children have acquired the ability to juxtapose or confront two perspectives and acknowledge their difference. They now know at a theoretical level that people can have different views of a given situation, or that the same state of affairs can be represented in alternative ways—sometimes by different people at the same time (e.g., when two persons look at the same object from different viewpoints), by the same person at different times (e.g., when one changes one’s mind or adopts a different visuospatial position), or by the same person at the same time (e.g., when juxtaposing actual with counterfactual perspectives). At this level, we might say, children have gained theoretical insight into the perspectival nature of our access to the world. Children reach this level at around 4 or 5 years of age.

In the following sections, we will further unpack each of these milestones of shared intentionality. We will also describe gradual transitions from one stage to the next. Although these steps are introduced as distinct sequential stages, we take the development from one to the next stage to be continuous, with children gradually transitioning from one to the next stage as their cognition matures and their sociocultural experiences (e.g., the higher communicative demands that are placed on them) change accordingly. We will also cite literature from comparative psychology and primatology to show that none of these markers of shared intentionality are present in nonhuman animals.

3 Dyadic Interaction in Early Infancy

Soon after birth, humans relate to other members of their species in ways no other animal does. They engage dyadically, in face-to-face interaction. By 7 weeks of age, infants respond to another’s direct gaze with a “social smile” (Anisfeld 1982)—a smile indicating recognition of the other person as a fellow human being (Meltzoff 2007). From this time on, infants become increasingly responsive to and demanding of affective exchanges. At around 2 months of age, infants maintain eye contact while smiling, cooing, and moving rhythmically during episodes of primary intersubjectivity (Trevarthen and Aitken 2001). Among the first scholars to have drawn attention to the unique quality of these early human–human exchanges were Trevarthen (1979), Stern (1985), Reddy (2008), and Bråten (1988), who all remarked on infants’ efforts to enter into a dialogue with other humans and “converse” with them at a prelinguistic level.

Tronick et al.’s (1978) famous still-face experiments drastically demonstrate that infants expect and desire reciprocal social engagement. When encountering another human, infants demand mutual recognition and an exchange of affect. If they are instead shown a deadpan expression (sometimes misleadingly called a “neutral” expression), infants become upset, cry, and “protest,” trying desperately to revive the recalcitrant partner. What these experiments show is that infants want to be in what Buber referred to as “I-Thou” relations, in which two persons recognize each other’s humanity and give one another their full attention. Young infants thus already seem to have normative, rather than just statistical, expectations of mutual regard and consideration, as shown by the fact that they are more appalled than surprised when their interlocutor refuses their second-personal address.

It is worth mentioning here that in the original shared intentionality thesis of Tomasello et al. (2005), infants’ dyadic interaction with others in protoconversation was argued not to be intersubjective and thus not a form of shared intentionality (see also Tomasello 1999 on this point). Although Tomasello (2019) recognizes that no other animal engages in these kinds of affect-laden dyadic exchanges, he claims that young infants fail to recognize themselves and others as subjects of experience or attention. The cooing and smiling of young infants do not, in his version of the thesis, occur in an “intersubjective” space because the infant allegedly has not come to realize that persons have attentional or psychological states that can be directed or shared. In a critique of this position, we have argued that infants’ communicative efforts in primary intersubjectivity are themselves proof that infants have some (perhaps primitive) sense that others are subjects of experience distinct from themselves (Moll et al. 2021). In the transformative model of shared intentionality that we are proposing, young infants’ engagement in dyadic exchanges is the first of a sequence of steps in humans’ unique social–cognitive ontogeny. At this ground level of intersubjectivity, infants express their demand to be treated as what MacMurray (1961) called “persons in relation.”

3.1 The Human Face: Locus of Expression and Interaction

An excursion into the morphology of the human face and its evolution is helpful at this juncture because preverbal infants manifest their drive for social contact by using an extraordinary repertoire of facial expressions. Vertebrate faces evolved around 400 million years ago in ancient fish called placoderms. A face can be defined as the forward-facing part of an animal that unites multiple sense organs (organs responsible for vision, olfaction, etc.) with the site for food ingestion (Wilkins 2017). With the evolution of the face, processes of identifying and ingesting food became intertwined and highly efficient.

In humans, however, a face is more than a locus of perception coupled with food ingestion. The face is that part of the body by which human individuals recognize and present themselves to one another. As self-conscious beings, humans are—as Goffman (1967) discussed—invested in their faces and what they might indicate or reveal to their interaction partners. The face has undergone drastic changes over the course of hominin evolution (Lacruz et al. 2019). Tool use and the resulting ability to break down food and cut meat made hominin faces more delicate and flatter. The brow ridge receded, the muzzle shortened, and the jaw became smaller and retruded over time (Fig. 2). Together, these changes amount to the face’s verticalization (Wilkins 2017). Within the vertical plane of the face, the eyes underwent morphological changes that made them the center of attention, as they became elongated and the sclera took on a white color (Kobayashi and Kohshima 2001). The perceptual effect of these changes is that others can easily detect where one is looking. Humans almost “advertise” their attentional focus to one another (Tomasello et al. 2007). Shared intentionality theory interprets this as a physiological marker of our cooperative nature: We can afford to reveal what we are attending to because our interactions tend to be friendly, if not cooperative (Moll and Tomasello 2007). Because the evolutionary changes to hominin faces occurred in tandem with bipedalism, the angle at which the skull connects with the spine (the “cranial base flexure”) increased so that humans look forward, not downward, as they stand or locomote (Lieberman et al. 2002). Hominins started to encounter one another face to face. Human facial traits are more varied than other physical traits and the facial traits of other animals. Unlike other animals, who recognize their conspecifics by, for example, their smell, humans recognize each other mainly through their faces (Sheehan and Nachman 2014).

Fig. 2
figure 2

The face of an extant chimpanzee (a) and a reconstructed face of a hominin (Australopithecus afarensis) (b) compared to that of modern Homo sapiens (c)

Over the course of evolution, the human face verticalized, and features relevant for communication (eyes, eyebrows, lips) were enhanced. A species was born whose members encounter one another face to face.

With the evolution of the human face came a large repertoire of emotional expressions. As Kret et al. (2020, p. 379) remark, “[H]umans have evolved communicative faces to facilitate emotion transmission, where the expressive parts are enlarged and accentuated.” In addition to having bigger eyes with a contrastive sclera, humans also have less facial hair, redder lips, and starker eyebrows, all of which facilitate communication. Face-to-face interaction not only invites the expression of emotions—as interlocutors open up to one another and share how they feel—but also produces its own. Protoconversations with others allow infants to feel and express joy and curiosity, as well as tension and surprise, such as when their partner introduces suspense by playing peekaboo (Parrott and Gleitman 1989). Peekaboo is played in numerous, if not all, cultures across the globe (Fernald and O’Neill 1993) because playfully interrupting the mutually expected, species-typical encounter causes joyful arousal and trains infants’ developing understanding of object permanence.

Taken together, the evolution of Homo can be thought of as the birth of a genus whose members address and communicate with each other face to face. Humans, we might say, evolved to be “one toward another” (Rödl 2014), i.e., partners of intentional transactions that unite them “in the manner of holding them apart” (Rödl 2014). Although some nonhuman animals, such as lions and deer, are also at times “one toward another,” their manner of being face to face is different: It is antagonistic, not second-personal. The human species is the only species whose members face each other in benign and cooperative interaction. There are only a few comparative studies on mutual eye gaze in nonhuman apes, and while these studies suggest that the amount of eye contact held between individuals varies considerably between species, none of these individuals turn to one another to exchange affect in the way and to the degree that humans do (Kano and Call 2014; Kano et al. 2015). We thus find that the dialogical nature of humans manifests from life’s beginning, with young infants engaging in protoconversations in which they express themselves vis-à-vis others in dyadic communication.

3.2 Interim Summary

In our model, the first step of shared intentionality can be observed within mere weeks after birth. Human infants have a characteristic desire to interact with other persons and express themselves to others in dyadic exchanges. Humans’ unique dialogical ability and motivation first manifests when infants smile, coo, and look at others in bouts of “primary intersubjectivity.” In the hominin lineage, facial features evolved that support the kinds of communicative, face-to-face encounters that are unique to the human species and universally sought by young infants. Unlike other animal faces, the human face is more than a hub of sensory perception and food intake; it is a plane on which individuals express their subjective states of mind and attitudes toward one another. Humans make themselves known to others by expressing themselves through their faces, even in early infancy. This, we argue, is to be recognized as a unique point of departure of humans’ cognitive development.

4 Sharing Experiences in Joint Attention

A new milestone is reached in infant social–cognitive development at around the age of 1 year, when infants begin to show a motivation and capacity to share experiences with others in acts of joint attentional engagement. The dyadic relation between the infant and the other has expanded and come to include objects of joint interest and reference. Having greatly prioritized attention to humans in the first half year of life, infants in the second half of the first year, having acquired the ability to grasp things, become increasingly interested in the physical world (e.g., Maestro et al. 2005). Importantly, infants do not lose the other person out of sight as they explore objects, as they typically keep a close eye on their caregiver. Although infants are initially unable to coordinate their attention between the object and the person, they soon start shifting their gaze between the other and an interesting object or event. They might look to the partner with a “sharing look” and flash a “knowing smile,” indicating their awareness that the experience is shared.

The newly arisen ability to share attention to a target of mutual interest manifests in the following suite of behaviors (see Carpenter et al. 1998):

  • Perceiving objects together by seeing or hearing them simultaneously and looking back to the partner

  • Gesturing deictically to objects with the goal to share attention to them

  • Engaging in imitative learning by repeating another’s action in recognition of performing the same action

  • Turning to others as guides when encountering novel or ambivalent situations (“social referencing”)

  • Playing one’s part in simple collaborative games with shared goals, such as giving and taking an object

By age 1, children generally show all of the above behaviors, which they experience as rewarding (Gangi et al. 2014; Siposova and Carpenter 2019). In their longitudinal study, Carpenter et al. (1998) observed that infants who displayed one of these behaviors also tended to show the others. The fact that these behaviors are correlated points to a common origin. According to the classic shared intentionality thesis, the underlying psychological cause of this is infants’ newly gained understanding that self and other have attentional states that can be directed (e.g., by pointing) and interpersonally shared (Tomasello et al. 2005). Infants no longer just have the capacity to participate in dyadic relations, in which you and I are mutually engaged—as we have seen with primary intersubjectivity—but rather to form a “dual” subject, such that we, together, act on or attend to some object “as a unit,” as Margaret Gilbert (2007) writes.

Jointly attending to objects is immensely important for infants’ healthy cognitive and social development. It is a sine qua non for language acquisition (e.g., Tomasello and Farrar 1986) and for developing shared attitudes or orientations to the world (Hobson 2002). “Social referencing,” for example, which infants engage in when facing ambivalent situations, lets them endorse others’ adaptive attitudes toward new and potentially harmful objects and situations. It is through joint attention that infants come to see the world through the eyes of others.

How devastating impairments of one’s ability for joint attention are for one’s overall development can be demonstrated by cases of autism spectrum disorder. Many children with autism miss out on key joint attentional experiences and, as a consequence, have difficulties with acquiring language, taking others’ perspectives, and, more generally, developing species-typical ways of relating to others and the world (e.g., Charman et al. 2000). Impairments of joint attention cast a long shadow on the child’s development within as well as outside of the social arena, which supports the transformative, as opposed to the additive, account of shared intentionality.

Infants’ engagement in joint attention demonstrates their trust in others as guides and interpreters as they open themselves up to the world. In jointly attending with others, infants knowingly act as part of a dual subject: “We are looking at the moon,” “We are rolling the ball back and forth,” etc. It is the duality of the subject that distinguishes joint attention from the earlier, dyadic interactions (as discussed in Sect. 3). In dyadic interaction, I encounter you, and vice versa. I, the infant, am opening myself up to you, the adult, and you are doing the same toward me. In triadic joint attention, by contrast, we put our heads together over a problem to which we dedicate ourselves as a dual or plural subject. Rödl (2014) demonstrated that “dyadic” self-predication or intentional transaction (“I give you a greeting/you are greeted by me”) is logically prior to dual self-predication or joint intentional/attentional action (“We are looking at the moon together”) because it is in the former that the second person is established, whereas the latter presupposes it. What we find is that this logical order is matched by the ontogenetic progression from dyadic encounters to joint attention.

Although some primatologists claim to have observed wild and captive apes engaging in joint attention (Leavens and Racine 2009), data from experiments have not been able to confirm this. Even apes that are reared by (helpful and cooperative) humans have not been found to use gestures or turn their heads back and forth between another agent and an object or event with the goal to share the experience (Carpenter and Call 2013). Whenever a human-reared ape was observed gesturing toward something in its surroundings, the motivation seems to have been imperative: getting access to the pointed-to object. As with the first level of shared intentionality, then, research in comparative psychology overall suggests that the behaviors characteristic of the second level of shared intentionality—joint attention—are absent in other primates.

Participating in joint attention paves the way for children’s understanding of others’ experiences and perspectives. A series of studies we conducted demonstrates how joint attention facilitates perspective-taking (Moll et al. 2008; Moll and Tomasello 2007). In one of these experiments, a 1-year-old and an adult played with a novel object together. They then shared a second, different novel object. Next, the adult left the room, and another person played with the infant with a third novel object. Finally, all three objects were placed in front of the infant. In that moment, the adult who had been absent returned and excitedly requested “that one,” while gesturing vaguely to the cluster of objects. The finding was that infants selected the object that was new for the adult, even though this object was not new for them—suggesting that the infants realized what the other had and had not previously experienced. Importantly, infants were able to select the target object only if they had experienced the other, familiar objects with the adult in joint attention. If infants merely looked on as the adult explored the familiar objects by herself, they later failed to disambiguate the referent and were unable to discern what the adult wanted. Other studies we conducted yielded the same or similar results.

As we have argued elsewhere (Moll and Meltzoff 2011b), joint attention is the cradle of perspectivity. It is through joint attention that children establish a common ground of experience with other persons, and it is only against this shared background that others’ communicative acts obtain their meaning (Moll and Kadipasaoglu 2013). Without a joint attentional framing of experiences, infants fail to understand others’ gestures and speech. Joint attention, we argue, is an entry gate through which infants gain access to other minds.

5 Understanding Others’ Perspectives

Empirical work by us and others has revealed that children build up their competence in perspective-taking in an extended learning process that spans many years. Even adolescents’ perspective-related competence has not entirely matured. Teenagers tend to perform slightly less well than adults on tasks requiring prompt and accurate judgments about others’ points of view (Choudhury et al. 2006; Dumontheil et al. 2010). Despite this protracted development, even infants and young children are capable of considering others’ viewpoints. In our own work, we have focused on the early years and discovered that there are broadly two developmental stages in which children transcend their own perspective and take that of others into accounts. We now describe these two stages.

5.1 Perspective-Taking (Level 1): A Practical Skill

Between ages 1.5 and 3 years, infants and young children become competent at taking another’s perspective. This early-developing skill is limited to contexts in which young children directly interact with other persons in pragmatic contexts. Only later, at 4 to 5 years of age, are children able to attribute perspectives to other agents from a reflective or theoretical stance, allowing them to realize that a given object can be viewed or construed in various ways. The first skill is practical (“perspective-taking”), whereas the second, later-developing capacity is theoretical (“perspective-confronting”; see Moll et al. 2022).

Children’s perspective-taking skills manifest in the way they respond to others’ communicative gestures, speech, or actions. For example, an agent might ambiguously ask an infant for a piece of food when two different food items (e.g., broccoli and crackers) are available for the child to choose from. By around 1.5 to 2 years old, infants resolve the reference problem by giving the adult what she, the adult, had previously expressed a preference for (broccoli), although this mismatches the infant’s own preference (Repacholi and Gopnik 1997). Similarly, when a speaker vaguely asks for “the doka” without having previously shown the child which of several potential referents this novel word picks out, infants factor in the speaker’s prior interaction with the objects to discern what the speaker has in mind (e.g., Grassmann et al. 2009).

Young children have also demonstrated perspective-taking skills by anticipating how others, whose perspective differ in salient ways from their own, will act. In a study by Garnham and Perner (2001), 3‑year-olds anticipated the actions of someone with a false belief about where her object was located. The children knew that the object was in location B, but the character falsely believed that the object was in A, where she had last seen it. When asked to soften the character’s fall from a slide she came down to fetch the object, children correctly placed a mat under the slide to A, indicating that they were, at an implicit level, aware of the character’s false view of the object’s whereabouts. However, this knowledge was not available to children in explicit form: When asked where the character thought her object was, children confidently gave the wrong answer, “B.” Toddlers are thus able to shift into others’ perspectives in interaction (level 1), but they cannot “confront” alternative perspectives in discourse about mental representations (level 2).

More evidence for the dissociation between perspective-taking and perspective-confronting comes from studies measuring children’s facial expressions. Moll et al. (2016, 2017; Ni et al. 2023) repeatedly found that 2‑ and 3‑year-olds express tension in their faces when watching an agent approach unexpected reality. For example, a child might bite her lip or furrow her brow when Cookie Monster returns to his cookie box after someone raided it while Cookie Monster (but not the child) was absent. Toddlers, in other words, anticipate others’ rude awakening when confronted with a reality they (the others) do not expect. They are susceptible to the feelings of suspense that arise when one knows more than someone for whom this knowledge would be crucially important. But again, the anticipatory sensitivity to feelings of suspense is available only in the “heat of the moment” of ongoing social interaction; it is a skill that is embedded in and limited to pragmatic contexts. This early skill does not inform children’s mental state ascriptions that afford the confrontation of another’s perspective with one’s own.

Before turning to the higher-level capacity of “confronting” perspectives, let us briefly review literature on apes’ perspective-taking. Apes evolved skills of tracking others’ perceptual experiences that look somewhat similar to infants’ and toddlers’ perspective-taking skills. However, apes display these abilities not in cooperative or communicative settings, as do human children, but in antagonistic contexts, such as when they avoid being seen by a dominant group member. Subordinate apes, for example, have been shown to preferably approach food that is hidden from a dominant’s sight, rather than food that is out in the open (Hare et al. 2000). More so, subordinates also avoid food after having seen its being placed at an occluded location while the dominant was watching—suggesting that apes know what others “know” in the sense of what they have witnessed (Hare et al. 2001). It seems that through a process of convergent evolution, great apes developed skills that are, at some level, parallel to toddlers’ perspective-taking abilities. However, only in humans is this perspective-related capacity part of a life form that is characterized by shared intentionality. Apes, by contrast, evolved a limited set of tracking skills that help them to keep actions hidden from higher-ranking individuals or to outsmart others in competition.

5.2 Perspective-Confronting (Level 2): A Theoretical Skill

By 4 to 5 years old, children come to realize that there can be conflicting perspectives on the same object. For example, in standard false-belief tasks, children of this age accurately judge that Maxi has a false view of where the chocolate is (Wimmer and Perner 1983). Children also now understand that a person can change their mind about a state of affairs and revise her beliefs. In the famous unexpected-content task, 5‑year-olds, but not 3.5-year-olds, recalled that they initially assumed that there were Smarties, not candles, in the Smarties pouch (Hogrefe et al. 1986). Similarly, children between 4 and 5 years come to understand that appearances can deceive: Objects can look to be one thing (e.g., chocolate) but really be something else (e.g., an eraser; see Flavell et al. 1983; Moll and Tomasello 2012). At this age, children also first understand that an object can be brought under various conceptual perspectives: I can call this rabbit a “bunny,” a “rabbit,” or an “animal.” Children younger than 4 do not accept more than one label for the same object, despite having coreferring terms in their vocabulary that they use on different occasions. Children younger than 4 would insist that this “rabbit” can only come under a single sortal description (Doherty and Perner 1998).

The décalage between taking and confronting perspectives is best demonstrated by studies showing how these two capacities come apart within the same research paradigm. Moll and colleagues had 3‑ and 4‑year-olds determine which of two blue objects appeared green for an adult because she, the adult, saw one of the blue objects through a yellow color filter. Three-year-olds had no trouble identifying the correct object when the adult asked for the “green” object, although the children themselves saw the object in its true, blue color (Moll and Meltzoff 2011a; Moll et al. 2013). However, only children aged 4 and older acknowledged that they and the other person saw the object in different colors. Younger children instead insisted that they and the adult saw the objects in the same color. Interestingly, and in support of the transformative account, 3‑year-olds who had previously engaged in perspective-taking (by properly identifying the “green” object) stated in subsequent pilot trials of the perspective-confronting task that the object looked green to them. According to their visual sense, the object was clearly blue—however, having recently imagined the point of view of another person “colored” their perception so that the children judged the object as having a color it did not, in fact, have.

Studies using the suspense paradigm introduced above also provide evidence for the dissociation between perspective-taking and perspective-confronting. As mentioned, 2‑ to 3‑year-olds are susceptible to feelings of suspense when witnessing others approaching a scene with false expectations. However, Ni et al. (2023) found that young children’s awareness of the knowledge gap is below the threshold of explicit judgments. Toddlers expressed suspense only when they tracked an agent’s interactions with objects and projected or anticipated how the agent would react when finding her object gone or replaced. The same children expressed no suspense when there were no agent–object interactions to be tracked, so the children instead had to rely on general (situation-independent) knowledge about how appearances can deceive and cause false beliefs.

Literature on apes’ belief understanding is relatively scarce, mostly because very few studies have yielded positive data (see Tomasello and Moll 2013), and publishing negative findings continues to be difficult. In a study using a paradigm informally known as “chimp chess,” subordinate chimpanzees failed to understand that their misled dominant opponent would look in the wrong (empty) choice of two possible locations for food (Kaminski et al. 2008). The subordinates did not distinguish between false belief and simple ignorance, assuming random choices in both cases. Since then, Krupenye and colleagues (Kano et al. 2017; Krupenye et al. 2016) reported eye-tracking studies in which apes anticipated with their gaze where an agent would return to retrieve food that had been moved since the agent last saw it. However, as we have argued before (see above and Moll et al. 2022), anticipatory looks can only identify the ability to track experiences and take perspectives (level 1); it takes more complex, discursive actions in the form of judgments to detect the ability to confront perspectives (level 2).

5.3 Interim Summary

Taken together, we find that early in their lives, humans develop an extraordinary set of capacities by which they can “put themselves in the mental shoes of others.” These skills originate from the prior capacity to share experiences with other persons in joint attention. Between 1.5 and 3 years of age, children are sensitive to the perspectives of other persons, and they take these perspectives into consideration when coordinating their actions with others. Their perspective-taking skills enable toddlers to communicate and cooperate effectively even when their own view of a situation is incongruent with that of their interaction partners. Interestingly, however, the practical skills of perspective-taking do not translate, at this young age, into an explicit acknowledgement that one and the same object or state of affairs can be represented from various perspectives. Such reflective understanding of the perspectival nature of our access to the world takes shape between the ages of 4 and 5. Training studies have shed some light on key social and linguistic experiences that help children take the leap from practical perspective-taking to theoretical perspective-confronting. Effective training involves discourse in which conflicts between different perspectives are made the point of discussion (Lohmann and Tomasello 2003). For example, hearing that, e.g., “The girl thought the toy was broken, but it wasn’t” helps preschoolers become aware that people can hold different, including false, perspectives on a problem and that appearances can mislead and induce false representations of reality. What this shows is that an understanding of perspectival differences and of subjective mental lives—counterintuitively perhaps—emerges from shared representations of the objective world. In a first step, joint attention identifies what is out there in the world for us to see. In a second step, it invites consideration of the different perspectives from which things in the world are represented.

6 Synopsis and Outlook

We began this article by defending a particular version of the shared intentionality thesis, according to which humans are the only kind of being that shares experiences with conspecifics and puts their heads together in acts of collaboration and joint attention. Rather than viewing shared intentionality first and foremost as a specialized skill that humans have in addition to skills of individual intentionality (intending and acting as a singular subject), which Tomasello suggests in some of his writings (see Kern and Moll 2017), we argued that shared intentionality is to be understood as a feature of a unique form of life: a form of life whose bearers understand their status as second persons for one another.

We identified three early milestones of shared intentionality in human development. The first is young infants’ drive to enter a “meeting of minds” with others in dyadic, face-to-face encounters. Young infants open themselves up to protoconversational partners and engage in preverbal dialogue with them in bouts of “primary intersubjectivity.” We discussed how evolutionary changes in the hominin lineage facilitated the communication and expression of one’s mind via one’s face. The next big step is taken when infants approach their first birthdays and start attending to the world with others as a dual subject. They begin experiencing the world as part of a larger social unit: one that encompasses them and another person, typically their caregiver. Infants enter the three-way experiential relationship of joint attention by holding up or gesturing to objects, by singling out objects for shared attention with language, by imitating others’ actions, by following others’ deictic gestures to join them in their attention to an object, and by looking to others with a “quizzical” look as if to ask whether they should approach or withdraw from an object they find ambivalent. For these behaviors to count as joint attention, the infant and the other have to engage in some kind of exchange—which can be as subtle and fleeting as a moment of eye contact—that renders the experience shared. We discussed the importance of joint attentional experiences for the child’s further cognitive development, including her language and perspective-taking abilities.

The third big achievement emerges gradually over the course of the next few years, when young children build an understanding of the perspectival nature of their and others’ access to the world (perspectivity). We broke this capacity down into two steps, one practical (perspective-taking), the other theoretical (perspective-confronting). Between 1.5 and 3 years old, children can discern what another person is referring to or wants by factoring in that person’s visual or epistemic perspective. They also anticipate what an agent, based on her prior experiences, expects in a given situation. However, at this early age, children lack appreciation of the plurality of perspectives: They have no explicit knowledge of, but in fact forcefully deny, the possibility that one and the same object can be represented in various ways. An explicit understanding of perspectives develops between ages 4 and 5 years and represents what has been labeled a “theory of mind” (Premack and Woodruff 1978). What helps children build this knowledge is participation in social discourse in which differences and clashes between perspectives are highlighted and discussed.

One might ask in how far the collection of cognitive–developmental material presented here directly supports the transformative account we espoused before introducing these data. We admit to not having done enough to interrelate the theory with the data, which would require an article of its own. Let us just say a few things about this here. First, we showed that newborns’ participation in “primary intersubjectivity” implies that the point of departure of human development is different from that of the development of other animals, and if the point of departure is different, then so is the journey. This negates the idea that infants, as is sometimes suggested, begin their development in an “ape stage” which they only outgrow once they master language and tool use to a degree that even enculturated apes fail to reach. This picture of a transition from ape to human cognition within individual human development is clearly false: There is no ontogenetic process of “becoming human” simply because humans are always already cognitively distinct.

We can also point to evidence that infants’ unique sociality molds their entire cognition. Studies show that the object cognition of young infants, as they become increasingly curious about the inanimate environment, is not formed individually but is shaped by the child’s engagement with other persons around objects. Striano et al. (2006) found that when 6‑month-olds face an agent who shifts her gaze back and forth between them and an object—as if to invite attention sharing—infants encode the object more deeply than if the person focuses only on the object. Some months later, when joint attention is in full bloom, its effects on the child’s perception and representation of objects are blatantly obvious, as the child now imitates her interaction partner’s ways of engaging with these objects. Soon thereafter, by the time the child is in preschool, her receptiveness to pedagogy has deeply impacted her cognition about the physical world. For example, what preschoolers are told about an object’s functioning or properties guides and constrains the ways in which they manually explore the object (Bonawitz et al. 2011). Similarly, pedagogical messages about how to approach a specific instrumental problem greatly impact a child’s effectiveness at solving problems of this kind (e.g., Moll 2018). Early ontogenetic development is thus replete with evidence that shared intentionality is much more than just a special tool that helps children navigate social situations.

One might ask about the relevance of learning about human shared intentionality and its development. In response to this, we can point back to the beginning of the article, in which we introduced different approaches to solving the puzzle of human uniqueness. The work reported here furthers our understanding of the human condition by showing that human development is suffused with shared intentionality from the start, and that any philosophical anthropology has to start with the recognition that human sociality is unique in that even as infants, humans understand themselves to be “persons in relation.” Another response to the question of significance points to practical application. Treatment of autism spectrum disorder has been shown to be effective if it targets shared intentionality, especially if the intervention begins early, so that negative cascading effects of missing out on foundational social experiences can be minimized (Jones et al. 2006; Schertz and Odom 2004). Other clinical conditions, including schizophrenia and borderline personality disorder, also involve impairments of shared intentionality. Effective psychotherapy often centers on the person’s ways of relating to others’ (and their own) mentality or subjectivity. Overall, shared intentionality is of major interest across the social sciences, life sciences, and humanities because it is that which keeps individuals in dialogue with one another and maintains society’s functioning.

Unresolved issues to be addressed in future research span across empirical and conceptual questions. The contribution of maturational processes on one hand and sociocultural experiences on the other, such as parental efforts to draw infants into joint attention, remains poorly understood. The transition period from dyadic to triadic interaction in the second half of the first year of life has also not been sufficiently illuminated. Microgenetic studies that zoom in on critical transition periods will help gain deeper insights into how children progress toward higher and more complex forms of shared intentionality.

Despite helpful philosophical investigation in this area (e.g., Eilan 2005; Gilbert 2007; Rödl 2014), we also still lack a thorough understanding of how a plural subject is defined and what it means to “share” an experience. For example, is the dyadic encounter of you and I in primary intersubjectivity already a shared “we” experience, or does an object of joint interest over which we put our heads together need to be introduced for us to fuse together and form a plural subject? And if the infant’s and other’s experience in a dyadic encounter is already “shared” such that they are participating in one and the same experience, then why do infants so passionately communicate to their partner—through sound, gesture, and word—what it is they are experiencing? Do these communicative acts serve to acquire reassurance from the other that the experience is indeed shared, or do these acts constitute the sharing of the experience? Questions like these need to be further examined in collaboration with philosophers to gain a deeper understanding of human shared intentionality and how it unfolds in early ontogeny.