1 Introduction

In this paper, we offer criticism of traditional cognitivist theories of socio-cognitive development and explore an alternative, action-based framework. The accounts considered are nativism, rational constructivism, and two-systems theory. We argue that even though they all seem to explain empirical data about socio-cognitive development—infant mindreading, modulation by experiential factors, and cross-cultural variance—there are other, theoretical reasons why their explanations are untenable. Specifically, we discuss the problem of foundationalism and the related problem of nativism. Finally, we explore an alternative, action-based framework that avoids these theoretical limitations and offer an interpretation of the empirical data from that perspective.

2 Current empirical data

It has been over three decades since Premack and Woodruff’s landmark paper that set the course for contemporary research on the human ability to read other minds (Premack and Woodruff 1978; Wimmer and Perner 1983). The area of study, known as theory of mind (ToM) or mindreading, produced a staggering number of empirical findings. Notably, recent years have abounded in significant findings that can be broken down into three groups, which pose a challenge for any theory aiming to account for them:

  1. 1.

    The false belief test has recently been adapted to minimize extraneous cognitive demands on the child.Footnote 1 The results from studies adopting this new, spontaneousresponse, non-verbal FBT are one of the main points of contestation in the field. Infants as young as 15 months pass the spontaneous–response test, as opposed to around four years for the elicited-response version of it (Onishi and Baillargeon 2005; Surian et al. 2007). The methodology is similar, but the crucial difference is that instead of asking the child about what she thinks, the child’s looking time is measured in both possible scenarios. If the child looks longer at the situation where the observed person violates her false belief, it is interpreted as the child considering this unusual and therefore understanding false beliefs (Träuble et al. 2010). Alternatively, the anticipatory looking of the child is measured before the observed agent makes her choice. If the child passes the spontaneous–response test, she is often considered to possess an “implicit” theory of mind. This finding has been the main line of argumentation for researchers located on the nativist side of the innate-constructed scale.

  2. 2.

    Linguistic and social factors influence the development of explicit theory of mind in many ways (e.g. Astington and Baird 2005; de Villiers and de Villiers 2014; Kristen and Sodian 2014; Milligan et al. 2007; Ruffman et al. 2003). The first inquiries into the significance of language for ToM were trying to rule whether it was any particular element of its structure, its syntax or semantics that did the job (Astington and Jenkins 1999; de Villiers 2005; de Villiers and de Villiers 2009, 2014; de Villiers and Pyers 2002; Hale and Tager-Flusberg 2003; Olson 1989; Tager-Flusberg and Joseph 2005). With time, it was understood, however, that the issue is not so simple and more comprehensive studies proved that virtually all aspects of language facilitate ToM development (Cheung et al. 2004; Milligan et al. 2007; Ruffman et al. 2003; Ruffman et al. 2002), including pragmatics (Furrow et al. 1992; Peskin and Astington 2004; Ruffman et al. 2002).

The issue of the use of language is intimately connected with social factors and so they too have been proven to matter, independently from language (see Devine and Hughes 2016 for a meta-analysis). Correlations have been found between ToM and the number of siblings (Jenkins and Astington 1996; Perner et al. 1994); quality and quantity of interaction with parents (Ruffman et al. 2002); mothers’ disciplinary strategies (Shahaeian et al. 2014); mothers’ personal epistemologies (Tafreshi and Racine 2016); or, for deaf children, fluency in sign language of the parents (Wellman and Peterson 2013; Woolfe et al. 2002). The general consensus has been that social interaction in general, and social interaction that highlights mental life in particular, facilitates the development of social cognition, including false-belief understanding (Carpendale and Lewis 2006; Galende et al. 2014).

  1. 3.

    There are cultures with folk psychologies that differ from the Western one, and children from such cultures tend to show different developmental timetables and trajectories of socio-cognitive abilities as measured by various tests (Gut and Mirski 2016; Howell 1981; Lebra 1993; Lillard 1998; Mayer and Träuble 2012; Mills 2001; Strijbos and De Bruin 2013; Vinden 1996; Wellman et al. 2011; Wierzbicka 1992).Footnote 2

False belief tests conducted in the above cultures produce significantly different results than those coming from the West (Kallberg-Schroff and Miller 2014). Children from Samoa pass the false belief test at around eight, as opposed to the age of four in the West (Mayer and Träuble 2012). Another Pacific culture—Vanuatu—is similar in this respect (Dixson et al. 2017). Chinese and American children take different trajectories in ToM scale progression (ToM scale is a set of tests designed by Wellman et al. for more fine-grained measurement of ToM than the single FBT can provide) (Wellman et al. 2011). The same progression difference was found for Iranian children (Shahaeian et al. 2011), and a completely novel one in Vanuatu (Dixson et al. 2017). In fact, Dixson et al. (2017) established great differences in the sequence between different social groups within one culture, suggesting that even relatively small differences in sociocultural context can have great impact on social cognitive development. Kuntoro et al. (2017) drew similar conclusions from their study in Indonesia where they obtained different sequences depending on the city of origin and suggested that the differences in parenting styles between the two cities were responsible. Further, Naito (2014) reports that sixty percent of Japanese children tested did not pass FBT until they were 6 years old. Pakistani children as well showed a lag behind the “standard” Western performance (Nawaz et al. 2014). And a brain imagining study by Kobayashi et al. (2006) demonstrated significant differences in brain areas activated while mindreading in Japanese and American subjects. Finally, and probably most significantly, individuals speaking a newly formed and developing sign language in Nicaragua did not pass the FBT as late as their twenties (Pyers and Senghas 2009).

The traditional two accounts of theory of mind—nativism (Fodor 1992; Leslie eta l. 2004) and rational constructivism (theory) (Goodman et al. 2006; Gopnik and Wellman 1992)—had already formed their respective positions on socio-cognitive development long before most of the empirical findings presented above emerged. When confronted with the reality of these findings, the two approaches had to adjust their models accordingly, in order keep their initial form. For example, nativism had to account for the observed variance in socio-cognitive skills across cultures and social contexts, as well as the influence of language. This has come down mainly to polishing the competence-performance argument (e.g. Helming et al. 2016; Westra and Carruthers 2017). Constructivism, on the other hand, has run against the challenge of explaining the apparent infant mindreading skills. The general strategy for rational constructivists here has been to downplay the importance of the infant experiments, claiming that they do not really require a full-blown, belief-desire theory of mind as such to pass them (Wellman 2014). It is within that climate that the two-systems account has been formulated, which tries to find the middle ground between the two extremes of nativism and rational constructivism (Apperly and Butterfill 2009; Butterfill and Apperly 2013; Low et al. 2016).

Surely, much effort has been made by the three parties to account for the rich empirical data we have available. However, a theory must hold not only on empirical, but also on theoretical grounds. We believe that there are serious theoretical problems with all three accounts. They are untenable because of their foundationalism: they presuppose representational primitives and cannot account for their emergence. Below we describe why this is such a bad thing.

3 Foundationalism of the three dominant theories of social cognition

After Bickhard and Terveen (1995), we define a theory as foundational if it cannot account for the emergence of representational content and therefore must posit a set of representational primitives from which cognitive development starts (see also Thelen and Smith 2002/1996, pp. 28–34). This is untenable because representational emergence is an empirical fact, which should be impossible from a foundationalist perspective.

Foundationalism is a necessary consequence of theories that view mental representation to be encodings—symbols with semantic content that refer to the outside reality. There is just no way for such representationality to emerge from non-representational phenomena. All three traditional ToM accounts share that view of representation and hence they too are foundationalist, regardless of their particular differences in accounting for the empirical data. We demonstrate this in the next section. Our criticism is deeply indebted to Bickhard and Terveen (1995), who offer a detailed criticism of foundationalism in cognitive sciences.

3.1 The problem of emergence and the necessity for foundational concepts

Natural cognition is the ability to acquire information from the environment, retain it, and use it for the purpose of adaptive behavior. Mental representation is argued to be the central process making this possible. We should ask, then, what it means for mental representation to be in service of information acquisition, retention and behavior guidance. At the most general level, it means that the organism understands something about the represented reality: representation should give a clue to the organism about what it can expect to happen, what it can do, and what it should do considering its goals and current states.

Traditionally, representation is argued to achieve the above in virtue of its correspondence to the represented reality and systematic relationship to other mental representations (Cain 2013; Fodor 1975, 1983; Pylyshyn 1984; Smith and Medin 1981). This has classically been viewed as a causal, informational, and ruleful relationship between the representation, the represented, and other representations of the representational system. To capture that, mental representations have been assumed to have semantic properties much like logical terms, propositions and propositional attitudes do; they refer to reality via the meaning encoded in them and to each other via their syntactic properties. Consequently, much of cognitive psychology today views cognition in terms of manipulation of semantic symbols in this sense. Environmental information is said to enter the system through the senses, after which it is transduced into a symbolic or representational format that is independent from the specific sense modality through which the information was acquired. That is when sensory representations become concepts. Once in the amodal format, information is processed in a way similar to logical inferences. Finally, results of this computation are transduced back into the embodied format of motor directions.

There is a problem, however, in accounting for the emergence of the amodal semantic content—concepts must be already in place for the sensory information to be transduced into them and back into motor information (Bickhard and Terveen 1995). In fact, emergence of semantic content is impossible; there is no way for the organism with such a code to actually know what it is about (for the criticism see Bickhard 2009b; Bickhard and Terveen 1995), and for that reason any model built around semantic representation is forced to be foundational—to necessarily assume a set of conceptual primitives from which development can start. Below, we point out how the traditional ToM approaches share this problem.

3.2 Nativism

Researchers with nativist views generally follow the standard modularist model of cognition (Carruthers 2013, 2015, 2016; Helming et al. 2016, 2014; Leslie 1994; Scholl and Leslie 1999). As with similar accounts of other cognitive abilities (e.g. Lightfoot 1989; Pinker 2014/1994; Wynn 1992), the story here goes as follows: There is an innately specified cognitive mechanism, or module, dedicated to a specific domain, in our case—to mindreading. This mechanism is independent from and insensitive to virtually any extra-organismic factors and develops according to a biologically predetermined schedule. The innate, encoded information that it contains consists of basic inborn concepts (e.g. BELIEF, DESIRE, SEE, and PRETENSE) and heuristics (e.g. “seeing leads to believing”) that enable the child to pick out relevant stimuli, cognize it, and draw basic conclusions (in an unconscious modular way, that is). These are then fed in some form to the central system, adding an aspect of another’s agency to the child’s perceived reality.

Nativists take this to be the cognitive process that underwrites 15-month-olds’ performance on the spontaneous–response FBT; infants pass it because they have this basic inborn theory which makes them expect the false-belief congruent scenario. As Westra and Carruthers (2017) offer, this innate theory is open for learning; it is claimed that it gets enriched with more complex concepts throughout development, or that its harmonization with the rest of the cognitive system can improve, domain-general processes putting the mindreading module to work for their purposes (Carruthers 2015).

Accordingly, any differences in performance on the elicited-response FBT and related tests across populations—e.g. different cultures, parenting practices, or linguistic inputs—are explained away by factors other than an actual lack of mentalistic understanding of the mind (this is most clearly argued in Westra and Carruthers 2017, but see also Helming et al. 2014, 2016). Nativists take two directions here. One is the claim that the tasks in question make demands for more complex concepts and heuristics than the basic ones (culturally embedded ones not yet acquired). The other is the recourse to the competence-performance chasm. The latter path is naturally necessary for FBTs. Children are said to understand false beliefs innately and the varying performance is due to (a) misconstruals of implicatures of the test questions, (b) lack of adequate vocabulary in the language they are growing up with, or (c) undeveloped executive functions or general-processing resources.

Nativism is thus openly foundationalist, and seeks a solution to the impossibility of emergence in the claim that the representational primitives are innate. This does not help, however, as the idea of inborn semantic content cannot be defended either. We discuss this in Sect. 4.

3.3 Rational constructivism

The alternative, rationalist constructivist idea has been that children construct a theory of mind much like a scientist would (Gopnik et al. 1999; Gopnik and Wellman 1992). This faced a challenge in light of the spontaneous–response FBT results with infants. In order to explain the gap between infant and preschooler mindreading, and not to commit to innate belief understanding, constructivists were forced to show that infant mindreading could be explained by simpler concepts than those of a belief-desire theory of mind.Footnote 3

Wellman (2014) argues that infant data can be explained with only a desire-awareness conceptual framework, which with time is built on and becomes a proper theory of mind with belief understanding. The way the original concepts are modified and enriched is modeled with the use of Bayesian hierarchical networks, and the child’s conceptual development is viewed as a theory revision process, the child being a “little scientist.” Thus, in contrast to the nativist view, constructivism of this kind views ToM development as utilizing domain-general resources, and conceives of ToM as a somewhat real theory in the mind of the child, not a modular mechanism operating according to the same principle as a theory of mind. It does, however, start with a set of foundational mental representations (Wellman 2014, p. 197), and necessarily so. For a theory-based development, the initial states must be concepts understood as encodings; theory by its nature just has to start with initial concepts that enable hypothesis formation and further theoretical change. It is fairly unquestionable that any account has to start with something, to take something for granted. Rational constructivism, however, forces us to assume that these initial states are conceptual in nature, and this is clearly a case of foundationalism.

3.4 Two systems

The two-systems view finds a middle ground in between nativism and constructivism. Although there have been a number of different proposals that advance two systems (e.g. De Bruin and Newen 2012), Apperly and Butterfill’s account is most usually associated with the term (Apperly 2012a, b; Apperly and Butterfill 2009; Low et al. 2016).

The basic assumption of the two-system view is that children pass the spontaneous–response FBT because they are in possession of mindreading system 1, which is a limited foundational theory of mind: a belief-tracking mechanism (Apperly 2012a; Apperly and Butterfill 2009; Butterfill and Apperly 2013).

Imputing spontaneous–response FBT results to the workings of system 1, two-system proponents claim that passing the elicited-response FBT requires a much more effortful and explicit way of thinking about other minds—what they call system 2—that children before preschool cannot really use. Although Apperly and Butterfill do not really address the development of system 2, it would make sense that it is constructed or somehow acquired just as other explicit ways of thinking about reality, and so its development might be influenced by environmental factors, which would explain the cross-cultural data we have presented before.

Theoretically speaking, however, two-systems does not offer a way out of foundationalism. Although Apperly and Butterfill are much more subtle than the proponents of the other accounts in their distinction between a theory-of-mind ability (an ability to behave as if one had a theory of mind) and theory-of-mind cognition (using a theory of mind as such), they still view their system 1 in terms of the latter, albeit not “full-blown.” And with it, necessarily come its encoded representations:

We do not aim to argue that someone could track beliefs, true and false, without any theory of mind cognition at all. Our concern is rather with the construction of a minimal form of theory of mind cognition. As we shall explain, minimal theory of mind does involve representing belief-like states, but it does not involve representing beliefs or other propositional attitudes as such. (Butterfill and Apperly 2013, p. 3).

Further, it does not change matters much that they view their minimal theory ascribed to infants as only a construct at the computational level of explanation (cf. Marr 1982/2010). The computational level explanation still imposes significant constraints on possible implementations. Apperly and Butterfill’s framing of the computational problem in terms of a minimal theory of mind still poses a foundationalist problem: The computational-level theory has to be implemented in a way that merits being called “a theory”—that is, there have to be implementational equivalents of computational-level processes that relate to each other in the prescribed, theory-like way (compare to the discussion on tacit knowledge in Davies and Stone 2001; Fenici’s 2013, application of Davies and Stone’s ideas).

The computational problem they describe still consists in ascribing states to observed agents. Thus, system 1, though minimalistic, still presupposes concepts of object and agent and registration, whose contents are similarly left unexplained developmentally. As far as system 2 is concerned, here they run into all the problems that rational constructivism does—they openly draw an analogy to Fodor’s central system (Apperly and Butterfill 2009, p. 956), where explicit theories reside, when explaining why system-2 theory of mind should be effortful but flexible.

Notably, it can be argued that the two systems account does not aspire to explain the development of the representational states in question, and thus the charge of foundationalism does not apply to it. Still, the problem generally has to be solved to give an exhaustive account of socio-cognitive development, and there does not seem to be much potential in the two-systems account to do so, as it frames the problem in cognitivist terms.

The general insight of the two-systems theory, however, is consistent with the interpretation offered by our action-based account. Following the action-based principle, we too arrive at the closely similar conclusion that there is an important chasm between processes underlying competent social interaction, and explicit theorizing about other’s mental life. How we reach that interpretation is however importantly different, as later parts of the paper will show.

4 Emptiness of the concept of innateness

As we demonstrated above, all three accounts are foundational, which renders them theoretically untenable. One argumentational move that is often employed by foundational accounts is to defend foundationalism with a recourse to nativism. This is explicit in the nativistic accounts (e.g. Carruthers 2013, p. 151), but also a potential response of the other two frameworks. Therefore, below we point out why nativism is untenable in its own right.

As Racine (2013) argues, core or foundational knowledge approaches tend to use a neo-Darwinian adaptionist view on innateness, claiming that the inborn knowledge and skills present in infancy were an object of natural selection in phylogeny due to their evolutionary advantage, and hence are coded in genes and necessarily present in every individual. Rather than being solved, the foundational weight is thus moved onto biology. However, the move is unwarranted as developmental biology speaks against phenotypic traits as complex as concepts being formed prenatally and irrespective of experience.

Great revisions are afoot in modern biological sciences as some consider the twenty-first century to be the century of biology (Venter and Cohen 2004). One of the central issues in this revolutionary climate is precisely that of evolutionary mechanisms and viable notions of innateness. Following works of such researchers as Lewontin and Gould, modern biology is much more cautious with adaptionist stories of traits and the idea of them developing “innately” in ontogeny (Gould and Lewontin 1979; Lewontin 2001; Oyama 1985/2000). Psychology, however, seems much slower to catch on to this trend (cf. Racine 2013), as we see evidenced by the foundationalist accounts of cognitive development.

As biological research demonstrates, development is a multiply contingent process (Elman 1996; Gould and Lewontin 1979; Gould and Vrba 1982; Mameli and Bateson 2011; Oyama 1985/2000; Pigliucci and Müller 2010). A number of psychologists urge researchers to consider this in cognitive development too (Carpendale et al. 2013, p. 130; Carpendale and Wereha 2013, p. 208; Lewis et al. 2013, pp. 159–160; Lewkowicz 2011; Spencer et al. 2009). They point out that there are multiple elements whose interaction leads to the development of biological and cognitive forms, and hence any talk about “innate,” meaning encoded in genes, contorts the way in which genes matter for development. The “interactivist lesson” taken from the discussions in biology is that genes have their developmental significance only in the context of other intra-organismic as well as extra-organismic interactants. In other words, they have their “information” about a particular form only inasmuch as we keep other causes constant, which is hardly the case in nature. This interactive nature of development renders any talk about “genetically specified” innateness meaningless. We would be making just as much sense talking about innateness being “environmentally specified” since for genes to have their particular causal powers there needs to be a particular environmental context (Carpendale and Wereha 2013, p. 208).Footnote 4

We argue that the nativist ToM approaches assume the idea of innateness that no longer fits with current research in biology and therefore construct their theories in a theoretical vacuum. Let us have a look at this excerpt from Carruthers (2013):

The infant-mindreading hypothesis, in contrast, postulates an innately channeled body of core knowledge, or an innately structured processing mechanism (or both), with an internal structure that approximates a simple theory of mind. The explanatory burden, then, is an evolutionary one: it needs to be shown that there were sufficient adaptive pressures among our ancestors for such a mechanism to evolve. There is now an extensive body of work suggesting that this is indeed the case. The gains provided by such a mechanism might derive from enabling so-called ‘Machiavellian intelligence’ (Byrne and Whiten 1988, 1997), from facilitating larger group sizes (Dunbar 1998), from enabling distinctively human forms of cooperation and collaboration (Richerson and Boyd 2005; Hrdy 2009), or from any combination of these. (Carruthers 2013, p. 151).

According to Carruthers, a main challenge for the nativist explanation is supposed to be telling an adaptionist story. However, this contributes little to developmental models because phylogenetic adaptations are not preformed phenotypes that are necessarily expressed in ontogeny.Footnote 5 Neo-Darwinian adaptionism is a meta-theory conceived to talk about phylogeny exclusively: “The neo-Darwinian framework is at root, by definition, a nondevelopmental framework” (Racine 2013, p. 144). We may talk about innate features in phylogenetic analyses where the term is used to mean “reliably present in the species in a given environment”; these analyses assume developmental contingencies to be constant and talk about changes in population over phylogenetic time. When we are interested in ontogeny, however, we are trying to figure out precisely that which is excluded from the neo-Darwinian adaptionist framework—contingencies of development and how phylogenetic heritage interacts with the actual context of growth. The fact that some trait evolved through adaptation does not mean that it is innate in the sense that most nativist theorist seem to assume—that it is preprogrammed and necessarily present regardless of environment (Oyama 1985/2000, p. 25).

Adaptations happen in an environmental context and certain aspects of that context are usually necessary for them to develop in ontogeny. What is then “innate” is not an intraorganismic encoding that is the problem of the evolutionary biologist to explain, but an integrated organism-environment stability that any developmental account must tell the story of.

Accordingly, even if a cross-cultural universality is established in infant performance on the spontaneous–response FBT, this does not entail that the necessary cognitive skill develops innately. The universality is most likely due to similarities of experience across these cultures, not to a genetically or internally specified module. This means that not only more cross-cultural spontaneous–response FBT studies are needed, but also inquiries into the nature of the contexts of growth in the cultures studied, which would make it possible to identify potential similarities and differences that can modulate the development of the skills. Only once these potential environmental modulators have been excluded as a partial cause of socio-cognitive skills could we advance any nativist (i.e. developmentally internalist) claims.Footnote 6

In sum, empirical and theoretical considerations about development speak in favor of the view that evolutionary endowment interacts with other factors in ontogeny and leads to social competence, rather than providing a preformed ability or representation. Consequently, the nativist ToM accounts are stuck between their inability to account for the emergence of representation in ontogeny and the implausibility of representation forming in phylogeny.

Below we present the action-based framework that solves the problem of foundationalism and is consistent with developmental science. Finally, we offer a sketch of an action-based account of the three groups of empirical data.

5 Solutions offered by an action-based perspective

In recent years, we have been witnessing a pragmatic turn in cognitive science (Engel et al. 2015) as various action-oriented views are proposed to redress the flaws of classic symbol-manipulating models. Although contemporary action-based approaches to cognitive development are still in the works (Pezzulo et al. 2015, p. 49), the central importance of action has been recognized by a number of theories, both older and more recent ones. To name a few: Piagetian approaches (Allen and Bickhard 2013; Bickhard 2009a, b; Carpendale and Lewis 2004, 2006; Newcombe 2011), Vygotskian approaches (Nelson 2007), dynamic systems (Thelen and Smith 2002/1996), grounded cognition (Barsalou 2008), radical enactivism (Hutto and Myin 2013, 2017), or the Predictive Processing Theory (Clark 2016).

Although there is much work to be done before we arrive at a comprehensive action-based account of cognitive development, the action-based principle has a lot of potential to create a much-needed unifying framework for development. Here we are interested in what the framework can offer to the research on social cognitive development. To explore this, we briefly sketch the action-based principle and demonstrate how it deals with the problems of foundationalism and nativism. Then, we provide a provisional action-based interpretation of the three main groups of data about social cognition, and stress the interpretation’s fundamental difference from the classic cognitivist ones.

In our sketch, we draw on three frameworks: Bickhard’s interactivism (Bickhard 2009a), Carpendale and Lewis’ (C&L) social-experiential approach (Carpendale and Lewis 2006), and Nelson’s Community of Minds (Nelson 2007). These frameworks, although largely underrepresented, have a lot to offer to the current debates in social cognition, especially in reference to the problems we have discussed in this paper. We do refer to other compatible and relevant theories in passing, but do not wish to present an exhaustive review of this sort.

5.1 The action-based principle

Here is what follows from the criticism of the ToM approaches we offered above, which we contrast with what the action-based principle claims: (1) For the criticized theories, every action or cognitive skill is underwritten by disembodied representational competence of sorts; for action-based models, action can precede representational mental content.Footnote 7 (2) For the criticized theories, the development of social competence must start with an inborn base of amodal representation; for action-based models, it does not have to—representation can emerge from non-representational phenomena.

It is instrumental to stress at this point that the action-based representation (which we take from interactivism) and the standard idea of representation as amodal encoding differ fundamentally.Footnote 8 First, what we are interested in when modeling representation is not solving the metaphysical problem of reference (see, e.g. Quine 1960/2013, pp. 23–72), but only proposing such an idea of representation which is a viable way in which real organisms can represent reality. An action-based representation does not represent on the basis of reference or correspondence; it is not a disembodied symbol with a semantic stand-in for what is being represented. It does, however, have the necessary properties of representation—intentionality and truth value. And most importantly, it has them in virtue of processes which are consistent with developmental reality and which allow for representation to emerge in ontogeny (and phylogeny) from non-representational phenomena. How it does so, and how it achieves intentionality and truth value should become clear in our exposition of the action-based principle below. This is drawn from interactivism (Bickhard 2007, 2009a, b, 2010; Bickhard and Terveen 1995).

An internal state S is a detection of an external state S*. The organism does not know that, but merely experiences the internal state as “this state.” Being in internal state S, the organism undertakes action A (from among others; let us assume that for newborns actions can be random at first for the sake of the illustration), which results in the external state S* changing to Y*. The organism’s physical organization is such that Y* evokes another internal state, state Y. This way, on the basis of non-representational detective properties, the organism can create an action-internal state contingency pattern that while in state S, action A leads to state Y. This provides the germ for normativity—the organism will now (implicitly) know that state S is not only just “this state” but also such a state that can lead to state Y via action A. Thus, the organism functionally predicates something about the current situation; and it does so in virtue of the action-internal state contingency it has the knowledge of, without the need to refer to the outside world at all. Moreover, the predication can be false or true, and the truth value can be potentially known to the organism—all it takes is to engage in action A to find out whether it is possible to go to state Y from state S. If it was not, then the previous situation was not the situation that should have produced state S (no external state S*) or state S needs some other states co-occurring in order to afford going to Y, which were only accidentally present in the previous interactions that went from S to Y via action A. The organism can accordingly update its functional predications after failing to achieve the expected result of its action.

Now, there may be processes in more complex organisms whose main function is to probe the external reality in the way presented above. They can serve to keep track of what interactions are possible in the given situation so that the organism can be a competent agent that can choose from an array of affordable actions in light of its current goals. Bickhard refers to these kinds of representational phenomena as apperception. They are much more like the classic idea of representation as their function is mainly not to directly indicate the possibility of a given action, but only to predicate something about the current situation. This predication can then be the basis for many other interactions and constitutes what Bickhard calls the situation knowledge. Thus, apperceptive processes have the function of representation for other interactive processes that rely on them in order to guide adaptive behavior. In fact, any representation of a possible interaction can serve two functions—to cue the organism to engage in the interaction, or to provide information about the situation for another interaction representation (something like knowing that you can order a taxi anytime at a party makes you entertain staying after the last bus home leaves). Moreover, there are most likely many levels of recursive interactive systems within the architecture of the human mind, in which one system interacts with interaction potentials (representations) of a lower one, enabling explicit thought about its properties (Bickhard 1998; Campbell and Bickhard 1986). However, it is the function of representation for the organism’s processes that makes it representation, not some amodal, symbolic format it is coded in.

Importantly, the action-based representation offered by interactivism is inherently embodied and situated. The content is constructed through interaction and is grounded in the modalities of the experiences. (Ap)Perceptual processes constituted by sensorimotor contingencies (SMCs) may be grounded in one particular modality (O’Regan and Noë 2001), but the situation knowledge they create, and on which higher-order interactive representations rely, will span all the modalities. Representing the car, for instance, will base off SMCs grounded in past explorations of how cars look, smell, feel, what sound they produce, and for some maybe, what they taste like. The interactive representation of car will consist in connecting the expectations of all these modalities under a common contingency—if I hear a car, visual contingencies associated with it activate too and I expect to see it when I turn my head. The central point is that the content of action-based representation is modal by definition as it is past experience that constitutes it.

Consider a simple example (this is just to make a point, not an empirical claim). The infant experiences a state of hunger and starts crying (crying action could occur randomly, but it is plausible that such a simple state-action coupling could form prenatally; note, however, that the infant does not know anything about why she acts this way). Crying has its impact on the environment such that the mother comes and starts feeding the baby. As a result, the state of hunger changes to the state of satisfaction. Thus, the infant functionally and implicitly comes to know that when hungry, crying leads to satisfaction, and she has some (again, functional and implicit) idea what the internal state of hunger “means”. This knowledge—of hunger-crying-satisfaction contingency—does not have to be innate as the information about it is reliably present in the environment of growth; hence, evolutionary selection was more likely to predispose the child to quickly learn it in the above way rather than prewire it whole.

If the mother is not in the room, however, then crying will not have its effect.Footnote 9 Making sure that the mother is in the room will therefore be an apperceptual process in service of hunger satisfaction via crying. Perception is an apperceptual process par excellence, and we have an action-based account of perception nicely worked out by O’Regan and Noë (2001).Footnote 10 This way, through apperceptual processes, the child keeps track of whether the given situation really is a situation in which crying will lead to feeding and satisfaction of hunger.Footnote 11

The way that language changes the interactive context needs to be noted as we refer to it in our account of the three groups of empirical findings. Linguistic interaction will build on non-linguistic interaction patterns and words will come to represent interaction possibilities they have been used in. Consider a game in which a linguistic utterance is a part of—a simple naming game that mothers play with children many times, where when presented with a toy, one has to say its name. Grounded in such an interactive pattern, the word Zebra, for instance, can come to represent the toy as it was used in the game upon the presentation of the toy Zebra (feedback for failures in naming it so could have been provided in the form of undesirable interaction on the mother’s part—negative facial expressions or continuation to hold the toy instead of progressing to the next one). Linguistic units grounded in such a way can be naturally uttered in any situation, and the child will learn this when her mother uses the same expression in a different context, which evokes the interaction potentials in the child’s mind grounded in the past use of that word. Something similar will take place when the child herself uses it in a different context and observes its impact on the external, social world. Importantly, further contexts of use can be themselves linguistic, which is possibly how abstract meanings emerge.Footnote 12

There is much more to be said about language in an action-based framework (Bickhard 2007). However, for the present purposes, we want to point out that some socio-cognitive abilities could require language to develop, while some would not. There are some social competences which will be greatly improved by associating them with linguistic interaction, some that are embedded in linguistic interaction entirely, and there are such that do not gain much from it. For instance, mere physical sequential social interaction of changing the diaper would not gain anything by adding linguistic components to it, whereas going for a walk would (linguistic structuring of activities outdoors makes them both safer and more interesting for the child). And it goes without saying that kinds of interactive competences that are entirely embedded in language—such as being able to hold a conversation—would need previous linguistic experience in order to exist at all.

It should be clear after this exposition that an action-based perspective solves the problem of foundationalism and is consistent with the multiply contingent nature of developmental phenomena. In fact, many of the aspects of the model mentioned above—such as emergence of representation from non-representational phenomena, naturalized normativity, system-detectable error, or the possibility of multiple, interrelated but qualitatively different representational processes within the organism—are simply absent in the framework upon which the traditional ToM accounts are based. As such, an action-based perspective offers a much more comprehensive alternative to modeling socio-cognitive development that can replace or potentially complement the standard models.

5.2 Constraints on experience in an action-based framework

With the above model of cognitive development, the kind of interactive experience the child engages in will determine the content of its cognitive structure. It is possible that some internal state-action contingencies are already present in newborns (think of all kinds of reflexes), but it does not seem likely that they come with prewired complex interaction representations of theory of mind. A more viable thesis is that the macroarchitecture of the nervous system predisposes the child to certain kinds of experience, and it is these experiences that form its microstructure, furnishing the mind with representational contents (understood pragmatically, not semantically). Natural selection has happened in the environment where certain elements were reliably present and so whatever developmental process it has selected will implicitly presuppose their presence in ontogeny. In other words, the phylogenetic heritage determines what can be possibly experienced by the child, but it is the context of growth and experience therein that determine what will be experienced (cf. Nelson 2007, p. 249).

The kind of interactive experiences available to the child will naturally differ depending on the time and place of development. Nelson’s (2007, p. 19) model of constraints on cognitive development captures this fact nicely. She identifies six kinds of constraints: evolved, embodied, ecological, socially embedded, encultured, and that of past experience.Footnote 13 What is experienced at a given time in development is jointly determined by the constraints. The view on development that we get here is therefore necessarily scaffolded—starting from the interactions available to an infant, she actively “acts her way up,” establishing first basic action-based representations (e.g. sensorimotor contingencies), and then using them to engage in new interactive experiences and establish further, more complex, representations (e.g. social competence).Footnote 14 The basic representations will be fairly invariant across environmental contexts as they will rely on largely universal patterns of physical interaction with the environment and caretakers; higher representations will be grounded in more complex social and cultural interaction, which will naturally differ greatly across social and cultural contexts.

5.3 Socio-cognitive development in an action-based framework

Coming back to the problem at hand, two constraints on experience are especially relevant for the emergence of socio-cognitive mental structure—non-linguistic social interaction and linguistic social interaction. Below we show why,

Carpendale and Lewis (2004, 2006) stress the role triadic interaction with parents and objects has for cognitive development. Naturally, the kind of interactive experience the child gets while interacting with other people will establish socially embedded representations of interactive potentials: The child will have developed certain expectations of how people behave in a range of situations and how they react to the child’s own actions. Such largely behavioral interaction competence will not, however, involve abstract concepts of minds or beliefs. The action-based representations established will be derived from purely physical interaction with other people, not understanding their minds. As argued by others (e.g. Fenici 2012, 2015; Gallagher 2008), such embodied interactive competence can readily explain early social cognition and spontaneous–response test results.

Things change when language enters the developmental system. Conversational situations constitute a new kind of interactive context that enables representation of unobservable minds. Carpendale and Lewis claim that language is acquired by learning patterns of interactions for which “it is appropriate to use a particular term, for example, mental or emotional, or dealing with pain, and so forth” (Carpendale and Lewis 2004, p. 88). Linguistic interaction is part of the external environment and as such provides interactive potentials embedded in language, otherwise unavailable. This fact—that language enables later social cognition based on abstract notions of mind and belief—has also been argued by others on various grounds (e.g. Fenici 2012; Gallagher and Hutto 2008; Hutto 2008; Nelson 2005, 2007). An action-based principle could provide a more detailed model of why this is so. The bottom line is that linguistic interaction potentials can go a long way in explaining how children pass the standard FBT, without the need to posit theoretical knowledge (cf. Gopnik and Wellman 1992).Footnote 15

5.4 Empirical data within an action-based framework

Now we can turn to the difference an action-based approach makes in the interpretation of the groups of findings we reviewed at the beginning of the paper. As already discussed, social interaction is necessary for the emergence of any representation that pertains to other people as such representations originate in past social interactions. Language is necessary for representations that are embedded in linguistic social interaction, and facilitates those that benefit from linguistic input. For the development of folk psychology both linguistic and social interaction are naturally necessary (see Fenici 2017; Fenici and Garofoli 2017; Hutto 2008).

  1. 1.

    Infants’ performance in spontaneous–response FBT can be accordingly explained as a certain point in the development of interactive competence that 15-month-olds are at. Over the span of their lives so far, the infants have had enough interactive experience to represent and expect a false-belief-congruent scenario in the spontaneous–response FBT. From an action-based view, this is however only functional competence, not underwritten by abstract concepts. The representational processes involved are interactive potentials and expectations that follow from them, grounded in past interactions. More specifically, perhaps the infant’s expectancy is based on the chronic helpfulness in such situations that children of that age exhibit (Warneken and Tomasello 2007), combined with their experience of situations where previously absent adults interact with the kid and the toys in the room in a way that betrays ignorance about the toys’ location.Footnote 16 To wit, the expectation that is violated in spontaneous–response FBT has its source in the infant’s representation of the current interaction, not mental states. It is possible that the surprise consists in the falsification of the interaction representation where the child helps the adult and the adult’s established and expected role is not knowing where the objects are when she or he comes back into the room.

This is, of course, just a provisional projection; a more thorough analysis is needed, and more empirical tests done in the action-based spirit would illuminate the issue as well. What is important for us now, however, is that from an action-based perspective, the psychological phenomena behind the performance are not inborn or foundational, but constructed by past interactions, and the mental representation involved pertains to the physical aspect of the interaction only (cf. Banovsky 2016). There is no knowledge about unobservable minds involved. Infant social cognition is in this view competent social interaction based on previous non-linguistic interaction. This is generally consistent with minimalist accounts of early social cognition and some of the theoretical solutions offered there (Fenici 2015; Heyes 2014; Hutto 2015; Perner and Ruffman 2005; Ruffman 2014; Ruffman and Taumoepeau 2014). It is not consistent, however, with the two-systems account and their not minimalist enough account of early social competence as it concedes that system-1 agents still do understand something about mental life.

  1. 2.

    It should be clear by now that the significance of linguistic and social factors for social cognitive development is completely reevaluated from an action-based perspective. First, to have any expectations about social situations, infants need to have had relevant interactive experience that has led to the capacity for the formation of the respective situation knowledge—other people are part of the interaction potentials that the child represents in any situation that is social. How other people have behaved in the past and how they reacted to the child’s actions determines the kind of representation the child will have of any particular social situation. Without language, however, this kind of representation builds on purely physical interaction. Language introduces its unique interactive possibilities that are grounded in past linguistic interactions and make it possible to represent what is not physically there. Seeing Sally come back to the room in the FBT (Baron-Cohen et al. 1985) will induce interactive potentials of not only the observable reality, but also of linguistic interaction, rooted in past experience that involved talk about minds and intentions. It is only through language, then, that one can talk about the formation of abstract concepts; minds, beliefs, desires and other mentalistic concepts are developed through linguistic interactions that the child takes part in. Once there has been a sufficient amount of such linguistic experiences, it becomes possible to think about other minds too regardless of the current situation, as language imbues every situation with its interactive potential (cf. Clark 1997, pp. 193–218; Dennett 1996, pp. 147–152).

Consequently, the social-linguistic factors present in previous experience will naturally influence empirical results, as these factors are what largely constitutes the child’s representational abilities. In other words, more sophisticated social cognition, e.g. such that involves understanding opacity of terms, or one that involves giving justification of one’s answer, will build on previous linguistic interaction. This is in line with the fact that all aspects of language matter for social cognition (Milligan et al. 2007); this empirical observation makes sense in the current framework because language is not a computational tool that provides some special format or syntactic tool for symbol manipulation, but rather an interactive system that permeates the child’s cognitive structure (Bickhard 2007).

  1. 3.

    Anthropological research informs us that linguistic and social interaction differs greatly across cultures and therefore affords often starkly different interactive experience. It is true that we may obtain similar results from the spontaneous–response FBT with infants across different cultures—the kind of behavioral interaction that provides content for the expectation tested there can be fairly universal. As Carpendale and Lewis have it, “children […] may achieve comparable levels of development at similar ages because of commonalities in their experience” (Carpendale and Lewis 2004, p. 85). Relevant linguistic interactions, however, are evidently extremely different in many cultures and therefore language-embedded representation of other minds will be such too. The same applies to the narrow context of family and friends—they too afford varying interactive experiences and influence cognitive development, which is evident in the empirical data we presented at the beginning of the paper. The way other people are talked about, the way that social interaction is narrated, and the role that linguistic behavior plays in social interactions determine the kind of abstract representation of mental life that members of a given culture or family construct. We observe, for example, Japanese children failing the verbal FBT, not because of the fact that they predict the searching to be incongruent with the false belief, but rather because their justification, which is necessary to pass some versions of the test, does not refer to minds and beliefs, but to social relations (Naito 2014). This is not a surprise from an action-based perspective as Japanese children have had different linguistic interactions in their past than children from the West, based on which they have formed different interactive potentials (representations). It is then entirely natural from an action-based perspective that we tend to find the greatest differences in social cognition in cultures that differ from the West in both social (family relations, social conventions, philosophical traditions etc.) and linguistic (syntax, semantics, and pragmatics) respects. Representational underpinnings of folk psychology will genuinely differ across cultures inasmuch as social and linguistic experience differs in them in relevant aspects (cf. Fenici 2017).

It becomes clear that the view is generally consistent with other views that proclaim the embodied nature of early social cognition, and linguistic nature of later mindreading (e.g. Andrews 2012; Fenici 2012; Fenici and Garofoli 2017; Gallagher and Hutto 2008; Newen 2015; Zawidzki 2013). However, the exact views on the nature of cognitive processes involved espoused by these approaches could differ greatly from the action-based ontogeny we have presented.

6 Conclusion

We have shown that nativism, rational constructivism, and two-systems theory offer unsatisfactory explanations of socio-cognitive development for two related reasons—their foundationalism and nativism. All three theories are foundational in virtue of their encoding-based view on mental representation, which precludes representational emergence a priori. Appeals to innateness do not offer a solution to this problem: The idea of inborn concepts is inconsistent with modern biology and glosses over the process of their development, which is the job of a developmental account to explain. Even though all three frameworks have been argued to account for empirical data, they remain untenable in light of the above theoretical issues. Although needing much work, the alternative, action-based perspective we have presented offers a framework that naturally avoids foundationalism and nativism.

Interpretations of the empirical data (infant mindreading, context-sensitive developmental progressions, and cross-cultural differences in folk psychology) differ greatly between the two paradigms. Foundational concepts set the course for development in the traditional frameworks, which means that experience plays only a mediating role in that ontologically internal composition. The action-based paradigm, on the other hand, adopts a radically different, grounded position, and sees past experience as constitutive of representations involved in social cognition. Variance in social cognitive skills observed across cultures, and other linguistic and social contexts, is therefore deeply significant as it evidences genuine conceptual differences in people from different socio-linguistic contexts. Different interactions that those contexts afford lead to the emergence of essentially different representations of the human social world.