Language is often considered one of the core characteristics that distinguish humans from nonhuman animals (e.g. Culotta & Hanson, 2004). Aiming at an identification of general principles guiding the behavior of many species, behavior analysis concentrates on the similarities, contra to the differences, between speaking and other kinds of behavior. Compliant with evolutionary theory, behavior analysts seek an analysis of all species and all types of behavior in a common framework. To highlight the similarities between species, Skinner (1957) chose the label verbal behavior when introducing his monistic, non-mentalistic and naturalistic account of human language. Outside of behavior analysis, many attempts to understand human language are still primarily dualistic and characterized by what Ryle (1949/2009) called category mistakes.

An example of dualism in the analysis of human language is the Shannon-Weaver model (1949), which suggests that a sender encodes an idea or information. Then, the sender passes it through a communicative channel to a receiver who decodes it. As the communication theorists Bavelas et al. assert, the “classic [Shannon-Weaver] model is deeply embedded in the terms we still use to describe conversational participants and processes, such as sender, receiver, and channel” (Bavelas et al., 2000, pp. 941–942; emphasis in the original). Sender-receiver models are popular in folk psychology; but they are tricky from an epistemological position. Models such as the Shannon-Weaver-Model add no testable statements to what we observe. It remains unclear where we might look for the message, what it may consist of, or who/what can be said to do the encoding and decoding. No matter in what species, applying dualistic sender–receiver models to communication, will not render answers to such questions (Baum, 2016).

An example of a category mistake leading to circular reasoning in the analysis of human language becomes evident in linguist Arne Torp’s (2004) book length discussion on the origin of the Nordic languages where he writes “language families [are] groups of languages which have developed from a common origin. . . . Sound commonalities can be used to establish different language families” (p. 23). Then, he gives an example of a word in three languages that belong to the Indo-European language family: English (“one”), German (“eins”) and Norwegian (“en/ein”) and concludes that these numbers sound so much alike “because they belong to the same branch of the Indo-European ‘tree,’ the Germanic branch.” (p. 24; emphasis added). First, the relation between the languages is established due to sound communalities, and then the relation between languages explains the sound communalities. This circularity, or category mistake, stems from a confusion of categories (“all that have sounds XYZ in common”) and instances (Norwegian, English and German—belonging to (/are part of) the Indo-European lineage). However, a confusion of (summary) labels and causal explanations risks concealing that no actual causes are discovered (Ryle, 1949/2009).

Skinner (1957) defined verbal behavior as “behavior reinforced through the mediation of other persons” (p. 2) who “must be responding in ways which have been conditioned precisely in order to reinforce the behavior of the speaker” (p. 225; emphasis in the original). This definition includes, for example, signing, writing, or gesturing. It is not restricted to vocal communication. The advantages and disadvantages of this particular definition (e.g., as discussed by de Lourdes R. da F. Passos, 2012), have little relevance for the significance of Skinner’s (1938, 1953) discernment that speaking is, what he called, “operant behavior.” Operant behavior is behavior partly produced by an organism’s environment and acting (or “operating”) upon its environment.

In his interpretative work, Skinner (1957) primarily suggested an understanding of human language by means of an identification of certain operants, which he termed tacts, mands, echoics, and so on. During the 65 years that have passed since the publication of Verbal Behavior the lion share of research it has inspired has focused on these operants, generating widely published evidence of considerable applied success, but more limited numbers of basic experiments or theoretical insights. Presti and Moderato’s (2016) account titled Verbal Behavior: What is Really Researched? reveals that research on verbal behavior published in the journal The Analysis of Verbal Behavior from 1982 to 2013 is almost exclusively restricted to an analysis of the verbal operants suggested by Skinner. Experimental work on verbal behavior not focusing on Skinner’s operants (e.g., Kraut et al., 1982; Salzinger, 1959; Simon & Baum, 2017) has been accumulating more slowly. Noteworthy theoretical insights on verbal behavior from an evolutionary perspective have been offered by Hayes et al. who link it to relational frame theory (e.g. Hayes et al., 2017) and by Greer and colleagues (Greer, 2008; Greer & Keohane, 2006). Greer and colleagues have introduced an account of the development of verbal behavior in ontogeny (i.e., from conception) based on the child’s mastering of behavioral cusps (Greer et al., 2017; Pohl et al., 2020). Just like the view suggested in the following, Pohl et al. (2020) highlight a parallel. Their parallel is between the evolution of morphology and the evolution of behavior during the lifetime of an individual. Whereas Pohl et al. focus on the concept of metamorphosis in morphology and in behavior, the present article focuses on multilevel/scale selection over generations and within the lifetime of the individual.

My primary goal is to build a bridge between two of Skinner’s lines of research to inspire future basic studies going beyond the limits set by focus on the verbal operants Skinner suggested. I have previously connected Skinner’s approach to speech as behavior with his evolutionary approach to behavior (Simon, 2018, 2020). Here, I give an example of empirical research derived from this crossing of research lines. My goal is to inspire extending the community of behavior analytic researchers who focus on understanding human language from an evolutionary perspective and to inspire experiments going beyond Skinner’s operants. After clarification of the theoretical framework, I exemplify a study inspired by the potential analogy between natural selection and what Skinner called operant selection.

Skinner (1981) suggested that the workings of natural selection and of behavior change during the lifetime of the individual might follow similar principles. In line with other natural events such as the workings of the adaptive immune system, antibiotic resistance, the ostensible phenomenon of “beginner’s luck” (resulting from the nonsuccessful gamble novices’ quitting to gamble), or the peculiar shape of giraffes’ necks, operant behavior, including speech, results from a combination of variation, transmission, and selection of units (Simon, 2020). The nature of these units, however, differs from one selective process to another. This means that behavior change during an organism’s lifetime follows in many regards the same principles as the evolution of behavioral and physiological traits from generation to generation. Ontogenetic selection (the former) has arisen from Darwinian selection (the latter) and both mechanisms have similarities. These similarities are also found in other dynamic systems, which have arisen from Darwinian selection. Since Skinner’s (1981) first proposition, this parallel between phylogenetic and ontogenetic change, has been examined in detail, which has both led to a plethora of elaborations (e.g. Baum, 2017; Donahoe et al., 1993; Glenn et al., 1992; Hayes et al., 2020; Simon, 2016a, 2016b; Simon & Hessen, 2018) as well as critiques (e.g., Sjøberg & Kennair, 2016; commentary on Skinner, 1984; Tonneau & Sokolowski, 2000).

In these systems, change results from variation, recurrence, and selection of traits. Selection may in some aspects be compared to the workings of a sieve (Dawkins, 1995) that selects physiological characteristics and behavioral traits with the best fit to environmental requirements. Well-adapted traits have the highest probability of fitting through the “holes of the sieve,” thus, of passing on to the next generation. Baum (e.g., 2013) and Rachlin (e.g., 1992) were the first to point out a particular parallel between ontogenetic and Darwinian selection: The nature of the units of selection does not seem to be limited to what it appeared to be in earlier analyses. Analogously to biologists’ suggestion that natural selection can work on multiple nested levels, Rachlin and Baum put forward that operant behavior might be best understood in terms of nested units. They named their views teleological behaviorism and the molar-multiscale view respectively (Baum, 2018, 2013, 2016; Rachlin, 1992, 2013, 1999).Footnote 1 Baum’s term multiscale selection (in ontogeny) mirrors the term multilevel selection (in phylogeny).

Darwinian Multilevel Selection

In 1998, Wilson and Sober expanded the principle of Darwinian selection by suggesting multilevel selection as a mechanism of evolution. In this view, social organization of units is not a by-product of self-interest, but rather social groups are in themselves adaptive units. According to multilevel selection, not only genes or individuals compete for passing through the environmental “sieve” to the next generation, but instead, the layers of competing units can in some aspects be understood to resemble nested sets of Russian matryoshka dolls (Wilson & Wilson, 2008).Footnote 2 The lowest level of units that can be selected are genes. Genes are nested in cells, which are nested in organisms, which are parts of groups. Units on all these levels of complexity can pass through the selective “sieve.” Units on these different levels can be selected as wholes because they function cohesively to maximize fitness, or reproductive success. For a group-benefiting trait to spread, selection for the group level, involving competition between groups, must outweigh individual level selection, involving individuals competing within a group (O'Gorman et al., 2008). In group selection on different levels of complexity (genes, cells, organisms, groups of organisms), group fitness is higher or lower than the mean of the individual members’ fitness values, which means that traits evolve according to the survival and reproductive success of the group. Behavioral examples on the level of the organism that appear to influence group selection include cooperative raising of young, such as in elephants, cooperative hunting, such as among lions, systems of predatory warning, such as those used by prairie dogs and ground squirrels, and altruistic acts in humans. Although researchers disagree about the relative importance of phylogenetic selection at the levels of genes, individuals, and groups, there is widespread agreement that evolution can, in principle, operate at multiple levels (Gerkey, 2015).

Multiscale Selection in Ontogeny

Parallel to Darwinian multilevel selection, Baum (2013, 2016, 2018) and Rachlin (1991, 2019) have proposed a process of multiscale or group selection of behavior in ontogeny. Although there is considerable debate among biologists as to whether the conditions for multilevel selection are met in phylogenetic evolution, Rachlin (2019) argues that they are, nonetheless, met in the ontogenetic evolution of behavior. The central claim of multiscale selection is that the environment selects behavior in units, which are nested in and consist of other more or less complex behavioral units of multiple degrees of complexity on a continuous scale.Footnote 3 Patterns (or extended units) of behavior emerge and cohere at different points on a scale. To use Rachlin’s (1992) examples, a dance consists of lower-scale movements or steps and a melody consists of tones. The dance and the melody as wholes have a different function than their parts. You enjoy a movie as a whole, not as a sum of parts. If a part, say the last 10 minutes of a 100-minute movie, is missing, you do not experience 90% of the enjoyment that you would have experienced if you had seen the whole movie. Neither the melody nor the dance are learned because all of the tones or the steps were individually followed by reinforcement.

Taking care of your family may have parts such as building a house, building a house has parts such as putting up a wall, putting up a wall may involve hammering a nail, which consists of a certain movement of your arm and so on. The latter example illustrates how units of behavior can be viewed as nested into more extended units and as consisting of less extended units. Selection is assumed to act on the more and the less complex units as wholes. Building a house takes longer than putting up a wall. Parts of activities can vary. If I use a green or a red hammer it does not influence how effectively I put up the wall but using a hammer or a stone will influence the completion of the more extended unit of putting up a wall. Thus, contrary to what tool is used, the color of the hammer will not be subject to selection by consequences (if there are no relevant consequences).

In some situations, the selection pressure on a part of a whole is opposite in direction to that on the whole. This happens, for example, in questions of so-called self-control. Dieting to reduce excessive weight is a temporally extended behavioral unit that may be selected by its correlation with consequences such as better health, social advantages, and simplification of everyday life, but the less extended choices that are necessary to achieve weight reduction, that add up to behavior resulting in weight loss, correlate with punishment. Not eating the tempting desert and exercising when one would prefer to watch a movie lying on the sofa are not reinforced but punished, that is, at that level of analysis, they are selected against. That means, when I cut deserts to lose weight, no particular acts are reinforced (Rachlin, 2004). The possibility of learning altruistic behavior despite the punishment following single altruistic acts (per definition), does not defy behavior analysis. When acting altruistically, a valuable pattern of being nice to others is maintained. The pattern is valuable because making social decisions on a case-by-case basis is generally just as bad a policy as making decisions on whether to stop at a red light on a case-by-case basis. Group selection of organisms may be responsible for inheritance of genes contributing to altruistic tendencies (Wilson, 2015). By analogy, selection by reinforcement of patterns of responses may be responsible for our ability to learn to be altruistic (Rachlin & Locey 2011; Rachlin, 2019). Likewise, an alcohol addict can become sober although every refusal of a drink is punished by withdrawal symptoms because the temporal extended pattern of soberness can evolve as a whole and does not need to be constructed in a chain-like fashion from more basic units (Rachlin, 2004, 2019).

Although the principle of selection of behavioral units on multiple scales has been laid out in detail (e.g. Baum, 2016), there is to my knowledge no data published to collect support for or against this proposal, neither regarding verbal nor nonverbal behavior. Thus, in this article, I present an exploratory experimental design as a first attempt to fill the gap between postulation of a process for theoretical reasons and the empirical support of that process.

Likely, the evolution of language and of cooperation are tightly intertwined. Hayes and Sanford (2014) highlight that “cooperation came first” and drove the evolution of human language because it facilitated increased cooperation. Independent of the “the chicken or the egg” question, it is conceivable that language and cooperation have evolved (and still evolve) into more sophisticated versions continuously catalyzed by each other. After all, one of the functions of language is to aid cooperation, which has an adaptive advantage.

Therefore, our experiment involves a cooperation task during which participants speak to each other. The task consists of arranging parts, which are parts of larger parts. We tested if participants’ talk during the task contained verbal markers that map on to the levels built into the task. To investigate the assumption of a process of ontogenetic selection of units of verbal behavior acting on multiple nested levels,Footnote 4 we set out to identify verbal markers, which conversational partners use to signal the onset and offset of an extended verbal unit at a particular level of complexity.

Some aspects of the multiscale nature of activities have received empirical attention in other fields of psychology that have developed independent of Baum’s and Rachlin’s framework. In an article published in Cognitive Science, Bangerter and Clark (2003) studied what they described as joint activities composed of nested projects and subprojects. The study reported upon in this article is a derivative from Bangerter and Clark (2003).

The Attempt to Gather Data

Bangerter and Clark (2003) analyzed recorded dialogue between pairs of participants cooperating to complete different predefined tasks together. The authors described the joint activities as emerging in hierarchical projects and subprojects. Two people jointly moving benches to another room or cooking dinner together are examples of such joint activities characteristically involving verbal coordination. Bangerter and Clark hypothesized that certain words, which they termed project markers, would occur to coordinate horizontal transitions between activities on the same level and that different project markers would be used to coordinate movements upwards or downwards through the hierarchy. Two of the corpora analyzed by Bangerter and Clark were dialogue recordings from settings in which pairs of participants were asked to solve Tangram tasks together. Tangram is a traditional Chinese puzzle game, an example of which can be seen in Fig. 1.

Fig. 1
figure 1

Example tangram solution with definitions of levels

In Bangerter and Clark’s (2003) setting, participants were to arrange sets of Tangram pictures in a particular layout. One participant (“The Matcher”) could not see the solution showing the correct layout, which was visible to the other participant (“The Director”). The Matcher had identical individual Tangram pictures to The Director’s layout solutions. The participants were asked not to show each other their pictures, but to talk as they normally would to solve a total of 12 sets of Tangram puzzles as correctly as possible. One corpus was recorded with a total of 18 pairs of participants speaking Swiss German, the other corpus was recorded with 18 pairs speaking American English.

Bangerter and Clark (2003) predefined three hierarchical goal and subgoal levels within the Tangram task. One of the main findings of their study was that participants often used the word “okay” for navigating into and out of sub-projects but not for continuing on the same level (Fig. 1 illustrates such subprojects and levels). The nature of patterns of verbal behavior emerging in their study suggests that conversational partners utter discriminative stimuli that signal whether or not a transition to another task level is about to happen. This pattern of presumed discriminative stimuli, which we set out to explore further in our partial replication of Bangerter and Clark (2003), suggests that their descriptions consisted of nested levels, which reflect the levels of the task. We used nine new Tangram pictures for each trial and added an additional level, the completion of each group of three Tangram pictures. Our participants completed the task speaking Norwegian.

The specific objectives were to investigate if verbal responses indicating an onset or a conclusion of a nested verbal unit will be observable in a controlled setting in which pairs of participants were asked to complete an activity consisting of nested parts. We designed the Tangram task such that it consisted of four nested levels. Verbal responses occurring at beginnings and ends of each level were analyzed. See Fig. 1 for an example of a complete Tangram task with labels of what we defined as levels. The levels in the task were defined as completing one

  • A4 Sheet including a set of nine Tangram Pictures (Whole activity = one trial)

  • Group of 3 Tangram Pictures (Nested Activity Level 1 = NA1)

  • Individual Tangram Pictures (Nested Activity Level 2= NA2)

  • Components of a Tangram Picture (Nested Activity Level 3= NA3)

Method

Participants

Results from six pairs of participants were included. Seven participants were female and five were male. All were either 14 or 15 years old. Participants were attending a 60-min workshop giving a general introduction to behavior analysis as a science and field of application, which was part of a 2-day annual science fair at Oslo Metropolitan University, Norway. Participants did not receive payment or course credit in return for their participation. All high schools in Oslo were invited to the science fair and class teachers registered their class to attend workshops on different sciences throughout the 2 days. Our behavior analysis workshop included an optional participation in the experiment. The workshop and the experiment were open for all visitors of the science fair. In the workshop, we were careful not to give information on behavior analysis prior to participation that could be assumed to influence the results of the experiment.

An initial sample of 60 high school students and two teachers, forming 31 pairs (23 males and 39 females, aged between 14 and 27 years old), chose to participate in the experiment. All participants reported having no previous knowledge of behavior analysis. Half of the participants reported having Norwegian as their first language. Three of the pairs in which both students were native speakers used their right to withdraw from the experiment prior to completion. The sample included in this study resulted from exclusion criteria that were chosen to reduce the impact of potential confounding variables. Anyone who wished to participate could do so, but only the data produced by the sample of six pairs that met our predefined inclusion criteria were analyzed. They were the only pairs who (1) spoke exclusively Norwegian throughout the experimental session; (2) reported having Norwegian as a first language; and (3) did not violate the rules of the experiment by turning around in their seats during the experiment or showing each other a solution. We also excluded data from pairs who (4) either stopped the experiment to ask the experimenters questions or (5) did not progress further than completing four Tangram sheets after 30 min. The study was performed in adherence with the Helsinki Declaration and was approved by the relevant local ethics board.

Setting and Apparatus

Experimental sessions were conducted in seven rooms at Oslo Metropolitan University, Norway. Each room was equipped with a large table (1m x 2m), a small table (1m x 1m) and three chairs. The large table was placed against one of the walls. One chair was placed at this table facing the wall. Back-to-back with this chair, another chair was placed, facing the other wall. The small table was placed in front of this chair. A chair for the experimenter was placed in the corner of the room.

Prior to the experiment, a total of 72 Tangram pictures were created using INKSCAPE 0.92. These Tangram pictures were used to make eight Tangram solution sheets (lettered A–H) and corresponding sets of nine Tangram picture cards (also lettered A–H). Each of the 72 Tangram pictures created were randomly assigned to a set of nine. The nine Tangram pictures were printed in color and the Tangram solution sheets were A4 sheets of card as shown in Fig. 1. The positioning of the Tangram pictures across the page remained the same for all solution sheets, although the Tangram pictures were unique to each set of nine.

The solution sheets were placed on the small table in the room, facing down, stacked in order from A–H with their corresponding letter printed on the back. The eight sets of picture cards were placed on the large table. Each set contained nine cards, each of the nine cards consisted of one Tangram picture, printed in the same size as on the solution sheets. The nine cards were shuffled into a pile in a random order. Each set was placed in a deck facing down on a blank A4 card, which was labelled with their set letter (A–H).

A Toshiba Camileo H20 video camera on a tripod was located on the right-hand side of the table, statically directed towards the card sets and blank pieces of paper, filming participants’ hands moving the Tangram pieces.

Procedure

Seven undergraduate students in behavior analysis were trained as experimenters and workshop instructors. The classes of high school students who participated in the workshops were first given a 10-min introduction to behavior analysis as a field and then asked if they would like to participate in some research at the department, looking at how people work together. They were told that they would receive further instructions in the experimental rooms should they chose to participate and that they could leave the experiment at any time to return to the classroom where their teacher would be waiting. The students who volunteered to participate all knew each other from before but were randomly assigned into pairs. One of the experimenters took the pair to one of the seven experimental rooms.

Once inside the experimental room, the two participants were randomly assigned to one of the two chairs. The participant seated in the chair facing the large table was informed that they would work as the Matcher and the participant seated in the other chair was told that they would work as the Director. They were informed both orally and in writing that a recording would be made of both of their voices and the Matcher’s hands when on the table once they started the experiment at their consent. They then received consent forms and were asked if they had any questions about them before signing them. They were also verbally reminded that if they did not wish to participate in the experiment or wanted to leave the experiment at any time, they could return to the classroom without needing to explain why they did not wish to participate. They were also told that if they chose to withdraw, or requested so at the end of the experiment, the recordings of their session would be erased.

Once the consent forms were signed the participants received an instruction sheet. The instructions told the participants that their goal was to complete as many sets of Tangram sets as accurately as they could within a maximum time of 30 min, when the experimenter would stop them. They were informed that the Matcher had eight sets of nine Tangram pictures labelled A–H and that each Director’s sets A–H contained the same pictures as the Matcher’s. They were also told that completion of a set involved the Matcher placing their Tangram cards out on the blank pieces of card in the same layout as on the Director’s solutions. Then they were told that when they thought they had the correct solution that the Matcher should cover it with one of the blank pieces of white paper, move it to the side of the table and that they should then begin the next, working through the sets alphabetically, starting the next only after they finished the previous.

The Matcher had the blank pieces of card and Tangram picture cards but could not see the Director’s Tangram solutions. The Director had the Tangram picture solutions but could not see the Matcher’s picture cards or cardboard sheets where they placed the cards. The participants were finally informed that they should not show each other anything on their tables but that they should talk as they normally would to complete the task together. They were then asked if they had any questions, in which case the experimenter clarified the instructions. After this, they were told that if they were unsure of what to do next during the course of the experiment that they should refer to their printed instruction sheet.

The experimenter then started the camera recording, sat on a chair in the corner of the room and started a 30-minute timer. Unless the pair had completed all sets before the 30 minutes were up, once 30 minutes had passed, the experimenter waited until the participants had completed the group of three Tangram pictures they were currently working on, then asked the participants to stop their task, and then stopped the camera. The pairs were then asked to complete a demographics form with questions on their age, gender, first language, and whether they participated in their session as the Matcher or Director.

Pairs were given the opportunity to review but not change their solutions before returning to the main classroom together with the experimenter. Participants were given a debriefing in the classroom, describing the background and aims of the experiment. They were invited into a conversation where they could ask any questions both about the experiment and behavior analysis as a field.

Coding

Recordings of each experimental session were saved on an external hard drive and named with a code. The answers to the demographics forms were given the matching code in Excel. The recordings of the pairs of participants that met the inclusion criteria were transcribed for analysis, including the start time and completion time of the session, as well as the completion time of the Tangram sets (trials), such that it became visible in the text when the pair started or completed a level of the task. The first and last word said on each level were noted across trials. Total word counts were made for words frequently said starting and ending a level. The recordings of the pairs of participants that did not meet the inclusion criteria were deleted.

Interobserver Agreement

To assess interobserver agreement, a random selection of three of the six pairs’ verbal interactions were entirely assessed by a second observer. An interobserver point-by-point agreement ratio (Kazdin, 2011) was calculated. Where the two observers had noted the same verbal response as occurring at a particular level, it was scored as agreement. Where researchers had noted different verbal responses it was noted as disagreement. The total number of agreements was divided by the total number of agreements plus the total number of disagreements and that number was multiplied by 100. Mean agreement coefficients for the three pairs were 89%, 91%, and 95%.

Results

A total of 25,191 words were transcribed, with a mean word count per experimental session of 4,199 (N = 6; SD = 446). During six 30 minute sessions, there was a total of 34 replications of the start of the complete sheet (Whole Activity), 33 replications of the end of the Whole activity, 100 replications of the start and end of nested activity 1 (NA1) and 300 replications of the start and end of nested activity 2 (NA2). Analysis of the start and end of nested activity 3 (NA3) was conducted on a random selection of 50% of the pairs, which gave data on 417 replications of the start and end. See Table 1 for a summary of the number of completed levels for each pair, total words spoken per session and total words spoken per minute. From trial to trial, participants needed less time and uttered fewer words to complete the task, as visible in Figs. 2 and 3 showing word counts and duration per trial. See Table 2 for a roughFootnote 5 translation of words that occurred at the start of each of the activity levels.

Table 1 Number of Completed Whole Activities (Whole) and Nested Activities (NA2, NA3) per Pair alongside Total Words Spoken per Experimental Session and Words Spoken per Minute
Fig. 2
figure 2

Total words spoken per trial for each pair of participants

Fig. 3
figure 3

Time in minutes taken to complete each trial for each pair

Table 2 Translation of Norwegian Words to English words

The most commonly occurring words at the start and end of each level of activity were “Ok”, “Ja” (Yes, Yeah), “Også” (Also, and, As well) or “Og så” (And then). Table 3 shows what words were uttered at what percentage of trials to start them. If the same project markers are used to coordinate transitions between activities on the same level and different project markers are used to coordinate movements upwards or downwards through the hierarchy, Table 3 would in four of its rows show high percentages in one cell per row and low percentages in all other cells in the same row. Although differences of percentages are not large enough to be entirely conclusive, data suggest that the words used at the start of a level might indeed serve as project markers. Over trials, the Whole activity (N = 34) was started with the word “Ok” 41% of the time and “Ja” 29% of time. NA1 (N = 100) was started with the words “Også” or “Og så” 28% of the time and “Ok” 29% of time. NA2 (N = 300) was started with the word “Også” or “Og så” 49.3% of the time and NA 3 (N = 417) was started with the word “Også” or “Og så” 61.1% of the time. Figure 4 shows an increasing trend of “Også” being spoken at the start as the level of the activity increases and a weak decreasing trend of “Ok” being spoken at the start as the level decreases.

Table 3 Percentage of Activity Levels that were Started with each Marker Word for the Whole Activity and the Nested Activities (NA1, NA2, NA3)
Fig. 4
figure 4

Ratios of activity levels that began with the word “Også” and “Ok” for each pair (0 = Whole, 1 = NA1, 2 = NA2, and 3 = NA3)

Percentages of words spoken at the end of each of the levels are presented in Table 4. All levels were most frequently completed with the word “Ja,” with little variability across levels. A mean normalized word count for each of the most commonly used words at the start and end of activity levels is presented in Table 5. Compared to its normalized word count per 1,000 words, “Ja” was spoken at a much higher frequency at the end of the trials then during the rest of the conversation.

Table 4 Percentage of Activity Levels that were Ended with each Marker Word Analyzed for the Whole Activity and the Nested Activities (NA1, NA2, NA3)
Table 5 Mean Normalized Word Count (Occurrence per 1000 words) for the Most Commonly Used Marker Words for All Pairs throughout each Experimental Session

Discussion

As pointed out in the brief of this special issue: “Despite different theoretical particularities, the study of human language . . . within behavior-analysis seeks to develop a monistic, naturalistic account of human language . . . that is devoid of mentalistic theorizing” (Harte 2023). Spelling out the connection of the analysis of behavior, including verbal behavior, to evolution is one way to pave the way for achieving this goal. Commonalities between natural selection and selection of behavior in ontogeny have, for example, been suggested by Pohl et al.’s (2020) work on metamorphosis undertaking to explain the developmental unfolding of speech based on mastering behavioral cusps. Borgstede and Eggert (2021) have provided a definition of reinforcers based on their effect on biological fitness. Baum (2016) and Rachlin (2019) have elaborated on parallels between biological group selection and the nature of the units of selection of behavior in ontogeny. Although other work, such as that on metacontingencies (Glenn, 1988; Houmanfar & Rodrigues, 2006), on resonance (Field & Hineline, 2008), or on the hyperdimensional, multilevel (HDML) framework (Barnes-Holmes et al., 2021) has been concerned with the nature of the units of selection that is central in Baum’s molar multiscale theory and Rachlin’s teleological behaviorism, to my knowledge, no one has published attempts to test empirical implications of Rachlin and Baum’s approach.

Although the study at hand shows what words people use more or less while completing a nested collaborative task, it is not a test of multiscale selection of verbal behavior. The prediction was that there is a pattern in what words participants use to open and close the work on a particular task level because such a pattern presumably makes the verbal interchange, and thus the solution of the task, more effective. Results can be interpreted in a multiscale selection framework, but the absence of such a pattern would not necessarily contradict multiscale selection. A possible reason for the absence of empirical tests of the—for many—convincing multiscale narrative resembles a much-discussed problem of evolutionary theory in general (McCain & Weslake, 2013; Thompson, 1981; Williams, 1973): It is difficult to distinguish whether deriving hypotheses whose test would falsify the idea of nesting is challenging or impossible. In our Tangram task, we defined the levels, but what would motivate the assumption that these nested task levels map onto nested levels in the verbal behavior that accompanied the task? And how could one be sure what parts of the dialogue belong to what level? According to Bangerter and Clark (2003), participants both need to manage the task with help of the dialogue and the dialogue itself whereas different words serve to do one rather than the other. Would that imply two interdependent nesting systems?

Thus, rather than presenting a test aiming at falsification of multiscale theory, my attempt to illustrate a novel approach to the study of verbal behavior sticks to empirically exploring a possible implication of Baum’s (2018) and Rachlin’s (1991) multiscale view on behavioral units of selection. In our study based on the presumed parallels between group selection in phylogeny and behavioral group selection in ontogeny we aimed at seeing whether patterns of verbal responses map on nested levels of a joint task. If the participants give verbal discriminative stimuli signaling transitions between task levels, this suggests that dialogue is constructed in a way that it helps navigation between task levels where effective completion can be assumed to select the occurrence of certain level marker words whose occurrence, in turn, leads to a more effective completion of the task.

The frequency of words occurring at the start and end of each level of a nested activity was analyzed. “Ok” was most frequently spoken at the start of the Whole activity with a decreasing trend from NA1 to NA3. Whereas “Også” / “Og så” (“and” / “and well”) occurred frequently at the start of both NA2 and NA3 and became more frequent as the level of activity decreased. “Ja” (“yes”) was the single most commonly occurring verbal response at both the end of the Whole activity and of each of the nested activities. These patterns may suggest, on the one hand, that the choice of words at the start of a description on a particular level of a task may indeed function to support orientation among levels. On the other hand, the lack of one-to-one correspondence between particular words and onsets of levels leaves it open whether the starting words naturally signal probabilistically what level of description will follow and if patterns would become clearer if the sessions lasted longer. Another possibility is that the lack of clarity in the data arose due to artifacts resulting from our particular data collection such as variability in words used at the start of level due to the participants’ dialects (which can vary largely in Norwegian). It is possible that the common occurrence of “ja” at the end of all levels may mean that it is useful to signal that a level is completed whereas it is unnecessary to restate which one it was because that would not help the matcher with the task.

Our observation that the Whole activity is most reliably started with the same words (“ok”, “ja”) whereas the lower levels started most frequently with “også,” is in line with Bangerter and Clark’s (2003) finding that most higher-level projects were reliably started with “okay” whereas the lower levels were less distinguishable by use of a “project marker.” At the same time, most of our participants used the same word to start all lower-level projects whereas Bangerter and Clark observed less consistency.

This exploratory study is descriptive in nature, by design not allowing for causal conclusions. It is thought to provide a design that will generate data which are a prerequisite for future studies that can uncover causal relations. In such a future study, participants could, for example, speak to confederates who use the marker words “incorrectly” or omit them in half of the trials. If these words actually function as discriminative stimuli signaling the onset or completing of a new level, the “incorrect” use of them should lead to less effective cooperation on the task in half of the trials. All pairs used both less time and spoke fewer words over the course of the trials in our study, which may be a sign of increasing effectivity and skillfulness when being more experienced in solving the task. This trend should be weaker, absent or in the opposite direction when incorrectly using the marker words collected in the study at hand.

To be sure, our design did not allow for finding selection effects at more than one level. The data could also be interpreted without the theory that frames the article. As an example, the increase of “også”/”og så” across nested levels to mark the beginning of working on a lower level tangram part, can be interpreted as an increase in either the tacting of task incompleteness or manding for continued activity on a task. A test of implications of the multiscale view based on the findings of this study showing what words typically occur when Norwegian speakers start working on a particular level of our tangram task, would show that successful completion of the task on one level affected behavior on more than one level.

Multiscale theory has most explanatory power when the consequences on a lower level are opposite to the consequences on a higher level into which the lower level is nested. That is, selection pressure acts in two opposite directions. This is the case when selfish behavior of individuals outcompetes altruistic behavior of individuals whereas altruistic groups (consisting of individuals) outcompete selfish groups (consisting of individuals, Wilson, 2015). Questions of self-control also follow this pattern, when individual acts (such as of drinking alcohol, eating candy, doing drugs, smoking cigarettes) are reinforced whereas extended patterns of these acts (frequent consumption of alcohol, candy, drugs, cigarettes) are punished by the consequences of addiction (Rachlin, 2004). I would like to invite the community to design an experiment testing the effects of selection pressure in opposite directions on nested units of verbal behavior because this would be one way to go about collecting empirical support for or against multiscale selection. Despite the weaknesses of the particular study presented here, I hope that this first attempt of an application of the multiscale view will spark a conversation on how to further develop the study of verbal behavior informed by evolutionary theory.