Communication allows people to interact and build meaningful relationships by providing a means to exchange thoughts, feelings, and information. The processes of producing and comprehending language are fundamentally different; one involves producing ordered sequential sounds that convey an intended meaning, and the other involves translating heard sounds into meaning. Yet, in some areas of overlap, both speakers and listeners engage in similar processes. One of these areas is perspective taking: Both speakers and listeners use the perspective of their conversational partner to guide their language use. This is especially true when particular objects are being referred to, a process called referential communication. Perspective taking guides speakers’ word choices when they refer to an object, and it serves to narrow the possible referents that listeners will consider after hearing a reference (Krauss & Fussell, 1991).

One way that speakers and listeners use perspective to shape their language use is by separating information into two broad categories: common ground, which refers to mutually known information, and privileged ground, which refers to information known only to the speaker or listener. Processing perspective requires computing or inferring perspective differences (e.g., whether information is in common or privileged ground) and then integrating that information into communicative behavior (cf. Clark & Marshall, 1981). Speakers’ and listeners’ abilities to make computations and inferences about perspective differences vary, resulting in occasional failures to use perspective to guide their referential behaviors (Wardlow Lane & Ferreira, 2008). Little research has directly examined the factors that may be predictive of perspective-taking performance for speakers. Although previous research has implicated a role for domain-general processes in listeners’ perspective taking (Brown-Schmidt, 2009; Lin, Keysar, & Epley, 2010), such processes have not been examined with respect to production.

What are the general processes for producing language, and how do they differ from comprehension? Producing an utterance begins with a to-be-expressed thought. That thought then activates the associated features that define the to-be-expressed information to construct a preverbal message, at a level called message encoding (Bock, 1982; Levelt, 1989). The most highly activated of those features, the ones that when expressed will enable the speaker to express the intended message, will be selected for further processing at the next stage, grammatical encoding. At the grammatical-encoding stage, selected feature are mapped onto syntactic structures and lexical items. Those items are then sent to a phonological/articulatory stage, at which the linguistic features are sounded out for production. Finally, before and during utterance production, a monitoring stage monitors the output to check that the to-be-conveyed message communicates what was intended at conceptual encoding.

In contrast, comprehension proceeds from the perception of sounds to the determination of meaning. Making comprehension difficult are the facts that generally no markers indicate where a particular word begins and ends, and the acoustic signal that indicates a particular word can vary by speaker and context. Even so, within a syllable or two, listeners have made decisions regarding the word’s identity, syntactic class, aspects of which syntactic phrase the word belongs, and how it relates semantically to previous input (Garrett, 1990). As such, listeners must make quick decisions regarding whether information is in common or privileged ground.

In terms of perspective taking, speakers must account for perspective differences when formulating an utterance. This may occur in the initial stages of production, such as during message encoding, or as part of a later process, such as a post-grammatical-encoding monitoring process (cf. Brennan & Hanna, 2009; Horton & Keysar, 1996; Keysar, Barr, & Horton, 1998). In contrast, when listeners hear a reference, their task is to search for that referent, a process that may include restricting the searched space on the basis of perspective differences, or only consulting perspective as part of a monitoring process (cf. Brown-Schmidt, 2009; Keysar, Barr, Balin, & Paek, 1998). That differences exist between the processes involved in accounting for perspective differences in production and comprehension suggests that one cannot directly infer that the same cognitive mechanisms that support perspective taking in comprehension will also support perspective taking in production.

The present experiment is based on whether domain-general mechanisms—in this case, working memory (WM) and executive control (EC)—are predictive of speakers’ perspective taking. WM and EC are executive functions, which are a group of higher-order cognitive processes used to plan and execute complex tasks (Miyake et al., 2000; Pennington & Ozonoff, 1996). Executive functions include processes such as WM, EC, attention, planning, reasoning, and mental flexibility.

WM refers to a brain system that allows for the temporary storage and manipulation of information. According to Baddeley and Hitch’s (1974) influential model, WM comprises a central executive and two subsidiary systems, the phonological loop and visuospatial sketchpad. The central executive is responsible for directing attention and coordinating the activities of the subsystems. The phonological loop and visuospatial sketchpad are systems used to store and maintain verbal, visual, and spatial information. Higher WM capacity means greater resources to put toward attentionally and cognitively demanding tasks (Just & Carpenter, 1992). WM may enable speakers to mentally compare to-be-referred-to objects and surrounding objects in order to design and produce an appropriate reference.

Previous research has suggested that the human attentional system can be divided into three separable functional and neuroanatomical regions, including alerting, orienting, and EC (cf. Fan, McCandliss, Fossella, Flombaum, & Posner, 2005). EC refers to processes that modify behavior, such as strategy selection, conflict resolution, error correction, and inhibitory control (Reder & Schunn, 1996). Measures of EC typically require participants to inhibit a prepotent response in order to react to a stimulus. The present proposal is that EC may enable speakers to limit the selection of the conceptual features included in a particular referential expression to those that are intended, rather than to those that are highly activated. Additionally, EC and WM were chosen for investigation because past research had suggested a role for these executive functions in listeners’ perspective taking. If these executive functions regulate speakers’ perspective taking as well, this would suggest that the perspective-taking process is controlled by domain-general mechanisms, irrespective of performance modality.

Recent work on listeners’ perspective taking suggests domain generality. Brown-Schmidt (2009) and Nilsen and Graham (2009) both showed that listeners’ verbal inhibitory control performance predicted their perspective-taking performance. Brown-Schmidt measured listeners’ eye movements during a referential communication task and had participants complete a Stroop-like task (Stroop, 1935). Participants’ performance on the Stroop task significantly predicted their ability to use perspective to guide their reference resolution. Similarly, Nilsen and Graham gave 4- and 5-year-old children an adapted Stroop task and a tapping task (Diamond & Taylor, 1996; Luria, 1966) to measure inhibitory control, a backward digit span and a memory-for-objects task to test WM, and the flexible item selection task (Jacques & Zelazo, 2001) to measure cognitive flexibility. Children also completed referential-communication tasks as speakers and as listeners. Nilsen and Graham’s findings indicated that children’s reference resolution, but not their reference production, was related to their performance on inhibitory control measures, but not to measures of WM or cognitive flexibility.Footnote 1

Research has also shown that WM is predictive of adult listeners’ perspective taking. Lin, Keysar, and Epley (2010) measured listeners’ eye movements as they participated in a referential communication task, and also had participants perform a WM task. The results indicated that WM performance predicted the degree to which listeners used perspective to guide their reference resolution. This finding suggests that at least some part of the perspective-taking process for listeners is regulated by WM.

In the present study, individual differences in executive functioning were used to assess the domain generality of speakers’ perspective taking. Speakers were run on a referential-communication task and measures of WM and EC were collected. If perspective taking in production is regulated by these mechanisms that more generally regulate human behavior, as has been shown for language comprehension, performance on the executive function measures should predict performance on the referential-communication task.



A group of 60 undergraduates from the University of California, San Diego, participated. Two of the participants were eliminated due to failure to follow instructions.

Perspective-taking task

On each trial, participants viewed four pictures of objects on 8 1/2 × 11-in. paper. Three of these objects were mutually visible to the participants, and one was visible to the speaker only (see Fig. 1). The speakers named a particular target object for the listener so that the listener could pick it out of the display. Listeners were told that the speaker would name a hidden object on some trials, so that if the speaker named an object that the listener did not see, the listener could guess that the target was the hidden object. The objects varied in size, such that the actual size of a large (or small) object (relative to the size of the other objects in that set) on one trial might be small (or large) on another trial. The critical targets were medium-sized and in common ground. The task consisted of 16 filler trials, ten baseline trials, and ten privileged-ground trials. On baseline trials, targets were unique to the set. On privileged-ground trials, targets had size-contrasting pair mates in privileged ground. Each critical target was presented in both the baseline and privileged-ground conditions across participants. Four filler trials required participants to name a privileged object: Three of these were singletons, and one had a pair mate in common ground. Three of the filler trials required participants to name a common-ground singleton, and nine required them to name a common-ground target that had a size-contrasting common-ground pair mate.

Fig. 1
figure 1

Example experimental display

Each shape was used on only one trial per participant and never occurred with more than one other object of the same shape. Speakers’ use of size-contrasting modifiers was measured: More modifiers on privileged-ground than on baseline trials would serve as evidence of a failure to account for perspective differences.

To begin, an experimenter read instructions and then administered two practice trials. After 36 trials, participants were administered the digit span and flanker tasks and then run on a second round of the perspective-taking task. For the second round, participants switched speaker/addressee roles and were instructed that although they would be viewing the same shapes as in the previous round, the shapes were paired differently, so using one’s memory of pairings from the initial round would not be helpful.

At the beginning of each trial, addressees closed their eyes, and speakers turned to a page in a stimulus binder, revealing a set of four objects. Speakers looked at those objects for 2 s, after which the experimenter pointed to one of the objects. Speakers then blocked that object using an occluder, making the object privileged. After another 2-s delay, the experimenter indicated a target by pointing.

Digit span task

To measure WM, participants completed the forward and backward subtests of the Wechsler Adult Intelligence Scale, 3rd edition (Wechsler, 1997). Forward and backward scores were combined to create one digit span score. The forward span task measures participants’ ability to recall increasingly longer number strings in the same order as presented. This task measures short-term auditory memory and language-related attention resources (Hale, Hoeppner, & Fiorello, 2002; Lezak, 1983). The backward span task measures participants’ ability to recall number strings in reverse order, relative to presentation. This task measures participants’ ability to store, maintain, and manipulate information. Because of the linguistic component of these tasks, they likely tap into the articulatory loop to a greater degree than the visuospatial sketchpad. However, research supports the view that WM capacity is a unitary construct across both verbal and visuospatial span tasks (Kane et al., 2004).

Flanker task

Participants completed a flanker task to measure EC (Fan, McCandliss, Sommer, Raz, & Posner, 2002). In this task, participants respond to the direction of a center arrow, which is flanked by horizontal lines or arrows. Participants should press a key with their left hand if an arrow points left, or with their right hand if an arrow points right. The targets were 32 congruent (five arrows pointing in the same direction), 32 neutral (a single arrow flanked by lines without arrowheads), and 32 incongruent (two arrows on each side pointing in the opposite direction from the center arrow) displays. The displays were evenly divided between left- and right-pointing center arrows and were randomly presented. Trials included a 500-ms central fixation point, followed by the target stimulus. The intertrial interval was 500 ms. Practice blocks began with six neutral trials, followed by six congruent trials, then six incongruent trials, and then six trials with equal numbers of the different conditions in random order.

Performance on the flanker task measures participants’ ability to avoid inappropriate action and to select appropriate action, both of which are implicated in executive control (cf. Kopp, Rist, & Mattler, 1996).


Perspective-taking task

The dependent variable, computed for each participant by contrast type (privileged ground vs. baseline trials), was the percentage of targets on which speakers used direction-appropriate modifiers. Speakers modified more often when the targets had a size-contrasting match in privileged ground (29.0 %) than when the targets had no match in the set (1.4 %), t(1, 57) = 5.9, p < .001. Thus, speakers produced non-perspective-adjusted references on 29 % of privileged-ground trials.

Digit span task

The mean forward digit span score was 10.9 (SD = 1.9), and the mean backward digit span score was 7.8 (SD = 2.1). The mean combined digit span score was 18.8 (SD = 4.4), and the combined digit span scores significantly and negatively correlated with modifier use on privileged-ground trials (r = –.47, p < .005)Footnote 2: The lower were speakers’ scores on the digit span tasks, the more often they produced non-perspective-adjusted references.

Flanker task

Table 1 displays the mean RTs and error rates for each congruency condition. Participants were significantly faster on congruent than on neutral or incongruent trials (both ps < .01). Both RTs on incongruent trials and flanker interference scores (incongruent minus neutral RTs) were significantly and positively correlated with modifier use on privileged-ground trials (see Table 2), such that speakers with faster RTs on incongruent trials and smaller interference scores used size-contrasting modifiers less often on privileged-ground trials (r = .42, p < .01, and r = .33, p < .01, respectively).

Table 1 Flanker task and digit span means and standard deviations
Table 2 Correlation matrix: Pearson bivariate correlations between task scores


Participants’ performance on the perspective-taking, WM, and EC tasks was measured. Of interest was whether domain-general executive function measures were predictive of the speakers’ perspective-taking behavior. Speakers regularly made errors in perspective taking by including information in their references that contrasted the referent from an object that was unknown to their listener. Furthermore, this performance was predicted by performance on WM and EC tasks, suggesting a role for domain-general mechanisms in speakers’ perspective taking.

How might WM be relevant to speakers’ perspective taking? Figure 2 illustrates a possible mechanism as it operates over time. Scanning a display to identify each object requires that the speaker store the identity of each of the other relevant objects in WM. Storing these objects allows the speaker to compare them and enables the relevant distinguishing features to become activated at appropriate levels. Imagine a real-world situation, such as referring to a particular car in a crowded parking lot. In this situation, there would likely be many other cars from which to distinguish the target, and the target would differ from other cars in more than one manner (colors, shapes, and sizes). To refer to one particular car, the speaker must determine which conceptual features are necessary to distinguish the target from the other mutually visible cars. WM is a resource that speakers could use to hold in memory the features of the target and compare those to the relevant surrounding objects, which are potential targets to a listener. This is not dissimilar to the processes necessary for performance on the digit span tasks, in which participants must hold the identities of numbers in memory while comparing and manipulating their order.

Fig. 2
figure 2

Task analysis

How might EC be relevant to perspective taking in this type of task? The level of activation of conceptual features relating to differentiating features of common- and privileged-ground objects must be activated and maintained appropriately and dynamically as relevant contextual conditions change. Using Fig. 2, the concepts for “big” and “small” related to the triangles are relatively highly activated from Time Point 2 onward. This is necessary in case the speaker must identify one of those triangles in the context of both being visible to the listener. In this situation, there is no prediction of a role for EC. However, if one of the triangles becomes occluded, the size contrasts should lose activation, because the two triangles need not be distinguished from one another. In this situation, a role for EC in the management/inhibition of activation and selection of size-differentiating features is proposed, similarly to the management and inhibition of prepotent responses, as measured in the flanker task. Outside of the laboratory, situational features that serve to increase the activation of the now-nonrelevant size information, or to delay its loss of activation, should also serve to make it more likely that that the nonrelevant size information will be included in the speakers’ utterance.

Using the example of a parking lot, a speaker must be able to make reference to a particular car without selecting and producing inappropriate conceptual features, even if they are highly activated. Why might particular conceptual features be highly activated, even when speakers do not intend to include those features in their reference? Drawing a speaker’s attention to particular item features could cause the concepts associated with those features to become highly activated. Past research has suggested that when this occurs, speakers are more likely to include those conceptual features in their references, even when doing so is counterproductive to their own goals (Wardlow Lane, Groisman, & Ferreira, 2006; Wardlow Lane & Liersch, 2012). Outside the confines of a laboratory, speakers’ attention can be drawn in countless ways to the conceptual features of a to-be-referred-to object, as well as to surrounding objects. For example, a particular car might grab a speaker’s attention because it is unusual (rare, odd, or driving by quickly). The present proposal is that the extent to which a specific speaker can avoid producing a reference that is altered by that attention-grabbing object should be related to that speaker’s EC capacity.

Two prominent theories have proposed explanations for how interlocutors take others’ perspectives into account. The models make different predictions about how EC could influence perspective taking. First, the constraint satisfaction model proposes that the distinction between common and privileged knowledge is one of many contextual constraints that speakers use, and therefore its effect is only partial (Hanna, Tanenhaus, & Trueswell, 2003; Nadig & Sedivy, 2002). These constraints are weighted according to their salience and reliability, and their effects are related to the probability that each cue will be important for communication (Hanna et al., 2003).

In contrast, the dual-process model proposes that interlocutors take perspective into account through a two-stage process. The first stage is a quick and automatic stage, during which interlocutors begin formulating their message or interpreting a message using only their own perspective. The second stage is an effortful and controlled stage, during which the production and comprehension systems monitor for perspective. Given adequate resources, the system(s) will adjust to correct for the original egocentric perspective (Keysar, Barr, Balin, & Paek, 1998; Keysar, Lin, & Barr, 2003).

A contrast satisfaction approach might predict that EC mechanisms will reduce or “shut off” the size feature representation when inclusion of that representation violates perspective differences. In contrast, a dual-process approach might predict that EC mechanisms influence perspective taking via a monitoring process, recognizing at some point in the production process that a formulation with the size feature representation violates perspective differences. A third possibility is that EC mechanisms are able to affect perspective taking in both of these ways, through both a reduction of activation of the size feature representation and a later monitoring process. Further specification of process models of speakers’ perspective taking should take into account domain-general mechanisms that influence individual speakers’ abilities to use perspective to guide their referential behaviors.

This proposal does not assume that inclusion of a modifier is a prepotent response, only that the concepts related to the modifier will sometimes be highly activated due to the context within which a reference is made, and that that in turn will influence a particular speaker’s tendency to include that information, as a function of EC capacity. In this way, the present proposal differs from proposals made regarding listeners’ inhibitory control and perspective taking. Those proposals have suggested that listeners inhibit their own perspective (e.g., prepotent response) in order to correctly identify a referential target (Brown-Schmidt, 2009; Nilsen & Graham, 2009). The present data do not speak specifically to whether speakers must inhibit a prepotent response in the form of their own perspective before composing their utterance. These results reveal that domain-general mechanisms exert influence over speakers’ perspective taking.