Behavior-analytic approaches to intervention with individuals with autism spectrum disorder (ASD) and other intellectual and developmental disabilities (IDD) have consistently been supported as effective (Courtade et al., 2015; Hume et al., 2021). A feature of some of these empirically derived interventions (e.g., discrete trial training) is repeated practice opportunities under meticulous stimulus control, leading to responding that appears “rote” (Stauch et al., 2017). This is a concern, particularly when language or communication repertoires are the focus of intervention. A primary feature of language is generativity, or the ability to create and understand a potentially infinite number of sentences never previously heard or said (Ming et al., 2014). As a result, facilitating generative language has become a key area of interest in behavior analysis. Methods to promote the emergence of untrained responding have been developed out of several theoretical frameworks, such as naming theory (Horne & Lowe, 1996) and stimulus equivalence (Sidman, 1971; see Gibbs & Tullis, 2021 for an overview of these two frameworks), the current review will focus on interventions derived from relational frame theory (RFT; Hayes et al., 2001) due to its foundation in the inherently derived nature of language.

Relational Frame Theory

RFT (Hayes et al., 2001) suggests that language comes from the ability to engage in a generalized repertoire of responding to stimuli in terms of other stimuli, otherwise known as relational responding (Stewart & Roche, 2013). Relational responding can be nonarbitrary (i.e., based on the formal properties of the stimuli being related) or arbitrarily applicable (i.e., based on verbal or contextual control). Arbitrarily applicable relational responding is developed via exposure to multiple exemplars and contingencies provided by the larger socioverbal community, and it appears to form the foundation of human language (Stewart & Roche, 2013). Various patterns of arbitrarily applicable relational responding, or ‘frame families’, exist and have been cited throughout literature. Coordination, based on sameness or similarity, develops earliest (Hayes et al., 2001), and has been the focus of the majority of research in relational responding (Gibbs & Tullis, 2021). Although coordination has been the most commonly researched topic, other frames that have been evaluated empirically include distinction (difference), opposition (opposite), comparison (relativity between stimuli along a specific dimension), hierarchical (containment, inclusion), temporal (sequencing) and deictic (perspective-taking) (Barnes-Holmes et al., 2018). As any stimulus can be related to any other stimulus in keeping with any relational frame, arbitrarily applicable relational responding has the potential to be incredibly generative.

All relational frames are defined by the properties of mutual entailment, combinatorial entailment, and transformation of stimulus function (Rehfeldt & Barnes-Holmes, 2009). Mutual entailment is when relations between stimuli are bidirectional and responding in one direction entails responding in the other direction (e.g., if A is the opposite of B, then B is the opposite of A; if A contains B, then B is part of A, etc.). Combinatorial entailment is when two stimulus relations combine and allow a third relation to be derived (e.g., if A is more than B and B is more than C, then A is more than C; if A is the opposite of B and B is the opposite of C, then A is the same as C, etc.). Transformation of stimulus function is when the functions of one stimulus changes, or transforms, the functions of another stimulus based on the derived relation between the two stimuli (Dymond & Rehfeldt, 2000). As an example, a person who likes sweet desserts derives that lemon cookies (stimulus A) are sweeter than lemons (stimulus B), and then their friend tells them that lemon bars (stimulus C) taste even sweeter than lemon cookies. Later, when given a choice between lemon cookies and lemon bars for dessert, that person chooses lemon bars despite never having eaten them before. The evaluative functions of lemon bars have now been transformed based on their relation to lemon cookies (i.e., lemon bars are sweeter than lemon cookies).

Interventions Promoting Derived Responding

Previous reviews have been conducted evaluating the emergence of derived responding within and across specific relational frames. For example, Ming et al. (2014) reviewed studies that either (a) demonstrated the establishment of derived relational responding within various frames, or (b) used existing derived relational responding repertoires to teach educationally relevant skills to individuals with ASD. They concluded that programming should focus on establishing relevant patterns of derived relational responding skills that are absent via multiple exemplar training (MET), and that if learners can demonstrate specific types of derived relational responding (e.g., equivalence, naming, etc.), then those skills can and should be used to make subsequent programming more efficient (Ming et al., 2014). They recommended future research focus on the development of a standardized tool to systematically evaluate derived relational responding abilities, highlighting some of the work that has been done with this aim in mind (i.e., the Training and Assessment of Relational Precursors and Abilities; TARPA; Moran et al., 2010, 2014). Although their review encompassed studies that taught multiple relational frames (i.e., coordination, comparison, opposition, and deictic frames), it was not a complete, systematic evaluation of the literature base, and did not include a quantitative analysis nor a measure of overall study rigor.

Ming and Stewart (2017) reviewed research evaluating the relation of distinction, or difference, with both nonarbitrary and arbitrary stimuli. At the time of the review, no studies had examined how best to establish frames of distinction in individuals who were unable to demonstrate those relational responses. Based on prior work, Ming and Stewart (2017) recommended that derived relational responding to relations of distinction be conceptualized as a continuum of responding from nonarbitrary to arbitrary stimuli. They recommended future research should determine the necessary hierarchy of component skills, the most optimal sequencing of these skills, and the most effective teaching procedures for establishing them. Although fairly comprehensive through looking at research across a variety of domains, this review did not systematically search the existing literature and also did not provide a quantitative analysis. While Ming and Stewart (2017) focused on relations of distinction, Montoya-Rodríguez et al. (2017a, b) published a bibliographical review of research published between 2001 and 2015 evaluating deictic relational responding in typically developing and atypically developing populations. Their review found that while the number of empirical articles being published had increased, the studies were most often carried out with a typically developing participant population, and training protocols for deictic relational responding have had limited development and investigation (Montoya-Rodríguez et al., 2017a, b).

Raaymakers et al. (2019) completed a systematic review of derived verbal behavior research conducted with typical and atypical populations published between 2000 and 2017. The 52 studies included in Raaymakers et al. (2019) were required to evaluate derived verbal operants (i.e., mands, tacts, intraverbals, echoics, textual, dictation, and autoclitics) from a stimulus equivalence (Sidman, 1971) or RFT framework, excluding articles whose methodologies referenced naming theory (Horne & Lowe, 1996). Their results found marked variability in the reporting on participant prerequisite skills, existing verbal repertoires, and assessments utilized, which limited analysis of potential repertoires necessary for derived relational responding to occur. Results also indicated that different instructional procedures were most effective with different verbal operants (i.e., tact and intraverbal training were most effective for establishing derived intraverbal responses, conditional discrimination training was most effective to establish derived mand and tact responses, etc.). Mastery criterion also varied between studies, with some using a percentage correct and others using block or rolling block mastery criteria. A limitation of Raaymakers et al. (2019) is that, despite requiring included articles to evaluated derived verbal behavior from either a stimulus equivalence or RFT perspective, there was no consideration of the pattern of relational responding, or frame family, that the derived verbal operants were part of, and there was little discussion of the basic processes underlying the emergence of these responses.

Two recent citation analyses found increasing interest in the use of RFT technologies to promote derived relational responding in atypically developing populations (O’Connor et al., 2017; Belisle et al., 2020a, b), with a major limitation that the relational frame of coordination was most frequently targeted for investigation. This specific relational frame was explored further by Gibbs and Tullis (2021) in a systematic review of 47 articles published since 2013 evaluating the emergence of derived responding in accordance with coordination across the theoretical bases of naming, stimulus equivalence, and RFT in learners with IDD and ASD. Gibbs and Tullis sought to determine if individuals with IDD and ASD can demonstrate the emergence of untrained coordination relations. Questions asked included whether specific learner characteristics influenced emergence, if there were specific assessment tools to identify learners capable of demonstrating emergence, and whether particular instructional procedures facilitated emergence. The findings supported the conclusion that individuals with IDD and ASD are able to demonstrate derived coordination relations, however it was emphasized that due to the low quality and rigor of many of the studies evaluated, the results should be interpreted with caution (Gibbs & Tullis, 2021).

Gibbs and Tullis (2021) found that while the expansiveness of a learner’s verbal repertoire, particularly the skill of bidirectional naming, influences the emergence of derived responding, further research is required to better determine other characteristics contributing to emergence. They concluded more research is needed to determine tools with predictive validity for derived relational responding, and instructional procedures beyond match-to-sample (MTS) are worthy of investigation (Gibbs & Tullis, 2021). Despite a comprehensive scope, a limitation of Gibbs and Tullis (2021) is they solely focused on frames coordination. It is unknown whether investigations conducted across other relational frames could shed further light on learner characteristics, assessment tools, and instructional procedures that may facilitate derived relational responding in learners with IDD and ASD. As a result, a systematic review of the literature evaluating the emergence of derived relational responding beyond the frame of coordination in individuals with IDD and ASD is warranted.

Research Questions and Statement of Purpose

The purpose of this review is to extend the work of Gibbs and Tullis (2021) by systematically synthesizing and analyzing the results of research facilitating derived relational responding beyond the frame of coordination in learners with IDD and ASD, and to make recommendations for additional areas of investigation. This review aims to answer several questions. First, is there sufficient, high-quality evidence that individuals with IDD and ASD have demonstrated derived relational responding beyond coordination in the context of empirical research? Second, is there evidence to indicate the presence of distinct learner characteristics or profiles that influence the development of this skill? Third, are there particular assessment instruments that are ideal to determine a learner’s relational skill repertoire? Last, are there specific instructional procedures best suited to develop relational responding in this population?

Method

Search Procedures

Systematic searches of peer-reviewed journal articles were conducted using the APA PsycInfo®, CINAHL Plus, Proquest Central, Pubmed, and Google Scholar electronic databases. Two Boolean searches of each database were conducted, the first using intellectual and developmental disability AND (a) relational frame theory, (b) derived relational responding, or (c) relational frame(s), and the second using autism spectrum disorder in combination with the previously listed search terms. When searches returned 1000 or more articles, the results were further narrowed by including the terms (a) comparison, (b) opposition, (c) distinction, (d) hierarchical, and (e) deictic. In addition to database searches, ancestry searches from the reference lists of identified studies, and reviews of published citation analyses were also completed. A total of 2870 articles were identified from electronic databases and evaluated for inclusion. After evaluation, 2768 articles were excluded and 67 articles were identified as duplicates identified by multiple databases, leaving 35 articles to be screened for full text review. Of those articles, 23 were identified as eligible for inclusion, with an additional seven articles found after ancestry searches and citation analysis reviews, resulting in a total of 30 articles containing 38 studies. See Fig. 1 for a visual representation of this search.

Fig. 1
figure 1

Visual representation of systemic search procedures

Inclusion and Exclusion Criteria

Studies were required to be published in English in peer-reviewed journals, and have at least one participant with a formal diagnosis of ASD or IDD whose data could be disaggregated for analysis. Articles were further required to measure generalized or derived responding. Review articles (e.g., Ming & Stewart, 2017; Montoya-Rodríguez et al., 2017a, b), non-experimental articles (e.g., Kavanagh et al., 2020; McHugh & Reed, 2008), articles evaluating assessment tools without an additional evaluation of training and emergence (e.g., Pomorska et al., 2020), and articles evaluating relational responding within the frame of coordination were excluded from this review.

Data Classification

Article Analysis

All articles were analyzed using the following criteria: (a) participants (chronological age, diagnosis); (b) assessments conducted and associated results (e.g., PPVT-4, WISC-IV, etc.); (c) experimental design; (d) relational frame family or families evaluated (e.g., comparison, opposition, distinction, hierarchical, deictic, temporal, or multiple frames); (e) content taught (e.g., metaphor comprehension, working memory, etc.); (f) setting (e.g., school, home, etc.); (g) teaching procedures (e.g., multiple exemplar teaching, conditional discrimination training, etc.); (h) generalized responses measured; (i) outcomes (whether participants demonstrated derived relational responding and whether responding was variable); and (j) reliability and fidelity (reporting of interobserver agreement and procedural fidelity). Similar outcome criteria to those used by Gibbs and Tullis (2021) were utilized in the current review, with outcomes classified as positive, negative, or variable based on the demonstration of derived relational responding. Positive outcomes indicated all participants demonstrated derived responding when tested, and negative outcomes indicated no participants demonstrated evidence of derived responding. Variable outcomes indicate either (a) some participants demonstrated derived responding while others did not, (b) participants demonstrated some, but not all derived responses, or (c) some participants required additional intervention to demonstrate derived responding. The results of the article analysis is presented in Table 1.

Table 1 Summary of articles evaluating derived relational responding beyond coordination in individuals with intellectual and developmental disabilities

Single Case Analysis and Review Framework v2.0

The Single-Case Analysis and Review Framework (SCARF; Ledford et al., 2020), a tool to evaluate the quality, rigor, and outcomes of single case design studies, was completed for all studies which included an appropriate graphic display of data (i.e., line or bar graphs showing a minimum of two primary comparison conditions, such as baseline and intervention or pre- and posttest data). The SCARF quantifies the rigor, quality and breadth of measurement, and outcomes across 10 elements. This data provides a scatterplot representation of an article’s overall quality and rigor, the extent to which generalization outcomes are internally valid, and the extent to which maintenance outcomes are measured separately in time from intervention (Ledford et al., 2020). Within each scatterplot, the majority of the included data points existing in the upper right quadrant of the graph is indicative of the best outcomes for each measure. Elements associated with the quality of a single-case design evaluated by the SCARF include (a) participant description (i.e., demographics, formal assessment results, general learner information, inclusion and exclusion criteria), (b) dependent variable descriptions (i.e., operational definitions, examples and non-examples, measurement system and utilization), (c) condition descriptions (i.e., adequate description of procedures, dosage, setting, and implementors), (d) social validity (i.e., importance of a behavior to key stakeholders and society), (e) ecological validity (i.e., the relevance and reliable implementation of an intervention outside of controlled settings), (f) generalization measurement and measures (i.e., whether stimulus or response generalization occurred and how it was measured), and (g) maintenance measurement (i.e., whether evidence of continued behavior change occurred and how it was measured). Elements associated with the rigor of a single-case design evaluated by the SCARF include (a) dependent variable reliability (i.e., interobserver agreement, or the extent to which independent observers measure the same behavior), (b) implementation fidelity (i.e., degree to which experimental procedures are implemented as intended), and (c) sufficiency of data (i.e., data allow for analysis of level, trend, and variability within and across conditions).

Of the 30 articles within this review, 21 could be assessed with the SCARF. Of the remaining nine, six were excluded from analysis because they did not use a formal single-case design and three were excluded secondary to lacking a graphical display, or because the graphical display did not allow for disaggregation of the data (e.g., Molina-Cobos & Amador-Castro, 2010). Among the six articles that did not use a formal single-case design, Cassidy et al. (2011) utilized a pre-test/post-test quasi-experimental design with no control group, while Dunne et al. (2014), Gorham et al. (2009), Kent et al. (2017), and Murphy and Barnes-Holmes (2009, 2010) utilized a series of pre-post tests without a clear experimental design. We are unaware of any tools suited to assess the quality of such designs. The articles excluded from the SCARF analysis, as well as the reason for their exclusion, are noted in Table 1. As in Gibbs and Tullis (2021), each directly trained relation was evaluated as its own comparison and received its own entry (N = 68), and the emergence of derived responding was categorized in the SCARF as instances of response generalization (i.e., the measurement of different specific behaviors than the specific behaviors taught in the study), or both response and stimulus generalization (i.e., the measurement of a target behavior performed with materials separate from those used in teaching).

Interrater Reliability

Interrater reliability for the descriptive article coding and the SCARF was assessed for 31% of the articles selected for inclusion by the fourth author. Reliability was calculated for each article by dividing the total number of agreements by the total number of agreements and disagreements, then multiplying the quotient by 100 to calculate the percent agreement. Mean reliability for the descriptive coding was 93% (range, 81%–100%), while mean reliability for the SCARF was 95% (range, 80% – 100%). Interrater reliability was unable to be assessed for the systematic database searches, which is a limitation that should be addressed in future reviews.

Results

Article Variables

Participants

The chronological age ranges, gender, and diagnoses of participants in each study are included in Table 1, and are summarized in Table 2. A total of 122 participants (accounting for multiple experiments with the same individuals) were included in the 30 articles evaluated and completed all study procedures. Gender was reported for 108 participants, with the majority of participants being male (74.6%, N = 91) as compared to female (13.9%, N = 17). Data on the race and ethnicity of study participants were not reported in any of the 30 articles. The majority of participants had a diagnosis of ASD (67.2%, N = 82). Other diagnoses reported include Down’s syndrome and developmental delay (12.3%, N = 15), Down’s syndrome (5.7%, N = 7), educational and/or behavioral difficulties (6.6%, N = 8), language difficulties (3.3%, N = 4), developmental delay (2.5%, N = 3), pervasive developmental disorder (1.6%, N = 2), and comprehension difficulties (1.6%, N = 2). Participant ages in the included studies ranged between three and 35 years old.

Table 2 Demographic characteristics of participants across studies

Over half (71.1%, N = 27) of the included 38 studies included three or fewer participants, while 21.1% (N = 8) included between four and six participants, 2.6% (N = 1) of studies included between seven and 10 participants, and 5.3% (N = 2) included more than 10 participants.

Assessments

Assessment information for each study is provided in Table 1, and a more detailed breakdown of the frequency with which each assessment was used is provided in Table 3. A majority of the 38 studies included in this review (73.7%, N = 28) provided participant assessment information, such as scores or learner characteristics based on assessment results. Of the studies that included assessment information, 31.6% (N = 12) reported using exclusively norm-referenced measures, and also provided participant information such as intelligence quotient scores, percentile ranks, and age equivalencies. Nine of studies (23.7%) reported using exclusively criterion-referenced measures, and provided participant information such as criterion achieved and existing language repertoires. Seven of the studies (18.4%) reported using both norm- and criterion-referenced measures.

Table 3 Assessments used

Experimental Design

The experimental design used most frequently in the included 38 studies was a concurrent or nonconcurrent multiple baseline or multiple probe design across participants (42.1%, N = 16). The second most frequently used arrangement was not a formal experimental design, but rather phases of training and testing relations (23.1%, N = 9). One study (2.6%) used a multiple baseline design across skills, one study (2.6%) used a pre- and post-intervention probe design across participants, and one study (2.6%) used a pre-test/post-test quasi-experimental design with no control group. An additional five studies (13.2%) used a multiple baseline design with an embedded pretest and posttest or multiple probe to measure the effect of the intervention on the emergence of derived responding. Four studies (10.5%) employed a variation of a reversal design (e.g., A-B-A, A-B-C-A, or A-B-A-C-A-D-A), and of the remaining two studies, one (2.6%) utilized an A-B design with a pretest and posttest, and one (2.6%) utilized a multiple probe with an embedded A-B-C design.

Relational Frame and Content Taught

The relational frames and content taught within the included studies can be found in Table 4. Across these studies, the deictic relational frame was most commonly targeted for teaching (N = 12, 31.6%). The second most commonly targeted relational frame was comparison (N = 10, 26.3%). Three studies targeted opposition relations (7.9%) to teach content including understanding sarcasm. An additional two studies (5.3%) focused on distinction relations, three studies (7.9%) targeted hierarchical relations, and one study (2.6%) targeted temporal relations. Three studies (7.9%) targeted relational framing itself, teaching multiple relational frames in sequence using stimuli such as nonsense syllables and nonarbitrary and arbitrary pictures. The remaining four studies (10.5%) targeted multiple relational frames while teaching specific content.

Table 4 Relational frame and content taught

Setting

The majority of studies (86.8%; N = 33) were conducted in participants’ natural environments (e.g., home, community, day program, school). Of the remaining four studies, one (2.6%) was conducted in a university-based clinic setting (Barron et al., 2019), one (2.6%) used a combination of home and university-based settings for different participants (Jackson et al., 2014), two (5.3%) utilized a clinic setting (Grannan & Rehfeldt, 2012), and one (2.6%) did not specify its setting (Cassidy et al., 2011).

Teaching Procedures

Half (50.0%, N = 19) of the included 38 studies used multiple exemplar training (MET) to teach content across relational frames and evaluate the emergence of derived responding. Other teaching procedures included: (a) single exemplar instruction (SEI) and multiple exemplar instruction (MEI) (10.5%); (b) relational training consisting of reinforcement, prompting, and error correction as needed (7.9%); (c) conditional discrimination training (CDT) either individually or in combination with match to sample (MTS) procedures (7.9%); (d) MET within a precision teaching (PT) instructional paradigm (2.6%); (e) observation either individually or in combination with discrimination training (5.3%); (f) tact training in combination with MTS procedures (2.6%); (g) intraverbal training alone or in combination with reverse intraverbal training (5.3%); (h) an adaptation of the TARPA (Moran et al., 2010, 2014) (2.6%); (i) a false belief training protocol (2.6%); and (j) a training package of providing rules, modeling, practice, and feedback followed by in vivo training (2.6%). See Fig. 2 for a visual representation of the frequencies of different teaching procedures used per relational frame.

Fig. 2
figure 2

Visual representation of the frequencies of interventions used to teach each relational frame or frames. CDT = conditional discrimination training, DT = discrimination training, FBTP = false belief training protocol, IT = intraverbal training, MEI = multiple exemplar instruction, MET = multiple exemplar training, MTS = match to sample, O = observation, RT = relational training, RIT = reverse intraverbal training, SEI = single exemplar instruction, TARPA = training and assessment of relational precursors and abilities, TT = tact training

Derived Responses

Studies included in this review evaluated the emergence of derived responding, which varied across investigations. The vast majority of the included studies (89.5%, N = 34) investigated the emergence of (a) mutually entailed relations, (b) combinatorially entailed relations, and/or (c) transformation of stimulus function following the direct training of specific relations. Additional outcome variables were assessed in several studies, including the emergence of novel drawing and writing behaviors in the presence of deictic cues (Barron et al., 2019), as well as scores on various assessments administered as pretests and posttests to evaluate the role the acquisition of derived relational responding has on broad outcomes such as working memory (Baltruschat et al., 2012), intelligence quotient (Cassidy et al., 2011), theory of mind (ToM; Jackson et al., 2014; Lovett & Rehfeldt, 2014), and reading comprehension (Newsome et al., 2014).

Study Outcomes

Of the 38 studies evaluated, 42.1% reported positive outcomes (N = 16), while 55.3% reported variable outcomes (N = 21). The main reason for this classification was that some participants required additional or more extensive intervention to demonstrate derived responding (66.6%, N = 14), while the remaining seven (33.3%) article outcomes were classified as variable due to some participants demonstrating emergence of some, but not all derived relations. Only one study, Jackson et al. (2014), reported a negative outcome, with all participants requiring explicit training to demonstrate mastery of complex deictic relations, and no improvements seen in posttest ToM scores.

Reliability and Fidelity

The majority of the 38 studies in this review (84.2%, N = 32) measured and reported interobserver reliability during a proportion of all teaching and testing sessions, with agreement ranging between 87–100%. Of the six studies that did not report interobserver agreement (15.8%), four reported using electronic devices (e.g., computer, laptop, tablet, etc.) and automated data collection procedures (Cassidy et al., 2011; Dunne et al., 2014 experiments two, three, and four). Procedural fidelity or treatment integrity data was only reported in 28.9% of the included studies (N = 11), with fidelity ranging between 87.5–100%.

SCARF Data Analysis

Primary Data Measurement

Figure 3 depicts the results of the primary data measurement for the 68 relations across the 21 articles able to be analyzed. As stated previously, the data displayed in this graph depicts the outcomes of the direct training done in each study, not the outcomes of the emergence of untrained relations, which is assessed in the generalization measurement graph. The scatterplot is designed such that the x-axis designates overall study quality and rigor, with quality increasing as data points move toward the right, and the y-axis designates primary outcomes, with improved effects as data points move up the axis. Low rigor was classified as a score of one or below, moderate rigor was classified as a score of between one and three, and high rigor was classified as a score of three and above (Ledford et al., 2020). Nine studies (13.6%) had lower quality evidence of minimal or weak effects, while one study (1.5%) had lower quality evidence of positive effects. Fifteen studies (22.1%) had moderate quality evidence of minimal or weak effects, while 39 studies (59.1%) had moderate quality evidence of positive effects. Four studies (6.1%) had higher quality evidence of positive effects. Although the majority of the studies evaluated reported positive effects (64.7%, N = 44), 94.1% (N = 64) of the included studies had low (15.2%, N = 10) to moderate (77.9%, N = 54) quality and rigor, and thus their results should be accepted with reservations.

Fig. 3
figure 3

Results for primary data measurement. Study quality and rigor increases as data points move down the x-axis, while primary outcomes are increasingly positive as data points move up the y-axis. While many of the studies included for review had positive outcomes, as indicated by the clustering of data points at the top of the graph, many of the studies had weak to moderate overall quality and rigor, as indicated by the clustering of data points on the left-hand side of the graph

Generalization Measurement

Figure 4 depicts the results of the generalization measurement for the 68 relations analyzed. The scatterplot indicates that most of the studies reported positive (72.1%, N = 49) or moderate (22.1%, N = 15) generalization effects, with only four studies (5.9%) reporting negative effects. Generalization was measured with pre- and posttests (45.6%, N = 31) or posttests only (25.0%, N = 17) in over half of the included studies, with fewer studies measuring generalization intermittently throughout intervention (23.5%, N = 16) or experimentally (i.e., at least three times per condition; 5.9%, N = 4), reducing confidence in the internal validity of the reported effects.

Fig. 4
figure 4

Results for generalization measurement. As data points move down the x-axis, internal validity increases, while generalized outcomes are increasingly positive as data points move up 0 1 2 3 4 Primary Outcomes Overall Study Quality & Rigor 4 3 2 1 0 0 1 2 3 4 Quality & Rigor of Generalization Measurement Generalized Outcomes Post Only Pre/Post Intermittently Experimentally 4 3 2 1 0 the y-axis. While many of the studies included for review had positive generalized outcomes, as indicated by the clustering of data points at the top of the graph, most data points being plotted on the left-hand side of the graph indicates that internal validity was relatively weak for many of the included studies

Maintenance Measurement

Figure 5 depicts the results of the maintenance measurement for the 68 relations analyzed. Over three quarters of the included studies did not measure maintenance outcomes (79.4%, N = 54). Of the few studies that did evaluate maintenance outcomes (20.6%, N = 14), most measured maintenance at or beyond one month after intervention (71.4%, N = 10), while two studies (14.3%) evaluated maintenance between one and three weeks after intervention (Gould et al., 2011; Zagrabska-Swiatkowska et al., 2020), and two studies (14.3%) measured maintenance immediately following intervention (Baltruschat et al., 2012; Persicke et al., 2012). Although each of the studies that measured maintenance reported moderate to positive outcomes immediately following intervention and over one month after intervention, because so many did not measure maintenance at all, limited conclusions can be made as to the retention of derived responding in the content areas taught.

Fig. 5
figure 5

Results for maintenance measurement. As data points move down the x-axis, the time between intervention and maintenance measurement increases, while maintained outcomes are increasingly positive as data points move up the y-axis. Maintained outcomes were generally positive, which is evidenced by the clustering of data points at the top of the graph. However, there are fewer data points represented within the graph as few of the included studies measured participant maintenance

Discussion

The current review sought to synthesize and analyze the results of research facilitating the emergence of derived relational responding beyond the frame of coordination. Systematic searches identified 30 articles comprised of 38 studies that met criteria for inclusion, supporting the findings of recent citation analyses by O’Connor et al. (2017) and Belisle et al. (2020a, b) that investigations into the use of technologies borne from the theoretical foundation of RFT with learners with IDD and ASD have increased in frequency over the last 15 years. The present review was focused on four key questions. The first was whether sufficient, high quality evidence suggests that learners with IDD and ASD can demonstrate derived relational responding beyond coordination. The second was whether evidence supported the presence of distinct learner characteristics or profiles that influence the development of derived relational responding beyond coordination. The third was whether there were particular assessment instruments ideal for determining a learner’s relational skill repertoire. The final question was whether there are specific instructional procedures best suited for developing relational responding in learners with IDD and ASD.

Derived Relational Responding in Individuals with IDD and ASD

A total of 38 studies containing 122 participants were identified as having investigated the emergence of derived relational responding in accordance with a variety of relational frames beyond coordination. The analysis of the results of these studies found that over half of the participants (54.1%, N = 66) demonstrated the emergence of derived relational responding in a variety of relational frames using several different teaching procedures (e.g., MET). An additional 38 participants (31.1%) demonstrated derived relational responding following additional intervention (e.g., additional prompting, remedial training, reverse intraverbal training, etc.). These findings support the assertion that learners with IDD and ASD can acquire the skills to demonstrate more complex derived relational responding beyond coordination relations. However, due to the quality and rigor of many of the included studies, these results must be interpreted with caution.

Learner Characteristics

The results of the current review support and extend the conclusions of previous work evaluating coordination relations that a learner’s verbal repertoire influences their ability to demonstrate derived relational responding (Gibbs & Tullis, 2021; O’Connor et al., 2009; Lee et al., 2015). Two studies asked specific questions in an effort to determine how an individual’s verbal repertoire influenced derived relational responding beyond coordination. Dunne et al. (2014) evaluated the effects of a program of testing and training on the emergence of derived relational responding in the frames of opposition (study two), distinction (study three) and comparison (study four) with learners with ASD. Across all three studies, the learners with higher VB-MAPP scores required considerably less training than learners with lower scores. In another set of studies, Kent et al. (2017) reported similar outcomes. Seven of 11 learners with ASD whose PPVT and K-BIT scores indicated more significant receptive and expressive language limitations were unable to progress through the entire test protocol of nonarbitrary and arbitrary coordination, distinction, comparison, and opposition relations even after extensive training trials (Kent et al., 2017, study two). In contrast, four learners with ASD who demonstrated receptive and expressive language delays, but whose PPVT and K-BIT scores indicated less significant limitations, were able to complete the entire test protocol following no more than 44 training trials (Kent et al., 2017, study three). These results bolster the hypothesis that a more expansive verbal repertoire influences the training requirements and test performances for derived relational responding across relational frames, but additional research is still required for more conclusive answers.

Furthermore, it has previously been suggested that bidirectional naming (BiN; Miguel, 2016) may be an essential repertoire for learners to demonstrate emergent equivalence relations (Howarth et al., 2015; Kobari-Wright & Miguel, 2014; Morgan et al., 2020), but the necessity of a BiN repertoire for the emergence of derived relations beyond coordination is less clear. Lovett and Rehfeldt (2014) taught perspective-taking skills to adolescents with ASD and assessed generalization of these skills to a more natural language situation. For two of the three participants whose performance on reversed and double reversed deictic relations during generalization probes did not meet mastery criteria, the tact for each emotion experienced in the natural language situation was stated vocally following the video presentation. This resulted in an increase in response accuracy to mastery criterion levels for reversed relations, and improved response accuracy for double reversed relations, though still below mastery criterion. Lovett and Rehfeldt’s (2014) results suggest generalization of perspective-taking skills may require not only a deictic relational responding repertoire, but also a tact repertoire. While certainly interesting, these findings are by no means conclusive, and there is currently little research investigating the relationship between BiN and different relational frames. Additional inquiry is required to better clarify the influence BiN has on the emergence of derived responding beyond coordination.

Another potential requirement for the acquisition of derived relational responding is the ability to first demonstrate nonarbitrary relational responding. As discussed previously, nonarbitrary relational responding is based on the physical properties of the target stimuli (Hayes et al., 2001), and it has been suggested that a repertoire of nonarbitrary relational responding may be a necessary prerequisite skill to demonstrate, or at least facilitate, derived relational responding (Barnes-Holmes et al., 2004; Berens & Hayes, 2007; Dunne et al., 2014; Kent et al., 2017). This concept was supported by Gale and Stewart (2020), whose three participants with ASD all demonstrated derived relational responding in accordance with comparison, and all had nonarbitrary relational responding in their repertoire. However, this suggestion has recently been brought into question. While not included in the current review secondary to only evaluating an assessment protocol, the findings of Pomorska et al. (2020) suggest the relationship between nonarbitrary and derived relational responding may be more dynamic than linear, with both skills potentially emerging in tandem. More research is necessary to better determine how nonarbitrary relational responding potentially influences derived relational responding.

Though a well-developed verbal repertoire inclusive of BiN and the ability to engage in nonarbitrary relational responding may be influential in facilitating derived relational responding across relational frames, the possibility that distinct precursor skills are required for the emergence of derived responding in each frame individually cannot be ruled out. For example, Molina-Cobos and Amador-Castro (2010) reported participants with Down syndrome and developmental delay were unable to successfully report on the preferences of another until discrimination training was utilized to teach participants to discriminate between themselves and the character being used in the study. As no assessment results were reported, the participants overall discrimination skills are unknown, which precludes suggestions related to the required strength of a discrimination repertoire for deictic responding. While the previously described studies have contributed to our understanding of what skill repertoires may be necessary to support derived relational responding in this population of learners, few studies have specifically sought to answer this fundamental question. Future research should continue to investigate whether broad skill repertoires and/or specific precursor skills, and what ones, are required to facilitate derived relational responding within and across relational frames.

Assessment Tools

Similar to the findings of Gibbs and Tullis (2021), many of the studies in the current review reported the results of a variety of norm- (e.g., WISC-IV) and criterion-referenced (e.g., PIRK) assessments in an effort to describe participants cognitive and language skills. However, the use of specific assessment tools in an effort to determine learner’s relational skill repertoires occurred less frequently. The preassessments associated with the Promoting the Emergence of Advanced Knowledge: Equivalence (PEAK-E; Dixon, 2015) and Transformation (PEAK-T; Dixon, 2016) modules were used in four studies. Positive results were reported when participants demonstrated mutual and combinatorial entailment on the PEAK-E (Belisle et al., 2020a, b) and PEAK-T (Belisle et al., 2016) preassessments. Barron et al. (2019) also reported positive outcomes following the ability to demonstrate nonarbitrary deictic relations on the PEAK-T receptive and expressive preassessments.

Two other assessments were used in an attempt to ascertain learner relational abilities. Cassidy et al. (2011) devised and administered the Relational Abilities Index (RAI), a computer-based assessment that measures proficiency in coordination, opposition, and comparison relational responding. While the RAI has been cited as an acceptable surrogate measure of IQ in the RFT literature (Colbert et al., 2017), in Cassidy et al. (2011) its purpose was to evaluate participants relational skill repertoires before and after intervention to determine if MET was responsible for changes in relational abilities. Since the development and extension (Cassidy et al., 2016) of the RAI, further work has been done to create a more expansive assessment of relational responding. The Relational Abilities Index + (RAI + ; Colbert et al., 2020) assesses relational performance across distinction, temporality, and analogy in addition to coordination, opposition, and comparison. However, as no studies have yet been published on the use of the RAI + with learners with IDD and ASD, more research is needed to determine its utility in assessing their relational repertoires.

Gale and Stewart (2020) utilized the TARPA (Moran et al., 2010, 2014), a hierarchical testing and training protocol based on simple and conditional discrimination skills in which the first stage tests an individual’s ability to learn simple discriminations, the second stage tests an individual’s ability to engage in nonarbitrary relational responding, and the final stage tests an individual’s ability to engage in derived relational responding. Gale and Stewart (2020) first used the TARPA as a testing method to assess participants’ abilities to learn the repertoires necessary to derive comparative relations, then as a training procedure to establish comparative relational responding, which was successful with all three participants with ASD. Given the TARPA’s ability to assess and train prerequisite and relational responding skills, further research is warranted on its utility in clinical practice with learners with ASD as well as learners with IDD.

Teaching Procedures

Multiple exemplar training (MET) remains the most frequently implemented teaching procedure and it was utilized across relational frames and content, however several studies utilized multiple exemplar instruction (MEI). While sometimes used interchangeably, MET and MEI are considered two different procedures, with MET often used in RFT approaches to derived responding, and MEI utilized in investigations grounded in BiN (LaFrance & Tarbox, 2020). Briefly, MET presents different exemplars while teaching a target response topography that serves the same function, while MEI rotates instructions targeting different responses to produce interdependence between speaker and listener repertoires (see LaFrance & Tarbox, 2020 for a more detailed discussion of each methodology). In the current review, both MET and MEI resulted in the acquisition of derived relational responding in various content areas, but participants sometimes required additional instruction in both procedures. For example, one participant in Belisle et al.’s (2016) investigation of teaching single reversal I-you deictic frames to learners with ASD via MET required additional mixed deictic frame training after failure to demonstrate derived relations after mastery of the initial training content. Similarly, in Greer and Yuan’s (2008) evaluation of MEI to teach past tense verbs in an autoclitic function to learners with developmental delays, some participants demonstrated derived responding following MEI with one set of stimuli, and others needed intervention with an additional set. Future research should seek to determine if there is an optimal number of exemplars, or level of explicit instruction, required to facilitate derived responding with either of these two procedures.

Other teaching procedures used less frequently in the current review, but with positive results on derived relational responding, include a tabletop version of the TARPA (Gale & Stewart, 2020), intraverbal training (Lee et al., 2019), match-to-sample and conditional discrimination training (Murphy & Barnes-Holmes, 2010), and a combination intervention of observation and discrimination training (Molina-Cobos & Amador-Castro, 2010, experiment two). The use of term ‘observation’ rather than ‘observational learning’ to describe this final intervention procedure is deliberate, as observational learning is characterized by observation of a modeled response and a subsequent consequence (Masia & Chase, 1997). In both experiments, Molina-Cobos and Amador-Castro (2010) had participants observe a photo of a named character engaging in an activity he was reported to prefer, but there was no subsequent consequence observed. Thus, the procedure does not fully meet the common definition of observational learning. While observation alone in experiment one was not successful for participants to learn the preferences of another, as previously discussed, all participants demonstrated the target response after observation was paired with discrimination training so participants could effectively discriminate between themselves and the character (Molina-Cobos & Amador-Castro, 2010). Given the relative simplicity of the interventions, additional research on the effects of a combination of observation and discrimination training on the emergence of derived relational responding is certainly warranted.

Design Rigor

The rigor of the studies included in the current review was measured using the SCARF (Ledford et al., 2020), and as reported in the results, while 67% of the studies reported positive outcomes for participants, only four of those with positive outcomes had high quality and rigor. Of the three SCARF elements associated with design rigor, sufficiency of data and implementation fidelity were the two biggest contributors to the classification of many studies as low rigor, which is similar to the findings of Gibbs and Tullis (2021). Sufficiency of data refers to the presence of an adequate number of demonstrations of effect, and an adequate number of data points in each condition to allow for analysis of level, trend, and variability within and across conditions. The standard for sufficiency of data set by the SCARF requires at least three data points per condition, and at least three potential demonstrations of effect (Ledford et al., 2020), a standard that only 16.7% of the included studies met. The majority of the included studies failed to meet this standard due to having fewer than three data points per condition, which occurred most frequently in baseline (e.g., Jackson et al., 2014), and some studies also had fewer than three potential demonstrations of effect (Belisle et al., 2020a, b, experiments one and two; Grannan & Rehfeldt, 2012). A limitation of the current paper is that, as some studies were not amenable for evaluation of quality using the SCARF, it is unknown whether or not these articles may have strengthened or weakened the rigor of the overall body of work in this area, or had a significant degree of influence on the findings as a whole.

While certain quality indicator checklists put forth less rigorous standards for data sufficiency (e.g., Horner et al., 2005; Tate et al., 2008), in addition to the SCARF, quality standards set forth by the Council for Exceptional Children (CEC, 2014), the What Works Clearinghouse (WWC, 2020), and others (Reichow et al., 2008; Smith, 2012) require a minimum of three data points in each experimental condition. Valid arguments do exist for conditions in which fewer than three data points are appropriate to determine a pattern of responding (e.g., severe problem behavior, when an extended baseline would withhold necessary education from a learner, etc.). However, this did not appear to be the case in the included studies. Many of the studies that did not meet criteria utilized multiple probe designs or they did not use a discernable experimental design at all, but rather a series of testing and training phases. Two potential arguments could be made specifically to DRR literature. First, because relational pretesting can require an extensive number of trials (e.g., a 36-trial deictic framing protocol used for pre and posttest probes; Lovett & Rehfeldt, 2014), three data points in baseline might require too much time. Second, when arbitrary stimuli are used within a stimulus class and the established stimulus relations are determined by the experimenter, it is incredibly unlikely for a learner to have a history of reinforcement for responding to such relations. However, this point must be balanced with the need to establish via repeated observations that a learner truly lacks a derived relational responding repertoire to improve the believability of the intervention methods, especially given the evidence that learners with IDD and ASD may present with splinter skills, and engage in responding beyond what their developmental level or IQ score suggests they are capable of (Mayes et al., 2012).

In addition to sufficient data to support the presence of a functional relation, a high level of implementation fidelity, or the degree to which the independent variable is implemented as intended, is considered necessary by the SCARF as well as several other quality assessment tools (CEC, 2014; Horner et al., 2005; Tate et al., 2016) for confidence that changes in participant behavior are the result of the intervention being applied. Only 28.9% of the included studies reported procedural fidelity data. This is similar to the results of Gibbs and Tullis (2021), though the percentage is slightly improved from their reported 22.6%. While very few studies measured maintenance, many of those that did reported maintained effects over one month after intervention. However, given that 78.8% of the included studies reported no maintenance at all, whether derived relational responding is a skill that maintains over time is an area requiring additional investigation.

Areas of Future Research

As previously reported, data on the race and ethnicity of research participants was not included in any of the 30 studies within this review, an unsurprising finding given the results of a systematic review by Steinbrenner et al. (2022) which found only 25% of articles published since 1990 examining evidence based practices in autism intervention reported on participant race and ethnicity. The absence of this data raises the question of whether interventions designed to facilitate derived relational responding have similar efficacy across racial and ethnic groups. Similarly, of the 30 studies, Lee et al. (2019) was the only one to clearly state participants’ native language, despite several other studies being conducted internationally and in non-English speaking countries. In order to function as communication between speaker and listener, language requires context which includes cultural background knowledge, among other factors (Garten et al., 2019). This may be especially true when teaching complex relational responding such as comprehending metaphors or idioms, as the verbally mediated meaning in one language may not transfer to another (e.g., in German when one says, “Ich habe einen Kater,” it means, “I have a hangover,” and not the literal translation, “I have a tomcat”). Future studies should include this information explicitly to better determine whether race, ethnicity, and native language of participants influences intervention efficacy, and to aid in the development of more nuanced research questions regarding complex language generativity.

Although only one study examining the effect of improved derived relational responding has on IQ scores in learners with IDD was included in the current review (Cassidy et al., 2011), there has been increased interest in this area of research. Studies have reported significant IQ increases following an intervention to derive coordination, opposition, and comparison relations in typically developing learners (Cassidy et al., 2016; Colbert et al., 2018). An initial study by Dixon et al. (2021) indicates that similar increases in IQ score might be achieved with learners with IDD and/or ASD, but questions remain whether an increase in IQ score correlates with an increase in adaptive functioning following this type of intervention. Although general intelligence is by no means unimportant, adaptive functioning (e.g., daily living skills, interpersonal skills) may arguably be of more concern for learners with IDD and ASD, as it encompasses skills needed for everyday functioning (McQuaid et al., 2021). Previous research has demonstrated that among learners with IDD and ASD, a lower IQ score is correlated with more deficits in adaptive skills (Kanne et al., 2011; Matson & Shoemaker, 2009). However, research has also indicated that adaptive functioning is much lower than what would be expected based on IQ score in learners with ASD without a co-occurring intellectual disability (Bradshaw et al., 2019; Klin et al., 2007). Given that interventions to facilitate derived relational responding have the potential advantage of instructional efficiency, it is key to determine whether or not the positive effect of these interventions extends to additional populations and beyond IQ score.

Similar to Gibbs and Tullis (2021), a limitation of the literature contained in the current review is that they lack social validity measures. This limitation is one that can and should be remedied with additional research. While 89.5% of the included studies took place in participants’ natural environments (e.g., school, home, etc.), Lee et al. (2019) was the only study to directly measure social validity. Participants’ parents filled out a questionnaire, and investigators interviewed the participants and their teachers (who were not aware of the research) to determine whether the intervention in metaphor comprehension facilitated improvements in communication, social, or reading comprehension skills in the classroom environment. As derived relational responding may be a potential avenue to more generative language, it behooves researchers to include generative measures outside of the experimental context with others who are blind to the goal of an investigation. Further, derived relational responding offers a more efficient treatment option, with intervention in only a few relations resulting in the derivation of other relations without explicit teaching. However, most social validity measures focus on stakeholder satisfaction, and few consider issues such as the time and cost-effectiveness associated with an intervention (Callahan et al., 2017), two factors that may influence more widespread adoption of these technologies. Future research should not only include objective measures of social validity to determine stakeholder satisfaction with interventions to facilitate derived relational responding, but also the instructional and resource efficiency of these interventions.

Practical Implications

The result of the current review that learners with IDD and ASD demonstrated derived relational responding beyond the frame of coordination has pronounced implications for skill acquisition and education, particularly language generativity, in this population. The evidence suggests a more expansive verbal repertoire, inclusive of BiN, and the ability to engage in nonarbitrary derived relational responding may influence this population’s ability to demonstrate derived relational responding. This may assist practitioners in determining those learners who have the necessary prerequisite skills to begin this type of intervention. Additionally, the identification of assessments such as the PEAK-E and T assessments, RAI, and TARPA to determine current learner repertoires and in some cases guide intervention (i.e., PEAK modules, TARPA) is important. Finally, while MET was the most frequently utilized teaching procedure in the included studies, previous research has found the use of technology aided instruction such as the Teaching Implicit Relational Assessment Procedure (T-IRAP; Kilroe et al., 2014) may be more instructionally efficient than traditional tabletop presentations in learners with ASD. Technology-based instructional arrangements certainly have appeal, especially when viewed in light of the COVID-19 pandemic which resulted in the loss of face-to-face instruction for many students with and without disabilities around the world. However, regardless of whether instruction is taking place in person, finding the most efficient way to teach, and to reach the greatest possible number of learners certainly merits continued attention and research.