The more closely actual judgments are studied, the more evident does it become that they do not proceed according to the clean logical schemes which we are prone to devise for them in advance. Robert S. Woodworth, 1899, p. 818

Discriminating between stimuli that differ along one physical dimension (e.g., sound pressure, duration) constitutes one of the most fundamental cognitive abilities. It is therefore not surprising that psychophysicists have employed discrimination tasks vastly and in different areas of perception ever since the dawn of experimental psychology. Notably, to this day, the experimental procedures commonly used in this regard are often based on a classic procedure already employed by Gustav T. Fechner under the name method of constant stimuli (Fechner, 1860; Hegelmaier, 1852; Laming & Laming, 1992). The most prominent variant of this procedure might be the two-alternative forced-choice task (2AFC), originally developed by Hegelmaier (1852). In this task, a standard \(s\) with constant magnitude and a comparison \(c\) with magnitude varying from trial to trial are presented successively, and participants have to identify the larger stimulus at the end of each trial. The order of \(s\) and \(c\) varies randomly from trial to trial, yielding trials with order \(\langle cs \rangle \) or order \(\langle sc \rangle \). As explained below, one intriguing finding is that discrimination performance differs as a function of stimulus order. The goal of the present work is to provide a review of this temporal order effect, as well as a discussion of its theoretical implications.

Measuring discrimination performance

To measure discrimination sensitivity (i.e., the difference threshold) as obtained with this paradigm, one typically estimates the magnitude difference between \(s\) and \(c\), which enables identification of the larger stimulus with an accuracy level of 75% (Gescheider, 1997). This measure is conventionally defined as the difference limen (\(DL\); or just noticeable difference \(JND\)). In order to estimate \(DL\), a psychometric function is typically fitted to the data obtained from a discrimination experiment. This psychometric function plots the probability that \(c\) is judged to be larger than \(s\) on the y-axis as a function of the physical magnitude of \(c\) on the x-axis. Psychometric functions typically increase from 0 to 1 with increasing values of \(c\) (Fig. 1). Note that \(DL\) is captured by the slope of the psychometric function. Specifically, half the distance between the levels of c corresponding to the 0.25 and 0.75 probabilities (i.e., half the interquartile range) defines \(DL\). Thus, the steeper the psychometric function, the smaller the \(DL\) and hence the higher the participant’s sensitivity.

Fig. 1
figure 1

Left plot. Hypothetical logistic psychometric function of a 2AFC duration discrimination experiment with a standard \(s\) of 500 msec. The location of the function defines the \(PSE\), or point of subjective equality. Half the distance between the levels of c respectively corresponding to \(c_{25\%}\) and \(c_{75\%}\) (i.e., half the interquartile range) defines \(DL\). Hence, \(DL\) is a measure of sensitivity reflected in the steepness of the psychometric function. Right plot. Hypothetical order-dependent logistic psychometric functions \( F_1 = P(R_{c>s} | \langle cs \rangle )\) and \( F_2 = P(R_{c>s} | \langle sc \rangle )\) of a 2AFC duration discrimination experiment with a standard \(s\) of 500 msec. Note that both functions have identical locations corresponding to an identical \(PSE\) for both stimulus orders. However, \( F_2\) is steeper than \( F_1\), implying a larger \(DL\) for \(\langle cs \rangle \) trials than for \(\langle sc \rangle \) trials and hence exhibiting a negative TBE

Contrary to intuition, discrimination sensitivity as measured in this task is not merely a function of the physical difference between s and c, but also depends on their temporal order. To our knowledge, the first to notice this were Lillie J. Martin and Georg E. Müller when exploring weight discrimination (Martin & Müller, 1899). These pioneers of experimental psychology noted that participants more often judged correctly whether the second of two successively lifted weights was lighter or heavier than the first stimulus when \(s\) preceded rather than followed \(c\)—implying higher discrimination performance in trials with stimulus order \(\langle sc \rangle \) than \(\langle cs \rangle \). In detail, they concluded that “with the same effective difference, in those cases where the comparison weight was lifted second, more correct judgments (and also more correct judgments where the difference was more pronounced) were obtained compared to those cases where the comparison weight was lifted first“ (p. 25).Footnote 1 However, although Robert S. Woodworth (Woodworth, 1899) praised the work of Lillie J. Martin and Georg E. Müller for demonstrating the actual complexity underlying seemingly trivial judgment procedures, this contribution was not widely received within the scientific community.

Theoretical relevance of temporal order effects

It was about a hundred years later that psychophysicists not only rediscovered the effects of temporal order on discrimination sensitivity empirically but also realized their theoretical importance. For example, Nachmias (2006, p. 2462) noted the slope of the psychometric function to be steeper for \(\langle sc \rangle \) than for \(\langle cs \rangle \) trials. This observation of him reflects an instance of the Type B effect (TBE, Ulrich & Vorberg, 2009), and is illustrated in Fig. 1.Footnote 2

Importantly, one speaks of a negative TBE if \(DL\) is larger for \(\langle cs \rangle \)-trials than for \(\langle sc \rangle \)-trials, while a positive TBE refers to the opposite result pattern, i.e., smaller \(DL\) for \(\langle cs \rangle \)-trials than for \(\langle sc \rangle \)-trials. Mainly negative TBEs have been reported in the literature (Bausenhart, Dyjas, & Ulrich, 2015; Dyjas et al., 2012; Ellinghaus, Ulrich, & Bausenhart, 2018; Rammsayer & Ulrich, 2012; Stott, 1935; Ulrich, 2010; Woodrow, 1935). Positive TBEs have mostly been reported for stimuli of short duration presented with very short interstimulus intervals (Hellström & Rammsayer, 2004, 2015; Hellström, 2000). Although the TBE has been mainly investigated for the case of duration discrimination (Ellinghaus, Gick, Ulrich, & Bausenhart, 2019; Hellström & Rammsayer, 2015; Woodrow, 1935), some studies have reported a TBE for physical dimensions outside the temporal domain (Ellinghaus et al., 2018; Ross & Gregory, 1964).Footnote 3

The resurgence of interest in temporal order effects may be partially attributed to the formal psychophysical models that explicitly specify stimulus discrimination mechanisms. Many of these models, as for example signal detection theory (Green & Swets, 1966; Macmillan & Creelman, 2005; Wickens, 2002) and other prominent psychophysical models (Luce & Galanter, 1963; Yeshurun, Carrasco, & Maloney, 2008), are so-called difference models of stimulus discrimination. These models are rooted in the pioneering work of Thurstone (1927a, 1927b), who postulated that humans base their comparative judgment on the difference of the internal stimulus representations \({\textbf{D}}=\textbf{X}_1-\textbf{X}_2\), whereby \(\textbf{X}_1\) and \(\textbf{X}_2\) represent the internal magnitudes of the first and second stimulus in a trial. Crucially, these traditional difference models of the 2AFC task imply that the perceived magnitude difference \({\textbf{D}}\) on any trial does not depend on the stimulus order of s and c but merely on their magnitudes. Consequently, as shown by Dyjas et al. (2012), difference models cannot easily account for the TBE.

Type B effect: Potential explanations

Various explanations have been proposed to account for the TBE. These might be subsumed under three separate theoretical frameworks. First, based on a suggestion by Durlach and Braida (1969), Nachmias (2006) reasoned that “judgments would be based on comparing the second stimulus presented on the trial with some conglomerate of the virtual standardFootnote 4and the first stimulus of the trial“(p. 2462). Elaborating on this idea, Lapid, Ulrich, and Rammsayer (2008) and Dyjas et al. (2012) formalized a cognitive mechanism according to which such internal reference I, akin to Nachmias’s virtual standard, is established and updated on every trial. According to this internal reference model (IRM), participants complete their task in a current trial n by comparing the internal representation \(\textbf{X}_{2,n}\) of the second stimulus in this trial against the current internal reference \({\textbf{I}}_n\), which is a conglomerate of previous and current stimulus instances and updates continuously from trial to trial. As the subjective difference \({\textbf{D}}_n\) on the present trial is the result of this comparison process, it can be stated as

$$\begin{aligned} {\textbf{D}}_n = {\textbf{I}}_n - \textbf{X}_{2,n}. \end{aligned}$$
(1)

Specifically, the internal reference \({\textbf{I}}_n = g \cdot {\textbf{I}}_{n-1} +(1-g) \cdot \textbf{X}_{1,n}\) on trial n is a weighted sum of the first stimulus’ internal representation \(\textbf{X}_{1,n}\) on the current trial n and the internal reference \({\textbf{I}}_{n-1}\) from the previous trial \(n-1\), with constant weight g, \(0 \le g <1\). If \({\textbf{D}}_n >0\), then participants judge the first stimulus to be larger than the second stimulus, whereas when \({\textbf{D}}_n<0\), they judge the second stimulus to be larger than the first stimulus. Dyjas et al. (2012) demonstrated that this mechanism implies a better discrimination performance when \(s\) precedes rather than follows \(c\), and thus predicts negative TBE. Several related ideas have been proposed within the Bayesian framework (de Jong et al., 2021; Glasauer & Shi, 2021; Jazayeri & Shadlen, 2010; Schumacher & Voss, 2023; Shi, Church, & Meck, 2013; Wiener, Thompson, & Coslett, 2014), which are closely related to IRM (cf. de Jong et al., 2021). For completeness, it should be noted here that IRM and these related models have not only proven useful as potential accounts for the TBE, but may also explain some related temporal context effects such as central tendencies in judgment (e.g., Bausenhart, Dyjas, & Ulrich, 2014; Hollingworth, 1910; Vierordt, 1868) and assimilatory sequence effects (e.g., Cicchini, Arrighi, Cecchetti, Giusti, & Burr, 2012; Dyjas et al., 2012; Fischer & Whitney, 2014; Fritsche, Mostert, & de Lange, 2017) - for an integrative review see Sadibolova and Terhune (2022).

A second framework wherein which effects of stimulus order can be accounted for is the sensation weighting model (SWM; Hellström, 1977; Hellström, 1985; Hellström, 2000; Hellström & Rammsayer, 2004; Hellström & Rammsayer, 2015). As SWM is conceptually based on the adaptation level theory (Helson, 1947, 1964; Michels & Helson, 1954), it shares with IRM the premise that judgments rely on both past and present stimulus information. In contrast to IRM, SWM assumes that the magnitudes of the internal representations for the first and second stimuli are differently weighted when the stimuli are compared. Depending on the relative weights of the first and second stimulus, discrimination performance is higher or lower for the \(\langle sc \rangle \) compared to \(\langle cs \rangle \) (also see Bausenhart et al., 2015). Therefore, SWM can also account for a positive TBE (i.e., larger \(DL\) for \(\langle sc \rangle \) than for \(\langle cs \rangle \) trials). As noted above, such positive TBEs have been reported by Hellström and his colleagues for duration discrimination experiments (e.g., Hellström & Rammsayer, 2004; Hellström, Patching, & Rammsayer, 2020).

Third, it has been argued that the TBE might not reflect any specific cognitive mechanism at all, but rather constitutes a methodological artifact which results from researcher decisions in estimating the difference threshold. For example, García-Pérez and Alcalá-Quintana (2010) argued that the higher difference thresholds in \(\langle cs \rangle \) as compared to \(\langle sc \rangle \) trials as observed by Lapid et al. (2008) are due to an supposedly erroneous procedure of \(DL\) estimation. In detail, these authors on p. 1160 claim that “the wrong choice of a psychometric function to fit to 2AFC data, as well as the lack of a free lapsing-rate parameter, spuriously inflated DLs estimated by Lapid et al. (2008)“. While Ulrich (2010) doubted that the method employed by Lapid et al. (2008) is actually flawed, according to this interpretation of García-Pérez and Alcalá-Quintana (2010), a TBE should be evident (as an artifact) only in studies which employ a particular and supposedly incorrect procedure of \(DL\) estimation.

In summary, the TBE has diagnostic value for evaluating the validity of psychophysical models. Specifically, as outlined above, TBEs are at variance with the predictions of traditional difference models but consistent with extensions of these models, such as IRM or SWM. Because these extensions generally apply to any stimulus discrimination task, the TBE was hypothesized to also systematically emerge outside the temporal domain. Coherently, the present meta-analysis includes both temporal and non-temporal studies in a random-effects model (Borenstein, Hedges, Higgins, & Rothstein, 2009).

Method

Sample of studies

Several papers addressing the effect of stimulus order on discrimination sensitivity for the 2AFC task have been published throughout the decades. The starting point for selecting studies was an overview provided by Dyjas et al. (2012). For the present analysis, we supplemented this preliminary compilation by adding more recent and earlier studies that had not been incorporated in this initial overview. Since authors have used various terms to refer to the TBE, keyword-based literature research would not yield meaningful results. Therefore, we attempted to identify all relevant publications using a snowball approach, and searched for additional relevant studies based on the references and citations of studies on the TBE. In essence, we searched for all 2AFC experiments that reported separate \(DL\)s for \(\langle cs \rangle \) and \(\langle sc \rangle \) trials, and contained sufficient information to compute the standardized mean difference (SMD, see below). The obtained studies were subdivided into temporal (i.e., duration) and non-temporal (e.g., brightness, loudness) discrimination experiments. Regarding temporal discrimination, this resulted in the following studies: Bausenhart et al. (2015), Bruno, Ayhan, and Johnston (2012), Dyjas et al. (2012), Dyjas et al. (2014), Dyjas and Ulrich (2014), Ellinghaus et al. (2019), Ellinghaus, Giel, Ulrich, and Bausenhart (2021), Gao, Miller, Rudd, Webster, and Jiang (2021), Gordon (1967), Grondin and McAuley (2009), Harrison, Binetti, Mareschal, and Johnston (2017), Hellström and Rammsayer (2004), Hellström and Rammsayer (2015), Hellström et al. (2020), Lapid et al. (2008), Marchman (1969), Rammsayer and Wittkowski (1990), Van Allen, Benton, and Gordon (1966), Thönes, Von Castell, Iflinger, and Oberfeld (2018), Ulrich (2010). Regarding experiments outside the temporal domain, the following studies were obtained: Ellinghaus et al. (2018), Ellinghaus et al. (2021), Lapid, Ulrich, and Rammsayer (2009), Nachmias (2006), Ross and Gregory (1964), von Castell, Hecht, and Oberfeld (2017).

Fig. 2
figure 2

Results of temporal studies

Coded factors and hypotheses

Two separate random-effects models were run. The first model included only the temporal studies and duration was used as a factor, with experiments employing an \(s\) shorter or longer than 500 msecclassified as short or long, respectively. Theoretical considerations drove the inclusion of this factor. Namely, studies on duration discrimination suggest that the processing of very short durations relies on different mechanisms than the processing of longer durations. For example, Michon (1985) assumed that temporal processing of intervals longer than 500 ms is cognitively mediated, whereas shorter intervals are perceptual in nature. This distinct timing hypothesis received both neuroscientific (Lewis & Miall, 2003b, a) and behavioral support (Rammsayer & Lima, 1991; Rammsayer & Ulrich, 2011)—however, see Rammsayer and Ulrich (2005) for contradicting results. Consistent with the distinct timing hypothesis and earlier studies (cf. Bausenhart et al., 2015; Hellström & Rammsayer, 2004), we conjectured that the TBE will be negative at a relatively long interval length but become more positive at a relatively short interval length. Second, an additional random-effects model included only the non-temporal studies. Here, we expected the SMD to be significantly smaller than 0, reflecting substantial evidence for the negative TBE.

Fig. 3
figure 3

Results of non-temporal studies

Effect size analysis

In a first step, the standardized mean difference (SMD, computational details are given by Borenstein et al., 2009) was computed for each documented TBE, defined as \(\mathbf {D = DL_{<sc>}- DL_{<cs>}}\). For within-subjects designs, the employed expression was \(\widehat{SMD} = \frac{M_D}{S_{within}}\), with \(S_{within} = \frac{S_D}{\sqrt{2(1-r)}}\).Footnote 5 For between-subjects designs, the employed expression was \(\widehat{SMD} = t \cdot \sqrt{\frac{N_1 + N_2}{N_1 \cdot N_2}}\). In a second step, these estimated effect sizes of both the temporal and non-temporal stimuli were submitted to a random-effects model using the function metagen() of the R package meta (Schwarzer et al., 2007). For both analyses, the corresponding model accounted for the hierarchically nested data structure, that is, in each model each observed TBE was nested within a certain (sub)sample of a certain experiment, which was nested within a certain study. For both models, SMD was estimated across all observations. In addition, for the temporal studies, a subgroup analysis with the factor interval length was carried out to get separate SMD estimates for short and long stimulus durations. Finally, the function forest.meta() was employed to design the forest plots.

Results

An overview of the temporal studies is given in Fig. 2. In the corresponding model, the SMD was estimated to be -0.61 (95% CI [-0.82; -0.40]) when averaged across all observations, and this estimate was significantly different from zero, \(t(52) = -5.83\), \(p < .001\). In addition, the subgroup analysis revealed a significantly more negative SMD for long durations (-0.67, 95% CI [-0.87; -0.48]) as compared to short durations (-0.23, 95% CI [-0.51; 0.05]), \(F(1,51) = 13.84\), \(p < .01\). An overview of the non-temporal studies is given in Fig. 3. The corresponding SMD was estimated as -0.86 (95% CI [-1.07; -0.65]), and significantly larger than zero, \(t(6) = -10.05\), \(p < .001\).

Discussion

The goal of the present article was to review the literature on the TBE, i.e., the finding that discrimination performance as indexed by \(DL\) differs as a function of the temporal order of \(s\) and \(c\) in the 2AFC task. Importantly, since the TBE had been primarily investigated for the temporal domain (i.e., duration discrimination), it was unclear whether this effect is the signature of some general cognitive process related to comparative judgments or specific to duration discrimination. Investigating this generality is theoretically important since the TBE poses a problem for traditional difference models (Green & Swets, 1966; Macmillan & Creelman, 2005; Wickens, 2002). Since these models are formulated rather generally, their predictions should hold across various stimulus attributes. Therefore, the present meta-analysis aimed to assess the TBE’s generality by including and comparing effect sizes of both temporal and non-temporal studies in a random-effects model (Borenstein et al., 2009).

The main results of this analysis can be summarized as follows: First, the meta-analytic regression model indicated substantial evidence for the TBE. Hence, in contrast to the predictions of traditional difference models, the subjective difference between two stimuli compared is not merely a function of their physical difference but also depends on their temporal order. Given that the analysis procedures of the 2AFC studies published throughout the decades and incorporated in the present analysis are very heterogeneous, it seems unlikely that the TBE merely reflects the consequences of a particular erroneous \(DL\) estimation procedure, as claimed by García-Pérez and Alcalá-Quintana (2010). Rather, the TBE appears to constitute a real phenomenon with a mechanistic origin. As such, it challenges established models of stimulus discrimination, and can thus be considered a benchmark effect to elaborate these models. Notably, while most TBEs reported in the literature stem from the temporal domain, based on our analysis, it seems plausible that the TBE is an inherent feature of discrimination experiments where standard and comparison are presented successively, as the present analysis provides substantial evidence for the TBE in various non-temporal tasks such as line length or brightness discrimination. Based on our analysis, it seems plausible that the Type B effect is an inherent feature of each 2AFC experiment. In fact, the absolute value of the estimated SMD for the non-temporal domain seems numerically even slightly larger than for the temporal domain. A potential reason why this phenomenon may have been underrepresented in the literature so far is the common practice of averaging data from the two stimulus presentation orders in 2AFC experiments (see Ulrich & Vorberg, 2009). This practice is not advisable, however, as these authors have shown that neglecting order effects can distort \(DL\) estimates. Computational tools for estimating discrimination performance that avoid such pitfalls are given by Bausenhart, Dyjas, Vorberg, and Ulrich (2012).

Second, while most TBEs correspond to the direction as predicted by IRM and related Bayesian updating models, that is, higher \(DL\) for \(\langle cs \rangle \) than for \(\langle sc \rangle \) trials (hence reflecting a negative TBE), positive TBEs (higher \(DL\) for \(\langle sc \rangle \) than for \(\langle cs \rangle \) trials) have also been documented (Hellström & Rammsayer, 2015). Notably, all of these documented positive TBEs stem from duration discrimination experiments, most of which employed very short standard durations (<100 ms). Coherently, when the factor interval length was employed as a moderator (short vs. long durations) in the meta-analytic regression model for the duration discrimination studies, the effect size of the TBE differed significantly between experiments respectively employing short versus long standard durations. This result is theoretically important, because positive TBEs are inconsistent with the predictions of IRM and related models (but see Bausenhart et al., 2015), yet can be accounted for by SWM. In detail, SWM can predict lower \(DL\) for \(\langle cs \rangle \) than for \(\langle sc \rangle \) trials by postulating that the stimulus magnitude in the first stimulus position is given more weight than the magnitude in the second position. Nevertheless, it should be noted that the estimated mean SMD for the short duration temporal studies was still negative, and hence the present meta-analysis does not provide sufficient evidence to discard IRM in favor of SWM. It should also be noted that the number of experiments employing standard durations not longer than 100 ms is relatively small and only a few replication attempts outside the lab of Hellström and colleagues have been made, some of which were unsuccessful in replicating a positive TBE (Bausenhart et al., 2015; Rammsayer & Wittkowski, 1990).

In any case, even if the positive TBEs documented for short standard durations constitute a true effect, this does not generally falsify the mechanisms as proposed by IRM and related models in favor of SWM. Rather, it is possible that the scope of these models is not as general as previously hypothesized. For example, it is possible that the nature of integrating past and present stimulus information differs between relatively long and relatively short durations. In fact, it is a prominent idea in the psychology of time that different mechanisms operate in different time scales (Lewis & Miall, 2003b; Michon, 1985). Coherently, effects analogous to the TBE (different decision weights for the two stimulus positions) have been reported as increasingly positive (greater impact of the first stimulus than of the second) with brief durations and brief ISI (e.g., Hellström, 1979; Hellström, 2003).Footnote 6 As the stimulus conditions employed in these experiments are rarely used by researchers, a biased view might arise when making inferences from the current literature. Clearly, investigating the exact boundary conditions under which positive and negative TBEs, respectively, occur will be crucial for future theory building research. Finally although not directly relevant to the TBE and its origin, it is worth mentioning that from the models considered here only SWM can genuinely account for the classic time order error. Footnote 7

In this article, we focused on the slope of the psychometric function, which is captured by the difference limen (\(DL\), Luce & Galanter, 1963). It is theoretically possible that the order of \(s\) and \(c\) not only affects \(DL\) but also higher moments of the psychometric function, such as the skewness of the function (cf. Ulrich, 1987). Unfortunately, we are not aware of any study that assessed potential order-effects for higher moments of the psychometric function. For future research, it might be valuable to also address higher moments of the psychometric function as a benchmark for evaluating models of stimulus discrimination.Footnote 8

In conclusion, the present meta-analysis reveals that the TBE is a ubiquitous feature of discrimination tasks when a constant standard and a variable comparison are presented successively, as for example in the classic 2AFC task. This effect constitutes a challenge for classic difference models, such as SDT (e.g., Green & Swets, 1966) and other prominent psychophysical difference models (e.g., Yeshurun et al., 2008). Potential candidate mechanisms underlying the TBE are (1) differential weighting of the stimulus magnitudes at the two positions (e.g., Hellström, 1977), (2) internal reference formation (e.g., Dyjas et al., 2012), (3) Bayesian updating (e.g., de Jong et al., 2021), and (4) biased threshold estimation (García-Pérez & Alcalá-Quintana, 2010). In any case, future studies are needed to better understand under which conditions positive and negative TBEs, respectively, result, to more clearly delineate the underlying mechanisms of discrimination performance.

Open Practices Statement

The data and analysis script of this meta-analysis will be made available upon publication of this article.