Participants
A total of 47 students (29 females; mean age 20.6 years) at the University of Groningen participated in exchange for partial course credit. Participants gave informed consent before the experiment. Based on the predefined performance criterion (see Procedure section), one participant was excluded from further analysis.
Apparatus and stimuli
Stimulus generation and presentation were controlled by Psychtoolbox (Brainard, 1997). We used a 19-in. CRT screen, an Iiyama Prolite G2773HS-GB1, with a resolution of 1,280 × 1,024 pixels, running at 100 Hz. Participants were seated in a sound-attenuated room with dimmed lights approximately 60 cm from the screen. A grey background was maintained during the entire experiment. A black fixation dot was presented throughout each trial. Feedback sounds were brief (150 ms) pure tones; high (1000 Hz) for correct and low (200 Hz) for incorrect responses. Stimuli were white circles, presented in the center of the screen with a diameter of 6.69°.
Procedure
After four practice trials (one for each standard duration, in random order), participants completed a total of 200 trials of a duration discrimination task (see Fig. 1). On each trial, the duration of a standard (S) duration (always presented first) was compared with a comparison (C) duration. At the start of each trial, a black fixation dot was presented centered on the screen, which was present throughout the entire trial. After 2 seconds, the standard was presented, which had a duration of 0.3, 0.6, 1.2, or 2.4 seconds (randomly sampled without replacement). Then, after a delay of 1 second, C was presented, which was either shorter or longer than S. The difference in duration between S and C is referred to as Δd, which is a proportion of S. When C was shorter than S, the comparison duration equaled \( \frac{S}{1+\varDelta d} \); when it was longer, its duration equaled S ∗ (1 + Δd). Participants indicated whether C was shorter (key ‘c’) or longer (key ‘m’) than S. Participants also received auditory feedback after their response: a brief high tone for correct, and a brief low tone for incorrect responses. All combinations of S and longer/shorter C were presented equally often (50% of the comparisons were longer, 50% was shorter, across standards) and in randomized order.
The difference in duration between the S and C (Δd, in proportion to S) was varied with an adaptive staircase procedure throughout the experiment. A transformed-rule up and down was used. When participants gave three consecutive correct responses, Δd was decreased, and it was increased when they made a single error. This rule approximates a performance level of 79.4% correct (Levitt, 1971). The starting value of Δd was 0.6 with a step size of 0.05 and a minimum of 0.05. When participants reached a Δd of more than 1, they were excluded from any further analysis (see Participants section).
Analysis
Generalized linear mixed models (GLMMs) were fitted with the lme4 package (Version: 1.1-19; Bates, Mächler, Bolker, & Walker, 2015) using the ‘nlminb’ optimizer from the optimx package (Version 2018-7.10; Nash & Varadhan, 2011). Mixed-effect models are more powerful than the traditional approach of aggregating data on a subject level, since it takes into account subject-level variability (Moscatelli, Mezzetti, & Lacquaniti, 2012). The data were fitted to normal cumulative psychometric functions with the ‘probit’ function. We centered S around the geometric mean (849 ms) to make the results more interpretable. Assuming that the geometric mean represents the center of the duration distribution, this will produce an overall intercept of zero. In the GLMMs, ‘subject’ was always included as a random intercept. We sequentially added fixed and their associated random effects to the GLMM. To quantify evidence for more complex models over simpler ones, Bayes factors were approximated with the Bayesian Information Criterion (BIC) values of the GLMMs (Wagenmakers, 2007):
$$ {BF}_{01}=\exp \left(\frac{{\Delta BIC}_{10}}{2}\right) $$
(1)
In order to quantify the relationship between bias and precision, we estimated the effect of standard and the average slope for each participant individually in a GLM. Then, we computed to nonparametric Spearman correlation coefficients and associated 95% confidence intervals, using the z-transformation method implemented in the psych package (Version 2.0.12; Revelle, 2020).
Modelling
In order to formalize the differences between IRM and the Kalman filter, we implemented both in R (R Core Team, 2018). We did not perform extensive optimization routines to fit each model to the data, since we only want to demonstrate what each model predicts given a reasonable set of fixed parameters. We wanted to ensure that both models have identical inputs and similar rules for determining outputs, so that differences in model predictions can be attributed to the internal workings of each model. Inputs (xm) were logarithmically transformed durations (d), as used in the experiment, perturbed by gaussian noise (nm):
$$ {x}_m=\ln (d)+{n}_m $$
(2)
with \( p\left({n}_m\right)=N\left(0,{\sigma}_m^2\right) \). Here, \( {\sigma}_m^2 \) determines the noisiness of the sensory input. Put differently, \( 1/{\sigma}_m^2 \) is the precision of sensory input. In order to simulate data for individual subjects with different precision, we randomly selected values for σm, i from a truncated normal distribution where the mean corresponds to σm. Importantly, these input parameters were not free parameters, since they were fixed across different models. We found that σm = 0.2 provided a reasonable fit for the aggregate results. We simulated data from each model for 200 subjects with 840 trials each, matching the random order of stimulus presentations to the real participants. We did not run a staircase procedure on simulated subjects, but instead presented combinations of S and Δd in random order without replacement.
Outputs of the models (responding ‘longer’ or ‘shorter’) were determined by comparing the representations of S and C produced by each model. If C > S, the model responds, ‘comparison longer’; if C < S, ‘comparison shorter.’ In the case of IRM, S and C were the internal references that resulted from perceiving the standard and comparison. In the case of the Kalman filter, S and C were the means of the priors that resulted from perceiving the standard and comparison.
Kalman filter
The Kalman filter is a Bayesian model, which assumes that subjects maintain and update a prior, which represents the distribution of previously observed durations. We base our implementation of the Kalman filter on Glasauer and Shi (2018), and Petzschner and Glasauer (2011). Instead of representing durations with only a single value, the Kalman filter also represents the uncertainty associated with that representation. The prior is modelled as a Gaussian distribution: \( N\left({\mu}_p,{\sigma}_p^2\right) \). When a duration, indexed by n, is sensed, it is represented by the likelihood function \( p\left({x}_{m,n}\right)\sim N\left({x}_{m,n},{\sigma}_m^2\right) \). When a duration is estimated, the prior is updated through a weighted average of the previous prior distribution and the currently sensed likelihood. The weight of the new observation is called the Kalman gain (k):
$$ {k}_n=\frac{\sigma_{p,n-1}^2+q}{\sigma_{p,n-1}^2+q+{\sigma}_m^2} $$
(3)
As can be seen, the weight of the new observation is determined by the relative uncertainty of the previous prior \( {\sigma}_{p,n-1}^2 \) and uncertainty of the current likelihood \( {\sigma}_m^2 \). When the uncertainty of the current sensory observation (likelihood) is large relative to the uncertainty of the prior, k will be small (for a method to empirically estimate the Kalman gain over time, see Berniker, Voss, & Kording, 2010). The Kalman gain is further determined by process variance q, which reflects that the observer assumes a prior that fluctuates randomly over time: μp, n~μp, n − 1 + N(0, q). In other words, there is always a level of uncertainty involved in representing the non-static prior, which is determined by process variance q. This is an important assumption, since the variance of the prior \( {\sigma}_p^2 \) is updated as follows:
$$ {\sigma}_{p,n}^2={k}_n\ast {\sigma}_m^2 $$
(4)
We can see from Equations 3 and 4 that, if q = 0, and \( {\sigma}_m^2 \) is constant, k would continually decrease alongside \( {\sigma}_p^2 \), ensuring that new observations are unable to change the prior, resulting in an overly rigid representation of stimulus history. The prior mean (μp) is updated as follows:
$$ {\mu}_{p,n}=\left(1-{k}_n\right)\ast {\mu}_{p,n-1}+{k}_n\ast {x}_{m,n} $$
(5)
In order to use the Kalman filter for duration discrimination, we assume that subjects use the updated prior for both the first and second duration and compare their means. For all simulations, we used q = 0.9. We chose this value because this produced an average k around 0.85 across simulated subjects. This, in turn, ensured that the parameters of the Kalman filter and IRM are comparable, since the Kalman gain determines weight on new observations, and g determines the weight on the internal reference (i.e., k = 1 − g). It should be noted that, since the variance of the likelihood function (\( {\sigma}_m^2 \)) varies between subjects, k also varies between subjects. When \( {\sigma}_m^2 \) is high (low), k will be low (high), and the prior will have a more (less) pronounced influence in the form of global and local context effects. In other words, more precise subjects will have smaller context effects.
Internal reference model
The internal reference model (IRM; Dyjas et al., 2012) assumes that subjects maintain and update an internal reference, which represents a geometric moving average of previously observed durations. When a duration xm,n, indexed by n, the internal reference In is updated through a weighted average of the previous internal reference (In−1) and the currently observed duration xm:
$$ {I}_n=g\ast {I}_{n-1}+\left(g-1\right)\ast {x}_{m,n} $$
(6)
where g, 0 ≤ g < 1 is the constant weight on In−1. In effect, g controls a trade-off between having a stable internal reference (high g) and quickly adapting to new durations (low g). Dyjas et al. (2012) describe two different versions of IRM that can be used to explain performance in duration discrimination tasks. In the first version, which we will refer to as IRM1, only the first duration updates the internal reference. This internal reference, in turn, is compared with the observed second duration, which does not update the internal reference. In the second version (IRM2), both durations update the internal reference and both internal references are compared with generate a decision about whether the second stimulus was longer or shorter than the first. For all simulations, we use g = 0.15. All data, materials, code and supplemental materials are available on the Open Science Framework (osf.io/hu43y).