Introduction

Specifying the relationship between physical time and perceived duration has been explored in many facets in psychophysics. Particularly when duration perception is compared with other sensory modalities, Stevens’ power law is invoked. Employing it implies two related, and fundamental questions: First, whether perceived duration satisfies the condition of ratio scalability and second, whether the power law parameters obtained in duration scaling experiments remain unaffected by certain characteristics of the task. This study examines these questions by testing the validity of a number of pertinent axioms from representational measurement theory.

The relationship between the physical intensity of a stimulus and its perceived magnitude can be described by Stevens’ power law (1946, 1956), which is formulated as:

$$ \varphi(t) = \alpha t^{\beta}, t > 0. $$
(1)

That is, the perceived magnitude of a stimulus t is described by a power function α t β. Whereas the parameter α is a proportionality factor depending on the units used, the exponent β depends on the sensory modality. If the value of β is > 1, the perceived magnitude of the stimulus grows faster than the intensity of the physical stimulus. If β is < 1, the increments in perceived stimulus magnitude become smaller with increasing physical stimulus intensity. In the case of β=1, there is a directly proportional relationship between physical and perceived stimuli, i.e., the relationship can be described by a simple linear function.

Physical time and its perceived duration were also found to be related by a power function (Stevens and Galanter, 1957; Allan, 1979). The power function was fitted in several experiments applying different scaling methods (Eisler, 1975), among them ratings and magnitude estimation with and without a standard (Bobko et al., 1977). These approaches yielded exponents ranging from 0.44 to 1.87 (Kornbrot et al., 2013), with an average exponent of 0.90 most suitably describing the relationship between physical and perceived duration (Eisler, 1976).

Established methods to determine the exponent of Stevens’ psychophysical function are scaling procedures, in which participants are asked to produce correspondences between the perceived intensity of stimuli and numerical values consistent with the instruction. Stevens (1956) described two direct scaling methods, which are called magnitude estimation and magnitude production.

Though Stevens, in his later writings (e.g., Stevens, 1975) expressed a preference for using these methods without any constraints such as fixed standards or pre-assigned numerical values, their earliest applications were implemented in a similar manner as the classical methods to measure sensory thresholds, that is they used a fixed stimulus, the standard, and a variable stimulus called the comparison. These versions of magnitude estimation and production have later been termed ‘ratio estimation’ and ‘ratio production’, respectively (Gescheider, 1997).

There are two implicit assumptions fundamental to these direct scaling procedures: It is assumed that the participants are able to estimate or to produce perceived intensities on a ratio-scale level and, furthermore, that the numerals the participants use to describe their sensations may be treated like rational numbers in mathematics and therefore can be taken at face value.

Narens (1996) may be credited with making these assumptions explicit—never actually tested by Stevens or his followers—and formulated mathematical axioms providing a possibility to validate them. He distinguishes between behavioral and cognitive axioms: The untestable cognitive axioms describe the relationship between the participant’s unobservable sensation of a stimulus’ intensity and its numerical representation. The behavioral axioms characterize the participant’s behavior in a scaling experiment and relate their numerical representation to the number words used to describe the stimulus’ intensity. In contrast to the cognitive axioms, the behavioral axioms are empirically testable.

The behavioral axioms crucial for the assumption that participants are capable of estimating or producing ratios of stimulus intensities are monotonicity, commutativity, and multiplicativity. Their validity can be evaluated by analyzing data collected in magnitude or ratio production experiments (Luce, 2002). In the latter, when applied to the psychophysics of duration, the participant is instructed to adjust the duration of a comparison stimulus (such as w, x, y, z in the following), of the ratio of p, q or r of the perceived duration of the standard stimulus t: The notation (x,p,t) represents a participant’s adjustment x, which is perceived to last p times as long as the standard interval t, with the boldface letter referring to the number word used in the magnitude production instructions.

First of all, besides a number of technical axioms concerning the continuity of the physical stimulus values, the axiom of monotonicity (Augustin, 2008; Axiom 3.1 in Narens, 1996), also known as ordering, has to be tested. It is formulated as:

$$ \text{If} \ (x,\mathbf{p},t) \in E \ \text{and} \ (y,\mathbf{q},t) \in E, \ \text{then} \ p > q \Leftrightarrow x \succ y. $$
(2)

This means, if x has been adjusted to appear p times as long (×p, in the following) as the standard t and another adjustment y is q times as long (×q, in the following) as the standard, and p is greater than q, then the adjusted duration x must be longer than the duration y. According to Narens’ (1996) theory, if the axiom of monotonicity holds, it can be assumed that the perception of stimuli of the investigated modality occurs on a sensory continuum. It is a necessary condition not only for the subsequently elaborated axioms of commutativity and multiplicativity, but also fundamental to any scaling at all, because even the categories of an ordinal scale can be arranged in an ascending or descending (and therefore monotonic) order. Furthermore, the axiom of commutativity can be evaluated, which is formulated as:

$$\begin{array}{@{}rcl@{}} \text{If} \ (x,\mathbf{p},t) \in E, (z,\mathbf{q},x) \in E, (y,\mathbf{q},t) \in E, \ \text{and}\ \\ (w,\mathbf{p},y) \in E, \ \text{then} \ z = w. \end{array} $$
(3)

In other words, commutativity holds, if the stimulus duration resulting from a successive production sequence ×p×q is equal to the stimulus duration resulting from successive adjustments with interchanged ratio production factors ×q×p. For example, doubling the duration of a standard tone and then tripling the outcome should result in the same final duration as tripling the standard duration first and then doubling the result. Narens showed that if the axiom of commutativity holds, it can be assumed that the participant perceives stimulus magnitudes of the investigated modality on ratio scale level. But even if a ratio scale of perception does exist, there is no evidence that the scale values used by the participants can be interpreted as scientific numbers. To show the latter, the axiom of multiplicativity has to be evaluated, which is formulated as:

$$ \text{If} \ (x,\mathbf{p},t) \in E, (z,\mathbf{q},x) \ \text{and} \ r = qp, \ \text{then} \ (z,\mathbf{r},t) \in E. $$
(4)

In other words, the multiplicativity property holds, if the stimulus duration resulting from the successive adjustments ×p×q is equal to the stimulus duration resulting from a single adjustment ×r with r being the mathematical product of p and q. For example, doubling the duration of a standard tone and then tripling this adjustment should result in the same final duration as making the standard six times as long in a single adjustment. If the axiom of multiplicativity holds, the numerals as used by the participants to describe the perceived stimulus magnitudes can be taken at face value.

During the last decade, the axiomatic approach to magnitude scaling pioneered by Narens (1996) has been extended by Luce and colleagues (Luce, 2002, 2008; Luce et al., 2010). One recent interpretation concerning the axiom of multiplicativity argues, that a veridical interpretation of numbers and thus the validity of multiplicativity is not mandatory for direct ratio scaling: If the axiom of commutativity is satisfied, thus implying ratio scalability for the modality studied, it may be said that the participants interpret the numbers as some ratio, though not the exact ratio stated in the instructions.

The axiomatic framework has been applied to a number of psychophysical dimensions such as loudness (Ellermeier and Faulhammer, 2000; Steingrimsson and Luce, 2005a, b; Zimmer, 2005), area (Augustin and Maier 2008), brightness (Steingrimsson, 2011; Steingrimsson et al., 2012), and, most recently, pitch (Kattner and Ellermeier, 2014). Duration perception, however, has not been studied in this axiomatic manner.

Therefore, the aim of the first experiment was to investigate whether the fundamental axioms of Narens’ theory hold for duration perception, i.e., whether participants are capable of processing durations on a ratio scale. This was tested in a ratio production experiment in which participants were required to adjust the duration of a comparison tone to specific positive integer ratios of two different standard durations (t 1=100 ms, t 2=400 ms).

The experiment employed a method that is typical for axiomatic testing requiring the participant to adjust the duration of the comparison interval in an iterative fashion until it subjectively matches with the desired ratio. In contrast to one-shot estimations (e.g., “Turn the sound off as soon as it is p times as long.”), which seem to be less cumbersome, this procedure does not introduce a bias due to motor latency. Furthermore, the initial duration of the comparison was randomly chosen to fall above and below the estimated ‘target duration’ for the purpose of counterbalancing trials in which the participants had to shorten or lengthen the comparison tone.

In the second experiment, two further axioms, weak multiplicativity and invertibility (Augustin, 2008), were tested to provide evidence for the psychological meaningfulness of scaling perceived duration, i.e., whether the size of the power law exponent for duration perception remains unaffected by the size of the standard used in ratio production. Again, participants had to adjust the duration of auditory intervals to a certain ratio with respect to a standard tone (t 3=600 ms). This time, fractions as well as integers were used as ratio production factors

Experiment 1

Method

Participants

Ten participants took part in the experiment. The sample consisted of four female and six male participants with a median age of 24 years ranging from 21 to 56 years. They did not have any prior knowledge of the hypotheses being tested. The experiment was conducted individually in a double-walled sound-attenuated listening chamber (IAC).

Stimuli and apparatus

All stimuli were sine waves of the same frequency of 440 Hz (A4 standard pitch) converted with a sampling rate of 44.1 kHz, and with 16-bit resolution. Their duration varied as a result of the protocol and contained 10-ms cosine-shaped rise-and-decay ramps to avoid unwanted switching transients. The standards were of fixed durations of 100 and 400 ms or of individual duration generated according to the adjustments of the participants. The comparison stimuli varied accordingly; their initial length was randomly chosen between one and ten times the duration of the corresponding standard. The tones were preset to a comfortable sound pressure level of 65 dB SPL. After passing through a headphone amplifier (Behringer HA 800 Powerplay PRO 8), the tones were presented diotically via headphones (Beyerdynamics DT 990 PRO). The experiment was programmed in MATLAB using the PsychToolbox-package by Brainard (1997) and Pelli (1997).

Procedure

In the first time-production experiment, the participants had to complete 264 trials altogether. They were divided into four identical test sessions taking place on different days. Each session was composed of three blocks of 22 trials, resulting in a total of 66 trials, respectively. After the completion of a block, the participants were allowed to take a short break. The recording of the data started after the participants had become familiar with the task during three practice trials at the beginning of each session.

Each trial consisted of two duration intervals marked by continuous tones, which were presented successively. The first tone, or standard, was of fixed duration, either 100 or 400 ms, while the second tone, or comparison, was of variable starting duration and could be adjusted by the participants. The tones were separated by a fixed silent inter-stimulus interval of 500 ms. During the presentation of both tones, a yellow numeral p (p=1,2,3,4,6,8) was displayed in the upper part of the screen, which was the instruction for the participant to adjust the duration of the second tone so that it was perceived to be p-times as long as the first tone. The adjustments could be made by pressing either the left cursor key for decreasing or the right cursor key for increasing the duration of the comparison tone. The steps for incrementing/decrementing duration were \(\frac {1}{20}\) of the standard interval, that is 5 ms for the standard of 100 ms and 20 ms for the standard of 400 ms. To increase step size, participants could press the shift key together with the cursor key resulting in steps being ten times as long as the original steps, that is 50 ms or 200 ms, respectively.

The participants were asked to adjust the duration of the comparison tone step by step, i.e., after each key press response, the current standard and the altered comparison were replayed and the instruction was presented again. The participants were instructed to adjust the comparison until they were satisfied with the result and to eventually press the enter key to register the final value. The next trial started after an inter-trial interval of 2,000 ms. There was no time restriction to performing the task.

In each of the blocks, the standards of t 1=100 and t 2=400 ms were combined with the ratio production factors p=1,2,3,4,6 and 8. These trials are called basic trials and their outcomes are primarily used to test monotonicity. The testing of commutativity and multiplicativity is based on the outcomes of so-called successive trials, in which the individual adjustments produced by the participants in the basic trials were used as standards. They were combined with the ratio production factors q=2,3, and 4. Each type of adjustment was made 12 times, i=12. In the following, the basic adjustments are indicated by (x i ,p,t). As an example, (x 3,2,100) is the third (i=3) adjustment of a trial with a ratio production factor p=2 and a standard stimulus t=100 ms.

In the successive trials, for each participant, the individual basic adjustments of each (x i ,p,100) and (x i ,p,400) were used as standard stimuli. More precisely, the new standards (x i ,2,100) and (x i ,2,400), derived from a basic doubling trial, had to be made q=2,3, and four times as long. Likewise, the standards (x i ,3,100), (x i ,4,100), (x i ,3,400) and (x i ,4,400) were subsequently doubled (q=2). The procedure might become more obvious by inspecting Fig. 1: The arrows starting from the x-axis depict the basic adjustments, whereas the arrows starting from the arrowheads depict the successive adjustments.

Fig. 1
figure 1

Ratio productions made by (N=10) participants in Experiment 1: Arithmetic means and standard deviations of basic and successive trials for t 1=100 ms top and t 2=400 ms bottom. Adjustments connected by dashed lines should coincide, if commutativity and multiplicativity hold

On the whole, there were 22 different types of adjustments: Each of the two standard stimuli was paired with each of the six ratio production factors p=1,2,3,4,6, and 8, resulting in 12 types of basic ×p adjustments. In addition, each standard was combined with each of the five pairs (p,q)=(2,2),(2,3),(2,4),(3,2), and (4,2), resulting in ten different types of successive ×p×q adjustments. Each type of adjustment was made 12 times, resulting in 264 trials per participant.

Results and discussion

Overall results

Overall mean adjustments for (N=10) participants are depicted in Fig. 1 in the upper panel for the shorter standard duration of t 1=100 ms and in the lower panel for the longer standard of t 2=400 ms. The mean number of adjustments made in one trial was M=13.3. In 66 % of all trials, participants made fine-step adjustments of duration. In further analyses, after a brief descriptive overview, the data sets of each participant are treated separately.

Monotonicity

The axiom of monotonicity was tested to confirm that duration perception of short intervals (100 to 4000 ms) occurs on a sensory continuum, i.e., that unequal temporal intervals are perceived as such and can be discriminated, respectively. From a descriptive point of view, the axiom of monotonicity seems to hold, because, as Fig. 1 shows, the mean outcome durations increase for increasing ratio production factors.

For the inferential statistics, two one-factor, repeated-measures analyses of variance (ANOVAs) tested the effect of the ratio production factor on the mean individual duration adjustments produced in basic trials only, separately for the two standards. For the standard t 1=100 ms, the ANOVA yielded significant differences among the different ratio production factors, F(5,45)=306.9,p<.001,η 2=.97. A post hoc Tukey HSD test was conducted to check whether the mean adjustments of a pair of two adjacent ratio production factors are similar (∼). The results showed that all pairs of ratio production factors ((x,1,100)∼(x,2,100),(x,2,100)∼(x,3, 100), (x,3,100) ∼ (x,4,100),(x,4,100) ∼ (x,6, 100 ) and (x,6,100)∼(x,8,100)) differ significantly at p<.001. For the standard t 2=400 ms, a comparable ANOVA also yielded significant variations among the ratio production factors, F(5,45)=140.7,p<.001,η 2=.94. Post hoc Tukey HSD comparisons revealed significant differences for all pairs of ratio production factors, p<.001 for (x,4,400)∼(x,6,400) and (x,6,400)∼(x,8,400), p<.01 for (x,1,400)∼(x,2,400) and (x,3,400)∼(x,4,400), and p<.05 for (x,2,400)∼(x,3,400). Further analyses of variance containing the factors block and session revealed no main effects for them, thus any practice effects can be ruled out.

Furthermore, a graphical analysis based on cumulative sums of the adjustments made, as proposed by Augustin and Maier (2008), was conducted for each participant. The axiom of monotonicity requires, that, for a fixed standard stimulus t, a ratio production factor p and a fixed number of repetitions i, the inequality S(x i ,p,t)<S(x i ,q,t) holds, with p<q and S representing the sum of duration adjustments x made up to the i-th trial. That is, the axiom of monotonicity holds, if for each standard t and each number i of repetitions (adjustments), the cumulative sums can be ordered by the ratio production factors used: S(x i ,1,t)<S(x i ,2,t)<S(x i ,3,t)<S(x i ,4,t)<S(x i ,6,t)<S(x i ,8,t). Thus, for each participant and both standards t 1 and t 2, the n=12 outcome durations of each type of ×p adjustments were summed up successively across trials. The cumulative sums, S(x,1,t),S(x,2,t),S(x,3,t),S(x,4,t),S(x,6,t) and S(x,8,t), of participant mg12, who is representative for the sample, are depicted in Fig. 2, the left panel shows the shorter and the middle panel shows the longer standard duration. Although the outcome durations of all trials n=1 to 12 were summed up successively, only the cumulative sums in the range of trials n=7 to 12 are plotted, in order to avoid inspecting the effects resulting from random influences for a small number of observations. Both graphs show that the curves for different ratio production factors never cross, e.g., that for the standard duration t 1, each cumulated outcome duration for p=2 is shorter than the corresponding cumulated outcome duration for p=3, meaning that at no point in the sequence of trials is monotonicity violated, thereby providing a more rigorous test than a comparison of overall condition means would.

Fig. 2
figure 2

Cumulative sums of the ratio productions made in Experiments 1 and 2. Each curve depicts the cumulative sums for a particular ratio production factor p as a function of the trial number (7 to 12, or 13 to 24, respectively, to minimize the effect of random influences for ‘small’ number of repetitions). The left graph refers to the shorter standard duration (t 1=100 ms), the middle one refers to the longer standard duration(t 2=400 ms), both produced by participant mg12 and showing no violations of monotonicity, representative for the outcome of Experiment 1. The right graphrefers to Experiment 2 and a standard of t 3=600 ms, showing magnitude productions by participant kr14 and violations of monotonicity for basic trials with \(\mathbf {p} = \frac {1}{3}\) and \(\mathbf {p} = \frac {1}{2}\)

Commutativity

The axiom of commutativity provides evidence for the assumption that duration perception is based on a ratio scale. For testing commutativity, adjustments produced in successive trials are analyzed: Commutativity is taken to be satisfied, if a successive ×p×q adjustment is statistically indistinguishable from a successive ×q×p adjustment, i.e., if both types of raw adjustments emanate from the same distribution. For a descriptive analysis, Fig. 1 shows that most of the corresponding pairs of successive adjustments ×p×q and ×q×p which are connected by dashed lines coincide, indicating that the axiom holds for the overall means.

For individual inferential testing, nonparametric Mann–Whitney U tests (two-tailed, α=.1) for both pairs (p,q)=(2,3) and (2,4) and both standards were conducted resulting in four tests per participant and a total of 40 tests for the entire sample.

A standard significance level of α=.1 was used, because the aim of the analysis was to accept a statistical null hypothesis, thus making it harder to assume that an axiom holds for a particular comparison. A correction for multiple comparisons was not applied for the same reason.

For the entire sample, five violations in the 40 tests of the axiom of commutativity were observed (compare Table 1). Four of the five violations were produced by two participants (ml06, mn21), both for the standard of 100 ms. For seven of ten participants, the axiom of commutativity held in all cases.

Table 1 Experiment 1: Empirical evaluation of the commutative property for both standard stimuli with t 1=100 ms and t 2=400 ms for each (N=10) participant

Multiplicativity

The axiom of multiplicativity was tested to check whether the numerals as used by the participants can be taken at face value, i.e., whether there is a veridical transformation between perceived and mathematical numbers. For testing multiplicativity, the adjusted durations resulting from successive trials are compared with durations adjusted in basic trials: The axiom holds, if the duration resulting from the successive ×p×qq×p, respectively) adjustment is statistically indistinguishable from the basic ×r adjustment, with r=p q. In a descriptive manner, Fig. 1 also shows that most of the pairs of successive adjustments ×p×q and ×q×p are commensurate with the corresponding adjustments of ×r (with which they are connected by dashed lines), thus indicating multiplicativity to hold for the entire sample.

The individual inferential statistics tested multiplicativity by conducting Mann–Whitney U tests (two-tailed, α=.1) for the three pairs (p,q)=(2,2), (2,3) and (2,4) and both standards, which results in six tests for each participant and a total of 60 tests for the entire sample. Altogether, 19 violations of 60 comparisons for the axiom of multiplicativity were observed (compare Table 2). For only two participants did the axiom of multiplicativity hold in all cases, whereas the other participants showed violations in one to five of six tests.

Table 2 Experiment 1: Empirical evaluation of the multiplicative property for both standard stimuli with t 1=100 ms and t 2=400 ms for each (N=10) participant

Model fitting procedure

Furthermore, linear regressions were computed for all participants and both standards to estimate the parameters α and β for the power law (φ(t)=α t β) as well as the parameters a and b for a simple linear function (φ(t)=a+b t). It was assumed that the individually adjusted durations of (x,p,100) and (x,p,400) are perceived to be p times as long as the standards, respectively. Thus, for the linear model, a linear regression of the ratio production factors p constituting the dependent variable on the individual adjustments constituting the independent variable was computed. For the power function, a linear regression was computed as well, with the logarithmically transformed ratio production factor p as the dependent variable and the logarithmically transformed individual adjustments serving as the independent variable.

The estimated parameters and squared correlation coefficients R 2 for both linear model and power function and for both standards are shown in Table 3. The comparison between linear and power-function model shows, that for the short standard, the power-function model results in a slightly better fit (t(13.15)=1.885,p=.082) explaining 4.7 % more of the variance. For the longer standard, the linear model seems to fit the data as well as the power-function model (t(15.96)=0.735,p=.47), the latter explaining only 2.3 % more of the variance. Furthermore, the power function exponents estimated for the two standards significantly differ in size, t(11.58)=3.67,p=.003. The exponent β of the power function yielded an average of β(t 1)=0.87 (β<1 in all cases) for the shorter standard and β(t 2)=1.02 (β>1 in 6 of 10 cases) for the longer standard duration. Both the linear and the power function indicate a reasonable fit to the data with R 2 ranging from 0.71 to 0.98 for the raw-data adjustments.

Table 3 Experiment 1: Estimated parameters and squared correlation coefficients for linear model and power function for both standard stimuli with t 1=100 ms and t 2=400 ms and each (N=10) participant

Summary

The analyses showed that the axiom of monotonicity was not violated, i.e., the participants were able to produce monotonically ordered adjustments according to the different ratio production factors. The axiom of commutativity was violated in 12.5 % of all tests, while multiplicativity was violated in 32 % of all tests. The estimated power function exponents for the two standards clearly differ in value, that is, the estimation of the parameters of the power law seems to depend on the duration of the standard, and, for the longer standard, seems to be close to 1 resulting in a simple linear function.

Experiment 2

The previous experiment investigated the axioms of monotonicity, commutativity, and multiplicativity for the perception of duration to test the validity of assumptions basic to Stevens’ direct scaling methods. Since the axiom of commutativity was found to be valid in 87.5 % of all cases, it can be assumed that participants’ processing of short duration in a ratio production experiment is based on a ratio scale. However, it might be difficult to describe the relationship between the mathematical numbers provided in the experimental instruction and the numbers as interpreted by the participants, because the axiom of multiplicativity held in only 68 % of the tests, i.e., roughly a third of the participants do not appear to process the numbers at their face value. Comparisons of the estimated exponents of the power functions describing the relationship between physical and perceived duration yielded significantly different exponents for the two standard durations employed.

The observation that the two different standard durations used in Experiment 1 result in diverging exponents has traditionally been classified as a context effect. In the domain of psychophysical scaling, several types of context effects have been described: Besides the stimulus range used in the experiment (Garner, 1954; Ward et al., 1996), the numerical examples given in the experimental instruction (Robinson, 1976) and the number values assigned to the standard stimuli (Beck and Shaw, 1965), or even the entire experimental context might have an influence on the size of the exponent. Therefore, the psychological meaningfulness of the exponent has been called into question (Lockhead, 1992). In contrast to this point of view, other investigators have argued that finding the ‘true’ exponent is still possible (Teghtsoonian and Teghtsoonian, 2003; Teghtsoonian, 2012).

However, in the axiomatic-measurement literature, this problem has been framed as a more fundamental issue of meaningfulness (Stevens, 1946; Luce, 1978; Narens, 1981). For each power function describing the relationship between the physical intensity of a stimulus and its perceived magnitude, one might ask whether the parameters of this function are psychologically meaningful, i.e., invariant under certain transformations. Note that the exponent of the power function depends on the sensory continuum, the participant’s individual perception—which does not exert a very strong influence (Teghtsoonian and Teghtsoonian, 1983)—and potential contextual influences as mentioned above. Furthermore, it might also vary under changes of the physical measurement scale f (Narens and Mausfeld, 1992) and the size of the standard (Augustin, 2008) used in an experiment. If, for example, the measurement scale f is transformed to another scale g measuring the same physical intensity as f and if these scales are neither log-interval nor ratio scales, then it must be assumed that the choice of the scale has an influence on the exponent of the power function. Thus, the obtained exponent has no psychological relevance, or is not meaningful.

But even if the exponent of the power function is invariant under changes of the physical stimulus scale applied in the experiment, it has to be investigated, whether the exponent is invariant under changes of the standard stimulus t being the basis for the estimates or adjustments made by the participants. Augustin (2008) suggests a mathematical method to examine the dependency on the standard by postulating two further axioms that can be evaluated empirically that is weak multiplicativity and invertibility. The axiom of weak multiplicativity is formulated as:

$$\begin{array}{@{}rcl@{}} \text{For} \ t, y, z \in X \text{ and a real number} \ p > 0,\\ (y,\mathbf{p}, t) \in E, (z,\mathbf{1/p},y) \in E \Rightarrow (z, \mathbf{1}, t) \in E. \end{array} $$
(5)

That means, weak multiplicativity holds, if the stimulus intensity resulting from successive adjustments \(\times \mathbf {p} \times \frac {\mathbf {1}}{\mathbf {p}}\) is equal to the stimulus intensity resulting from the basic adjustment with p=1. For example, doubling the duration of the standard and then halving this adjustment should result in the same final duration as matching the duration of the comparison interval to that of the standard. Weak multiplicativity is very similar to Narens’ axiom of multiplicativity. But while multiplicativity has to hold for all cases p>0 and q>0, weak multiplicativity is a special case of multiplicativity with \(\mathbf {q} = \frac {\mathbf {1}}{\mathbf {p}}\), i.e., even if the axiom of multiplicativity is violated, the axiom of weak multiplicativity might hold.

The axiom of invertibility is formulated as:

$$ \text{For} \ t, y \in X \ \text{and} \ \mathbf{p} > 0, (y,\mathbf{p},t) \in E \Leftrightarrow (t, \mathbf{1/p}, y) \in E. $$
(6)

In other words, invertibility holds, if the intensity of a stimulus resulting from successive adjustments \(\times \mathbf {p} \times \frac {\mathbf {1}}{\mathbf {p}}\) is equal to the stimulus intensity of the standard t or, put simply, if it is possible to undo a ×p adjustment by requiring to produce its reciprocal \(\times \frac {\mathbf {1}}{\mathbf {p}}\). So weak multiplicativity and invertibility differ in whether the successive adjustment resulting from \(\times \mathbf {p} \times \frac {\mathbf {1}}{\mathbf {p}}\) is equal to the adjustment of ×1 in the first case and the actual duration of the standard in the second case. As Augustin (2008) stated, both axioms are necessary and sufficient conditions for the exponent of Stevens’ power law to be invariant under changes of the standard t.

However, previous magnitude production experiments using ratio production factors p<1<q assume fractions and integers to be processed differently: A study by Luce, Steingrimsson and Narens (2010) showed the axiom of commutativity to be violated for the N=2 participants tested when fractions and integer ratios were mixed. Steingrimsson and Luce (2007) found comparable discrepancies for the axiom of multiplicativity for N=3 participants in an experiment on loudness production. Augustin (2008) explicitly tested the two crucial axioms of weak multiplicativity and invertibility and found them to be violated for all N=10 participants who performed ratio productions of the area of visually presented circles.

For the perception of duration, numerous experiments to determine the exponent of Stevens’ power law were conducted using standard durations ranging from 50 ms to 300 s (Eisler, 1976). Although the exponents derived from these experiments vary between β=0.23 and 1.36, it has not been sufficiently investigated whether these differences may be caused by the use of different standards. A study by Kane and Lown (1986) used standard durations of 30 and 180 s and did not find the length of standard duration to affect the size of the power law exponent. Eisler’s (1976) review of 111 studies on duration perception, however, reported lower exponents obtained from experiments using standard durations shorter than 500 ms, but they did not specify this observation in more detail.

Because, in contrast, even the exponents derived from Experiment 1, using standards of t 1=100 and t 2=400 ms, significantly differ in size, β(t 1)=0.87,β(t 2)=1.02, it is plausible to investigate the meaningfulness of the power law exponent for the perception of duration by means of Augustin’s (2008) additional axioms.

Methods

Participants

Fifteen participants were tested in the experiment. The sample consisted of 14 female participants and one male with a median age of 23 years, ranging from 23 to 45 years. They were all students of psychology, but did not have any prior knowledge of the current hypotheses. Again, testing was conducted individually in a double-walled sound-attenuated listening chamber (IAC).

Stimuli and apparatus

Stimuli were generated using the same apparatus and signal parameters as in Experiment 1. The fixed standard, however, had a duration of 600 ms, while comparison stimuli varied in duration; their initial length was randomly chosen between 200 and 1,800 ms.

Procedure

In Experiment 2, the participants had to complete 216 trials altogether. The trials were divided into two identical test sessions taking place on two different days. Each session was composed of 12 blocks of nine trials, each. After three practice trials, data were recorded. After having completed three blocks, the participants could take a short break.

As in Experiment 1, participants had to adjust the comparison interval, separated from the standardFootnote 1 by an inter-stimulus interval of 500 ms, according to a certain ratio production factor p \((\mathbf {p} = \frac {1}{3},\frac {1}{2}, 1, 2, 3)\) presented on the screen. To increase or decrease the duration of the comparison interval, participants had to press the appropriate cursor key, either in small (20 ms) or in large steps (200 ms). Again, both tones were replayed after each keystroke, with the comparison tone having changed in duration.

In each of the 24 blocks, the standard of t 3=600 ms was combined with the ratio production factors \(\frac {1}{3},\frac {1}{2}, 1, 2, \) and 3 resulting in five types of ×p adjustments and 120 basic trials altogether. In the successive trials, the individual basic adjustments (x i ,p,600) were used as standard stimuli, i.e., the new standard \((x_{i},\frac {1}{3},600)\) had to be adjusted using the ratio production factor q=3, the standard \((x_{i},\frac {1}{2},600)\) was combined with the ratio production factor q=2, the standard (x i ,2,600) was combined with the ratio production factor \(\mathbf {q}=\frac {1}{2}\) and the standard (x i ,3,600) had to be adjusted with the ratio production factor \(\mathbf {q}=\frac {1}{3}\). The four types of ×p×q adjustments resulted in 96 successive trials altogether.

Results and discussion

Overall results

The overall means based on all (N=15) participants are depicted in Fig. 3. The mean number of adjustments made in one trial was M=7.4. In 35 % of the adjustments, participants were using large steps to reach their final decision. In further analyses, after a brief descriptive overview, the data sets of each participant are treated separately.

Fig. 3
figure 3

Ratio productions obtained in Experiment 2: Arithmetic means and standard deviations of basic and successive trials for t 3=600 ms and (N=11) participants. Adjustments connected by dashed lines should coincide, if weak multiplicativity holds

Monotonicity

An ANOVA on the duration adjustments yielded significant differences between the different ratio production factors, F(4,56)=200.6,p<.001,η 2=.93. Post hoc Tukey HSD comparisons revealed significant differences (p<.001) for all but one pair of ratio production factors, i.e., \((x,\frac {1}{3},600) \sim (x,\frac {1}{2},600)\), p=.55.

Furthermore, a graphical analysis based on cumulative sums was applied. As suspected from the results of the Tukey test, 4 of 15 participants exhibited violations of monotonicity in their adjustments of \(\times \frac {1}{3}\) and \(\times \frac {1}{2}\). An example is shown in the right panel of Fig. 2 for participant kr14, whose lines for \(\times \frac {1}{3}\) and \(\times \frac {1}{2}\) are at the same level or even cross. These four participants were excluded from further analyses.

Weak multiplicativity

The axiom of weak multiplicativity is satisfied, when the outcome of the \(\times \mathbf {p} \times \frac {\mathbf {1}}{\mathbf {p}}\) adjustment is statistically indistinguishable from the duration of the ×1 adjustment. This axiom was tested by performing nonparametric Mann–Whitney U tests (two-tailed, α=.1) comparing the outcome of the four combinations \((p, q) = (\frac {1}{3}, 3)\), \((\frac {1}{2}, 2)\), \((2, \frac {1}{2})\), and \((3, \frac {1}{3})\) with that of the ×1 adjustment. That was done individually for each of (N=11) participants, resulting in a total of 44 tests. For the entire sample, 24 violations in 44 tests of the axiom of weak multiplicativity were observed. 18 of 22 violations were produced in trials with \((p, q) = (\frac {1}{3}, 3)\) and \((p, q) = (\frac {1}{2}, 2)\), while in trials with \((p, q) = (2, \frac {1}{2})\) and \((p, q) = (3, \frac {1}{3})\), only six violations of 22 tests were found; compare Table 4, left column.

Table 4 Experiment 2: Empirical evaluation of weak multiplicativity and invertibility for the standard of t 3=600 ms duration for each (N=11) participant

Invertibility

The axiom of invertibility is satisfied, when the final outcome of the successive \(\times \mathbf {p} \times \frac {\mathbf {1}}{\mathbf {p}}\) adjustments is statistically indistinguishable from the duration of the standard, t 3=600 ms. By conducting nonparametric Mann–Whitney U tests (two-tailed, α=.1), it was tested whether the duration adjustments of the successive trials with \((p, q) = (\frac {1}{3}, 3)\), \((\frac {1}{2}, 2)\), \((2, \frac {1}{2})\), and \((3, \frac {1}{3})\) may be produced by distributions with μ=600 ms. Tests were performed separately for the four combinations and individually for each of (N=11) participants, resulting in a total of 44 tests. For the entire sample, 25 violations of 44 tests of the axiom of invertibility were observed. 20 violations of 22 tests were produced in trials with \((p, q) = (\frac {1}{3}, 3)\) and \((p, q) = (\frac {1}{2}, 2)\), while in trials with \((p, q) = (2, \frac {1}{2})\) and \((p, q) = (3, \frac {1}{3})\), only five violations of 22 tests were found; compare Table 4, right column.

Model fitting procedure

Furthermore, regressions were computed for all participants to estimate the parameters for a linear psychophysical function as well as for a power function. The estimated parameters and squared correlation coefficients R 2 for both linear model and power function are shown in Table 5. The estimation of the exponent β of the power function revealed a β>1 in 9 of 11 cases with an average of β=1.16. The comparison between the two models shows no significant difference in their goodness of fit (t(18.62)=0.058, p=.53), but they both explain considerably less variance, 78 %, than the models fitted in Experiment 1.

Table 5 Experiment 2: Estimated parameters and squared correlation coefficients for linear model and power function for the standard stimulus of t 3=600 ms and each (N=11) participant

Summary

The analyses showed that the axiom of monotonicity was violated by four participants, i.e., these participants were not able to produce monotonically ordered adjustments for the ratio production factors \(\mathbf {p} = \frac {\mathbf {1}}{\mathbf {2}}\) and \(\mathbf {p} = \frac {\mathbf {1}}{\mathbf {3}}\). The axiom of weak multiplicativity was violated in 55 % of all tests. The axiom of invertibility showed comparable violation rates of 57 %. For \(\times \frac {\mathbf {1}}{\mathbf {p}} \times \mathbf {p}\) adjustments, both axioms were violated more often (82 %,91 %) than for \(\times \mathbf {p} \times \frac {\mathbf {1}}{\mathbf {p}}\) adjustments (27 %,23 %).

General discussion

In two experiments, the present study examined the validity of a number of axioms from representational measurement theory for the ratio production of time intervals. These axioms are fundamental for determining whether subjective duration may be assumed to constitute a ratio scale, and how the numerical scale values obtained may be interpreted. Furthermore, they can confirm the psychological meaningfulness of the function describing the relationship between physical and subjective duration.

Axiomatic evaluation and model fitting

In Experiment 1, multiple analyses revealed that, with all ratio production factors p≥1, the axiom of monotonicity was corroborated, indicating that all participants were able to produce monotonically increasing durations in response to appropriate ratio instructions, thus satisfying a basic ordinal requirement for a scale.

The individual evaluation of commutativity and multiplicativity revealed large differences between participants: For some participants, such as mg12, ml16, and mw28, we found almost no axiom violations, whereas others (mh15, ml06) showed as many as five violations in ten tests. This finding implies that some participants were able to deal with the instructions of a ratio production experiment, i.e., they use the numbers presented in the experiment as they are requested to, whereas others were not.

The overall axiomatic evaluation showed the commutative property to hold for most participants (12.5 % violations) implying that, generalized, they are capable of processing duration on a ratio scale. However, the multiplicative property was violated in 32 % of all tests showing that the numerals as used by the participants or in the instructions to describe perceived duration cannot always be taken at face value. Thus, Narens’ (1996) axioms which are fundamental to Stevens’ direct scaling approach could be validated, in that a ratio scale of duration can be assumed, but there is no obvious way to derive the actual scale values.

The results for commutativity and multiplicativity of the present experiment are comparable with findings for other sensory continua. For the perception of area, Augustin and Maier (2008) reported violation rates of 12 % for the axiom of commutativity and 61 % for multiplicativity. Ellermeier and Faulhammer (2000) found commutativity to be violated in 11 % of all cases and violations of multiplicativity in 94 % of the tests, while Zimmer (2005) reported violations rates of 14 % and 89 %, with both studies examining the perception of loudness. For the perception of pitch, Kattner and Ellermeier (2014) found the axioms to be violated in 22 % and 33 % of all tests, respectively.

Furthermore, power function exponents fitted to the ratio productions made relative to the two different standards used in Experiment 1 turned out to differ significantly. Therefore, it was tested whether the dependency of the standard can be confirmed by axiomatic testing. The axioms of weak mutiplicativity and invertibility, necessary and sufficient conditions for the invariance of the exponent of the power function under changes of the standard, were evaluated in Experiment 2.

The results show the crucial axioms of weak multiplicativity and invertibility to be violated in 55 % and 57 % of all cases, respectively, suggesting, as already assumed in Experiment 1, the power function exponent to depend on the size of the standard. From a scaling perspective, this might be construed as a context effect due to the use of a fallible method: ratio production. It might be argued that an unconstrained method using ‘no designated standard, no assigned modulus’ disposes of the influence of the standard simply by omitting it. But since there is no axiomatic framework to test this (one stimulus - one response) methodology for internal consistency, we appear to be stuck with ratio production (or estimation) for the time being.

The results of the present axiomatic evaluation are comparable with findings made in the perception of area, where violation rates of 70 % for the axiom of weak multiplicativity and 72.5 % for the axiom of invertibility were reported (Augustin, 2008). For the perception of loudness and pitch, weak multiplicativity and invertibility were not evaluated yet.

Comparisons between a linear model and a psychophysical power function reveal both types of models to fit the data quite well, with comparably high proportions of variance explained. However, since Experiment 2 has shown that their estimated parameters depend on the size of the standard, the psychophysical functions fitted do not appear to be meaningful.

Implications

The results of the present experiments can be helpful to draw conclusions on the conception of further studies of duration scaling.

An interesting question—suggested by one of the reviewers—might be, whether the participants who performed ratio productions of duration without any axiom violations do so for other sensory modalities, as well. That might clarify whether full compliance with the axioms is due to a superior way of handling numbers in general or whether it is specific to a given modality studied.

For successive adjustments, a systematic bias as reported in other studies was found: The final adjustments reached in successive trials, e.g., ×2×3, often exceeded the adjustments made in corresponding basic trials, e.g., ×6. This patterns seems to be systematic, since it was found for other sensory modalities as well, e.g., Augustin (2008) reported a similar bias for area adjustments. Ellermeier and Faulhammer (2000) found ×2×3 loudness adjustments to be systematically higher in level than ×6 adjustments, and Zimmer (2005) found the same pattern for loudness fractionation, i.e., the outcome of a \(\times \frac {1}{6}\) adjustment produced less of a level reduction than the outcome of successive \(\times \frac {1}{2} \times \frac {1}{3}\) adjustments. Steingrimsson and Luce (2007) investigated this bias and explained it by referring to a ‘numerical distortion’. They stated that the relationship between scientific numbers and numbers used by the participants is not linear but can be described by another function, e.g., by a power function with an exponent <1 causing successive adjustments to be greater in size than basic adjustments. So, if multiplicativity as tested in this experiment fails, so-called k-multiplicativity can be tested to examine whether the relationship between scientific numbers and numbers used by the participants follows a power function with a constant exponent.

Furthermore, in Experiment 2, the very basic axiom of monotonicity was found to be violated for four of 15 participants. These participants did not produce distinguishable duration adjustments for ratio production factors p<1, although their adjustments for p≥1 clearly follow a monotonic order. It can be ruled out that this finding might be due to a kind of floor effect, because in Experiment 1, even shorter durations were adjusted without any difficulty.

Furthermore, an unpublished experiment conducted in our laboratory investigated whether monotonicity, commutativity and multiplicativity can reliably be shown to hold for the fractionation of time intervals and revealed violation rates comparable to the results of Experiment 1. Thus, it might be assumed that the participants who violated monotonicity in Experiment 2 did not necessarily have difficulties in processing fractions, but might have misconceptions regarding the instructions of the mixed condition itself.

Additionally, a noteworthy order effect was observed when comparing the adjustments of \(\times \frac {1}{3} \times 3\) and \(\times \frac {1}{2} \times 2\) with the adjustments of \(\times 3 \times \frac {1}{3}\) and \(\times 2 \times \frac {1}{2}\): All successive adjustments ×p×q with p<1 preceding q>1 resulted in considerably longer outcome durations than successive adjustments with p>1 followed by q<1. Augustin (2008) did not report this pattern for the perception of area, so this finding may be assumed to be specific for the perception of duration, but will have to be further investigated.

Furthermore, it might be investigated, how exactly the exponent of the power function varies under changes of the standard stimulus. It might be plausible, as the results of the present experiments assume, that the exponent increases with increasing standard duration.

Conclusions

In conclusion, the present experiments show that if using ratio production of temporal intervals, the measurement is based on a ratio scale, although a ‘numerical distortion’ impedes an unequivocal interpretation of the scale values. Thus, before the shape of the transformation function relating perceived and mathematical numbers is determined, power law fitting using ratio production should be taken with a grain of salt.

Furthermore, the fitting of curves describing the relationship between physical and perceived time, regardless of power function or linear relationship, is difficult: Even if both kinds of models seem to describe the relationship quite well, the estimated parameters depend on the size of the reference stimulus used in the experiment and thus can hardly be interpreted in a psychologically meaningful way.