1 Introduction

Return expectations are one of the main determinants of traders’ investment decisions and therefore play a crucial role in financial market dynamics. Consequently, a profound insight in how these expectations are formed will contribute substantially to our understanding of financial market phenomena such as excess volatility and the emergence of bubbles and crashes. It is therefore hardly surprising that, in the last couple of decades, considerable effort has been put in analyzing expectation formation, both by using data from questionnaire studies and by using data from laboratory experiments.Footnote 1

However, the forecasts collected from these survey studies and laboratory experiments may be systematically affected by the way in which forecasts are elicited, as well as the format in which past data is presented. Glaser et al. (2007), for example, show that subjects that have to forecast returns exhibit a stronger tendency to extrapolate past trends than subjects that are asked to forecast prices.Footnote 2 In addition, in a recent study (Glaser et al., 2019) find that return expectations are higher when return forecasts, instead of price forecasts, are elicited. Furthermore, they show that return expectations are lower when past returns, instead of past prices, are shown to the subjects. This is of particular interest since both formats are used in investor documents of mutual funds, on financial websites, and so on, and may change according to changes in regulation.Footnote 3 Moreover, some well-known financial market surveys differ in how they elicit forecasts (see (Glaser et al., 2019), for an overview).

These findings imply that, at a minimum, results on expectation formation from questionnaire studies and laboratory experiments should be interpreted with care, in particular since these results are used in economic policy debates (see e.g., Glaser et al. (2007) and Hoffmann et al. (2017)) and discussed in the popular press, thereby partially shaping the expectations of the general public. Moreover, return expectations guide investment decisions and thereby have a direct impact on realized market returns. They therefore have an effect on the performance of other traders’ investment strategies as well. If the format of the information presented, or the format of the forecasting task, leads to systematic differences in expectation formation, financial market behavior as measured by, for example, price volatility, is likely to be affected as well. In particular, if eliciting return forecasts leads to stronger trend extrapolation, this may increase the incidence of bubbles and crashes. Similarly, if presenting past performance by return bar charts (instead of price line charts) leads to more moderate return forecasts, then bubbles may be less likely to occur. Although many important behavioral tendencies have been established in individual decision-making environments, it is not always clear if and how these findings translate to market dynamics. An important motivation for our paper is to investigate the extent to which aggregate behavior is affected by the choice of variable to be predicted as well as by the presentation of past performance.

In the studies mentioned above subjects need to forecast the next realization of a predetermined and exogenously given time series of prices or returns (either simulated, or based on historical stock price data). These studies therefore abstract from any effect that return forecasts have on return realizations. However, these feedback effects can be substantial: market prices are determined by the trading decisions of individual traders which are informed by these traders’ expectations. The earlier contributions therefore only investigate part of the story: they miss the link between individual decisions and market realizations, and thereby the channel through which the presentation of information and of the forecasting task may have an actual impact on the functioning of financial markets. We go beyond the existing literature by explicitly taking expectations feedback into account. The focus of our study therefore does not lie on individual forecasting behavior, but on the effect on market dynamics instead. In this so-called learning to forecast experimentFootnote 4 subjects’ average expectations of prices/returns are an important determinant of realized prices/returns. This setting allows us to investigate whether price volatility is affected by how information about past performance is presented and by which variable is being forecast. Obviously this cannot be studied when the time series are exogenously given. Moreover it is not clear whether the individual differences observed in previous contributions also result in differences at the market level. As a second contribution, we run an additional treatment where subjects are free to choose the format in which they give their prediction and the format in which information is presented to them. In this way we can study whether the aggregate effects that we identify in our main treatments are robust.

In our main experiment, we use a \(2\times 2\) between-subjects design similar to the one used in Glaser et al. (2019). Depending on the treatment, subjects either see past prices or past returns, and either have to forecast the next price or the next return, for 50 consecutive periods. The underlying data generating mechanism is the same for all four treatments, with the only differences between the treatments in how the forecasting task and past information are presented. Subjects are paid for their forecasting accuracy. We find that the format of past information (either presented as a return bar chart or as a price line chart) does not have a significant effect on the resulting market dynamics, but we provide evidence that the format in which forecasts are elicited does have an effect. In particular, we find that asking for return forecasts tends to increase price volatility, when compared to asking for price forecasts. Unfortunately, the general picture is somewhat obfuscated by the fact that in a number of markets at least one of the subjects submits a very high prediction for a certain period, possibly as the consequence of a typing error, or because of an inclination to experiment with the forecasting environment. These extreme predictions, which occur much more often in the treatments where subjects’ task is to predict the price, have a substantial and persistent effect on the ensuing market dynamics. Correcting for these outlier markets substantially strengthens our main result that forecasting returns leads to more price volatility than forecasting prices. An analysis of subjects’ individual forecasts suggests that the difference between predicting returns and predicting prices is due to a tendency of subjects to coordinate on forecasting strategies that extrapolate trends (in prices) more strongly when forecasting returns than when forecasting prices.

In each of the four experimental treatments described above, we fixed both the format in which past information is presented as well as the format of the prediction task. In practice, investors and financial advisors may indeed be prone to using particular formats, due to specific regulation, because of an established habit or default bias, or because the investors they are advising demand that advice in a specific format. Nevertheless, investors and financial advisors may have a choice in how they want information about past performance of assets to be presented, and whether they prefer to predict future prices or future returns as an input to an investment decision (taken by themselves, or by the investor they are advising). Given that predicting returns tends to lead to more volatile prices than predicting prices, the question arises whether, if the prediction format can be chosen freely, one of the formats will emerge as the dominant choice, and how price volatility will be affected. In an attempt to answer these questions, we run an additional treatment of our experiment where in each period subjects can choose whether to submit a price forecast or a return forecast. In addition, in each period subjects can also decide whether past performance of the asset is presented either by past prices or past returns, and they are allowed to switch between those formats as often as they want. We find no clear preference for predicting prices or returns in this treatment with endogenous choice; about half of the predictions are price predictions, and the other half are return predictions. We do find that subjects have a preference to observe information about the variable they will forecast. About 80% of the decisions fall into this category.Footnote 5 Combining these two insights we find that two out of our four initial treatments emerge as a natural environment for participants: the treatment where subjects observe and predict prices and the treatment where they observe and predict returns. In terms of market dynamics, we do not find systematic significant differences between this additional treatment and the four initial treatments.

Summarizing, we provide evidence that thinking about returns, instead of thinking about prices, has a substantial impact on the performance of financial markets. These findings are also important, as subjects do not have a clear preference for forecasting prices, which would result in more stable market dynamics. Policy makers and regulators should take the effect of the forecasting format into account when designing policies aimed at stabilizing financial markets, in particular when such policies are in part based upon survey data or laboratory experiments that choose one particular format for forecasts.

Our findings are consistent with earlier research that suggests that trend extrapolation may explain differences in forecasting behavior between different elicitation formats. Glaser et al. (2007) review the literature and show that questionnaire studies and laboratory experiments where return (or price change) forecasts are elicited typically document trend extrapolation, whereas mean reverting behavior is found in studies where price forecasts are elicited. For example, in the questionnaire study presented in Glaser et al. (2007) the effect of the format of the task is isolated: in all treatments subjects observe different series of either increasing, stable or decreasing historical prices of actual stocks from the German stock exchange. Subjects that have to forecast returns exhibit a stronger tendency to extrapolate past trends: that is, return forecasts are higher (lower) after prices have been increasing (decreasing) when return forecasts are elicited directly than when these forecasts are derived from elicited price forecasts.Footnote 6

Glaser et al. (2019) do not only vary the elicitation task but also, in a \(2\times 2\) between-subjects design, the format in which past data are presented, either as a line chart of past prices or as a bar chart of past returns. Instead of a stronger tendency to extrapolate trends when return forecasts are elicited, they find that these return forecasts are higher. Nevertheless, higher return forecasts would increase demand for the asset and are therefore likely to increase the incidence of bubbles in asset prices as well. Glaser et al. (2019) also find that return forecasts are lower when past returns are shown than when past prices are shown which, by a similar argument, has the potential to diminish the likelihood of bubbles. However, we do not find evidence for an effect of the format of past information in our experiment. Indeed, results from other experimental work that focuses on the effect of the presentation of past information are mixed as well. Andreassen (1988) lets subjects trade an artificial stock. Their behavior is consistent with stronger expected mean reversion when prices are observed than when returns are observed. Diacon and Hasseldine (2007) find that investment decisions are not significantly affected by the presentation format of the different funds. As opposed to Glaser et al. (2007), Stössel and Meier (2015) find that subjects overestimate returns substantially when seeing a return bar chart. Finally, the five-year ahead return forecasts of the subjects of Huber and Huber (2019) are more extreme when subjects observe returns than when subjects observe prices. This seems to be consistent with the results from Andreassen (1988), but not necessarily with those from Glaser et al. (2019).

Our experiment is also related to previous work on learning to forecast experiments. In applications to financial markets, with subjects forecasting future prices on the basis of past prices, learning to forecast experiments typically exhibit persistent deviations of realized prices from fundamentals and the endogenous emergence of bubbles and crashes (see e.g., Hommes et al. (2005, 2008) and Heemeijer et al. (2009)), as well as a remarkably high degree of coordination of individual forecasts on a common prediction strategy. These results are quite robust, for example with respect to information about the underlying model (Sonnemans and Tuinstra, 2010), the number of subjects in a group (Hommes et al., 2021), letting subjects make trading decisions, instead of, or in addition to, letting them predict future prices (Bao et al., 2017). Our results show that when eliciting returns (instead of eliciting prices, which is the case in the previous learning to forecast experiments), these persistent deviations from fundamental values are exacerbated even further.

Finally, by establishing that the format of the forecasting task may increase mispricing and lead to bubbles and crashes in asset prices, our paper contributes to the literature that shows that framing may have important effects on financial market decisions (see e.g., Kirchler et al. (2005, 2012), and Anufriev et al. (2019)).

The remainder of this paper is structured as follows. Section 2 presents the experimental design, the underlying asset pricing model and our main hypotheses. We discuss our main results on market stability in Sect. 3. In addition, in this section we also present the results after correcting for outliers in different ways. In Sect. 4 we discuss our findings from the additional treatment with endogenous task and information formats. Section 5 concludes. Experimental instructions and other supplementary material, such as time series for all markets and the analysis of subjects’ individual forecasting strategies, are presented in the online appendices.

2 Experimental design

The experiment, programmed in PhP, was run in February and March 2017, in September and November 2019 and in May 2022 at the CREED experimental laboratory of the University of Amsterdam.Footnote 7 In total 504 subjects (students from various fields) participated in the experiment in 26 sessions.Footnote 8 Experimental sessions (up to paying out) lasted for approximately 90 min, with payments for each subject typically between €23.5 and €28. Below we will outline the main features of the experimental design.

2.1 Subjects’ task and main treatments

Our design is based on the standard learning to forecast experimental design (see (Hommes et al., 2005, 2008), and Heemeijer et al. (2009), for examples or Hommes (2011), for an overview). Subjects are told that their role is that of an advisor to a pension fund. This pension fund has to decide how much of its wealth to invest in a risky asset, and bases its decision upon the forecast provided by the subject. The task of the subjects is to forecast the price or the return of the risky asset for 50 consecutive periods, using information about past prices or information about past returns. Subjects’ earnings are based upon their forecasting accuracy.

Following (Glaser et al., 2019), we vary: (i) the manner in which forecasts are elicited (‘task’), either by asking for a forecast of the price, \(p_{t}\), or by asking for a forecast of the return (i.e. the relative price change), \(r_{t}=\frac{p_{t}-p_{t-1}}{p_{t-1}}\); and (ii) the way in which information is provided to the subjects (‘information’), again either as a time series of past prices, or as a bar chart of past returns. This gives a \(2\times 2\) between-subjects design with four treatments, PP, RP, PR and RR, where, for example, PR means that subjects observe prices (P) and forecast returns (R), and similarly for the other treatments – see Table 1. Six subjects are active in each market in each treatment.Footnote 9 We have 19 to 23 markets for each treatment.

Table 1 Treatments in the \(2\times 2\) design and the number of markets (in parentheses)

In contrast to Glaser et al. (2019) and many other experimental studies on expectation formation, in learning to forecast experiments the realization of the variable that subjects need to predict is not exogenously given, but determined by the subjects’ predictions. In particular, when subjects predict a higher price/return for the risky asset, the pension funds they advise will demand more of this asset, in order to reap the potential capital gains. Increased aggregate demand for the risky asset will then drive the price/return of this asset up instantaneously. We formalize this ‘expectations feedback’ by the pricing equation from Bao et al. (2017), which is given byFootnote 10

$$\begin{aligned} p_{t}=66+\frac{1}{1.05}\left( \overline{p}_{t}^{f}-66\right) +\varepsilon _{t}. \end{aligned}$$

Here \(p_{t}\) is the price of the risky asset in period t and \(\overline{p} _{t}^{f}=\frac{1}{6}\sum _{h=1}^{6}p_{h,t}^{f}\) is the average price forecast of the subjects for period t, averaged over the six subjects in the same market. Furthermore, \(\varepsilon _{t}\) is a small demand shock with \(\varepsilon _{t}\sim N(0,0.25)\). Note that the (rational expectations) fundamental value in this market equals 66, and is constant over time. Moreover, expectations feedback is positive (an increase in the average forecast increases the price realization) and the feedback strength, given by the discount factor 1/1.05, is high (abstracting from the demand shocks, the realized price is a weighted average of the average price forecast and the fundamental value, with most of the weight on the former).Footnote 11

It is straightforward to transform (expected) prices into (expected) returns, and the other way around (using \(r_{t}=\left( p_{t}-p_{t-1}\right) /p_{t-1}\) or \(p_{t}=\left( 1+r_{t}\right) p_{t-1}\)). Equation (1) therefore generates prices (and hence returns) for each of the four treatments. Also note that the realization of demand shocks \(\varepsilon _{t}\) is the same for each market and for each treatment.

Subjects do not have full knowledge of the price/return generating mechanism shown in Eq. (1). However, they are provided with qualitative information about how the market works. That is, they are explained that: (i) a higher forecast will lead the pension fund they advise to buy more units of the risky asset; and (ii) the market price will be higher if the total demand for the risky asset is higher. In addition, they know that the number of subjects in their market is six.Footnote 12

2.2 Information and incentives

Fig. 1
figure 1

Example of the decision screen in treatments PR (panel a) and RP (panel b)

Examples of the decision screens, for treatments PR and RP, are shown in Fig. 1. Subjects can submit their forecast at the top of the screen.Footnote 13 Depending on the treatment, the information subjects have when they need to submit a forecast for period t consists of: (i) a table with the realized prices or returns, their own forecasts, their earnings up to the last period, and their accumulated earnings thus far (lower right part of the decision screen); (ii) a figure with either a time series of past realized prices or a bar chart of past realized returns (lower left part of the decision screen),Footnote 14; and (iii) the most recent price. To be more precise, in the treatments where subjects receive information about prices (PR and PP), we present the previously realized prices in the table, and a figure showing the time series of the realized prices so far. In treatments where subjects receive information about returns (RR and RP), we present the previously realized returns in the table, and a bar chart for the past returns on the graph. Furthermore, subjects see their own forecast in the table in the variable they have to forecast. So in treatments PR and RR the table contains the return forecasts subjects submitted, whereas in treatments PP and RP the table contains their price forecasts. Note that the most recent price is given for all treatments, also for treatments RP and RR, since otherwise the task for subjects in treatment RP, where they observe past returns but have to forecast the price, would become overly complicated. For treatment RR this most recent price is not required, but it is given in order to provide subjects with the same information in treatments RP and RR. For all treatments a price of \(p_{0}=50\) is shown on the initial decision screen.

Subjects are paid based on their forecasting accuracy. In particular, the number of points earned by subject h in period t is given by:

$$\begin{aligned} \text {payoff}_{h,t}=1300\times \max \left\{ 1-625\times F_{h,t}^{2},0\right\} \text {,} \end{aligned}$$

where \(F_{h,t}\) is a measure of the forecast error made by participant h in period t. Here we take \(F_{h,t}=\frac{p_{h,t}^{e}-p_{t}}{p_{t-1}}\) for treatments PP and RP and \(F_{h,t}=r_{h,t}^{e}-r_{t}\) for treatments PR and RR, so that incentives are exactly the same in each treatment.Footnote 15 Subjects earn between 0 (if the forecast error in that period is \(4\%\) or larger) and 1300 (if the forecast is correct) points per period. At the end of the experiment subjects’ total points over the 50 periods are transformed into euros (with 2600 points giving €1.00), in addition to a €5.00 show-up fee.

We impose upper and lower bounds for price and return forecasts, with price forecasts restricted to be between 0 and 1000 and return forecasts restricted to be between \(-100\%\) and \(300\%\).Footnote 16 In order not to provide focal points, subjects are not informed about these restrictions beforehand, but they receive an (individual) message as soon as they try to submit a forecast that violates a restriction that is relevant for them.Footnote 17 For the first period subjects do not have any information about prices or returns yet (except that \(p_0=50\)), but in the instructions we suggest that the price (return) in the first period is likely to be in the interval [0, 100] (\([-10\%,10\%]\)), although subjects are not obliged to choose a forecast from that interval.Footnote 18 Subjects have two minutes to make their decision during each of the first 10 periods and one minute for each of the periods 11 to 50.Footnote 19

2.3 Hypotheses

Essentially, in each of the four treatments subjects are asked to perform the same task, have the same information for doing the task and are rewarded in the same way. The only difference between treatments is how the task and information are presented to the subjects, either in terms of prices or in terms of returns. One might therefore conjecture that behavior of subjects is independent of the treatment. However, previous individual choice experiments suggest that the format of information and the forecasting task may matter. Indeed, Glaser et al. (2007) and Glaser et al. (2019) provide compelling evidence that expectation formation, for time series of prices/returns that are exogenously given, is affected by how past information and the forecasting task are presented. In particular, Glaser et al. (2007) conclude that subjects have a larger tendency to extrapolate trends when they forecast returns than when they forecast prices. Glaser et al. (2019), on the other hand, find that asking for returns leads to higher forecasts than asking for prices. Both results suggest that at the aggregate market level, in an environment with positive expectations feedback, the incidence of bubbles, as well as price volatility, will be higher when subjects forecast returns than when they forecast prices. This leads to our first hypothesis.

Hypothesis 1

Forecasting returns instead of prices leads to more unstable market dynamics.

In addition, Glaser et al. (2019) find that showing subjects past returns leads to lower forecasts than when showing them past prices. In earlier learning to forecast experiments price instability typically starts out with a steady increase in asset prices, resulting in a bubble with prices much higher than their underlying fundamental value. Eventually these price increases stop, after which prices start to decline and revert back to the vicinity of the fundamental value, or even overshoot that value. In our setting, lower expectations (due to observing returns) will lead to lower market prices, and this is likely to diminish the extent of price increases and therefore the amplitude of these bubbles. This gives rise to our second hypothesis.

Hypothesis 2

Observing returns instead of prices leads to more stable market dynamics.

We will test Hypotheses 1 and 2 by considering different measures of instability and compare these measures between treatments.

3 The effect of information format and forecasting task on aggregate market outcomes

We present our main findings in this section, starting with providing a first overview of the experimental results in Sect. 3.1. In Sect. 3.2 we investigate whether asking for returns instead of asking for prices, or providing past returns instead of providing past prices, has an effect on market dynamics. Because our main results are slightly diluted by the existence of a number of outlier markets, which are characterised by one or more extreme predictions, we study stability for different subsamples of the data in Sect. 3.3.

3.1 An overview of the experimental results

Fig. 2
figure 2

Median price per treatment, smoothed over 5 periods. Treatment ENDO is discussed in Sect. 4

Figure 2 shows the median market prices in each treatment. That is, for each of the 50 periods we calculate the median price in the given period over the markets in a specific treatment. In order to enhance visibility, we smooth these time series by taking the moving average over 5 periods. The median prices show that there are price oscillations in each treatment but those in treatments PR and RR typically have a larger amplitude (or there are fewer stable markets in these treatments). Figure 2 suggests that the market price in the PP markets (and in the first half of the experiment in the RP markets) stays closer to the fundamental value of 66 whereas markets in the other two treatments exhibit oscillations around the fundamental value. In particular, in 14 of the 45 markets (\(31\%\)) in treatments PP and RP prices are within \(10\%\) of the fundamental value (that is, in the interval \(\left[ 59.4,72.6\right]\)) for at least 40 consecutive periods. On the other hand, in treatments PR and RR a much higher fraction of the markets exhibit prominent fluctuations: only for 4 out of 39 markets (\(10\%\) ) in these two treatments prices are within \(10\%\) of the fundamental value for at least 40 consecutive periods. The figure does not provide clear evidence on the effect of observing returns versus observing prices on the volatility of prices and returns.Footnote 20

3.2 Market stability

In this section we formally investigate whether the stability of aggregate market prices is affected by the presentation format of the task and/or information given to the subjects. To that end we use different instability measures, which we calculate for each market on the basis of market prices and returns from period 11 to period 50. We exclude the first ten periods to allow for some learning.

We consider six different measures. Our first three measures are standard measures of price dispersion and volatility: the standard deviation of the logreturns (\(std_r\)), the standard deviation of the market price (\(std_p\)) and the interquartile range of market prices (IQR), which is the difference between the third and the first quartile of the realized price in the given market. That is

$$\begin{aligned} \begin{array}{ccc} std_r=\sqrt{\frac{1}{40}\sum \limits _{t=11}^{50}{(\ln (p_{t}/p_{t-1})-\overline{\ln (1+r)})^{2}}},&std_p=\sqrt{\frac{1}{40}\sum \limits _{t=11}^{50}{(p_{t}-\bar{p})^{2}}},&\text {and} \end{array}\\ IQR=Q_3\left( p_{11},\dots ,p_{50}\right) -Q_1\left( p_{11},\dots ,p_{50}\right) , \end{aligned}$$

where \(\overline{\ln (1+r)}=\frac{1}{40}\sum \nolimits _{t=11}^{50}{\ln (p_{t}/p_{t-1})}\) is the average realized logreturn over the last 40 periods, \(\bar{p}=\frac{1}{40}\sum \nolimits _{t=11}^{50}{p_{t}}\) is the average realized asset price over the last 40 periods and \(Q_1\) and \(Q_3\) denote the first and third quartile, respectively. Our fourth measure, absolute return (AR) is equal to the median of the absolute returns between periods 11 and 50, that is

$$\begin{aligned} AR=\text {median}(|r_{11}|, \dots , |r_{50}|). \end{aligned}$$

As with the std and IQR measures, a higher value of AR implies higher price volatility.Footnote 21

The four measures discussed above measure price volatility, but do not necessarily capture mispricing (i.e. deviations of realized prices from the fundamental value) very well. For example, if prices are relatively stable but at a level substantially different from the fundamental value these measures will be low, whereas mispricing will be significant. Even though our hypotheses test stability of the market, we investigate two measures of mispricing as well, as they can also be viewed as an indirect measure of stability. These final two measures, relative absolute deviation from the fundamental value (RAD) and relative deviation from the fundamental value (RD), take such deviations from the fundamental value into account (they were introduced in Stöckl et al. (2010), and have become standard bubble measures in the literature on experimental asset markets). These measures are defined as

$$\begin{aligned} RAD=\frac{1}{40}\sum \limits _{t=11}^{50}\frac{|p_{t}-p^{*}|}{p^{*}}\text { and }RD=\frac{1}{40}\sum \limits _{t=11}^{50}\frac{p_{t}-p^{*}}{p^{*}}, \end{aligned}$$

where, in our experiment, \(p^{*}=66\).Footnote 22

Table 2 Median values of the instability measures over the markets for each treatment, and combined treatments per information or task

Table 2 summarizes the median values of the instability measures over the markets for each treatment.Footnote 23 Furthermore, as our hypotheses test either observing or forecasting prices versus returns, we also report the values for the merged treatments. That is, to test Hypothesis 1 (differences between forecasting price and return) we merge treatments PP and RP (into *P) on the one hand and PR and RR (into *R) on the other hand, and to test Hypothesis 2 (differences between observing price and return) we merge PP and PR (into P*) and RP and RR (into R*).Footnote 24 The values of the measures ARRAD and RD are reported in percentages. Although the within-treatment variation in each of the measures is substantial, the measures are to a large extent mutually consistent. The median values show a picture consistent with Fig. 2 the median value of the instability measures is lower in treatments PP and RP than in treatments PR and RR. This is in line with our earlier observation that treatments PP and RP are more stable than the other two treatments.

Table 3 Summary of p-values of the Kolmogorov–Smirnov tests for comparing treatments in terms of instability

In order to formally test our hypotheses, we use Kolmogorov–Smirnov tests for each measure. The test results are collected in Table 3.Footnote 25 Note that we apply one-sided Kolmogorov-Smirnov tests, with the direction given by Hypotheses 1 and 2 – e.g., we test and accept the alternative hypothesis that measure \(std_r\) contains lower values for markets in *P than for the markets in *R, which confirms our Hypothesis 1. The tests indicate significant differences (\(p<0.05\)) between *P and *R for five of the six instability measures while differences between P* and R* are never significant.Footnote 26 Based on these results we do not find support for Hypothesis 2, but we do find support for Hypothesis 1.

A possible reason for the absence of an effect of observing prices versus observing returns might be that in each treatment of our experiment we show the most recent price to the subjects in each period. This is in line with the design chosen in Glaser et al. (2019) but there is an important difference. In Glaser et al. (2019) subjects have to make a one-time forecast: In their treatments RP and RR subjects could observe all past returns and only the most recent price. In our experiment, however, subjects have to make a forecast for 50 consecutive periods so even though we show them only the most recent price in treatments RP and RR, they can write the prices down and essentially have the possibility to use all past prices for forecasting.

We have established that there is a difference between the distributions of the markets where subjects need to predict prices compared to markets where subjects need to predict returns, with the former tending to be more stable. Now we investigate with a multivariate multiple regression whether these findings also hold true when considering the exact values of the instability measures. Table 4 presents the results of the multivariate multiple linear regression with the six different measures as the dependent variables and two dummy variables as independent variables to test our hypotheses: *P which is one for treatments RP and PP, and P* which is one for treatments PR and PP.Footnote 27 The regression results cannot support either hypothesis with almost all dummy variables being insignificant. Based on the above tests our main result is the following:

Table 4 Multivariate multiple linear regressions for testing Hypotheses 1 and 2

Result 1

We find some evidence supporting the hypothesis that forecasting prices tends to lead to more stable market dynamics than forecasting returns. The format in which past information is presented has no effect on market stability.

Our result is based upon the observation that, when comparing distributions, forecasting prices clearly leads to more stable market dynamics than forecasting returns. However, this difference is not significant for the regression results. A possible reason behind the difference between the Kolmogorov-Smirnov test and the regression results can be seen when looking at the individual markets. Both the figures in Online Appendix C and the exact values of the six measures per market in Online Appendix D show the same picture. Markets in *P tend to be more extreme than markets in *R. This means that considering the more stable markets (with lower instability measures), we observe more *P markets with smaller values. However, when we look at the other end of the distribution, we again see more *P markets than *R markets. Note however that there are more relatively stable markets than markets with very high price volatility. Furthermore, these markets with extraordinary high price volatility result in very high instability measures which can be seen in Tables D.2 and D.3 in Online Appendix D, as well as in Table D.1 when comparing the medians and the standard deviations. This low number of substantially higher values could lead to positive, but insignificant regression coefficients. Due to this feature of our data, in the next section we investigate different subsamples of the data.

3.3 Outlier markets and sample split

When looking at the data of individual markets, we can see that there are several markets where one out of the six subjects submits a very different forecast than the other five subjects (see individual figures in Online Appendix C). Such outliers have a substantial and long-lasting effect on the market dynamics in the given market (usually, but not necessarily, resulting in larger bubbles) and thereby obfuscate the differences between treatments. These outliers typically come from two sources: either subjects make a typo, or they experiment with very high (or very low) forecasts. While in real life we cannot rule out such a behavior or typos (or “fat-finger errors”Footnote 28) either, it does not happen that often.

In order to investigate the effect these outliers have on our main result, we consider two ways to look at a subsample of the data. First, we remove outlier markets. Second, as the first method results in an unbalanced distribution of markets across treatments, as an alternative we split the sample evenly per treatment.

We identify a subject’s forecast in period \(t>1\) as an outlier if it deviates by more than 50% from the median forecast, taken over the six participants in that market in that period. Because typically there is a high degree of coordination between forecasts, a forecast far from this median can indeed be considered an outlier. We do not consider outliers in the first period, \(t=1\), because subjects have no information about past prices in that period, and their forecasts therefore are not necessarily close to each other.Footnote 29 Unlike the main analysis, to identify outliers we look at periods before period 11 as well, as having an outlier forecast might have a long-lasting effect on prices. By using this approach, we identify 9 (out of 22) markets in PP, 13 (out of 23) markets in RP, 5 (out of 19) markets in PR and 2 (out of 20) markets in RR as outliers. These markets are classified as outlier markets based on a total of 112 outlier predictions which corresponds to 0.4% of all possible predictions after period 1.Footnote 30

Table 5 Kolmogorov-Smirnov tests and multivariate multiple linear regressions after removing outlier markets

Table 5 confirms the observation that markets that do not feature outliers are much more stable in *P than in *R. When restricting our analysis to markets that do not exhibit such outlier behavior, we see a highly significant difference in the distribution for predicting price vs predicting return, with the former being more stable (see Panel A). Furthermore, the regressions (Panel B) also show that predicting price results in lower instability measures, with a significant negative coefficient for four out of the six measures. Whether subjects observe prices or returns still does not make a difference for stability.

Let us comment on the uneven effect of a typo in the different treatments. Due to the nature of the forecasting task a high forecast may cause larger price fluctuations in the *P treatments than in the *R treatments. For example, when a subject makes a typo by incidentally typing an extra zero, then the forecasted price of that person will be 10 times larger. However, if we consider a relatively stable market in the *R treatment, where the returns are below 10%, such a typo has a much smaller effect. An extra zero would mean a maximum return forecast of 100%, which translates to a price forecast that is only double the original price level. The same holds when submitting the maximum possible return forecast of 300%, which is equivalent with a price prediction that is four times the current price. If we are in a stable market (with prices around 65-70), this price prediction is much lower than the maximum possible price in *P (1000). Outlier markets in *P therefore have the potential to be much more unstable than outlier markets in *R.

One might argue that the subsample we obtain after removing the outliers is selective as we end up with an unbalanced sample with respect to the number of observations per treatment. To circumvent that we consider a sample split based on a median market per treatment, which results in a more even split of the total sample as well as a more even split for each treatment. To do so, we combine the six instability measures discussed above to rank the markets per treatment. We implement the ranking in the following way. For each measure we rank the markets in each treatment by giving the market with the lowest value for that measure rank 1, and so on, up to rank n for the market with the highest value of that measure. Subsequently we order all markets in the given treatment by the sum of the ranks they have for the six measures. This gives us an overall ranking for each market in each treatment. We determine the median rank for each treatment, and all markets with a lower rank belong to the stable sample, whereas all markets with a higher ranking belong to the unstable sample.Footnote 31

Table 6 Kolmogorov–Smirnov tests and multivariate multiple linear regressions for the split samples

Table 6 shows the results for the Kolmogorov-Smirnov tests (Panels A and C) and the multivariate multiple regressions (Panels B and D). In Panels A and B we present the markets with the better ranking (stable markets) and, for completeness, in Panels C and D we present the results for the more unstable markets. Given that outlier markets often exhibit unstable market dynamics, the results for the stable markets are very similar to the results we obtained after focusing on the non-outlier markets. Predicting price results in a more stable market dynamics in those markets (the regression results are even stronger than in Table 5) compared to the markets where subjects predict returns. There is no significant difference between observing prices and returns. Looking at the unstable markets (see Panels C and D in Table 6), we find no support for our hypotheses. These findings lead us to our second result:

Result 2

After correcting for outliers, whether by only considering non-outlier markets, or by considering only the stable half of the markets, we find that forecasting prices leads to more stable markets than forecasting returns. The differences are strongly significant. The format in which past information is presented still has no effect on market stability.

Even though our main analyses and hypotheses connect with market stability and aggregate market behavior, we have also looked at individual behavior for a potential behavioral explanation behind the results. We investigate how subjects react to past price changes by looking at trend-extrapolation with a regression on decision-level data. Furthermore, we estimate for each individual a forecasting rule to measure the most important factors (past prices or past own forecasts, both with several lags) that enter the decision making process. Both analyses show that subjects forecasting returns extrapolate past price changes stronger than subjects forecasting prices. These results are consistent with the findings of Glaser et al. (2007) who also found stronger trend-extrapolation when predicting returns. The analyses are relegated to Online Appendices E and F.

4 Endogenous choice of task and information

In the main experiment subjects receive information about either past returns or past prices, and they have to submit either price or return forecasts. However, in reality, investors may not be bound to one type of information, but might be free to choose what to look at, and whether to forecast prices or forecast returns. We designed an additional treatment, treatment ENDO, to investigate the effect of the endogenous choice of these two formats on market stability.Footnote 32 In Sect. 4.1 we present the design of this additional treatment, and in Sects. 4.2 and 4.3 we discuss the experimental results.

4.1 Experimental design

The design of treatment ENDO is largely the same as the treatments described in Sect. 2. In contrast to these other treatments the ENDO treatment allows for a choice between submitting a price or a return, and for switching between return or price history information. Figure 3 illustrates subjects’ screen in this treatment. The change in the table while switching observed information only concerns the market realizations, subjects always see their own forecasts as they have submitted them before.

Fig. 3
figure 3

Example of the decision screen in treatment ENDO

The incentives, information and time given to subjects are the same as in the main treatments.Footnote 33 Important to note is that we give earning examples to subjects in the instructions, so that they can see that there is no monetary advantage of forecasting either variable.Footnote 34

We collected data on 17 markets for this treatment. In total 102 subjects from the University of Amsterdam participated in June and September 2022. The practical procedure was the same as in the main treatments, including incentives, exchange rates and decision time restrictions. The sessions (up to payments) lasted about 100 min and subjects typically earned between €24 and €27.5.

4.2 Stability of the endogenous markets

Figure 2 in Sect. 3.1 not only displays the median prices for the original treatments, but also for treatment ENDO. In the first half of the experiment the median market in this treatment is relatively stable with a price slightly higher than the price in treatments PP and RP. In the second half of the experiment we see similar oscillations as for the median markets in the other treatments.

We use the Kolmogorov-Smirnov test for pairwise comparisons of the six instability measures between treatment ENDO and all original individual and merged treatments.Footnote 35 Again, we restrict our analysis to the last 40 periods to account for learning. As we do not have specific hypotheses about the direction of potential differences, all tests are two-sided. The non-parametric tests reveal no systematic differences; most of the test results are not significant at the 5% significance level (see Table D.5 in Online Appendix D). Multivariate regressions confirm these results, we do not find systematic differences between the instability measures of treatment ENDO compared to the other treatments (see Table D.7 in Online Appendix D).

For completeness, we repeat the analysis for the same sample splits that we used for the other treatments. We identify 10 markets as outliers in treatment ENDO.Footnote 36 The non-parametric tests and regressions do not reveal systematic differences between treatments for the non-outlier markets or for the stable or unstable markets. The only significant difference that occurs for multiple instability measures and both for the non-outlier markets as well as for the more stable markets is that the markets from treatment ENDO seem to be a bit more stable than markets in the RR treatment (and also more stable than the *R treatments, see Tables D.8 and D.9 in Online Appendix D). These results hold for the first four measures, but not for RAD and RD. For these two measures the Kolmogorov-Smirnov tests find that treatment ENDO is less stable than treatment PP (and than *P, except for RAD in the stable markets). These results from the sample splits suggest that, consistent with Fig. 2, treatment ENDO lies somewhere between the most and least stable treatments. However, given that we only find these differences for specific treatments and for the sample split, we conclude the following:

Result 3

Giving subjects the freedom to choose the format of the task and the information does not lead to systematically more stable or systematically more unstable market dynamics.

4.3 Choice of task and information in the endogenous markets

In this subsection we investigate the choices subjects make in treatment ENDO. As we already pointed out earlier, subjects do not seem to be influenced by the variable presented to them on the top of the decision screen. In fact, in the first period 57 subjects choose to predict prices, and 38 choose to predict returns. These choices are uncorrelated with the order of the variables on the screen. Looking at the last 40 periods, 58% of subjects (59 out of 102) always predict the same variable, so do not switch prediction task. Throughout these 40 periods, 50.41% of all decisions are price predictions, and 50.49% of the last seen information is price information. Subjects switch prediction task on average 2.94 times throughout the 40 periods (standard deviation of 4.56 and maximum of 21) with a median of zero switches. They switch information while thinking about their prediction (thus within a period) on average 0.65 times with zero median (standard deviation of 1.53) and a maximum of 19 switches within a single period by a single subject. Thus the majority of subjects stick both to the information they saw in the previous period and to the task they chose in the previous period, and are reluctant to change. Importantly, these two variables are the same for the large majority of the decisions. When using the last seen variable in a period as a proxy for the information subjects use to make their decision in that period, we find that in the last 40 periods of the experiment 40.8% of all decisions fall in treatment category PP, 9.6% in RP, 9.6% in PR and 40.0% in RR. The correlation is very strong, subjects are likely to submit a price prediction if they have seen the price chart last, and a return prediction if they have chosen to look at returns (the p-value of the Spearman correlation test is \(<0.001\)). This shows that the PP and RR treatments provide a more natural environment for subjects than the PR and RP treatments.

Finally, one might ask how the group composition in terms of number of subjects choosing to observe or forecast prices affects stability. We used multivariate multiple linear regressions again to investigate this question with markets as observations. The analysis is relegated to Online Appendix D (see Table D.10). Based on this analysis within treatment ENDO we do not find support for either Hypothesis 1 or Hypothesis 2. This is probably due to the fact that the markets are already relatively close to each other in terms of stability, we have only 17 observations, and that the markets constitute a mix between price forecasters and return forecasters. Also note that this treatment was not designed to test these hypotheses.

To summarize our analysis of treatment ENDO, we find no systematic differences between this treatment and the initial treatments. However, we do find that 80% of the decisions are made such that the information and the decision task are the same, and subjects do not have a clear preference for either predicting price or predicting returns (around 50% of the previously mentioned “same variable” decisions are price decisions, the other half are return decisions). Thus our findings for the initial four treatments are relevant, as all treatment categories, especially PP and RR are observed among the individual decisions. Given our results on the difference between predicting prices and predicting returns, our experiment suggests that regulation that steers investors to think about prices instead of returns facilitates market stability.Footnote 37

5 Concluding remarks

In this paper we fill an important gap in the research on the impact of the presentation format of past information and of the choice of forecasting task on expectation formation. In particular, where previous research focused on individual decision making, we analyze how aggregate behavior is impacted, by acknowledging that expectations have an effect on actual investor behavior, and therefore on realized market dynamics. We go beyond earlier contributions that only consider the effect on forecasting an exogenously given time series, and consider the more realistic environment where market prices are determined endogenously. Our contribution is an example of mapping individual decision biases into aggregate behavior.

Although we do not find evidence that the format of past information, which is either presented as a return bar chart or as a price line chart, has a notable impact on aggregate market dynamics, the format of the task (either forecasting prices, or forecasting returns) does have a significant effect. In particular, when subjects are asked to forecast returns, they tend to coordinate on expectation rules that exhibit stronger extrapolation of past trends than when they are asked to forecast prices. This leads to larger price volatility and a higher incidence of bubbles and crashes in those treatments. Earlier empirical research has already shown that financial market participants have a tendency to extrapolate trends in past observations, see e.g., Sirri and Tufano (1998); Choi et al. (2009) and Greenwood and Shleifer (2014). Our results suggest that this tendency increases when investors think in terms of returns instead of prices, and that this may have a substantial adverse impact on financial market stability. As a second contribution, we designed an additional treatment where subjects had the opportunity to choose the formats of the forecasting task and the presentation of the information themselves. We find no clear preference for forecasting prices or returns, but we do find a clear preference for forecasting and observing the same variables, a more natural, cognitively less demanding task, corresponding to treatments PP and RR of the main experiment.

Andreassen (1987, 1988) and Glaser et al. (2007) refer to the representativeness heuristic (see Tversky and Kahneman, 1982) to argue that subjects that think in prices are more likely to predict mean reversion (in prices) than subjects that think in price changes or returns. To illustrate their point, consider the following sequence of monotonically increasing prices (adapted from Andreassen (1987)): 60, 62, 65, 68, 70. Under the assumption that the representative price corresponds to the mean or median price of that sequence, the participant’s prediction would be around 65. However, the returns corresponding to this series of prices are 3.33%, 4.84%, 4.62% and 2.94%, respectively, with a representative (average) return of around 4%, which translates into a price forecast of about 72.8, instead of the mean reversion resulting from a price forecast of 65. Admittedly, it is not obvious that the average price or return is representative for the sequence of prices or returns.Footnote 38 Nevertheless, even if subjects have simple naïve or adaptive expectations (i.e. their forecast is equal to the last realized value of that variable, or it is equal to a weighted average of that last observation and their previous forecast), clearly trends in prices will be extrapolated when returns are forecasted, but not when prices are forecasted. This is consistent with the stronger trend extrapolation we find in treatments PR and RR.

The argument based upon the representativeness heuristic also suggests that the format of past information should have an effect on forecasting behavior: Returns are much more salient in treatments RP and RR than in treatments PP and PR. However, we do not find evidence for a significant difference between those treatments. We can think of two possible reasons for this. We have already discussed our first explanation before Result 1 in Sect. 3.2, namely that we provide subjects in treatments RP and RR with the most recent price, not only with returns. Second, independent of the format of past information, subjects may focus on the variable that they need to forecast. If required, they translate the variable that they observe into the variable that they need to forecast. This would diminish the effect of the format of past information, and would be consistent with the mixed results on the effect of the chart format in the existing experimental literature that we discussed in the Introduction. Notwithstanding our results, for actual financial markets the format of the presentation of past information may still have a real effect. In our experiment we impose the variable that subjects need to forecast, but in actual financial markets this is – to a certain extent – up to the market participant itself (except professional forecasters and financial analysts who are asked for predictions in a specific format by investors). One may imagine that investors who observe past returns will be more inclined to forecast future returns than investors who observe past prices – making treatments PP and RR the more relevant treatments in our experiment. In this way, the format of the presentation of past information may still have an impact, albeit indirect, upon actual financial market dynamics. Even though our endogenous treatment (ENDO) did not explicitly test this explanation, we do indeed find that subjects have a preference to observe and predict the same variable, although there does not seem to be a preference for either price forecasts or return forecasts. Given that in the first period subjects do not have any past information, it might be that they choose the task first and then the information they want to observe. However, by presenting past performance by means of prices, traders may be nudged into forecasting prices as well, leading to a decrease in price volatility. We leave this question for possible future research.

Many learning to forecast experiments with positive expectations feedback feature persistent deviations from fundamentals and the emergence of bubbles and crashes. In those experiments subjects typically observe past prices and have to forecast future prices, as in our treatment PP. Our results show that if return forecasts are elicited instead of price forecasts, these persistent deviations from the fundamental value are exacerbated, although the underlying price/return generating mechanism remains exactly the same.

Two final remarks about the model we chose for investigating forecasting behavior are in order here. First, as in previous learning to forecast experiments the feedback strength we choose is relatively high: the relevant coefficient in Eq. (1) is 1/1.05, which is approximately equal to 0.95, implying that the realized price (return) will be quite close to the average price (return) forecast. Earlier work on learning to forecast experiments with positive expectations feedback has shown that a smaller value of this feedback strength will mitigate the endogenous emergence of bubbles and crashes in this framework. In fact, deviations from fundamentals quickly vanish and prices converge to fundamentals if the feedback coefficient is about 0.70 or less (see Sonnemans and Tuinstra (2010), and Bao and Hommes (2019)).Footnote 39 From our results we conjecture that, in a learning to forecast experiment where return forecasts are elicited, the feedback strength would have to be even lower to induce convergence to the fundamental value. Second, in our learning to forecast experiment we focus exclusively on expectation formation, whereas in other experimental studies on the endogenous emergence of bubbles and crashes, subjects can buy and sell the asset (see Palan, 2013) for an overview of the sizable literature on bubbles in experimental asset markets pioneered by Smith et al. (1988)). However, in a recent study, Bao et al. (2017) show that the bubbles and crashes that emerge in experiments with positive feedback are robust when subjects can also trade in that asset. In addition, Amromin and Sharpe (2014) and Greenwood and Shleifer (2014) show that portfolio choices can be explained to a large extent by survey expectations.Footnote 40 We therefore believe that our results will translate to an experiment where beliefs of the participants are elicited, either as return forecast or as price forecast, and participants subsequently can trade in the asset. Our conjecture is that in the former case participants show a higher willingness to buy (sell) assets, also against prices higher (lower) than the fundamental value. We leave it to future work to investigate this issue.