Statistical inference for measures of predictive success

We provide statistical inference for measures of predictive success. These measures are frequently used to evaluate and compare the performance of di(cid:11)erent models of individual and group decision making in experimental and revealed preference studies. We provide a brief illustration of our (cid:12)ndings by comparing the predictive success of di(cid:11)erent revealed preference tests for models of intertemporal decision making. This demonstrates that it is possible to compare the predictive success of di(cid:11)erent models in a statistically meaningful way. JEL-codes: C10, C90, D12


Introduction
Given a behavioural model and an outcome space of possible observations, Selten (1991) distinguishes between three types of theories. A point theory gives a single element of the outcome space and predicts this point as the central tendency of the observations. A distribution theory gives a probability distribution over the outcomes space and predicts that observations are independently drawn according to this distribution. Finally, an area theory only predicts that the observed outcomes should lie in a certain subset of the outcome space. For example, a distribution theory could predict that some variable of interest is uniformly distributed on the unit interval. A point theory, on the other hand, would predict that the mean (or meadian) of the observations is equal to 0.5. Finally, an area theory would predict that the observations lie in the interval [0,1]. Given this classification, a distribution theory is more informative than either a point theory or an area theory in the sense that if we know the observations to be uniformly distributed, we also know their central tendency (mean or median) and their area (support).
Many applications in experimental and revealed preference settings fall into the class of area theories. With respect to these theories, models are often evaluated on the basis of two metrics: the hit rate and the area. The hit rate gives the percentage of all observations that fall within the predicted subset of the outcome space. A high hit rate implies that many subjects have made choices that are consistent with the model's predictions. The hit rate, however, only capture one dimension of the model's performance. In general, the hit rate of a model will be higher if the model becomes less permissive (i.e. the model imposes weaker restrictions on the observed behaviour). Therefore, for an area theory to be meaningful it is desirable that the empirical test is sufficiently strong. The permissiveness can be measured by the 'area' of the test, which gives the relative size of the predicted subset compared to the set of all possible outcomes. 1 Generally, a favourable hit rate, for a specific behavioural model, provides convincing support for the model only if the associated area is sufficiently small. In practice, however, the two measures are almost always positively correlated, which in fact makes it interesting to define a summarizing measure that combines the two measures of empirical performance into a single metric, a so called measure of predictive success. Selten (1991) argues in favour of the functional specification that determines the predictive success as the difference between the hit rate and the area, predictive success = hit rate − area. This measure of predictive success is frequently used experimental studies 2 and has recently been advocated for use with revealed preference tests by Beatty and Crawford (2011). 3 In revealed preference studies, the area is usually quantified as one minus the Bronars (1987) power, which gives the probability that a randomly generated data sets (obtained from a uniform distribution on the budget hyperplanes) will fail the revealed preference test.
Different area theories (and revealed preference models) can be evaluated on the basis of their predictive success, and models with higher predictive success can be seen as having a better empirical fit. However, when comparing the the predictive success between two models, it is not at all obvious how big the difference in predictive success needs to be in order to be 'significant'. The literature dealing with predictive success measures is silent on this point. The main reason for this is that the theory underlying the predictive success measure is not a stochastic theory: the observations are either inside or outside the predicted set (see Hey (1998) for a discussion). Although it is true that the predictive success measure is devoid of any statistical interpretation, the fact remains that its computation is based on the observed behaviour from a finite number of (randomly chosen) subjects. By considering the space of all possible observed behaviour as the relevant population we show that it is nevertheless possible to conduct valid statistical inference. Our paper uses elementary large sample theory to construct asymptotically valid confidence intervals for various predictive success measures. In this 1 Of course, the 'size' of a set will always be conditional on a specific measure on the outcome space. Our framework will be flexible enough to allow for different specifications of this measure.
way it becomes possible to construct asymptotic valid hypothesis tests to verify whether the predictive success of a model is larger than some benchmark threshold (e.g. zero) or to compare the predictive success between different opposing models.
In the next section, we set out the framework and derive the statistical results. Section 3 contains an empirical illustration of our findings that compares the predictive success of different revealed preference tests for models of intertemporal decision making.

Framework
The building blocks of our framework are data sets, denoted by s. A data set may correspond to the outcome of an experiment for a single subject. We denote by Ω the set of all possible data sets that can be observed. An experiment is given by a finite number of data sets {s i } i≤n from Ω.
Hit rate An area theory for a certain model of behaviour predicts that the datasets will fall within a certain subset A of the outcome space Ω. Given such area theory, we consider the indicator function I : Ω → {0, 1} : s → I(s) such that I(s) = 1 if and only if s ∈ A. The hit rate, r n , of the experiment {s i } i≤n is given by the proportion of datasets that fall within the set A.
Area In order to define the area, we need a bit more work. To start, let us fix a data set s i ∈ Ω and consider a probably space (Ω i , B i , F i ) which may depend on the specificities of the data set s i . Here, Ω i ⊆ Ω is a subset of the outcome space such that s i ∈ Ω i . The set B i is a sigma algebra on Ω i such that the function I(.) restricted to Ω i is measurable and F i : B i → [0, 1] is a probability measure. We define the area of the dataset s i by the function ρ( Intuitively, ρ(s i ) measures the size of the set A according to the measure F i . The area of the experiment {s i } i≤n is defined as the mean of the areas of the data sets in the experiment, In many experimental settings we have that Ω is finite, Ω i = Ω and F i equals the uniform distribution on Ω, i.e. each individual data set is given an equal probability. In such setting, ρ(s i ) will be the same for all s i and the measure a n will coincide with ρ. Observe, however, that our framework is flexible enough for other specifications of the probability measure F i . 4 In some cases, it is possible to obtain ρ(.) as a closed form solution. In other settings (like revealed preference theory) no closed form solutions are known. To encompass those situations, we allow ρ(s i ) to be approximated by simulation. In such cases, we draw m i.i.d. data sets {s i 1 , . . . ,s i m } using the probability measure F i and compute the finite sample approximation, The area of the experiment is then approximated by, Using the law of large numbers, we have that for m → ∞, a n,m → P a n .

Predictive success
The hit rate r n and the area a n,m can be combined in a measure of predictive success p : [0, 1] 2 → R : (r, a) → p(r, a). Intuitively, p(r n , a n,m ) measures the performance of the behavioural model underlying the indicator function I(.). Usually, p is increasing in its first argument and decreasing in its second. We assume that p(., .) is continuously differentiable.

Large sample results
We consider the probability space (Ω, B, P) where B is a sigma algebra on Ω and P is a probability distribution on Ω giving the law by which the individual data sets in the experiment are obtained. We assume that B is such that both the functions I(.) and ρ(.) are measurable. The population hit rate and area are given by, Consider an experiment {s 1 , . . . , s n } which is obtained from n i.i.d. draws according to the law P. By the law of large numbers, we have that, as n → ∞ and m n −1 → ∞: r n → P r and a n,m → P a. Further, using the classical central limit theorem, we have that, is the asymptotic variance-covariance matrix. The elements of Σ can be consistently estimated by their finite sample analogues.
Using the continuous mapping theorem, we have that for n → ∞ and m n −1 → ∞: p(r n , a n,m ) → P p(r, a). Next, let δ be the row vector of partial derivatives of the predictive success measure p(r, a) evaluated at (r, a), ] .
The variance, δΣδ ′ , can be consistently estimated by, is an asymptotic α × 100% confidence interval for the predictive success measure p(r, a).
Comparing predictive success In many cases it is also interesting to compare two tests on the basis of their difference in predictive success. Consider two tests with hit rates and area equal to r, a andr,ã, respectively. By the central limit theorem, we know that, where Σ ∆ is the asymptotic variance covariance matrix whose elements can be consistently estimated using the finite sample plug ins. For example, the covariance between r andr is equal to, which can be consistently estimated by, We denote the estimator of the variance-covariance matrix by S ∆,n,m . Again, using the delta method, the asymptotic distribution of the difference in predictive success is given by, Where δ ∆ is equal to the following row vector of partial derivatives, is an asymptotic α × 100% CI for p(r, a) − p(r,ã).

Illustration
We illustrate our results using various revealed preference tests for different models of intertemporal decision making. The first model is the standard life cycle (LC) model where an individual optimizes a time separable additive utility function ∑ t δ t u(q t ) subject to an intertemporal budget constraint p t q t + a t = I t + (1 + r t )a t−1 . Here δ < 1 is a subjective discount rate, p t are the period t prices, a t is the value of assets at period t, I t is the contemporaneous income and r t is the interest rate. Data sets for this model are determined by prices, quantities and interest rates for a finite number of periods, The revealed preference conditions for this life cycle model were derived by Browning (1989). For the second model, let us first single out a habit forming good c. The habits (H) model replaces the intertemporal separable utility function by a utility function of the form ∑ t δ t u(q t , c t−1 ). Here, the consumption of the addictive good in period t − 1 is allowed to influence the utility in period t. The revealed preference characterization of this model was given by Crawford (2010).
Our third model, the habits as durables (HAD) model, considers a variant where the intertemporal utility function is given by ∑ t δ t u(q t , A t ) and where A t = βA t−1 + c t , represents a stock of addiction with depreciation rate β that determines how fast the addiction wears off. This is the rational addiction model put forward by Becker and Murphy (1988). The revealed preference characterization of this model was derived by Demuynck and Verriest (2013).
As a final fourth model, we consider the static utility maximization model where the household maximizes each period a time independent utility function u(q) subject to a budget constraint p t q = m t for some level of expenditure m t . The revealed preference conditions for this model is given by the Generalized Axiom of Revealed Preference (GARP) (see, for example, Varian (1982)). Typically, in a revealed preference setting, we specify the measure F i as the probability law that randomly samples data setss This is analogue to the way the Bronars (1987) power is computed. We consider 3 measures of predictive success.
The first measure takes the difference between the hit rate and the area and is the measure that has become standard in the literature. It is bounded between -1 and 1. In the best case scenario, r → 1 and a → 0. This gives a predictive success close to one. In such case, most data sets pass the test while the area is very small. In the worst case scenario, r → 0 and a → 1, which gives a predictive success close to minus one. In this case, almost all observations are inconsistent with the model while the area is almost equal to the outcome space Ω. In intermediate cases, the measure of predictive success is found somewhere between minus one and plus one. Zero is a natural benchmark where r = a. The second measure takes the ratio of the hit rate and area. Intuitively, p 2 measures the density of the observed data sets within the predicted area. It is bounded from below by zero. The natural benchmark, where r = a, gives a predictive success equal to one. The third measure is obtained from the first measure by dividing it by the maximal value that it can obtain for fixed a. It can also be written as 1 − 1−r 1−a . Intuitively, this predictive success measure will be higher, the lower the density outside the predicted area. Its benchmark is equal to zero. We refer to Selten (1991) for a more thorough discussion of the differences between these predictive success measures.

Data description
We use data from the Encuesta Continua de Presupuestos Familiares. This data set contains detailed information on consumed quantities and prices for a large sample of Spanish households. We refer to Browning and Collado (2001), Crawford (2010) and Demuynck and Verriest (2013) for a more detailed explanation of this data set. The observations range from 1985 until 1997 and are obtained on a quarterly basis. Every quarter, new households are participating in the moving panel and others are dropped. There are a maximum of eight consecutive observations per household. We consider 14 nondurable commodity categories, 5 and take tobacco as the habit forming good. 6 We have a sample of 671 households (n = 671). Finally, we simulate the areas ρ(s i ) using 1000 random draws per data set (in other words, we set m equal to 1000).
Results Table 1 provides the results on the estimates of p(r, a) for the different measures and the 95% asymptotic confidence intervals. For the first measure, the highest estimate is for the HAD model which is also the only model whose confidence interval excludes the benchmark value 0. For the second measure, the highest value is found for the LC model. However, this model also has the highest variance, which makes its value highly uncertain. Both H and HAD models exclude 1 from the 95% confidence intervals. The last measure gives qualitatively similar results as the first measure. Table 2 gives the mean values and 95% asymptotic confidence intervals for the difference in predictive success between the different revealed preference tests. Many intervals include the value of zero meaning that the hypothesis of equal predictive success cannot be rejected at the 5% level. Exceptions to this are the differences between the GARP and the HAD test for measures 1 and 2, the difference between the LC and HAD test for measures 1 and 3 and the difference between the H and HAD test for all predictive success measures under consideration.

Size analysis
Our results are based on large sample statistics. This means that they may be unreliable if the number of data sets in the experiment is small. In order to analyse this, we conduct a simple level analysis based on an artificial data set on 10 observations (|T | = 10) and 10 goods. 7 We compute the area of this data set using the Bronars procedure. We consider the case where the null-hypotheses p 1 (r, a) = 0, p 2 (r, a) = 1 and p 3 (r, a) = 0 hold. Towards this end, we randomly generated experiments of various sizes 5 In particular, we have (1) Food and non-alcoholic drinks at home, (2) Alcohol, (3) Tobacco, (4) Energy at home, (5) Services at home, (6) Nondurables at home, (7) Nondurable medicines, (8) Medical services, (9) Transportation, (10) Petrol, (11) Leisure, (12) Personal services, (13) Personal nondurables, (14) Restaurants and bars. 6 We further restrict the sample to the subset of households for which the wife is outside of the labour market and for which we have observations for all eight quarters. We further restrict the sample to households which have strict positive consumption for the addictive good in all periods. 7 The results are not sensitive to the number of observations or goods.   Table 3: Level of the test that rejects the hypothesis p 1 (r, a) = 0; p 2 (r, a) = 1 or p 3 (r, a) = 0 if these values fall outside the 95% confidence intervals.  Table 3 gives the level of the test that rejects if the sample based predictive success measure falls outside the 95% confidence interval. The table shows that the test coincides with the nominal level for experiments of 250 datasets or more. Small experiments, however, tend to reject the null-hypothesis to often.

Conclusion
This note provides statistical inference for measures of predictive success. Predictive success measures are frequently used to evaluate and compare the performance of different models of individual and group behaviour in experimental and revealed preference studies. Our results allow us to derive confidence intervals for the value of a predictive success measure or for the difference between the predictive success measure of two opposing models. We provide a brief illustration of our findings by comparing the predictive success of different revealed preference tests for models of intertemporal decision making. Finally, simulation results indicate that our tests give reliable results for moderately sized experiments but that type I errors may be above the 5% nominal value in small samples.