Nonparametric predictive distributions based on conformal prediction
 401 Downloads
Abstract
This paper applies conformal prediction to derive predictive distributions that are valid under a nonparametric assumption. Namely, we introduce and explore predictive distribution functions that always satisfy a natural property of validity in terms of guaranteed coverage for IID observations. The focus is on a prediction algorithm that we call the Least Squares Prediction Machine (LSPM). The LSPM generalizes the classical Dempster–Hill predictive distributions to nonparametric regression problems. If the standard parametric assumptions for Least Squares linear regression hold, the LSPM is as efficient as the Dempster–Hill procedure, in a natural sense. And if those parametric assumptions fail, the LSPM is still valid, provided the observations are IID.
Keywords
Conformal prediction Least Squares Nonparametric regression Predictive distributions Regression1 Introduction
This paper applies conformal prediction to derive predictive distribution functions that are valid under a nonparametric assumption. In our definition of predictive distribution functions and their property of validity we follow Shen et al. (2018, Section 1), whose terminology we adopt, and Schweder and Hjort (2016, Chapter 12), who use the term “prediction confidence distributions”. The theory of predictive distributions as developed by Schweder and Hjort (2016) and Shen et al. (2018) assumes that the observations are generated from a parametric statistical model. We extend the theory to the case of regression under the general IID model (the observations are generated independently from the same distribution), where the distribution form does not need to be specified; however, our exposition is selfcontained. Our predictive distributions generalize the classical Dempster–Hill procedure (to be formally defined in Sect. 5), which these authors referred to as direct probabilities (Dempster) and \(A_{(n)}\)/\(H_{(n)}\) (Hill). For a wellknown review of predictive distributions, see Lawless and Fredette (2005). The more recent review by Gneiting and Katzfuss (2014) refers to the notion of validity used in this paper as probabilistic calibration and describes it as critical in forecasting; Gneiting and Katzfuss (2014, Section 2.2.3) also give further references.
We start our formal exposition by defining conformal predictive distributions (CPDs), nonparametric predictive distributions based on conformal prediction, and algorithms producing CPDs (conformal predictive systems, CPSs) in Sect. 2; we are only interested in (nonparametric) regression problems in this paper. An unusual feature of CPDs is that they are randomized, although they are typically affected by randomness very little. The starting point for conformal prediction is the choice of a conformity measure; not all conformity measures produce CPDs, but we give simple sufficient conditions. In Sect. 3 we apply the method to the classical Least Squares procedure obtaining what we call the Least Squares Prediction Machine (LSPM). The LSPM is defined in terms of regression residuals; accordingly, it has three main versions: ordinary, deleted, and studentized. The most useful version appears to be studentized, which does not require any assumptions on how influential any of the individual observations is. We state the studentized version (and, more briefly, the ordinary version) as an explicit algorithm. The next two sections, Sects. 4 and 5, are devoted to the validity and efficiency of the LSPM. Whereas the LSPM, as any CPS, is valid under the general IID model, for investigating its efficiency we assume a parametric model, namely the standard Gaussian linear model. The question that we try to answer in Sect. 5 is how much we should pay (in terms of efficiency) for the validity under the general IID model enjoyed by the LSPM. We compare the LSPM with three kinds of oracles under the parametric model; the oracles are adapted to the parametric model and are only required to be valid under it. The weakest oracle (Oracle I) only knows the parametric model, and the strongest one (Oracle III) also knows the parameters of the model. In important cases the LSPM turns out to be as efficient as the Dempster–Hill procedure. All proofs are postponed to Sect. 6, which also contains further discussions. Section 7 is devoted to experimental results demonstrating the validity and, to some degree, efficiency of our methods. Finally, Sect. 8 concludes and lists three directions of further research.
Another method of generating predictive distributions that are valid under the IID model is Venn prediction (Vovk et al. 2005, Chapter 6). An advantage of the method proposed in this paper is that it works in the case of regression, whereas Venn prediction, at the time of writing of this paper, was only known to work in the case of classification (see, however, the recent paper by Nouretdinov et al. 2018, discussed in Sect. 8).
The conference version of this paper (Vovk et al. 2017b), announcing the main results, was published in the Proceedings of COPA 2017. This expanded journal version includes detailed proofs, further discussion of the intuition behind the proposed algorithms, and topics for further research, in addition to improved exposition.
A significant advantage of conformal predictive distributions over traditional conformal prediction is that the former can be combined with a utility function to arrive at optimal decisions. A first step in this direction has been made in Vovk and Bendtsen (2018) (developing ideas of the conference version of this paper).
2 Randomized and conformal predictive distributions
We consider the regression problem with p attributes. Correspondingly, the observation space is defined to be \(\mathbb {R}^{p+1}=\mathbb {R}^p\times \mathbb {R}\); its element \(z=(x,y)\), where \(x\in \mathbb {R}^p\) and \(y\in \mathbb {R}\), is interpreted as an observation consisting of an object \(x\in \mathbb {R}^p\) and its label \(y\in \mathbb {R}\). Our task is, given a training sequence of observations \(z_i=(z_i,y_i)\), \(i=1,\ldots ,n\), and a new test object \(x_{n+1}\in \mathbb {R}^p\), to predict the label \(y_{n+1}\) of the \((n+1)\)st observation. Our statistical model is the general IID model: the observations \(z_1,z_2,\ldots \), where \(z_{i}=(x_{i},y_{i})\), are generated independently from the same unknown probability measure P on \(\mathbb {R}^{p+1}\).
We start from defining predictive distribution functions following Shen et al. (2018, Definition 1), except that we relax the definition of a distribution function and allow randomization. Let U be the uniform probability measure on the interval [0, 1].
Definition 1
 R1a
 For each training sequence \((z_1,\ldots ,z_n)\in (\mathbb {R}^{p+1})^n\) and each test object \(x_{n+1}\in \mathbb {R}^p\), the function \(Q(z_1,\ldots ,z_n,(x_{n+1},y),\tau )\) is monotonically increasing both in \(y\in \mathbb {R}\) and in \(\tau \in [0,1]\) (where “monotonically increasing” is understood in the wide sense allowing intervals of constancy). In other words, for each \(\tau \in [0,1]\), the functionis monotonically increasing, and for each \(y\in \mathbb {R}\), the function$$\begin{aligned} y\in \mathbb {R}\mapsto Q(z_1,\ldots ,z_n,(x_{n+1},y),\tau ) \end{aligned}$$is monotonically increasing.$$\begin{aligned} \tau \in [0,1] \mapsto Q(z_1,\ldots ,z_n,(x_{n+1},y),\tau ) \end{aligned}$$
 R1b
 For each training sequence \((z_1,\ldots ,z_n)\in (\mathbb {R}^{p+1})^n\) and each test object \(x_{n+1}\in \mathbb {R}^p\),and$$\begin{aligned} \lim _{y\rightarrow \infty } Q(z_1,\ldots ,z_n,(x_{n+1},y),0) = 0 \end{aligned}$$(1)$$\begin{aligned} \lim _{y\rightarrow \infty } Q(z_1,\ldots ,z_n,(x_{n+1},y),1) = 1. \end{aligned}$$
 R2
 As function of random training observations \(z_1\sim P\),..., \(z_n\sim P\), a random test observation \(z_{n+1}\sim P\), and a random number \(\tau \sim U\), all assumed independent, the distribution of Q is uniform:$$\begin{aligned} \forall \alpha \in [0,1]: {{\mathrm{\mathbb {P}}}}\left\{ Q(z_1,\ldots ,z_n,z_{n+1},\tau ) \le \alpha \right\} = \alpha . \end{aligned}$$
In this paper we will be interested in RPDs of thickness \(\frac{1}{n+1}\) with exception size at most n, for typical training sequences of length n [cf. (17) below]. In all our examples, \(Q(z_1,\ldots ,z_n,z_{n+1},\tau )\) will be a continuous function of \(\tau \). Therefore, the set (4) will be a closed interval in [0, 1]. However, we do not include these requirements in our official definition.
Four examples of predictive distributions are shown in Fig. 5 below as shaded areas; let us concentrate, for concreteness, on the top left one. The length of the training sequence for that plot (and the other three plots) is \(n=10\); see Sect. 7 for details. Therefore, we are discussing an instance of \(Q_{10}\), of width 1 / 11 with exception size 10. The shaded area is \(\{(y,Q_{10}(y,\tau ))\mid y\in \mathbb {R},\tau \in [0,1]\}\). We can regard \((y,\tau )\) as a coordinate system for the shaded area. The cut of the shaded area by the vertical line passing through a point y of the horizontal axis is the closed interval [Q(y, 0), Q(y, 1)], where \(Q:=Q_{10}\). The notation Q(y) for the vertical axis is slightly informal.
Remark 1
The usual interpretation of (7) is that it is a randomized p value for testing the null hypothesis of the observations being IID. In the case of CPDs, the informal alternative hypothesis is that \(y_{n+1}=y\) is smaller than expected under the IID model. Then (6) can be interpreted as a degree of conformity of the observation \((x_{n+1},y_{n+1})\) to the remaining observations. Notice the onesided nature of this notion of conformity: a label can only be strange (nonconforming) if it is too small; large is never strange. This notion of conformity is somewhat counterintuitive, and we use it only as a technical tool.
2.1 Defining properties of distribution functions
Next we discuss why Definition 1 (essentially taken from Shen et al. 2018) is natural. The key elements of this definition are that (1) the distribution function Q is monotonically increasing, and (2) its value is uniformly distributed. The following two lemmas show that these are defining properties of distribution functions of probability measures on the real line. All proofs are postponed to Sect. 6.
First we consider the case of a continuous distribution function; the justification for this case, given in the next lemma, is simpler.
Lemma 1
Suppose F is a continuous distribution function on \(\mathbb {R}\) and Y is a random variable distributed as F. If \(Q:\mathbb {R}\rightarrow \mathbb {R}\) is a monotonically increasing function such that the distribution of Q(Y) is uniform on [0, 1], then \(Q=F\).
In the general case we need randomization. Remember the definition of the lexicographic order on \(\mathbb {R}\times [0,1]\): \((y,\tau )\le (y',\tau ')\) is defined to mean that \(y<y'\) or both \(y=y'\) and \(\tau \le \tau '\).
Lemma 2
Equality (10) says that Q is essentially F; in particular, \(Q(y,\tau )=F(y)\) at each point y of F’s continuity. It is a known fact that if we define Q by (10) for the distribution function F of a probability measure P, the distribution of Q will be uniform when its domain \(\mathbb {R}\times [0,1]\) is equipped with the probability measure \(P\times U\).
The previous two lemmas suggest that properties R1a and R2 in the definition of RPSs are the important ones. However, property R1b is formally independent of R1a and R2 in our case of the general IID model (rather than a single probability measure on \(\mathbb {R}\)): consider, e.g., a conformity measure A that depends only on the objects \(x_i\) but does not depend on their labels \(y_i\); e.g., the lefthand side of (1) will be close to 1 for large n and highly conforming \(x_{n+1}\).
2.2 Simplest example: monotonic conformity measures
 monotonically increasing in \(y_{n+1}\),$$\begin{aligned} y_{n+1} \le y'_{n+1} \Longrightarrow A(z_1,\ldots ,z_n,(x_{n+1},y_{n+1})) \le A(z_1,\ldots ,z_n,(x_{n+1},y'_{n+1})); \end{aligned}$$
 monotonically decreasing in \(y_{1}\),(Because of the requirement of invariance (5), being decreasing in \(y_1\) is equivalent to being decreasing in \(y_i\) for any \(i=2,\ldots ,n\).)$$\begin{aligned} y_{1} \le y'_{1} \Longrightarrow A((x_1,y_1),z_{2},\ldots ,z_n,z_{n+1}) \ge A((x_1,y'_1),z_{2},\ldots ,z_n,z_{n+1}). \end{aligned}$$
2.3 Criterion of being a CPS
Unfortunately, many important conformity measures are not monotonic, and the next lemma introduces a weaker sufficient condition for a conformal transducer to be an RPS.
Lemma 3
The conformal transducer determined by a conformity measure A satisfies condition R1a if, for each \(i\in \{1,\ldots ,n\}\), each training sequence \((z_1,\ldots ,z_n)\in (\mathbb {R}^{p+1})^n\), and each test object \(x_{n+1}\in \mathbb {R}^p\), \(\alpha ^y_{n+1}\alpha _i^y\) is a monotonically increasing function of \(y\in \mathbb {R}\) (in the notation of (8)).
3 Least squares prediction machine
In this section we will introduce three versions of what we call the Least Squares Prediction Machine (LSPM). They are analogous to the Ridge Regression Confidence Machine (RRCM), as described in Vovk et al (2005, Section 2.3) (and called the IID predictor in Vovk et al. 2009), but produce (at least usually) distribution functions rather than prediction intervals.
Unfortunately, the ordinary and deleted LSPM are not RPS, because their output \(Q_n\) [see (2)] is not necessarily monotonically increasing in y (remember that, for conformal transducers, \(Q_n(y,\tau )\) is monotonically increasing in \(\tau \) automatically). However, we will see that this can happen only in the presence of highleverage points.
The following proposition can be deduced from Lemma 3 and the explicit form [analogous to Algorithm 1 below but using (22)] of the ordinary LSPM. The details of the proofs for all results of this section will be spelled out in Sect. 6.
Proposition 1
The function \(Q_n\) output by the ordinary LSPM [see (2)] is monotonically increasing in y provided \(\bar{h}_{n+1}<0.5\).
The condition needed for \(Q_n\) to be monotonically increasing, \(\bar{h}_{n+1}<0.5\), means that the test object \(x_{n+1}\) is not a very influential point. An overview of highleverage points is given by Chatterjee and Hadi (1988, Section 4.2.3.1), where they start from Huber’s 1981 proposal to regard points \(x_i\) with \(\bar{h}_{i}>0.2\) as influential.
The assumption \(\bar{h}_{n+1}<0.5\) in Proposition 1 is essential:
Proposition 2
Proposition 1 ceases to be true if the constant 0.5 in it is replaced by a larger constant.
The next two propositions show that for the deleted LSPM, determined by (12), the situation is even worse than for the ordinary LSPM: we have to require \(\bar{h}_{i}<0.5\) for all \(i=1,\ldots ,n\).
Proposition 3
The function \(Q_n\) output by the deleted LSPM according to (2) is monotonically increasing in y provided \(\max _{i=1,\ldots ,n}\bar{h}_{i}<0.5\).
We have the following analogue of Proposition 2 for the deleted LSPM.
Proposition 4
Proposition 3 ceases to be true if the constant 0.5 in it is replaced by a larger constant.
An important advantage of studentized LSPM is that to get predictive distributions we do not need any assumptions of low leverage.
Proposition 5
The studentized LSPM is an RPS and, therefore, a CPS.
3.1 The studentized LSPM in an explicit form
Finally, let us discuss the condition that all \(B_i\) are defined and positive, \(i=1,\ldots ,n\). By Chatterjee and Hadi (1988, Property 2.6(b)), \(\bar{h}_{n+1}=1\) implies \(\bar{h}_{i,n+1}=0\) for \(i=1,\ldots ,n\); therefore, the condition is equivalent to \(\bar{h}_i<1\) for all \(i=1,\ldots ,n+1\). By Mohammadi (2016, Lemma 2.1(iii)), this means that the rank of the extended data matrix \(\bar{X}\) is p and it remains p after removal of any one of its \(n+1\) rows. If this condition is not satisfied, we define \(Q_n(y):=[0,1]\) for all y. This ensures that the studentized LSPM is a CPS.
3.2 The batch version of the studentized LSPM
3.3 The ordinary LSPM
4 A property of validity of the LSPM in the online mode
In the previous section (cf. Algorithm 1) we defined a procedure producing a “fuzzy” distribution function \(Q_n\) given a training sequence \(z_i=(x_i,y_i)\), \(i=1,\ldots ,n\), and a test object \(x_{n+1}\). In this and following sections we will use both notation \(Q_n(y)\) (for an interval) and \(Q_n(y,\tau )\) (for a point inside that interval, as above). Remember that U is the uniform distribution on [0, 1].
Prediction in the online mode proceeds as follows:
Protocol 1
Of course, Forecaster does not know P and \(y_{n+1}\) when computing \(Q_n\).
In the online mode we can strengthen condition R2 as follows:
Theorem 1
(Vovk et al. 2005, Theorem 8.1) In the online mode of prediction (in which \((z_i,\tau _i)\sim P\times U\) are IID), the sequence \((p_1,p_2,\ldots )\) is IID and \((p_1,p_2,\ldots )\sim U^{\infty }\), provided that Forecaster uses the studentized LSPM (or any other conformal transducer).
The property of validity asserted in Theorem 1 is marginal, in that we do not assert that the distribution of \(p_n\) is uniform conditionally on \(x_{n+1}\). Conditional validity is attained by the LSPM only asymptotically and under additional assumptions, as we will see in the next section.
5 Asymptotic efficiency
In this section we obtain some basic results about the LSPM’s efficiency. The LSPM has a property of validity under the general IID model, but a natural question is how much we should pay for it in terms of efficiency in situations where narrow parametric or even Bayesian assumptions are also satisfied. This question was asked independently by Evgeny Burnaev (in September 2013) and Larry Wasserman. It has an analogue in nonparametric hypothesis testing: e.g., a major impetus for the widespread use of the Wilcoxon ranksum test was Pitman’s discovery in 1949 that even in the situation where the Gaussian assumptions of Student’s ttest are satisfied the efficiency (“Pitman’s efficiency”) of the Wilcoxon test is still 0.95.
 A1

The sequence \(x_1,x_2,\ldots \) is bounded: \(\sup _i\left\ x_i\right\ <\infty \).
 A2

The first component of each vector \(x_i\) is 1.
 A3
 The empirical secondmoment matrix has its smallest eigenvalue eventually bounded away from 0:where \(\lambda _{\min }\) stands for the smallest eigenvalue.$$\begin{aligned} \liminf _{n\rightarrow \infty } \lambda _{\min } \left( \frac{1}{n}\sum _{i=1}^n x_i x'_i \right) >0, \end{aligned}$$
 A4

The labels \(y_1,y_2,\ldots \) are generated according to (24): \(y_i= w' x_i+\xi _i\), where \(\xi _i\) are independent Gaussian noise random variables distributed as \(N(0,\sigma ^2)\).
Alongside the three versions of the LSPM, we will consider three “oracles” (at first concentrating on the first two). Intuitively, all three oracles know that the data is generated from the model (24). Oracle I knows neither w nor \(\sigma \) (and has to estimate them from the data or somehow manage without them). Oracle II does not know w but knows \(\sigma \). Finally, Oracle III knows both w and \(\sigma \).
Theorem 2
Theorem 3
In Theorems 2 and 3, we have \(\tau \sim U\); alternatively, they will remain true if we fix \(\tau \) to any value in [0, 1]. For simplified oracles, we have \(Q^\mathrm{I}_n(\hat{y}_{n+1} + \hat{\sigma }_n t)=\varPhi (t)\) in Theorem 2 and \(Q^\mathrm{II}_n(\hat{y}_{n+1} + \sigma t)=\varPhi (t)\) in Theorem 3. Our proofs of these theorems (given in Sect. 6) are based on the representation (22) and the results of Mugantseva (1977) (see also Chen 1991, Chapter 2).
Applying Theorems 2 and 3 to a fixed argument t, we obtain (dropping \(\tau \) altogether):
Corollary 1
We can see that under the Gaussian model (24) complemented by other natural assumptions, the LSPM is asymptotically close to the oracular predictive distributions for Oracles I and II, and therefore is approximately conditionally valid and efficient (i.e., valid and efficient given \(x_1,x_2,\ldots \)). On the other hand, Theorem 1 guarantees the marginal validity of the LSPM under the general IID model, regardless of whether (24) holds.
5.1 Comparison with the Dempster–Hill procedure
In this subsection we discuss a classical procedure that was most clearly articulated by Dempster (1963, p. 110) and Hill (1968, 1988); therefore, in this paper we refer to it as the Dempster–Hill procedure. Both Dempster and Hill trace their ideas to Fisher’s (1939; 1948) nonparametric version of his fiducial method, but Fisher was interested in confidence distributions for quantiles rather than predictive distributions. Hill (1988) also referred to his procedure as Bayesian nonparametric predictive inference, which was abbreviated to nonparametric predictive inference (NPI) by Frank Coolen (Augustin and Coolen 2004). We are not using the last term since it seems that all of this paper (and the whole area of conformal prediction) falls under the rubric of “nonparametric predictive inference”. An important predecessor of Dempster and Hill was Jeffreys (1932), who postulated what Hill later denoted as \(A_{(2)}\) (see Lane 1980 and Seidenfeld 1995 for discussions of Jeffreys’s paper and Fisher’s reaction).
Notice that the LSPM, as presented in (23), is a very natural adaptation of \(A_{(n)}\) to the Least Squares regression.
Since (31) is a conformal transducer (provided a point from an interval in (31) is chosen randomly from the uniform distribution on that interval), we have the same guarantees of validity as those given above: the distribution of (31) is uniform over the interval [0, 1].
As for efficiency, it is interesting that, in the most standard case of IID Gaussian observations, our predictive distributions for linear regression are as precise as the Dempster–Hill ones asymptotically when compared with Oracles I and II. Let us apply the Dempster–Hill procedure to the location/scale model \(y_i=w+\xi _i\), \(i=1,2,\ldots \), where \(\xi _i\sim N(0,\sigma ^2)\) are independent. As in the case of the LSPM, we can compare the Dempster–Hill procedure with three oracles (we consider only simplified versions): Oracle I knows neither w nor \(\sigma \), Oracle II knows \(\sigma \), and Oracle III knows both w and \(\sigma \).
It is interesting that Theorems 2 and 3 (and therefore the blue and black plots in Fig. 1) are applicable to both the LSPM and Dempster–Hill predictive distributions. (The fact that the analogous asymptotic variances for standard linear regression are as good as those for the location/scale model was emphasized in the pioneering paper by Pierce and Kopecky 1979.) The situation with Oracle III is different. Donsker’s (1952) classical result implies the following simplification of Theorems 2 and 3, where \(Q^\mathrm{III}\) stands for Oracle III’s predictive distribution (independent of n).
Theorem 4
The variance \(\varPhi (t)(1\varPhi (t))\) of the Brownian bridge is shown as the red line in Fig. 1. However, the analogue of the process (32) does not converge in general for the LSPM (under this section’s assumption of fixed objects).
6 Proofs, calculations, and additional observations
In this section we give all proofs and calculations for the results of the previous sections and provide some additional comments.
6.1 Proofs for Sect. 2
6.1.1 Proof of Lemma 1
6.1.2 Proof of Lemma 2
 if the supremum in (33) is attained,and so Q maps the lexicographic interval \(((y,1),(y',1)]\) of positive probability \(F(y')F(y)\) into one point;$$\begin{aligned} F(y) < Q(y,1) = {{\mathrm{\mathbb {P}}}}(Q(Y,1)\le Q(y,1)) = {{\mathrm{\mathbb {P}}}}((Y,1)\le (y',1)) = F(y'), \end{aligned}$$
 if the supremum in (33) is not attained,and so Q maps the lexicographic interval \(((y,1),(y',0))\) of positive probability \(F(y')F(y)\) into one point.$$\begin{aligned} F(y)< Q(y,1) = {{\mathrm{\mathbb {P}}}}(Q(Y,1)\le Q(y,1)) = {{\mathrm{\mathbb {P}}}}((Y,1)<(y',1)) = F(y'), \end{aligned}$$
In the same way we prove that \(Q(y,0)=F(y)\) for all \(y\in \mathbb {R}\).
Now (10) holds trivially when F is continuous at y. If it is not, \(Q^{1}((F(y),F(y))\) will only contain points \((y,\tau )\) for various \(\tau \), and so (10) is the only way to ensure that the distribution of Q is uniform.
6.1.3 Proof of Lemma 3
Let us split all numbers \(i\in \{1,\ldots ,n+1\}\) into three classes: i of class I are those satisfying \(\alpha ^y_i>\alpha ^y_{n+1}\), i of class II are those satisfying \(\alpha ^y_i=\alpha ^y_{n+1}\), and i of class III are those satisfying \(\alpha ^y_i<\alpha ^y_{n+1}\). Each of those numbers is assigned a weight: 0 for i of class I, \(\tau /(n+1)\) for i of class II, and \(1/(n+1)\) for i of class III; notice that the weights are larger for highernumbered classes. According to (7), \(Q_n(y,\tau )\) is the sum of the weights of all \(i\in \{1,\ldots ,n+1\}\). As y increases, each individual weight can only increase (as i can move only to a highernumbered class), and so the total weight \(Q_n(y,\tau )\) can also only increase.
6.2 Comments and proofs for Sect. 3
After a brief discussion of Ridge Regression Prediction Machines (analogous to Ridge Regression Confidence Machines, mentioned at the beginning of Sect. 3), we prove Propositions 1–5 and find the explicit forms for the studentized, ordinary, and deleted LSPM.
6.2.1 Ridge regression prediction machines
We can generalize LSPM to the Ridge Regression Prediction Machine (RRPM) by replacing the Least Squares predictions in (11), (12), and (14) by Ridge Regression predictions (see Vovk et al. 2017a for details). In this paper we are interested in the case \(p\ll n\), and so Least Squares often provide a reasonable result as compared with Ridge Regression. When we move on to the kernel case (and Kernel Ridge Regression), the Least Squares method ceases to be competitive. Vovk et al. (2017a) extend some results of this paper to the kernel case replacing the LSPM by the RRPM.
Remark 2
The early versions of the Ridge Regression Confidence Machines used Open image in new window in place of the righthand side of (11) (see, e.g., Vovk et al. 2005, Section 2.3). For the first time the operation \(\left \cdots \right \) of taking the absolute value was dropped in Burnaev and Vovk (2014) to facilitate theoretical analysis.
6.2.2 Proof of Proposition 1
6.2.3 Proof of Proposition 2
6.2.4 Proof of Proposition 3
6.2.5 Proof of Proposition 4
6.2.6 Proof of Proposition 5
6.2.7 Computations for the studentized LSPM
6.2.8 The ordinary and deleted LSPM
6.3 Comments and proofs for Sect. 5
There are different notions of weak convergence of empirical processes used in literature, but in this paper (in particular, Theorems 2 and 3) we use the oldfashioned one due to Skorokhod: see, e.g., Billingsley (1999, except for Section 15). We will regard empirical distribution functions and empirical processes to be elements of a function space which we will denote \(\mathbb {D}\): its elements are càdlàg (i.e., rightcontinuous with left limits) functions \(f:\mathbb {R}\rightarrow \mathbb {R}\), and the distance between \(f,g\in \mathbb {D}\) will be defined to be the Skorokhod distance (either d or \(d^{\circ }\) in the notation of Billingsley 1999, Theorem 12.1) between the functions \(t\in [0,1]\mapsto f(\varPhi ^{1}(t))\) and \(t\in [0,1]\mapsto g(\varPhi ^{1}(t))\) in D[0, 1]. (Here \(\varPhi \) is the standard Gaussian distribution function; we could have used any other function on the real line that is strictly monotonically increasing from 0 to 1.)
6.3.1 Proofs of Theorems 2 and 3 for the ordinary LSPM
We will start our proof from the ordinary LSPM, in which case the predictive distribution is particularly simple.
Proof
Now Theorems 2 and 3 will follow from Mugantseva (1977) and Chen (1991). Mugantseva only treats simple linear regression, and in general we deduce Theorem 2 from Chen (1991, Theorem 2.4.3) and deduce Theorem 3 from Chen’s Theorems 2.4.3 and 2.3.2. However, to make those results applicable we need to show that the fraction \(\frac{1+g_{n+1}}{1+g_i}\) in (22) can be ignored; the following lemma shows that both \(g_{n+1}\) and \(g_i\) are sufficiently close to 1.
Lemma 5
Under our conditions A1–A4, \(\max _{i=1,\ldots ,n+1}\left g_i\right =O(n^{1})=o(n^{1/2})\).
Proof
The idea is to use Prokhorov’s theorem, in the form of Theorem 13.1 in Billingsley (1999), first proving that the finitedimensional distributions of \(G_n\) converge to those of Z and then that the sequence \(G_n\) is tight. The functional space \(D(\infty ,\infty )\) is defined and studied in Billingsley (1999, p. 191); we can use it in place of \(\mathbb {D}\) if we consider, without loss of generality, the domains of \(G_n\) and \(\bar{G}_n\) to be bounded. Let \(\pi _{t^*_1,\ldots ,t^*_k}\) be the projection of \(D(\infty ,\infty )\) onto \(\mathbb {R}^k\): \(\pi _{t^*_1,\ldots ,t^*_k}(x):=(x(t^*_1),\ldots ,x(t^*_k))\).
Lemma 6
The finitedimensional distributions of \(G_n\) weakly converge to Z: \(\pi _{t^*_1,\ldots ,t^*_k}(G_n)\Rightarrow \pi _{t^*_1,\ldots ,t^*_k}(Z)\).
Proof
The second step in the proof of Theorem 3 is to prove the tightness of the perturbed empirical distribution functions for the residuals.
Lemma 7
The sequence \(G_n\), \(n=1,2,\ldots \), is tight.
Proof
We will use the standard notation for càdlàg functions x on a closed interval of the real line (Billingsley 1999, Section 12): j(x) stands for the size of the largest jump of x, \(w_x(T):=\sup _{s,t\in T}\left x(s)x(t)\right \) for any subset T of the domain of x, \(w_x(\delta ):=\sup _t w_x[t,t+\delta ]\) for any \(\delta >0\), and \(w'_x(\delta ):=\inf _{\{t_i\}}\max _{i\in \{1,\ldots ,v\}}w_x[t_{i1},t_i)\), where \(t_0<t_1<\cdots <t_v\) range over the partitions of the domain \([t_0,t_v]\) of x that are \(\delta \)sparse in the sense of \(\min _{i\in \{1,\ldots ,v\}}(t_it_{i1})>\delta \).
Now Theorem 3 follows from Lemmas 6 and 7 by Billingsley (1999, Theorem 13.1).
The following lemma was used in the proof of Lemma 6.
Lemma 8
Suppose a sequence \(\bar{G}_n\) of random functions in \(D(\infty ,\infty )\) weakly converges to a random function Z in \(C(\infty ,\infty )\) and suppose \(t_n\rightarrow t\) are real numbers (or, more generally, \(t_n\) are random variables converging to t in probability). Then \(\bar{G}_n(t_n)\) weakly converges to Z(t).
6.3.2 Proofs for the studentized LSPM
Let us see that Theorems 2 and 3 still hold for the deleted and studentized LSPM. For concreteness, we will only consider the studentized LSPM. We have the following stronger form of Lemma 5.
Lemma 9
Under conditions A1–A4, \(\max _{i,j=1,\ldots ,n+1}\left \bar{h}_{i,j}\right =O(n^{1})\).
Proof
7 Experimental results
In this section we explore experimentally the validity and efficiency of the studentized LSPM.
7.1 Online validity
7.2 Efficiency
Next we explore empirically the efficiency of the studentized LSPM. Figure 5 compares the conformal predictive distribution with the true (Oracle III’s) distribution for four randomly generated test objects and a randomly generated training sequence of length 10 with 2 attributes. The first attribute is a dummy all1 attribute; remember that Theorems 2 and 3 depend on the assumption that one of the attributes is an identical 1 (without it, the plots become qualitatively different: cf. Chen 1991, Corollary 2.4.1). The second attribute is generated from the standard Gaussian distribution, and the labels are generated as \(y_n\sim 2x_{n,2}+N(0,1)\), \(x_{n,2}\) being the second attribute. We also show (with thinner lines) the output of Oracle I and Oracle II, but only for the simplified versions, in order not to clutter the plots. Instead, in the lefthand plot of Fig. 6 we show the first plot of Fig. 5 that is normalized by subtracting the true distribution function; this time, we show the output of both simplified and proper Oracles I and II; the difference is not large but noticeable. The righthand plot of Fig. 6 is similar except that the training sequence is of length 100 and there are 20 attributes generated independently from the standard Gaussian distribution except for the first one, which is the dummy all1 attribute; the labels are generated as before, \(y_n\sim 2x_{n,2}+N(0,1)\).
Since Oracle III is more powerful than Oracles I and II (it knows the true datagenerating distribution), it is more difficult to compete with; therefore, the black line is farther from the shaded area than the blue and red lines for all four plots in Fig. 5. The estimated distribution functions being to the left of the true distribution functions is a coincidence: the four plots correspond to the values 0–3 of the seed for the R pseudorandom number generator, and for other seeds the estimated distribution functions are sometimes to the right and sometimes to the left.
8 Conclusion
This paper introduces conformal predictive distributions in regression problems. Their advantage over the usual conformal prediction intervals is that a conformal predictive distribution \(Q_n\) contains more information; in particular, it can produce a plethora of prediction intervals: e.g., for each \(\epsilon >0\), \(\{y\in \mathbb {R}\mid \epsilon /2\le Q_n(y,\tau )\le 1\epsilon /2\}\) is a conformal prediction interval at confidence level \(1\epsilon \).

This paper is based on the most traditional approach to weak convergence of empirical processes, originated by Skorokhod and described in detail by Billingsley (1999). This approach encounters severe difficulties in more general situations (such as multidimensional labels). Alternative approaches have been proposed by numerous authors, including Dudley (using the uniform topology and ball \(\sigma \)algebra, Dudley 1966, 1967) and HoffmannJørgensen (dropping measurability and working with outer integrals; see, e.g., van der Vaart and Wellner 1996, Section 1.3 and the references in the historical section). Translating our results into those alternative languages might facilitate various generalizations.

Another generalization of the traditional notion of weak convergence is Belyaev’s notion of weakly approaching sequences of random distributions (Belyaev and Sjöstedt–de Luna 2000). When comparing the LSPM with Oracle III, we limited ourselves to stating the absence of weak convergence and calculating the asymptotics of 1dimensional distributions; Belyaev’s definition is likely to lead to more precise results.

The recent paper by Nouretdinov et al. (2018) uses inductive Venn–Abers predictors to produce predictive distributions, with very different guarantees of validity. Establishing connections between the approach of this paper and that of Nouretdinov et al. (2018) is an interesting direction of further research.
Notes
Acknowledgements
We are grateful to Teddy Seidenfeld for useful historical information. We also thank the anonymous referees of the conference and journal versions of this paper for helpful comments. Supported by the EPSRC (Grant EP/K033344/1), EU Horizon 2020 Research and Innovation programme (Grant 671555), US NSF (Grant DMS1513483), and Leverhulme Magna Carta Doctoral Centre.
References
 Augustin, T., & Coolen, F. P. A. (2004). Nonparametric predictive inference and interval probability. Journal of Statistical Planning and Inference, 124, 251–272.MathSciNetCrossRefzbMATHGoogle Scholar
 Balasubramanian, V. N., Ho, S. S., & Vovk, V. (Eds.). (2014). Conformal prediction for reliable machine learning: Theory, adaptations, and applications. Amsterdam: Elsevier.zbMATHGoogle Scholar
 Belyaev, Y., & Sjöstedtde Luna, S. (2000). Weakly approaching sequences of random distributions. Journal of Applied Probability, 37, 807–822.MathSciNetCrossRefzbMATHGoogle Scholar
 Billingsley, P. (1999). Convergence of probability measures (2nd ed.). New York: Wiley.CrossRefzbMATHGoogle Scholar
 Burnaev, E., & Vovk, V. (2014). Efficiency of conformalized ridge regression. JMLR: Workshop and Conference Proceedings, 35, 605–622. (COLT 2014).Google Scholar
 Chatterjee, S., & Hadi, A. S. (1988). Sensitivity analysis in linear regression. New York: Wiley.CrossRefzbMATHGoogle Scholar
 Chen, G. (1991). Empirical processes based on regression residuals: Theory and applications. PhD thesis, Department of Mathematics and Statistics, Simon Fraser UniversityGoogle Scholar
 Dempster, A. P. (1963). On direct probabilities. Journal of the Royal Statistical Society B, 25, 100–110.zbMATHGoogle Scholar
 Donsker, M. D. (1952). Justification and extension of Doob’s heuristic approach to the Kolmogorov–Smirnov theorems. Annals of Mathematical Statistics, 23, 277–281.MathSciNetCrossRefzbMATHGoogle Scholar
 Dudley, R. M. (1966). Weak convergence of probabilities on nonseparable metric spaces and empirical measures on Euclidean spaces. Illinois Journal of Mathematics, 10, 109–126.MathSciNetzbMATHGoogle Scholar
 Dudley, R. M. (1967). Measures on nonseparable metric spaces. Illinois Journal of Mathematics, 11, 449–453.MathSciNetzbMATHGoogle Scholar
 Fisher, R. A. (1939). Student. Annals of Eugenics, 9, 1–9.CrossRefzbMATHGoogle Scholar
 Fisher, R. A. (1948). Conclusions fiduciaires. Annales de l’Institut Henry Poincaré, 10, 191–213.MathSciNetGoogle Scholar
 Genest, C., & Kalbfleisch, J. (1988). Bayesian nonparametric survival analysis: Comment. Journal of the American Statistical Association, 83, 780–781.Google Scholar
 Geyer, C. J., & Meeden, G. D. (2005). Fuzzy and randomized confidence intervals and pvalues (with discussion). Statistical Science, 20, 358–387.MathSciNetCrossRefzbMATHGoogle Scholar
 Gneiting, T., & Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1, 125–151.CrossRefGoogle Scholar
 Hill, B. M. (1968). Posterior distribution of percentiles: Bayes’ theorem for sampling from a population. Journal of the American Statistical Association, 63, 677–691.MathSciNetzbMATHGoogle Scholar
 Hill, B. M. (1988). De Finetti’s theorem, induction, and \(A_{(n)}\) or Bayesian nonparametric predictive inference (with discussion). In D. V. Lindley, J. M. Bernardo, M. H. DeGroot, & A. F. M. Smith (Eds.), Bayesian statistics (Vol. 3, pp. 211–241). Oxford: Oxford University Press.Google Scholar
 Jeffreys, H. (1932). On the theory of errors and least squares. Proceedings of the Royal Society of London A, 138, 48–55.CrossRefzbMATHGoogle Scholar
 Lane, D. A. (1980). Fisher, Jeffreys, and the nature of probability. In S. E. Fienberg & D. V. Hinkley (Eds.), R. A. Fisher: An appreciation, lecture notes in statistics (Vol. 1, pp. 148–160). Berlin: Springer.CrossRefGoogle Scholar
 Lawless, J. F., & Fredette, M. (2005). Frequentist prediction intervals and predictive distributions. Biometrika, 92, 529–542.MathSciNetCrossRefzbMATHGoogle Scholar
 Mohammadi, M. (2016). On the bounds for diagonal and offdiagonal elements of the hat matrix in the linear regression model. REVSTAT Statistical Journal, 14, 75–87.MathSciNetzbMATHGoogle Scholar
 Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (5th ed.). Hoboken, NJ: Wiley.zbMATHGoogle Scholar
 Mugantseva, L. A. (1977). Testing normality in onedimensional and multidimensional linear regression. Theory of Probability and Its Applications, 22, 591–602.MathSciNetCrossRefGoogle Scholar
 Nouretdinov, I., Volkhonskiy, D., Lim, P., Toccaceli, P., & Gammerman, A. (2018). Inductive VennAbers predictive distribution. Proceedings of Machine Learning Research, 60, 15–36. (COPA 2018).Google Scholar
 Pierce, D. A., & Kopecky, K. J. (1979). Testing goodness of fit for the distribution of errors in regression models. Biometrika, 66, 1–5.MathSciNetCrossRefzbMATHGoogle Scholar
 Pinelis, I. (2015). Exact bounds on the closeness between the Student and standard normal distributions. ESAIM: Probability and Statistics, 19, 24–27.MathSciNetCrossRefzbMATHGoogle Scholar
 Schweder, T., & Hjort, N. L. (2016). Confidence, likelihood, probability: Statistical inference with confidence distributions. Cambridge: Cambridge University Press.CrossRefzbMATHGoogle Scholar
 Seber, G. A. F., & Lee, A. J. (2003). Linear regression analysis (2nd ed.). Hoboken, NJ: Wiley.CrossRefzbMATHGoogle Scholar
 Seidenfeld, T. (1995). Jeffreys, Fisher, and Keynes: Predicting the third observation, given the first two. In A. F. Cottrell & M. S. Lawlor (Eds.), New perspectives on Keynes (pp. 39–52). Durham, NC: Duke University Press.Google Scholar
 Shen, J., Liu, R., & Xie, M. (2018). Prediction with confidence—A general framework for predictive inference. Journal of Statistical Planning and Inference, 195, 126–140.MathSciNetCrossRefzbMATHGoogle Scholar
 van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: With applications to statistics. New York: Springer.CrossRefzbMATHGoogle Scholar
 Vovk, V., & Bendtsen, C. (2018). Conformal predictive decision making. Proceedings of Machine Learning Research, 91, 52–62. (COPA 2018).Google Scholar
 Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning in a random world. New York: Springer.zbMATHGoogle Scholar
 Vovk, V., Nouretdinov, I., & Gammerman, A. (2009). Online predictive linear regression. Annals of Statistics, 37, 1566–1590.MathSciNetCrossRefzbMATHGoogle Scholar
 Vovk, V., Nouretdinov, I., Manokhin, V., & Gammerman, A. (2017a). Conformal predictive distributions with kernels. Online compression modelling project (new series). http://alrw.net. Working paper 20
 Vovk, V., Shen, J., Manokhin, V., & Xie, M. (2017b). Nonparametric predictive distributions based on conformal prediction. Proceedings of Machine Learning Research, 60, 82–102. (COPA 2017).Google Scholar
 Wang, C. M., Hannig, J., & Iyer, H. K. (2012). Fiducial prediction intervals. Journal of Statistical Planning and Inference, 142, 1980–1990.MathSciNetCrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.