Introduction

Machine learning techniques been widely applied to many areas because of their expressive power with the help of gradient descent [13, 15, 19, 23, 24] and metaheuristics [1,2,3, 16, 17] for efficient parameter searching. Based on machine learning, one can build predictive models considering the uncertainty of outputs. This paper concentrates on the predictive system providing predictive distribution for the test label, which is desirable for all walks of life [11]. For regression problems, the predictive distribution contains the full information of the uncertainty, as it can provide the probability of any event relevant to the test label and be transformed to prediction point or prediction interval by use of the corresponding first-order moment or quantiles. For many applications, especially the high-risk ones, the predictive distributions are required to be valid, which implies that the distributions or their derived prediction intervals have statistical compatibility with realizations, i.e., they ought to tell the truth [33].

Nowadays, many algorithms in the context of statistics or machine learning have been proposed to output predictive distributions for test labels. However, most of them such as Bayesian regression and Gaussian process regression are highly dependent on their prior distribution assumptions, which can be far away from being valid if the prior assumptions are not correct [6, 18]. Recently, some frequentist approaches of probabilistic prediction algorithms have been proposed with compatibility with realizations in mind [25, 27]. While these approaches concern more about frequentist probability, they are limited in applications due to their original parametric forms. This issue has been tackled by a collection of promising works about conformal predictive systems (CPSs) [31, 33, 34], which build predictive system using learning framework of conformal prediction [6, 32] and extend the above frequentist approaches to a general nonparametric setting of being valid even in the small-sample cases.

The purpose of conformal prediction is to output valid prediction sets for test labels. One of the key characteristics of conformal prediction is that its p values calculated using conformity scores follows the uniform distribution on \([0, 1]\) with the assumption of the samples being independent and identically distributed. This excellent property enables us to transform the unknown uncertainty from data to one of our most familiar distributions. CPSs utilize the p values of conformal prediction and transform them to the predictive distributions, which makes CPSs have the small-sample property of validity [31].

The pioneer work [31] first proposed CPSs with the classical least square procedure as the underlying algorithms and the asymptotic efficiency was proved with some strong assumptions. After that, [31] was done to answer some general questions about the existence and construction of consistent CPSs. In addition to the general theoretical studies above, two kinds of works concentrating on the applicability of CPSs were done. The first kind is to propose more flexible CPSs, whose representatives are [33] and [35]. The former extends using the classical least square procedure as underlying algorithm to using a more powerful algorithm named kernel ridge regression, and the latter proposed conformal calibration whose underlying algorithms are existing predictive systems. The second kind is to speed up the learning process of CPSs, as CPSs inherit the computational issue from conformal prediction [14, 20, 30]. To address this, there are two ways to try. One way is to modify the learning process of the original CPSs, such as split conformal predictive systems (SCPSs) and cross-conformal predictive systems (CCPSs) [34]. SCPSs are also valid even in small-sample cases, but they may lose predictive efficiency, as they split the data into two parts, one of which is used to train the underlying algorithm and the other of which is used to calculate conformity scores. Although CCPSs do not have the theoretical guarantee of validity, they improve the prediction performance by making full use of the data. Another way is to use a fast and well-performed underlying algorithm to compute the conformity scores, which was our previous work for building a fast probabilistic prediction algorithm [37]. In that work, based on the Leave-One-Out CCPS and extreme learning machine [12], we proposed a fast CPS named LOO–CCPS–RELM and analysed its asymptotic property of validity. LOO–CCPS–RELM takes advantage of jackknife prediction of residuals and their closed-form formula to make the whole learning process fast, which is competent in real-time applications.

This work extends our previous work about LOO–CCPS–RELM in two aspects. First, we design a more general learning framework in the spirit of LOO–CCPS–RELM to make probabilistic prediction, whose underlying algorithms can be any uniformly stable algorithm. Second, contrast with LOO–CCPS–RELM designed and proved to be asymptotically valid only for homoscedastic cases, the learning framework in this paper considers the heteroscedastic cases and a more general theoretical guarantee of the asymptotical validity is proved. The heteroscedastic cases are addressed by the idea of locally weighted jackknife prediction, whose theoretical analysis for prediction intervals has been conducted in our earlier work [38]. This paper extends the related concepts and analytical techniques to the probabilistic prediction. Since the predictive system we proposed is based on the idea of locally weighted jackknife prediction, it is named as locally weighted jackknife predictive system (LW-JPS) in this paper.

In summary, to build valid and computationally efficient predictive system, we develop locally weighted jackknife prediction approach with asymptotic guarantee of validity with the contributions as follows:

  • A general predictive system based on the idea of locally weighted jackknife prediction is proposed for probabilistic prediction, which is easy-to-code and can learn fast if the underlying algorithms have the closed-form formula for leave-one-out residuals.

  • The asymptotical validity of our predictive system is proved with some regularity assumptions, which extends the analysis of LOO–CCPS–RELM by considering a more general setting and the heteroscedastic cases.

  • The experiments with 20 public data sets are conducted, which empirically proves the effective and efficiency of the proposed predictive system.

The rest of this paper is organized as follows. “Conformal predictive systems and locally weighted jackknife predictive system” reviews conformal predictive systems and defines the proposed LW-JPS. “Asymptotic analysis of locally weighted jackknife predictive system” proves the asymptotic validity of LW-JPS with some regularity assumptions and conditions. In “Experiments”, the experiments are designed to test the validity and efficiency of LW-JPS empirically and the conclusions of this paper are drawn in “Conclusion”.

Conformal predictive systems and locally weighted jackknife predictive system

Throughout this paper, \({\varvec{X}}\subseteq {{\varvec{R}}}^{n}\) denotes the object space and \({\varvec{Y}}\subseteq {\varvec{R}}\) the label space. The observation space is denoted by \({\varvec{Z}}={\varvec{X}}\times {\varvec{Y}}\) and each observation \({\varvec{z}}=({\varvec{x}}, y)\in {\varvec{X}}\times {\varvec{Y}}\) comprises its object \({\varvec{x}}\) and corresponding label \(y\). \({\varvec{Z}}l=\{{{\varvec{Z}}}_{i}, i=1, \cdot \cdot \cdot , l\}\) denotes a random training set whose realization is \({{\varvec{z}}}^{l}=\{{{\varvec{z}}}_{i}, i=1, \cdot \cdot \cdot , l\}\). \({{\varvec{Z}}}_{0}\) denotes a random test observation whose realization is \({z}_{0}\), where \({{\varvec{Z}}}_{0}=\left({{\varvec{X}}}_{0},{Y}_{0}\right),{{\varvec{Z}}}_{1}=\left({{\varvec{X}}}_{1},{Y}_{1}\right),\cdots ,{{\varvec{Z}}}_{l}=\left({{\varvec{X}}}_{l},{Y}_{l}\right)\) are independent and identically distributed and drawn from the distribution \(\rho \) on \({\varvec{Z}}={\varvec{X}}\times {\varvec{Y}}\). \(T\) denotes a random number uniformly distributed on \(\left[0,1\right]\), which is independent of all observations and its realization is denoted by \(t\).

For a fixed training set \({\varvec{z}}l\) and a test input object \({{\varvec{x}}}_{0}\), the goal of predictive systems is to construct a predictive distribution on \(y\in {\varvec{R}}\), which contains much of the information about \({y}_{0}\).

Predictive system and randomized predictive system

We first give the definition of predictive system which is first formally defined in [35].

Definition 1

A measurable function \(Q:{\varvec{Z}}^{l + 1} \to \left[ {0, 1} \right]\) is a predictive system (PS) if it satisfies the following two conditions:

A. For each realization \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\) , the function \(Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)\) is increasing in \(y\in {\varvec{R}}\) .

B. For each realization \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\) ,

$$\underset{y\to -\infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)=0$$

and

$$\underset{y\to \infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)=1.$$

Next, the notion of randomized predictive system is needed to introduce conformal predictive system.

Definition 2

A measurable function \(Q:{\varvec{Z}}^{l + 1} \times \left[ {0,1} \right] \to \left[ {0,1} \right]\) is a randomized predictive system (RPS) if it satisfies the following two conditions:

A. For each realization \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\) , the function \(Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),t\right)\) is increasing in \(y\in {\varvec{R}}\) and \(t\in \left[0,1\right]\) .

B. For each realization \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\) ,

$$\underset{y\to -\infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),0\right)=0$$

and

$$\underset{y\to \infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),1\right)=1.$$

In this paper, we use the shorthand notation \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)=Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)\) to explicitly regard it as a function of \(y\) dependent on \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\), and the shorthand notation \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)=Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),t\right)\) to explicitly regard it as a function of \(y\) dependent on \({{\varvec{z}}}^{l}\), \({{\varvec{x}}}_{0}\) and \(t\). \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\) is the predictive distribution of PS, which is a cumulative distribution function (CDF) of \({Y}_{0}\) given \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\). Different from that, RPS introduces a random number \(t\) to build the predictive distribution \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\). For fixed training set \({{\varvec{z}}}^{l}\) and \({{\varvec{x}}}_{0}\), the lower bound and upper bound of \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) are \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},0}\left(y\right)\) and \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},1}\left(y\right)\), respectively. The gap \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},1}\left(y\right)-{Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},0}\left(y\right)\) can converge to 0 quickly for the existing designed RPSs [31]. Thus, one can use a CDF between or approximating \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},0}\left(y\right)\) and \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},1}\left(y\right)\) to remove the impact of \(t\) and build the predictive distribution of \({Y}_{0}\).

A predictive system \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\) is valid, if the following holds:

$$P\left\{{Q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \eta \right\}=\eta ,$$
(1)

where \({Q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left(y\right)\) is a random function of \(y\) whose realization is \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\). In addition, \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\) is asymptotically valid if formula (1) holds asymptotically. Let \({\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(\eta /2\right)}\) and \({\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(1-\eta /2\right)}\) be the \(\eta /2\) and \(1-\eta /2\) quantiles of \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\). Then, the property of validity defined by formula (1) ensures that

$$ P\left\{ {Y_{0} \in C_{{{\varvec{Z}}^{l} ,{\varvec{X}}_{0} }}^{{\left( {1 - \eta } \right)}} } \right\} = 1 - \eta , $$
(2)

where \({C}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(1-\eta \right)}=\left[{\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(\eta /2\right)},{\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(1-\eta /2\right)}\right]\) is the prediction interval derived from \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\), whose expected coverage rate is \(1-\eta \).

The predictive systems developed in the literature needs strong assumptions to be valid in small-sample cases [25, 27]. Therefore, to obtain validity in small-sample cases, randomized predictive system introduces the extra random number \(t\), whose purpose is to define a similar property of validity for RPS as follows,

$$P\left\{{Q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0},T}\left({Y}_{0}\right)\le \eta \right\}=\eta $$
(3)

If \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) is the \(p\) value of conformal prediction, the corresponding RPS is called conformal predictive system, which has the property of validity in small-sample cases defined by formula (3) and the equation-like formula (2) holds by introducing \(T\).

Next, we review SCPSs and CCPSs to demonstrate how to construct the function \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\).

Split conformal predictive system

To build a CPS, the conformity scores of observations calculated by a conformity measure \(A\left(S,{\varvec{z}}\right)\) are needed, where \(S\) is a data set and \(z\) is an observation. The conformity measure evaluates the degree of agreement between \(S\) and \(z\). In the context of SCPSs, \(A\left(S,{\varvec{z}}\right)\) should be a balance isotonic function [34]. In general, with a regression algorithm \(u\), \(A\left(S,{\varvec{z}}\right)\) can be designed as

$$A\left(S,{\varvec{z}}\right)=y-{\widehat{\mu }}_{S}\left({\varvec{x}}\right),$$
(4)

or

$$A\left(S,{\varvec{z}}\right)=\frac{y-{\widehat{\mu }}_{S}\left({\varvec{x}}\right)}{\sqrt{{\widehat{\upsilon }}_{S}\left({\varvec{x}}\right)}},$$
(5)

where \({\widehat{\mu }}_{S}\) and \({\widehat{\upsilon }}_{S}\) are estimated mean function and conditional variance function learned from \(S\), respectively.

The learning process of SCPSs splits the training set \({{\varvec{z}}}^{l}\) into two parts, which are the proper training set \({{\varvec{z}}}_{1}^{m}=\{\left({{\varvec{x}}}_{j},{y}_{j}\right),j=\mathrm{1,2},\cdots ,m\}\) and the calibration set \({{\varvec{z}}}_{m}^{l}=\{\left({{\varvec{x}}}_{j},{y}_{j}\right),j=m+1,\cdots ,l\}\). For each possible label \(y\in {\varvec{R}}\), \(l-m+1\) conformity scores can be computed as follows:

$${\alpha }_{i}=A\left({{\varvec{z}}}_{1}^{m},\left({{\varvec{x}}}_{i},{y}_{i}\right)\right),$$
$${\alpha }_{0}^{y}=A\left({{\varvec{z}}}_{1}^{m},\left({{\varvec{x}}}_{0},y\right)\right),$$

where \({{\varvec{x}}}_{0}\) is a test input, \(y\) is the corresponding label of \({{\varvec{x}}}_{0}\) and \(i=m+1,m+2,\cdot \cdot \cdot ,l\). Based on the above calculation, \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) can be obtained as formula (5) in [37]. The theory in [34] shows that the above \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) is a valid RPS.

Different \(A\left(S,{\varvec{z}}\right)\) leads to different \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\). Suppose that formula (5) is the conformity measure and define \({C}_{i}\) as

$${C}_{i}={\widehat{\mu }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{0}\right)+\frac{{y}_{m+i}-{\widehat{\mu }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{m+i}\right)}{\sqrt{{\widehat{\upsilon }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{m+i}\right)}}\times \sqrt{{\widehat{\upsilon }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{0}\right).}$$

Sort \({C}_{i}\) to obtain \({C}_{\left(1\right)}\le \cdots \le {C}_{\left(l-m\right)}\) and let \({C}_{\left(0\right)}=-\infty \) and \({C}_{\left(l-m+1\right)}=\infty \). Then, the corresponding \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) is calculated as formula (7) in [37], which can be further modified to become a formal CDF as formula (8) in [37], i.e., the empirical CDF of \(\left\{{C}_{\left(i\right)},i=1,\cdots ,l-m\right\}\).

The split process of SCPSs may not make full use of the training data, which is the reason of the development of CCPSs.

Cross-conformal predictive system

Based on the idea of cross validation, CCPSs first partition the training data into \(k\) folds. Let \({o}_{i}\) denote the ordinals of training data in the \(i\) th fold and \({z}_{\left({o}_{i}\right)}^{l}\) denote the training data without the \(i\) th fold. For each \(i\in \left\{1,\cdot \cdot \cdot ,k\right\}\), a CCPS with conformity measure \(A\left(S,{\varvec{z}}\right)\) calculates the conformity scores with \({{\varvec{z}}}_{\left({o}_{i}\right)}^{l}\) being the proper training set and \(\left\{{{\varvec{z}}}_{j}|j\in {o}_{i}\right\}\) the calibration set. The corresponding conformity scores are

$${\alpha }_{j,i}=A\left({{\varvec{z}}}_{\left({o}_{i}\right)}^{l},{{\varvec{z}}}_{j}\right)$$

and

$${\alpha }_{0,i}^{y}=A\left({{\varvec{z}}}_{\left({o}_{i}\right)}^{l},\left({{\varvec{x}}}_{0},y\right)\right).$$

Finally, the function \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) of the CCPS is written as formula (9) in [37].

Suppose that formula (5) is the conformity measure and for \(j\in {o}_{i}\), \({C}_{j,i}\) is written as

$${C}_{j,i}={\widehat{\mu }}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{0}\right)+\frac{{y}_{i}-{\widehat{\mu }}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}{\sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}}\times \sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{0}\right).}$$

Sort all \({C}_{j,i}\) to obtain \({C}_{\left(1\right)}\le \cdots \le {C}_{\left(l\right)}\) and set \({C}_{\left(0\right)}=-\infty \) and \({C}_{\left(l+1\right)}=\infty \). Then, the \({Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)\) of the above CCPS can be written as formula (10) in [37], which can be further modified to become a formal CDF as formula (11) in [37], i.e., the empirical CDF of \(\left\{{C}_{\left(i\right)},i=1,\cdots ,l\right\}.\)

Leave-One-Out CCPS with formula (5) as conformity measure can be obtained by choosing \(k=l\), whose predictive distribution is the empirical CDF of \(\left\{{C}_{i},i=1,\cdots ,l\right\}\), with \({C}_{i}\) being written as.

$${C}_{i}={\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right)+\frac{{y}_{i}-{\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}{\sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}}\times \sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right).}$$

We summarize the Leave-One-Out CCPS in Algorithm 1, since our proposed predictive system based on locally weighted jackknife prediction is highly related to it.

figure a

Locally weighted jackknife predictive system

Jackknife prediction employs leave-one-out predictions for training data, which was proposed in the context of conformal prediction to build interval predictors [14, 36, 38]. Here, we extend it to build predictive systems inspired by Leave-One-Out CCPS. Locally weighted jackknife prediction is the jackknife prediction with the square root of \({\widehat{\upsilon }}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)\) in Algorithm 1 as the local weight. In fact, Algorithm 1 can be modified to base on locally weighted jackknife prediction by changing \({\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right)\) and \({\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right)\) to \({\widehat{u}}_{{{\varvec{z}}}^{l}}\left({{\varvec{x}}}_{0}\right)\) and \({\widehat{v}}_{{{\varvec{z}}}^{l}}\left({{\varvec{x}}}_{0}\right)\), respectively, which reduces the times of the computation for regressors from \(l\) to 1. In addition, one also needs a way of calculating or approximating \({\widehat{v}}_{{{\varvec{z}}}^{l}}\) and \({\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\) efficiently to build the predictive system. In this paper, we employ the way of approximating them developed in our previous works [36, 38] about conformal prediction, which leads to our proposed predictive system based on locally weighted jackknife prediction in Algorithm 2.

figure b

Algorithm 2 utilizes the jackknife prediction \({\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\) and calculates the locally weighted leave-one-out residuals with the square root of \({\widehat{v}}_{{\widehat{{\varvec{z}}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)\) as the weight to build the predictive system.

Although Algorithm 2 needs to compute leave-one-out residuals, the learning process can be fast if the underlying algorithms \(u\) and \(v\) are linear smoothers [39], which have closed-form formula for computation.

Asymptotic analysis of locally weighted jackknife predictive system

This section provides the asymptotic analysis of LW-JPS. We first give the related definition, assumptions and conditions, and then prove the asymptotic validity of LW-JPS.

Definitions, assumptions and conditions

Throughout the paper, we assume that the labels are bounded by \(D\), i.e., \({sup}_{y\in {\varvec{Y}}}\left|y\right|\le D\). The regularity properties of the probability distribution \(\rho \) on \({\varvec{Z}}={\varvec{X}}\times {\varvec{Y}}\) will be assumed when it is needed as in [9]. All observations \(\left({{\varvec{X}}}_{i},{Y}_{i}\right)\) are i.i.d. samples. The generalization error of a function \(f:{\varvec{X}}\to {\varvec{Y}}\) is measured by

$$\xi \left(f\right)=E\left[{\left(f\left({\varvec{X}}\right)-Y\right)}^{2}\right]={\int }_{{\varvec{Z}}}\left(f\left({\varvec{x}}\right)-y\right)^{2}d\rho .$$

Denote the marginal probability distribution of \(\rho \) on \({\varvec{X}}\) as \({\rho }_{{\varvec{X}}}\), which is \({\rho }_{{\varvec{X}}}\left(S\right)=\rho \left(S\times {\varvec{Y}}\right)\) for the measurable set \(S\subseteq {\varvec{X}}\). The conditional distribution of \(y\) given \({\varvec{x}}\) is \(\rho \left(y|{\varvec{x}}\right)\) and the regression function of \(\rho \) is

$${\mu }_{\rho }\left({\varvec{x}}\right){=E\left[Y|{\varvec{X}}={\varvec{x}}\right]=\int }_{{\varvec{Y}}}yd\rho \left(y|{\varvec{x}}\right).$$

Therefore, based on Proposition 1.8 in [9], \({\mu }_{\rho }\) is the minimizer of \(\xi \left(f\right)\) and for each \(f:{\varvec{X}}\to {\varvec{Y}}\),

$$\xi \left(f\right)-\xi \left({\mu }_{\rho }\right)={\int }_{{\varvec{X}}}{\left(f\left({\varvec{x}}\right)-{\mu }_{\rho }\left({\varvec{x}}\right)\right)}^{2}d{\rho }_{{\varvec{X}}}.$$

It can be concluded that \({\mu }_{\rho }\) is bounded by \(D\) as \(Y\le D\).

For the regression problem, we assume that the samples satisfy Assumption 1, where \({\Vert f\Vert }_{\infty }\) is the infinite norm of \(f\) on its domain, i.e., \({\Vert f\Vert }_{\infty }={\mathrm{sup}}_{{\varvec{x}}\in {\varvec{X}}}\left|f\left({\varvec{x}}\right)\right|\).

Assumption 1

Each observation \(\left({\varvec{X}},Y\right)\) satisfies the following formula:

$$Y={\mu }_{\rho }\left({\varvec{X}}\right)+\sqrt{{v}_{\rho }\left({\varvec{X}}\right)}\times \zeta ,$$

where \({v}_{\rho }\left({\varvec{X}}\right)\) is the conditional variance function and \(\zeta \) is a random variable with zero mean and unit variance. \(\zeta \) is independent of \({\varvec{X}}\) and \(0<{v}_{min}\le {\Vert {v}_{\rho }\Vert }_{\infty }\le {v}_{max}<\infty \) . In addition, \(\left|\zeta \right|\le {\zeta }_{max}\) whose cumulative distribution function \(F\left(b\right)=P\left\{\zeta \le b\right\}\) is continuous and strictly increasing on \(\left\{b|F\left(b\right)\in \left(\mathrm{0,1}\right)\right\}\) .

The formula in Assumption 1 is a standard assumption for regression problems with heteroscedastic setting, where the conditional variance of \(Y\) is dependent on \({\varvec{X}}\) instead of a constant. Since \(Y\) is bounded, \(\left|\zeta \right|\le {\zeta }_{max}\) is assumed.

An array of random variables \({X}_{l}\) for \(l\in {{\varvec{N}}}^{+}\) converges to a random variable \(X\) in probability is written as \({X}_{l}{\to }_{p}X\), whose definition can be found from the Definition 1 in [38].

To prove the asymptotic validity of LW-JPS, the following four conditions are needed for algorithms \(\upmu \) and \(v\), which were also introduced in our earlier work about the theoretical analysis of locally weighted jackknife prediction [38]. In the conditions, \(r\) represents a general regression algorithm and \({{\varvec{Z}}}^{l}\) is a general random data set for training r, whose samples are i.i.d.. \({\widehat{r}}_{{{\varvec{Z}}}^{l}}\) is the learned regressor whose randomness is from \({{\varvec{Z}}}^{l}\) and \({\widehat{r}}_{{{\varvec{z}}}^{l}}\) is the corresponding realization.


Condition 1. The regression algorithm \(r\) is symmetric in the observations, such that for each \(l\), each \({{\varvec{z}}}^{l}\) and each permutation \(\pi \) of \(\left\{1,\cdots ,l\right\}\), there holds

$${\widehat{r}}_{{{\varvec{z}}}^{l}}={\widehat{r}}_{{\pi }_{l}\left({{\varvec{z}}}^{l}\right)},$$

where \({\pi }_{l}\left({{\varvec{z}}}^{l}\right)=\left\{{{\varvec{z}}}_{\pi \left(j\right)},j=1,\cdots ,l\right\}\).


Condition 2. The regressor \({\widehat{r}}_{{{\varvec{Z}}}^{l}}\) uniformly converges in probability to the regression function \({\mu }_{\rho }\) of \({\varvec{Z}}\), i.e.,

$${\Vert {\widehat{r}}_{{{\varvec{Z}}}^{l}}-{\mu }_{\rho }\Vert }_{\infty }{\to }_{p}0.$$

Condition 3. The regression algorithm \(r\) is a uniformly stable algorithm [8], whose uniform stability with respect to the square loss is \(\beta =\beta \left(l\right)\), i.e., for each \(l\) and each \({{\varvec{z}}}^{l}\),

$$ \mathop {\sup }\limits_{i} \left( {\mathop {\sup }\limits_{{({\mathbf{x}},y) \in {\mathbf{X}} \times Y}} \left| {\left( {y - \hat{r}_{{{\mathbf{z}}^{l} }} \left( {\mathbf{x}} \right)} \right)^{2} - \left( {y - \hat{r}_{{{\mathbf{z}}_{\left( i \right)}^{l} }} \left( {\mathbf{x}} \right)} \right)^{2} } \right|} \right) \le \beta (l), $$

where \(\underset{l\to \infty }{\mathit{lim}}\beta \left(l\right)=0\).

Condition 4. For two fixed data sets \({\widehat{{\varvec{z}}}}^{l}=\big\{\big({\varvec{x}}_{j},{\widehat{y}}_{j}\big),j=1,\cdots ,l\big\}\) and \({\widetilde{{\varvec{z}}}}^{l}=\big\{\big({{\varvec{x}}}_{j},{\widetilde{y}}_{j}\big),j=1,\cdots ,l\big\}\) with the same input objects, if for each \(l\), the labels satisfy

$$\underset{i\in \left\{\mathrm{1,2},\cdots ,l\right\}}{\mathrm{sup}}\left|{\widehat{y}}_{i}-{\widetilde{y}}_{i}\right|\le \eta ,$$

there holds

$${\Vert {\widehat{r}}_{{\widehat{{\varvec{z}}}}^{l}}-{\widehat{r}}_{{\widetilde{{\varvec{z}}}}^{l}}\Vert }_{\infty }\le \eta .$$

With the same mathematical skills in [38], we need the algorithm \(\upmu \) to satisfy Condition 1, 2, and 3 and \(v\) to satisfy Condition 1, 2 and 4 to prove the asymptotic validity of LW-JPS. The conditions are not too restrict for applications, which we have analyzed in section 3.3 of [38].

Asymptotic validity of LW-JPS

We introduce Lemma 1 to guarantee that \({\widehat{v}}_{{\widehat{{\varvec{Z}}}}^{l}}\) is a consistent estimator for the conditional variance function in Algorithm 2, which has been proved in [38].

Lemma 1

With Assumption 1 being hold, \(\mu \) satisfying Condition 2 and 3, and \(v\) satisfying Condition 2 and 4, we have

$${\Vert {\widehat{v}}_{{\widehat{{\varvec{Z}}}}^{l}}-{v}_{\rho }\Vert }_{\infty }{\to }_{p}0.$$

We will prove in Theorem 1 that Algorithm 2 is asymptotically valid by showing that the corresponding predictive distribution \({\widehat{Q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\) satisfies

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$
(6)

which is an asymptotic version of formula (1). To do so, we need to prove that

$$P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$
(7)

where \({\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(\alpha \right)}\) is the \(\alpha \) quantile of \({\widehat{Q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)\). Formula (7) is equivalent to

$$P\left\{{\Gamma }_{{{\varvec{Z}}}^{l}}\le {\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$
(8)

where \({\Gamma }_{{{\varvec{z}}}^{l}}\) is the normalized residual defined by

$${\Gamma }_{{{\varvec{z}}}^{l}}=\frac{{Y}_{0}-{\widehat{\mu }}_{{{\varvec{z}}}^{l}}\left({{\varvec{X}}}_{0}\right)}{\sqrt{{\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}\left({{\varvec{X}}}_{0}\right)}},$$

and \({\widehat{q}}_{{{\varvec{z}}}^{l}}^{\left(\alpha \right)}\) is the \(\alpha \) quantile of the normalized leave-one-out residuals \(\left\{{a}_{l,i},i=1,\cdots ,l\right\}\) defined by

$${a}_{l,i}=\frac{{y}_{i}-{\widehat{\mu }}_{{\mathbf{z}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}{\sqrt{{\widehat{v}}_{{\widehat{\mathbf{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}}.$$

Denote the CDF of \({\Gamma }_{{{\varvec{z}}}^{l}}\) by \({F}_{{{\varvec{z}}}^{l}}\left(b\right)\), i.e.,

$${F}_{{{\varvec{z}}}^{l}}\left(b\right)=P\left\{{\Gamma }_{{{\varvec{z}}}^{l}}\le b|{{\varvec{z}}}^{l}\right\},$$

and \({q}^{(\alpha )}\) is the \(\alpha \) quantile of \(F(b)\) in Assumption 1. Since Lemma 1 confirms that the estimator \({\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}\) uniformly converges to \({v}_{\rho }\) in probability and \(\mu \) satisfying Condition 2, we can make the connection between \({\Gamma }_{{{\varvec{z}}}^{l}}\) and the normalized noise term of Assumption 1, which is

$${\zeta }_{0}=\frac{{Y}_{0}-{\mu }_{\rho }\left({{\varvec{X}}}_{0}\right)}{\sqrt{{v}_{\rho }\left({{\varvec{X}}}_{0}\right)}},$$

and prove in Lemma 2 that

$$\underset{b\in R}{\mathrm{sup}}\left|{F}_{{{\varvec{Z}}}^{l}}\left(b\right)-F(b)\right|{\to }_{p}0.$$

Also, \({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\) and \({q}^{\left(\alpha \right)}\) are highly related as we show in Lemma 2 that

$${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}{\to }_{p}{q}^{\left(\alpha \right)}.$$

Building on Lemma 2, formula (8) and formula (6) can be proved in turn and the conclusion of LW-JPS being asymptotically valid can be drawn in Theorem 1.

The analysis techniques in Lemma 2 was first introduced in [28] for linear regression problems with homoscedastic errors, which was further improved for nonlinear regression problems with heteroscedastic errors by our earlier work for locally weighted jackknife prediction [38]. The above two works both concerns building interval prediction other than probabilistic prediction, which makes the detailed expressions different from this work. In addition, our work about LOO–CCPS–RELM [37] only considers nonlinear regression problems with homoscedastic errors, and its proofs are specific to extreme learning machine. Therefore, we introduce and prove Lemma 2, which is essential to proving Theorem 1 strictly in this paper.

Lemma 2

Fix \(\alpha \in \left(\mathrm{0,1}\right)\). If the conditions of Lemma 1 hold and both \(\mu \) and \(v\) also satisfy Condition 1, then we have.

$$\underset{b\in R}{\mathrm{sup}}\left|{F}_{{{\varvec{Z}}}^{l}}\left(b\right)-F(b)\right|{\to }_{p}0,$$
(9)

and

$${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}{\to }_{p}{q}^{\left(\alpha \right)}.$$
(10)

P r oof

Since \({\widehat{\mu }}_{{{\varvec{z}}}^{l}}\) satisfies that

$${\Vert {\widehat{\mu }}_{{{\varvec{Z}}}^{l}}-{\mu }_{\rho }\Vert }_{\infty }{\to }_{p}0,$$

and \({\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}\) satisfies that

$${\Vert {\widehat{v}}_{{\widehat{{\varvec{Z}}}}^{l}}-{v}_{\rho }\Vert }_{\infty }{\to }_{p}0,$$

for each \(l\) we can define a nonempty set \(B(l)\) as

$$B\left(l\right)=\left\{{{\varvec{z}}}^{l}|\mathrm{max}\left\{{\Vert {\widehat{\mu }}_{{{\varvec{z}}}^{l}}-{\mu }_{\rho }\Vert }_{\infty },{\Vert {\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}-{v}_{\rho }\Vert }_{\infty }\right\}\le g\left(l\right)\right\},$$

where \(g\left(l\right)\) is nonnegative and converges to 0 sufficiently slow. Then, we can construct an array of random variables \({\Gamma }_{{{\varvec{z}}}^{l}}\) by taking an arbitrary \({{\varvec{z}}}^{l}\) in \(B\left(l\right)\). As \({{\varvec{z}}}^{l}\in B\left(l\right)\), \({v}_{min}>0\) and \(g\left(l\right)\) converges to 0, there exists a \({l}_{1}\), such that for all \(l>{l}_{1}\), there holds

$${\Vert {\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}\Vert }_{\infty }\ge {v}_{min}-g\left(l\right)\ge {v}_{min}-g\left({l}_{1}\right)>0.$$

For all \(l>{l}_{1}\), by the definitions, we have

$$\left|{\Gamma }_{{{\varvec{z}}}^{l}}-{\zeta }_{0}\right|\le \frac{\frac{g\left(l\right)}{\left|\sqrt{{v}_{min}}+\sqrt{{v}_{min}-g\left({l}_{1}\right)}\right|}\times {\zeta }_{max}+g\left(l\right)}{\sqrt{{v}_{min}-g\left({l}_{1}\right)}},$$

which guarantees that

$${\Gamma }_{{{\varvec{z}}}^{l}}{\to }_{p}{\zeta }_{0}.$$

Since convergence in probability implies convergence in distribution and the CDF of \({\zeta }_{0}\) is continuous, according to Proposition 1.16 of [26], we have

$$\underset{l\to \infty }{\mathrm{lim}}\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{{{\varvec{z}}}^{l}}\left(b\right)-F\left(b\right)\right|=0.$$

The arbitrarily chosen \({{\varvec{z}}}^{l}\) from \(B\left(l\right)\) leads to

$$\underset{l\to \infty }{\mathrm{lim}}\underset{{{\varvec{z}}}^{l}\in B\left(l\right)}{\mathrm{sup}}\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{{{\varvec{z}}}^{l}}\left(b\right)-F\left(b\right)\right|=0,$$

which implies that formula (9) is correct [38].

Next, we prove formula (10). Since for every \(\epsilon >0\), we have

$$P\left\{\left|{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}-{q}^{\left(\alpha \right)}\right|>\epsilon \right\}=P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}+P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}.$$

Thus, we need to show that

$$P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}$$

and

$$P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}$$

converges to 0, respectively. Define \({F}_{l}\left(b\right)\) by

$$ F_{l} \left( b \right) = P\left\{ {\frac{{Y_{1} - \hat{\mu }_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)}}{{\sqrt {\hat{v}_{{\hat{\user2{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)} }} \le b} \right\}{ } = E\left[ {P\left\{ {\frac{{Y_{1} - \hat{\mu }_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)}}{{\sqrt {\hat{v}_{{\hat{\user2{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)} }} \le b{|}{\varvec{Z}}_{\left( 1 \right)}^{l} } \right\}} \right]{ } = E\left[ {F_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( b \right)} \right] $$

whose distance from \(F\left(b\right)\) can be bounded by

$$\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{l}\left(b\right)-F\left(b\right)\right|\le E\left[\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{{{\varvec{Z}}}_{\left(1\right)}^{l}}\left(b\right)-F\left(b\right)\right|\right].$$

From formula (9) and the definition of leave-one-out samples, the bounded random variable \(\underset{b\in {\varvec{R}}}{\mathit{sup}}\left|{F}_{{{\varvec{Z}}}_{\left(1\right)}^{l}}\left(b\right)-F\left(b\right)\right|\) converges to 0 in probability. This leads to

$$ \mathop {\lim }\limits_{l \to \infty } \mathop {\sup }\limits_{{b \in {\varvec{R}}}} \left| {F_{l} \left( b \right) - F\left( b \right)} \right| \le \mathop {\lim }\limits_{l \to \infty } \begin{array}{*{20}c} {E\left[ {\mathop {\sup }\limits_{{b \in {\varvec{R}}}} \left| {F_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( b \right) - F\left( b \right)} \right|} \right],} = 0 \\ \end{array} $$

i.e.,

$$\underset{l\to \infty }{\mathrm{lim}}\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{l}\left(b\right)-F\left(b\right)\right|=0.$$
(11)

Let the CDF of the normalized leave-one-out residuals \(\left\{{a}_{l,i},i=1,\cdots ,l\right\}\) be denoted by \({F}_{al}\left(b\right)\). Therefore, \({F}_{{A}_{l}}\left(b\right)\) and \({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\) are the corresponding random function and random variable by introducing the randomness of \({{\varvec{Z}}}^{l}\). Define \({J}_{l,i}={1}_{\left\{{A}_{l,i}>{q}^{\left(\alpha \right)}+\epsilon \right\}}\), which is the indicator function of \(\left\{{A}_{l,i}>{q}^{\left(\alpha \right)}+\epsilon \right\}\). As algorithms \(\mu \) and \(v\) being exchangeable implies that \(\left\{{J}_{l,j},j=1,\cdots ,l\right\}\) are exchangeable, based on the property of quantile function [29], we have

$$ \begin{gathered} P\left\{ {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} > q^{\left( \alpha \right)} + \in } \right\} = P\left\{ {\alpha > F_{{A_{l} }} \left( {q^{\left( \alpha \right)} + \in } \right)} \right\}{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = P\left\{ {1 - F_{{A_{l} }} \left( {q^{\left( \alpha \right)} + \in } \right) > 1 - \alpha } \right\}{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = P\left\{ {\frac{1}{l}\mathop \sum \limits_{i = 1}^{l} \left( {J_{l,i} - E\left[ {J_{l,1} } \right]} \right) > 1 - \alpha - E\left[ {J_{l,1} } \right]} \right\}{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = P\left\{ {\frac{1}{l}\mathop \sum \limits_{i = 1}^{l} \left( {J_{l,i} - E\left[ {J_{l,i} } \right]} \right) > F_{l} \left( {q^{\left( \alpha \right)} + \in } \right) - \alpha } \right\}. \hfill \\ \end{gathered} $$

Since formula (11) holds, it follows that

$$F\left({q}^{\left(\alpha \right)}+\epsilon \right)>\alpha ,$$

which implies that \({F}_{l}\left({q}^{\left(\alpha \right)}+\epsilon \right)>0\) for sufficiently large \(l\). Thus, it follows from Markov’s inequality that for sufficiently large \(l\), the probability,

$$P\left\{\frac{1}{l}\sum_{i=1}^{l}\left({J}_{l,i}-E\left[{J}_{l,i}\right]\right)>{F}_{l}\left({q}^{\left(\alpha \right)}+\epsilon \right)-\alpha \right\},$$

is bounded by

$$\frac{\frac{1}{l}var\left({J}_{l,1}\right)+\frac{l\left(l-1\right)}{{l}^{2}}cov\left({J}_{l,1},{J}_{l,2}\right)}{{\left({F}_{l}\left({q}^{\left(\alpha \right)}+\epsilon \right)-\alpha \right)}^{2}},$$
(12)

where \(var\) and \(cov\) are the variance and covariance function, respectively. Therefore, to prove \(P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}\) approaches 0, we need to prove \(cov\left({J}_{l,1},{J}_{l,2}\right)\) converges to 0 as \(l\to \infty \).

Let \({{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}\) and \({\widehat{{\varvec{Z}}}}_{\left(\mathrm{1,2}\right)}^{l}\) be the corresponding data set without the first two observations. Define \({A}_{l,\left(\mathrm{1,2}\right)}\) and \({A}_{l,\left(\mathrm{2,1}\right)}\) by

$${A}_{l,\left(\mathrm{1,2}\right)}=\frac{{{Y}_{1}-\widehat{m}}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{1}\right)}{\sqrt{{\widehat{v}}_{{\widehat{{\varvec{Z}}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{1}\right)}}, {A}_{l,\left(\mathrm{2,1}\right)}=\frac{{{Y}_{2}-\widehat{m}}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{2}\right)}{\sqrt{{\widehat{v}}_{{\widehat{{\varvec{Z}}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{2}\right)}},$$

and define \({\widetilde{A}}_{1}\) and \({\widetilde{A}}_{2}\) by

$${\widetilde{A}}_{1}=\left[{A}_{l,1},{A}_{l,2}\right], {\widetilde{A}}_{2}=\left[{A}_{l,\left(\mathrm{1,2}\right)},{A}_{l,\left(\mathrm{2,1}\right)}\right].$$

Let \({\widetilde{F}}_{l,1}\) and \({\widetilde{F}}_{l,2}\) be the CDFs of \({\widetilde{A}}_{1}\) and \({\widetilde{A}}_{2}\), respectively, i.e.,

$${\widetilde{F}}_{l,1}\left({b}_{1},{b}_{2}\right)=P\left\{{A}_{l,1}\le {b}_{1},{A}_{l,1}\le {b}_{2}\right\},$$

and

$${\widetilde{F}}_{l,2}\left({b}_{1},{b}_{2}\right)=P\left\{{A}_{l,\left(\mathrm{1,2}\right)}\le {b}_{1},{A}_{l,\left(\mathrm{1,2}\right)}\le {b}_{2}\right\}.$$

For \({\widetilde{F}}_{l,2}\), we have

$${\widetilde{F}}_{l,2}\left({b}_{1},{b}_{2}\right)=E\left[{F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{1}\right){F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{2}\right)\right].$$
(13)

Since \({F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{1}\right)\) and \({F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{2}\right)\) are bounded random variables, which converge to \(F\left({b}_{1}\right)\) and \(F\left({b}_{2}\right)\), respectively, due to formula (9), we have

$$\underset{l\to \infty }{\mathrm{lim}}{\widetilde{F}}_{l,2}\left({b}_{1},{b}_{2}\right)=F\left({b}_{1}\right)F\left({b}_{2}\right).$$

As Lemma 1 holds and \(\upmu \) satisfies Condition 2, we can deduce that \({A}_{l,1}\), \({A}_{l,2}\), \({A}_{l,\left(\mathrm{1,2}\right)}\) and \({A}_{l,\left(\mathrm{2,1}\right)}\) are all convergent in probability to \({\upzeta }_{0}\), which implies that \({A}_{l,1}-{A}_{l,\left(\mathrm{1,2}\right)}{\to }_{p}0\) and \({A}_{l,2}-{A}_{l,\left(\mathrm{2,1}\right)}{\to }_{p}0\). Therefore, from Lemma 2.8 in [29], there holds

$$\underset{l\to \infty }{\mathrm{lim}}{\widetilde{F}}_{l,1}\left({b}_{1},{b}_{2}\right)=F\left({b}_{1}\right)F\left({b}_{2}\right).$$

Furthermore, with

$$ \begin{gathered} {\text{cov}}\left( {J_{l,1} ,J_{l,1} } \right){ } = {\text{cov}}\left( {1 - J_{l,1} ,1 - J_{l,1} } \right){ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \tilde{F}_{l,1} \left( {q^{\left( \alpha \right)} + \in ,q^{\left( \alpha \right)} + \in } \right) - F_{l} \left( {q^{\left( \alpha \right)} + \in } \right)F_{l} \left( {q^{\left( \alpha \right)} + \in } \right) \hfill \\ \end{gathered} $$

and formula (11), we have

$$\underset{l\to \infty }{\mathrm{lim}}cov\left({J}_{l,1},{J}_{l,1}\right)=0.$$

Based on the formula above and Eq. (12), we have

$$\underset{l\to \infty }{\mathrm{lim}}P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}=0.$$

Similarly, we can also prove that

$$\underset{l\to \infty }{\mathrm{lim}}P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}=0.$$

Thus, since the fact that

$$P\left\{\left|{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}-{q}^{\left(\alpha \right)}\right|>\epsilon \right\}=P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}+P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}.$$

and the two limit equations above hold, we have

$${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(1-\alpha \right)}{\to }_{p}{q}^{\left(1-\alpha \right)}.$$
(14)

The following two theorems describe the statistical compatibility of the predictive distributions output by LW-JPS with observations in the asymptotic setting. Theorem 1 proves the asymptotic version of formula (1) and Theorem 2 proves a sufficient condition of the asymptotic version of formula (2), where quantiles can be set arbitrarily.

Theorem 1

Fix \(\alpha \in \left(\mathrm{0,1}\right)\) . If Assumption 1 holds, \(\mu \) satisfying Condition 1, 2 and 3, and \(v\) satisfying Condition 1, 2 and 4, we have

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha .$$

Proof

Based on Assumption 1, we have \(F\left({q}^{\left(\alpha \right)}\right)=\alpha \). Therefore,

$$ \begin{gathered} \left| {P\left\{ {\Gamma_{{{\varvec{Z}}^{l} }} \le \hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} {|}{\varvec{Z}}^{l} } \right\} - \alpha } \right|{ } = \left| {F_{{{\varvec{Z}}^{l} }} \left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {q^{\left( \alpha \right)} } \right)} \right|{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \le \left| {F_{{{\varvec{Z}}^{l} }} \left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right)} \right| + \left| {F\left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {q^{\left( \alpha \right)} } \right)} \right|{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \le \mathop {\sup }\limits_{{b \in {\varvec{R}}}} \left| {F_{{{\varvec{Z}}^{l} }} \left( b \right) - F\left( b \right)} \right| + \left| {F\left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {q^{\left( \alpha \right)} } \right)} \right|. \hfill \\ \end{gathered} $$

From Lemma 2 and \(F\left(b\right)\) being continuous, we have \(\left|F\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\right)-F\left({q}^{\left(\alpha \right)}\right)\right|{\to }_{p}0\) using Theorem 1.10 in [26] and \(\underset{b\in {\varvec{R}}}{\mathrm{sup}}{\to }_{p}0\). Thus, we can conclude that

$${F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\right)=P\left\{{\Gamma }_{{{\varvec{Z}}}^{l}}\le {\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$
(15)

which is equivalent to

$$P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$
(16)

since for every \(\alpha \in \left(\mathrm{0,1}\right)\), there holds

$${F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\right)=P\left\{{\Gamma }_{{{\varvec{Z}}}^{l}}\le {\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}=P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}.$$

For every \(\epsilon \) such that \(0<\epsilon <max\left\{\alpha ,1-\alpha \right\}\), by the definition of quantiles, we have

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}\le {F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha +\epsilon \right)}\right),$$
(17)

and

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}\ge {F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha -\epsilon \right)}\right).$$
(18)

Based on formula (15), for every \(\delta >\) 0, we have

$$P\left\{\left|{F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha +\epsilon \right)}\right)-\left(\alpha +\epsilon \right)\right|>\epsilon \right\}<\delta ,$$

which combing formula (17) lead to

$$P\left\{P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}-\left(\alpha +\epsilon \right)>\epsilon \right\}<\delta .$$

Similarly, with formula (18), there holds

$$P\left\{P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}-\left(\alpha -\epsilon \right)<-\epsilon \right\}<\delta .$$

Then, we have

$$P\left\{\left|P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}-\alpha \right|>2\epsilon \right\}<2\delta .$$

Since \(\epsilon \) and \(\delta \) are arbitrary, the conclusion of Theorem 1 can be drawn.

Based on the deduction of Theorem 1, we can obtain the following coverage guarantee for derived prediction intervals from \({\widehat{Q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\), which is desirable for practitioners for interval prediction.

Theorem 2

Fix \({\eta }_{1}\) and \({\eta }_{2}\) such that \(0<{\eta }_{1}<{\eta }_{2}<1\) . If the conditions of Theorem 1 hold, we have

$$P\left\{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le {Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}{\eta }_{2}-{\eta }_{1}.$$

Proof

For every \(\epsilon \) such that \(0<\epsilon <\mathrm{max}\left\{{\eta }_{1},{\Delta }_{\eta }\right\}\), based on formula (16), we have

$$P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}-\epsilon \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}{\eta }_{1}-\epsilon ,$$

which leads to

$$P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}-\epsilon \right)}<Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}{\Delta }_{\eta }-\epsilon ,$$

where \({\Delta }_{\eta }={\eta }_{2}-{\eta }_{1}\). Then, for every \(\delta >\) 0, there holds

$$P\left\{\left|P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}-\epsilon \right)}<Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-\left({\Delta }_{\eta }+\epsilon \right)\right|>\epsilon \right\}<\delta .$$

Thus, we have

$$P\left\{P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-\left({\Delta }_{\eta }+\epsilon \right)>\epsilon \right\}<\delta .$$

Similarly, there holds

$$P\left\{P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-\left({\Delta }_{\eta }-\epsilon \right)<-\epsilon \right\}<\delta .$$

Therefore, we have

$$P\left\{\left|P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-{\Delta }_{\eta }\right|>2\epsilon \right\}<2\delta ,$$

which proves the conclusion of Theorem 2, since \(\epsilon \) and \(\delta \) are arbitrary.

Experiments

In this section, to test LW-JPS empirically, randomized kernel ridge regression with random Fourier features [21] is used as \(\mu \) and \(k\)-nearest neighbor regression is used as \(v\), respectively, since they satisfy the conditions we assumed in “Asymptotic analysis of locally weighted jackknife predictive system”. Following [38], the number of random features were set to 1000 and \(k=\sqrt{l}\) for \(k\)-nearest neighbor regression. The ridge parameter with the least leave-one-out errors was chosen for LW-JPS. The comparison predictive systems are SPCS with support vector regression (SCPS–SVR), SCPS with random forests (SCPS–RF), CCPS with support vector regression (CCPS–SVR), CCPS with random forests (CCPS–RF) and CPS with random forests with out-of-bag errors as conformity scores (OOB–CPS–RF). All the comparison algorithms employ formula (5) as the conformity measure based on the recently empirical evaluation research in [40]. SCPS–SVR, SCPS–RF, CCPS–SVR and CCPS–RF use the same normalization for conformity measure as LW-JPS, whereas OOB–CPS–RF uses the standard deviation of out-of-bag predictions for normalization based on the approach in [40]. OOB–CPS–RF was first proposed in [40], which extends the idea of the state-of-the-art conformal regressor with random forests [7]. Following [37], for all SCPSs, 40 percent of the training data was used as calibration set and for all CCPSs, the number of folds was 5. In addition, the meta-parameters of all comparison algorithms were chosen using threefold cross-validation on the training set based on \({R}^{2}\) scores. SVR with Gaussian kernel was employed, whose regularization parameter \(C\) was chosen from \(\left\{{10}^{-5},{10}^{-4},\cdots ,{10}^{4},{10}^{5}\right\}\). For random forests, the number of trees was chosen from \(\left\{100, 300, 500, 1000\right\}\) and the minimum number of samples per tree leaf was chosen from \(\left\{1, 3, 5\right\}\), respectively.

The experiments were conducted on 20 public data sets, which are from Delve [22], KEEL [4] and UCI [5] repositories whose detailed information is summarized in Table 1. The features and labels were all normalized to \([0, 1]\) with min–max normalization. Tenfold cross-validation was used to test the algorithms, i.e., each data set was randomly split into tenfolds, where each fold was used to evaluate the algorithms trained from the other ninefolds and the mean of ten results for each algorithm was reported. All the algorithms in this section were coded with python based on numpy and scikit-learn library and the experimental results were collected from the computer with 3.5 GHz CPU and 32 GB RAM.

Table 1 Data sets

Test the validity of LW-JPS

This section tests whether LW-JPS is a valid predictive system by the definition of formula (1). To do so, the values of the CDF of LW-JPS on the test data were collected and the frequency of the values being not more than \(\alpha \) was calculated, whose results are shown in Table 2 with mean representing the mean value of each column. Table 2 demonstrates that the frequencies are compatible with corresponding \(\alpha \), which empirically proves the validity of LW-JPS.

Table 2 Validity test of LW-JPS

As we analyze in “Conformal predictive systems and locally weighted jackknife predictive system”, the validity property of formula (1) implies the coverage guarantee by formula (2), which will be shown in the next experiment.

Comparison with the other CPSs

This section compares the performance of LW-JPS with SCPS–SVR, SCPS–RF, CCPS–SVR, CCPS–RF and OOB–CPS–RF. To compare the quality of the predictive distributions, the widely used continuous ranked probability score (CRPS) are employed whose definition can be found in [34]. The lower the CRPS is, the better the predictive distribution is. The barplots of the mean of continuous ranked probability scores for different data sets are shown in Fig. 1, which demonstrates that LW-JPS performs better in most cases. Table 3 records the mean CRPS of all algorithms, with the least one of each data set shown in bold. For each data set, the rank of an algorithm is obtained and the mean rank in Table 3 is the mean value of all ranks for each algorithm. From Table 3, we can see that the LW-JPS performs better than the other predictive systems, which indicates the effectiveness of LW-JPS.

Fig. 1
figure 1

Mean of continuous ranked probability scores for different algorithms trained on different data sets

Table 3 The mean CRPS of all algorithms

We also test the derived prediction intervals from the predictive distributions of all predictive systems. For a significance level\(\eta \), which is the expected coverage rate preset by practitioners, the derived prediction interval is based on formula (2) with the help of \(\eta /2\) and \(1-\eta /2\) quantiles. Two indicators are employed to describe the quality of prediction intervals. One is the prediction error rate, which is the frequency of the true label being out of the prediction intervals. The other is the average interval size, which measures the information efficiency of the prediction intervals. The smaller the average interval size, the more information the prediction intervals contain. We set the significance levels as 0.2, 0.1 and 0.05 and show the experimental results in Tables 4, 5 and 6 for error rates and in Tables 7, 8 and 9 for average interval sizes. We also summarize the error rates, the means and mean ranks of average interval sizes in Figs. 2, 3, and 4.

Table 4 Error rate \(\left(\eta =0.2\right)\)
Table 5 Error rate \(\left(\eta =0.1\right)\)
Table 6 Error rate \(\left(\eta =0.05\right)\)
Table 7 Average interval size \(\left(\eta =0.2\right)\)
Table 8 Average interval size \(\left(\eta =0.1\right)\)
Table 9 Average interval size \(\left(\eta =0.05\right)\)
Fig. 2
figure 2

Mean of prediction error rates of the prediction intervals derived from the predictive distributions

Fig. 3
figure 3

Mean of average interval sizes of the prediction intervals derived from the predictive distributions

Fig. 4
figure 4

Mean rank of average interval sizes of the prediction intervals derived from the predictive distributions

From Tables 4, 5 and 6, we can see that all predictive systems are empirically valid for the data sets, which also empirically proves the coverage guarantee of LW-JPS. Besides, it is shown in Tables 7, 8 and 9 that prediction intervals of LW-JPS are more informationally efficient than those of the other algorithms, which is demonstrated in Figs. 3 and 4. The box plots of average interval size are also shown in Fig. 5, which also demonstrates that JPS performs better than other CPSs.

Fig. 5
figure 5

Box plots of average interval sizes of the prediction intervals derived from the predictive distributions

We also conducted Wilcoxon test [10] to answer the question of whether LW-JPS performs better than other comparison algorithms significantly. Table 10 demonstrates the p values of the experimental results about CRPS and average interval sizes with \(\eta \in \{\mathrm{0.2,0.1,0.05}\}\) and the bold values are less than 0.05, which shows the significant differences. From Table 10, we can see that LW-JPS significantly performs better than SCPS–SVR, SCPS–RF, CCPS–SVR and CCPS–RF, and the differences between LW-JPS and OOB–CPS–RF are not significant in most cases. Since OOB–CPS–RF represents the state-of-the-art process using conformal approach for regression problems, the statistical tests confirm the effectiveness of LW-JPS for probabilistic prediction.

Table 10 The p values of Wilcoxon tests

For training speed, all of the algorithms are computationally efficient versions of CPSs and the mean values of the training times of SCPS–SVR, SCPS–RF, CCPS–SVR, CCPS–RF, OOB–CPS–RF and LW-JPS on 20 data sets are 0.293 s, 8.704 s, 1.940 s, 59.336 s, 15.393 s and 1.443 s, respectively, indicating that the LW-JPS used in this paper is also computationally efficient.

In summary, the experimental results in this section not only verifies the empirical validity of LW-JPS, but also shows its better performance than the other comparison algorithms, which indicates the effectiveness and efficiency of LW-JPS for probabilistic prediction.

Conclusion

This paper proposes a predictive system based on the idea of jackknife prediction, which is inspired by the leave-one-out cross-conformal predictive system. The proposed LW-JPS can transform any regression algorithm for point prediction to probabilistic prediction, which can describe the uncertainty of test labels. The asymptotic validity of LW-JPS is proved with some regularity assumptions and conditions. Based on the analysis, the empirical testing of LW-JPS with randomized kernel ridge regression and \(k\)-nearest neighbor regression was conducted. The empirical validity of LW-JPS was demonstrated in the experiments and its performance for probabilistic prediction compared favourably with the other comparison algorithms, which demonstrates the effectiveness and efficiency of LW-JPS for probabilistic prediction.

Although our method is empirically valid and shows better performance when compared with other comparison CPSs, we only employ two representative regression algorithms satisfying the related conditions in this paper. Therefore, future work about empirical studies with a wider range of regression algorithms needs to be done. Moreover, the approach of LW-JPS we proposed in this paper cannot be built on deep learning models efficiently for complex learning problems, such as image segmentation or image-to-image regression problems, since in those cases, there are no efficient ways to compute leave-one-out predictions on training data. Thus, future work about approximately computing leave-one-out predictions for deep neural networks is worth exploring, in order to make the jackknife prediction approach more tractable for complex problems.