Probabilistic prediction with locally weighted jackknife predictive system

Wang, Di; Wang, Ping; Wang, Pingping; Wang, Cong; He, Zhen; Zhang, Wei

doi:10.1007/s40747-023-01044-0

Probabilistic prediction with locally weighted jackknife predictive system

Original Article
Open access
Published: 05 April 2023

Volume 9, pages 5761–5778, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Probabilistic prediction with locally weighted jackknife predictive system

Download PDF

Di Wang^1,2,
Ping Wang^1,2,
Pingping Wang³,
Cong Wang^1,2,
Zhen He⁴ &
…
Wei Zhang ORCID: orcid.org/0000-0002-6517-4373⁴

1327 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Probabilistic predictions for regression problems are more popular than point predictions and interval predictions, since they contain more information for test labels. Conformal predictive system is a recently proposed non-parametric method to do reliable probabilistic predictions, which is computationally inefficient due to its learning process. To build faster conformal predictive system and make full use of training data, this paper proposes the predictive system based on locally weighted jackknife prediction approach. The theoretical property of our proposed method is proved with some regularity assumptions in the asymptotic setting, which extends our earlier theoretical researches from interval predictions to probabilistic predictions. In the experimental section, our method is implemented based on our theoretical analysis and its comparison with other predictive systems is conducted using 20 public data sets. The continuous ranked probability scores of the predictive distributions and the performance of the derived prediction intervals are compared. The better performance of our proposed method is confirmed with Wilcoxon tests. The experimental results demonstrate that the predictive system we proposed is not only empirically valid, but also provides more information than the other comparison predictive systems.

Valid prediction intervals for regression problems

Article 18 April 2022

Nonparametric predictive distributions based on conformal prediction

Article Open access 17 August 2018

Prediction intervals in supervised learning for model evaluation and discrimination

Article 31 December 2014

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Machine learning techniques been widely applied to many areas because of their expressive power with the help of gradient descent [13, 15, 19, 23, 24] and metaheuristics [1,2,3, 16, 17] for efficient parameter searching. Based on machine learning, one can build predictive models considering the uncertainty of outputs. This paper concentrates on the predictive system providing predictive distribution for the test label, which is desirable for all walks of life [11]. For regression problems, the predictive distribution contains the full information of the uncertainty, as it can provide the probability of any event relevant to the test label and be transformed to prediction point or prediction interval by use of the corresponding first-order moment or quantiles. For many applications, especially the high-risk ones, the predictive distributions are required to be valid, which implies that the distributions or their derived prediction intervals have statistical compatibility with realizations, i.e., they ought to tell the truth [33].

Nowadays, many algorithms in the context of statistics or machine learning have been proposed to output predictive distributions for test labels. However, most of them such as Bayesian regression and Gaussian process regression are highly dependent on their prior distribution assumptions, which can be far away from being valid if the prior assumptions are not correct [6, 18]. Recently, some frequentist approaches of probabilistic prediction algorithms have been proposed with compatibility with realizations in mind [25, 27]. While these approaches concern more about frequentist probability, they are limited in applications due to their original parametric forms. This issue has been tackled by a collection of promising works about conformal predictive systems (CPSs) [31, 33, 34], which build predictive system using learning framework of conformal prediction [6, 32] and extend the above frequentist approaches to a general nonparametric setting of being valid even in the small-sample cases.

The purpose of conformal prediction is to output valid prediction sets for test labels. One of the key characteristics of conformal prediction is that its p values calculated using conformity scores follows the uniform distribution on $[0, 1]$ with the assumption of the samples being independent and identically distributed. This excellent property enables us to transform the unknown uncertainty from data to one of our most familiar distributions. CPSs utilize the p values of conformal prediction and transform them to the predictive distributions, which makes CPSs have the small-sample property of validity [31].

The pioneer work [31] first proposed CPSs with the classical least square procedure as the underlying algorithms and the asymptotic efficiency was proved with some strong assumptions. After that, [31] was done to answer some general questions about the existence and construction of consistent CPSs. In addition to the general theoretical studies above, two kinds of works concentrating on the applicability of CPSs were done. The first kind is to propose more flexible CPSs, whose representatives are [33] and [35]. The former extends using the classical least square procedure as underlying algorithm to using a more powerful algorithm named kernel ridge regression, and the latter proposed conformal calibration whose underlying algorithms are existing predictive systems. The second kind is to speed up the learning process of CPSs, as CPSs inherit the computational issue from conformal prediction [14, 20, 30]. To address this, there are two ways to try. One way is to modify the learning process of the original CPSs, such as split conformal predictive systems (SCPSs) and cross-conformal predictive systems (CCPSs) [34]. SCPSs are also valid even in small-sample cases, but they may lose predictive efficiency, as they split the data into two parts, one of which is used to train the underlying algorithm and the other of which is used to calculate conformity scores. Although CCPSs do not have the theoretical guarantee of validity, they improve the prediction performance by making full use of the data. Another way is to use a fast and well-performed underlying algorithm to compute the conformity scores, which was our previous work for building a fast probabilistic prediction algorithm [37]. In that work, based on the Leave-One-Out CCPS and extreme learning machine [12], we proposed a fast CPS named LOO–CCPS–RELM and analysed its asymptotic property of validity. LOO–CCPS–RELM takes advantage of jackknife prediction of residuals and their closed-form formula to make the whole learning process fast, which is competent in real-time applications.

This work extends our previous work about LOO–CCPS–RELM in two aspects. First, we design a more general learning framework in the spirit of LOO–CCPS–RELM to make probabilistic prediction, whose underlying algorithms can be any uniformly stable algorithm. Second, contrast with LOO–CCPS–RELM designed and proved to be asymptotically valid only for homoscedastic cases, the learning framework in this paper considers the heteroscedastic cases and a more general theoretical guarantee of the asymptotical validity is proved. The heteroscedastic cases are addressed by the idea of locally weighted jackknife prediction, whose theoretical analysis for prediction intervals has been conducted in our earlier work [38]. This paper extends the related concepts and analytical techniques to the probabilistic prediction. Since the predictive system we proposed is based on the idea of locally weighted jackknife prediction, it is named as locally weighted jackknife predictive system (LW-JPS) in this paper.

In summary, to build valid and computationally efficient predictive system, we develop locally weighted jackknife prediction approach with asymptotic guarantee of validity with the contributions as follows:

A general predictive system based on the idea of locally weighted jackknife prediction is proposed for probabilistic prediction, which is easy-to-code and can learn fast if the underlying algorithms have the closed-form formula for leave-one-out residuals.
The asymptotical validity of our predictive system is proved with some regularity assumptions, which extends the analysis of LOO–CCPS–RELM by considering a more general setting and the heteroscedastic cases.
The experiments with 20 public data sets are conducted, which empirically proves the effective and efficiency of the proposed predictive system.

The rest of this paper is organized as follows. “Conformal predictive systems and locally weighted jackknife predictive system” reviews conformal predictive systems and defines the proposed LW-JPS. “Asymptotic analysis of locally weighted jackknife predictive system” proves the asymptotic validity of LW-JPS with some regularity assumptions and conditions. In “Experiments”, the experiments are designed to test the validity and efficiency of LW-JPS empirically and the conclusions of this paper are drawn in “Conclusion”.

Conformal predictive systems and locally weighted jackknife predictive system

Throughout this paper, ${\varvec{X}}\subseteq {{\varvec{R}}}^{n}$ denotes the object space and ${\varvec{Y}}\subseteq {\varvec{R}}$ the label space. The observation space is denoted by ${\varvec{Z}}={\varvec{X}}\times {\varvec{Y}}$ and each observation ${\varvec{z}}=({\varvec{x}}, y)\in {\varvec{X}}\times {\varvec{Y}}$ comprises its object ${\varvec{x}}$ and corresponding label $y$. ${\varvec{Z}}l=\{{{\varvec{Z}}}_{i}, i=1, \cdot \cdot \cdot , l\}$ denotes a random training set whose realization is ${{\varvec{z}}}^{l}=\{{{\varvec{z}}}_{i}, i=1, \cdot \cdot \cdot , l\}$. ${{\varvec{Z}}}_{0}$ denotes a random test observation whose realization is ${z}_{0}$, where ${{\varvec{Z}}}_{0}=\left({{\varvec{X}}}_{0},{Y}_{0}\right),{{\varvec{Z}}}_{1}=\left({{\varvec{X}}}_{1},{Y}_{1}\right),\cdots ,{{\varvec{Z}}}_{l}=\left({{\varvec{X}}}_{l},{Y}_{l}\right)$ are independent and identically distributed and drawn from the distribution $\rho $ on ${\varvec{Z}}={\varvec{X}}\times {\varvec{Y}}$. $T$ denotes a random number uniformly distributed on $\left[0,1\right]$, which is independent of all observations and its realization is denoted by $t$.

For a fixed training set ${\varvec{z}}l$ and a test input object ${{\varvec{x}}}_{0}$, the goal of predictive systems is to construct a predictive distribution on $y\in {\varvec{R}}$, which contains much of the information about ${y}_{0}$.

Predictive system and randomized predictive system

We first give the definition of predictive system which is first formally defined in [35].

Definition 1

A measurable function $Q:{\varvec{Z}}^{l + 1} \to \left[ {0, 1} \right]$ is a predictive system (PS) if it satisfies the following two conditions:

A. For each realization ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$ , the function $Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)$ is increasing in $y\in {\varvec{R}}$ .

B. For each realization ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$ ,

$$\underset{y\to -\infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)=0$$

and

$$\underset{y\to \infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)=1.$$

Next, the notion of randomized predictive system is needed to introduce conformal predictive system.

Definition 2

A measurable function $Q:{\varvec{Z}}^{l + 1} \times \left[ {0,1} \right] \to \left[ {0,1} \right]$ is a randomized predictive system (RPS) if it satisfies the following two conditions:

A. For each realization ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$ , the function $Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),t\right)$ is increasing in $y\in {\varvec{R}}$ and $t\in \left[0,1\right]$ .

B. For each realization ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$ ,

$$\underset{y\to -\infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),0\right)=0$$

and

$$\underset{y\to \infty }{\mathit{lim}}Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),1\right)=1.$$

In this paper, we use the shorthand notation ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)=Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right)\right)$ to explicitly regard it as a function of $y$ dependent on ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$, and the shorthand notation ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)=Q\left({{\varvec{z}}}^{l},\left({{\varvec{x}}}_{0},y\right),t\right)$ to explicitly regard it as a function of $y$ dependent on ${{\varvec{z}}}^{l}$, ${{\varvec{x}}}_{0}$ and $t$. ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$ is the predictive distribution of PS, which is a cumulative distribution function (CDF) of ${Y}_{0}$ given ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$. Different from that, RPS introduces a random number $t$ to build the predictive distribution ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$. For fixed training set ${{\varvec{z}}}^{l}$ and ${{\varvec{x}}}_{0}$, the lower bound and upper bound of ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ are ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},0}\left(y\right)$ and ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},1}\left(y\right)$, respectively. The gap ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},1}\left(y\right)-{Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},0}\left(y\right)$ can converge to 0 quickly for the existing designed RPSs [31]. Thus, one can use a CDF between or approximating ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},0}\left(y\right)$ and ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},1}\left(y\right)$ to remove the impact of $t$ and build the predictive distribution of ${Y}_{0}$.

A predictive system ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$ is valid, if the following holds:

$$P\left\{{Q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \eta \right\}=\eta ,$$

(1)

where ${Q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left(y\right)$ is a random function of $y$ whose realization is ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$. In addition, ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$ is asymptotically valid if formula (1) holds asymptotically. Let ${\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(\eta /2\right)}$ and ${\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(1-\eta /2\right)}$ be the $\eta /2$ and $1-\eta /2$ quantiles of ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$. Then, the property of validity defined by formula (1) ensures that

$$ P\left\{ {Y_{0} \in C_{{{\varvec{Z}}^{l} ,{\varvec{X}}_{0} }}^{{\left( {1 - \eta } \right)}} } \right\} = 1 - \eta , $$

(2)

where ${C}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(1-\eta \right)}=\left[{\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(\eta /2\right)},{\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(1-\eta /2\right)}\right]$ is the prediction interval derived from ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$, whose expected coverage rate is $1-\eta $.

The predictive systems developed in the literature needs strong assumptions to be valid in small-sample cases [25, 27]. Therefore, to obtain validity in small-sample cases, randomized predictive system introduces the extra random number $t$, whose purpose is to define a similar property of validity for RPS as follows,

$$P\left\{{Q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0},T}\left({Y}_{0}\right)\le \eta \right\}=\eta $$

(3)

If ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ is the $p$ value of conformal prediction, the corresponding RPS is called conformal predictive system, which has the property of validity in small-sample cases defined by formula (3) and the equation-like formula (2) holds by introducing $T$.

Next, we review SCPSs and CCPSs to demonstrate how to construct the function ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$.

Split conformal predictive system

To build a CPS, the conformity scores of observations calculated by a conformity measure $A\left(S,{\varvec{z}}\right)$ are needed, where $S$ is a data set and $z$ is an observation. The conformity measure evaluates the degree of agreement between $S$ and $z$. In the context of SCPSs, $A\left(S,{\varvec{z}}\right)$ should be a balance isotonic function [34]. In general, with a regression algorithm $u$, $A\left(S,{\varvec{z}}\right)$ can be designed as

$$A\left(S,{\varvec{z}}\right)=y-{\widehat{\mu }}_{S}\left({\varvec{x}}\right),$$

(4)

or

$$A\left(S,{\varvec{z}}\right)=\frac{y-{\widehat{\mu }}_{S}\left({\varvec{x}}\right)}{\sqrt{{\widehat{\upsilon }}_{S}\left({\varvec{x}}\right)}},$$

(5)

where ${\widehat{\mu }}_{S}$ and ${\widehat{\upsilon }}_{S}$ are estimated mean function and conditional variance function learned from $S$, respectively.

The learning process of SCPSs splits the training set ${{\varvec{z}}}^{l}$ into two parts, which are the proper training set ${{\varvec{z}}}_{1}^{m}=\{\left({{\varvec{x}}}_{j},{y}_{j}\right),j=\mathrm{1,2},\cdots ,m\}$ and the calibration set ${{\varvec{z}}}_{m}^{l}=\{\left({{\varvec{x}}}_{j},{y}_{j}\right),j=m+1,\cdots ,l\}$. For each possible label $y\in {\varvec{R}}$, $l-m+1$ conformity scores can be computed as follows:

$${\alpha }_{i}=A\left({{\varvec{z}}}_{1}^{m},\left({{\varvec{x}}}_{i},{y}_{i}\right)\right),$$

$${\alpha }_{0}^{y}=A\left({{\varvec{z}}}_{1}^{m},\left({{\varvec{x}}}_{0},y\right)\right),$$

where ${{\varvec{x}}}_{0}$ is a test input, $y$ is the corresponding label of ${{\varvec{x}}}_{0}$ and $i=m+1,m+2,\cdot \cdot \cdot ,l$. Based on the above calculation, ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ can be obtained as formula (5) in [37]. The theory in [34] shows that the above ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ is a valid RPS.

Different $A\left(S,{\varvec{z}}\right)$ leads to different ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$. Suppose that formula (5) is the conformity measure and define ${C}_{i}$ as

$${C}_{i}={\widehat{\mu }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{0}\right)+\frac{{y}_{m+i}-{\widehat{\mu }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{m+i}\right)}{\sqrt{{\widehat{\upsilon }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{m+i}\right)}}\times \sqrt{{\widehat{\upsilon }}_{{{\varvec{z}}}_{1}^{m}}\left({{\varvec{x}}}_{0}\right).}$$

Sort ${C}_{i}$ to obtain ${C}_{\left(1\right)}\le \cdots \le {C}_{\left(l-m\right)}$ and let ${C}_{\left(0\right)}=-\infty $ and ${C}_{\left(l-m+1\right)}=\infty $. Then, the corresponding ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ is calculated as formula (7) in [37], which can be further modified to become a formal CDF as formula (8) in [37], i.e., the empirical CDF of $\left\{{C}_{\left(i\right)},i=1,\cdots ,l-m\right\}$.

The split process of SCPSs may not make full use of the training data, which is the reason of the development of CCPSs.

Cross-conformal predictive system

Based on the idea of cross validation, CCPSs first partition the training data into $k$ folds. Let ${o}_{i}$ denote the ordinals of training data in the $i$ th fold and ${z}_{\left({o}_{i}\right)}^{l}$ denote the training data without the $i$ th fold. For each $i\in \left\{1,\cdot \cdot \cdot ,k\right\}$, a CCPS with conformity measure $A\left(S,{\varvec{z}}\right)$ calculates the conformity scores with ${{\varvec{z}}}_{\left({o}_{i}\right)}^{l}$ being the proper training set and $\left\{{{\varvec{z}}}_{j}|j\in {o}_{i}\right\}$ the calibration set. The corresponding conformity scores are

$${\alpha }_{j,i}=A\left({{\varvec{z}}}_{\left({o}_{i}\right)}^{l},{{\varvec{z}}}_{j}\right)$$

and

$${\alpha }_{0,i}^{y}=A\left({{\varvec{z}}}_{\left({o}_{i}\right)}^{l},\left({{\varvec{x}}}_{0},y\right)\right).$$

Finally, the function ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ of the CCPS is written as formula (9) in [37].

Suppose that formula (5) is the conformity measure and for $j\in {o}_{i}$, ${C}_{j,i}$ is written as

$${C}_{j,i}={\widehat{\mu }}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{0}\right)+\frac{{y}_{i}-{\widehat{\mu }}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}{\sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}}\times \sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left({o}_{i}\right)}^{l}}\left({{\varvec{x}}}_{0}\right).}$$

Sort all ${C}_{j,i}$ to obtain ${C}_{\left(1\right)}\le \cdots \le {C}_{\left(l\right)}$ and set ${C}_{\left(0\right)}=-\infty $ and ${C}_{\left(l+1\right)}=\infty $. Then, the ${Q}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0},t}\left(y\right)$ of the above CCPS can be written as formula (10) in [37], which can be further modified to become a formal CDF as formula (11) in [37], i.e., the empirical CDF of $\left\{{C}_{\left(i\right)},i=1,\cdots ,l\right\}.$

Leave-One-Out CCPS with formula (5) as conformity measure can be obtained by choosing $k=l$, whose predictive distribution is the empirical CDF of $\left\{{C}_{i},i=1,\cdots ,l\right\}$, with ${C}_{i}$ being written as.

$${C}_{i}={\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right)+\frac{{y}_{i}-{\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}{\sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}}\times \sqrt{{\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right).}$$

We summarize the Leave-One-Out CCPS in Algorithm 1, since our proposed predictive system based on locally weighted jackknife prediction is highly related to it.

Locally weighted jackknife predictive system

Jackknife prediction employs leave-one-out predictions for training data, which was proposed in the context of conformal prediction to build interval predictors [14, 36, 38]. Here, we extend it to build predictive systems inspired by Leave-One-Out CCPS. Locally weighted jackknife prediction is the jackknife prediction with the square root of ${\widehat{\upsilon }}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)$ in Algorithm 1 as the local weight. In fact, Algorithm 1 can be modified to base on locally weighted jackknife prediction by changing ${\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right)$ and ${\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{0}\right)$ to ${\widehat{u}}_{{{\varvec{z}}}^{l}}\left({{\varvec{x}}}_{0}\right)$ and ${\widehat{v}}_{{{\varvec{z}}}^{l}}\left({{\varvec{x}}}_{0}\right)$, respectively, which reduces the times of the computation for regressors from $l$ to 1. In addition, one also needs a way of calculating or approximating ${\widehat{v}}_{{{\varvec{z}}}^{l}}$ and ${\widehat{v}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}$ efficiently to build the predictive system. In this paper, we employ the way of approximating them developed in our previous works [36, 38] about conformal prediction, which leads to our proposed predictive system based on locally weighted jackknife prediction in Algorithm 2.

Algorithm 2 utilizes the jackknife prediction ${\widehat{u}}_{{{\varvec{z}}}_{\left(i\right)}^{l}}$ and calculates the locally weighted leave-one-out residuals with the square root of ${\widehat{v}}_{{\widehat{{\varvec{z}}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)$ as the weight to build the predictive system.

Although Algorithm 2 needs to compute leave-one-out residuals, the learning process can be fast if the underlying algorithms $u$ and $v$ are linear smoothers [39], which have closed-form formula for computation.

Asymptotic analysis of locally weighted jackknife predictive system

This section provides the asymptotic analysis of LW-JPS. We first give the related definition, assumptions and conditions, and then prove the asymptotic validity of LW-JPS.

Definitions, assumptions and conditions

Throughout the paper, we assume that the labels are bounded by $D$, i.e., ${sup}_{y\in {\varvec{Y}}}\left|y\right|\le D$. The regularity properties of the probability distribution $\rho $ on ${\varvec{Z}}={\varvec{X}}\times {\varvec{Y}}$ will be assumed when it is needed as in [9]. All observations $\left({{\varvec{X}}}_{i},{Y}_{i}\right)$ are i.i.d. samples. The generalization error of a function $f:{\varvec{X}}\to {\varvec{Y}}$ is measured by

$$\xi \left(f\right)=E\left[{\left(f\left({\varvec{X}}\right)-Y\right)}^{2}\right]={\int }_{{\varvec{Z}}}\left(f\left({\varvec{x}}\right)-y\right)^{2}d\rho .$$

Denote the marginal probability distribution of $\rho $ on ${\varvec{X}}$ as ${\rho }_{{\varvec{X}}}$, which is ${\rho }_{{\varvec{X}}}\left(S\right)=\rho \left(S\times {\varvec{Y}}\right)$ for the measurable set $S\subseteq {\varvec{X}}$. The conditional distribution of $y$ given ${\varvec{x}}$ is $\rho \left(y|{\varvec{x}}\right)$ and the regression function of $\rho $ is

$${\mu }_{\rho }\left({\varvec{x}}\right){=E\left[Y|{\varvec{X}}={\varvec{x}}\right]=\int }_{{\varvec{Y}}}yd\rho \left(y|{\varvec{x}}\right).$$

Therefore, based on Proposition 1.8 in [9], ${\mu }_{\rho }$ is the minimizer of $\xi \left(f\right)$ and for each $f:{\varvec{X}}\to {\varvec{Y}}$,

$$\xi \left(f\right)-\xi \left({\mu }_{\rho }\right)={\int }_{{\varvec{X}}}{\left(f\left({\varvec{x}}\right)-{\mu }_{\rho }\left({\varvec{x}}\right)\right)}^{2}d{\rho }_{{\varvec{X}}}.$$

It can be concluded that ${\mu }_{\rho }$ is bounded by $D$ as $Y\le D$.

For the regression problem, we assume that the samples satisfy Assumption 1, where ${\Vert f\Vert }_{\infty }$ is the infinite norm of $f$ on its domain, i.e., ${\Vert f\Vert }_{\infty }={\mathrm{sup}}_{{\varvec{x}}\in {\varvec{X}}}\left|f\left({\varvec{x}}\right)\right|$.

Assumption 1

Each observation $\left({\varvec{X}},Y\right)$ satisfies the following formula:

$$Y={\mu }_{\rho }\left({\varvec{X}}\right)+\sqrt{{v}_{\rho }\left({\varvec{X}}\right)}\times \zeta ,$$

where ${v}_{\rho }\left({\varvec{X}}\right)$ is the conditional variance function and $\zeta $ is a random variable with zero mean and unit variance. $\zeta $ is independent of ${\varvec{X}}$ and $0<{v}_{min}\le {\Vert {v}_{\rho }\Vert }_{\infty }\le {v}_{max}<\infty $ . In addition, $\left|\zeta \right|\le {\zeta }_{max}$ whose cumulative distribution function $F\left(b\right)=P\left\{\zeta \le b\right\}$ is continuous and strictly increasing on $\left\{b|F\left(b\right)\in \left(\mathrm{0,1}\right)\right\}$ .

The formula in Assumption 1 is a standard assumption for regression problems with heteroscedastic setting, where the conditional variance of $Y$ is dependent on ${\varvec{X}}$ instead of a constant. Since $Y$ is bounded, $\left|\zeta \right|\le {\zeta }_{max}$ is assumed.

An array of random variables ${X}_{l}$ for $l\in {{\varvec{N}}}^{+}$ converges to a random variable $X$ in probability is written as ${X}_{l}{\to }_{p}X$, whose definition can be found from the Definition 1 in [38].

To prove the asymptotic validity of LW-JPS, the following four conditions are needed for algorithms $\upmu $ and $v$, which were also introduced in our earlier work about the theoretical analysis of locally weighted jackknife prediction [38]. In the conditions, $r$ represents a general regression algorithm and ${{\varvec{Z}}}^{l}$ is a general random data set for training r, whose samples are i.i.d.. ${\widehat{r}}_{{{\varvec{Z}}}^{l}}$ is the learned regressor whose randomness is from ${{\varvec{Z}}}^{l}$ and ${\widehat{r}}_{{{\varvec{z}}}^{l}}$ is the corresponding realization.

Condition 1. The regression algorithm $r$ is symmetric in the observations, such that for each $l$, each ${{\varvec{z}}}^{l}$ and each permutation $\pi $ of $\left\{1,\cdots ,l\right\}$, there holds

$${\widehat{r}}_{{{\varvec{z}}}^{l}}={\widehat{r}}_{{\pi }_{l}\left({{\varvec{z}}}^{l}\right)},$$

where ${\pi }_{l}\left({{\varvec{z}}}^{l}\right)=\left\{{{\varvec{z}}}_{\pi \left(j\right)},j=1,\cdots ,l\right\}$.

Condition 2. The regressor ${\widehat{r}}_{{{\varvec{Z}}}^{l}}$ uniformly converges in probability to the regression function ${\mu }_{\rho }$ of ${\varvec{Z}}$, i.e.,

$${\Vert {\widehat{r}}_{{{\varvec{Z}}}^{l}}-{\mu }_{\rho }\Vert }_{\infty }{\to }_{p}0.$$

Condition 3. The regression algorithm $r$ is a uniformly stable algorithm [8], whose uniform stability with respect to the square loss is $\beta =\beta \left(l\right)$, i.e., for each $l$ and each ${{\varvec{z}}}^{l}$,

$$ \mathop {\sup }\limits_{i} \left( {\mathop {\sup }\limits_{{({\mathbf{x}},y) \in {\mathbf{X}} \times Y}} \left| {\left( {y - \hat{r}_{{{\mathbf{z}}^{l} }} \left( {\mathbf{x}} \right)} \right)^{2} - \left( {y - \hat{r}_{{{\mathbf{z}}_{\left( i \right)}^{l} }} \left( {\mathbf{x}} \right)} \right)^{2} } \right|} \right) \le \beta (l), $$

where $\underset{l\to \infty }{\mathit{lim}}\beta \left(l\right)=0$.

Condition 4. For two fixed data sets ${\widehat{{\varvec{z}}}}^{l}=\big\{\big({\varvec{x}}_{j},{\widehat{y}}_{j}\big),j=1,\cdots ,l\big\}$ and ${\widetilde{{\varvec{z}}}}^{l}=\big\{\big({{\varvec{x}}}_{j},{\widetilde{y}}_{j}\big),j=1,\cdots ,l\big\}$ with the same input objects, if for each $l$, the labels satisfy

$$\underset{i\in \left\{\mathrm{1,2},\cdots ,l\right\}}{\mathrm{sup}}\left|{\widehat{y}}_{i}-{\widetilde{y}}_{i}\right|\le \eta ,$$

there holds

$${\Vert {\widehat{r}}_{{\widehat{{\varvec{z}}}}^{l}}-{\widehat{r}}_{{\widetilde{{\varvec{z}}}}^{l}}\Vert }_{\infty }\le \eta .$$

With the same mathematical skills in [38], we need the algorithm $\upmu $ to satisfy Condition 1, 2, and 3 and $v$ to satisfy Condition 1, 2 and 4 to prove the asymptotic validity of LW-JPS. The conditions are not too restrict for applications, which we have analyzed in section 3.3 of [38].

Asymptotic validity of LW-JPS

We introduce Lemma 1 to guarantee that ${\widehat{v}}_{{\widehat{{\varvec{Z}}}}^{l}}$ is a consistent estimator for the conditional variance function in Algorithm 2, which has been proved in [38].

Lemma 1

With Assumption 1 being hold, $\mu $ satisfying Condition 2 and 3, and $v$ satisfying Condition 2 and 4, we have

$${\Vert {\widehat{v}}_{{\widehat{{\varvec{Z}}}}^{l}}-{v}_{\rho }\Vert }_{\infty }{\to }_{p}0.$$

We will prove in Theorem 1 that Algorithm 2 is asymptotically valid by showing that the corresponding predictive distribution ${\widehat{Q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$ satisfies

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$

(6)

which is an asymptotic version of formula (1). To do so, we need to prove that

$$P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$

(7)

where ${\widehat{q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}^{\left(\alpha \right)}$ is the $\alpha $ quantile of ${\widehat{Q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}\left(y\right)$. Formula (7) is equivalent to

$$P\left\{{\Gamma }_{{{\varvec{Z}}}^{l}}\le {\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$

(8)

where ${\Gamma }_{{{\varvec{z}}}^{l}}$ is the normalized residual defined by

$${\Gamma }_{{{\varvec{z}}}^{l}}=\frac{{Y}_{0}-{\widehat{\mu }}_{{{\varvec{z}}}^{l}}\left({{\varvec{X}}}_{0}\right)}{\sqrt{{\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}\left({{\varvec{X}}}_{0}\right)}},$$

and ${\widehat{q}}_{{{\varvec{z}}}^{l}}^{\left(\alpha \right)}$ is the $\alpha $ quantile of the normalized leave-one-out residuals $\left\{{a}_{l,i},i=1,\cdots ,l\right\}$ defined by

$${a}_{l,i}=\frac{{y}_{i}-{\widehat{\mu }}_{{\mathbf{z}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}{\sqrt{{\widehat{v}}_{{\widehat{\mathbf{z}}}_{\left(i\right)}^{l}}\left({{\varvec{x}}}_{i}\right)}}.$$

Denote the CDF of ${\Gamma }_{{{\varvec{z}}}^{l}}$ by ${F}_{{{\varvec{z}}}^{l}}\left(b\right)$, i.e.,

$${F}_{{{\varvec{z}}}^{l}}\left(b\right)=P\left\{{\Gamma }_{{{\varvec{z}}}^{l}}\le b|{{\varvec{z}}}^{l}\right\},$$

and ${q}^{(\alpha )}$ is the $\alpha $ quantile of $F(b)$ in Assumption 1. Since Lemma 1 confirms that the estimator ${\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}$ uniformly converges to ${v}_{\rho }$ in probability and $\mu $ satisfying Condition 2, we can make the connection between ${\Gamma }_{{{\varvec{z}}}^{l}}$ and the normalized noise term of Assumption 1, which is

$${\zeta }_{0}=\frac{{Y}_{0}-{\mu }_{\rho }\left({{\varvec{X}}}_{0}\right)}{\sqrt{{v}_{\rho }\left({{\varvec{X}}}_{0}\right)}},$$

and prove in Lemma 2 that

$$\underset{b\in R}{\mathrm{sup}}\left|{F}_{{{\varvec{Z}}}^{l}}\left(b\right)-F(b)\right|{\to }_{p}0.$$

Also, ${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}$ and ${q}^{\left(\alpha \right)}$ are highly related as we show in Lemma 2 that

$${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}{\to }_{p}{q}^{\left(\alpha \right)}.$$

Building on Lemma 2, formula (8) and formula (6) can be proved in turn and the conclusion of LW-JPS being asymptotically valid can be drawn in Theorem 1.

The analysis techniques in Lemma 2 was first introduced in [28] for linear regression problems with homoscedastic errors, which was further improved for nonlinear regression problems with heteroscedastic errors by our earlier work for locally weighted jackknife prediction [38]. The above two works both concerns building interval prediction other than probabilistic prediction, which makes the detailed expressions different from this work. In addition, our work about LOO–CCPS–RELM [37] only considers nonlinear regression problems with homoscedastic errors, and its proofs are specific to extreme learning machine. Therefore, we introduce and prove Lemma 2, which is essential to proving Theorem 1 strictly in this paper.

Lemma 2

Fix $\alpha \in \left(\mathrm{0,1}\right)$. If the conditions of Lemma 1 hold and both $\mu $ and $v$ also satisfy Condition 1, then we have.

$$\underset{b\in R}{\mathrm{sup}}\left|{F}_{{{\varvec{Z}}}^{l}}\left(b\right)-F(b)\right|{\to }_{p}0,$$

(9)

and

$${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}{\to }_{p}{q}^{\left(\alpha \right)}.$$

(10)

P r oof

Since ${\widehat{\mu }}_{{{\varvec{z}}}^{l}}$ satisfies that

$${\Vert {\widehat{\mu }}_{{{\varvec{Z}}}^{l}}-{\mu }_{\rho }\Vert }_{\infty }{\to }_{p}0,$$

and ${\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}$ satisfies that

$${\Vert {\widehat{v}}_{{\widehat{{\varvec{Z}}}}^{l}}-{v}_{\rho }\Vert }_{\infty }{\to }_{p}0,$$

for each $l$ we can define a nonempty set $B(l)$ as

$$B\left(l\right)=\left\{{{\varvec{z}}}^{l}|\mathrm{max}\left\{{\Vert {\widehat{\mu }}_{{{\varvec{z}}}^{l}}-{\mu }_{\rho }\Vert }_{\infty },{\Vert {\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}-{v}_{\rho }\Vert }_{\infty }\right\}\le g\left(l\right)\right\},$$

where $g\left(l\right)$ is nonnegative and converges to 0 sufficiently slow. Then, we can construct an array of random variables ${\Gamma }_{{{\varvec{z}}}^{l}}$ by taking an arbitrary ${{\varvec{z}}}^{l}$ in $B\left(l\right)$. As ${{\varvec{z}}}^{l}\in B\left(l\right)$, ${v}_{min}>0$ and $g\left(l\right)$ converges to 0, there exists a ${l}_{1}$, such that for all $l>{l}_{1}$, there holds

$${\Vert {\widehat{v}}_{{\widehat{{\varvec{z}}}}^{l}}\Vert }_{\infty }\ge {v}_{min}-g\left(l\right)\ge {v}_{min}-g\left({l}_{1}\right)>0.$$

For all $l>{l}_{1}$, by the definitions, we have

$$\left|{\Gamma }_{{{\varvec{z}}}^{l}}-{\zeta }_{0}\right|\le \frac{\frac{g\left(l\right)}{\left|\sqrt{{v}_{min}}+\sqrt{{v}_{min}-g\left({l}_{1}\right)}\right|}\times {\zeta }_{max}+g\left(l\right)}{\sqrt{{v}_{min}-g\left({l}_{1}\right)}},$$

which guarantees that

$${\Gamma }_{{{\varvec{z}}}^{l}}{\to }_{p}{\zeta }_{0}.$$

Since convergence in probability implies convergence in distribution and the CDF of ${\zeta }_{0}$ is continuous, according to Proposition 1.16 of [26], we have

$$\underset{l\to \infty }{\mathrm{lim}}\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{{{\varvec{z}}}^{l}}\left(b\right)-F\left(b\right)\right|=0.$$

The arbitrarily chosen ${{\varvec{z}}}^{l}$ from $B\left(l\right)$ leads to

$$\underset{l\to \infty }{\mathrm{lim}}\underset{{{\varvec{z}}}^{l}\in B\left(l\right)}{\mathrm{sup}}\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{{{\varvec{z}}}^{l}}\left(b\right)-F\left(b\right)\right|=0,$$

which implies that formula (9) is correct [38].

Next, we prove formula (10). Since for every $\epsilon >0$, we have

$$P\left\{\left|{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}-{q}^{\left(\alpha \right)}\right|>\epsilon \right\}=P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}+P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}.$$

Thus, we need to show that

$$P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}$$

and

$$P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}$$

converges to 0, respectively. Define ${F}_{l}\left(b\right)$ by

$$ F_{l} \left( b \right) = P\left\{ {\frac{{Y_{1} - \hat{\mu }_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)}}{{\sqrt {\hat{v}_{{\hat{\user2{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)} }} \le b} \right\}{ } = E\left[ {P\left\{ {\frac{{Y_{1} - \hat{\mu }_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)}}{{\sqrt {\hat{v}_{{\hat{\user2{Z}}_{\left( 1 \right)}^{l} }} \left( {{\varvec{X}}_{1} } \right)} }} \le b{|}{\varvec{Z}}_{\left( 1 \right)}^{l} } \right\}} \right]{ } = E\left[ {F_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( b \right)} \right] $$

whose distance from $F\left(b\right)$ can be bounded by

$$\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{l}\left(b\right)-F\left(b\right)\right|\le E\left[\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{{{\varvec{Z}}}_{\left(1\right)}^{l}}\left(b\right)-F\left(b\right)\right|\right].$$

From formula (9) and the definition of leave-one-out samples, the bounded random variable $\underset{b\in {\varvec{R}}}{\mathit{sup}}\left|{F}_{{{\varvec{Z}}}_{\left(1\right)}^{l}}\left(b\right)-F\left(b\right)\right|$ converges to 0 in probability. This leads to

$$ \mathop {\lim }\limits_{l \to \infty } \mathop {\sup }\limits_{{b \in {\varvec{R}}}} \left| {F_{l} \left( b \right) - F\left( b \right)} \right| \le \mathop {\lim }\limits_{l \to \infty } \begin{array}{*{20}c} {E\left[ {\mathop {\sup }\limits_{{b \in {\varvec{R}}}} \left| {F_{{{\varvec{Z}}_{\left( 1 \right)}^{l} }} \left( b \right) - F\left( b \right)} \right|} \right],} = 0 \\ \end{array} $$

i.e.,

$$\underset{l\to \infty }{\mathrm{lim}}\underset{b\in {\varvec{R}}}{\mathrm{sup}}\left|{F}_{l}\left(b\right)-F\left(b\right)\right|=0.$$

(11)

Let the CDF of the normalized leave-one-out residuals $\left\{{a}_{l,i},i=1,\cdots ,l\right\}$ be denoted by ${F}_{al}\left(b\right)$. Therefore, ${F}_{{A}_{l}}\left(b\right)$ and ${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}$ are the corresponding random function and random variable by introducing the randomness of ${{\varvec{Z}}}^{l}$. Define ${J}_{l,i}={1}_{\left\{{A}_{l,i}>{q}^{\left(\alpha \right)}+\epsilon \right\}}$, which is the indicator function of $\left\{{A}_{l,i}>{q}^{\left(\alpha \right)}+\epsilon \right\}$. As algorithms $\mu $ and $v$ being exchangeable implies that $\left\{{J}_{l,j},j=1,\cdots ,l\right\}$ are exchangeable, based on the property of quantile function [29], we have

$$ \begin{gathered} P\left\{ {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} > q^{\left( \alpha \right)} + \in } \right\} = P\left\{ {\alpha > F_{{A_{l} }} \left( {q^{\left( \alpha \right)} + \in } \right)} \right\}{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = P\left\{ {1 - F_{{A_{l} }} \left( {q^{\left( \alpha \right)} + \in } \right) > 1 - \alpha } \right\}{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = P\left\{ {\frac{1}{l}\mathop \sum \limits_{i = 1}^{l} \left( {J_{l,i} - E\left[ {J_{l,1} } \right]} \right) > 1 - \alpha - E\left[ {J_{l,1} } \right]} \right\}{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = P\left\{ {\frac{1}{l}\mathop \sum \limits_{i = 1}^{l} \left( {J_{l,i} - E\left[ {J_{l,i} } \right]} \right) > F_{l} \left( {q^{\left( \alpha \right)} + \in } \right) - \alpha } \right\}. \hfill \\ \end{gathered} $$

Since formula (11) holds, it follows that

$$F\left({q}^{\left(\alpha \right)}+\epsilon \right)>\alpha ,$$

which implies that ${F}_{l}\left({q}^{\left(\alpha \right)}+\epsilon \right)>0$ for sufficiently large $l$. Thus, it follows from Markov’s inequality that for sufficiently large $l$, the probability,

$$P\left\{\frac{1}{l}\sum_{i=1}^{l}\left({J}_{l,i}-E\left[{J}_{l,i}\right]\right)>{F}_{l}\left({q}^{\left(\alpha \right)}+\epsilon \right)-\alpha \right\},$$

is bounded by

$$\frac{\frac{1}{l}var\left({J}_{l,1}\right)+\frac{l\left(l-1\right)}{{l}^{2}}cov\left({J}_{l,1},{J}_{l,2}\right)}{{\left({F}_{l}\left({q}^{\left(\alpha \right)}+\epsilon \right)-\alpha \right)}^{2}},$$

(12)

where $var$ and $cov$ are the variance and covariance function, respectively. Therefore, to prove $P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}$ approaches 0, we need to prove $cov\left({J}_{l,1},{J}_{l,2}\right)$ converges to 0 as $l\to \infty $.

Let ${{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}$ and ${\widehat{{\varvec{Z}}}}_{\left(\mathrm{1,2}\right)}^{l}$ be the corresponding data set without the first two observations. Define ${A}_{l,\left(\mathrm{1,2}\right)}$ and ${A}_{l,\left(\mathrm{2,1}\right)}$ by

$${A}_{l,\left(\mathrm{1,2}\right)}=\frac{{{Y}_{1}-\widehat{m}}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{1}\right)}{\sqrt{{\widehat{v}}_{{\widehat{{\varvec{Z}}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{1}\right)}}, {A}_{l,\left(\mathrm{2,1}\right)}=\frac{{{Y}_{2}-\widehat{m}}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{2}\right)}{\sqrt{{\widehat{v}}_{{\widehat{{\varvec{Z}}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({{\varvec{X}}}_{2}\right)}},$$

and define ${\widetilde{A}}_{1}$ and ${\widetilde{A}}_{2}$ by

$${\widetilde{A}}_{1}=\left[{A}_{l,1},{A}_{l,2}\right], {\widetilde{A}}_{2}=\left[{A}_{l,\left(\mathrm{1,2}\right)},{A}_{l,\left(\mathrm{2,1}\right)}\right].$$

Let ${\widetilde{F}}_{l,1}$ and ${\widetilde{F}}_{l,2}$ be the CDFs of ${\widetilde{A}}_{1}$ and ${\widetilde{A}}_{2}$, respectively, i.e.,

$${\widetilde{F}}_{l,1}\left({b}_{1},{b}_{2}\right)=P\left\{{A}_{l,1}\le {b}_{1},{A}_{l,1}\le {b}_{2}\right\},$$

and

$${\widetilde{F}}_{l,2}\left({b}_{1},{b}_{2}\right)=P\left\{{A}_{l,\left(\mathrm{1,2}\right)}\le {b}_{1},{A}_{l,\left(\mathrm{1,2}\right)}\le {b}_{2}\right\}.$$

For ${\widetilde{F}}_{l,2}$, we have

$${\widetilde{F}}_{l,2}\left({b}_{1},{b}_{2}\right)=E\left[{F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{1}\right){F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{2}\right)\right].$$

(13)

Since ${F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{1}\right)$ and ${F}_{{{\varvec{Z}}}_{\left(\mathrm{1,2}\right)}^{l}}\left({b}_{2}\right)$ are bounded random variables, which converge to $F\left({b}_{1}\right)$ and $F\left({b}_{2}\right)$, respectively, due to formula (9), we have

$$\underset{l\to \infty }{\mathrm{lim}}{\widetilde{F}}_{l,2}\left({b}_{1},{b}_{2}\right)=F\left({b}_{1}\right)F\left({b}_{2}\right).$$

As Lemma 1 holds and $\upmu $ satisfies Condition 2, we can deduce that ${A}_{l,1}$, ${A}_{l,2}$, ${A}_{l,\left(\mathrm{1,2}\right)}$ and ${A}_{l,\left(\mathrm{2,1}\right)}$ are all convergent in probability to ${\upzeta }_{0}$, which implies that ${A}_{l,1}-{A}_{l,\left(\mathrm{1,2}\right)}{\to }_{p}0$ and ${A}_{l,2}-{A}_{l,\left(\mathrm{2,1}\right)}{\to }_{p}0$. Therefore, from Lemma 2.8 in [29], there holds

$$\underset{l\to \infty }{\mathrm{lim}}{\widetilde{F}}_{l,1}\left({b}_{1},{b}_{2}\right)=F\left({b}_{1}\right)F\left({b}_{2}\right).$$

Furthermore, with

$$ \begin{gathered} {\text{cov}}\left( {J_{l,1} ,J_{l,1} } \right){ } = {\text{cov}}\left( {1 - J_{l,1} ,1 - J_{l,1} } \right){ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; = \tilde{F}_{l,1} \left( {q^{\left( \alpha \right)} + \in ,q^{\left( \alpha \right)} + \in } \right) - F_{l} \left( {q^{\left( \alpha \right)} + \in } \right)F_{l} \left( {q^{\left( \alpha \right)} + \in } \right) \hfill \\ \end{gathered} $$

and formula (11), we have

$$\underset{l\to \infty }{\mathrm{lim}}cov\left({J}_{l,1},{J}_{l,1}\right)=0.$$

Based on the formula above and Eq. (12), we have

$$\underset{l\to \infty }{\mathrm{lim}}P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}=0.$$

Similarly, we can also prove that

$$\underset{l\to \infty }{\mathrm{lim}}P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}=0.$$

Thus, since the fact that

$$P\left\{\left|{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}-{q}^{\left(\alpha \right)}\right|>\epsilon \right\}=P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}>{q}^{\left(\alpha \right)}+\epsilon \right\}+P\left\{{\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}<{q}^{\left(\alpha \right)}-\epsilon \right\}.$$

and the two limit equations above hold, we have

$${\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(1-\alpha \right)}{\to }_{p}{q}^{\left(1-\alpha \right)}.$$

(14)

The following two theorems describe the statistical compatibility of the predictive distributions output by LW-JPS with observations in the asymptotic setting. Theorem 1 proves the asymptotic version of formula (1) and Theorem 2 proves a sufficient condition of the asymptotic version of formula (2), where quantiles can be set arbitrarily.

Theorem 1

Fix $\alpha \in \left(\mathrm{0,1}\right)$ . If Assumption 1 holds, $\mu $ satisfying Condition 1, 2 and 3, and $v$ satisfying Condition 1, 2 and 4, we have

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha .$$

Proof

Based on Assumption 1, we have $F\left({q}^{\left(\alpha \right)}\right)=\alpha $. Therefore,

$$ \begin{gathered} \left| {P\left\{ {\Gamma_{{{\varvec{Z}}^{l} }} \le \hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} {|}{\varvec{Z}}^{l} } \right\} - \alpha } \right|{ } = \left| {F_{{{\varvec{Z}}^{l} }} \left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {q^{\left( \alpha \right)} } \right)} \right|{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \le \left| {F_{{{\varvec{Z}}^{l} }} \left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right)} \right| + \left| {F\left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {q^{\left( \alpha \right)} } \right)} \right|{ } \hfill \\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \le \mathop {\sup }\limits_{{b \in {\varvec{R}}}} \left| {F_{{{\varvec{Z}}^{l} }} \left( b \right) - F\left( b \right)} \right| + \left| {F\left( {\hat{q}_{{{\varvec{Z}}^{l} }}^{\left( \alpha \right)} } \right) - F\left( {q^{\left( \alpha \right)} } \right)} \right|. \hfill \\ \end{gathered} $$

From Lemma 2 and $F\left(b\right)$ being continuous, we have $\left|F\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\right)-F\left({q}^{\left(\alpha \right)}\right)\right|{\to }_{p}0$ using Theorem 1.10 in [26] and $\underset{b\in {\varvec{R}}}{\mathrm{sup}}{\to }_{p}0$. Thus, we can conclude that

$${F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\right)=P\left\{{\Gamma }_{{{\varvec{Z}}}^{l}}\le {\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$

(15)

which is equivalent to

$$P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}\alpha ,$$

(16)

since for every $\alpha \in \left(\mathrm{0,1}\right)$, there holds

$${F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}\right)=P\left\{{\Gamma }_{{{\varvec{Z}}}^{l}}\le {\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}=P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left(\alpha \right)}|{{\varvec{Z}}}^{l}\right\}.$$

For every $\epsilon $ such that $0<\epsilon <max\left\{\alpha ,1-\alpha \right\}$, by the definition of quantiles, we have

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}\le {F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha +\epsilon \right)}\right),$$

(17)

and

$$P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}\ge {F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha -\epsilon \right)}\right).$$

(18)

Based on formula (15), for every $\delta >$ 0, we have

$$P\left\{\left|{F}_{{{\varvec{Z}}}^{l}}\left({\widehat{q}}_{{{\varvec{Z}}}^{l}}^{\left(\alpha +\epsilon \right)}\right)-\left(\alpha +\epsilon \right)\right|>\epsilon \right\}<\delta ,$$

which combing formula (17) lead to

$$P\left\{P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}-\left(\alpha +\epsilon \right)>\epsilon \right\}<\delta .$$

Similarly, with formula (18), there holds

$$P\left\{P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}-\left(\alpha -\epsilon \right)<-\epsilon \right\}<\delta .$$

Then, we have

$$P\left\{\left|P\left\{{\widehat{Q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}\left({Y}_{0}\right)\le \alpha |{{\varvec{Z}}}^{l}\right\}-\alpha \right|>2\epsilon \right\}<2\delta .$$

Since $\epsilon $ and $\delta $ are arbitrary, the conclusion of Theorem 1 can be drawn.

Based on the deduction of Theorem 1, we can obtain the following coverage guarantee for derived prediction intervals from ${\widehat{Q}}_{{{\varvec{z}}}^{l},{{\varvec{x}}}_{0}}$, which is desirable for practitioners for interval prediction.

Theorem 2

Fix ${\eta }_{1}$ and ${\eta }_{2}$ such that $0<{\eta }_{1}<{\eta }_{2}<1$ . If the conditions of Theorem 1 hold, we have

$$P\left\{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le {Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}{\eta }_{2}-{\eta }_{1}.$$

Proof

For every $\epsilon $ such that $0<\epsilon <\mathrm{max}\left\{{\eta }_{1},{\Delta }_{\eta }\right\}$, based on formula (16), we have

$$P\left\{{Y}_{0}\le {\widehat{q}}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}-\epsilon \right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}{\eta }_{1}-\epsilon ,$$

which leads to

$$P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}-\epsilon \right)}<Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}{\to }_{p}{\Delta }_{\eta }-\epsilon ,$$

where ${\Delta }_{\eta }={\eta }_{2}-{\eta }_{1}$. Then, for every $\delta >$ 0, there holds

$$P\left\{\left|P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}-\epsilon \right)}<Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-\left({\Delta }_{\eta }+\epsilon \right)\right|>\epsilon \right\}<\delta .$$

Thus, we have

$$P\left\{P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-\left({\Delta }_{\eta }+\epsilon \right)>\epsilon \right\}<\delta .$$

Similarly, there holds

$$P\left\{P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-\left({\Delta }_{\eta }-\epsilon \right)<-\epsilon \right\}<\delta .$$

Therefore, we have

$$P\left\{\left|P\left\{{{q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{1}\right)}\le Y}_{0}\le {q}_{{{\varvec{Z}}}^{l},{{\varvec{X}}}_{0}}^{\left({\eta }_{2}\right)}|{{\varvec{Z}}}^{l}\right\}-{\Delta }_{\eta }\right|>2\epsilon \right\}<2\delta ,$$

which proves the conclusion of Theorem 2, since $\epsilon $ and $\delta $ are arbitrary.

Experiments

In this section, to test LW-JPS empirically, randomized kernel ridge regression with random Fourier features [21] is used as $\mu $ and $k$-nearest neighbor regression is used as $v$, respectively, since they satisfy the conditions we assumed in “Asymptotic analysis of locally weighted jackknife predictive system”. Following [38], the number of random features were set to 1000 and $k=\sqrt{l}$ for $k$-nearest neighbor regression. The ridge parameter with the least leave-one-out errors was chosen for LW-JPS. The comparison predictive systems are SPCS with support vector regression (SCPS–SVR), SCPS with random forests (SCPS–RF), CCPS with support vector regression (CCPS–SVR), CCPS with random forests (CCPS–RF) and CPS with random forests with out-of-bag errors as conformity scores (OOB–CPS–RF). All the comparison algorithms employ formula (5) as the conformity measure based on the recently empirical evaluation research in [40]. SCPS–SVR, SCPS–RF, CCPS–SVR and CCPS–RF use the same normalization for conformity measure as LW-JPS, whereas OOB–CPS–RF uses the standard deviation of out-of-bag predictions for normalization based on the approach in [40]. OOB–CPS–RF was first proposed in [40], which extends the idea of the state-of-the-art conformal regressor with random forests [7]. Following [37], for all SCPSs, 40 percent of the training data was used as calibration set and for all CCPSs, the number of folds was 5. In addition, the meta-parameters of all comparison algorithms were chosen using threefold cross-validation on the training set based on ${R}^{2}$ scores. SVR with Gaussian kernel was employed, whose regularization parameter $C$ was chosen from $\left\{{10}^{-5},{10}^{-4},\cdots ,{10}^{4},{10}^{5}\right\}$. For random forests, the number of trees was chosen from $\left\{100, 300, 500, 1000\right\}$ and the minimum number of samples per tree leaf was chosen from $\left\{1, 3, 5\right\}$, respectively.

The experiments were conducted on 20 public data sets, which are from Delve [22], KEEL [4] and UCI [5] repositories whose detailed information is summarized in Table 1. The features and labels were all normalized to $[0, 1]$ with min–max normalization. Tenfold cross-validation was used to test the algorithms, i.e., each data set was randomly split into tenfolds, where each fold was used to evaluate the algorithms trained from the other ninefolds and the mean of ten results for each algorithm was reported. All the algorithms in this section were coded with python based on numpy and scikit-learn library and the experimental results were collected from the computer with 3.5 GHz CPU and 32 GB RAM.

Table 1 Data sets

Full size table

Test the validity of LW-JPS

This section tests whether LW-JPS is a valid predictive system by the definition of formula (1). To do so, the values of the CDF of LW-JPS on the test data were collected and the frequency of the values being not more than $\alpha $ was calculated, whose results are shown in Table 2 with mean representing the mean value of each column. Table 2 demonstrates that the frequencies are compatible with corresponding $\alpha $, which empirically proves the validity of LW-JPS.

Table 2 Validity test of LW-JPS

Full size table

As we analyze in “Conformal predictive systems and locally weighted jackknife predictive system”, the validity property of formula (1) implies the coverage guarantee by formula (2), which will be shown in the next experiment.

Comparison with the other CPSs

This section compares the performance of LW-JPS with SCPS–SVR, SCPS–RF, CCPS–SVR, CCPS–RF and OOB–CPS–RF. To compare the quality of the predictive distributions, the widely used continuous ranked probability score (CRPS) are employed whose definition can be found in [34]. The lower the CRPS is, the better the predictive distribution is. The barplots of the mean of continuous ranked probability scores for different data sets are shown in Fig. 1, which demonstrates that LW-JPS performs better in most cases. Table 3 records the mean CRPS of all algorithms, with the least one of each data set shown in bold. For each data set, the rank of an algorithm is obtained and the mean rank in Table 3 is the mean value of all ranks for each algorithm. From Table 3, we can see that the LW-JPS performs better than the other predictive systems, which indicates the effectiveness of LW-JPS.

Table 3 The mean CRPS of all algorithms

Full size table

We also test the derived prediction intervals from the predictive distributions of all predictive systems. For a significance level$\eta $, which is the expected coverage rate preset by practitioners, the derived prediction interval is based on formula (2) with the help of $\eta /2$ and $1-\eta /2$ quantiles. Two indicators are employed to describe the quality of prediction intervals. One is the prediction error rate, which is the frequency of the true label being out of the prediction intervals. The other is the average interval size, which measures the information efficiency of the prediction intervals. The smaller the average interval size, the more information the prediction intervals contain. We set the significance levels as 0.2, 0.1 and 0.05 and show the experimental results in Tables 4, 5 and 6 for error rates and in Tables 7, 8 and 9 for average interval sizes. We also summarize the error rates, the means and mean ranks of average interval sizes in Figs. 2, 3, and 4.

Table 4 Error rate $\left(\eta =0.2\right)$

Full size table

Table 5 Error rate $\left(\eta =0.1\right)$

Full size table

Table 6 Error rate $\left(\eta =0.05\right)$

Full size table

Table 7 Average interval size $\left(\eta =0.2\right)$

Full size table

Table 8 Average interval size $\left(\eta =0.1\right)$

Full size table

Table 9 Average interval size $\left(\eta =0.05\right)$

Full size table

From Tables 4, 5 and 6, we can see that all predictive systems are empirically valid for the data sets, which also empirically proves the coverage guarantee of LW-JPS. Besides, it is shown in Tables 7, 8 and 9 that prediction intervals of LW-JPS are more informationally efficient than those of the other algorithms, which is demonstrated in Figs. 3 and 4. The box plots of average interval size are also shown in Fig. 5, which also demonstrates that JPS performs better than other CPSs.

We also conducted Wilcoxon test [10] to answer the question of whether LW-JPS performs better than other comparison algorithms significantly. Table 10 demonstrates the p values of the experimental results about CRPS and average interval sizes with $\eta \in \{\mathrm{0.2,0.1,0.05}\}$ and the bold values are less than 0.05, which shows the significant differences. From Table 10, we can see that LW-JPS significantly performs better than SCPS–SVR, SCPS–RF, CCPS–SVR and CCPS–RF, and the differences between LW-JPS and OOB–CPS–RF are not significant in most cases. Since OOB–CPS–RF represents the state-of-the-art process using conformal approach for regression problems, the statistical tests confirm the effectiveness of LW-JPS for probabilistic prediction.

Table 10 The p values of Wilcoxon tests

Full size table

For training speed, all of the algorithms are computationally efficient versions of CPSs and the mean values of the training times of SCPS–SVR, SCPS–RF, CCPS–SVR, CCPS–RF, OOB–CPS–RF and LW-JPS on 20 data sets are 0.293 s, 8.704 s, 1.940 s, 59.336 s, 15.393 s and 1.443 s, respectively, indicating that the LW-JPS used in this paper is also computationally efficient.

In summary, the experimental results in this section not only verifies the empirical validity of LW-JPS, but also shows its better performance than the other comparison algorithms, which indicates the effectiveness and efficiency of LW-JPS for probabilistic prediction.

Conclusion

This paper proposes a predictive system based on the idea of jackknife prediction, which is inspired by the leave-one-out cross-conformal predictive system. The proposed LW-JPS can transform any regression algorithm for point prediction to probabilistic prediction, which can describe the uncertainty of test labels. The asymptotic validity of LW-JPS is proved with some regularity assumptions and conditions. Based on the analysis, the empirical testing of LW-JPS with randomized kernel ridge regression and $k$-nearest neighbor regression was conducted. The empirical validity of LW-JPS was demonstrated in the experiments and its performance for probabilistic prediction compared favourably with the other comparison algorithms, which demonstrates the effectiveness and efficiency of LW-JPS for probabilistic prediction.

Although our method is empirically valid and shows better performance when compared with other comparison CPSs, we only employ two representative regression algorithms satisfying the related conditions in this paper. Therefore, future work about empirical studies with a wider range of regression algorithms needs to be done. Moreover, the approach of LW-JPS we proposed in this paper cannot be built on deep learning models efficiently for complex learning problems, such as image segmentation or image-to-image regression problems, since in those cases, there are no efficient ways to compute leave-one-out predictions on training data. Thus, future work about approximately computing leave-one-out predictions for deep neural networks is worth exploring, in order to make the jackknife prediction approach more tractable for complex problems.

Data availability

The data used during the current study are available from the corresponding author upon reasonable request.

References

MT Abdulkhaleq TA Rashid A Alsadoon 2022 Harmony search: current studies and uses on healthcare systems Artif Intell Med 131 102348 https://doi.org/10.1016/j.artmed.2022.102348
Article Google Scholar
MT Abdulkhaleq TA Rashid BA Hassan 2023 Fitness dependent optimizer with neural networks for COVID-19 patients Comput Methods Programs Biomed Update 3 100090 https://doi.org/10.1016/j.cmpbup.2022.100090
Article Google Scholar
JM Abdullah TA Rashid BB Maaroof 2023 Multi-objective fitness-dependent optimizer algorithm Neural Comput Appl https://doi.org/10.1007/s00521-023-08332-3
Article Google Scholar
J Alcala-Fdez A Fernandez J Luengo 2011 KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework J Mult-Valued Log Soft Comput 17 2–3 255 287 https://doi.org/10.1016/j.jlap.2009.12.002
Article Google Scholar
Asuncion A, Newman D (2007) UCI machine learning repository Irvine. Retrieved from http://www.ics.uci.edu/~mlearn/MLRepository.html. Accessed 01 Jan 2013
V Balasubramanian S-S Ho V Vovk 2014 Conformal prediction for reliable machine learning: theory, adaptations and applications Morgan Kaufmann Publishers Inc. Newnes
MATH Google Scholar
Boström H, Linusson H, Löfström T et al (2016) Evaluation of a variance-based nonconformity measure for regression forests. Paper presented at the 5th International Symposium on Conformal and Probabilistic Prediction with Applications, COPA 2016, Madrid, Spain, 9653, pp 75–89
O Bousquet A Elisseeff 2002 Stability and generalization J Mach Learn Res 2 3 499 526 https://doi.org/10.1162/153244302760200704
Article MathSciNet MATH Google Scholar
F Cucker DX Zhou 2007 Learning theory: an approximation theory viewpoint Cambridge monographs on applied and computational mathematics Cambridge University Press
Google Scholar
J Derrac S García D Molina 2011 A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms Swarm Evol Comput 1 1 3 18 https://doi.org/10.1016/j.swevo.2011.02.002
Article Google Scholar
T Gneiting M Katzfuss 2014 Probabilistic forecasting Ann Rev Stat Appl 1 1 125 151 https://doi.org/10.1146/annurev-statistics-062713-085831
Article Google Scholar
G Huang H Zhou X Ding 2012 Extreme learning machine for regression and multiclass classification IEEE Trans Syst Man Cybernetics Part B (Cybernetics) 42 2 513 529 https://doi.org/10.1109/TSMCB.2011.2168604
Article Google Scholar
Y LeCun Y Bengio G Hinton 2015 Deep learning Nature 521 7553 436 444 https://doi.org/10.1038/nature14539
Article Google Scholar
J Lei M G’Sell A Rinaldo 2018 Distribution-free predictive inference for regression J Am Stat Assoc 113 523 1094 1111 https://doi.org/10.1080/01621459.2017.1307116
Article MathSciNet MATH Google Scholar
Z Li F Liu W Yang 2022 A survey of convolutional neural networks: analysis, applications, and prospects IEEE Trans Neural Netw Learn Syst 33 12 6999 7019 https://doi.org/10.1109/TNNLS.2021.3084827
Article MathSciNet Google Scholar
BB Maaroof TA Rashid JM Abdulla 2022 Current studies and applications of shuffled frog leaping algorithm: a review Arch Computat Methods Eng 29 5 3459 3474 https://doi.org/10.1007/s11831-021-09707-2
Article MathSciNet Google Scholar
A Mahmoodzadeh HR Nejati M Mohammadi 2022 Forecasting tunnel boring machine penetration rate using LSTM deep neural network optimized by grey wolf optimization algorithm Expert Syst Appl 209 118303 https://doi.org/10.1016/j.eswa.2022.118303
Article Google Scholar
Melluish T, Saunders C, Nouretdinov I et al (2001) Comparing the Bayes and typicalness frameworks. Paper presented at the 12th European Conference on Machine Learning, ECML 2001, Freiburg, Germany, 2167, pp 360–371
MR Mohebbian HR Marateb KA Wahid 2023 Semi-supervised active transfer learning for fetal ECG arrhythmia detection Comput Methods Programs Biomed Update 3 100096 https://doi.org/10.1016/j.cmpbup.2023.100096
Article Google Scholar
H Papadopoulos 2008 Inductive conformal prediction: theory and application to neural networks P Fritzsche Eds Tools in artificial intelligence IntechOpen London https://doi.org/10.5772/6078
Chapter Google Scholar
Rahimi A, Recht B (2007) Random features for large-scale kernel machines. Paper presented at the Advances in Neural Information Processing Systems 20 (NIPS 2007), Vancouver, British Columbia, Canada, 3 (4): 1177–1184
Rasmussen CE, Neal RM, Hinton G et al. Delve data for evaluating learning in valid experiments, 1995–1996. Retrieved from: https://www.cs.toronto.edu/~delve/. Accessed 01 Mar 2003
A Sayeed Y Choi J Jung 2023 A deep convolutional neural network model for improving WRF simulations IEEE Trans Neural Netw Learn Syst 34 2 750 760 https://doi.org/10.1109/TNNLS.2021.3100902
Article Google Scholar
J Schmidhuber 2015 Deep learning in neural networks: an overview Neural Netw 61 85 117 https://doi.org/10.1016/j.neunet.2014.09.003
Article Google Scholar
T Schweder NL Hjort 2016 Confidence, likelihood, probability 41 Cambridge University Press Cambridge
Book MATH Google Scholar
J Shao 2003 Mathematical statistics Springer Science and Business Media New York
Book MATH Google Scholar
J Shen RY Liu M-g Xie 2018 Prediction with confidence—a general framework for predictive inference J Statist Plan Inference 195 126 140 https://doi.org/10.1016/j.jspi.2017.09.012
Article MathSciNet MATH Google Scholar
Steinberger L, Leeb H (2016) Leave-one-out prediction intervals in linear regression models with many variables. arXiv e-prints, arXiv:1602.05801. https://doi.org/10.48550/arXiv.1602.05801
AW Vaart van der 2000 Asymptotic statistics 3 Cambridge University Press Cambridge
Google Scholar
V Vovk 2015 Cross-conformal predictors Ann Math Artif Intell 74 1 9 28 https://doi.org/10.1007/s10472-013-9368-4
Article MathSciNet MATH Google Scholar
Vovk V (2019) Universally consistent conformal predictive distributions. Paper presented at the Proceedings of the Eighth Symposium on Conformal and Probabilistic Prediction and Applications, 105, pp 105-122
V Vovk A Gammerman G Shafer 2005 Algorithmic learning in a random world Springer Science and Business Media New York
MATH Google Scholar
Vovk V, Nouretdinov I, Manokhin V et al (2018) Conformal Predictive Distributions with Kernels. Paper presented at the International Conference Commemorating the 40th Anniversary of Emmanuil Braverman's Decease, Boston, MA, USA, 11100, pp 103–121
Vovk V, Nouretdinov I, Manokhin V et al (2018) Cross-conformal predictive distributions. Paper presented at the Proceedings of the Seventh Workshop on Conformal and Probabilistic Prediction and Applications, 91, pp 37–51
Vovk V, Petej I, Toccaceli P et al (2020) Conformal calibrators. Paper presented at the Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, Proceedings of Machine Learning Research, 128, pp 84–99
D Wang P Wang J Shi 2018 A fast and efficient conformal regressor with regularized extreme learning machine Neurocomputing 304 1 11 https://doi.org/10.1016/j.neucom.2018.04.012
Article Google Scholar
D Wang P Wang Y Yuan 2020 A fast conformal predictive system with regularized extreme learning machine Neural Netw 126 347 361 https://doi.org/10.1016/j.neunet.2020.03.022
Article Google Scholar
D Wang P Wang S Zhuang 2020 Asymptotic analysis of locally weighted jackknife prediction Neurocomputing 417 10 22 https://doi.org/10.1016/j.neucom.2020.07.074
Article Google Scholar
L Wasserman 2006 All of nonparametric statistics Springer Science and Business Media New York
MATH Google Scholar
Werner H, Carlsson L, Ahlberg E et al (2020) Evaluating different approaches to calibrating conformal predictive systems. Paper presented at the Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, 128, pp 134–150

Download references

Acknowledgements

The authors would like to thank the anonymous editor and reviewers for their valuable comments and suggestions which improved this work.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62106169 and 61972282.

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, People’s Republic of China
Di Wang, Ping Wang & Cong Wang
Joint Laboratory of Intelligent Identification and Nowcasting Service for Convective System, CMA Public Meteorological Service Center, Beijing, 100081, People’s Republic of China
Di Wang, Ping Wang & Cong Wang
Qingdao Academy of Chinese Medical Science, Shandong University of Traditional Chinese Medicine, Qingdao, 266112, Shandong, People’s Republic of China
Pingping Wang
College of Management and Economics, Tianjin University, Tianjin, 300072, People’s Republic of China
Zhen He & Wei Zhang

Authors

Di Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Pingping Wang
View author publications
You can also search for this author in PubMed Google Scholar
Cong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen He
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Zhang.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, D., Wang, P., Wang, P. et al. Probabilistic prediction with locally weighted jackknife predictive system. Complex Intell. Syst. 9, 5761–5778 (2023). https://doi.org/10.1007/s40747-023-01044-0

Download citation

Received: 13 September 2022
Accepted: 09 March 2023
Published: 05 April 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s40747-023-01044-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Probabilistic prediction with locally weighted jackknife predictive system

Abstract

Similar content being viewed by others

Valid prediction intervals for regression problems

Nonparametric predictive distributions based on conformal prediction

Prediction intervals in supervised learning for model evaluation and discrimination

Introduction

Conformal predictive systems and locally weighted jackknife predictive system

Predictive system and randomized predictive system

Definition 1

Definition 2

Split conformal predictive system

Cross-conformal predictive system

Locally weighted jackknife predictive system

Asymptotic analysis of locally weighted jackknife predictive system

Definitions, assumptions and conditions

Assumption 1

Asymptotic validity of LW-JPS

Lemma 1

Lemma 2

P r oof

Theorem 1

Proof

Theorem 2

Proof

Experiments

Test the validity of LW-JPS

Comparison with the other CPSs

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation