1 Introduction

One of my indelible memories of Peter Schmidt was a conversation we had in my kitchen at a party for Midwest Econometrics Group participants in 1993 about the uneasy relationship between statistics and econometrics. “If a statistical tree falls in the forest, but no econometrician sees it,” Peter said matter-of-factly, “then it never happened.” In 1939 Harold Hotelling, arguably one of the most eminent statisticians and econometricians of the twentieth century witnessed such an event and wrote about in Hotelling (1939). The paper inspired Hermann Weyl to write a highly influential paper, Weyl (1939) generalizing it. Hotelling’s idea has attracted a small coterie of admirers in statistics, but it is fair to say that it remains almost unknown in econometrics.

My quixotic aim in this paper is to rescue Hotelling’s idea from econometric obscurity. I will begin by describing a simple setting in which the idea can be employed to construct a confidence interval for a scalar parameter that enters awkwardly in a standard regression problem. Then, I will describe how it can be used to construct uniform confidence bands for nonparametric regression using penalty methods, and finally I will compare performance with confidence bands constructed with recently developed methods of conformal inference.

2 Hotelling’s regression problem

Consider the nonlinear regression model

$$\begin{aligned} Y_i=x_i^\top \alpha + \lambda _i(\tau )\beta + \varepsilon _i \end{aligned}$$

where \(\alpha , \beta , \tau \) are unknown parameters, \(\lambda _i(\cdot )\) are known functions and \(\varepsilon _i\sim {{\mathcal {N}}}(0, \sigma ^2)\). For the sake of concreteness, we might interpret \(\lambda _i (\tau )\) as a Box-Cox transformation of another covariate, say \((z_i^\tau -1)/\tau \). We would like to test \(H_0: \beta =0\). Under the null, the Box-Cox parameter \(\tau \) is not identified, so we need to consider strategies that properly account for this.Footnote 1

By the familiar (Frisch and Waugh 1933) trickery, we can eliminate the \(\alpha \) effect.Footnote 2 Redefining the notation and assuming for convenience that \(\sigma ^2 = 1\), we are left with the likelihood ratio statistic

$$\begin{aligned} L = \inf _\tau \sum (Y_i-{\hat{\beta _\tau }} \lambda _i(\tau ))^2/\sum Y_i^2 \end{aligned}$$

Now, denoting the n-vectors, \(Y = (Y_i)\), \(\lambda = (\lambda _i)\), and the Euclidean norm by \(\Vert \cdot \Vert \), \({\hat{\beta _\tau }} = Y^\top \lambda (\tau )/ \Vert \lambda (\tau ) \Vert ^2\) so we can rewrite,

$$\begin{aligned} L= & {} \inf _\tau \Vert Y \Vert ^{-2} ( \Vert Y \Vert ^2 - 2(Y^\top \lambda )^2/ \Vert \lambda \Vert ^2 + (Y^\top \lambda )^2/ \Vert \lambda \Vert ^2)\\= & {} 1 - \sup _\tau \left( \frac{Y^\top \lambda (\tau )}{ \Vert \lambda (\tau ) \Vert \Vert Y \Vert }\right) ^2\\\equiv & {} 1 - \sup _\tau (\gamma (\tau )^\top U )^2 \end{aligned}$$

Now \( U =Y/ \Vert Y \Vert \) is uniformly distributed on the sphere \(S^{n-1}\) and \(\gamma (\tau ) = \lambda (\tau )/ \Vert \lambda (\tau ) \Vert \) is a curve in \(S^{n-1}\). Thus, the test rejects when \(W=\sup _\tau \gamma (\tau )^\top U \) exceeds some value \(w=\cos \theta \) which is equivalent to

$$\begin{aligned} U \in \gamma ^\theta= & {} \{u\in S^{n-1}: \sup _t u^\top \gamma (t)\ge \cos \theta \}\\= & {} \{ u\in S^{n-1}: \text {d}(u, \gamma ) \le (2(1-w))^{1/2}\}. \end{aligned}$$

Note that the original definition of L is such that we reject for small values, so \(L<c, \) implies we reject for \(\sup _\tau \gamma (\tau )^\top U > w = \cos \theta \) for some critical value of \(\theta \). This is illustrated in Fig. 1 of Johansen and Johnstone (1990), reproduced here as Fig. 1. They call this the “angular or geodesic radius \(\theta \) about \(\gamma \):”

$$\begin{aligned} d^2(u, \gamma )= & {} \sin ^2(\theta ) + (1-\cos (\theta ))^2\\= & {} 1- 2\cos \theta + \cos ^2\theta + \sin ^2\theta \\= & {} 2(1-\cos \theta ). \end{aligned}$$

So when the distance \(d(u,\gamma )\) is small, U falls inside tube, and we reject. This may seem a bit counter-intuitive, but is nonetheless correct. There are probably many ways to it sound more intuitive. Here is one possibility. Since it all boils down to a cosine, that is the simple correlation between \(\lambda (\tau )\) and Y, we want to reject \(H_0: \beta =0\) if this correlation/cosine is too large, but Y’s that make it too large are the \(Y's\) that fall inside the tube.

Fig. 1
figure 1

Angular distance from \(\gamma (t)\) to \(u = (1,0)\)

So how do we compute the critical w or equivalently the critical \(\theta \)? Since \(W>w \equiv \cos \theta \) is equivalent to U being in the tube, we need the volume of the tube. Let \(| \gamma |\) denote the length of the arc \(\gamma (\tau )\) on the sphere. This can be approximated by the finite difference formula,

$$\begin{aligned} | \gamma | = \int \Vert \dot{\gamma }(\tau ) \Vert \text {d} \tau \approx \sum _{i=2}^m \Vert (\gamma (\tau _i) - \gamma (\tau _{i-1}))\Vert , \end{aligned}$$

and the \(\tau \)’s are chosen on some relatively fine grid of m points. Note that in the finite difference approximation the \(\tau _i - \tau _{i-1}\) that would normally appear in the denominator of the difference quotient inside the norm expression cancels with the contribution of the \(\text {d}\tau \).

Theorem 1

If \(\gamma \) is a non-closed regular curve in \(S^{d-1}\) then for w near 1,

$$\begin{aligned} {{\mathcal {P}}}(W\ge w) = \frac{| \gamma |}{2\pi }(1-w^2)^{\frac{d-2}{2}} +\frac{1}{2} {{\mathcal {P}}} (B\left( \frac{1}{2}, \frac{d-1}{2}\right) \ge w^2) \end{aligned}$$
(1)

where \(B(1/2, (d-1)/2) \) is a beta random variable. If \(\gamma \) is closed, i.e., forms a closed loop without end points, then the second “cap” term is omitted.

We ignore pathological complications involving self-intersections of the curve, \(\gamma \). This follows from a result of Hotelling (1939), as does the next theorem.

Theorem 2

Let \(\gamma \) be a regular closed curve in \(S^{d-1}\) with length \(|\gamma |\). And

$$\begin{aligned} \gamma ^\theta= & {} \{u \in S^{d-1}\big | \sup _t u^\top \gamma (t) \ge \theta \}\\= & {} \{u\in S^{d-1} \big | \text {d}(u, \gamma ) \le (2(1-w))^{1/2}\} \end{aligned}$$

where \(w=\cos \theta \). If \(\theta \) is sufficiently small, then the volume of the tube \(V(\gamma ^\theta )\) is given by

$$\begin{aligned} V(\gamma ^\theta ) = |\gamma | \Omega _{d-2} \sin ^{d-2} \theta \end{aligned}$$
(2)

where \(\Omega _{d-2} = \pi ^{(d-2)/2}\Gamma (d/2)\) is the volume of the unit ball in \(R^{d-2}.\)

Heuristically, the formula is,

$$\begin{aligned} V(\gamma ^\theta ) = (\text{ length } \text{ of } \text{ tube}) \cdot (\text{ Volume } \text{ of } \text{ unit } \text{ ball}) \cdot \text{ radius}^{d-2} \end{aligned}$$

Recall that the volume of the unit ball in dimension d is \(V=\pi ^{d/2}/\Gamma ((d+2)/2)\). When \(\theta \) is larger, or \(\gamma \) is twisty, then the tube may intersect itself and the formula would need some refinement. Figure 2 is a crude attempt to depict tube on the 2-sphere, those with enhanced geometric imagination may try to visualize a three dimensional tube on the 3-sphere embedded in 4-space.

Fig. 2
figure 2

A Hotelling tube on a 2-sphere

When the curve is not closed, then it needs “caps” on each end. These caps are given by

$$\begin{aligned} w_{d-2}\int _{\cos \theta }^1 (1-z^2)^{(d-3)/2}\text {d}z \end{aligned}$$

where \(w_{d-2} = 2\pi ^{(d-1)/2} /\Gamma ((d-1)/2)\) is the \((d-2)\)-volume of \(S^{d-2}\). Note that the volume of the sphere, \(V(S^{d-1}) = 2 \pi ^{d/2} /\Gamma (d/2)\), is not the same as the volume of the ball. Note also that \((1-z^2)^{1/2}\) is again the radius and integrating out the \(r^{d-3}\) yields a \(d-2\) dimensional volume. A useful reference for this sort of geometry is Kendall (1961).

How do we get from (2) to (1)? Recall that U is uniform on the \((d-1)\) sphere so we need to divide by the volume of that sphere to evaluate the probability of being in the tube, so for closed curves,

$$\begin{aligned} \frac{V(\text{ tube})}{V(\text{ sphere})}= & {} \frac{ | \gamma | \Omega _{d-2}\sin ^{d-2}\theta }{2\pi ^{d/2}/\Gamma (d/2)}\\= & {} \frac{ | \gamma | (\pi ^{(d-2)/2}/\Gamma (d/2))\sin ^{d-2}\theta }{2\pi (\pi ^{(d-2)/2}/\Gamma (d/2))}\\= & {} \frac{ | \gamma | }{2\pi }(1-w^2)^{(d-2)/2}. \end{aligned}$$

To include caps, we also need to divide by the volume of the sphere. Note that

$$\begin{aligned} \mathbb {P}(B_{1/2, \frac{d-1}{2}} \ge w^2)= & {} \int _{w^2}^1 \left[ x^{1/2-1}(1-x)^{\frac{d-1}{2}-1}/B(1/2, \frac{d-1}{2})\right] \text {d}x\\= & {} \int _{w^2}^1 \left[ x^{-1/2} (1-x)^{\frac{d-3}{2}}/B\right] \text {d}x. \end{aligned}$$

Changing variables \(x\rightarrow y^2\), we have

$$\begin{aligned}= & {} \int _{y_0}^1\left[ y^{-1}(1-y^2)^{\frac{d-3}{2}}/B\right] 2y \text {d}y\\= & {} 2\int _{y_0}^1 B^{-1}(1-y^2)^{\frac{d-3}{2}}\text {d}y. \end{aligned}$$

It remains to show that \(B^{-1} = w_{d-2}/\text{V(sphere) }\), which follows after a little simplification and recalling that \(\Gamma (1/2) = \sqrt{\pi }\).

To check how the Hotelling tube procedure performs in moderate sample sizes, Table 1 reports results of a small simulation experiment. Data are generated with iid \(x_i\) standard log-normal and

$$\begin{aligned} y_i = \beta _n (x_i^\tau - 1)/\tau + \epsilon _i, \; u \sim \mathcal {N}(0,1). \end{aligned}$$

Three values of \(\tau \) are considered \(\tau \in \{ -0.5, 0, 0.5 \}\). Local alternatives, \(\beta _n = \beta _0/\sqrt{n}\), are considered with \(\beta _0 \in \{ 0, 1, 2\}\). The nominal level of the Hotelling test is taken to be 0.05. and 1000 replications of the experiment are made for each parametric setting. When \(\beta = 0\) so the null is true, the test delivers quite accurate size for all of the sample sizes considered, and power is respectable when \(\beta \) deviates from zero.

3 Uniform confidence bands for nonparametric regression

Consider the series expansion model

$$\begin{aligned} Y_i=\sum _{j=1}^d \beta _j a_j(t_i) + \varepsilon _i \end{aligned}$$

with \(\varepsilon _i\sim {{\mathcal {N}}}(0, \sigma ^2)\) as before and \(t\in I \subset {\mathbb {R}}\). Our objective is to find a positive c such that

$$\begin{aligned} P_{\beta ,\sigma , \Sigma }(|\beta ^\top a(t) - {\hat{\beta ^\top }} a(t)| \le c\sigma (a(t)^\top \Sigma a(t))^{1/2} \; \forall \; t\in I) \approx 1-\alpha , \end{aligned}$$

uniformly in \(\beta , \sigma .\) Johansen and Johnstone (1990) write this as, \( P_{\beta ,\sigma , \Sigma } (T < c\sigma ) \) where

$$\begin{aligned} T=\sup _{a\in C} \frac{a^\top ({\hat{\beta }} - \beta )}{\sqrt{a^\top \Sigma a}} \end{aligned}$$
Table 1 Rejection frequencies for the Hotelling likelihood ratio test for a simple Box-Cox example

Now consider \(X\sim {{\mathcal {N}}}(\xi , \Sigma )\), so X plays the role of \({\hat{\beta }}\) and \(\xi \) of \(\beta .\) We’d like to make a confidence statement about \(\{a^\top \xi |a\in C\}\) and C is some sort of “curve.” So now we write,

$$\begin{aligned} T=T(X, \xi ) = \sup _{a\in C} \frac{a^\top (X-\xi )}{\sqrt{a^\top \Sigma a}}. \end{aligned}$$

We want the distribution of T, so we can obtain the confidence set

$$\begin{aligned} R_x=\{ \{a^\top \xi \}_{a\in C} | T(X, \xi ) < c_{1-\varepsilon }\} \end{aligned}$$

where \(P_{\xi , \Sigma } (T< c_{1-\varepsilon }) = 1-\varepsilon .\) Write \(T=RW\) where,

$$\begin{aligned} R^2=(X-\xi )^\top \Sigma ^{-1}(X-\xi ) \sim \chi _d^2, \end{aligned}$$

and

$$\begin{aligned} W = \sup _{a\in C} \frac{a^\top (X-\xi )}{\sqrt{a^\top \Sigma a}\sqrt{(X-\xi )\Sigma ^{-1}(X-\xi )}} = \sup _{a\in C} \frac{(\Sigma ^{1/2}a)^\top \Sigma ^{-1/2} (X-\xi )}{ | \Sigma ^{1/2} a | | \Sigma ^{-1/2}(X-\xi ) | }. \end{aligned}$$

Now to put things back into the earlier framework of \(\gamma \) and U we set,

$$\begin{aligned} \gamma (a)= & {} \frac{\Sigma ^{1/2}a}{ | \Sigma ^{1/2} a | }\\ U= & {} \Sigma ^{-1/2} (X-\xi )/ | \Sigma ^{-1/2} (X-\xi ) |. \end{aligned}$$

So as before, \(\gamma =\gamma (C) \subset S^{d-1}\), and U is uniform on \(S^{d-1}\). R and W do not depend on \(\xi , \Sigma \) or they do, but only via \(\gamma \). \(R^2\) is independent of W and \(R^2\sim \chi _d^2\) so,

$$\begin{aligned} \mathbb {P}(T>c)=\int _c^\infty \mathbb {P}(W>c/r) \mathbb {P}(R\in dr). \end{aligned}$$

The random variable W has the same form as in the simple example so,

$$\begin{aligned} \mathbb {P}(W>w) = \frac{ | \gamma | }{2\pi }(1-w^2)^{(d-2)/2} + \frac{1}{2}{{\mathcal {P}}}(B\ge w^2) \equiv b_\gamma (w). \end{aligned}$$

(Naiman 1986) bounds this probability by,

$$\begin{aligned} \mathbb {P}(T>c) \le \int _c^\infty \min \{b_\gamma (c/r), 1\} {\mathcal P}(R\in \text {d}r), \end{aligned}$$

and Knowles (1987) suggests ignoring the \(b_\gamma <1\) constraint and then integrates the bound to obtain,

$$\begin{aligned} \mathbb {P}(T>c) \le \frac{ | \gamma | }{2\pi } e^{-c^2/2} + 1-\Phi (c). \end{aligned}$$

This integration may appear somewhat miraculous, but does actually work out provided that one carefully observes the \({{\mathcal {P}}}(R\in \text {d}r)\) term. Since \(R^2\sim \chi _d^2\), letting F denote the distribution function of \(\chi _d^2\), we have,

$$\begin{aligned} \mathbb {P}(R\le r) = {{\mathcal {P}}}(R^2\le r^2) = F(r^2) \end{aligned}$$

so the corresponding density of R is

$$\begin{aligned} f_{\scriptscriptstyle R}(r) = 2rF'(r^2) = 2rf_{{\scriptscriptstyle R}^2}(r^2). \end{aligned}$$

Once one has this bound then various other things fall into place. For example,

$$\begin{aligned} {{\mathcal {P}}}(|T|>c) \le 2\mathbb {P}(T>c). \end{aligned}$$

(Johansen and Johnstone 1990) give further details on the accuracy of the bounds and applications.

4 Additive models for total variation penalized nonparametric quantile regression

In Koenker (2011), I have described a general approach to estimation and inference for additive nonparametric quantile regression models of the form,

$$\begin{aligned} Q_{{\scriptscriptstyle Y}_i|x_i, z_i} (\tau |x_i, z_i) = x_i^\top \theta _0 + \sum _{j=1}^J g_j (z_{ij}). \end{aligned}$$

The components \(g = (g_1, \cdots , g_J)\) can be univariate or bivariate. Their smoothness can be controlled by penalizing total variation of the functions themselves or their gradients. Estimation is carried out by solving the linear program,

$$\begin{aligned} \min _{(\theta _0, g)} \sum \rho _\tau (y_i-x_i^\top \theta _0 - \sum g_j(z_{ij})) + \lambda _0 \Vert \theta _0 \Vert _1 + \sum _{j=1}^{\scriptscriptstyle J} \lambda _j \bigvee (\nabla g_j) \end{aligned}$$
(2)

where \(\rho _\tau (u) = u (\tau - \mathbb {1}(u < 0))\) is the usual quantile objective function, \(\Vert \theta _0 \Vert _1 = \sum _{k=1}^{\scriptscriptstyle K} |\theta _{0k}|\) and \(\bigvee (\nabla g_j)\) denotes the total variation of the derivative or gradient of the function g. Recall that for g with absolutely continuous derivative \(g'\) we can express the total variation of \(g':{\mathcal R} \rightarrow {{\mathcal {R}}}\) as

$$\begin{aligned} \bigvee (g'(z)) = \int |g''(z) |\text {d}z \end{aligned}$$

while for \(g:{\mathbb {R}}^2 \rightarrow {\mathbb {R}}\) with absolutely continuous gradient,

$$\begin{aligned} \bigvee (\nabla g) = \int \Vert \nabla ^2g(z)\Vert \text {d}z \end{aligned}$$

where \(\nabla ^2g(z)\) denotes the Hessian of g and \(\Vert \cdot \Vert \) denotes the Hilbert–Schmidt norm for matrices. In contrast, total variation penalization of the component functions themselves yields piecewise constant solutions.

Adapting the Hotelling tube idea to construct uniform confidence bands for these components is also described in Koenker (2011), as is selection of the smoothing parameters \(\lambda _j, j = 0, 1, \cdots J\). It should be stressed that all of this machinery relies on the validity of Gaussian approximations for the fitted parameters and estimated functions and is conditional on selected tuning parameters. This is in accord with a large strand of earlier literature including (Wahba 1983; Nychka 1983), and Krivobokova et al. (2010); however, there are inevitable questions that can be raised about both aspects. To explore this, we consider some recent proposals for strengthening coverage guarantees based on conformal inference in the next section.

5 Conformal quantile regression

Conformal prediction, and conformal inference more generally, has grown out of work by Vladimir Volk and colleagues, see, e.g., Shafer and Vovk (2008) for an overview. It has emerged as an essential tool in uncertainty quantification throughout statistics and machine learning. An essential feature of the conformal inference approach in regression is a sample splitting device that allows one to adjust a confidence band constructed with training data based on its performance on a validation sample. Strong finite sample performance guarantees can be proven based on seemingly rather weak exchangeability assumptions. In regression settings, early work presumed conventional iid error structure when constructing the initial bands from the training data; however, (Romano et al. 2019) noted that in more heterogeneous settings narrower bands could be constructed using quantile regression methods. This approach has been further developed in Lei and Candès (2022). In high-dimensional regression, this typically would involve some form of random forest or neural network model for the initial bands, but the same methods can be used in simpler models like the additive models described above.

Construction of conformal prediction bands for additive quantile regression models can be described briefly as follows:

figure a

Note that the conformal adjustment of the initial band can make it wider or narrower. When \(Q<0\), it indicates that the validation sample fell well inside the initial band indicating that it is safe to shrink the width of the initial band.

There are several potential difficulties with the foregoing recipe.

  • Predictions based on the training sample typically are not equipped to extrapolate beyond the empirical support of the training data, so if the validation data, or new data requiring a conformal interval, lie outside that support some accommodation must be made.

  • Performance guarantees are based on marginal coverage of the band, so it may happen that in certain regions of design space there may be failures of coverage that are compensated by satisfactory coverage elsewhere. As shown by Foygel Barber et al. (2020), conditional coverage is not achievable in any generality.

  • All of the familiar challenges of penalty methods for regression smoothing persist, so choice of smoothing parameters, in particular, can cause headaches, even though poor \(\lambda \) selection can in principle be ameliorated by the conformal adjustment.

Fig. 3
figure 3

Example 1 of Romano, Patterson and Candès: As described in the text the response is concentrated in bands determined by a Poisson component with some quite extreme outliers that are (mostly) invisible in this plot. The Poisson rate is periodic accounting for the obvious heteroscedasticity. The red curves depict the predicted 0.05 and 0.95 conditional quantile estimates based on the training data, using penalization of the derivative of the fitted function, while the blue curves depict the conformally modified estimates. In this example, the conformity scores \(E_i\) are quite small and the conformal modification is almost negligible

Fig. 4
figure 4

Example 1 of Romano, Patterson and Candès: In contrast to the earlier piecewise linear fit obtained by total variation penalization of the first derivative of g, in this figure total variation of the fitted function itself is penalized resulting in a piecewise constant fit. Clearly this penalty is better suited to the example and mimics quite well the fit depicted in that paper. Again, the conformal adjustment is only barely visible

We conclude this section by illustrating the use of the conformal method in an artificial data example taken from Romano et al. (2019). Simulated data are generated as,

There are 7000 observations plotted in grey. The Poisson contribution to the response produces a banded structure to the scatterplot with pronounced heteroscedasticity. There are a small number of extreme outliers many of which lie outside the frame of the figure; such outliers are harmless since we are estimating conditional quantile functions. Penalizing total variation of \(g'\) yields a piecewise linear fit that does not fit the scatter as well as the piecewise constant estimate obtained by penalizing the total variation of g itself. It is striking here that the conformal adjustment in both figures is almost imperceptible. Thus, if interest focuses on prediction intervals for the response, the initial estimates provided by the penalized quantile regression estimates are fine, even though they are based on only half the original sample.

Fig. 5
figure 5

Pointwise and uniform confidence bands for RPC Example: In contrast to the conformal prediction band, pointwise and uniform bands for the 0.05 and 0.95 conditional quantile functions are considerably wider. The uniform band is based on the Hotelling tube construction described in Koenker (2011) and is depicted as the light grey shaded band enclosing the darker grey pointwise band

Prediction bands for Y are fine as far as they go, but what if we wanted confidence bands for the conditional quantile functions? Some might argue, e.g., Geiser (1993); Clarke and Clarke (2018), that it is pointless to predict quantities that can never be observed, but I subscribe to the principle: every decent estimate deserves a standard error. Figure 5 illustrates confidence bands for the lower, \(\tau = 0.05\) and upper, \(\tau = 0.95\) conditional quantile functions as estimated using penalization of \(g'\). The dark grey bands are the pointwise bands, while the lighter grey bands are those based on the Hotelling tube approach. Note that the bands for the 0.05 estimate are extremely narrow since the data are very concentrated in this region, so the \(\tau = 0.05\) conditional quantile is very precisely estimated.

6 Discussion

The large literature in econometrics about stochastic frontier models is mostly concerned with parametric models of the tail behavior of the response “near the production frontier.” Nonparametric quantile regression offers yet another perspective on estimating such models. It would be extremely foolish to make any claims for alternative methodology described here on the basis of the flimsy evidence offered, let me conclude simply by saying that it might be worthy of further consideration.