Hotelling tubes, confidence bands and conformal inference

Stochastic frontier models and methods as pioneered by Peter Schmidt in Aigner et al. (J Econom 6:21–37, 1977), Horrace and Schmidt (J Product Anal 7:257–282, 1996), Amsler et al. (J Econom 190:280–288, 2016) constitute a rare departure from the usual econometric obsession with models for conditional means. They also provided an early stimulus for the development of quantile regression methods. After a brief tutorial on Hotelling tube methods for constructing confidence bands for nonparametric quantile regression, strengthened performance guarantees for such bands are described based on recent developments in conformal inference. These methods may be considered to be a rather idiosyncratic new approach to nonparametric inference for stochastic frontier models.


Introduction
One of my indelible memories of Peter Schmidt was a conversation we had in my kitchen at a party for Midwest Econometrics Group participants in 1993 about the uneasy relationship between statistics and econometrics. "If a statistical tree falls in the forest, but no econometrician sees it," Peter said matter-of-factly, "then it never happened." In 1939 Harold Hotelling, arguably one of the most eminent statisticians and econometricians of the twentieth century witnessed such an event and wrote about in Hotelling (1939). The paper inspired Hermann Weyl to write a highly influential paper, Weyl (1939) generalizing it. Hotelling's idea has attracted a small coterie of admirers in statistics, but it is fair to say that it remains almost unknown in econometrics.
My quixotic aim in this paper is to rescue Hotelling's idea from econometric obscurity. I will begin by describing a simple setting in which the idea can be employed to construct a confidence interval for a scalar parameter that enters awkwardly in a standard regression problem. Then, I will describe how it can be used to construct uniform confidence bands for nonparametric regression using penalty methods, and finally I will compare performance with confidence bands constructed with recently developed methods of conformal inference.

Hotelling's regression problem
Consider the nonlinear regression model where α, β, τ are unknown parameters, λ i (·) are known functions and ε i ∼ N (0, σ 2 ). For the sake of concreteness, we might interpret λ i (τ ) as a Box-Cox transformation of another covariate, say (z τ i − 1)/τ . We would like to test H 0 : β = 0. Under the null, the Box-Cox parameter τ is not identified, so we need to consider strategies that properly account for this. 1 By the familiar (Frisch and Waugh 1933) trickery, we can eliminate the α effect. 2 Redefining the notation and assuming for convenience that σ 2 = 1, we are left with the likelihood ratio statistic Now, denoting the n-vectors, Y = (Y i ), λ = (λ i ), and the Euclidean norm by · , β τ = Y λ(τ )/ λ(τ ) 2 so we can rewrite, Thus, the test rejects when W = sup τ γ (τ ) U exceeds some value w = cos θ which is equivalent to 1 There is of course a large literature on such problems, notably: Davies (1977Davies ( , 1987; Andrews and Ploberger (1994); Hansen (1996), none of whom mention Hotelling. An exception that justifies the qualified "almost unknown" above is Kim et al. (1998). I do not claim that the Hotelling approach is "best" in any sense, only that it is worthy of further consideration. To this end, software to compute the confidence bands described below is available in the R package quantreg, Koenker (1999) for a general class of total variation penalized, additive, nonparametric quantile regression models. 2 Hotelling obviously knew all about how to do this, and one doubts that he learned it from Frisch, but this would probably be hard to establish.
So when the distance d(u, γ ) is small, U falls inside tube, and we reject. This may seem a bit counter-intuitive, but is nonetheless correct. There are probably many ways to it sound more intuitive. Here is one possibility. Since it all boils down to a cosine, that is the simple correlation between λ(τ ) and Y , we want to reject H 0 : β = 0 if this correlation/cosine is too large, but Y 's that make it too large are the Y s that fall inside the tube. So how do we compute the critical w or equivalently the critical θ ? Since W > w ≡ cos θ is equivalent to U being in the tube, we need the volume of the tube. Let |γ | denote the length of the arc γ (τ ) on the sphere. This can be approximated by the finite difference formula, and the τ 's are chosen on some relatively fine grid of m points. Note that in the finite difference approximation the τ i − τ i−1 that would normally appear in the denominator Theorem 1 If γ is a non-closed regular curve in S d−1 then for w near 1, where B(1/2, (d − 1)/2) is a beta random variable. If γ is closed, i.e., forms a closed loop without end points, then the second "cap" term is omitted.
We ignore pathological complications involving self-intersections of the curve, γ . This follows from a result of Hotelling (1939), as does the next theorem.
Theorem 2 Let γ be a regular closed curve in S d−1 with length |γ |. And Heuristically, the formula is, Recall that the volume of the unit ball in dimension d is V = π d/2 / ((d + 2)/2). When θ is larger, or γ is twisty, then the tube may intersect itself and the formula would need some refinement. Figure 2 is a crude attempt to depict tube on the 2-sphere, those with enhanced geometric imagination may try to visualize a three dimensional tube on the 3-sphere embedded in 4-space.
When the curve is not closed, then it needs "caps" on each end. These caps are given by , is not the same as the volume of the ball. Note also that (1 − z 2 ) 1/2 is again the radius and integrating out the r d−3 yields a d − 2 dimensional volume. A useful reference for this sort of geometry is Kendall (1961).
How do we get from (2) to (1)? Recall that U is uniform on the (d − 1) sphere so we need to divide by the volume of that sphere to evaluate the probability of being in the tube, so for closed curves, To include caps, we also need to divide by the volume of the sphere. Note that Changing variables x → y 2 , we have It remains to show that B −1 = w d−2 /V(sphere), which follows after a little simplification and recalling that (1/2) = √ π .
To check how the Hotelling tube procedure performs in moderate sample sizes, Table 1 reports results of a small simulation experiment. Data are generated with iid x i standard log-normal and The nominal level of the Hotelling test is taken to be 0.05. and 1000 replications of the experiment are made for each parametric setting. When β = 0 so the null is true, the test delivers quite accurate size for all of the sample sizes considered, and power is respectable when β deviates from zero.

Uniform confidence bands for nonparametric regression
Consider the series expansion model ∼ N (0, σ 2 ) as before and t ∈ I ⊂ R. Our objective is to find a positive c such that uniformly in β, σ. Johansen and Johnstone (1990) write this as, P β,σ, (T < cσ ) where Now consider X ∼ N (ξ, ), so X plays the role ofβ and ξ of β. We'd like to make a confidence statement about {a ξ |a ∈ C} and C is some sort of "curve." So now we write, We want the distribution of T , so we can obtain the confidence set Now to put things back into the earlier framework of γ and U we set, So as before, γ = γ (C) ⊂ S d−1 , and U is uniform on S d−1 . R and W do not depend on ξ, or they do, but only via γ . R 2 is independent of W and R 2 ∼ χ 2 d so, The random variable W has the same form as in the simple example so, . (Naiman 1986) bounds this probability by, and Knowles (1987) suggests ignoring the b γ < 1 constraint and then integrates the bound to obtain, This integration may appear somewhat miraculous, but does actually work out provided that one carefully observes the P(R ∈ dr ) term. Since R 2 ∼ χ 2 d , letting F denote the distribution function of χ 2 d , we have, so the corresponding density of R is Once one has this bound then various other things fall into place. For example, . (Johansen and Johnstone 1990) give further details on the accuracy of the bounds and applications.

Additive models for total variation penalized nonparametric quantile regression
In Koenker (2011), I have described a general approach to estimation and inference for additive nonparametric quantile regression models of the form, The components g = (g 1 , · · · , g J ) can be univariate or bivariate. Their smoothness can be controlled by penalizing total variation of the functions themselves or their gradients. Estimation is carried out by solving the linear program, where ρ τ (u) = u(τ − 1(u < 0)) is the usual quantile objective function, θ 0 1 = K k=1 |θ 0k | and (∇g j ) denotes the total variation of the derivative or gradient of the function g. Recall that for g with absolutely continuous derivative g we can express the total variation of g : R → R as (g (z)) = |g (z)|dz while for g : R 2 → R with absolutely continuous gradient, where ∇ 2 g(z) denotes the Hessian of g and · denotes the Hilbert-Schmidt norm for matrices. In contrast, total variation penalization of the component functions themselves yields piecewise constant solutions.
Adapting the Hotelling tube idea to construct uniform confidence bands for these components is also described in Koenker (2011), as is selection of the smoothing parameters λ j , j = 0, 1, · · · J . It should be stressed that all of this machinery relies on the validity of Gaussian approximations for the fitted parameters and estimated functions and is conditional on selected tuning parameters. This is in accord with a large strand of earlier literature including (Wahba 1983;Nychka 1983), and Krivobokova et al. (2010); however, there are inevitable questions that can be raised about both aspects. To explore this, we consider some recent proposals for strengthening coverage guarantees based on conformal inference in the next section.

Conformal quantile regression
Conformal prediction, and conformal inference more generally, has grown out of work by Vladimir Volk and colleagues, see, e.g., Shafer and Vovk (2008) for an overview. It has emerged as an essential tool in uncertainty quantification throughout statistics and machine learning. An essential feature of the conformal inference approach in regression is a sample splitting device that allows one to adjust a confidence band constructed with training data based on its performance on a validation sample. Strong finite sample performance guarantees can be proven based on seemingly rather weak exchangeability assumptions. In regression settings, early work presumed conventional iid error structure when constructing the initial bands from the training data; however, (Romano et al. 2019) noted that in more heterogeneous settings narrower bands could be constructed using quantile regression methods. This approach has been further developed in Lei and Candès (2022). In high-dimensional regression, this typically would involve some form of random forest or neural network model for the initial bands, but the same methods can be used in simpler models like the additive models described above.

6:
Compute Q, the τ 1 − τ 0 quantile of the conformal scores. 7: Return the augmented prediction band Note that the conformal adjustment of the initial band can make it wider or narrower. When Q < 0, it indicates that the validation sample fell well inside the initial band indicating that it is safe to shrink the width of the initial band.
There are several potential difficulties with the foregoing recipe. • Predictions based on the training sample typically are not equipped to extrapolate beyond the empirical support of the training data, so if the validation data, or new data requiring a conformal interval, lie outside that support some accommodation must be made. • Performance guarantees are based on marginal coverage of the band, so it may happen that in certain regions of design space there may be failures of coverage that are compensated by satisfactory coverage elsewhere. As shown by Foygel Barber et al. (2020), conditional coverage is not achievable in any generality. • All of the familiar challenges of penalty methods for regression smoothing persist, so choice of smoothing parameters, in particular, can cause headaches, even though poor λ selection can in principle be ameliorated by the conformal adjustment.
We conclude this section by illustrating the use of the conformal method in an artificial data example taken from Romano et al. (2019). Simulated data are generated as, There are 7000 observations plotted in grey. The Poisson contribution to the response produces a banded structure to the scatterplot with pronounced heteroscedasticity.  Romano, Patterson and Candès: In contrast to the earlier piecewise linear fit obtained by total variation penalization of the first derivative of g, in this figure total variation of the fitted function itself is penalized resulting in a piecewise constant fit. Clearly this penalty is better suited to the example and mimics quite well the fit depicted in that paper. Again, the conformal adjustment is only barely visible There are a small number of extreme outliers many of which lie outside the frame of the figure; such outliers are harmless since we are estimating conditional quantile functions. Penalizing total variation of g yields a piecewise linear fit that does not fit the scatter as well as the piecewise constant estimate obtained by penalizing the total variation of g itself. It is striking here that the conformal adjustment in both figures is almost imperceptible. Thus, if interest focuses on prediction intervals for the response, the initial estimates provided by the penalized quantile regression estimates are fine, even though they are based on only half the original sample. Prediction bands for Y are fine as far as they go, but what if we wanted confidence bands for the conditional quantile functions? Some might argue, e.g., Geiser (1993); Clarke and Clarke (2018), that it is pointless to predict quantities that can never be observed, but I subscribe to the principle: every decent estimate deserves a standard error. Figure 5 illustrates confidence bands for the lower, τ = 0.05 and upper, τ = 0.95 conditional quantile functions as estimated using penalization of g . The dark grey bands are the pointwise bands, while the lighter grey bands are those based on the Hotelling tube approach. Note that the bands for the 0.05 estimate are extremely narrow since the data are very concentrated in this region, so the τ = 0.05 conditional quantile is very precisely estimated.

Discussion
The large literature in econometrics about stochastic frontier models is mostly concerned with parametric models of the tail behavior of the response "near the production frontier." Nonparametric quantile regression offers yet another perspective on estimating such models. It would be extremely foolish to make any claims for alternative methodology described here on the basis of the flimsy evidence offered, let me conclude simply by saying that it might be worthy of further consideration.

Conflict of interest
The author has had no funding support or other potential conflict of interest.
Human and animal rights Nor does the study involve any human or animal participants; nor have any vegetables been harmed in the preparation of this work.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.