Abstract
We consider the problem of nonparametric estimation of a convex regression function \(\phi _0\). We study the risk of the least squares estimator (LSE) under the natural squared error loss. We show that the risk is always bounded from above by \(n^{-4/5}\) modulo logarithmic factors while being much smaller when \(\phi _0\) is well-approximable by a piecewise affine convex function with not too many affine pieces (in which case, the risk is at most \(1/n\) up to logarithmic factors). On the other hand, when \(\phi _0\) has curvature, we show that no estimator can have risk smaller than a constant multiple of \(n^{-4/5}\) in a very strong sense by proving a “local” minimax lower bound. We also study the case of model misspecification where we show that the LSE exhibits the same global behavior provided the loss is measured from the closest convex projection of the true regression function. In the process of deriving our risk bounds, we prove new results for the metric entropy of local neighborhoods of the space of univariate convex functions. These results, which may be of independent interest, demonstrate the non-uniform nature of the space of univariate convex functions in sharp contrast to classical function spaces based on smoothness constraints.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider the problem of estimating an unknown convex function \(\phi _0\) on \([0, 1]\) from observations \((x_1, Y_1), \dots , (x_n, Y_n)\) drawn according to the model
where \(x_1, \dots , x_n\) are fixed points in \([0, 1]\) and \(\xi _1, \dots , \xi _n\) represent independent mean zero errors. Convex regression is an important problem in the general area of nonparametric estimation under shape constraints. It often arises in applications: typical examples appear in economics (indirect utility, production or cost functions), medicine (dose response experiments) and biology (growth curves).
The most natural and commonly used estimator for \(\phi _0\) is the full least squares estimator (LSE), \(\hat{\phi }_{ls}\), which is defined as any minimizer of the LS criterion, i.e.,
where \({\mathcal {C}}\) denotes the set of all real-valued convex functions on \([0, 1]\). \(\hat{\phi }_{ls}\) is not unique even though its values at the data points \(x_1, \dots , x_n\) are unique. This follows from that fact that \((\hat{\phi }_{ls}(x_1),\ldots , \hat{\phi }_{ls}(x_n)) \in {\mathbb R}^n\) is the projection of \((Y_1,\ldots , Y_n)\) on a closed convex cone. A simple linear interpolation of these values leads to a unique continuous and piecewise linear convex function with possible knots at the data points, which can be treated as the canonical LSE. The canonical LSE can be easily computed by solving a quadratic program with \((n-2)\) linear constraints.
Unlike other methods for function estimation such as those based on kernels which depend on tuning parameters such as smoothing bandwidths, the LSE has the obvious advantage of being completely automated. It was first proposed by [20] for the estimation of production functions and Engel curves. Algorithms for its computation can be found in [13] and [14]. The theoretical behavior of the LSE has been investigated by many authors. Its consistency in the supremum norm on compact sets in the interior of the support of the covariate was proved by [19]. Mammen [22] derived the rate of convergence of the LSE and its derivative at a fixed point, while [16] proved consistency and derived its asymptotic distribution at a fixed point of positive curvature. Dümbgen et al. [12] showed that the supremum distance between the LSE and \(\phi _0\), assuming twice differentiability, on a compact interval in the interior of the support of the design points is of the order \((\log (n)/n)^{2/5}\).
In spite of all the above mentioned work, surprisingly, not much is known about the global risk behavior of the LSE under the natural loss function:
This is the main focus of our paper. In particular, we satisfactorily address the following questions in the paper: At what rate does the risk of the LSE \(\hat{\phi }_{ls}\) decrease to zero? How does this rate of convergence depend on the underlying true function \(\phi _0 \in {\mathcal {C}}\); i.e., does the LSE exhibit faster rates of convergence for certain functions \(\phi _0\)? How does \(\hat{\phi }_{ls}\) behave, in terms of its risk, when the model is misspecified, i.e., the regression function is not convex?
We assume, throughout the paper, that, in (1), \(x_1 < x_2 < \dots < x_n\) are fixed design points in \([0, 1]\) satisfying
where \(c_1\) and \(c_2\) are positive constants, and that \(\xi _1, \ldots , \xi _n\) are independent normally distributed random variables with mean zero and variance \(\sigma ^2 > 0\). In fact, all the results in our paper, excluding those in Sect. 5, hold under the milder assumption of subgaussianity of the errors. Our contributions in this paper can be summarized in the following
-
1.
We establish, for the first time, a finite sample upper bound for risk of the LSE \(\hat{\phi }_{ls}\) under the loss \(\ell ^2\) in Sect. 2. The analysis of the risk behavior of \(\hat{\phi }_{ls}\) is complicated due to two facts: (1) \(\hat{\phi }_{ls}\) does not have a closed form expression, and (2) the class \({\mathcal {C}}\) (over which \(\hat{\phi }_{ls}\) minimizes the LS criterion) is not totally bounded. Our risk upper bound involves a minimum of two terms; see Theorem 2.1. The first term says that the risk \({\mathbb E}_{\phi _0} \ell ^2(\hat{\phi }_{ls}, \phi _0)\) is bounded by \(n^{-4/5}\) up to logarithmic multiplicative factors in \(n\). The second term in the risk bound says that the risk is bounded from above by a combination of the parametric rate \(1/n\) and an approximation term that dictates how well \(\phi _0\) is approximated by a piecewise affine convex function (up to logarithmic multiplicative factors). Our risk bound, in addition to establishing the \(n^{-4/5}\) worst case bound, implies that \(\hat{\phi }_{ls}\) adapts to piecewise affine convex functions with not too many pieces (see Sect. 2 for the precise definition). This is remarkable because the LSE minimizes the LS criterion over all convex functions with no explicit special treatment for piecewise affine convex functions.
-
2.
In the process of proving our risk bound for the LSE, we prove new results for the metric entropy of balls in the space of convex functions. One of the standard approaches to finding risk bounds for procedures based on empirical risk minimization (ERM) says that the risk behavior of \(\hat{\phi }_{ls}\) is determined by the metric entropy of balls in the parameter space around the true function (see, for example, [4, 23, 30, 32]). The ball around \(\phi _0\) in \({\mathcal {C}}\) of radius \(r\) is defined as
$$\begin{aligned} S(\phi _0, r)\, {:=}\, \{\phi \in {\mathcal {C}}: \ell ^2(\phi , \phi _0) \le r^2 \}. \end{aligned}$$(4)Recall that, for a subset \({\mathcal {F}}\) of a metric space \(({\mathcal {X}},\rho )\), the \(\epsilon \)-covering number of \({\mathcal {F}}\) under the metric \(\rho \) is denoted by \(M(\epsilon , {\mathcal {F}}, \rho )\) and is defined as the smallest number of closed balls of radius \(\epsilon \) whose union contains \({\mathcal {F}}\). Metric entropy is the logarithm of the covering number. We prove new upper bounds for the metric entropy of \(S(\phi _0, r)\) in Sect. 3. These bounds depend crucially on \(\phi _0\). When \(\phi _0\) is a piecewise affine function with not too many pieces, the metric entropy of \(S(\phi _0, r)\) is much smaller than when \(\phi _0\) has a second derivative that is bounded from above and below by positive constants. This difference in the sizes of the balls \(S(\phi _0, r)\) is the reason why \(\hat{\phi }_{ls}\) exhibits different rates for different convex functions \(\phi _0\). It should be noted that the convex functions \(S(\phi _0, r)\) are not uniformly bounded and hence existing results on the metric entropy of classes of convex functions (see [5, 11, 18]) cannot be used directly to bound the metric entropy of \(S(\phi _0, r)\). Our main risk bound Theorem 2.1 is proved in Sect. 4 using the developed metric entropy bounds for \(S(\phi _0, r)\). These new bounds are also of independent interest.
-
3.
We investigate the optimality of the rate \(n^{-4/5}\). We show that for convex functions \(\phi _0\) having a bounded (from both above and below) curvature on a sub-interval of \([0, 1]\), the rate \(n^{-4/5}\) cannot be improved (in a very strong sense) by any other estimator. Specifically we show that a certain “local” minimax risk (see Sect. 5 for the details), under the loss \(\ell ^2\), is bounded from below by \(n^{-4/5}\). This shows, in particular, that the same holds for the global minimax rate for this problem.
-
4.
We also provide risk bounds in the case of model misspecification where we do not assume that the underlying regression function in (1) is convex. In this case we prove the exact same upper bounds for \({\mathbb E}_{\phi _0} \ell ^2(\hat{\phi }_{ls}, \phi _0)\) where \(\phi _0\) now denotes any convex projection (defined in Sect. 6) of the unknown true regression function. To the best of our knowledge, this is the first result on global risk bounds for the estimation of convex regression functions under model misspecification. Some auxiliary results about convex functions useful in the proofs of the main results are deferred to “Appendix”.
Two special features of our analysis are that: (1) all our risk-bounds are non-asymptotic, and (2) none of our results uses any (explicit) characterization of the LSE (except that it minimizes the least squares criterion) as a result of which our approach can, in principle, be extended to more complex ERM procedures, including shape restricted function estimation in higher dimensions; see e.g., [10, 26, 27].
Our adaptation behavior of the LSE implies in particular that the LSE converges at different rates depending on the true convex function \(\phi _0\). We believe that such adaptation is rather unique to problems of shape restricted function estimation and is currently not very well understood. For example, in the related problem of monotone function estimation, which has an enormous literature (see e.g., [3, 15, 33] and the references therein), the only result on adaptive global behavior of the LSE is found in [17]; also see [29]. This result, however, holds only in an asymptotic sense and only when the true function is a constant. Results on the pointwise adaptive behavior of the LSE in monotone function estimation are more prevalent and can be found, for example, in [7, 8, 21]. For convex function estimation, as far as we are aware, adaptation behavior of the LSE has not been studied before. Adaptation behavior for the estimation of a convex function at a single point has been recently studied by [6] but they focus on different estimators that are based on local averaging techniques.
2 Risk analysis of the LSE
Before stating our main risk bound, we need some notation. Recall that \({\mathcal {C}}\) denotes the set of all real-valued convex functions on \([0, 1]\). For \(\phi \in {\mathcal {C}}\), let \({\mathfrak {L}}(\phi )\) denote the “distance” of \(\phi \) from affine functions. More precisely,
Note that \({\mathfrak {L}}(\phi ) = 0\) when \(\phi \) is affine.
We also need the notion of piecewise affine convex functions. A convex function \(\alpha \) on \([0, 1]\) is said to be piecewise affine if there exists an integer \(k\) and points \(0 = t_0 < t_1 < \dots < t_k = 1\) such that \(\alpha \) is affine on each of the \(k\) intervals \([t_{i-1}, t_i]\) for \(i = 1, \dots , k\). We define \(k(\alpha )\) to be the smallest such \(k\). Let \({\mathcal {P}}_{k}\) denote the collection of all piecewise affine convex functions with \(k(\alpha ) \le k\) and let \({\mathcal {P}}\) denote the collection of all piecewise affine convex functions on \([0, 1]\).
We are now ready to state our main upper bound for the risk of \(\hat{\phi }_{ls}\).
Theorem 2.1
Let \(R {:=} \max (1, {\mathfrak {L}}(\phi _0))\). There exists a positive constant \(C\) depending only on the ratio \(c_1/c_2\) such that
provided
Because of the presence of the minimum in the risk bound presented above, the bound actually involves two parts. We isolate these two parts in the following two separate results. The first result says that the risk is bounded by \(n^{-4/5}\) up to multiplicative factors that are logarithmic in \(n\). The second result says that the risk is bounded from above by a combination of the parametric rate \(1/n\) and an approximation term that dictates how well \(\phi _0\) is approximated by a piecewise affine convex function (up to logarithmic multiplicative factors). The implications of these two theorems are explained in the remarks below. It is clear that Theorems 2.2 and 2.3 together imply Theorem 2.1. We therefore prove Theorem 2.1 by proving Theorems 2.2 and 2.3 separately in Sect. 4.
Theorem 2.2
Let \(R {:=} \max (1, {\mathfrak {L}}(\phi _0))\). There exists a positive constant \(C\) depending only on the ratio \(c_1/c_2\) such that
whenever
Theorem 2.3
There exists a constant \(C\), depending only on the ratio \(c_1/c_2\), such that
for all \(n\).
The following remarks will better clarify the meaning of these results. The first remark below is about Theorem 2.2. The later three remarks are about Theorem 2.3.
Remark 2.1
(Why convexity is similar to second order smoothness) From the classical theory of nonparametric statistics, it follows that this is the same rate that one obtains for the estimation of twice differentiable functions (satisfying a condition such as \(\sup _{x \in [0, 1]} |\phi _0''(x)| \le B\)) on the unit interval. In Theorem 2.2, we prove that \(\hat{\phi }_{ls}\) achieves the same rate (up to log factors) when the true function is convex under no assumptions whatsoever on the smoothness of the function. Therefore, the constraint of convexity is similar to the constraint of second order smoothness. This has long since been believed to be true, but to the best of our knowledge, Theorem 2.2 is the first result to rigorously prove this via a nonasymptotic risk bound for the estimator \(\hat{\phi }_{ls}\) with no assumption of smoothness.
Remark 2.2
(Parametric rates for piecewise affine convex functions) Theorem 2.3 implies that \(\hat{\phi }_{ls}\) has the parametric rate for estimating piecewise affine convex functions. Indeed, suppose \(\phi _0\) is a piecewise affine convex function on \([0, 1]\) i.e., \(\phi _0 \in {\mathcal {P}}\). Then using \(\alpha = \phi _0\) in (5), we have the risk bound
This is the parametric rate \(1/n\) up to logarithmic factors and is of course much smaller than the nonparametric rate \(n^{-4/5}\) given in Theorem 2.2. Therefore, \(\hat{\phi }_{ls}\) adapts to each class \({\mathcal {P}}_k\) of piecewise convex affine functions.
Remark 2.3
(Automatic adaptation) Risk bounds such as (5) are usually provable for estimators based on empirical model selection criteria (see, for example, [2]) or aggregation (see, for example, [25]). Specializing to the present situation, in order to adapt over \({\mathcal {P}}_k\) as \(k\) varies, one constructs LSE over each \({\mathcal {P}}_k\) and then either selects one estimator from this collection by an empirical model selection criterion or aggregates these estimators with data-dependent weights. While the theory for such penalization estimators is well-developed (see e.g., [2]), these estimators are computationally expensive, might rely on certain tuning parameters which might be difficult to choose in practice and also require estimation of \(\sigma ^2\). The LSE \(\hat{\phi }_{ls}\) is very different from these estimators because it simply minimizes the LS criterion over the whole space \({\mathcal {C}}\). It is therefore very easy to compute, does not depend on any tuning parameter or estimates for \(\sigma ^2\) and, remarkably, it automatically adapts over the classes \({\mathcal {P}}_k\) as \(k\) varies.
Remark 2.4
(Why convexity is different from second order smoothness) In Remark 2.1, we argued how estimation under convexity is similar to estimation under second order smoothness. Here we describe how the two are different. The risk bound given by Theorem 2.3 crucially depends on the true function \(\phi _0\). In other words, the LSE converges at different rates depending on the true convex function \(\phi _0\). Therefore, the rate of the LSE is not uniform over the class of all convex functions but it varies quite a bit from function to function in that class. As will be clear from our proofs, the reason for this difference in rates is that the class of convex functions \({\mathcal {C}}\) is locally non-uniform in the sense that the local neighborhoods around certain convex functions (e.g., affine functions) are much sparser than local neighborhoods around other convex functions. On the other hand, in the class of twice differentiable functions, all local neighborhoods are, in some sense, equally sized.
Remark 2.5
(On the logarithmic factors) We believe that Theorems 2.2 and 2.3 might have redundant logarithmic factors. In particular, we conjecture that there should be no logarithmic term in Theorem 2.2 and that the logarithmic term should be \(\log (en/(2c_1))\) instead of \((\log (en/(2c_1)))^{5/4}\) in Theorem 2.3; cf. analogous results in isotonic regression—[33] and [9]. These additional logarithmic factors mainly arise due to the fact that the class \(S(\phi _0, r)\), of convex functions appearing in the proofs, is not uniformly bounded. Sharpening these factors might be possible by using an explicit characterization of the LSE (as was done in [33] and [9] for isotonic regression) and other techniques that are beyond the scope of the present paper.
The proofs of Theorems 2.2 and 2.3 are presented in Sect. 4. A high level overview of the proof goes as follows. The convex LSE is an ERM procedure. These procedures are very well studied and numerous risk bounds exist in mathematical statistics and machine learning (see, for example, [4, 23, 30, 32]). These results essentially say that the risk behavior of \(\hat{\phi }_{ls}\) is determined by the metric entropy of the balls \(S(\phi _0, r)\) (defined in (4)) in \({\mathcal {C}}\) around the true function \(\phi _0\). Controlling the metric entropy of the \(S(\phi _0, r)\) is the key step in the proofs of Theorems 2.2 and 2.3. The next section deals with bounds for the metric entropy of \(S(\phi _0, r)\).
3 The local structure of the space of convex functions
In this section, we prove bounds for the metric entropy of the balls \(S(\phi _0, r)\) as \(\phi _0\) ranges over the space of convex functions. Our results give new insights into the local structure of the space of convex functions. We show that the metric entropy of \(S(\phi _0, r)\) behaves differently for different convex functions \(\phi _0\). This is the reason why the LSE exhibits different rates of convergence depending on the true function \(\phi _0\). The metric entropy of \(S(\phi _0, r)\) is much smaller when \(\phi _0\) is a piecewise affine convex function with not too many affine pieces than when \(\phi _0\) has a second derivative that is bounded from above and below by positive constants.
The next theorem is the main result of this section.
Theorem 3.1
There exists a positive constant \(c\) depending only on the ratio \(c_1/c_2\) such that for every \(\phi _0 \in {\mathcal {C}}\) and \(\epsilon > 0\), we have
where
Note that the dependence of the right hand side on (6) on \(\epsilon \) is always \(\epsilon ^{-1/2}\). The dependence on \(r\) is given by \(\Gamma (r; \phi _0)\) and it depends on \(\phi _0\). This function \(\Gamma (r; \phi _0)\) controls the size of the ball \(S(\phi _0, r)\). The larger the value \(\Gamma (r; \phi _0)\), the larger the metric entropy of \(S(\phi _0, r)\). The smallest possible value of \(\Gamma (r; \phi _0)\) equals \(r\) and is achieved for affine functions. When \(\phi _0\) is piecewise affine, \(\Gamma (r; \phi _0)\) is larger than \(r\) but it is not much larger provided \(k(\phi _0)\) is small. This is because \(\Gamma (r; \phi _0) \le r k^{5/2}(\phi _0)\). When \(\phi _0\) cannot be well-approximable by piecewise affine functions with small number of pieces, it can be shown that \(\Gamma (r; \phi _0)\) is bounded from below by a constant independent of \(r\). This will be the case, for example, when \(\phi _0\) is twice differentiable with \(\phi _0''(x)\) bounded from above and below by positive constants. As shown in the next theorem, \(S(\phi _0, r)\) has the largest possible size for such \(\phi _0\). Note also that one always has the upper bound \(\Gamma (r; \phi _0) \le \sqrt{r^2 + {\mathfrak {L}}^2(\phi _0)}\) which can be proved by restricting the infimum in the definition of \(\Gamma (r; \phi _0)\) to affine functions.
We need the following definition for the next theorem. For a subinterval \([a, b]\) of \([0, 1]\) and positive real numbers \(\kappa _1 < \kappa _2\), we define \(\mathfrak {K}{:=} \mathfrak {K}(a, b, \kappa _1, \kappa _2)\) to be the class of all convex functions \(\phi \) on \([0, 1]\) which are twice differentiable on \([a, b]\) and which satisfy \(\kappa _1 \le \phi ''(x) \le \kappa _2\) for all \(x \in [a, b]\).
Theorem 3.2
Suppose \(\phi _0 \in \mathfrak {K}(a, b, \kappa _1, \kappa _2)\). Then there exist positive constants \(c\), \(\epsilon _0\) and \(\epsilon _1\) depending only on \(\kappa _1, \kappa _2\), \(b-a\) and \(c_2\) such that
Note that the right hand side of (7) does not depend on \(r\). This should be contrasted with the right hand side of (6) when \(\phi _0\) is, say, an affine function. The non-uniform nature of the space of univariate convex functions should be clear from this: balls \(S(\phi _0, r)\) of the same radius \(r\) in the space have different sizes depending on their center, \(\phi _0\). This should be contrasted with the space of twice differentiable functions in which all balls are equally sized in the sense that they all satisfy (7).
Remark 3.1
Note that the inequality (7) only holds when \(\epsilon \ge \epsilon _1 n^{-2}\). In other words, it does not hold when \(\epsilon \downarrow 0\). This is actually inevitable because, ignoring the convexity of functions in \(S(\phi _0, r)\), the metric entropy of \(S(\phi _0, r)\) under \(\ell \) cannot be larger than the metric entropy of the ball of radius \(r\) in \({\mathbb R}^n\), which is bounded from above by \(n \log (1 + (3r/\epsilon ))\) (see e.g., [24], Lemma 4.1). Thus, as \(\epsilon \downarrow 0\), the metric entropy of \(S(\phi _0, r)\) becomes logarithmic in \(\epsilon \) as opposed to \(\epsilon ^{-1/2}\). Also note that inequality (7) only holds for \(\epsilon \le r \epsilon _0\). This also makes sense because the diameter of \(S(\phi _0, r)\) in the metric \(\ell \) equals \(2r\) and, consequently, the left hand side of (7) equals zero for \(\epsilon > 2r\). Therefore, one cannot expect (7) to hold for all \(\epsilon > 0\).
Remark 3.2
The proof of Theorem 3.2 actually implies a conclusion stronger than (7). Let \(S'(\phi _0, r) {:=} \left\{ \phi \in {\mathcal {C}}: \sup _x |\phi (x) - \phi _0(x)| \le r\right\} \). Clearly this is a smaller neighborhood of \(\phi _0\) than \(S(\phi _0, r)\) i.e., \(S'(\phi _0, r) \subseteq S(\phi _0, r)\). The proof of Theorem 3.2 shows that the lower bound (7) also holds for \(\log M(\epsilon , S'(\phi _0, r), \ell )\).
In the reminder of this section, we provide the proofs of Theorems 3.1 and 3.2. Let us start with the proof of Theorem 3.1. Since functions in \(S(\phi _0, r)\) are convex, we need to analyze the covering numbers of subsets of convex functions. There exist only two previous results here. Bronshtein [5] proved covering numbers for classes of convex functions that are uniformly bounded and uniformly Lipschitz under the supremum metric. This result was extended by [11] who dropped the uniform Lipschitz assumption (this result was further extended by [18] to the multivariate case). Unfortunately, the convex functions in \(S(\phi _0, r)\) are not uniformly bounded (they only satisfy a weaker integral-type constraint) and hence Dryanov’s result cannot be used directly for proving Theorem 3.1. Another difficulty is that we need covering numbers under \(\ell \) while the results in [11] are based on integral \(L_p\) metrics.
Here is a high-level outline of the proof of Theorem 3.1. The first step is to reduce the general problem to the case when \(\phi _0 \equiv 0\). The result for \(\phi _0 \equiv 0\) immediately implies the result for all affine functions \(\phi _0\). One can then generalize to piecewise affine convex functions by repeating the argument over each affine piece. Finally, the result is derived for general \(\phi _0\) by approximating \(\phi _0\) by piecewise affine convex functions.
For \(\phi _0 \equiv 0\), the class of convex functions under consideration is \(S(0, r)\). Unfortunately, functions in \(S(0, r)\) are not uniformly bounded; they only satisfy a weaker discrete \(L^2\)-type boundedness constraint. We get around the lack of uniform boundedness by noting that convexity and the \(L^2\)-constraint imply that functions in \(S(0, r)\) are uniformly bounded on subintervals that are in the interior of \([x_1, x_n]\) (this is proved via Lemma 7.3). We use this to partition the interval \([x_1, x_n]\) into appropriate subintervals where Dryanov’s metric entropy result can be employed. We first carry out this argument for another class of convex functions where the discrete \(L^2\)-constraint is replaced by an integral \(L^2\)-constraint. From this result, we deduce the covering numbers of \(S(0, r)\) by using straightforward interpolation results (Lemma 7.4).
3.1 Proof of Theorem 3.1
3.1.1 Reduction to the case when \(\phi _0 \equiv 0\)
The first step is to note that it suffices to prove the theorem when \(\phi _0\) is the constant function equal to 0. For \(\phi _0 \equiv 0\), Theorem 3.1 is equivalent the following statement: there exists a constant \(c > 0\), depending only on the ratio \(c_1/c_2\), such that
Below, we prove Theorem 3.1 assuming that (8) is true. Let \(\alpha \in {\mathcal {P}}_k\) be a piecewise affine function with \(k(\alpha ) = k\). We shall show that
This inequality immediately implies Theorem 3.1 because for every \(\phi _0, \phi \in {\mathcal {C}}\) and \(\alpha \in {\mathcal {P}}\), we have
by the trivial inequality \((a + b)^2 \le 2 a^2 + 2 b^2\). This means that \(\ell ^2(\phi , \alpha ) \le 2 r^2 + 2 \ell ^2(\phi _0, \alpha )\) for every \(\phi \in S(\phi _0, r)\). Hence
This inequality and (9) together clearly imply (6). It suffices therefore to prove (9).
Suppose that \(\alpha \) is affine on each of the \(k\) intervals \(I_i = [t_{i-1}, t_i]\) for \(i = 2, \dots , k\), where \(0 = t_0 < t_1 < \dots < t_{k-1} < t_k = 1\), and \(I_1 = [0, t_1]\). Then there exist \(k\) affine functions \(\tau _1, \dots , \tau _k\) on \([0, 1]\) such that \(\alpha (x) = \tau _i(x)\) for \(x \in I_i\) for every \(i = 1, \dots , k\).
For every pair of functions \(f\) and \(g\) on \([0, 1]\), we have the trivial identity: \(\ell ^2(f, g) = \sum _{i=1}^k \ell _{i}^2(f, g)\) where
As a result, we clearly have
Fix an \(i \in \{1, \dots , k\}\). Note that for every \(f \in S(\alpha , r)\), we have
Therefore
where \(S_i(\tau _i, r)\) consists of the class of all convex functions \(f : I_i \rightarrow {\mathbb R}\) for which \(\ell _i^2(\tau _i, f) \le r^2\).
By the translation invariance of the Euclidean distance and the fact that \(\phi - \tau \) is convex whenever \(\phi \) is convex and \(\tau \) is affine, it follows that
where \(S_i(0, r)\) is defined as the class of all convex functions \(f: I_i \rightarrow {\mathbb R}\) for which \(\ell _i^2(0, f) \le r^2\).
The covering number \(M(\epsilon /\sqrt{k}, S_i(0, r), \ell _i)\) can be easily bounded using (8) by the following scaling argument. Let \(J {:=} \{j \in \{1, \dots , n\} : x_j \in I_i\}\) with \(m\) being the cardinality of \(J\). Also write \([a, b]\) for the interval \(I_i\) and let \(u_j {:=} (x_j - a)/(b-a)\) for \(j \in J\). For \(f, g \in {\mathcal {C}}\), let
and \(S^{(u)}(0, \gamma ) {:=} \{f \in {\mathcal {C}}: \ell ^{(u)}(f, 0) \le \gamma \}\). By associating, for each \(f \in S_i(0, r)\), the convex function \(\tilde{f} \in {\mathcal {C}}\) defined by \(\tilde{f}(x) {:=} f(a + (b-a)x)\), it can be shown that
The assumption (3) implies that the distance between neighboring points in \(\{u_j, j \in J\}\) lies between \(mc_1/(n(b-a))\) and \(mc_2/(n(b-a))\). Therefore, by applying (8) to \(\{u_j, j \in J\}\) instead of \(\{x_i\}\), we obtain the existence of a positive constant \(c\) depending only on the ratio \(c_1/c_2\) such that
The required inequality (9) now follows from the above and (10).
3.1.2 The integral version
We have established above that it suffices to prove Theorem 3.1 for \(\phi _0 \equiv 0\) i.e., it suffices to prove (8). The ball \(S(0, r)\) consists of all convex functions \(\phi \) such that
For \(a < b\) and \(B > 0\), let \({\mathfrak {I}}([a, b], B)\) denote the class of all real-valued convex functions \(f\) on \([a, b]\) for which \(\int _a^b f^2(x) dx \le B^2\). The ball \(S(0, r)\) is intuitively very close to the class \({\mathfrak {I}}([0, 1], r)\) the only difference being that the average constraint (11) is replaced by the integral constraint \(\int _0^1 \phi ^2(x) dx \le r^2\) in \({\mathfrak {I}}([0, 1], r)\). We shall prove a good upper bound for the metric entropy of \({\mathfrak {I}}([0, 1], r)\). The metric entropy of \(S(0, r)\) will then be derived as a consequence.
Theorem 3.3
There exist a constant \(c\) such that for every \(0 < \eta < 1/2\), \(B > 0\) and \(\epsilon > 0\), we have
where, by \(L_2[\eta , 1-\eta ]\), we mean the metric where the distance between \(f\) and \(g\) is given by
Remark 3.3
We take the metric above to be \(L_2[\eta , 1 - \eta ]\) as opposed to \(L_2[0, 1]\) because
To see this, take \(f_j(t) = 2^{j/2} \max (0, 1 - 2^j t)\) for \(t \in [0,1]\) and \(j \ge 1\). It is then easy to check that \(f_j \in {\mathfrak {I}}([0, 1], B)\) for \(B \ge 1/3\) and that \(\int _0^1 (f_j - f_{j+1})^2 \ge c\) for some positive constant \(c\) which proves (13). The equality (13) is also the reason why the right hand side of (12) approaches \(\infty \) as \(\eta \downarrow 0\).
The above theorem is a new result. If the constraint \(\int _0^1 \phi ^2(x) dx \le B^2\) is replaced by the stronger constraint \(\sup _{x \in [0, 1]} |\phi (x)| \le B\), then this has been proved by [11]. Specifically, [11] considered the class \({\mathcal {C}}([a, b], B)\) consisting of all convex functions \(f\) on \([a, b]\) which satisfy \(\sup _{x \in [a, b]}|f(x)| \le B\) and proved the following. [18] extended this to the multivariate case.
Theorem 3.4
(Dryanov) There exists a positive constant \(c\) such that for every \(B > 0\) and \(b > a\), we have
Remark 3.4
In [11], inequality (14) was only asserted for \(\epsilon \le \epsilon _0 B(b-a)^{1/2}\) for a positive constant \(\epsilon _0\). It turns out however that this condition is redundant. This follows from the observation that the diameter of the space \({\mathcal {C}}([a, b], B)\) in the \(L_2[a, b]\) metric is at most \(2B(b-a)^{1/2}\) which means that the left hand side of (14) equals 0 for \(\epsilon > 2B(b-a)^{1/2}\) and, thus, by changing the constant \(c\) suitably in Dryanov’s result, we obtain (14).
The class \({\mathfrak {I}}([0, 1], B)\) is much larger than \({\mathcal {C}}([0, 1], B)\) because the integral constraint \(\int _0^1 \phi ^2(x) dx \le B^2\) is much weaker than \(\sup _{x \in [0, 1]} |\phi (x)| \le B\). Therefore, Theorem 3.3 does not directly follow from Theorem 3.4. However, it is possible to derive Theorem 3.4 from Theorem 3.3 via the observation (made rigorous in Lemma 7.3) that functions in \({\mathfrak {I}}([0, 1], B)\) become uniformly bounded on subintervals of \([0, 1]\) that are sufficiently far away from the boundary points. On such subintervals, we may use Theorem 3.4 to bound the covering numbers. Theorem 3.3 is then proved by putting together these different covering numbers as shown below.
Proof of Theorem 3.3
By a trivial scaling argument, we can assume without loss of generality that \(B = 1\). Let \(l\) be the largest integer that is strictly smaller than \(-\log (2\eta )/\log 2\) and let \(\eta _i {:=} 2^i \eta \) for \(i = 0, \dots , l+1\). Observe that \(\eta _l < 1/2 \le \eta _{l+1}\).
Fix \(i \in \{0, \dots , l\}\). By Lemma 7.3, the restriction of a function \(\phi \in {\mathfrak {I}}([0, 1], 1)\) to \([\eta _i, \eta _{i+1}]\) is convex and uniformly bounded by \(2 \sqrt{3} \eta _i^{-1/2}\). Therefore, by Theorem 3.4, there exists a positive constant \(c\) such that we can cover the functions in \({\mathfrak {I}}([0, 1], 1)\) in the \(L_2[\eta _i, \eta _{i+1}]\) metric to within \(\alpha _{i}\) by a finite set having cardinality at most
Because
we get a cover for functions in \({\mathfrak {I}}([0, 1], 1)\) in the \(L_2[\eta , 1/2]\) metric of size less than or equal to \(\left( \sum _{i=0}^l \alpha _i^2 \right) ^{1/2}\) and cardinality at most \(\exp \left( c \sum _{i=0}^l \alpha _i^{-1/2} \right) \).
Taking \(\alpha _i = \epsilon (l+1)^{-1/2}\), we get that
where \(c_1\) depends only on \(c\). By an analogous argument, the above inequality will also hold for \(\log M(\epsilon , {\mathfrak {I}}([0, 1], 1), L_2[1/2, 1-\eta ])\). The proof is completed by putting these two bounds together. \(\square \)
3.1.3 Completion of the Proof of Theorem 3.1
We now complete the proof of Theorem 3.1 by proving inequality (8). We will use Theorem 3.3. We need to switch between the pseudometrics \(\ell \) and \(L_2[\eta , 1-\eta ]\). This will be made convenient by the use of Lemma 7.4.
By an elementary scaling argument, it follows that
We, therefore, only need to prove (8) for \(r = 1\). For ease of notation, let us denote \(S(0, 1)\) by \(S\).
Because \(x_i - x_{i-1} \ge c_1/n\) for all \(i = 2, \dots , n\), we have \(x_2, \dots , x_{n-1} \in [c_1/n, 1 - (c_1/n)]\). We shall first prove an upper bound for \(\log M(\epsilon , S, \ell _1)\) where
For each function \(\phi \in S\), let \(\tilde{\phi }\) be the convex function on \([x_2, x_{n-1}]\) defined by
where \(i = 2, \dots , n-2\). Also let \(\tilde{S} {:=} \left\{ \tilde{\phi }: \phi \in S \right\} \).
By Lemma 7.4 and the assumption that \(x_i - x_{i-1} \ge c_1/n\) for all \(i\), we get that
for every pair of functions \(\phi \) and \(\psi \) in \(S\). Letting \(\delta {:=} \epsilon \sqrt{c_1/6}\) this inequality implies that
Again by Lemma 7.4 and the assumption \(x_i - x_{i-1} \le c_2/n\), we have that
As a result, we have that \(\tilde{S} \subseteq {\mathfrak {I}}([x_1, x_n], \sqrt{c_2})\). Further, because \(x_2 \ge x_1 + c_1/n\) and \(x_{n-1} \le x_n - c_1/n\), we get that
where \(\eta {:=}\, c_1/n\). By a simple scaling argument, the covering number on the right hand side above is upper bounded by
Indeed, for each \(f \in {\mathfrak {I}}([x_1, x_n], \sqrt{c_2})\), we can associate \(\tilde{f}(y) {:=} f(x_1 + y(x_n - x_1))\) for \(y \in [0, 1]\). It is then easy to check that \(\tilde{f} \in {\mathfrak {I}}([0, 1], \sqrt{c_2(x_n - x_1)})\) and
from which (15) easily follows. From the bound (15), it is now easy to see that (because \(x_n - x_1 \le 1\))
Thus, by Theorem 3.3, we assert the existence of a positive constant \(c\) a such that
Now for every pair of functions \(\phi \) and \(\psi \) in \(S\), we have
We make the simple observation that \((\phi (x_1), \phi (x_n))\) lies in the closed ball of radius \(\sqrt{n}\) in \({\mathbb R}^2\) denoted by \(B_2(0, \sqrt{n})\). As a result, using Pollard ([24], Lemma 4.1), we have
where the covering number of \(B_2(0, \sqrt{n})\) is in the usual Euclidean metric. Using (16), we get
Because \(\log (1 + x) \le 3 \sqrt{x}\) for all \(x > 0\), the first term in the right hand side above is bounded by a constant multiple of \(\epsilon ^{-1/2}\). This proves (8) provided the constant \(c\) is renamed appropriately.
3.2 Proof of Theorem 3.2
In our proof below, we shall make use of Lemma 7.1 (stated and proved in Appendix) which bounds the distance between functions in \(\mathfrak {K}(a, b, \kappa _1, \kappa _2)\) and their piecewise linear interpolants.
Fix \(m \ge 1\) and let \(t_i = a + (b-a)i/m\) for \(i = 0, \dots , m\). For each \(i = 1, \dots , m\), let \(\alpha _i\) define the linear interpolant of the points \((t_{i-1}, \phi _0(t_{i-1}))\) and \((t_i, \phi _0(t_i))\) i.e.,
By error estimates for linear interpolation (see e.g., Chapter 3 of [1]), for every \(x \in [t_{i-1}, t_i]\), there exists a point \(t_x \in [t_{i-1}, t_i]\) for which
which implies, because \(\phi _0 \in \mathfrak {K}(a, b, \kappa _1, \kappa _2)\), that
for every \(x \in [a, b]\). By convexity of \(\phi _0\), it is obvious that \(\alpha _i(x) \ge \phi _0(x)\) for \(x \in [t_{i-1}, t_i]\) and \(\alpha _i(x) \le \phi _0(x)\) for \(x \notin [t_{i-1}, t_i]\).
Now for each \(\tau \in \{0, 1\}^m\), let us define
The functions \(\phi _{\tau }\) are clearly convex because they equal the pointwise maximum of convex functions. Moreover, for \(x \in [t_{i-1}, t_i]\), we have
Also, from (18),
Because \(\ell (\phi _{\tau }, \phi _0) \le \sup _{x} |\phi _{\tau }(x) - \phi _0(x)|\), it follows that \(\phi _{\tau } \in S(\phi _0, r)\) provided
Observe now that for every \(\tau , \tau ' \in \{0, 1\}^m\),
where \({\Upsilon }(\tau , \tau ') {:=} \sum _i \{\tau _i \ne \tau '_i \}\). We now use Lemma 7.1 to bound \(\ell ^2(\phi _0, \max (\phi _0, \alpha _i))\) from below. Since \(\alpha _i\) is the linear interpolant of \((t_{i-1}, \phi _0(t_{i-1}))\) and \((t_i, \phi _0(t_i))\), we use Lemma 7.1 (inequality (38)) with \(a = t_{i-1}\) and \(b = t_i\) to assert
provided
From (20), we thus have
Using now the Varshamov-Gilbert lemma [(see, for example, Massart ([23, Lemma 4.7)] which asserts the existence of a subset \(W\) of \(\{0, 1\}^m\) with cardinality, \(|W| \ge \exp (m/8)\) such that \({\Upsilon }(\tau , \tau ') \ge m/4\) for all \(\tau , \tau ' \in W\) with \(\tau \ne \tau '\), we get that
Let us now fix \(\epsilon > 0\) and choose \(m\) so that
From (22), we then see that \(\{\phi _{\tau }: \tau \in W\}\) is an \(\epsilon \)-packing set under the pseudometric \(\ell \). The condition (19) would hold provided
Also, the condition (21) is equivalent to
We have therefore showed that for \(\epsilon \) satisfying the above pair of inequalities, there exists an \(\epsilon \)-packing subset of \(S(\phi _0, r)\) with cardinality \(|W|\) satisfying
The proof of Theorem 3.2 is now complete if we take
4 Proofs of the risk bounds of the LSE
In this section, we provide the proofs of Theorems 2.2 and 2.3. As mentioned in Sect. 3, these two theorems together imply our main risk bound Theorem 2.1 of the convex LSE. Our proofs are based on the local metric entropy result (Theorem 3.1) of the space of univariate convex functions derived in the previous section together with standard results on the risk behavior of ERM procedures. Before proceeding further, let us state precisely the result from the literature on ERM procedures that we use to analyze the risk of \(\hat{\phi }_{ls}\). There exist many such results but they are all similar in spirit and the following result from Van de Geer ([30], Theorem 9.1) is especially convenient to use.
Theorem 4.1
[30] For each \(r > 0\), let
Suppose \(H\) is a function on \((0, \infty )\) such that
and such that \(H(r)/r^2\) is decreasing on \((0, \infty )\). Then there exists a universal constant \(C\) such that
for every \(\delta > 0\) satisfying \(\sqrt{n} \delta \ge C\sigma H(\sqrt{\delta })\).
Let us note that our local metric entropy result, Theorem 3.1, easily implies an upper bound for the entropy integral
appearing in Theorem 4.1. Indeed, using the bound given by (6) for \(M(\epsilon , S(\phi _0, r), \ell )\) above and integrating, we obtain that (23) is bounded from above by
for every \(\phi _0 \in {\mathcal {C}}\) and \(r > 0\) where \(K\) is a constant that only depends on the ratio \(c_1/c_2\).
4.1 Proof of Theorem 2.2
Let us define
where \(A\) is a constant whose value will be specified shortly. Observe that \(\delta _0 \le R^2\) whenever \(n \ge A^{5/4} \left( \log ((en)/(2c_1)) \right) ^{5/4} \sigma ^2/R^2\). We use the bound (24) for the entropy integral (23). By restricting the infimum in the right hand side of (24) to affine functions (i.e., \(\alpha \in {\mathcal {P}}_1\)) for which \(k(\alpha ) = 1\), we obtain (note that \(\inf _{\alpha \in {\mathcal {P}}_1} \ell ^2(\phi _0, \alpha ) = {\mathfrak {L}}^2(\phi _0) \le R^2\))
for every \(r > 0\). Suppose now that
so that \(\delta _0 \le R^2\) and inequality (25) holds for every \(r > 0\). Let \(H(r)\) denote the right hand side of (25). It is clear that \(H(r)/r^2\) is decreasing on \((0, \infty )\). As a result, a condition of the form \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) for some positive constant \(C\) holds for every \(\delta \ge \delta _0\) provided it holds for \(\delta = \delta _0\). Clearly
Assuming that (26) holds and noting then that \(\delta _0 \le R^2\), we get
We shall now use Theorem 4.1. Let \(C\) be the constant given by Theorem 4.1. By the above inequality, the condition \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) holds for each \(\delta \ge \delta _0\) provided \(A = 2^{1/5} (C K)^{8/5}\). Thus by Theorem 4.1, we obtain
for all \(\delta \ge \delta _0\) whenever \(n\) satisfies (26). Using the expression for \(\delta _0\) and (26), we get for \(\delta \ge \delta _0\),
We thus have
for some constant \(C_1\) (depending only on \(C\) and \(A = 2^{1/5}(CK)^{8/5}\)) provided \(n\) satisfies (26). Integrating both sides of this inequality with respect to \(\delta \) [and using (27) again], we obtain the risk bound
for some positive constant \(C_2\) depending only on \(C\) and \(K\). Because \(C\) is an absolute constant and \(K\) only depends on the ratio \(c_1/c_2\), the proof is complete by an appropriate renaming of the constant \(C\).
4.2 Proof of Theorem 2.3
For each \(1 \le k \le n\), let
so that
It is also easy to check that
As a result, there exists an integer \(u \in \{1, \dots , n\}\) such that \(\ell _k^2 > \sigma ^2 k^{5/4}/n\) if \(1 \le k < u\) and \(\ell _k^2 \le \sigma ^2 k^{5/4}/n\) if \(k \ge u\). This means that when \(1 \le k < u\) (which implies that \(u \ge 2\) or \(u-1 \ge u/2\))
It then follows that
Consequently, the proof will be complete if we show that
To prove this, we start by defining
for a constant \(A\) whose value will be specified shortly. Because \(\ell _u^2 \le \sigma ^2 u^{5/4}/n\), it follows that \(\ell _u^2 \le \delta _0/A\).
By (24), there exists a positive constant \(K\) depending only on the ratio \(c_1/c_2\) such that
for every \(r > 0\). Let \(H(r)\) denote the right hand side above. It is clear that \(H(r)/r^2\) is decreasing on \((0, \infty )\). As a result, a condition of the form \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) for some positive constant \(C\) holds for every \(\delta \ge \delta _0\) provided it holds for \(\delta = \delta _0\). Because \(\ell _u^2 \le \delta _0/A\), we have
Consequently,
We shall now use Theorem 4.1. Let \(C\) be the positive constant given by Theorem 4.1. By inequality (29), we can clearly choose \(A\) depending only on \(K\) and \(C\) so that \(\sqrt{n} \delta _0 \ge C \sigma H(\sqrt{\delta _0})\). Because \(H(r)/r^2\) is a decreasing function of \(r\), this choice of \(A\) also ensures that \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) for every \(\delta \ge \delta _0\). Thus by Theorem 4.1, we obtain
Note further, from the definition of \(\delta _0\), that \(\delta _0 \ge \sigma ^2 A/n\) which implies that the sum on the right hand side of (30) is dominated by the first term. We thus have
for a constant \(C_1\) depending upon only \(C\) and \(A\). The required risk bound (28) is now derived by integrating both sides of the above inequality with respect to \(\delta \) and using that \(\delta _0 \ge \sigma ^2 A/n\).
5 Non-adaptable convex functions
We showed that the risk of the convex LSE is always bounded from above by \(n^{-4/5}\) up to logarithmic factors in \(n\) and that for convex functions that are well-approximable by piecewise affine functions with not too many pieces, the risk of the convex LSE is bounded by \(1/n\) up to log factors. The reason why the risk is much smaller for these functions is that the balls around them have small sizes. We also showed in Theorem 3.2 that for convex functions with curvature, the balls are really non-local. Here, we show that for such convex functions, in a very strong sense, the rate \(n^{-4/5}\) cannot be improved by any estimator.
Recall the class of functions, \(\mathfrak {K}(a, b, \kappa _1, \kappa _2)\), that was defined in Theorem 3.2. The constants \(a, b, \kappa _1\) and \(\kappa _2\) will be fixed constants in this section and we shall therefore refer to \(\mathfrak {K}(a, b, \kappa _1, \kappa _2)\) by just \(\mathfrak {K}\). For every function \(\phi _0 \in \mathfrak {K}\), let us define the local neighborhood \(N(\phi _0)\) of \(\phi _0\) in \({\mathcal {C}}\) by
Recall that the constant \(c_1\) is defined in (3). We define the local minimax risk of \(\phi _0 \in \mathfrak {K}\) to be
the infimum above being over all possible estimators \(\hat{\phi }\). \({\mathfrak {R}}_n(\phi _0)\) represents the smallest possible risk under the knowledge that the unknown convex function \(\phi \) lies in the local neighborhood \(N(\phi _0)\) of \(\phi _0\).
In the next theorem, we shall show that the local minimax risk of every function \(\phi _0 \in \mathfrak {K}\) is bounded from below by a constant multiple of \(n^{-4/5}\). Observe that the \(l^2\) diameter of \(N(\phi _0)\) defined as \(\sup _{\phi _1, \phi _2 \in N(\phi _0)} \ell ^2(\phi _1, \phi _2)\) is bounded from above by \(n^{-4/5}\) up to multiplicative factors that are independent of \(n\). Therefore, the supremum risk over \(N(\phi _0)\) of any reasonable estimator is bounded from above by \(n^{-4/5}\) up to multiplicative factors. The next theorem shows that if \(\phi _0 \in \mathfrak {K}\), then the supremum risk of every estimator is also bounded from below by \(n^{-4/5}\) up to multiplicative factors. Therefore, one cannot estimate \(\phi _0\) at a rate faster than \(n^{-4/5}\).
Theorem 5.1
(Lower bound) For every \(\phi _0 \in \mathfrak {K}(a, b, \kappa _1, \kappa _2)\), we have
provided \(n^2 \ge (2c_2)^{5/2} \kappa _2/(\sigma \sqrt{c_1})\).
Prototypical examples of functions in \(\mathfrak {K}\) include power functions \(x^k\) for \(k \ge 2\) and the above theorem implies that every estimator has rate at least \(n^{-4/5}\) for all these functions. Note that the LSE has the rate \(n^{-4/5}\) up to logarithmic factors of \(n\) for all functions \(\phi _0\). In particular, the LSE is rate optimal (up to logarithmic factors) for all functions in \(\mathfrak {K}\).
Prominent examples of functions not in the class \(\mathfrak {K}\) include the piecewise affine convex functions. As shown in Theorem 2.3, faster rates are possible for these functions. Essentially, the LSE converges at the parametric rate (up to logarithmic factors) for these functions.
The hardest functions to estimate under the global risk are therefore smooth convex functions. This is in sharp contrast to the standpoint of pointwise risk estimation where, for example, cusps in the function \(f(x) = |x|\) are the hardest to estimate. In fact, one would expect a rate of \(n^{-2/3}\) near such cusp points (see [6] for a detailed study of pointwise estimation although they work with estimators that are different from the LSE). However, for global estimation, the region over which one gets such slower rates is small enough to not effect the overall near-parametric rate for piecewise affine convex functions.
Our proof of Theorem 5.1 is based on the application of Assouad’s lemma, the following version of which is a consequence of Lemma 24.3 of Van der Vaart ([31], pp. 347). We start by introducing some notation. Let \({\mathbb P}_{\phi }\) denote the joint distribution of the observations \((x_1, Y_1), \dots , (x_n, Y_n)\) when the true convex function equals \(\phi \). For two probability measures \(P\) and \(Q\) having densities \(p\) and \(q\) with respect to a common measure \(\mu \), the total variation distance, \(\Vert P-Q\Vert _{TV}\), is defined as \(\int (|p-q|/2) d\mu \) and the Kullback-Leibler divergence, \(D(P\Vert Q)\), is defined as \(\int p \log (p/q) d\mu \). Pinsker’s inequality asserts
for all probability measures \(P\) and \(Q\).
Lemma 5.2
(Assouad) Let \(m\) be a positive integer and suppose that, for each \(\tau \in \{0, 1\}^m\), there is an associated convex function \(\phi _{\tau }\) in \(N(\phi _0)\). Then the following inequality holds:
where \({\Upsilon }(\tau , \tau ') {:=} \sum _{i} \{\tau _i \ne \tau '_i\}\).
Proof of Theorem 5.1
Fix \(m \ge 1\) and consider the same construction \(\{\phi _{\tau }, \tau \in \{0, 1\}^m\}\) from the proof of Theorem 3.2. We saw there that
and that
for every \(\tau , \tau ' \in \{0, 1\}^m\) provided \(n \ge 4mc_2/(b-a)\). Also, whenever \({\Upsilon }(\tau , \tau ') = 1\), it is clear that
We use Lemma 7.1 to bound \(\ell ^2(\phi _0, \max (\phi _0, \alpha _i))\) from above. Specifically, we use inequality (39) with \(a = t_{i-1}\) and \(b = t_i\) to get
provided \(n \ge 4mc_1/(b-a)\). Thus under the assumption \(n \ge 4mc_2/(b-a)\), we have (35) and also (note that \(c_2 \ge c_1\))
We apply Assouad’s lemma to these functions \(\phi _{\tau }\). By inequality (32), we get
By the Gaussian assumption and independence of the errors, the Kullback-Leibler divergence \(D({\mathbb P}_{\phi _{\tau }}\Vert P_{\phi _{\tau '}})\) can be easily calculated to be \(n \ell ^2(\phi _{\tau }, \phi _{\tau '})/(2 \sigma )\). We therefore obtain
Thus by the application of (33), we obtain the following lower bound for \({\mathfrak {R}}_n(\phi _0)\):
provided \(\phi _{\tau } \in N(\phi _0)\) for each \(\tau \). We make the choice
The inequality (34) implies that \(\phi _{\tau } \in N(\phi _0)\). The inequality (31) follows easily from (36). The constraint \(n \ge 4c_2m/(b-a)\) translates to
The proof is complete. \(\square \)
6 Model misspecification
In this section, we evaluate the performance of the convex LSE \(\hat{\phi }_{ls}\) in the case when the unknown regression function (to be denoted by \(f_0\)) is not necessarily convex. Specifically, suppose that \(f_0\) is an unknown function on \([0, 1]\) that is not necessarily convex. We consider observations \((x_1, Y_1), \dots , (x_n, Y_n)\) from the model:
where \(x_1< \dots < x_n\) are fixed design points in \([0, 1]\) and \(\xi _1, \dots , \xi _n\) are independent normal variables with zero mean and variance \(\sigma ^2\).
The convex LSE \(\hat{\phi }_{ls}\) is defined in the same way as before as any convex function that minimizes the sum of squares criterion. Since the true function \(f_0\) is not necessarily convex, it turns out that the LSE is really estimating the convex projections of \(f_0\). Any convex function \(\phi _0\) on \([0, 1]\) that minimizes \(\ell ^2(f_0, \phi )\) over \(\phi \in {\mathcal {C}}\) is a convex projection of \(f_0\) i.e.,
Convex projections are not unique. However, because \(\{(\phi (x_1), \dots , \phi (x_n)): \phi \in {\mathcal {C}}\}\) is a convex closed subset of \({\mathbb R}^n\), it follows (see, for example Stark and Yang ([28], Chapter 2)) that the vector \((\phi _0(x_1), \dots , \phi _0(x_n))\) is unique for every convex projection \(\phi _0\) and, moreover, we have the inequality:
The following is the main result of this section. It is the exact analogue of Theorem 2.1 for the case of model misspecification.
Theorem 6.1
Let \(\phi _0\) denote any convex projection of \(f_0\) and let \(R {:=} \max (1, {\mathfrak {L}}(\phi _0))\). There exists a positive constant \(C\) depending only on the ratio \(c_1/c_2\) such that
provided
We omit the proof of this theorem because it is similar to the proof of Theorem 2.1. It is based on the metric entropy results from Sect. 3 and the following result from the literature on the risk behavior of ERMs.
Theorem 6.2
Let \(\phi _0\) denote any convex projection of \(f_0\). Suppose \(H\) is a function on \((0, \infty )\) such that
and such that \(H(r)/r^2\) is decreasing on \((0, \infty )\). Then there exists a universal constant \(C\) such that
for every \(\delta > 0\) satisfying \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\).
This result is very similar to Theorem 4.1. Its proof proceeds in the same way as the proof of Theorem 4.1 [(see Van de Geer ([30, Proof of Theorem 9.1)]. We provide below a sketch of its proof for the convenience of the reader.
Proof of Theorem 6.2
Because \(\phi _0\) is convex, we have, by the definition of \(\hat{\phi }_{ls}\), that
Writing \(Y_i = f_0(x_i) + \xi _i\) and simplifying the above expression, we get
Inequality (37) applied with \(\phi = \hat{\phi }_{ls}\) gives
Combining the above two inequalities, we obtain
This is of the same form as the “basic inequality” of Van de Geer ([30], pp. 148). From here, the proof proceeds just as the proof of Theorem 9.1 in [30]. \(\square \)
Theorem 6.1 shows that one gets adaptation in the misspecified case provided \(f_0\) has a convex projection that is well-approximable by a piecewise affine convex function with not too many pieces. An illuminating example of this occurs when \(f_0\) is a concave function. In this case, we show in Lemma 7.5 (stated and proved in Appendix) that \(\phi _0\) can be taken to be an affine function, i.e., \(\phi _0 \in {\mathcal {P}}_1\). As a result, it follows that if \(f_0\) is concave, then the risk of \(\hat{\phi }_{ls}\) measured from any convex projection of \(f_0\) is bounded from above by the parametric rate up to a logarithmic factor of \(n\).
References
Atkinson, K.E.: An Introduction to Numerical Analysis, 2nd edn. Wiley, New York (1989)
Barron, A., Birgé, L., Massart, P.: Risk bounds for model selection via penalisation. Probab. Theory Relat. Fields 113, 301–413 (1999)
Birgé, L.: The Grenander estimator: a nonasymptotic approach. Ann. Stat. 17(4), 1532–1549 (1989)
Birgé, L., Massart, P.: Rates of convergence for minimum contrast estimators. Probab. Theory Relat. Fields 97, 113–150 (1993)
Bronshtein, E.M.: \(\epsilon \)-entropy of convex sets and functions. Sib. Math. J. 17, 393–398 (1976)
Cai, T., Low, M.: A framework for estimation of convex functions. Stat. Sin. (to appear). Available at http://www3.stat.sinica.edu.tw/ss_newpaper/SS-13-279_na.pdf (2014)
Carolan, C., Dykstra, R.: Asymptotic behavior of the Grenander estimator at density flat regions. Can. J. Stat. 27(3), 557–566 (1999)
Cator, E.: Adaptivity and optimality of the monotone least-squares estimator. Bernoulli 17(2), 714–735 (2011)
Chatterjee, S., Guntuboyina, A., Sen, B.: On risk bounds in isotonic and other shape restricted regression problems (2014, submitted). arXiv:1311.3765
Cule, M.L., Samworth, R.J., Stewart, M.I.: Maximum likelihood estimation of a multi-dimensional log-concave density (with discussion). J. R. Stat. Soci. Ser. B 72, 545–600 (2010)
Dryanov, D.: Kolmogorov entropy for classes of convex functions. Constr. Approx. 30, 137–153 (2009)
Dümbgen, L., Freitag, S., Jongbloed, G.: Consistency of concave regression with an application to current-status data. Math. Methods Stat. 13(1), 69–81 (2004)
Dykstra, R.L.: An algorithm for restricted least squares regression. J. Am. Stat. Assoc. 78(384), 837–842 (1983)
Fraser, D.A.S., Massam, H.: A mixed primal-dual bases algorithm for regression under inequality constraints. Application to concave regression. Scand. J. Stat. 16(1), 65–74 (1989)
Grenander, U.: On the theory of mortality measurement. II. Skand. Aktuarietidskr. 39, 125–153 (1957) (1956)
Groeneboom, P., Jongbloed, G., Wellner, J.A.: Estimation of a convex function: characterizations and asymptotic theory. Ann. Stat. 29(6), 1653–1698 (2001)
Groeneboom, P., Pyke, R.: Asymptotic normality of statistics based on the convex minorants of empirical distribution functions. Ann. Probab. 11(2), 328–345 (1983)
Guntuboyina, A., Sen, B.: Covering numbers for convex functions. IEEE Trans. Inf. Theory 59(4), 1957–1965 (2013)
Hanson, D.L., Pledger, G.: Consistency in concave regression. Ann. Stat. 4(6), 1038–1050 (1976)
Hildreth, C.: Point estimates of ordinates of concave functions. J. Am. Stat. Assoc. 49, 598–619 (1954)
Jankowski, H.: Convergence of linear functionals of the Grenander estimator under misspecification. Ann. Stat. 42(2), 625–653 (2014)
Mammen, E.: Nonparametric regression under qualitative smoothness assumptions. Ann. Stat. 19(2), 741–759 (1991)
Massart, P.: Concentration Inequalities and Model Selection. Lecture Notes in Mathematics, volume 1896. Springer, Berlin (2007)
Pollard, D.: Empirical Processes: Theory and Applications, volume 2 of NSF-CBMS Regional Conference Series in Probability and Statistics. Institute of Mathematical Statistics, Hayward, CA (1990)
Rigollet, P., Tsybakov, A.B.: Sparse estimation by exponential weighting. Stat. Sci. 27(4), 558–575 (2012)
Seijo, E., Sen, B.: Nonparametric least squares estimation of a multivariate convex regression function. Ann. Stat. 39, 1633–1657 (2011)
Seregin, A., Wellner, J.A.: Nonparametric estimation of multivariate convex-transformed densities. Ann. Stat. 38, 3751–3781 (2010)
Stark, H., Yang, Y.: Vector Space Projections. Wiley, New York (1988)
Van de Geer, S.: Hellinger-consistency of certain nonparametric maximum likelihood estimators. Ann. Stat. 21(1), 14–44 (1993)
Van de Geer, S.: Applications of Empirical Process Theory. Cambridge University Press, Cambridge (2000)
Van der Vaart, A.: Asymptotic Statistics. Cambridge University Press, Cambridge (1998)
van der Vaart, A.W., Wellner, J.A.: Weak Convergence and Empirical Process: With Applications to Statistics. Springer, Berlin (1996)
Zhang, C.-H.: Risk bounds in isotonic regression. Ann. Stat. 30(2), 528–555 (2002)
Acknowledgments
The authors would like to thank Aritra Guha, Sasha Tsybakov, a referee and an Associate Editor for their helpful comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
B. Sen supported by the NSF Grants DMS-1150435 and AST-1107373. A. Guntuboyina is supported by NSF Grant DMS-1309356.
Appendix: Some auxiliary results
Appendix: Some auxiliary results
Lemma 7.1
Fix \(\phi _0 \in {\mathcal {C}}\) and suppose there exists a subinterval \([a, b]\) of \([0, 1]\) such that \(\phi _0\) is twice differentiable on \([a, b]\). Let \(\alpha \) denote the linear interpolant of the points \((a, \phi _0(a))\) and \((b, \phi _0(b))\) i.e.,
-
1.
If \(\phi _0''(x) \ge \kappa _1\) for all \(x \in [a, b]\), then
$$\begin{aligned} \ell ^2(\phi _0, \max (\phi _0, \alpha )) \ge \frac{\kappa _1^2(b-a)^5}{4{,}096 c_2} \qquad \text {when}\,\, n \ge 4c_2/(b-a). \end{aligned}$$(38) -
2.
If \(\phi _0''(x) \le \kappa _2\) for all \(x \in [a, b]\), then
$$\begin{aligned} \ell ^2(\phi _0, \max (\phi _0, \alpha )) \le \frac{\kappa _2^2(b-a)^5}{32 c_1} \qquad \text {when}\,\, n \ge 4c_1/(b-a). \end{aligned}$$(39)
Proof of Lemma 7.1
By convexity of \(\phi _0\), it is obvious that \(\alpha (x) \ge \phi _0(x)\) for \(x \in [a, b]\) and \(\alpha (x) \le \phi _0(x)\) for \(x \notin [a, b]\). We therefore have
where \(I\) denotes the indicator function. By standard error estimates for linear interpolation, for every \(x \in [a, b]\), there exists a point \(t_x \in [a, b]\) for which
Let us first prove (38). By (41) and the assumption \(\phi _0''(x) \ge \kappa _1\) for \(x \in [a, b]\), we have
Thus, from (40), we get
Clearly \((x-a)(b-x) \ge (b-a)^2/16\) for every \(x \in [(3a+b)/4, (a+3b)/4]\) and hence,
To get a lower bound on the number of points \(x_1, \dots , x_n\) that are contained in the interval \([(3a+b)/4, (a+3b)/4]\), we use Lemma 7.2 which gives
The condition \(n \ge 4c_2/(b-a)\) now implies that
which completes the proof of (38). We now turn to the proof of (39). By (41) and the assumption \(\phi _0''(x) \le \kappa _2\) for \(x \in [a, b]\), we have
Thus from (40), we write
Because \((x-a)(b-x) \le (b-a)^2/4\) for all \(x \in [a, b]\), we obtain
To obtain an upper bound on the number of points \(x_1, \dots , x_n\) that are contained in \([a, b]\), we again use Lemma 7.2 to get
When \(n \ge 4c_1/(b-a)\), we have
and this completes the proof. \(\square \)
Lemma 7.2
Let \(x_1 < \dots < x_n\) be fixed points in \([0, 1]\) satisfying \(c_1 \le n(x_i - x_{i-1}) \le c_2\) for all \(2 \le i \le n\). Let \([a, b]\) be a subinterval of \([0, 1]\) that contains \(m\) of the \(n\) real numbers \(x_1, \dots , x_n\). Then
Proof
Let \(x_0 \,{:=}\, \max \left( x_1 - c_2/n, 0 \right) \) and \(x_{n+1}\, {:=}\, \min \left( x_n + c_2/n, 1 \right) \). Let
for some \(0 \le k \le n-m\). Clearly
which gives the upper bound in (42). On the other hand,
which gives the lower bound in (42). The proof is complete. \(\square \)
Lemma 7.3
Let \(\phi \) be a convex function on \([0, 1]\) for which \(\int _0^1 |\phi (x)|^p dx \le 1\) for a fixed \(p \ge 1\). Then \(|\phi (y)| \le 2(1+p)^{1/p} \max \left( y^{-1/p}, (1-y)^{-1/p} \right) \) for all \(y \in (0, 1)\).
Proof
It suffices to prove the theorem for \(0 < y < 1/2\).
Suppose \(\phi (y) > y^{-1/p}\). Then, by convexity of \(\phi \), the condition \(\phi (x) > \phi (y)\) must hold either for all \(x \in (0, y)\) or for all \(x \in (y, 1)\). Therefore,
which gives a contradiction. Therefore \(\phi (y) \le y^{-1/p}\).
Suppose, if possible, that \(\phi (y) < -c y^{-1/p}\) for some \(c > 1\). We consider the following cases separately.
Case (\(i\)) Assume \(\phi (0) < -cy^{-1/p}\). In this case, by convexity of \(\phi \), it follows that \(\phi (x) < -cy^{-1/p}\) for all \(x \in [0, y]\). Therefore \(|\phi (x)| > cy^{-1/p}\) and thus
This contradicts \(c > 1\).
Case (\(ii\)) Here \(\phi (0) \ge -cy^{-1/p}\). We now consider the following two subcases:
-
1.
\(\phi (0) \le 0\). Then \(\phi (x) \le 0\) for all \(x \in [0, y]\). For each \(0 \le x \le y\), we have, by convexity,
$$\begin{aligned} \phi (x) \le \left( 1-\frac{x}{y} \right) \phi (0) + \frac{x}{y} \phi (y) \le \frac{x}{y} \phi (y). \end{aligned}$$Thus \(y \phi (x) \le x \phi (y) \le 0\) for each \(0 \le x \le y\). As a result,
$$\begin{aligned} y^p |\phi (x)|^p \ge x^p |\phi (y)|^p \qquad \text {for}\, 0 \le x \le y. \end{aligned}$$Integrating both sides from \(x = 0\) to \(x = y\), we obtain
$$\begin{aligned} y^p \int _0^y |\phi (x)|^p dx \ge |\phi (y)|^p \frac{y^{p+1}}{p+1} \end{aligned}$$which implies that \(|\phi (y)|^{p} \le (p+1)/y\), i.e., \(|\phi (y)| \le (1+p)^{1/p} y^{-1/p}\) which is a contradiction if \(c > (1+p)^{1/p}\).
-
2.
\(\phi (0) > 0\). Let \(z \in (0, y)\) be such that \(\phi (z) = 0\). For \(x < z\), we can write, by convexity,
$$\begin{aligned} 0 = \phi (z) \le \frac{y-z}{y-x}\phi (x) + \frac{z-x}{y-x} \phi (y) \end{aligned}$$which implies that
$$\begin{aligned} 0 > \phi (y) \ge \frac{y-z}{x-z} \phi (x). \end{aligned}$$As a result, \(|z-x|^{p} |\phi (y)|^p \le |y-z|^p |\phi (x)|^p\) for \(0 < x < z\). Integrating both sides from \(x = 0\) to \(x = z\), we get
$$\begin{aligned} |\phi (y)|^p \frac{z^{p+1}}{p+1} \le |y-z|^p \int _0^z |\phi (x)|^p dx. \end{aligned}$$(43)For \(z < x < y\), again, by convexity, we write
$$\begin{aligned} \phi (x) \le \frac{x-z}{y-z} \phi (y) + \frac{y-x}{y-z} \phi (z) = \frac{x-z}{y-z} \phi (y) \le 0. \end{aligned}$$As a result, \(|y-z|^p |\phi (x)|^p \ge |x-z|^p |\phi (y)|^p\). Integrating from \(x = z\) to \(x = y\), we get
$$\begin{aligned} |\phi (y)|^p \frac{(y-z)^{p+1}}{p+1} \le |y-z|^p \int _z^y |\phi (x)|^p dx. \end{aligned}$$(44)Adding the two inequalities (43) and (44), we obtain
$$\begin{aligned} \frac{|\phi (y)|^p}{p+1} \left( z^{p+1} + (y-z)^{p+1} \right) \le |y-z|^p \int _0^y |\phi (x)|^p dx < y^p. \end{aligned}$$Now
$$\begin{aligned} z^{p+1} + (y-z)^{p+1} \ge \min _{0 < u < y} \left( u^{p+1} + (y-u)^{p+1} \right) = 2^{-p} y^{p+1}. \end{aligned}$$Combining, we obtain
$$\begin{aligned} |\phi (y)| < 2 (1+p)^{1/p} y^{-1/p} \end{aligned}$$which results in a contradiction if \(c \ge 2(1+p)^{1/p} y^{-1/p}\).
\(\square \)
Lemma 7.4
(Interpolation Lemma) Fix \(x_1 < x_2 < \dots < x_n\) and suppose that \(c_1 \le n(x_i - x_{i-1}) \le c_2\) for all \(2 \le i \le n\). For every function \(f\) on \([x_1, x_n]\), associate another function \(\tilde{f}\) on \([x_1, x_n]\) by
where \(i = 1, \dots , n-1\). Then for every pair of functions \(f\) and \(g\) on \([x_1, x_n]\), we have
Proof
It is elementary to check that for every \(1 \le i \le n-1\), we have
where \(\alpha {:=} f(x_i) - g(x_i)\) and \(\beta = f(x_{i+1}) - g(x_{i+1})\). Using the inequalities
we obtain
Adding these inequalities from \(i = 1\) to \(i = n-1\), we deduce
which yields the desired result. \(\square \)
Remark 7.1
Observe that if \(f\) is a convex function on \([a, b]\), then \(\tilde{f}\) is also convex on \([a, b]\).
Lemma 7.5
The set of all convex projections of a concave function \(f_0\) includes an affine function.
Proof
We prove this result by the method of contradiction. Suppose that there is no convex projection that is affine. Let \(\phi _0\) be the continuous piecewise affine convex projection of \(f_0\). For a function \(g:[0,1] \rightarrow {\mathbb R}\) we define \(g(0+) {:=} \lim _{x \rightarrow 0 +} g(x)\) and \(g(1-) {:=} \lim _{x \rightarrow 1 -} g(x)\). This notation is necessary as \(f_0\) need not be continuous at the boundary points \(\{0,1\}\).
-
Case (\(i\)): Suppose that \(f_0(0+) \ge \phi _0(0)\) and \(f_0(1-) \ge \phi _0(1)\). Then the affine function \(\tilde{\phi }_0\) obtained by joining \((0,\phi _0(0))\) and \((1,\phi _0(1))\), i.e., \(\tilde{\phi }_0(x) = (1-x)\phi _0(0) + x \phi _0(1)\), for \(x \in [0,1]\), lies in-between \(\phi _0\) and \(f_0\) (as \(f_0\) is concave) and \(\ell ^2(\phi _0,f_0) \ge \ell ^2(\tilde{\phi }_0,f_0)\), giving rise to a contradiction.
-
Case (\(ii\)): Suppose that \(f_0(0+) < \phi _0(0)\) and \(f_0(1-) \ge \phi _0(1)\). Then there is a point \(u \in (0,1)\) such that \(f_0(u) = \phi _0(u)\). Let us define \(\tilde{\phi }\) to be the affine function joining \((u,\phi _0(u))\) and \((1,\phi _0(1))\). Again, \(\tilde{\phi }_0\) lies in-between \(\phi _0\) and \(f_0\) and \(\ell ^2(\phi _0,f_0) \ge \ell ^2(\tilde{\phi }_0,f_0)\), thus giving rise to a contradiction.
-
Case (\(iii\)): Suppose that \(f_0(0+) \ge \phi _0(0)\) and \(f_0(1-) < \phi _0(1)\). A similar analysis as in (\(ii\)) by looking at the affine function obtained by joining \((0,\phi _0(0))\) and \((v,\phi _0(v))\) where \(\phi _0(v) = f_0(v)\), \(v \in (0,1)\), gives a contradiction.
-
Case (\(iv\)): Suppose that \(f_0(0+) < \phi _0(0)\) and \(f_0(1-) <\phi _0(1)\). Suppose that there are two points \(u_0, u_1 \in (0,1)\) such that \(f_0(u_i) = \phi _0(u_i)\), for \(i = 1,2\). Then define \(\tilde{\phi }\) to be the affine function joining \((u_0, \phi _0(u_0))\) and \((u_1, \phi _0(u_1))\). Again, \(\tilde{\phi }_0\) lies in-between \(\phi _0\) and \(f_0\) and \(\ell ^2(\phi _0,f_0) \ge \ell ^2(\tilde{\phi }_0,f_0)\), thus giving rise to a contradiction. Suppose that \(f_0\) and \(\phi _0\) touch at just one point \(v \in (0,1)\). Then defining \(\tilde{\phi }_0\) to be the affine function that passes through \((v,\phi _0(v))\) and is a sub-gradient to both \(\phi _0\) and \(f_0\) at \(v\) yields a contradiction. If \(f_0\) and \(\phi _0\) do not touch at all then defining \(\tilde{\phi }_0\) to be any affine function lying between \(\phi _0\) and \(f_0\) shows that \(\ell ^2(\phi _0,f_0) \ge \ell ^2(\tilde{\phi }_0,f_0)\). This completes the proof. \(\square \)
Remark 7.2
Note that if \(n > 2\), the convex projection of a concave \(f_0\) is in fact unique on \((0,1)\) and affine.
Rights and permissions
About this article
Cite this article
Guntuboyina, A., Sen, B. Global risk bounds and adaptation in univariate convex regression. Probab. Theory Relat. Fields 163, 379–411 (2015). https://doi.org/10.1007/s00440-014-0595-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00440-014-0595-3