1 Introduction

We consider the problem of estimating an unknown convex function \(\phi _0\) on \([0, 1]\) from observations \((x_1, Y_1), \dots , (x_n, Y_n)\) drawn according to the model

$$\begin{aligned} Y_i = \phi _0(x_i) + \xi _i, \qquad \text {for}\quad i = 1, \dots , n, \end{aligned}$$
(1)

where \(x_1, \dots , x_n\) are fixed points in \([0, 1]\) and \(\xi _1, \dots , \xi _n\) represent independent mean zero errors. Convex regression is an important problem in the general area of nonparametric estimation under shape constraints. It often arises in applications: typical examples appear in economics (indirect utility, production or cost functions), medicine (dose response experiments) and biology (growth curves).

The most natural and commonly used estimator for \(\phi _0\) is the full least squares estimator (LSE), \(\hat{\phi }_{ls}\), which is defined as any minimizer of the LS criterion, i.e.,

$$\begin{aligned} \hat{\phi }_{ls} \in \mathop {\mathrm{argmin}}_{\psi \in {\mathcal {C}}} \sum _{i=1}^n \left( Y_i - \psi (x_i) \right) ^2, \end{aligned}$$

where \({\mathcal {C}}\) denotes the set of all real-valued convex functions on \([0, 1]\). \(\hat{\phi }_{ls}\) is not unique even though its values at the data points \(x_1, \dots , x_n\) are unique. This follows from that fact that \((\hat{\phi }_{ls}(x_1),\ldots , \hat{\phi }_{ls}(x_n)) \in {\mathbb R}^n\) is the projection of \((Y_1,\ldots , Y_n)\) on a closed convex cone. A simple linear interpolation of these values leads to a unique continuous and piecewise linear convex function with possible knots at the data points, which can be treated as the canonical LSE. The canonical LSE can be easily computed by solving a quadratic program with \((n-2)\) linear constraints.

Unlike other methods for function estimation such as those based on kernels which depend on tuning parameters such as smoothing bandwidths, the LSE has the obvious advantage of being completely automated. It was first proposed by [20] for the estimation of production functions and Engel curves. Algorithms for its computation can be found in [13] and [14]. The theoretical behavior of the LSE has been investigated by many authors. Its consistency in the supremum norm on compact sets in the interior of the support of the covariate was proved by [19]. Mammen [22] derived the rate of convergence of the LSE and its derivative at a fixed point, while [16] proved consistency and derived its asymptotic distribution at a fixed point of positive curvature. Dümbgen et al. [12] showed that the supremum distance between the LSE and \(\phi _0\), assuming twice differentiability, on a compact interval in the interior of the support of the design points is of the order \((\log (n)/n)^{2/5}\).

In spite of all the above mentioned work, surprisingly, not much is known about the global risk behavior of the LSE under the natural loss function:

$$\begin{aligned} \ell ^2(\phi , \psi ) {:=} \frac{1}{n} \sum _{i=1}^n \left( \phi (x_i) - \psi (x_i) \right) ^2. \end{aligned}$$
(2)

This is the main focus of our paper. In particular, we satisfactorily address the following questions in the paper: At what rate does the risk of the LSE \(\hat{\phi }_{ls}\) decrease to zero? How does this rate of convergence depend on the underlying true function \(\phi _0 \in {\mathcal {C}}\); i.e., does the LSE exhibit faster rates of convergence for certain functions \(\phi _0\)? How does \(\hat{\phi }_{ls}\) behave, in terms of its risk, when the model is misspecified, i.e., the regression function is not convex?

We assume, throughout the paper, that, in (1), \(x_1 < x_2 < \dots < x_n\) are fixed design points in \([0, 1]\) satisfying

$$\begin{aligned} c_1 \le n(x_i - x_{i-1}) \le c_2, \quad \text{ for }\quad i =2,3,\ldots , n, \end{aligned}$$
(3)

where \(c_1\) and \(c_2\) are positive constants, and that \(\xi _1, \ldots , \xi _n\) are independent normally distributed random variables with mean zero and variance \(\sigma ^2 > 0\). In fact, all the results in our paper, excluding those in Sect. 5, hold under the milder assumption of subgaussianity of the errors. Our contributions in this paper can be summarized in the following

  1. 1.

    We establish, for the first time, a finite sample upper bound for risk of the LSE \(\hat{\phi }_{ls}\) under the loss \(\ell ^2\) in Sect. 2. The analysis of the risk behavior of \(\hat{\phi }_{ls}\) is complicated due to two facts: (1) \(\hat{\phi }_{ls}\) does not have a closed form expression, and (2) the class \({\mathcal {C}}\) (over which \(\hat{\phi }_{ls}\) minimizes the LS criterion) is not totally bounded. Our risk upper bound involves a minimum of two terms; see Theorem 2.1. The first term says that the risk \({\mathbb E}_{\phi _0} \ell ^2(\hat{\phi }_{ls}, \phi _0)\) is bounded by \(n^{-4/5}\) up to logarithmic multiplicative factors in \(n\). The second term in the risk bound says that the risk is bounded from above by a combination of the parametric rate \(1/n\) and an approximation term that dictates how well \(\phi _0\) is approximated by a piecewise affine convex function (up to logarithmic multiplicative factors). Our risk bound, in addition to establishing the \(n^{-4/5}\) worst case bound, implies that \(\hat{\phi }_{ls}\) adapts to piecewise affine convex functions with not too many pieces (see Sect. 2 for the precise definition). This is remarkable because the LSE minimizes the LS criterion over all convex functions with no explicit special treatment for piecewise affine convex functions.

  2. 2.

    In the process of proving our risk bound for the LSE, we prove new results for the metric entropy of balls in the space of convex functions. One of the standard approaches to finding risk bounds for procedures based on empirical risk minimization (ERM) says that the risk behavior of \(\hat{\phi }_{ls}\) is determined by the metric entropy of balls in the parameter space around the true function (see, for example, [4, 23, 30, 32]). The ball around \(\phi _0\) in \({\mathcal {C}}\) of radius \(r\) is defined as

    $$\begin{aligned} S(\phi _0, r)\, {:=}\, \{\phi \in {\mathcal {C}}: \ell ^2(\phi , \phi _0) \le r^2 \}. \end{aligned}$$
    (4)

    Recall that, for a subset \({\mathcal {F}}\) of a metric space \(({\mathcal {X}},\rho )\), the \(\epsilon \)-covering number of \({\mathcal {F}}\) under the metric \(\rho \) is denoted by \(M(\epsilon , {\mathcal {F}}, \rho )\) and is defined as the smallest number of closed balls of radius \(\epsilon \) whose union contains \({\mathcal {F}}\). Metric entropy is the logarithm of the covering number. We prove new upper bounds for the metric entropy of \(S(\phi _0, r)\) in Sect. 3. These bounds depend crucially on \(\phi _0\). When \(\phi _0\) is a piecewise affine function with not too many pieces, the metric entropy of \(S(\phi _0, r)\) is much smaller than when \(\phi _0\) has a second derivative that is bounded from above and below by positive constants. This difference in the sizes of the balls \(S(\phi _0, r)\) is the reason why \(\hat{\phi }_{ls}\) exhibits different rates for different convex functions \(\phi _0\). It should be noted that the convex functions \(S(\phi _0, r)\) are not uniformly bounded and hence existing results on the metric entropy of classes of convex functions (see [5, 11, 18]) cannot be used directly to bound the metric entropy of \(S(\phi _0, r)\). Our main risk bound Theorem 2.1 is proved in Sect. 4 using the developed metric entropy bounds for \(S(\phi _0, r)\). These new bounds are also of independent interest.

  1. 3.

    We investigate the optimality of the rate \(n^{-4/5}\). We show that for convex functions \(\phi _0\) having a bounded (from both above and below) curvature on a sub-interval of \([0, 1]\), the rate \(n^{-4/5}\) cannot be improved (in a very strong sense) by any other estimator. Specifically we show that a certain “local” minimax risk (see Sect. 5 for the details), under the loss \(\ell ^2\), is bounded from below by \(n^{-4/5}\). This shows, in particular, that the same holds for the global minimax rate for this problem.

  2. 4.

    We also provide risk bounds in the case of model misspecification where we do not assume that the underlying regression function in (1) is convex. In this case we prove the exact same upper bounds for \({\mathbb E}_{\phi _0} \ell ^2(\hat{\phi }_{ls}, \phi _0)\) where \(\phi _0\) now denotes any convex projection (defined in Sect. 6) of the unknown true regression function. To the best of our knowledge, this is the first result on global risk bounds for the estimation of convex regression functions under model misspecification. Some auxiliary results about convex functions useful in the proofs of the main results are deferred to “Appendix”.

Two special features of our analysis are that: (1) all our risk-bounds are non-asymptotic, and (2) none of our results uses any (explicit) characterization of the LSE (except that it minimizes the least squares criterion) as a result of which our approach can, in principle, be extended to more complex ERM procedures, including shape restricted function estimation in higher dimensions; see e.g., [10, 26, 27].

Our adaptation behavior of the LSE implies in particular that the LSE converges at different rates depending on the true convex function \(\phi _0\). We believe that such adaptation is rather unique to problems of shape restricted function estimation and is currently not very well understood. For example, in the related problem of monotone function estimation, which has an enormous literature (see e.g., [3, 15, 33] and the references therein), the only result on adaptive global behavior of the LSE is found in [17]; also see [29]. This result, however, holds only in an asymptotic sense and only when the true function is a constant. Results on the pointwise adaptive behavior of the LSE in monotone function estimation are more prevalent and can be found, for example, in [7, 8, 21]. For convex function estimation, as far as we are aware, adaptation behavior of the LSE has not been studied before. Adaptation behavior for the estimation of a convex function at a single point has been recently studied by [6] but they focus on different estimators that are based on local averaging techniques.

2 Risk analysis of the LSE

Before stating our main risk bound, we need some notation. Recall that \({\mathcal {C}}\) denotes the set of all real-valued convex functions on \([0, 1]\). For \(\phi \in {\mathcal {C}}\), let \({\mathfrak {L}}(\phi )\) denote the “distance” of \(\phi \) from affine functions. More precisely,

$$\begin{aligned} {\mathfrak {L}}(\phi ) {:=} \inf \left\{ \ell (\phi , \tau ) : \tau \text { is affine on } [0, 1] \right\} \!. \end{aligned}$$

Note that \({\mathfrak {L}}(\phi ) = 0\) when \(\phi \) is affine.

We also need the notion of piecewise affine convex functions. A convex function \(\alpha \) on \([0, 1]\) is said to be piecewise affine if there exists an integer \(k\) and points \(0 = t_0 < t_1 < \dots < t_k = 1\) such that \(\alpha \) is affine on each of the \(k\) intervals \([t_{i-1}, t_i]\) for \(i = 1, \dots , k\). We define \(k(\alpha )\) to be the smallest such \(k\). Let \({\mathcal {P}}_{k}\) denote the collection of all piecewise affine convex functions with \(k(\alpha ) \le k\) and let \({\mathcal {P}}\) denote the collection of all piecewise affine convex functions on \([0, 1]\).

We are now ready to state our main upper bound for the risk of \(\hat{\phi }_{ls}\).

Theorem 2.1

Let \(R {:=} \max (1, {\mathfrak {L}}(\phi _0))\). There exists a positive constant \(C\) depending only on the ratio \(c_1/c_2\) such that

$$\begin{aligned}&{\mathbb E}_{\phi _0} \ell ^2(\hat{\phi }_{ls}, \phi _0) \le C \left( \log \frac{en}{2c_1} \right) ^{5/4}\\&\quad \times \min \left[ \left( \frac{\sigma ^2 \sqrt{R}}{n} \right) ^{4/5}, \inf _{\alpha \in {\mathcal {P}}} \left( \ell ^2(\phi _0, \alpha ) + \frac{\sigma ^2 k^{5/4}(\alpha )}{n} \right) \right] \end{aligned}$$

provided

$$\begin{aligned} n \ge C \frac{\sigma ^2}{R^2} \left( \log \frac{en}{2c_1} \right) ^{5/4}. \end{aligned}$$

Because of the presence of the minimum in the risk bound presented above, the bound actually involves two parts. We isolate these two parts in the following two separate results. The first result says that the risk is bounded by \(n^{-4/5}\) up to multiplicative factors that are logarithmic in \(n\). The second result says that the risk is bounded from above by a combination of the parametric rate \(1/n\) and an approximation term that dictates how well \(\phi _0\) is approximated by a piecewise affine convex function (up to logarithmic multiplicative factors). The implications of these two theorems are explained in the remarks below. It is clear that Theorems 2.2 and 2.3 together imply Theorem 2.1. We therefore prove Theorem 2.1 by proving Theorems 2.2 and 2.3 separately in Sect. 4.

Theorem 2.2

Let \(R {:=} \max (1, {\mathfrak {L}}(\phi _0))\). There exists a positive constant \(C\) depending only on the ratio \(c_1/c_2\) such that

$$\begin{aligned} {\mathbb E}_{\phi _0} \ell ^2 \left( \hat{\phi }_{ls}, \phi _0 \right) \le C \left( \log \frac{en}{2c_1} \right) \left( \frac{\sigma ^2 \sqrt{R}}{n} \right) ^{4/5} \end{aligned}$$

whenever

$$\begin{aligned} n \ge C \left( \log \frac{en}{2c_1} \right) ^{5/4} \frac{\sigma ^2}{R^2}. \end{aligned}$$

Theorem 2.3

There exists a constant \(C\), depending only on the ratio \(c_1/c_2\), such that

$$\begin{aligned} {\mathbb E}_{\phi _0} \ell ^2(\phi _0, \hat{\phi }_{ls}) \le C \left( \log \frac{en}{2c_1} \right) ^{5/4} \inf _{\alpha \in {\mathcal {P}}} \left( \ell ^2(\phi _0, \alpha ) + \frac{\sigma ^2 k^{5/4}(\alpha )}{n}{} \right) \end{aligned}$$
(5)

for all \(n\).

The following remarks will better clarify the meaning of these results. The first remark below is about Theorem 2.2. The later three remarks are about Theorem 2.3.

Remark 2.1

(Why convexity is similar to second order smoothness) From the classical theory of nonparametric statistics, it follows that this is the same rate that one obtains for the estimation of twice differentiable functions (satisfying a condition such as \(\sup _{x \in [0, 1]} |\phi _0''(x)| \le B\)) on the unit interval. In Theorem 2.2, we prove that \(\hat{\phi }_{ls}\) achieves the same rate (up to log factors) when the true function is convex under no assumptions whatsoever on the smoothness of the function. Therefore, the constraint of convexity is similar to the constraint of second order smoothness. This has long since been believed to be true, but to the best of our knowledge, Theorem 2.2 is the first result to rigorously prove this via a nonasymptotic risk bound for the estimator \(\hat{\phi }_{ls}\) with no assumption of smoothness.

Remark 2.2

(Parametric rates for piecewise affine convex functions) Theorem 2.3 implies that \(\hat{\phi }_{ls}\) has the parametric rate for estimating piecewise affine convex functions. Indeed, suppose \(\phi _0\) is a piecewise affine convex function on \([0, 1]\) i.e., \(\phi _0 \in {\mathcal {P}}\). Then using \(\alpha = \phi _0\) in (5), we have the risk bound

$$\begin{aligned} {\mathbb E}_{\phi _0} \ell ^2(\phi _0, \hat{\phi }_{ls}) \le C \left( \log \frac{en}{2c_1} \right) ^{5/4} \frac{\sigma ^2 k^{5/4}(\phi _0)}{n}. \end{aligned}$$

This is the parametric rate \(1/n\) up to logarithmic factors and is of course much smaller than the nonparametric rate \(n^{-4/5}\) given in Theorem 2.2. Therefore, \(\hat{\phi }_{ls}\) adapts to each class \({\mathcal {P}}_k\) of piecewise convex affine functions.

Remark 2.3

(Automatic adaptation) Risk bounds such as (5) are usually provable for estimators based on empirical model selection criteria (see, for example, [2]) or aggregation (see, for example, [25]). Specializing to the present situation, in order to adapt over \({\mathcal {P}}_k\) as \(k\) varies, one constructs LSE over each \({\mathcal {P}}_k\) and then either selects one estimator from this collection by an empirical model selection criterion or aggregates these estimators with data-dependent weights. While the theory for such penalization estimators is well-developed (see e.g., [2]), these estimators are computationally expensive, might rely on certain tuning parameters which might be difficult to choose in practice and also require estimation of \(\sigma ^2\). The LSE \(\hat{\phi }_{ls}\) is very different from these estimators because it simply minimizes the LS criterion over the whole space \({\mathcal {C}}\). It is therefore very easy to compute, does not depend on any tuning parameter or estimates for \(\sigma ^2\) and, remarkably, it automatically adapts over the classes \({\mathcal {P}}_k\) as \(k\) varies.

Remark 2.4

(Why convexity is different from second order smoothness) In Remark 2.1, we argued how estimation under convexity is similar to estimation under second order smoothness. Here we describe how the two are different. The risk bound given by Theorem 2.3 crucially depends on the true function \(\phi _0\). In other words, the LSE converges at different rates depending on the true convex function \(\phi _0\). Therefore, the rate of the LSE is not uniform over the class of all convex functions but it varies quite a bit from function to function in that class. As will be clear from our proofs, the reason for this difference in rates is that the class of convex functions \({\mathcal {C}}\) is locally non-uniform in the sense that the local neighborhoods around certain convex functions (e.g., affine functions) are much sparser than local neighborhoods around other convex functions. On the other hand, in the class of twice differentiable functions, all local neighborhoods are, in some sense, equally sized.

Remark 2.5

(On the logarithmic factors) We believe that Theorems 2.2 and 2.3 might have redundant logarithmic factors. In particular, we conjecture that there should be no logarithmic term in Theorem 2.2 and that the logarithmic term should be \(\log (en/(2c_1))\) instead of \((\log (en/(2c_1)))^{5/4}\) in Theorem 2.3; cf. analogous results in isotonic regression—[33] and [9]. These additional logarithmic factors mainly arise due to the fact that the class \(S(\phi _0, r)\), of convex functions appearing in the proofs, is not uniformly bounded. Sharpening these factors might be possible by using an explicit characterization of the LSE (as was done in [33] and [9] for isotonic regression) and other techniques that are beyond the scope of the present paper.

The proofs of Theorems 2.2 and 2.3 are presented in Sect. 4. A high level overview of the proof goes as follows. The convex LSE is an ERM procedure. These procedures are very well studied and numerous risk bounds exist in mathematical statistics and machine learning (see, for example, [4, 23, 30, 32]). These results essentially say that the risk behavior of \(\hat{\phi }_{ls}\) is determined by the metric entropy of the balls \(S(\phi _0, r)\) (defined in (4)) in \({\mathcal {C}}\) around the true function \(\phi _0\). Controlling the metric entropy of the \(S(\phi _0, r)\) is the key step in the proofs of Theorems 2.2 and 2.3. The next section deals with bounds for the metric entropy of \(S(\phi _0, r)\).

3 The local structure of the space of convex functions

In this section, we prove bounds for the metric entropy of the balls \(S(\phi _0, r)\) as \(\phi _0\) ranges over the space of convex functions. Our results give new insights into the local structure of the space of convex functions. We show that the metric entropy of \(S(\phi _0, r)\) behaves differently for different convex functions \(\phi _0\). This is the reason why the LSE exhibits different rates of convergence depending on the true function \(\phi _0\). The metric entropy of \(S(\phi _0, r)\) is much smaller when \(\phi _0\) is a piecewise affine convex function with not too many affine pieces than when \(\phi _0\) has a second derivative that is bounded from above and below by positive constants.

The next theorem is the main result of this section.

Theorem 3.1

There exists a positive constant \(c\) depending only on the ratio \(c_1/c_2\) such that for every \(\phi _0 \in {\mathcal {C}}\) and \(\epsilon > 0\), we have

$$\begin{aligned} \log M(\epsilon , S(\phi _0, r), \ell ) \le c \left( \log \frac{en}{2c_1} \right) ^{5/4} \sqrt{\frac{\Gamma (r; \phi _0)}{\epsilon }} \end{aligned}$$
(6)

where

$$\begin{aligned} \Gamma (r; \phi _0) {:=} \inf _{\alpha \in {\mathcal {P}}} \left( k^{5/2}(\alpha ) \left( r^2 + \ell ^2(\phi _0, \alpha ) \right) ^{1/2} \right) . \end{aligned}$$

Note that the dependence of the right hand side on (6) on \(\epsilon \) is always \(\epsilon ^{-1/2}\). The dependence on \(r\) is given by \(\Gamma (r; \phi _0)\) and it depends on \(\phi _0\). This function \(\Gamma (r; \phi _0)\) controls the size of the ball \(S(\phi _0, r)\). The larger the value \(\Gamma (r; \phi _0)\), the larger the metric entropy of \(S(\phi _0, r)\). The smallest possible value of \(\Gamma (r; \phi _0)\) equals \(r\) and is achieved for affine functions. When \(\phi _0\) is piecewise affine, \(\Gamma (r; \phi _0)\) is larger than \(r\) but it is not much larger provided \(k(\phi _0)\) is small. This is because \(\Gamma (r; \phi _0) \le r k^{5/2}(\phi _0)\). When \(\phi _0\) cannot be well-approximable by piecewise affine functions with small number of pieces, it can be shown that \(\Gamma (r; \phi _0)\) is bounded from below by a constant independent of \(r\). This will be the case, for example, when \(\phi _0\) is twice differentiable with \(\phi _0''(x)\) bounded from above and below by positive constants. As shown in the next theorem, \(S(\phi _0, r)\) has the largest possible size for such \(\phi _0\). Note also that one always has the upper bound \(\Gamma (r; \phi _0) \le \sqrt{r^2 + {\mathfrak {L}}^2(\phi _0)}\) which can be proved by restricting the infimum in the definition of \(\Gamma (r; \phi _0)\) to affine functions.

We need the following definition for the next theorem. For a subinterval \([a, b]\) of \([0, 1]\) and positive real numbers \(\kappa _1 < \kappa _2\), we define \(\mathfrak {K}{:=} \mathfrak {K}(a, b, \kappa _1, \kappa _2)\) to be the class of all convex functions \(\phi \) on \([0, 1]\) which are twice differentiable on \([a, b]\) and which satisfy \(\kappa _1 \le \phi ''(x) \le \kappa _2\) for all \(x \in [a, b]\).

Theorem 3.2

Suppose \(\phi _0 \in \mathfrak {K}(a, b, \kappa _1, \kappa _2)\). Then there exist positive constants \(c\), \(\epsilon _0\) and \(\epsilon _1\) depending only on \(\kappa _1, \kappa _2\), \(b-a\) and \(c_2\) such that

$$\begin{aligned} \log M(\epsilon , S(\phi _0, r), \ell ) \ge c \epsilon ^{-1/2} \qquad \text {for}\,\, \epsilon _1 n^{-2} \le \epsilon \le r \epsilon _0. \end{aligned}$$
(7)

Note that the right hand side of (7) does not depend on \(r\). This should be contrasted with the right hand side of (6) when \(\phi _0\) is, say, an affine function. The non-uniform nature of the space of univariate convex functions should be clear from this: balls \(S(\phi _0, r)\) of the same radius \(r\) in the space have different sizes depending on their center, \(\phi _0\). This should be contrasted with the space of twice differentiable functions in which all balls are equally sized in the sense that they all satisfy (7).

Remark 3.1

Note that the inequality (7) only holds when \(\epsilon \ge \epsilon _1 n^{-2}\). In other words, it does not hold when \(\epsilon \downarrow 0\). This is actually inevitable because, ignoring the convexity of functions in \(S(\phi _0, r)\), the metric entropy of \(S(\phi _0, r)\) under \(\ell \) cannot be larger than the metric entropy of the ball of radius \(r\) in \({\mathbb R}^n\), which is bounded from above by \(n \log (1 + (3r/\epsilon ))\) (see e.g., [24], Lemma 4.1). Thus, as \(\epsilon \downarrow 0\), the metric entropy of \(S(\phi _0, r)\) becomes logarithmic in \(\epsilon \) as opposed to \(\epsilon ^{-1/2}\). Also note that inequality (7) only holds for \(\epsilon \le r \epsilon _0\). This also makes sense because the diameter of \(S(\phi _0, r)\) in the metric \(\ell \) equals \(2r\) and, consequently, the left hand side of (7) equals zero for \(\epsilon > 2r\). Therefore, one cannot expect (7) to hold for all \(\epsilon > 0\).

Remark 3.2

The proof of Theorem 3.2 actually implies a conclusion stronger than (7). Let \(S'(\phi _0, r) {:=} \left\{ \phi \in {\mathcal {C}}: \sup _x |\phi (x) - \phi _0(x)| \le r\right\} \). Clearly this is a smaller neighborhood of \(\phi _0\) than \(S(\phi _0, r)\) i.e., \(S'(\phi _0, r) \subseteq S(\phi _0, r)\). The proof of Theorem 3.2 shows that the lower bound (7) also holds for \(\log M(\epsilon , S'(\phi _0, r), \ell )\).

In the reminder of this section, we provide the proofs of Theorems 3.1 and 3.2. Let us start with the proof of Theorem 3.1. Since functions in \(S(\phi _0, r)\) are convex, we need to analyze the covering numbers of subsets of convex functions. There exist only two previous results here. Bronshtein [5] proved covering numbers for classes of convex functions that are uniformly bounded and uniformly Lipschitz under the supremum metric. This result was extended by [11] who dropped the uniform Lipschitz assumption (this result was further extended by [18] to the multivariate case). Unfortunately, the convex functions in \(S(\phi _0, r)\) are not uniformly bounded (they only satisfy a weaker integral-type constraint) and hence Dryanov’s result cannot be used directly for proving Theorem 3.1. Another difficulty is that we need covering numbers under \(\ell \) while the results in [11] are based on integral \(L_p\) metrics.

Here is a high-level outline of the proof of Theorem 3.1. The first step is to reduce the general problem to the case when \(\phi _0 \equiv 0\). The result for \(\phi _0 \equiv 0\) immediately implies the result for all affine functions \(\phi _0\). One can then generalize to piecewise affine convex functions by repeating the argument over each affine piece. Finally, the result is derived for general \(\phi _0\) by approximating \(\phi _0\) by piecewise affine convex functions.

For \(\phi _0 \equiv 0\), the class of convex functions under consideration is \(S(0, r)\). Unfortunately, functions in \(S(0, r)\) are not uniformly bounded; they only satisfy a weaker discrete \(L^2\)-type boundedness constraint. We get around the lack of uniform boundedness by noting that convexity and the \(L^2\)-constraint imply that functions in \(S(0, r)\) are uniformly bounded on subintervals that are in the interior of \([x_1, x_n]\) (this is proved via Lemma 7.3). We use this to partition the interval \([x_1, x_n]\) into appropriate subintervals where Dryanov’s metric entropy result can be employed. We first carry out this argument for another class of convex functions where the discrete \(L^2\)-constraint is replaced by an integral \(L^2\)-constraint. From this result, we deduce the covering numbers of \(S(0, r)\) by using straightforward interpolation results (Lemma 7.4).

3.1 Proof of Theorem 3.1

3.1.1 Reduction to the case when \(\phi _0 \equiv 0\)

The first step is to note that it suffices to prove the theorem when \(\phi _0\) is the constant function equal to 0. For \(\phi _0 \equiv 0\), Theorem 3.1 is equivalent the following statement: there exists a constant \(c > 0\), depending only on the ratio \(c_1/c_2\), such that

$$\begin{aligned} \log M(\epsilon , S(0, r), \ell ) \le c \left( \log \frac{en}{2c_1} \right) ^{5/4} \left( \frac{\epsilon }{r} \right) ^{-1/2} \qquad \text {for all}\, \epsilon > 0. \end{aligned}$$
(8)

Below, we prove Theorem 3.1 assuming that (8) is true. Let \(\alpha \in {\mathcal {P}}_k\) be a piecewise affine function with \(k(\alpha ) = k\). We shall show that

$$\begin{aligned} \log M(\epsilon , S(\alpha , r), \ell ) \le c k^{5/4} \left( \log \frac{en}{2c_1} \right) ^{5/4} \left( \frac{\epsilon }{r} \right) ^{-1/2} \qquad \text {for every}\, \epsilon > 0. \end{aligned}$$
(9)

This inequality immediately implies Theorem 3.1 because for every \(\phi _0, \phi \in {\mathcal {C}}\) and \(\alpha \in {\mathcal {P}}\), we have

$$\begin{aligned} \ell ^2(\phi , \alpha ) \le 2 \ell ^2(\phi , \phi _0) + 2 \ell ^2(\phi _0, \alpha ) \end{aligned}$$

by the trivial inequality \((a + b)^2 \le 2 a^2 + 2 b^2\). This means that \(\ell ^2(\phi , \alpha ) \le 2 r^2 + 2 \ell ^2(\phi _0, \alpha )\) for every \(\phi \in S(\phi _0, r)\). Hence

$$\begin{aligned} M(\epsilon , S(\phi _0, r), \ell ) \le M(\epsilon , S(\alpha , \sqrt{2(r^2 + \ell ^2(\phi _0, \alpha ))}, \ell ). \end{aligned}$$

This inequality and (9) together clearly imply (6). It suffices therefore to prove (9).

Suppose that \(\alpha \) is affine on each of the \(k\) intervals \(I_i = [t_{i-1}, t_i]\) for \(i = 2, \dots , k\), where \(0 = t_0 < t_1 < \dots < t_{k-1} < t_k = 1\), and \(I_1 = [0, t_1]\). Then there exist \(k\) affine functions \(\tau _1, \dots , \tau _k\) on \([0, 1]\) such that \(\alpha (x) = \tau _i(x)\) for \(x \in I_i\) for every \(i = 1, \dots , k\).

For every pair of functions \(f\) and \(g\) on \([0, 1]\), we have the trivial identity: \(\ell ^2(f, g) = \sum _{i=1}^k \ell _{i}^2(f, g)\) where

$$\begin{aligned} \ell _i^2(f, g) {:=} \frac{1}{n} \sum _{j: x_j \in I_i} \left( f(x_j) - g(x_j) \right) ^2. \end{aligned}$$

As a result, we clearly have

$$\begin{aligned} M(\epsilon , S(\alpha , r), \ell ) \le \prod _{i=1}^k M(\epsilon /\sqrt{k}, S(\alpha , r), \ell _{i}). \end{aligned}$$
(10)

Fix an \(i \in \{1, \dots , k\}\). Note that for every \(f \in S(\alpha , r)\), we have

$$\begin{aligned} \ell _i^2(f, \tau _i) = \ell _i^2(f, \alpha ) \le \ell ^2(f, \alpha ) \le r^2. \end{aligned}$$

Therefore

$$\begin{aligned} M(\epsilon /\sqrt{k}, S(\alpha , r), \ell _i) \le M(\epsilon /\sqrt{k}, S_i(\tau _i, r), \ell _i) \end{aligned}$$

where \(S_i(\tau _i, r)\) consists of the class of all convex functions \(f : I_i \rightarrow {\mathbb R}\) for which \(\ell _i^2(\tau _i, f) \le r^2\).

By the translation invariance of the Euclidean distance and the fact that \(\phi - \tau \) is convex whenever \(\phi \) is convex and \(\tau \) is affine, it follows that

$$\begin{aligned} M(\epsilon /\sqrt{k}, S_i(\tau _i, r), \ell _i) = M(\epsilon /\sqrt{k}, S_i(0, r), \ell _i) \end{aligned}$$

where \(S_i(0, r)\) is defined as the class of all convex functions \(f: I_i \rightarrow {\mathbb R}\) for which \(\ell _i^2(0, f) \le r^2\).

The covering number \(M(\epsilon /\sqrt{k}, S_i(0, r), \ell _i)\) can be easily bounded using (8) by the following scaling argument. Let \(J {:=} \{j \in \{1, \dots , n\} : x_j \in I_i\}\) with \(m\) being the cardinality of \(J\). Also write \([a, b]\) for the interval \(I_i\) and let \(u_j {:=} (x_j - a)/(b-a)\) for \(j \in J\). For \(f, g \in {\mathcal {C}}\), let

$$\begin{aligned} \ell ^{(u)}(f, g) {:=} \left( \frac{1}{m} \sum _{j \in J} (f(u_j) - g(u_j))^2 \right) ^{1/2} \end{aligned}$$

and \(S^{(u)}(0, \gamma ) {:=} \{f \in {\mathcal {C}}: \ell ^{(u)}(f, 0) \le \gamma \}\). By associating, for each \(f \in S_i(0, r)\), the convex function \(\tilde{f} \in {\mathcal {C}}\) defined by \(\tilde{f}(x) {:=} f(a + (b-a)x)\), it can be shown that

$$\begin{aligned} M(\epsilon /\sqrt{k}, S_i(0, r), \ell _i) = M \left( \sqrt{\frac{n}{m}} \frac{\epsilon }{\sqrt{k}}, S^{(u)}(0, r \sqrt{n/m}), \ell ^{(u)} \right) . \end{aligned}$$

The assumption (3) implies that the distance between neighboring points in \(\{u_j, j \in J\}\) lies between \(mc_1/(n(b-a))\) and \(mc_2/(n(b-a))\). Therefore, by applying (8) to \(\{u_j, j \in J\}\) instead of \(\{x_i\}\), we obtain the existence of a positive constant \(c\) depending only on the ratio \(c_1/c_2\) such that

$$\begin{aligned} \log M \left( \sqrt{\frac{n}{m}} \frac{\epsilon }{\sqrt{k}}, S^{(u)}(0, r \sqrt{n/m}), \ell ^{(u)} \right)&\le c \left( \log \frac{en(b-a)}{2c_1} \right) ^{5/4} \left( \frac{\epsilon }{\sqrt{k}r} \right) ^{-1/2} \\&\le c \left( \log \frac{en}{2c_1} \right) ^{5/4} \left( \frac{\epsilon }{\sqrt{k}r} \right) ^{-1/2}. \end{aligned}$$

The required inequality (9) now follows from the above and (10).

3.1.2 The integral version

We have established above that it suffices to prove Theorem 3.1 for \(\phi _0 \equiv 0\) i.e., it suffices to prove (8). The ball \(S(0, r)\) consists of all convex functions \(\phi \) such that

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \phi ^2(x_i) \le r^2. \end{aligned}$$
(11)

For \(a < b\) and \(B > 0\), let \({\mathfrak {I}}([a, b], B)\) denote the class of all real-valued convex functions \(f\) on \([a, b]\) for which \(\int _a^b f^2(x) dx \le B^2\). The ball \(S(0, r)\) is intuitively very close to the class \({\mathfrak {I}}([0, 1], r)\) the only difference being that the average constraint (11) is replaced by the integral constraint \(\int _0^1 \phi ^2(x) dx \le r^2\) in \({\mathfrak {I}}([0, 1], r)\). We shall prove a good upper bound for the metric entropy of \({\mathfrak {I}}([0, 1], r)\). The metric entropy of \(S(0, r)\) will then be derived as a consequence.

Theorem 3.3

There exist a constant \(c\) such that for every \(0 < \eta < 1/2\), \(B > 0\) and \(\epsilon > 0\), we have

$$\begin{aligned} \log M \left( \epsilon , {\mathfrak {I}}([0, 1], B), L_2[\eta , 1 - \eta ] \right) \le c \left( \log \frac{e}{2\eta } \right) ^{5/4} \left( \frac{\epsilon }{B} \right) ^{-1/2}. \end{aligned}$$
(12)

where, by \(L_2[\eta , 1-\eta ]\), we mean the metric where the distance between \(f\) and \(g\) is given by

$$\begin{aligned} \left( \int _{\eta }^{1-\eta } \left( f(x) - g(x) \right) ^2 dx \right) ^{1/2}. \end{aligned}$$

Remark 3.3

We take the metric above to be \(L_2[\eta , 1 - \eta ]\) as opposed to \(L_2[0, 1]\) because

$$\begin{aligned} \log M \left( \epsilon , {\mathfrak {I}}([0, 1], B), L_2[0, 1] \right) = \infty \end{aligned}$$
(13)

To see this, take \(f_j(t) = 2^{j/2} \max (0, 1 - 2^j t)\) for \(t \in [0,1]\) and \(j \ge 1\). It is then easy to check that \(f_j \in {\mathfrak {I}}([0, 1], B)\) for \(B \ge 1/3\) and that \(\int _0^1 (f_j - f_{j+1})^2 \ge c\) for some positive constant \(c\) which proves (13). The equality (13) is also the reason why the right hand side of (12) approaches \(\infty \) as \(\eta \downarrow 0\).

The above theorem is a new result. If the constraint \(\int _0^1 \phi ^2(x) dx \le B^2\) is replaced by the stronger constraint \(\sup _{x \in [0, 1]} |\phi (x)| \le B\), then this has been proved by [11]. Specifically, [11] considered the class \({\mathcal {C}}([a, b], B)\) consisting of all convex functions \(f\) on \([a, b]\) which satisfy \(\sup _{x \in [a, b]}|f(x)| \le B\) and proved the following. [18] extended this to the multivariate case.

Theorem 3.4

(Dryanov) There exists a positive constant \(c\) such that for every \(B > 0\) and \(b > a\), we have

$$\begin{aligned} \log M \left( \epsilon , {\mathcal {C}}([a, b], B), L_2[a,b] \right) \le c \left( \frac{\epsilon }{B (b - a)^{1/2}} \right) ^{-1/2} \qquad \text {for every}\, \epsilon > 0.\qquad \end{aligned}$$
(14)

Remark 3.4

In [11], inequality (14) was only asserted for \(\epsilon \le \epsilon _0 B(b-a)^{1/2}\) for a positive constant \(\epsilon _0\). It turns out however that this condition is redundant. This follows from the observation that the diameter of the space \({\mathcal {C}}([a, b], B)\) in the \(L_2[a, b]\) metric is at most \(2B(b-a)^{1/2}\) which means that the left hand side of (14) equals 0 for \(\epsilon > 2B(b-a)^{1/2}\) and, thus, by changing the constant \(c\) suitably in Dryanov’s result, we obtain (14).

The class \({\mathfrak {I}}([0, 1], B)\) is much larger than \({\mathcal {C}}([0, 1], B)\) because the integral constraint \(\int _0^1 \phi ^2(x) dx \le B^2\) is much weaker than \(\sup _{x \in [0, 1]} |\phi (x)| \le B\). Therefore, Theorem 3.3 does not directly follow from Theorem 3.4. However, it is possible to derive Theorem 3.4 from Theorem 3.3 via the observation (made rigorous in Lemma 7.3) that functions in \({\mathfrak {I}}([0, 1], B)\) become uniformly bounded on subintervals of \([0, 1]\) that are sufficiently far away from the boundary points. On such subintervals, we may use Theorem 3.4 to bound the covering numbers. Theorem 3.3 is then proved by putting together these different covering numbers as shown below.

Proof of Theorem 3.3

By a trivial scaling argument, we can assume without loss of generality that \(B = 1\). Let \(l\) be the largest integer that is strictly smaller than \(-\log (2\eta )/\log 2\) and let \(\eta _i {:=} 2^i \eta \) for \(i = 0, \dots , l+1\). Observe that \(\eta _l < 1/2 \le \eta _{l+1}\).

Fix \(i \in \{0, \dots , l\}\). By Lemma 7.3, the restriction of a function \(\phi \in {\mathfrak {I}}([0, 1], 1)\) to \([\eta _i, \eta _{i+1}]\) is convex and uniformly bounded by \(2 \sqrt{3} \eta _i^{-1/2}\). Therefore, by Theorem 3.4, there exists a positive constant \(c\) such that we can cover the functions in \({\mathfrak {I}}([0, 1], 1)\) in the \(L_2[\eta _i, \eta _{i+1}]\) metric to within \(\alpha _{i}\) by a finite set having cardinality at most

$$\begin{aligned} \exp \left[ c \left( \frac{\alpha _i \sqrt{\eta _i}}{\sqrt{\eta _{i+1} - \eta _i}} \right) ^{-1/2} \right] = \exp \left( c \alpha _i^{-1/2} \right) . \end{aligned}$$

Because

$$\begin{aligned} \int _{\eta }^{1/2} \left( \phi (x) - f(x) \right) ^2 dx \le \sum _{i=0}^{l} \int _{\eta _i}^{\eta _{i+1}} \left( \phi (x) - f(x) \right) ^2 dx, \end{aligned}$$

we get a cover for functions in \({\mathfrak {I}}([0, 1], 1)\) in the \(L_2[\eta , 1/2]\) metric of size less than or equal to \(\left( \sum _{i=0}^l \alpha _i^2 \right) ^{1/2}\) and cardinality at most \(\exp \left( c \sum _{i=0}^l \alpha _i^{-1/2} \right) \).

Taking \(\alpha _i = \epsilon (l+1)^{-1/2}\), we get that

$$\begin{aligned} \log M(\epsilon , {\mathfrak {I}}([0, 1], 1), L_2[\eta , 1/2]) \le c \epsilon ^{-1/2} (l+1)^{5/4} \le c_1 \epsilon ^{-1/2} \left( \log \frac{e}{2\eta } \right) ^{5/4} \end{aligned}$$

where \(c_1\) depends only on \(c\). By an analogous argument, the above inequality will also hold for \(\log M(\epsilon , {\mathfrak {I}}([0, 1], 1), L_2[1/2, 1-\eta ])\). The proof is completed by putting these two bounds together. \(\square \)

3.1.3 Completion of the Proof of Theorem 3.1

We now complete the proof of Theorem 3.1 by proving inequality (8). We will use Theorem 3.3. We need to switch between the pseudometrics \(\ell \) and \(L_2[\eta , 1-\eta ]\). This will be made convenient by the use of Lemma 7.4.

By an elementary scaling argument, it follows that

$$\begin{aligned} M(\epsilon , S(0, r), \ell ) = M(\epsilon /r, S(0, 1), \ell ). \end{aligned}$$

We, therefore, only need to prove (8) for \(r = 1\). For ease of notation, let us denote \(S(0, 1)\) by \(S\).

Because \(x_i - x_{i-1} \ge c_1/n\) for all \(i = 2, \dots , n\), we have \(x_2, \dots , x_{n-1} \in [c_1/n, 1 - (c_1/n)]\). We shall first prove an upper bound for \(\log M(\epsilon , S, \ell _1)\) where

$$\begin{aligned} \ell ^2_1(\phi , \psi ) {:=} \frac{1}{n-2} \sum _{i=2}^{n-1} \left( \phi (x_i) - \psi (x_i) \right) ^2. \end{aligned}$$

For each function \(\phi \in S\), let \(\tilde{\phi }\) be the convex function on \([x_2, x_{n-1}]\) defined by

$$\begin{aligned} \tilde{\phi }(x) {:=} \frac{x_{i+1} - x}{x_{i+1} - x_i}\phi (x_i) + \frac{x-x_i}{x_{i+1} - x_i} \phi (x_{i+1}) \qquad \text {for}\, x_i \le x \le x_{i+1} \end{aligned}$$

where \(i = 2, \dots , n-2\). Also let \(\tilde{S} {:=} \left\{ \tilde{\phi }: \phi \in S \right\} \).

By Lemma 7.4 and the assumption that \(x_i - x_{i-1} \ge c_1/n\) for all \(i\), we get that

$$\begin{aligned} \ell _1^2(\phi , \psi ) \le \frac{6}{c_1} \int _{x_2}^{x_{n-1}} \left( \tilde{\phi }(x) - \tilde{\psi }(x) \right) ^2 dx \end{aligned}$$

for every pair of functions \(\phi \) and \(\psi \) in \(S\). Letting \(\delta {:=} \epsilon \sqrt{c_1/6}\) this inequality implies that

$$\begin{aligned} M \left( \epsilon , S, \ell _1 \right) \le M \left( \delta , \tilde{S}, L_2[x_2, x_{n-1}] \right) . \end{aligned}$$

Again by Lemma 7.4 and the assumption \(x_i - x_{i-1} \le c_2/n\), we have that

$$\begin{aligned} \int _{x_1}^{x_n} \tilde{\phi }^2(x) dx \le \frac{c_2}{n} \sum _{i=1}^n \phi ^2(x_i) \le c_2 \qquad \text {for every}\, \phi \in S. \end{aligned}$$

As a result, we have that \(\tilde{S} \subseteq {\mathfrak {I}}([x_1, x_n], \sqrt{c_2})\). Further, because \(x_2 \ge x_1 + c_1/n\) and \(x_{n-1} \le x_n - c_1/n\), we get that

$$\begin{aligned} M \left( \delta , \tilde{S}, L_2[x_2, x_{n-1}] \right) \le M \left( \delta , {\mathfrak {I}}([x_1, x_n], \sqrt{c_2}), L_2[x_1 + \eta , x_{n} - \eta ] \right) \end{aligned}$$

where \(\eta {:=}\, c_1/n\). By a simple scaling argument, the covering number on the right hand side above is upper bounded by

$$\begin{aligned} M \left( \frac{\delta }{\sqrt{x_n - x_1}}, {\mathfrak {I}}([0, 1], \sqrt{c_2(x_n - x_1)}), L_2\left[ \frac{\eta }{x_n - x_1}, 1 - \frac{\eta }{x_n - x_1}\right] \right) . \end{aligned}$$
(15)

Indeed, for each \(f \in {\mathfrak {I}}([x_1, x_n], \sqrt{c_2})\), we can associate \(\tilde{f}(y) {:=} f(x_1 + y(x_n - x_1))\) for \(y \in [0, 1]\). It is then easy to check that \(\tilde{f} \in {\mathfrak {I}}([0, 1], \sqrt{c_2(x_n - x_1)})\) and

$$\begin{aligned} \int _{x_1 + \eta }^{x_n - \eta } \left( f_1(x) - f_2(x) \right) ^2 dx = (x_n - x_1) \int _{\eta /(x_n - x_1)}^{1 - (\eta /(x_n - x_1))} \left( \tilde{f}_1(y) - \tilde{f}_2(y) \right) ^2 dy, \end{aligned}$$

from which (15) easily follows. From the bound (15), it is now easy to see that (because \(x_n - x_1 \le 1\))

$$\begin{aligned} M \left( \delta , {\mathfrak {I}}([x_1, x_n], \sqrt{c_2}), L_2[x_1 + \eta , x_{n} - \eta ] \right) \le M \left( \delta , {\mathfrak {I}}([0, 1], \sqrt{c_2}), L_2[\eta , 1 - \eta ] \right) . \end{aligned}$$

Thus, by Theorem 3.3, we assert the existence of a positive constant \(c\) a such that

$$\begin{aligned} \log M(\epsilon , S, \ell _1) \le c \left( \log \frac{en}{2c_1} \right) ^{5/4} \left( \frac{\sqrt{c_1} \epsilon }{\sqrt{c_2}} \right) ^{-1/2}. \end{aligned}$$
(16)

Now for every pair of functions \(\phi \) and \(\psi \) in \(S\), we have

$$\begin{aligned} \ell ^2(\psi , \phi ) \le \ell _1^2(\psi , \phi ) + \frac{1}{n} \sum _{i \in \{1, n\}} \left( \phi (x_i) - \psi (x_i) \right) ^2. \end{aligned}$$

We make the simple observation that \((\phi (x_1), \phi (x_n))\) lies in the closed ball of radius \(\sqrt{n}\) in \({\mathbb R}^2\) denoted by \(B_2(0, \sqrt{n})\). As a result, using  Pollard ([24], Lemma 4.1), we have

$$\begin{aligned}&M(\epsilon , S, \ell ) \le M\left( \frac{\epsilon }{\sqrt{2}}, S, \ell _1\right) M\left( \frac{\sqrt{n}\epsilon }{\sqrt{2}}, B_2(0, \sqrt{n})\right) \\&\quad \le \left( 1 + \frac{3\sqrt{2}}{\epsilon } \right) ^2 M\left( \frac{\epsilon }{\sqrt{2}}, S, \ell _1\right) \end{aligned}$$

where the covering number of \(B_2(0, \sqrt{n})\) is in the usual Euclidean metric. Using (16), we get

$$\begin{aligned} \log M(\epsilon , S, \ell ) \le 2 \log \left( 1 + \frac{3\sqrt{2}}{\epsilon } \right) + c \left( \log \frac{en}{2c_1} \right) ^{5/4} \left( \frac{\sqrt{c_1} \epsilon }{\sqrt{2c_2}} \right) ^{-1/2}. \end{aligned}$$
(17)

Because \(\log (1 + x) \le 3 \sqrt{x}\) for all \(x > 0\), the first term in the right hand side above is bounded by a constant multiple of \(\epsilon ^{-1/2}\). This proves (8) provided the constant \(c\) is renamed appropriately.

3.2 Proof of Theorem 3.2

In our proof below, we shall make use of Lemma 7.1 (stated and proved in Appendix) which bounds the distance between functions in \(\mathfrak {K}(a, b, \kappa _1, \kappa _2)\) and their piecewise linear interpolants.

Fix \(m \ge 1\) and let \(t_i = a + (b-a)i/m\) for \(i = 0, \dots , m\). For each \(i = 1, \dots , m\), let \(\alpha _i\) define the linear interpolant of the points \((t_{i-1}, \phi _0(t_{i-1}))\) and \((t_i, \phi _0(t_i))\) i.e.,

$$\begin{aligned} \alpha _i(x) {:=}\,\, \phi _0(t_{i-1}) + \frac{\phi _0(t_i) - \phi _0(t_{i-1})}{t_i - t_{i-1}} \left( x - t_{i-1} \right) \qquad \text {for}\, x \in [0, 1]. \end{aligned}$$

By error estimates for linear interpolation (see e.g., Chapter 3 of [1]), for every \(x \in [t_{i-1}, t_i]\), there exists a point \(t_x \in [t_{i-1}, t_i]\) for which

$$\begin{aligned} |\phi _0(x) - \alpha _i(x)| = (x - t_{i-1})(t_i - x) \frac{\phi _0''(t_x)}{2} \end{aligned}$$

which implies, because \(\phi _0 \in \mathfrak {K}(a, b, \kappa _1, \kappa _2)\), that

$$\begin{aligned} |\phi _0(x) - \alpha _i(x)| \le (x - t_{i-1})(t_i - x) \frac{\kappa _2}{2} \le \frac{\kappa _2}{8}(t_i - t_{i-1})^2 = \frac{(b-a)^2 \kappa _2}{8 m^2} \end{aligned}$$
(18)

for every \(x \in [a, b]\). By convexity of \(\phi _0\), it is obvious that \(\alpha _i(x) \ge \phi _0(x)\) for \(x \in [t_{i-1}, t_i]\) and \(\alpha _i(x) \le \phi _0(x)\) for \(x \notin [t_{i-1}, t_i]\).

Now for each \(\tau \in \{0, 1\}^m\), let us define

$$\begin{aligned} \phi _{\tau }(x) {:=} \max \left( \phi _0(x), \max _{i : \tau _i = 1} \alpha _i(x) \right) \qquad \text {for}\, x \in [0, 1]. \end{aligned}$$

The functions \(\phi _{\tau }\) are clearly convex because they equal the pointwise maximum of convex functions. Moreover, for \(x \in [t_{i-1}, t_i]\), we have

$$\begin{aligned} \phi _{\tau }(x) = \left\{ \begin{array}{rl} \alpha _i(x) &{}\quad \text{ if } \tau _i = 1 \\ \phi _0(x) &{}\quad \text{ if } \tau _i = 0\text{. } \end{array} \right. \end{aligned}$$

Also, from (18),

$$\begin{aligned} \sup _{x \in [0, 1]} \left| \phi _{\tau }(x) - \phi _0(x) \right| \le \max _{1 \le i \le m} \sup _{x \in [t_{i-1}, t_i]} \left| \phi _{0}(x) - \alpha _i(x) \right| \le \frac{(b-a)^2\kappa _2}{8 m^2}. \end{aligned}$$

Because \(\ell (\phi _{\tau }, \phi _0) \le \sup _{x} |\phi _{\tau }(x) - \phi _0(x)|\), it follows that \(\phi _{\tau } \in S(\phi _0, r)\) provided

$$\begin{aligned} \frac{(b-a)^2 \kappa _2}{8 m^2} \le r. \end{aligned}$$
(19)

Observe now that for every \(\tau , \tau ' \in \{0, 1\}^m\),

$$\begin{aligned} \ell ^2\left( \phi _{\tau }, \phi _{\tau '} \right) = \sum _{i: \tau _i \ne \tau _i'} \ell ^2 \left( \phi _0, \max (\phi _0, \alpha _i) \right) \ge {\Upsilon }(\tau , \tau ') \min _{1 \le i \le m} \ell ^2(\phi _0, \max (\phi _0, \alpha _i)) \end{aligned}$$
(20)

where \({\Upsilon }(\tau , \tau ') {:=} \sum _i \{\tau _i \ne \tau '_i \}\). We now use Lemma 7.1 to bound \(\ell ^2(\phi _0, \max (\phi _0, \alpha _i))\) from below. Since \(\alpha _i\) is the linear interpolant of \((t_{i-1}, \phi _0(t_{i-1}))\) and \((t_i, \phi _0(t_i))\), we use Lemma 7.1 (inequality (38)) with \(a = t_{i-1}\) and \(b = t_i\) to assert

$$\begin{aligned} \ell ^2(\phi _0, \max (\phi _0, \alpha _i)) \ge \frac{\kappa _1^2 (t_i - t_{i-1})^5}{4{,}096 c_2} = \frac{\kappa _1^2 (b-a)^5}{4{,}096 c_2 m^5} \end{aligned}$$

provided

$$\begin{aligned} n \ge \frac{4c_2}{t_i-t_{i-1}} = \frac{4mc_2}{b-a}. \end{aligned}$$
(21)

From (20), we thus have

$$\begin{aligned} \ell ^2(\phi _{\tau }, \phi _{\tau '}) \ge {\Upsilon }(\tau , \tau ') \frac{\kappa _1^2 (b-a)^5}{4{,}096 c_2 m^5}. \end{aligned}$$

Using now the Varshamov-Gilbert lemma [(see, for example, Massart ([23,  Lemma 4.7)] which asserts the existence of a subset \(W\) of \(\{0, 1\}^m\) with cardinality, \(|W| \ge \exp (m/8)\) such that \({\Upsilon }(\tau , \tau ') \ge m/4\) for all \(\tau , \tau ' \in W\) with \(\tau \ne \tau '\), we get that

$$\begin{aligned} \ell ^2(\phi _{\tau }, \phi _{\tau '}) \ge \frac{\kappa _1^2 (b-a)^5}{16{,}384 c_2 m^4} \qquad \text {for all}\, \tau , \tau ' \in W\quad \hbox { with }\tau \ne \tau '. \end{aligned}$$
(22)

Let us now fix \(\epsilon > 0\) and choose \(m\) so that

$$\begin{aligned} m^4 = \frac{\kappa _1^2(b-a)^5}{16{,}384 c_2 \epsilon ^2}. \end{aligned}$$

From (22), we then see that \(\{\phi _{\tau }: \tau \in W\}\) is an \(\epsilon \)-packing set under the pseudometric \(\ell \). The condition (19) would hold provided

$$\begin{aligned} \epsilon \le \frac{\kappa _1 \sqrt{b-a}}{16 \sqrt{c_2} \kappa _2} r. \end{aligned}$$

Also, the condition (21) is equivalent to

$$\begin{aligned} \epsilon \ge \frac{c_2^2 \sqrt{b-a} \kappa _1}{8 \sqrt{c_2}n^2}. \end{aligned}$$

We have therefore showed that for \(\epsilon \) satisfying the above pair of inequalities, there exists an \(\epsilon \)-packing subset of \(S(\phi _0, r)\) with cardinality \(|W|\) satisfying

$$\begin{aligned} \log |W| \ge \frac{m}{8} \ge \frac{\sqrt{\kappa _1} (b-a)^{5/4}}{96 c_2^{1/4}} \epsilon ^{-1/2}. \end{aligned}$$

The proof of Theorem 3.2 is now complete if we take

$$\begin{aligned} \epsilon _0 {:=} \frac{\kappa _1 \sqrt{b-a}}{16 \kappa _2 \sqrt{c_2}} \quad \text { and } \quad c {:=} \frac{\sqrt{\kappa _1} (b-a)^{5/4}}{96 c_2^{1/4}} \quad \text { and } \quad \epsilon _1 {:=} \frac{c_2^2 \sqrt{b-a} \kappa _1}{8 \sqrt{c_2}}. \end{aligned}$$

4 Proofs of the risk bounds of the LSE

In this section, we provide the proofs of Theorems 2.2 and 2.3. As mentioned in Sect. 3, these two theorems together imply our main risk bound Theorem 2.1 of the convex LSE. Our proofs are based on the local metric entropy result (Theorem 3.1) of the space of univariate convex functions derived in the previous section together with standard results on the risk behavior of ERM procedures. Before proceeding further, let us state precisely the result from the literature on ERM procedures that we use to analyze the risk of \(\hat{\phi }_{ls}\). There exist many such results but they are all similar in spirit and the following result from Van de Geer ([30], Theorem 9.1) is especially convenient to use.

Theorem 4.1

[30] For each \(r > 0\), let

$$\begin{aligned} S(\phi _0, r) {:=} \{\phi \in {\mathcal {C}}: \ell ^2(\phi _0, \phi ) \le r^2 \}. \end{aligned}$$

Suppose \(H\) is a function on \((0, \infty )\) such that

$$\begin{aligned} H(r) \ge \int _0^r \sqrt{\log M(\epsilon , S(\phi _0, r) , \ell )} \ d\epsilon \qquad \text {for every}\, r > 0 \end{aligned}$$

and such that \(H(r)/r^2\) is decreasing on \((0, \infty )\). Then there exists a universal constant \(C\) such that

$$\begin{aligned} {\mathbb P}_{\phi _0}\left( \ell ^2(\hat{\phi }_{ls}, \phi _0) > \delta \right) \le C \sum _{s \ge 0} \exp \left( - \frac{n2^{2s}\delta }{C^2 \sigma ^2} \right) \end{aligned}$$

for every \(\delta > 0\) satisfying \(\sqrt{n} \delta \ge C\sigma H(\sqrt{\delta })\).

Let us note that our local metric entropy result, Theorem 3.1, easily implies an upper bound for the entropy integral

$$\begin{aligned} \int _0^r \sqrt{\log M(\epsilon , S(\phi _0, r), \ell )} d\epsilon \end{aligned}$$
(23)

appearing in Theorem 4.1. Indeed, using the bound given by (6) for \(M(\epsilon , S(\phi _0, r), \ell )\) above and integrating, we obtain that (23) is bounded from above by

$$\begin{aligned} K \left( \log \frac{en}{2c_1} \right) ^{5/8} r^{3/4} \inf _{\alpha \in {\mathcal {P}}} \left[ k^{5/8}(\alpha ) \left( r^2 + \ell ^2(\phi _0, \alpha ) \right) ^{1/8} \right] \end{aligned}$$
(24)

for every \(\phi _0 \in {\mathcal {C}}\) and \(r > 0\) where \(K\) is a constant that only depends on the ratio \(c_1/c_2\).

4.1 Proof of Theorem 2.2

Let us define

$$\begin{aligned} \delta _0 \,{:=}\, A \left( \frac{\sigma ^2}{n} \right) ^{4/5} R^{2/5} \log \frac{en}{2c_1} \end{aligned}$$

where \(A\) is a constant whose value will be specified shortly. Observe that \(\delta _0 \le R^2\) whenever \(n \ge A^{5/4} \left( \log ((en)/(2c_1)) \right) ^{5/4} \sigma ^2/R^2\). We use the bound (24) for the entropy integral (23). By restricting the infimum in the right hand side of (24) to affine functions (i.e., \(\alpha \in {\mathcal {P}}_1\)) for which \(k(\alpha ) = 1\), we obtain (note that \(\inf _{\alpha \in {\mathcal {P}}_1} \ell ^2(\phi _0, \alpha ) = {\mathfrak {L}}^2(\phi _0) \le R^2\))

$$\begin{aligned} \int _0^r \sqrt{\log M(\epsilon , S(\phi _0, r), \ell )} d\epsilon \le K \left( \log \frac{en}{2c_1} \right) ^{5/8} r^{3/4} \left( r^2 + R^2 \right) ^{1/8} \end{aligned}$$
(25)

for every \(r > 0\). Suppose now that

$$\begin{aligned} n \ge A^{5/4} \left( \log \frac{en}{2c_1} \right) ^{5/4} \frac{\sigma ^2}{R^2} \end{aligned}$$
(26)

so that \(\delta _0 \le R^2\) and inequality (25) holds for every \(r > 0\). Let \(H(r)\) denote the right hand side of (25). It is clear that \(H(r)/r^2\) is decreasing on \((0, \infty )\). As a result, a condition of the form \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) for some positive constant \(C\) holds for every \(\delta \ge \delta _0\) provided it holds for \(\delta = \delta _0\). Clearly

$$\begin{aligned} \frac{H(\sqrt{\delta _0})}{\delta _0} = K \left( \log \frac{en}{2c_1} \right) ^{5/8} \delta _0^{-5/8} \left( \delta _0 + R^2 \right) ^{1/8}. \end{aligned}$$

Assuming that (26) holds and noting then that \(\delta _0 \le R^2\), we get

$$\begin{aligned} \frac{H(\sqrt{\delta _0})}{\delta _0} \le 2^{1/8} K \left( \log \frac{en}{2c_1} \right) ^{5/8} \delta _0^{-5/8} R^{1/4} = 2^{1/8} K A^{-5/8} \frac{\sqrt{n}}{\sigma }. \end{aligned}$$

We shall now use Theorem 4.1. Let \(C\) be the constant given by Theorem 4.1. By the above inequality, the condition \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) holds for each \(\delta \ge \delta _0\) provided \(A = 2^{1/5} (C K)^{8/5}\). Thus by Theorem 4.1, we obtain

$$\begin{aligned} {\mathbb P}_{\phi _0} \left( \ell ^2(\hat{\phi }_{ls}, \phi _0) > \delta \right) \le C \sum _{s \ge 0} \exp \left( -\frac{n2^{2s} \delta }{C^2 \sigma ^2} \right) \end{aligned}$$

for all \(\delta \ge \delta _0\) whenever \(n\) satisfies (26). Using the expression for \(\delta _0\) and (26), we get for \(\delta \ge \delta _0\),

$$\begin{aligned} \frac{n\delta }{\sigma ^2} \ge \frac{n \delta _0}{\sigma ^2} = A \left( \frac{n}{\sigma ^2} \right) ^{1/5} R^{2/5} \log \frac{en}{2c_1} \ge A^{5/4} \left( \log \frac{en}{2c_1} \right) ^{5/4}. \end{aligned}$$
(27)

We thus have

$$\begin{aligned} {\mathbb P}_{\phi _0} \left( \ell ^2(\hat{\phi }_{ls}, \phi _0) > \delta \right) \le C_1 \exp \left( - \frac{n \delta }{C_1 \sigma ^2} \right) \qquad \text {for all}\, \delta \ge \delta _0 \end{aligned}$$

for some constant \(C_1\) (depending only on \(C\) and \(A = 2^{1/5}(CK)^{8/5}\)) provided \(n\) satisfies (26). Integrating both sides of this inequality with respect to \(\delta \) [and using (27) again], we obtain the risk bound

$$\begin{aligned} {\mathbb E}_{\phi _0} \ell ^2(\hat{\phi }_{ls}, \phi _0 ) \le C_2 \delta _0 = C_2 \left( \frac{\sigma ^2}{n} \right) ^{4/5} A R^{2/5} \log \frac{en}{2c_1} \end{aligned}$$

for some positive constant \(C_2\) depending only on \(C\) and \(K\). Because \(C\) is an absolute constant and \(K\) only depends on the ratio \(c_1/c_2\), the proof is complete by an appropriate renaming of the constant \(C\).

4.2 Proof of Theorem 2.3

For each \(1 \le k \le n\), let

$$\begin{aligned} \ell _k^2 = \inf \{\ell ^2(\phi _0, \alpha ) : \alpha \in {\mathcal {P}}\quad \text {and}\quad k(\alpha ) = k \} \end{aligned}$$

so that

$$\begin{aligned} \inf _{\alpha \in {\mathcal {P}}} \left( \ell ^2(\phi _0, \alpha ) + \frac{\sigma ^2 k^{5/4}(\alpha )}{n} \right) = \inf _{1 \le k \le n} \left( \ell _k^2 + \frac{\sigma ^2 k^{5/4}}{n} \right) . \end{aligned}$$

It is also easy to check that

$$\begin{aligned} \ell _1^2 \ge \ell _2^2 \ge \dots \ge \ell _n^2 = 0. \end{aligned}$$

As a result, there exists an integer \(u \in \{1, \dots , n\}\) such that \(\ell _k^2 > \sigma ^2 k^{5/4}/n\) if \(1 \le k < u\) and \(\ell _k^2 \le \sigma ^2 k^{5/4}/n\) if \(k \ge u\). This means that when \(1 \le k < u\) (which implies that \(u \ge 2\) or \(u-1 \ge u/2\))

$$\begin{aligned} \ell _k^2 + \frac{\sigma ^2 k^{5/4}}{n} \ge \ell _{u-1}^2 > \frac{\sigma ^2}{n} (u-1)^{5/4} \ge \frac{\sigma ^2 u^{5/4}}{2^{5/4}n}. \end{aligned}$$

It then follows that

$$\begin{aligned} \inf _{1 \le k \le n} \left( \ell _k^2 + \frac{\sigma ^2 k^{5/4}}{n} \right) \ge \frac{\sigma ^2 u^{5/4}}{2^{5/4}n}. \end{aligned}$$

Consequently, the proof will be complete if we show that

$$\begin{aligned} {\mathbb E}_{\phi _0} \ell ^2(\phi _0, \hat{\phi }_{ls}) \le C \left( \log \frac{en}{2c_1} \right) ^{5/4} \frac{\sigma ^2 u^{5/4}}{n}. \end{aligned}$$
(28)

To prove this, we start by defining

$$\begin{aligned} \delta _0 {:=} A \left( \log \frac{en}{2c_1} \right) ^{5/4} \frac{\sigma ^2 u^{5/4}}{n} \end{aligned}$$

for a constant \(A\) whose value will be specified shortly. Because \(\ell _u^2 \le \sigma ^2 u^{5/4}/n\), it follows that \(\ell _u^2 \le \delta _0/A\).

By (24), there exists a positive constant \(K\) depending only on the ratio \(c_1/c_2\) such that

$$\begin{aligned} \int _0^r \sqrt{\log M(\epsilon , S(\phi _0, r), \ell )} d\epsilon&\le K \left( \log \frac{en}{2c_1} \right) ^{5/8} \inf _{\alpha \in {\mathcal {P}}} \left[ k^{5/8}(\alpha ) r^{3/4} \left( r^2 + \ell ^2(\phi _0, \alpha ) \right) ^{1/8} \right] \\&\le K \left( \log \frac{en}{2c_1} \right) ^{5/8} \inf _{\alpha \in {\mathcal {P}}_u} \left[ k^{5/8}(\alpha ) r^{3/4} \left( r^2 + \ell ^2(\phi _0, \alpha ) \right) ^{1/8} \right] \\&\le K \left( \log \frac{en}{2c_1} \right) ^{5/8} u^{5/8} r^{3/4} \left( r^2 + \ell _u^2 \right) ^{1/8}. \end{aligned}$$

for every \(r > 0\). Let \(H(r)\) denote the right hand side above. It is clear that \(H(r)/r^2\) is decreasing on \((0, \infty )\). As a result, a condition of the form \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) for some positive constant \(C\) holds for every \(\delta \ge \delta _0\) provided it holds for \(\delta = \delta _0\). Because \(\ell _u^2 \le \delta _0/A\), we have

$$\begin{aligned} H(\sqrt{\delta _0}) \le K \left( \log \frac{en}{2c_1} \right) ^{5/8} u^{5/8} \sqrt{\delta _0} \left( 1 + \frac{1}{A} \right) ^{1/8}. \end{aligned}$$

Consequently,

$$\begin{aligned} \frac{H(\sqrt{\delta _0})}{\delta _0} \le \frac{K}{\sqrt{A}} \left( 1 + \frac{1}{A} \right) ^{1/8} \frac{\sqrt{n}}{\sigma }. \end{aligned}$$
(29)

We shall now use Theorem 4.1. Let \(C\) be the positive constant given by Theorem 4.1. By inequality (29), we can clearly choose \(A\) depending only on \(K\) and \(C\) so that \(\sqrt{n} \delta _0 \ge C \sigma H(\sqrt{\delta _0})\). Because \(H(r)/r^2\) is a decreasing function of \(r\), this choice of \(A\) also ensures that \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\) for every \(\delta \ge \delta _0\). Thus by Theorem 4.1, we obtain

$$\begin{aligned} {\mathbb P}_{\phi _0} \left( \ell ^2(\hat{\phi }_{ls}, \phi _0) > \delta \right) \le C \sum _{s \ge 0} \exp \left( - \frac{n2^{2s} \delta }{C^2 \sigma ^2} \right) \qquad \text {for all}\, \delta \ge \delta _0. \end{aligned}$$
(30)

Note further, from the definition of \(\delta _0\), that \(\delta _0 \ge \sigma ^2 A/n\) which implies that the sum on the right hand side of (30) is dominated by the first term. We thus have

$$\begin{aligned} {\mathbb P}_{\phi _0} \left( \ell ^2(\hat{\phi }_{ls}, \phi _0) > \delta \right) \le C_1 \exp \left( - \frac{n \delta }{C_1 \sigma ^2} \right) \qquad \text {for all}\, \delta \ge \delta _0. \end{aligned}$$

for a constant \(C_1\) depending upon only \(C\) and \(A\). The required risk bound (28) is now derived by integrating both sides of the above inequality with respect to \(\delta \) and using that \(\delta _0 \ge \sigma ^2 A/n\).

5 Non-adaptable convex functions

We showed that the risk of the convex LSE is always bounded from above by \(n^{-4/5}\) up to logarithmic factors in \(n\) and that for convex functions that are well-approximable by piecewise affine functions with not too many pieces, the risk of the convex LSE is bounded by \(1/n\) up to log factors. The reason why the risk is much smaller for these functions is that the balls around them have small sizes. We also showed in Theorem 3.2 that for convex functions with curvature, the balls are really non-local. Here, we show that for such convex functions, in a very strong sense, the rate \(n^{-4/5}\) cannot be improved by any estimator.

Recall the class of functions, \(\mathfrak {K}(a, b, \kappa _1, \kappa _2)\), that was defined in Theorem 3.2. The constants \(a, b, \kappa _1\) and \(\kappa _2\) will be fixed constants in this section and we shall therefore refer to \(\mathfrak {K}(a, b, \kappa _1, \kappa _2)\) by just \(\mathfrak {K}\). For every function \(\phi _0 \in \mathfrak {K}\), let us define the local neighborhood \(N(\phi _0)\) of \(\phi _0\) in \({\mathcal {C}}\) by

$$\begin{aligned} N(\phi _0) {:=} \left\{ \phi \in {\mathcal {C}}: \sup _{x \in [0, 1]}|\phi (x) - \phi _0(x)| \le \left( \frac{\kappa _2 c_1^2}{32}\right) ^{1/5} \left( \frac{\sigma ^2}{n} \right) ^{2/5} \right\} . \end{aligned}$$

Recall that the constant \(c_1\) is defined in (3). We define the local minimax risk of \(\phi _0 \in \mathfrak {K}\) to be

$$\begin{aligned} {\mathfrak {R}}_n(\phi _0) {:=} \inf _{\hat{\phi }} \sup _{\phi \in N(\phi _0)} {\mathbb E}_{\phi } \ell ^2(\phi , \hat{\phi }), \end{aligned}$$

the infimum above being over all possible estimators \(\hat{\phi }\). \({\mathfrak {R}}_n(\phi _0)\) represents the smallest possible risk under the knowledge that the unknown convex function \(\phi \) lies in the local neighborhood \(N(\phi _0)\) of \(\phi _0\).

In the next theorem, we shall show that the local minimax risk of every function \(\phi _0 \in \mathfrak {K}\) is bounded from below by a constant multiple of \(n^{-4/5}\). Observe that the \(l^2\) diameter of \(N(\phi _0)\) defined as \(\sup _{\phi _1, \phi _2 \in N(\phi _0)} \ell ^2(\phi _1, \phi _2)\) is bounded from above by \(n^{-4/5}\) up to multiplicative factors that are independent of \(n\). Therefore, the supremum risk over \(N(\phi _0)\) of any reasonable estimator is bounded from above by \(n^{-4/5}\) up to multiplicative factors. The next theorem shows that if \(\phi _0 \in \mathfrak {K}\), then the supremum risk of every estimator is also bounded from below by \(n^{-4/5}\) up to multiplicative factors. Therefore, one cannot estimate \(\phi _0\) at a rate faster than \(n^{-4/5}\).

Theorem 5.1

(Lower bound) For every \(\phi _0 \in \mathfrak {K}(a, b, \kappa _1, \kappa _2)\), we have

$$\begin{aligned} {\mathfrak {R}}_n(\phi _0) \ge \frac{\kappa _1^2}{4{,}096c_2} \left( \frac{\sqrt{c_1}}{\kappa _2}\right) ^{8/5} (b-a) \left( \frac{\sigma ^2}{n} \right) ^{4/5} \end{aligned}$$
(31)

provided \(n^2 \ge (2c_2)^{5/2} \kappa _2/(\sigma \sqrt{c_1})\).

Prototypical examples of functions in \(\mathfrak {K}\) include power functions \(x^k\) for \(k \ge 2\) and the above theorem implies that every estimator has rate at least \(n^{-4/5}\) for all these functions. Note that the LSE has the rate \(n^{-4/5}\) up to logarithmic factors of \(n\) for all functions \(\phi _0\). In particular, the LSE is rate optimal (up to logarithmic factors) for all functions in \(\mathfrak {K}\).

Prominent examples of functions not in the class \(\mathfrak {K}\) include the piecewise affine convex functions. As shown in Theorem 2.3, faster rates are possible for these functions. Essentially, the LSE converges at the parametric rate (up to logarithmic factors) for these functions.

The hardest functions to estimate under the global risk are therefore smooth convex functions. This is in sharp contrast to the standpoint of pointwise risk estimation where, for example, cusps in the function \(f(x) = |x|\) are the hardest to estimate. In fact, one would expect a rate of \(n^{-2/3}\) near such cusp points (see [6] for a detailed study of pointwise estimation although they work with estimators that are different from the LSE). However, for global estimation, the region over which one gets such slower rates is small enough to not effect the overall near-parametric rate for piecewise affine convex functions.

Our proof of Theorem 5.1 is based on the application of Assouad’s lemma, the following version of which is a consequence of Lemma 24.3 of Van der Vaart ([31], pp. 347). We start by introducing some notation. Let \({\mathbb P}_{\phi }\) denote the joint distribution of the observations \((x_1, Y_1), \dots , (x_n, Y_n)\) when the true convex function equals \(\phi \). For two probability measures \(P\) and \(Q\) having densities \(p\) and \(q\) with respect to a common measure \(\mu \), the total variation distance, \(\Vert P-Q\Vert _{TV}\), is defined as \(\int (|p-q|/2) d\mu \) and the Kullback-Leibler divergence, \(D(P\Vert Q)\), is defined as \(\int p \log (p/q) d\mu \). Pinsker’s inequality asserts

$$\begin{aligned} D(P \Vert Q) \ge 2 \Vert P - Q \Vert _{TV}^2 \end{aligned}$$
(32)

for all probability measures \(P\) and \(Q\).

Lemma 5.2

(Assouad) Let \(m\) be a positive integer and suppose that, for each \(\tau \in \{0, 1\}^m\), there is an associated convex function \(\phi _{\tau }\) in \(N(\phi _0)\). Then the following inequality holds:

$$\begin{aligned} {\mathfrak {R}}_n(\phi _0) \ge \frac{m}{8} \min _{\tau \ne \tau '} \frac{\ell ^2(\phi _{\tau }, \phi _{\tau '})}{{\Upsilon }(\tau , \tau ')} \min _{{\Upsilon }(\tau , \tau ') = 1} \left( 1 - \Vert {\mathbb P}_{\phi _{\tau }} - {\mathbb P}_{\phi _{\tau '}}\Vert _{TV} \right) , \end{aligned}$$
(33)

where \({\Upsilon }(\tau , \tau ') {:=} \sum _{i} \{\tau _i \ne \tau '_i\}\).

Proof of Theorem 5.1

Fix \(m \ge 1\) and consider the same construction \(\{\phi _{\tau }, \tau \in \{0, 1\}^m\}\) from the proof of Theorem 3.2. We saw there that

$$\begin{aligned} \sup _{x \in [0, 1]} |\phi _{\tau }(x) - \phi _0(x)| \le \frac{(b-a)^2 \kappa _2}{8 m^2} \end{aligned}$$
(34)

and that

$$\begin{aligned} \ell ^2(\phi _{\tau }, \phi _{\tau '}) \ge {\Upsilon }(\tau , \tau ') \frac{\kappa _1^2 (b-a)^5}{4{,}096 c_2 m^5} \end{aligned}$$
(35)

for every \(\tau , \tau ' \in \{0, 1\}^m\) provided \(n \ge 4mc_2/(b-a)\). Also, whenever \({\Upsilon }(\tau , \tau ') = 1\), it is clear that

$$\begin{aligned} \ell ^2(\phi _{\tau }, \phi _{\tau '}) \le \max _{1 \le i \le m} \ell ^2(\phi _0, \max (\phi _0, \alpha _i)). \end{aligned}$$

We use Lemma 7.1 to bound \(\ell ^2(\phi _0, \max (\phi _0, \alpha _i))\) from above. Specifically, we use inequality (39) with \(a = t_{i-1}\) and \(b = t_i\) to get

$$\begin{aligned} \ell ^2(\phi _0, \max (\phi _0, \alpha _i)) \le \frac{\kappa _2^2 (t_i - t_{i-1})^5}{32 c_1} = \frac{\kappa _2^2 (b-a)^5}{32 c_1 m^5} \end{aligned}$$

provided \(n \ge 4mc_1/(b-a)\). Thus under the assumption \(n \ge 4mc_2/(b-a)\), we have (35) and also (note that \(c_2 \ge c_1\))

$$\begin{aligned} \ell ^2(\phi _{\tau }, \phi _{\tau '}) \le \frac{\kappa _2^2 (b-a)^5}{32 c_1 m^5} \qquad \text {whenever}\, {\Upsilon }(\tau , \tau ') = 1. \end{aligned}$$

We apply Assouad’s lemma to these functions \(\phi _{\tau }\). By inequality (32), we get

$$\begin{aligned} \Vert {\mathbb P}_{\phi _{\tau }} - {\mathbb P}_{\phi _{\tau '}}\Vert ^2_{TV} \le \frac{1}{2} D({\mathbb P}_{\phi _{\tau }}\Vert {\mathbb P}_{\phi _{\tau '}}). \end{aligned}$$

By the Gaussian assumption and independence of the errors, the Kullback-Leibler divergence \(D({\mathbb P}_{\phi _{\tau }}\Vert P_{\phi _{\tau '}})\) can be easily calculated to be \(n \ell ^2(\phi _{\tau }, \phi _{\tau '})/(2 \sigma )\). We therefore obtain

$$\begin{aligned} \Vert {\mathbb P}_{\phi _{\tau }} - {\mathbb P}_{\phi _{\tau '}}\Vert _{TV} \le \frac{\sqrt{n}}{2 \sigma } \ell (\phi _{\tau }, \phi _{\tau '}). \end{aligned}$$

Thus by the application of (33), we obtain the following lower bound for \({\mathfrak {R}}_n(\phi _0)\):

$$\begin{aligned} {\mathfrak {R}}_n(\phi _0) \ge \frac{m}{8} \frac{\kappa _1^2 (b-a)^5}{4{,}096 m^5 c_2} \left( 1 - \frac{\sqrt{n}\kappa _2}{2\sigma } \sqrt{\frac{(b-a)^5}{m^5 32 c_1}} \right) \end{aligned}$$
(36)

provided \(\phi _{\tau } \in N(\phi _0)\) for each \(\tau \). We make the choice

$$\begin{aligned} \frac{m}{b-a} {:=} \left( \frac{\sqrt{n}\kappa _2}{\sigma \sqrt{32 c_1}} \right) ^{2/5}. \end{aligned}$$

The inequality (34) implies that \(\phi _{\tau } \in N(\phi _0)\). The inequality (31) follows easily from (36). The constraint \(n \ge 4c_2m/(b-a)\) translates to

$$\begin{aligned} n^2 \ge (2c_2)^{5/2} \kappa _2/(\sigma \sqrt{c_1}). \end{aligned}$$

The proof is complete. \(\square \)

6 Model misspecification

In this section, we evaluate the performance of the convex LSE \(\hat{\phi }_{ls}\) in the case when the unknown regression function (to be denoted by \(f_0\)) is not necessarily convex. Specifically, suppose that \(f_0\) is an unknown function on \([0, 1]\) that is not necessarily convex. We consider observations \((x_1, Y_1), \dots , (x_n, Y_n)\) from the model:

$$\begin{aligned} Y_i = f_0(x_i) + \xi _i, \qquad \text {for}\quad i = 1, \dots , n, \end{aligned}$$

where \(x_1< \dots < x_n\) are fixed design points in \([0, 1]\) and \(\xi _1, \dots , \xi _n\) are independent normal variables with zero mean and variance \(\sigma ^2\).

The convex LSE \(\hat{\phi }_{ls}\) is defined in the same way as before as any convex function that minimizes the sum of squares criterion. Since the true function \(f_0\) is not necessarily convex, it turns out that the LSE is really estimating the convex projections of \(f_0\). Any convex function \(\phi _0\) on \([0, 1]\) that minimizes \(\ell ^2(f_0, \phi )\) over \(\phi \in {\mathcal {C}}\) is a convex projection of \(f_0\) i.e.,

$$\begin{aligned} \phi _0 \in \mathop {\mathrm{argmin}}_{\psi \in {\mathcal {C}}} \sum _{i=1}^n \left( f_0(x_i) - \phi (x_i)\right) ^2. \end{aligned}$$

Convex projections are not unique. However, because \(\{(\phi (x_1), \dots , \phi (x_n)): \phi \in {\mathcal {C}}\}\) is a convex closed subset of \({\mathbb R}^n\), it follows (see, for example Stark and Yang ([28], Chapter 2)) that the vector \((\phi _0(x_1), \dots , \phi _0(x_n))\) is unique for every convex projection \(\phi _0\) and, moreover, we have the inequality:

$$\begin{aligned} \ell ^2(f_0, \phi ) \ge \ell ^2(f_0, \phi _0) + \ell ^2(\phi _0, \phi ) \qquad \text {for every}\, \phi \in {\mathcal {C}}. \end{aligned}$$
(37)

The following is the main result of this section. It is the exact analogue of Theorem 2.1 for the case of model misspecification.

Theorem 6.1

Let \(\phi _0\) denote any convex projection of \(f_0\) and let \(R {:=} \max (1, {\mathfrak {L}}(\phi _0))\). There exists a positive constant \(C\) depending only on the ratio \(c_1/c_2\) such that

$$\begin{aligned}&{\mathbb E}_{f_0} \ell ^2(\hat{\phi }_{ls}, \phi _0) \le C \left( \log \frac{en}{2c_1} \right) ^{5/4}\min \left[ \left( \frac{\sigma ^2 \sqrt{R}}{n} \right) ^{4/5},\right. \\&\quad \left. \inf _{\alpha \in {\mathcal {P}}} \left( \ell ^2(\phi _0, \alpha ) + \frac{\sigma ^2 k^{5/4}(\alpha )}{n} \right) \right] \end{aligned}$$

provided

$$\begin{aligned} n \ge C \frac{\sigma ^2}{R^2} \left( \log \frac{en}{2c_1} \right) ^{5/4}. \end{aligned}$$

We omit the proof of this theorem because it is similar to the proof of Theorem 2.1. It is based on the metric entropy results from Sect. 3 and the following result from the literature on the risk behavior of ERMs.

Theorem 6.2

Let \(\phi _0\) denote any convex projection of \(f_0\). Suppose \(H\) is a function on \((0, \infty )\) such that

$$\begin{aligned} H(r) \ge \int _0^r \sqrt{\log M(\epsilon , S(\phi _0, r))} d\epsilon \qquad \text {for every}\, r > 0 \end{aligned}$$

and such that \(H(r)/r^2\) is decreasing on \((0, \infty )\). Then there exists a universal constant \(C\) such that

$$\begin{aligned} {\mathbb P}_{f_0} \left( \ell ^2(\hat{\phi }_{ls}, \phi _0) > \delta \right) \le C \sum _{s \ge 0} \exp \left( -\frac{n 2^{2s} \delta }{C^2 \sigma ^2} \right) \end{aligned}$$

for every \(\delta > 0\) satisfying \(\sqrt{n} \delta \ge C \sigma H(\sqrt{\delta })\).

This result is very similar to Theorem 4.1. Its proof proceeds in the same way as the proof of Theorem 4.1 [(see Van de Geer ([30, Proof of Theorem 9.1)]. We provide below a sketch of its proof for the convenience of the reader.

Proof of Theorem 6.2

Because \(\phi _0\) is convex, we have, by the definition of \(\hat{\phi }_{ls}\), that

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \left( Y_i - \hat{\phi }_{ls}(x_i) \right) ^2 \le \frac{1}{n} \sum _{i=1}^n \left( Y_i - \phi _0(x_i) \right) ^2. \end{aligned}$$

Writing \(Y_i = f_0(x_i) + \xi _i\) and simplifying the above expression, we get

$$\begin{aligned} \ell ^2(f_0, \hat{\phi }_{ls}) - \ell ^2(f_0, \phi _0) \le \frac{2}{n} \sum _{i=1}^n \xi _i \left( \hat{\phi }_{ls}(x_i) - \phi _0(x_i) \right) . \end{aligned}$$

Inequality (37) applied with \(\phi = \hat{\phi }_{ls}\) gives

$$\begin{aligned} \ell ^2(\hat{\phi }_{ls}, \phi _0) \le \ell ^2(f_0, \hat{\phi }_{ls}) - \ell ^2(f_0, \phi _0). \end{aligned}$$

Combining the above two inequalities, we obtain

$$\begin{aligned} \ell ^2( \hat{\phi }_{ls}, \phi _0) \le \frac{2}{n} \sum _{i=1}^n \xi _i \left( \hat{\phi }_{ls}(x_i) - \phi _0(x_i) \right) . \end{aligned}$$

This is of the same form as the “basic inequality” of Van de Geer ([30], pp. 148). From here, the proof proceeds just as the proof of Theorem 9.1 in [30]. \(\square \)

Theorem 6.1 shows that one gets adaptation in the misspecified case provided \(f_0\) has a convex projection that is well-approximable by a piecewise affine convex function with not too many pieces. An illuminating example of this occurs when \(f_0\) is a concave function. In this case, we show in Lemma 7.5 (stated and proved in Appendix) that \(\phi _0\) can be taken to be an affine function, i.e., \(\phi _0 \in {\mathcal {P}}_1\). As a result, it follows that if \(f_0\) is concave, then the risk of \(\hat{\phi }_{ls}\) measured from any convex projection of \(f_0\) is bounded from above by the parametric rate up to a logarithmic factor of \(n\).