Abstract
We consider a study on regression function estimation over a bounded domain of arbitrary shapes based on triangulation and penalization techniques. A total variation type penalty is imposed to encourage fusion of adjacent triangles, which leads to a partition of the domain consisting of disjointed polygons. The proposed method provides a piecewise linear, and continuous estimator over a data adaptive polygonal partition of the domain. We adopt a coordinate decent algorithm to handle the non-separable structure of the penalty and investigate its convergence property. Regarding the asymptotic results, we establish an oracle type inequality and convergence rate of the proposed estimator. A numerical study is carried out to illustrate the performance of this method. An R software package polygram is available.
This is a preview of subscription content, access via your institution.







References
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.
Bregman, L. M. (1966). A relaxation method of finding a common point of convex sets and its application to problems of optimization’. In Soviet Mathematics Doklady, 7, 1578–1581.
Breiman, L. (1991). The ii method for estimating multivariate functions from noisy data. Technometrics, 33(2), 125–143.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth.
Bunea, F., Tsybakov, A., Wegkamp, M., et al. (2007). Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1, 169–194.
Courant, R. et al. (1943). Variational methods for the solution of problems of equilibrium and vibrations. Verlag nicht ermittelbar.
De Boor, C. (1978). A practical guide to splines, Volume 27. Springer.
Nychka, D., Furrer, R., Paige, J., & Sain, S. (2017). fields: Tools for spatial data. Boulder, CO, USA: University Corporation for Atmospheric Research. R package version 9.8-1.
Franke, R. (1979). A critical comparison of some methods for interpolation of scattered data. Naval Postgraduate School Monterey CA: Technical report.
Friedman, J., Hastie, T., Hofling, H., & Tibshirani, R. (2007). Pathwise coordinate optimazation. The Annals of Statistics, 1(2), 302–332.
Friedman, J. H., et al. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.
Friedman, J. H., & Silverman, B. W. (1989). Flexible parsimonious smoothing and additive modeling. Technometrics, 31(1), 3–21.
Gaines, B. R., Kim, J., & Zhou, H. (2018). Algorithms for fitting the constrained lasso. Journal of Computational and Graphical Statistics, 27(4), 861–871.
Gu, C., Bates, D., Chen, Z., & Wahba, G. (1989). The computation of gcv functions through householder tridiagonalization with application to the fitting of interaction spline models. SIAM Journal of Matrix Analysis, 10, 457–480.
Hansen, M. (1994). Extended linear models, multivariate splines, and anova. Ph.D. dissertation.
Hansen, M., Kooperberg, C., & Sardy, S. (1998). Triogram models. Journal of the American Statistical Association, 93(441), 101–119.
He, X., & Shi, P. (1996). Bivariate tensor-product b-splines in a partly linear model. Journal of Multivariate Analysis, 58(2), 162–181.
Huang, J. Z. (1998). Projection estimation in multiple regression with application to functional anova models. The Annals of Statistics, 26(1), 242–272.
Huang, J. Z. (2003). Asymptotics for polynomial spline regression under weak conditions. Statistics & Probability Letters, 65(3), 207–216.
Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. The Annals of Statistics, 31(5), 1600–1635.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R. R Package Version, 1, 2.
James, G.M., Paulson, C., & Rusmevichientong, P. (2013). Penalized and constrained regression. Unpublished manuscript, http://www.bcf.usc.edu/~gareth/research/Research.html
Jhong, J. H., Koo, J. Y., & Lee, S. W. (2017). Penalized B-spline estimator for regression functions using total variation penalty. Journal of Statistical Planning and Inference, 184, 77–93.
Keller, J. M., Gray, M. R., & Givens. J. A. (1985). A fuzzy k-nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cybernetics SMC-15(4): 580–585.
Koenker, R., & Mizera, I. (2004). Penalized triograms: Total variation regularization for bivariate smoothing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1), 145–163.
Kooperberg, C., Stone, C.J., & Truong. Y.K. (1995a). The l2 rate of convergence for hazard regression. Scandinavian Journal of Statistics: 143–157 .
Kooperberg, C., Stone, C. J., & Truong, Y. K. (1995). Rate of convergence for logspline spectral density estimation. Journal of Time Series Analysis, 16(4), 389–401.
Lai, M. J. (2007). Multivariate splines for data fitting and approximation (pp. 210–228). San Antonio: Approximation Theory XII.
Lai, M. J., & Schumaker, L. L. (1998). On the approximation power of bivariate splines. Advances in Computational Mathematics, 9(3–4), 251–279.
Lai, M. J., & Schumaker, L. L. (2007). Spline functions on triangulations. Cambridge University Press.
Lai, M. J., & Wang, L. (2013). Bivariate penalized splines for regression. Statistica Sinica, 23(3), 1399–1417.
Lange, K., Hunter, D. R., & Yang, I. (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1), 1–20.
Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.
Ramsay, T. (2002). Spline smoothing over difficult regions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2), 307–319.
Rippa, S. (1992). Adaptive approximation by piecewise linear polynomials on triangulations of subsets of scattered data. SIAM Journal on Scientific and Statistical Computing, 13(5), 1123–1141.
Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics, 11(4), 735–757.
Schwarz, G., et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Shewchuk, J.R. (1996), may. Triangle: Engineering a 2d quality mesh generator and delaunay triangulator, In Applied Computational Geometry: Towards Geometric Engineering, eds. Lin, M.C. and D. Manocha, Volume 1148 of Lecture Notes in Computer Science, 203–222. Springer-Verlag. From the First ACM Workshop on Applied Computational Geometry.
S Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression. The Annals of Statistics: 1040–1053.
Stone, C. J. (1985). Additive regression and other nonparametric models. The Annals of Statistics, 13(2), 689–705.
Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 14(2), 590–606.
Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The Annals of Statistics, 22(1), 118–171.
Stone, C. J., Hansen, M. H., Kooperberg, C., Truong, Y. K., et al. (1997). Polynomial splines and their tensor products in extended linear modeling: 1994 wald memorial lecture. The Annals of Statistics, 25(4), 1371–1470.
Szeliski, R. (2010). Computer vision: algorithms and applications. Springer.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
Tibshirani, R., & Saunders, M. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society, 67(1), 91–108.
Tibshirani, R. J., & Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 1335–1371.
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3), 475–494.
Xiao, L. (2019). Asymptotics of bivariate penalised splines. Journal of Nonparametric Statistics, 31(2), 289–314.
Xiao, L., Li, Y., & Ruppert, D. (2013). Fast bivariate p-splines: The sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 577–599.
Ye, G. B., & Xie, X. (2011). Split bregman method for large scale fused lasso. Computational Statistics & Data Analysis, 55(4), 1552–1569.
Yu, D., Won, J. H., Lee, T., Lim, J., & Yoon, S. (2015). High-dimensional fused lasso regression using majorization-minimization and parallel processing. Journal of Computational and Graphical Statistics, 24(1), 121–153.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 67(2), 301–320.
Acknowledgements
The research of Ja-Yong Koo was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2018R1D1A1B07049972). The research of Jae-Hwan Jhong was supported by the NRF (NRF-2019R1A6A3A01096135 and NRF-2020R1G1A1A01100869). The research of Kwan-Young Bak was supported by the NRF (NRF-2021R1A6A3A01086417).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
A Preliminaries
1.1 A.1 Triangulations
Let \(\Omega\) be a compact and convex polygon in the plane. A collection \(\triangle = \{T_1, \ldots , T_G\}\) of triangles in the plane with disjoint interior, i.e., \(\Omega = \bigcup _{T \in \triangle } T\), is called a triangulation of \(\Omega\). In this paper, we assume that each element \(T \in \triangle\) is planar triangle. The vertices of the triangles of \(\triangle\) are called the vertices of the triangulation \(\triangle\). If a vertex v is a boundary point of \(\Omega\), we say that it is a boundary vertex. Otherwise, we call it an interior vertex. \(\triangle\) may consist of two completely separated triangles, or their intersection must be either a common vertex or a common edge.
Figure 8a shows a basic notation of the triangle \({\langle 123 \rangle }\). In Fig. 8b, an edge \(e_{23}\) is called a common edge of two triangles \({\langle 123 \rangle }\) and \({\langle 243 \rangle }\). Given a triangulation \(\triangle\), define a sub-triangulation which depends on the vertex, called star. The \({{\,\mathrm{star}\,}}(v)\) of a vertex v is given by the set of all triangles in \(\triangle\) which share the vertex v. Figure 8c displays the \({{\,\mathrm{star}\,}}(v)\).
1.2 A.2 Barycentric coordinates
Choose a triangle \({\langle 123 \rangle }\) for example. The barycentric coordinate vector of \(x = (x_1, x_2)\) relative to the triangle \({\langle 123 \rangle }\) is defined as
and satisfying the conditions
By Cramer’s rule,
where
is the signed area of the triangle \({\langle 123 \rangle }\). The values \(b^{123}_1(x)\), \(b^{123}_2(x)\) and \(b^{123}_3(x)\) are the relative areas of the green, the blue and the red triangles, respectively, see Fig. 9.
1.3 A.3 Linear hat splines
Suppose a linear spline s belongs to the space \({\mathcal {S}}\) of continuous linear splines. Given a vertex set \(V = \{v_1, \ldots , v_J\}\) in the triangluation \(\triangle\), a continuous linear hat splines \(B_1, \ldots , B_J\) are defined as
where \(T_x\) is the triangle which contains the point x. Then the hat splines \(B_1, \ldots , B_J\) has the form a basis for \({\mathcal {S}}\), which means that any continuous linear spline s is expressed as
B Algorithm details
We describe each step of the algorithm in detail.
-
Descent step: Let \({\tilde{\beta }} = ({\tilde{\beta }}_1, \ldots , {\tilde{\beta }}_J)\) be the current value of the coefficients. For each \(j = 1, \ldots , J\), the algorithm partially optimizes for a single coefficient \(\beta _j\), holding other coefficients fixed at current values \(\{{\tilde{\beta }}_\ell \}_{\ell \ne j}\):
$$\begin{aligned} {\tilde{\beta }}_j \leftarrow \mathop {\mathrm {argmin}}\limits _{\beta _j \in \mathbb {R}} R^\lambda \left( {\tilde{\beta }}_1, \ldots , {\tilde{\beta }}_{j - 1}, \beta _j, {\tilde{\beta }}_{j + 1}, \ldots , {\tilde{\beta }}_J \right) . \end{aligned}$$Ignoring terms independent of \(\beta _j\), we have
$$\begin{aligned}&R^\lambda ({\tilde{\beta }}_1, \ldots , {\tilde{\beta }}_{j - 1}, \beta _j, {\tilde{\beta }}_{j + 1}, \ldots , {\tilde{\beta }}_J) \nonumber \\&= \frac{1}{N} \sum _{n = 1}^N \left( y_n - \sum _{\ell \ne j} {\tilde{\beta }}_\ell G_\ell (x_n) - \beta _j G_j(x_n) \right) ^2 + \lambda \sum _{k = 1}^K {\left|\beta _j c_{kj} + \sum _{\ell \ne j} {\tilde{\beta }}_\ell c_{k\ell } \right|} \nonumber \\&= \frac{1}{N} \sum _{n = 1}^N \left( { y_{nj} - \beta _j G_j(x_n)} \right) ^2 + \lambda \sum _{k:c_{kj} \ne 0} {\left|c_{kj} \right|} {\left|\beta _j - \left( - \sum _{\ell \ne j} \frac{c_{\ell j}}{c_{kj}} {\tilde{\beta }}_\ell \right) \right|} \\&\quad + (\text{ terms } \text{ independent } \text{ of } \beta _j) \nonumber \end{aligned}$$(9)with \(y_{nj} = y_n - \sum _{\ell \ne j} {\tilde{\beta }}_\ell G_\ell (x_n)\) being a partial residuals. Subsequently, (9) is the univariate convex optimization problem with respect to \(\beta _j\):
$$\begin{aligned} \min _{\beta _j \in \mathbb {R}} \frac{a}{2} (\beta _j - b)^2 + \lambda \sum _{m = 1}^M d_m{\left|\beta _j - e_m \right|}, \end{aligned}$$(10)where
$$\begin{aligned} a = \frac{2}{N} \sum _{n = 1}^N G_j^2(x_n), \quad b = \frac{\sum _{n = 1}^N y_{nj} G_j(x_n)}{\sum _{n = 1}^N G_j(x_n)}, \quad d_m = {\left|c_{mj} \right|}, \quad e_m = - \sum _{\ell \ne j} \frac{c_{\ell j}}{c_{mj}} {\tilde{\beta }}_\ell . \end{aligned}$$The positive integer \(M = {\left|{\{ k: c_{kj} \ne 0 \}} \right|}\) is a relatively small number as the matrix C is sparse. Observe that (10) is the univariate quadratic function of \(\beta _j\) and hence, it always has a global minimum. By Theorem 1 of Jhong et al. (2017), we obtain the exact solutions for minimizing this problem. Thus, the algorithm iteratively updates the coefficients by the minimum of (10) until convergence.
-
Pruning step: The descent step moves parameters one at a time. This approach may not be effective as the penalty function is not separable. One way to resolve problem is to consider a reparameterization using the active constraint. The algorithm removes the edges corresponding to the active constraints and updates the new basis and the constraint matrix. When a single active condition is satisfied during the descent step, we represent the problem (4) into
$$\begin{aligned} \min _{\beta \in \mathbb {R}^{J - 1}} \frac{1}{N} {\left|y - {\tilde{G}} \beta \right|}_2^2 + \lambda {\left|{\tilde{C}}^\top \beta \right|}_1, \end{aligned}$$(11)where \({\tilde{G}}\), \({\tilde{C}}\) are the updated basis and the constraint matrix that satisfy the active condition.
The algorithm reduces the dimension J and K by pruning the corresponding common edge. To this end, we need one non-zero entry over \(c_k\) as a baseline. For numerical stability, the baseline is chosen by the largest element of \(c_k\) in its absolute value. The rest of steps are as follows.
-
1.
Let \(c_{kj}\) be the baseline. The active condition can therefore be represented by \({\tilde{\beta }}_j = - ( {c_k^{-j}})^\top {\tilde{\beta }}^{-j} / c_{kj}\) with respect to \({\tilde{\beta }}_j\), where \(c_k^{-j}\) and \({\tilde{\beta }}^{-j}\) are the \(J-1\) dimensional vector excluding the jth entry \(c_{kj}\) and \({\tilde{\beta }}_j\), respectively.
-
2.
Update G and C: Let \({\tilde{G}}_\ell = G_\ell - (c_{k \ell } / c_{kj}) G_j\) for \(\ell = 1, \ldots , J\), \(\ell \ne j\), and then the new design matrix becomes
$$\begin{aligned} {\tilde{G}} = \begin{bmatrix} {\tilde{G}}_1&\cdots&{\tilde{G}}_{j - 1}&{\tilde{G}}_{j + 1}&\cdots&{\tilde{G}}_J \end{bmatrix} \in \mathbb {R}^{N \times (J - 1)}. \end{aligned}$$Similarly, the updated constraint matrix has the form
$$\begin{aligned} {\tilde{C}} = \begin{bmatrix} {\tilde{c}}_1&\cdots&{\tilde{c}}_{k - 1}&{\tilde{c}}_{k + 1}&\cdots {\tilde{c}}_K \end{bmatrix} \in \mathbb {R}^{(J - 1) \times (K - 1)}, \end{aligned}$$where \({\tilde{c}}_m = c_m^{-j} - c_k^{-j} / c_{kj}\) for \(m = 1, \ldots , K\), \(m \ne k\).
-
3.
Now, the minimization problem (4) subject to \(c_k^\top {\tilde{\beta }} = 0\) is equivalent to (11). Then, we update the following notations:
$$\begin{aligned} G \leftarrow {\tilde{G}}, \quad C \leftarrow {\tilde{C}}, \quad {\tilde{\beta }} \leftarrow {\tilde{\beta }}^{-j}, \quad J \leftarrow J-1, \quad K \leftarrow K-1. \end{aligned}$$
The algorithm just removes kth column from C if \(c_k\) is a zero vector. This case does not occur early in the algorithm, but it may appear during a series of updated processes as the initial constraint matrix is not full rank. That is, there are redundancy between the edge-removal conditions of the initial interior edges so that some edges are removed together in a single pruning step. Therefore, in practice, the algorithm requires a number of pruning steps much smaller than the initial number of edges.
-
Smoothing step: The strategy of the algorithm is to solve a series of the problems sequentially, varying \(\lambda\). We obtain the solutions for an increasing sequence on the log scale values \(\lambda _1< \lambda _2 < \cdots\) for \(\lambda\), stopping at a sufficiently large value that all polygonal pieces are merged. In our polygram package, we set the largest value of \(\lambda\) to \(\lambda _{max}\) and define \(\lambda _{min} = \lambda _{max} \times \epsilon _\lambda\). Then generate an increasing sequence \(\{ \lambda _2 = \lambda _{min}, \lambda _3, \ldots , \lambda _{max} \}\) on the log-scale. The default values of \(\lambda _{max}\) and \(\epsilon _\lambda\) in the package are 10 and \(10^{-10}\), respectively, and these values can be adjusted appropriately according to the given data. The algorithm begins with \(\lambda _1 = 0\). Hence, we obtain the unpenalized least square estimator. Then we progressively increase \(\lambda \leftarrow \lambda _i\), \(i = 1, 2, \ldots ,\) and run the descent step and pruning step repeatedly until no further changes occur.
C Proofs of the results in Sect. 3.2
Lemma C.1 states that \(\hat{\beta }^\lambda\) lies in the same region generated by \(\{ c_k \}_{k=1}^K\) as \(\hat{\beta }^0\) does.
Lemma C.1
Suppose (C.1) and (C.2) hold. If \(0< \lambda < \lambda ^1\), then
Proof
It follows from the result in Section 7 of Tibshirani and Taylor (2011) that \(\hat{\beta }^\lambda\) is continuous with respect to \(\lambda\). This implies that \(c_k^\top \hat{\beta }^\lambda\) is a real-valued function continuous with respect to \(\lambda\). If \(\mathop {\mathrm {\mathsf {sign}}}\limits (c_k^\top {\hat{\beta }}^\lambda ) = -\tau _k\) for \(0<\lambda < \lambda ^1\), the intermediate value theorem implies that the function \(c_k^\top {\hat{\beta }}^\lambda\) assumes the value zero for some \(\zeta\) with \(0<\zeta < \lambda ^1\), which contradicts the definition of \(\lambda ^1\). \(\square\)
We now give proofs of Proposition 3.1, Proposition 3.2 and Theorem 3.3.
Proof of Proposition 3.1
By Lemma C.1, we obtain the following results from the stationarity condition
and
Take the limit \(\lambda \rightarrow \lambda _1 -\) on both side of (12), we obtain
Then, we have the explicit form of the \(\lambda ^1\) such that
Using the fact
and the definition of \(\lambda ^1\), we have the desired result. \(\square\)
Proof of Proposition 3.2
Define
For any \(\lambda \in (0, \lambda ^1)\), It follows form Lemma C.1 that the KKT condition for minimizing \(R^\lambda (\beta )\) can be expressed as
Assume that \({\tilde{\beta }}^\lambda = \mathop {\mathrm {argmin}}\limits _{\beta \in \mathbb {R}^J} R_+^\lambda (\beta )\) for \(0< \lambda < \lambda ^1\). Then the stationarity condition is
Observe that 13 is the stationarity condition for minimizing \(R^\lambda\) over \(\mathbb {R}^J\). By the uniqueness of \({\hat{\beta }}^\lambda\), it follows that \({\hat{\beta }}^\lambda \in \mathcal {H}_+\) minimizes \(R_+^\lambda\) over \(\mathbb {R}^J\), under the assumption of \(\lambda \in (0, \lambda ^1)\). \(\square\)
Proof of Theorem 3.3
Since \(\lambda \in (0, \lambda ^1)\), it follows from Proposition 3.2 that
Because \(R_+^\lambda\) is a smooth function with a positive definite Hessian matrix, the arguments of Theorem 4.1 (c) in Tseng (2001) give the desired result. \(\square\)
D Proofs of the results in Sect. 4
Proof of Theorem 4.1
Define
Observe
and, from the definition of \({\hat{\beta }}\),
Combining these, we have
and thus
On event \(\mathcal {D}_1\), we have
From the triangle inequality and the fact that the row sums of C are bounded by 4, we obtain the following.
on the event \(\mathcal {D}_1\). \(\square\)
Proof of Theorem 4.2
Lemma D.2 and the definitions of \(\mathcal {D}_2(\beta )\) and \(\mathcal {D}_3\) imply that, on the event \(\mathcal {D}(\beta )\),
where the last inequality follows from the inequality \(2uv \le 4u^2 + v^2 /4\) for \(u,v \in \mathbb {R}\) applied to the rightmost term. This implies that, for \(\beta \in \mathcal {B}\),
Thus, for \(\beta \in \mathcal {B}\), we have
on the event \(\mathcal {D}(\beta )\). \(\square\)
Proof of Corollary 4.3
From the arguments in Hansen (1994), for \(f \in \mathcal {C}^2(Q)\), there exists \(\beta ^* \in \mathbb {R}^J\) such that
which implies that
Assuming \(\overline{d}_N \asymp {\left( N / \log N \right) }^{-1/6}\), we have \(J \asymp {\left( N / \log N \right) }^{1/3}\), we conclude that \(\mathcal {B}\) is nonempty for a suitable choice of constants, and \(\,\mathbb {P}{\left( \mathcal {D}(\beta ^*) \right) }\) tends to 1 by Lemma D.3, Lemma D.4, and Lemma A.2 of Lai and Wang (2013). It follows from Theorem 4.2 that
\(\square\)
1.1 Technical lemmas
Lemma D.1
For \(\beta \in \mathbb {R}^J\), we have
Proof
Following Lai and Schumaker (2007), we have
The desired result follows from the assumption \(J = M_4 / \overline{d}_N^2\). \(\square\)
Lemma D.2
For \(\beta \in \mathbb {R}^J\), we have
on the event \(\mathcal {D}_1 \cap \mathcal {D}_2 (\beta )\).
Proof
From Lemma D.1, Theorem 4.1, the Cauchy-Schwarz inequality, and the definition of \(\mathcal {D}_2(\beta )\), we have
\(\square\)
Lemma D.3
We have
Proof
By (S3) and Lemma 9 of Stone (1986), we have
Under (S4), this implies
It follows from Lemma 12.26 of Breiman et al. (1984) that
Therefore, we have
provided that we choose \(M_5 = C_5^{-1}\). \(\square\)
Lemma D.4
For \(\beta \in \mathbb {R}^J\), we have
where \(S(\beta ) = {\Vert f- \,\mathsf {s}_\beta \Vert }_\infty\).
Proof
We apply the Bernstein inequality in Lemma 3 of Bunea et al. (2007) to random variables \(\xi _n = {\left( \,\mathsf {s}_\beta (X_n) - f(X_n) \right) }^2\) for \(n=1,\ldots ,N\). With the uniform upper bound \(S^2(\beta ) {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2\) on the second moment of \(\{ \xi _n \}_{n=1}^N\), we obtain
\(\square\)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jhong, JH., Bak, KY. & Koo, JY. Penalized polygram regression. J. Korean Stat. Soc. 51, 1161–1192 (2022). https://doi.org/10.1007/s42952-022-00181-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42952-022-00181-5