Skip to main content

Penalized polygram regression

Abstract

We consider a study on regression function estimation over a bounded domain of arbitrary shapes based on triangulation and penalization techniques. A total variation type penalty is imposed to encourage fusion of adjacent triangles, which leads to a partition of the domain consisting of disjointed polygons. The proposed method provides a piecewise linear, and continuous estimator over a data adaptive polygonal partition of the domain. We adopt a coordinate decent algorithm to handle the non-separable structure of the penalty and investigate its convergence property. Regarding the asymptotic results, we establish an oracle type inequality and convergence rate of the proposed estimator. A numerical study is carried out to illustrate the performance of this method. An R software package polygram is available.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

References

  • Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.

    MATH  Google Scholar 

  • Bregman, L. M. (1966). A relaxation method of finding a common point of convex sets and its application to problems of optimization’. In Soviet Mathematics Doklady, 7, 1578–1581.

    Google Scholar 

  • Breiman, L. (1991). The ii method for estimating multivariate functions from noisy data. Technometrics, 33(2), 125–143.

    MathSciNet  MATH  Google Scholar 

  • Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Belmont, CA: Wadsworth.

    MATH  Google Scholar 

  • Bunea, F., Tsybakov, A., Wegkamp, M., et al. (2007). Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1, 169–194.

    MathSciNet  MATH  Google Scholar 

  • Courant, R. et al. (1943). Variational methods for the solution of problems of equilibrium and vibrations. Verlag nicht ermittelbar.

  • De Boor, C. (1978). A practical guide to splines, Volume 27. Springer.

  • Nychka, D., Furrer, R., Paige, J., & Sain, S. (2017). fields: Tools for spatial data. Boulder, CO, USA: University Corporation for Atmospheric Research. R package version 9.8-1.

  • Franke, R. (1979). A critical comparison of some methods for interpolation of scattered data. Naval Postgraduate School Monterey CA: Technical report.

    Google Scholar 

  • Friedman, J., Hastie, T., Hofling, H., & Tibshirani, R. (2007). Pathwise coordinate optimazation. The Annals of Statistics, 1(2), 302–332.

    MathSciNet  MATH  Google Scholar 

  • Friedman, J. H., et al. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67.

    MathSciNet  MATH  Google Scholar 

  • Friedman, J. H., & Silverman, B. W. (1989). Flexible parsimonious smoothing and additive modeling. Technometrics, 31(1), 3–21.

    MathSciNet  MATH  Google Scholar 

  • Gaines, B. R., Kim, J., & Zhou, H. (2018). Algorithms for fitting the constrained lasso. Journal of Computational and Graphical Statistics, 27(4), 861–871.

    MathSciNet  MATH  Google Scholar 

  • Gu, C., Bates, D., Chen, Z., & Wahba, G. (1989). The computation of gcv functions through householder tridiagonalization with application to the fitting of interaction spline models. SIAM Journal of Matrix Analysis, 10, 457–480.

    MATH  Google Scholar 

  • Hansen, M. (1994). Extended linear models, multivariate splines, and anova. Ph.D. dissertation.

  • Hansen, M., Kooperberg, C., & Sardy, S. (1998). Triogram models. Journal of the American Statistical Association, 93(441), 101–119.

    MATH  Google Scholar 

  • He, X., & Shi, P. (1996). Bivariate tensor-product b-splines in a partly linear model. Journal of Multivariate Analysis, 58(2), 162–181.

    MathSciNet  MATH  Google Scholar 

  • Huang, J. Z. (1998). Projection estimation in multiple regression with application to functional anova models. The Annals of Statistics, 26(1), 242–272.

    MathSciNet  MATH  Google Scholar 

  • Huang, J. Z. (2003). Asymptotics for polynomial spline regression under weak conditions. Statistics & Probability Letters, 65(3), 207–216.

    MathSciNet  MATH  Google Scholar 

  • Huang, J. Z. (2003). Local asymptotics for polynomial spline regression. The Annals of Statistics, 31(5), 1600–1635.

    MathSciNet  MATH  Google Scholar 

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2017). ISLR: Data for an Introduction to Statistical Learning with Applications in R. R Package Version, 1, 2.

    Google Scholar 

  • James, G.M., Paulson, C., & Rusmevichientong, P. (2013). Penalized and constrained regression. Unpublished manuscript, http://www.bcf.usc.edu/~gareth/research/Research.html

  • Jhong, J. H., Koo, J. Y., & Lee, S. W. (2017). Penalized B-spline estimator for regression functions using total variation penalty. Journal of Statistical Planning and Inference, 184, 77–93.

    MathSciNet  MATH  Google Scholar 

  • Keller, J. M., Gray, M. R., & Givens. J. A. (1985). A fuzzy k-nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cybernetics SMC-15(4): 580–585.

  • Koenker, R., & Mizera, I. (2004). Penalized triograms: Total variation regularization for bivariate smoothing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1), 145–163.

    MathSciNet  MATH  Google Scholar 

  • Kooperberg, C., Stone, C.J., & Truong. Y.K. (1995a). The l2 rate of convergence for hazard regression. Scandinavian Journal of Statistics: 143–157 .

  • Kooperberg, C., Stone, C. J., & Truong, Y. K. (1995). Rate of convergence for logspline spectral density estimation. Journal of Time Series Analysis, 16(4), 389–401.

    MathSciNet  MATH  Google Scholar 

  • Lai, M. J. (2007). Multivariate splines for data fitting and approximation (pp. 210–228). San Antonio: Approximation Theory XII.

    MATH  Google Scholar 

  • Lai, M. J., & Schumaker, L. L. (1998). On the approximation power of bivariate splines. Advances in Computational Mathematics, 9(3–4), 251–279.

    MathSciNet  MATH  Google Scholar 

  • Lai, M. J., & Schumaker, L. L. (2007). Spline functions on triangulations. Cambridge University Press.

  • Lai, M. J., & Wang, L. (2013). Bivariate penalized splines for regression. Statistica Sinica, 23(3), 1399–1417.

    MathSciNet  MATH  Google Scholar 

  • Lange, K., Hunter, D. R., & Yang, I. (2000). Optimization transfer using surrogate objective functions. Journal of Computational and Graphical Statistics, 9(1), 1–20.

    MathSciNet  Google Scholar 

  • Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.

    MathSciNet  MATH  Google Scholar 

  • Ramsay, T. (2002). Spline smoothing over difficult regions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2), 307–319.

    MathSciNet  MATH  Google Scholar 

  • Rippa, S. (1992). Adaptive approximation by piecewise linear polynomials on triangulations of subsets of scattered data. SIAM Journal on Scientific and Statistical Computing, 13(5), 1123–1141.

    MathSciNet  MATH  Google Scholar 

  • Ruppert, D. (2002). Selecting the number of knots for penalized splines. Journal of Computational and Graphical Statistics, 11(4), 735–757.

    MathSciNet  Google Scholar 

  • Schwarz, G., et al. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.

    MathSciNet  MATH  Google Scholar 

  • Shewchuk, J.R. (1996), may. Triangle: Engineering a 2d quality mesh generator and delaunay triangulator, In Applied Computational Geometry: Towards Geometric Engineering, eds. Lin, M.C. and D. Manocha, Volume 1148 of Lecture Notes in Computer Science, 203–222. Springer-Verlag. From the First ACM Workshop on Applied Computational Geometry.

  • S Stone, C.J. (1982). Optimal global rates of convergence for nonparametric regression. The Annals of Statistics: 1040–1053.

  • Stone, C. J. (1985). Additive regression and other nonparametric models. The Annals of Statistics, 13(2), 689–705.

    MathSciNet  MATH  Google Scholar 

  • Stone, C. J. (1986). The dimensionality reduction principle for generalized additive models. The Annals of Statistics, 14(2), 590–606.

    MathSciNet  MATH  Google Scholar 

  • Stone, C. J. (1994). The use of polynomial splines and their tensor products in multivariate function estimation. The Annals of Statistics, 22(1), 118–171.

    MathSciNet  MATH  Google Scholar 

  • Stone, C. J., Hansen, M. H., Kooperberg, C., Truong, Y. K., et al. (1997). Polynomial splines and their tensor products in extended linear modeling: 1994 wald memorial lecture. The Annals of Statistics, 25(4), 1371–1470.

    MathSciNet  MATH  Google Scholar 

  • Szeliski, R. (2010). Computer vision: algorithms and applications. Springer.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1), 267–288.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R., & Saunders, M. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society, 67(1), 91–108.

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R. J., & Taylor, J. (2011). The solution path of the generalized lasso. The Annals of Statistics, 39(3), 1335–1371.

    MathSciNet  MATH  Google Scholar 

  • Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3), 475–494.

    MathSciNet  MATH  Google Scholar 

  • Xiao, L. (2019). Asymptotics of bivariate penalised splines. Journal of Nonparametric Statistics, 31(2), 289–314.

    MathSciNet  MATH  Google Scholar 

  • Xiao, L., Li, Y., & Ruppert, D. (2013). Fast bivariate p-splines: The sandwich smoother. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3), 577–599.

    MathSciNet  MATH  Google Scholar 

  • Ye, G. B., & Xie, X. (2011). Split bregman method for large scale fused lasso. Computational Statistics & Data Analysis, 55(4), 1552–1569.

    MathSciNet  MATH  Google Scholar 

  • Yu, D., Won, J. H., Lee, T., Lim, J., & Yoon, S. (2015). High-dimensional fused lasso regression using majorization-minimization and parallel processing. Journal of Computational and Graphical Statistics, 24(1), 121–153.

    MathSciNet  Google Scholar 

  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, 67(2), 301–320.

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The research of Ja-Yong Koo was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2018R1D1A1B07049972). The research of Jae-Hwan Jhong was supported by the NRF (NRF-2019R1A6A3A01096135 and NRF-2020R1G1A1A01100869). The research of Kwan-Young Bak was supported by the NRF (NRF-2021R1A6A3A01086417).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ja-Yong Koo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Appendices

A Preliminaries

1.1 A.1 Triangulations

Let \(\Omega\) be a compact and convex polygon in the plane. A collection \(\triangle = \{T_1, \ldots , T_G\}\) of triangles in the plane with disjoint interior, i.e., \(\Omega = \bigcup _{T \in \triangle } T\), is called a triangulation of \(\Omega\). In this paper, we assume that each element \(T \in \triangle\) is planar triangle. The vertices of the triangles of \(\triangle\) are called the vertices of the triangulation \(\triangle\). If a vertex v is a boundary point of \(\Omega\), we say that it is a boundary vertex. Otherwise, we call it an interior vertex. \(\triangle\) may consist of two completely separated triangles, or their intersection must be either a common vertex or a common edge.

Figure 8a shows a basic notation of the triangle \({\langle 123 \rangle }\). In Fig. 8b, an edge \(e_{23}\) is called a common edge of two triangles \({\langle 123 \rangle }\) and \({\langle 243 \rangle }\). Given a triangulation \(\triangle\), define a sub-triangulation which depends on the vertex, called star. The \({{\,\mathrm{star}\,}}(v)\) of a vertex v is given by the set of all triangles in \(\triangle\) which share the vertex v. Figure 8c displays the \({{\,\mathrm{star}\,}}(v)\).

Fig. 8
figure 8

Examples of basic notations for triangulation. Vertices and edges of \({\langle 123 \rangle }\) (a), Common edge \(e_{23}\) (blue line) of two triangles \({\langle 123 \rangle }\) and \({\langle 243 \rangle }\) (b), and \({{\,\mathrm{star}\,}}(v)\) of a vertex v represented by gray surface (c)

1.2 A.2 Barycentric coordinates

Choose a triangle \({\langle 123 \rangle }\) for example. The barycentric coordinate vector of \(x = (x_1, x_2)\) relative to the triangle \({\langle 123 \rangle }\) is defined as

$$\begin{aligned} b^{123}(x) = \left( b^{123}_1(x), b^{123}_2(x), b^{123}_3(x) \right) \in \mathbb {R}^3, \end{aligned}$$

and satisfying the conditions

$$\begin{aligned} x_j = \sum _{\ell = 1}^3 b^{123}_\ell (x) v_{\ell j} {\quad \text{ for } \quad }j = 1, 2, {\quad \text{ and } \quad }\sum _{\ell = 1}^3 b^{123}_\ell (x) = 1. \end{aligned}$$

By Cramer’s rule,

$$\begin{aligned} b^{123}_1(x) = \frac{{{\,\mathrm{area}\,}}(x, v_2, v_3)}{{{\,\mathrm{area}\,}}(v_1, v_2, v_3)}, \quad b^{123}_2(x) = \frac{{{\,\mathrm{area}\,}}(v_1, x, v_3)}{{{\,\mathrm{area}\,}}(v_1, v_2, v_3)} {\quad \text{ and } \quad }b^{123}_3(x) = \frac{{{\,\mathrm{area}\,}}(v_1, v_2, x)}{{{\,\mathrm{area}\,}}(v_1, v_2, v_3)}, \end{aligned}$$

where

$$\begin{aligned} {{\,\mathrm{area}\,}}(v_1, v_2, v_3) = \frac{1}{2} \begin{vmatrix} 1&1&1 \\ v_{11}&v_{21}&v_{31} \\ v_{12}&v_{22}&v_{32} \end{vmatrix} \end{aligned}$$

is the signed area of the triangle \({\langle 123 \rangle }\). The values \(b^{123}_1(x)\), \(b^{123}_2(x)\) and \(b^{123}_3(x)\) are the relative areas of the green, the blue and the red triangles, respectively, see Fig. 9.

Fig. 9
figure 9

The values \(b^{123}_1(x)\), \(b^{123}_2(x)\) and \(b^{123}_3(x)\) are the relative areas of the green, the blue and the red triangles, respectively

1.3 A.3 Linear hat splines

Suppose a linear spline s belongs to the space \({\mathcal {S}}\) of continuous linear splines. Given a vertex set \(V = \{v_1, \ldots , v_J\}\) in the triangluation \(\triangle\), a continuous linear hat splines \(B_1, \ldots , B_J\) are defined as

$$\begin{aligned} B_j (x) = {\left\{ \begin{array}{ll} b^{T_x}_j (x) &{} {\textbf {if}} x \in {{\,\mathrm{star}\,}}(v_j) \\ 0 &{} \quad otherwise \end{array}\right. } {\quad \text{ for } \quad }j = 1, \ldots , J, \end{aligned}$$

where \(T_x\) is the triangle which contains the point x. Then the hat splines \(B_1, \ldots , B_J\) has the form a basis for \({\mathcal {S}}\), which means that any continuous linear spline s is expressed as

$$\begin{aligned} s(x;\beta ) = \sum _{j = 1}^J \beta _j B_j(x) {\quad \text{ for } \quad }\beta \in \mathbb {R}^J. \end{aligned}$$
(8)

B Algorithm details

We describe each step of the algorithm in detail.

  • Descent step: Let \({\tilde{\beta }} = ({\tilde{\beta }}_1, \ldots , {\tilde{\beta }}_J)\) be the current value of the coefficients. For each \(j = 1, \ldots , J\), the algorithm partially optimizes for a single coefficient \(\beta _j\), holding other coefficients fixed at current values \(\{{\tilde{\beta }}_\ell \}_{\ell \ne j}\):

    $$\begin{aligned} {\tilde{\beta }}_j \leftarrow \mathop {\mathrm {argmin}}\limits _{\beta _j \in \mathbb {R}} R^\lambda \left( {\tilde{\beta }}_1, \ldots , {\tilde{\beta }}_{j - 1}, \beta _j, {\tilde{\beta }}_{j + 1}, \ldots , {\tilde{\beta }}_J \right) . \end{aligned}$$

    Ignoring terms independent of \(\beta _j\), we have

    $$\begin{aligned}&R^\lambda ({\tilde{\beta }}_1, \ldots , {\tilde{\beta }}_{j - 1}, \beta _j, {\tilde{\beta }}_{j + 1}, \ldots , {\tilde{\beta }}_J) \nonumber \\&= \frac{1}{N} \sum _{n = 1}^N \left( y_n - \sum _{\ell \ne j} {\tilde{\beta }}_\ell G_\ell (x_n) - \beta _j G_j(x_n) \right) ^2 + \lambda \sum _{k = 1}^K {\left|\beta _j c_{kj} + \sum _{\ell \ne j} {\tilde{\beta }}_\ell c_{k\ell } \right|} \nonumber \\&= \frac{1}{N} \sum _{n = 1}^N \left( { y_{nj} - \beta _j G_j(x_n)} \right) ^2 + \lambda \sum _{k:c_{kj} \ne 0} {\left|c_{kj} \right|} {\left|\beta _j - \left( - \sum _{\ell \ne j} \frac{c_{\ell j}}{c_{kj}} {\tilde{\beta }}_\ell \right) \right|} \\&\quad + (\text{ terms } \text{ independent } \text{ of } \beta _j) \nonumber \end{aligned}$$
    (9)

    with \(y_{nj} = y_n - \sum _{\ell \ne j} {\tilde{\beta }}_\ell G_\ell (x_n)\) being a partial residuals. Subsequently, (9) is the univariate convex optimization problem with respect to \(\beta _j\):

    $$\begin{aligned} \min _{\beta _j \in \mathbb {R}} \frac{a}{2} (\beta _j - b)^2 + \lambda \sum _{m = 1}^M d_m{\left|\beta _j - e_m \right|}, \end{aligned}$$
    (10)

    where

    $$\begin{aligned} a = \frac{2}{N} \sum _{n = 1}^N G_j^2(x_n), \quad b = \frac{\sum _{n = 1}^N y_{nj} G_j(x_n)}{\sum _{n = 1}^N G_j(x_n)}, \quad d_m = {\left|c_{mj} \right|}, \quad e_m = - \sum _{\ell \ne j} \frac{c_{\ell j}}{c_{mj}} {\tilde{\beta }}_\ell . \end{aligned}$$

    The positive integer \(M = {\left|{\{ k: c_{kj} \ne 0 \}} \right|}\) is a relatively small number as the matrix C is sparse. Observe that (10) is the univariate quadratic function of \(\beta _j\) and hence, it always has a global minimum. By Theorem 1 of Jhong et al. (2017), we obtain the exact solutions for minimizing this problem. Thus, the algorithm iteratively updates the coefficients by the minimum of (10) until convergence.

  • Pruning step: The descent step moves parameters one at a time. This approach may not be effective as the penalty function is not separable. One way to resolve problem is to consider a reparameterization using the active constraint. The algorithm removes the edges corresponding to the active constraints and updates the new basis and the constraint matrix. When a single active condition is satisfied during the descent step, we represent the problem (4) into

    $$\begin{aligned} \min _{\beta \in \mathbb {R}^{J - 1}} \frac{1}{N} {\left|y - {\tilde{G}} \beta \right|}_2^2 + \lambda {\left|{\tilde{C}}^\top \beta \right|}_1, \end{aligned}$$
    (11)

    where \({\tilde{G}}\), \({\tilde{C}}\) are the updated basis and the constraint matrix that satisfy the active condition.

The algorithm reduces the dimension J and K by pruning the corresponding common edge. To this end, we need one non-zero entry over \(c_k\) as a baseline. For numerical stability, the baseline is chosen by the largest element of \(c_k\) in its absolute value. The rest of steps are as follows.

  1. 1.

    Let \(c_{kj}\) be the baseline. The active condition can therefore be represented by \({\tilde{\beta }}_j = - ( {c_k^{-j}})^\top {\tilde{\beta }}^{-j} / c_{kj}\) with respect to \({\tilde{\beta }}_j\), where \(c_k^{-j}\) and \({\tilde{\beta }}^{-j}\) are the \(J-1\) dimensional vector excluding the jth entry \(c_{kj}\) and \({\tilde{\beta }}_j\), respectively.

  2. 2.

    Update G and C: Let \({\tilde{G}}_\ell = G_\ell - (c_{k \ell } / c_{kj}) G_j\) for \(\ell = 1, \ldots , J\), \(\ell \ne j\), and then the new design matrix becomes

    $$\begin{aligned} {\tilde{G}} = \begin{bmatrix} {\tilde{G}}_1&\cdots&{\tilde{G}}_{j - 1}&{\tilde{G}}_{j + 1}&\cdots&{\tilde{G}}_J \end{bmatrix} \in \mathbb {R}^{N \times (J - 1)}. \end{aligned}$$

    Similarly, the updated constraint matrix has the form

    $$\begin{aligned} {\tilde{C}} = \begin{bmatrix} {\tilde{c}}_1&\cdots&{\tilde{c}}_{k - 1}&{\tilde{c}}_{k + 1}&\cdots {\tilde{c}}_K \end{bmatrix} \in \mathbb {R}^{(J - 1) \times (K - 1)}, \end{aligned}$$

    where \({\tilde{c}}_m = c_m^{-j} - c_k^{-j} / c_{kj}\) for \(m = 1, \ldots , K\), \(m \ne k\).

  3. 3.

    Now, the minimization problem (4) subject to \(c_k^\top {\tilde{\beta }} = 0\) is equivalent to (11). Then, we update the following notations:

    $$\begin{aligned} G \leftarrow {\tilde{G}}, \quad C \leftarrow {\tilde{C}}, \quad {\tilde{\beta }} \leftarrow {\tilde{\beta }}^{-j}, \quad J \leftarrow J-1, \quad K \leftarrow K-1. \end{aligned}$$

The algorithm just removes kth column from C if \(c_k\) is a zero vector. This case does not occur early in the algorithm, but it may appear during a series of updated processes as the initial constraint matrix is not full rank. That is, there are redundancy between the edge-removal conditions of the initial interior edges so that some edges are removed together in a single pruning step. Therefore, in practice, the algorithm requires a number of pruning steps much smaller than the initial number of edges.

  • Smoothing step: The strategy of the algorithm is to solve a series of the problems sequentially, varying \(\lambda\). We obtain the solutions for an increasing sequence on the log scale values \(\lambda _1< \lambda _2 < \cdots\) for \(\lambda\), stopping at a sufficiently large value that all polygonal pieces are merged. In our polygram package, we set the largest value of \(\lambda\) to \(\lambda _{max}\) and define \(\lambda _{min} = \lambda _{max} \times \epsilon _\lambda\). Then generate an increasing sequence \(\{ \lambda _2 = \lambda _{min}, \lambda _3, \ldots , \lambda _{max} \}\) on the log-scale. The default values of \(\lambda _{max}\) and \(\epsilon _\lambda\) in the package are 10 and \(10^{-10}\), respectively, and these values can be adjusted appropriately according to the given data. The algorithm begins with \(\lambda _1 = 0\). Hence, we obtain the unpenalized least square estimator. Then we progressively increase \(\lambda \leftarrow \lambda _i\), \(i = 1, 2, \ldots ,\) and run the descent step and pruning step repeatedly until no further changes occur.

C Proofs of the results in Sect. 3.2

Lemma C.1 states that \(\hat{\beta }^\lambda\) lies in the same region generated by \(\{ c_k \}_{k=1}^K\) as \(\hat{\beta }^0\) does.

Lemma C.1

Suppose (C.1) and (C.2) hold. If \(0< \lambda < \lambda ^1\), then

$$\begin{aligned} \mathop {\mathrm {\mathsf {sign}}}\limits (c_k^\top {\hat{\beta }}^\lambda ) = \tau _k, {\quad \text{ for } \quad }k = 1, \ldots , K. \end{aligned}$$

Proof

It follows from the result in Section 7 of Tibshirani and Taylor (2011) that \(\hat{\beta }^\lambda\) is continuous with respect to \(\lambda\). This implies that \(c_k^\top \hat{\beta }^\lambda\) is a real-valued function continuous with respect to \(\lambda\). If \(\mathop {\mathrm {\mathsf {sign}}}\limits (c_k^\top {\hat{\beta }}^\lambda ) = -\tau _k\) for \(0<\lambda < \lambda ^1\), the intermediate value theorem implies that the function \(c_k^\top {\hat{\beta }}^\lambda\) assumes the value zero for some \(\zeta\) with \(0<\zeta < \lambda ^1\), which contradicts the definition of \(\lambda ^1\). \(\square\)

We now give proofs of Proposition 3.1, Proposition 3.2 and Theorem 3.3.

Proof of Proposition 3.1

By Lemma C.1, we obtain the following results from the stationarity condition

$$\begin{aligned} {\hat{\beta }}^\lambda = {\hat{\beta }}^0 - \lambda S c {\quad \text{ for } \quad }\lambda < \lambda _1 \end{aligned}$$

and

$$\begin{aligned} c_k^\top {\hat{\beta }}^\lambda = c_k^\top {\hat{\beta }}^0 - \lambda \eta _k {\quad \text{ for } \quad }\lambda < \lambda _1, \quad k = 1, \ldots , K. \end{aligned}$$
(12)

Take the limit \(\lambda \rightarrow \lambda _1 -\) on both side of (12), we obtain

$$\begin{aligned} c_k^\top {\hat{\beta }}^{\lambda _1} = c_k^\top {\hat{\beta }}^0 - \lambda _1 \eta _k {\quad \text{ for } \quad }k = 1, \ldots , K. \end{aligned}$$

Then, we have the explicit form of the \(\lambda ^1\) such that

$$\begin{aligned} \lambda ^1 = \frac{c_k^\top {\hat{\beta }}^0 - c_k^\top {\hat{\beta }}^{\lambda _1}}{\eta _k} {\quad \text{ for } \quad }\eta _k \ne 0, \quad k = 1, \ldots , K. \end{aligned}$$

Using the fact

$$\begin{aligned} {\left|c_k^\top {\hat{\beta }}^0 \right|} \le {\left|c_k^\top {\hat{\beta }}^\lambda \right|}, \quad k = 1, \ldots , K, \quad \lambda < \lambda ^1 \end{aligned}$$

and the definition of \(\lambda ^1\), we have the desired result. \(\square\)

Proof of Proposition 3.2

Define

$$\begin{aligned} \mathcal {H}_+ = \left\{ \beta \in \mathbb {R}^J: \tau _k (c_k^\top \hat{\beta }^0) > 0, {\quad \text{ for } \quad }k =1,\ldots ,K \right\} . \end{aligned}$$

For any \(\lambda \in (0, \lambda ^1)\), It follows form Lemma C.1 that the KKT condition for minimizing \(R^\lambda (\beta )\) can be expressed as

$$\begin{aligned} \nabla R ({\hat{\beta }}^\lambda ) + \lambda c = 0, \quad 0< \lambda < \lambda ^1, \ {\hat{\beta }}^\lambda \in \mathcal {H}_+. \end{aligned}$$

Assume that \({\tilde{\beta }}^\lambda = \mathop {\mathrm {argmin}}\limits _{\beta \in \mathbb {R}^J} R_+^\lambda (\beta )\) for \(0< \lambda < \lambda ^1\). Then the stationarity condition is

$$\begin{aligned} \nabla R ({\tilde{\beta }}^\lambda ) + \lambda c = 0, \quad 0< \lambda < \lambda ^1. \end{aligned}$$
(13)

Observe that 13 is the stationarity condition for minimizing \(R^\lambda\) over \(\mathbb {R}^J\). By the uniqueness of \({\hat{\beta }}^\lambda\), it follows that \({\hat{\beta }}^\lambda \in \mathcal {H}_+\) minimizes \(R_+^\lambda\) over \(\mathbb {R}^J\), under the assumption of \(\lambda \in (0, \lambda ^1)\). \(\square\)

Proof of Theorem 3.3

Since \(\lambda \in (0, \lambda ^1)\), it follows from Proposition 3.2 that

$$\begin{aligned} {\hat{\beta }}^\lambda = \mathop {\mathrm {argmin}}\limits _{\beta \in \mathbb {R}^J} R^\lambda (\beta ) = \mathop {\mathrm {argmin}}\limits _{\beta \in \mathbb {R}^J} R_+^\lambda (\beta ). \end{aligned}$$

Because \(R_+^\lambda\) is a smooth function with a positive definite Hessian matrix, the arguments of Theorem 4.1 (c) in Tseng (2001) give the desired result. \(\square\)

D Proofs of the results in Sect. 4

Proof of Theorem 4.1

Define

$$\begin{aligned} R(\beta ) = \frac{1}{N} {\left|\mathsf {Y}- \mathsf {G}\beta \right|}_2^2 \end{aligned}$$

Observe

$$\begin{aligned} {\Vert f - {\hat{f}} \Vert }_{2,N}^2 - {\Vert f - \,\mathsf {s}_\beta \Vert }_{2,N}^2 = \frac{1}{N} \left( {\left|\mathsf {G}{\hat{\beta }} \right|}_2^2 - {\left|\mathsf {G}\beta \right|}_2^2 \right) - \frac{2}{N} \left( {\hat{\beta }} - \beta \right) ^\top \mathsf {G}^\top \,\mathsf {F}\end{aligned}$$

and, from the definition of \({\hat{\beta }}\),

$$\begin{aligned} \lambda {\left|C^\top {\hat{\beta }} \right|}_1 - \lambda {\left|C^\top \beta \right|}_1 \le R(\beta ) - R({\hat{\beta }}) = \frac{1}{N} \left( {\left|\mathsf {G}\beta \right|}_2^2 - {\left|\mathsf {G}{\hat{\beta }} \right|}_2^2 \right) - \frac{2}{N} \left( \beta - {\hat{\beta }} \right) ^\top \mathsf {G}^\top \mathsf {Y}. \end{aligned}$$

Combining these, we have

$$\begin{aligned} {{\Vert f - {\hat{f}} \Vert }_{2,N}^2 + \lambda {\left|C^\top {\hat{\beta }} \right|}_1 - \left( {\Vert f - \,\mathsf {s}_\beta \Vert }_{2,N}^2 + \lambda {\left|C^\top \beta \right|}_1 \right) }\\&\le \frac{1}{N} \left( {\left|\mathsf {G}{\hat{\beta }} \right|}_2^2 - {\left|\mathsf {G}\beta \right|}_2^2 \right) - \frac{2}{N} \left( {\hat{\beta }} - \beta \right) ^\top \mathsf {G}^\top \,\mathsf {F}+ \left( \frac{1}{N} \left( {\left|\mathsf {G}\beta \right|}_2^2 - {\left|\mathsf {G}{\hat{\beta }} \right|}_2^2 \right) - \frac{2}{N} \left( \beta - {\hat{\beta }} \right) ^\top \mathsf {G}^\top \mathsf {Y}\right) \\&= \frac{2}{N} ({\hat{\beta }} - \beta )^\top \mathsf {G}^\top (\mathsf {Y}- \,\mathsf {F}), \end{aligned}$$

and thus

$$\begin{aligned} {\Vert f - {\hat{f}} \Vert }_{2,N}^2 \le {\Vert f - \,\mathsf {s}_\beta \Vert }_{2,N}^2 + \frac{2}{N} ({\hat{\beta }} - \beta )^\top \mathsf {G}^\top (\mathsf {Y}- \,\mathsf {F}) - \lambda \left( {\left|C^\top {\hat{\beta }} \right|}_1 - {\left|C^\top \beta \right|}_1 \right) . \end{aligned}$$

On event \(\mathcal {D}_1\), we have

$$\begin{aligned} \frac{1}{N} ({\hat{\beta }} - \beta )^\top \mathsf {G}^\top (\mathsf {Y}- \,\mathsf {F}) \le \sum _{j=1}^J {\left|{\hat{\beta }}_j - \beta _j \right|} {\left|\frac{1}{N} \mathsf {G}_j^\top (\mathsf {Y}- \,\mathsf {F}) \right|} \le \lambda {\left|{\hat{\beta }} - \beta \right|}_1. \end{aligned}$$

From the triangle inequality and the fact that the row sums of C are bounded by 4, we obtain the following.

$$\begin{aligned} {\Vert f - {\hat{f}} \Vert }_{2,N}^2&\le {\Vert f - \,\mathsf {s}_\beta \Vert }_{2,N}^2 + \frac{2}{N} ({\hat{\beta }} - \beta )^\top \mathsf {G}^\top (\mathsf {Y}- \,\mathsf {F}) - \lambda \left( {\left|C^\top {\hat{\beta }} \right|}_1 - {\left|C^\top \beta \right|}_1 \right) \\&\le {\Vert f - \,\mathsf {s}_\beta \Vert }_{2,N}^2 + 2\lambda {\left|{\hat{\beta }} - \beta \right|}_1 + \lambda {\left|C^\top (\hat{\beta } - \beta ) \right|}_1\\&\le {\Vert f - \,\mathsf {s}_\beta \Vert }_{2,N}^2 + 6 \lambda {\left|{\hat{\beta }} - \beta \right|}_1 \end{aligned}$$

on the event \(\mathcal {D}_1\). \(\square\)

Proof of Theorem 4.2

Lemma D.2 and the definitions of \(\mathcal {D}_2(\beta )\) and \(\mathcal {D}_3\) imply that, on the event \(\mathcal {D}(\beta )\),

$$\begin{aligned} \frac{1}{2}{\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2^2&\le {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_{2,N}^2\\&\le 2{\Vert f- \,\mathsf {s}_\beta \Vert }_{2,N}^2 + 2{\Vert f - \hat{f} \Vert }_{2,N}^2\\&\le 4 {\Vert f- \,\mathsf {s}_\beta \Vert }_{2}^2 + 2 \lambda ^2 J^2 + 2{\Vert f - \hat{f} \Vert }_{2,N}^2\\&\le 4 {\Vert f- \,\mathsf {s}_\beta \Vert }_{2}^2 + 2 \lambda ^2 J^2 +2 \left[ 2 {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 + 6 \lambda J {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2 \right] \\&\le 8 {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2 + 4\lambda ^2 J^2 + 144 \lambda ^2 J^2 + \frac{1}{4} {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2^2, \end{aligned}$$

where the last inequality follows from the inequality \(2uv \le 4u^2 + v^2 /4\) for \(u,v \in \mathbb {R}\) applied to the rightmost term. This implies that, for \(\beta \in \mathcal {B}\),

$$\begin{aligned} {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2^2&\le 32 {\Vert f- \,\mathsf {s}_\beta \Vert }_2^2 + 592 \lambda ^2 J^2 \le M_7^2 \lambda ^2 J^2 {\quad \text{ for } \quad }M_7 = \sqrt{32 M_6 + 592}. \end{aligned}$$

Thus, for \(\beta \in \mathcal {B}\), we have

$$\begin{aligned} {\Vert f - \hat{f} \Vert }_2&\le {\Vert f- \,\mathsf {s}_\beta \Vert }_2 + {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2 \le M_8 \lambda J {\quad \text{ for } \quad }M_8 = \sqrt{M_6} + M_7 \end{aligned}$$

on the event \(\mathcal {D}(\beta )\). \(\square\)

Proof of Corollary 4.3

From the arguments in Hansen (1994), for \(f \in \mathcal {C}^2(Q)\), there exists \(\beta ^* \in \mathbb {R}^J\) such that

$$\begin{aligned} {\Vert f - \,\mathsf {s}_{\beta ^*} \Vert }_\infty \le C_1 \overline{d}_N^2 \le C_2 J^{-1}, \end{aligned}$$

which implies that

$$\begin{aligned} {\Vert f - \,\mathsf {s}_{\beta ^*} \Vert }_2^2 \le C_3 J^{-2}. \end{aligned}$$

Assuming \(\overline{d}_N \asymp {\left( N / \log N \right) }^{-1/6}\), we have \(J \asymp {\left( N / \log N \right) }^{1/3}\), we conclude that \(\mathcal {B}\) is nonempty for a suitable choice of constants, and \(\,\mathbb {P}{\left( \mathcal {D}(\beta ^*) \right) }\) tends to 1 by Lemma D.3, Lemma D.4, and Lemma A.2 of Lai and Wang (2013). It follows from Theorem 4.2 that

$$\begin{aligned} {\Vert f - \hat{f} \Vert }_2^2 \le M_8 \lambda ^2 J^2 \le C_4 \frac{\log N}{N J} J^2 \le M_9 {\left( \frac{N}{\log N} \right) }^{-2/3}. \end{aligned}$$

\(\square\)

1.1 Technical lemmas

Lemma D.1

For \(\beta \in \mathbb {R}^J\), we have

$$\begin{aligned} M_{10} {\left|\beta \right|}_2 \le \sqrt{J} {\Vert \,\mathsf {s}_\beta \Vert }_2 \le M_{11} {\left|\beta \right|}_2. \end{aligned}$$

Proof

Following Lai and Schumaker (2007), we have

$$\begin{aligned} C_1 \overline{d}_N^2 {\left|\beta \right|}_2^2 \le {\Vert \,\mathsf {s}_\beta \Vert }_2^2 \le C_2 \overline{d}_N^2 {\left|\beta \right|}_2^2. \end{aligned}$$

The desired result follows from the assumption \(J = M_4 / \overline{d}_N^2\). \(\square\)

Lemma D.2

For \(\beta \in \mathbb {R}^J\), we have

$$\begin{aligned} {\Vert f - \hat{f} \Vert }_{2,N}^2 \le 2 {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 + 6 \lambda J {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2 \end{aligned}$$

on the event \(\mathcal {D}_1 \cap \mathcal {D}_2 (\beta )\).

Proof

From Lemma D.1, Theorem 4.1, the Cauchy-Schwarz inequality, and the definition of \(\mathcal {D}_2(\beta )\), we have

$$\begin{aligned} {\Vert f - \hat{f} \Vert }_{2,N}^2&\le {\Vert f-\,\mathsf {s}_\beta \Vert }_{2,N}^2 + 6\lambda {\left|\hat{\beta } - \beta \right|}_1\\&\le 2 {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 + 6\lambda \sqrt{J} {\left|\hat{\beta } - \beta \right|}_2\\&\le 2 {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 + 6\lambda J {\Vert \hat{f} - \,\mathsf {s}_\beta \Vert }_2. \end{aligned}$$

\(\square\)

Lemma D.3

We have

$$\begin{aligned} \,\mathbb {P}{\left( \mathcal {D}_1 \right) } \ge 1 - 2J \exp (- C_5 \lambda ^2 JN) = 1 - \delta . \end{aligned}$$

Proof

By (S3) and Lemma 9 of Stone (1986), we have

$$\begin{aligned} \mathbb {E}\left[ e^{t B_j(x) (Y - f(x))} \mid X = x \right] \le 1 + C_1 t^2 B_j^2(x) {\quad \text{ for } \quad }{\left|t \right|} \le C_2. \end{aligned}$$

Under (S4), this implies

$$\begin{aligned} \mathbb {E}\left[ e^{t B_j(X) (Y - f(X))} \right] \le 1 + \frac{C_3}{J} t^2 \le \exp {\left( \frac{C_4}{2J} t^2 \right) }. \end{aligned}$$

It follows from Lemma 12.26 of Breiman et al. (1984) that

$$\begin{aligned} \,\mathbb {P}{\left( {\left|V_j \right|}> \lambda \right) }&= \,\mathbb {P}{\left( {\left|\frac{1}{N} \sum _{n=1}^N B_j(X_n) (Y_n - f(X_n)) \right|} > \lambda \right) }\\&\le 2 \exp {\left( - C_5 \lambda ^2 J N \right) }. \end{aligned}$$

Therefore, we have

$$\begin{aligned} 1 - \,\mathbb {P}{\left( \mathcal {D}_1 \right) } \le \sum _{j=1}^J \,\mathbb {P}{\left( {\left|V_j \right|} > \lambda \right) } \le 2J \exp (- C_5 \lambda ^2 JN) = \delta , \end{aligned}$$

provided that we choose \(M_5 = C_5^{-1}\). \(\square\)

Lemma D.4

For \(\beta \in \mathbb {R}^J\), we have

$$\begin{aligned} \,\mathbb {P}{\left( D_2(\beta ) \right) } \ge 1 - \exp {\left( - \frac{J^2 N\lambda ^2}{4 S^2(\beta )} \right) }, \end{aligned}$$

where \(S(\beta ) = {\Vert f- \,\mathsf {s}_\beta \Vert }_\infty\).

Proof

We apply the Bernstein inequality in Lemma 3 of Bunea et al. (2007) to random variables \(\xi _n = {\left( \,\mathsf {s}_\beta (X_n) - f(X_n) \right) }^2\) for \(n=1,\ldots ,N\). With the uniform upper bound \(S^2(\beta ) {\Vert f - \,\mathsf {s}_\beta \Vert }_2^2\) on the second moment of \(\{ \xi _n \}_{n=1}^N\), we obtain

$$\begin{aligned} {\,\mathbb {P}{\left( {\Vert f-\,\mathsf {s}_\beta \Vert }_{2,N}^2 \ge 2 {\Vert f-\,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 \right) } }\\&\le \exp {\left( - \frac{N {\left( {\Vert f-\,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 \right) }^2}{2{\left( S^2(\beta ) {\Vert f-\,\mathsf {s}_\beta \Vert }_2^2 + S^2(\beta ) {\left( {\Vert f-\,\mathsf {s}_\beta \Vert }_2^2 + \lambda ^2 J^2 \right) } \right) }} \right) } \le \exp {\left( - \frac{N\lambda ^2 J^2}{4\,S^2(\beta )} \right) }. \end{aligned}$$

\(\square\)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jhong, JH., Bak, KY. & Koo, JY. Penalized polygram regression. J. Korean Stat. Soc. 51, 1161–1192 (2022). https://doi.org/10.1007/s42952-022-00181-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42952-022-00181-5

Keywords