Skip to main content


  • 1114 Accesses


In statistics, we assume that the number of samples N is larger than the number of variables p. Otherwise, linear regression will not produce any least squares solution, or it will find the optimal variable set by comparing the information criterion values of the \(2^p\) subsets of the cardinality p. Therefore, it is difficult to estimate the parameters. In such a situation, regularization is often used. In the case of linear regression, we add a penalty term to the squared error to prevent the coefficient value from increasing. When the regularization term is a constant \(\lambda \) times the L1 and L2 norms of the coefficient, the method is called lasso and ridge, respectively. In the case of lasso, as the constant \(\lambda \) increases, some coefficients become 0; finally, all coefficients become 0 when \(\lambda \) is infinity. In that sense, lasso plays a role of model selection. In this chapter, we consider the principle of lasso and compare it with ridge. Finally, we learn how to choose the constant \(\lambda \).

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-15-7568-6_6
  • Chapter length: 15 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   39.99
Price excludes VAT (USA)
  • ISBN: 978-981-15-7568-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   49.99
Price excludes VAT (USA)
Fig. 6.1
Fig. 6.2
Fig. 6.3
Fig. 6.4
Fig. 6.5
Fig. 6.6
Fig. 6.7
Fig. 6.8


  1. 1.

    In this book, convexity always means convex below and does not mean concave (convex above).

  2. 2.

    In such a case, we do not express the subderivative as \(\{f'(x_0)\}\) but as \(f'(x_0)\).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Joe Suzuki .

Exercise 49–56

Exercise 49–56

  1. 49.

    Let \(N, p\ge 1\). For \(X\in {\mathbb R}^{N\times p}\) and \(y\in {\mathbb R}^N,\ \lambda \ge 0\), we wish to obtain \(\beta \in {\mathbb R}^{p}\) that minimizes

    $$\frac{1}{N}\Vert y-X\beta \Vert ^2+\lambda \Vert \beta \Vert _2^2\ ,$$

    where for \(\beta =(\beta _1,\ldots ,\beta _p)\), we denote \(\Vert \beta \Vert _2:=\sqrt{\sum _{j=1}^p\beta _j^2}\). Suppose \(N<p\). Show that such a solution always exists and that it is equivalent to \(\lambda >0\). Hint: In order to show a necessary and sufficient condition, both directions should be proved.

  2. 50.
    1. (a)

      Suppose that a function \(f: {\mathbb R}\rightarrow {\mathbb R}\) is convex and differentiable at \(x=x_0\). Show that a z exists for an arbitrary \(x\in {\mathbb R}\) such that \(f(x)\ge f(x_0)+z(x-x_0)\) (subderivative) and that it coincides with the differential coefficient \(f'(x_0)\) at \(x=x_0\).

    2. (b)

      Show that \(-1\le z\le 1\) is equivalent to \(zx \le |x|\) for all \(x\in {\mathbb R}\).

    3. (c)

      Find the set of z defined in (a) for function \(f(x)=|x|\) and \(x_0\in {\mathbb R}\). Hint: Consider the cases \(x_0>0\), \(x_0<0\), and \(x_0=0\).

    4. (d)

      Compute the subderivatives of \(f(x)=x^2-3x+|x|\) and \(f(x)=x^2+x+2|x|\) for each point, and find the maximal and minimal values for each of the two functions.

  3. 51.

    Write an R program,x) of the function \(S_\lambda (x)\), \(\lambda >0\), \(x\in {\mathbb R}\) defined by

    $$S_\lambda (x):=\left\{ \begin{array}{ll} x-\lambda ,&{}x>\lambda \\ 0,&{}|x|\le \lambda \\ x+\lambda ,&{}x<-\lambda \end{array} \right. $$

    and execute the following.

    figure h

    Hint: Use pmax rather than max.

  4. 52.

    We wish to find the \(\beta \in {\mathbb R}\) that minimizes

    $$L=\frac{1}{2N}\sum _{i=1}^N(y_i-x_i\beta )^2+\lambda |\beta |$$

    given \((x_i,y_i)\in {\mathbb R}\times {\mathbb R},\ i=1,\ldots ,N,\ \lambda >0\), where we assume that \(x_1,\ldots ,x_N\) have been scaled so that \(\frac{1}{N}\sum _{i=1}^N x_i^2=1\). Express the solution by \(z:=\frac{1}{N}\sum _{i=1}^N x_iy_i\) and function \(\mathcal{S}_\lambda (\cdot )\).

  5. 53.

    For \(p>1\) and \(\lambda >0\), we estimate the coefficients \(\beta _0\in {\mathbb R}\) and \(\beta \in {\mathbb R}^p\) as follows: Initially, we randomly give the coefficients \(\beta \in {\mathbb R}^p\). Then, we update \(\beta _j\) by \(\displaystyle \mathcal{S}_\lambda \left( \sum _{i=1}^N \frac{x_{i,j}r_{i,j}}{N}\right) \), where \(\displaystyle r_{i,j}:=y_i-\sum _{k\not =j}x_{i,j}\beta _j\). We repeat this process for \(j=1,\ldots ,p\) and repeat the cycle until convergence. The function lasso below is used to scale the sample-based variance to one for each of the p variables before estimation of \((\beta _0,\beta )\). Fill in the blanks and execute the procedure.

    figure i
  6. 54.

    Transform Problem 53 (Lasso) into the setting in Problem 49 (Ridge) and execute it. Hint: Replace the line of eps and the while loop in the function lasso by

    figure j

    and change the function name to ridge. Blank (4) should be ridge rather than lasso.

  7. 55.

    Look up the meanings of glmnet and cv.glmnet and find the optimal \(\lambda \) and \(\beta \) for the data below. Which variables are selected among the five variables?

    figure k

    Hint: The coefficients are displayed via fit$beta. If a coefficient is nonzero, we consider it to be selected.

  8. 56.

    Given \(x_{i,1},x_{i,2},y_i\in {\mathbb R},\ i=1,\ldots ,N\), let \(\hat{\beta }_1,\hat{\beta }_2\) be the \(\beta _1,\beta _2\) that minimize \(\displaystyle S:=\sum _{i=1}^N(y_i-\beta _1x_{i,1}-\beta _2x_{i,2})^2\) given \(\hat{\beta }_1x_{i,1}+\hat{\beta }_{2}x_{i,2}\), \(\hat{y}_i\), \((i=1,\ldots ,N)\). Show the following three equations.

    1. (a)
      $$\sum _{i=1}^Nx_{i,1}(y_i-\hat{y}_i)=\sum _{i=1}^Nx_{i,2}(y_i-\hat{y}_i)=0.$$

      For arbitrary \(\beta _1,\beta _2\),

      $$y_i-\beta _1x_{i,1}-\beta _2x_{i,2}=y_i-\hat{y}_i-(\beta _1-\hat{\beta }_1)x_{i,1}-(\beta _2-\hat{\beta }_2)x_{i,2}.$$

      For arbitrary \(\beta _1,\beta _2\), \(\displaystyle \sum _{i=1}^N(y_i-\beta _1x_{i,1}-\beta _2x_{i,2})^2\) can be expressed by

      $$\begin{aligned}&(\beta _1-\hat{\beta }_1)^2\sum _{i=1}^Nx_{i,1}^2+2(\beta _1-\hat{\beta }_1)(\beta _2-\hat{\beta }_2)\sum _{i=1}^Nx_{i,1}x_{i,2}+(\beta _2-\hat{\beta }_2)^2\sum _{i=1}^Nx_{i,2}^2\\&+\sum _{i=1}^N(y_i-\hat{y}_i)^2. \end{aligned}$$
    2. (b)

      We consider the case \(\displaystyle \sum _{i=1}^Nx_{i,1}^2=\sum _{i=1}^Nx_{i,2}^2=1,\ \sum _{i=1}^Nx_{i,1}x_{i,2}=0\). In the standard least squares method, we choose the coefficients as \(\beta _1=\hat{\beta }_1\), \(\beta _2=\hat{\beta }_2\). However, under the constraint that \(|\beta _1|+|\beta _2|\) is less than a constant, we choose \((\beta _1,\beta _2)\) at which the circle with center \((\hat{\beta }_1,\hat{\beta }_2)\) and the smallest radius comes into contact with the rhombus. Suppose that we grow the radius of the circle with center \((\hat{\beta }_1,\hat{\beta }_2)\) until it comes into contact with the rhombus that connects \((1,0),(0,1),(-1,0),(0,-1)\). Show the region of the centers such that one of coordinates \((\hat{\beta }_1\) and \(\hat{\beta }_2)\) is zero.

    3. (c)

      What if the rhombus in (b) is replaced by a unit circle?

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Suzuki, J. (2020). Regularization. In: Statistical Learning with Math and R. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7567-9

  • Online ISBN: 978-981-15-7568-6

  • eBook Packages: Computer ScienceComputer Science (R0)