Skip to main content

Information Criteria

  • 1109 Accesses


Until now, from the observed data, we have considered the following cases:

  • Build a statistical model and estimate the parameters contained in it

  • Estimate the statistical model

In this chapter, we consider the latter for linear regression. The act of finding rules from observational data is not limited to data science and statistics, However, many scientific discoveries are born through such processes. For example, the writing of the theory of elliptical orbits, the law of constant area velocity, and the rule of harmony in the theory of planetary motion published by Kepler in 1596 marked the transition from the dominant theory to the planetary motion theory. While the explanation by the planetary motion theory was based on countless theories based on philosophy and thought, Kepler’s law solved most of the questions at the time with only three laws. In other words, as long as it is a law of science, it must not only be able to explain phenomena (fitness) but it must also be simple (simplicity). In this chapter, we will learn how to derive and apply the AIC and BIC, which evaluate statistical models of data and balance fitness and simplicity.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-15-7568-6_5
  • Chapter length: 18 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   39.99
Price excludes VAT (USA)
  • ISBN: 978-981-15-7568-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   49.99
Price excludes VAT (USA)
Fig. 5.1
Fig. 5.2


  1. 1.

    By |S|, we mean the cardinality of set S.

  2. 2.

    In many practical situations, including linear regression, no problem occurs.

  3. 3.

    By O(f(N)), we denote a function such that g(N)/f(N) is bounded.

  4. 4.

    By O(f(N)), we denote a function such that g(N)/f(N) is bounded.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Joe Suzuki .


Appendix: Proof of Propositions

Proposition 18

For covariates \(x_1,\ldots ,x_N\), if the responses are \(z_1,\ldots ,z_N\), the likelihood \(\displaystyle -\sum _{i=1}^N\log f(z_i|x_i,\gamma )\) of \(\gamma \in {\mathbb R}^{p+1}\) is

$$\begin{aligned} \frac{N}{2}\log 2\pi \sigma ^2 +\frac{1}{2\sigma ^2}\Vert z-X\beta \Vert ^2 -\frac{1}{\sigma ^2}(\gamma -\beta )^TX^T(z-X\beta )+\frac{1}{2\sigma ^2}(\gamma -\beta )^TX^TX(\gamma -\beta ) \end{aligned}$$

for an arbitrary \(\beta \in {\mathbb R}^{p+1}\).


In fact, for \(u\in {\mathbb R}\) and \(x\in {\mathbb R}^{p+1}\), we have that

$$\begin{aligned} \log f(u|x,\gamma )= & {} -\frac{1}{2}\log 2\pi \sigma ^2-\frac{1}{2\sigma ^2}(u-x\gamma )^2\\ (u-x\gamma )^2= & {} \{(u-x\beta )-x(\gamma -\beta )\}^2\\= & {} (u-x\beta )^2-2(\gamma -\beta )^Tx^T(u-x\beta )+(\gamma -\beta )^Tx^Tx(\gamma -\beta )\\ \log f(u|x,\gamma )= & {} -\frac{1}{2}\log 2\pi \sigma ^2 -\frac{1}{2\sigma ^2}(u-x\beta )^2\\&+\frac{1}{\sigma ^2}(\gamma -\beta )^Tx^T(u-x\beta )-\frac{1}{2\sigma ^2}(\gamma -\beta )^Tx^Tx(\gamma -\beta ) \end{aligned}$$

and, if we sum over \((x,u)=(x_1,z_1),\ldots ,(x_n,z_n)\), we can write

$$\begin{aligned} -\sum _{i=1}^N\log f(z_i|x_i,\gamma )= & {} \frac{N}{2}\log 2\pi \sigma ^2 +\frac{1}{2\sigma ^2}\Vert z-X\beta \Vert ^2\\&-\frac{1}{\sigma ^2}(\gamma -\beta )^TX^T(z-X\beta )+\frac{1}{2\sigma ^2}(\gamma -\beta )^TX^TX(\gamma -\beta )\ , \end{aligned}$$

where we have used \(z=[z_1,\ldots ,z_N]^T\) and \(\displaystyle \Vert z-X\beta \Vert ^2=\sum _{i=1}^N(z_i-x_i\beta )^2,\ \displaystyle X^TX=\sum _{i=1}^Nx_i^Tx_i,\ X^T(z-X\beta )=\sum _{i=1}^Nx_i^T(z_i-x_i\beta )\).

Proposition 19

Let k(S) be the cardinality of S. Then, we haveFootnote 4

$$E[\log \hat{\sigma }^2(S)]=\log \sigma ^2(S)-\frac{k(S)+2}{N}+O\left( \frac{1}{N^2}\right) \ .$$


Let \(m\ge 1\), \(U\sim \chi ^2_m\), \(V_1,\ldots ,V_m\sim N(0,1)\). For \(i=1,\ldots ,m\), we have that

$$\begin{aligned} Ee^{tV_i^2}= & {} \int _{-\infty }^\infty e^{tv_i^2}\frac{1}{\sqrt{2\pi }}e^{-v_i^2/2}dv_i= \int _{-\infty }^\infty \frac{1}{\sqrt{2\pi }} \exp \left\{ -\frac{(1-2t)v_i^2}{2}\right\} dv_i= (1-2t)^{-1/2}\\ Ee^{tU}= & {} \int _{-\infty }^\infty e^{t(v_1^2+\cdots +v_{m}^2)}\frac{1}{\sqrt{2\pi }}\int _{-\infty }^\infty e^{-(v_1^2+\cdots +v_{m}^2)/2}\, dv_1\cdots dv_{m} =(1-2t)^{-m/2}\ . \end{aligned}$$

which means that for \(n=1,2,\ldots \),

$$\begin{aligned} EU^n= & {} \frac{d{^n}Ee^{tU}}{dt^n}\bigg |_{t=0}=m(m+2)\cdots (m+2n-2) \ , \end{aligned}$$

where \(\displaystyle Ee^{tU}=1+tE[U]+\frac{t^2}{2}E[U^2]+\cdots \) has been used. Moreover, from the Taylor expansion, we have that

$$\begin{aligned} E[\log \frac{U}{m}]= & {} E\left( \frac{U}{m}-1\right) -\frac{1}{2}E\left( \frac{U}{m}-1\right) ^2+\cdots \ . \end{aligned}$$

If we let (5.16) for \(n=1,2\), where \(EU=m\) and \(EU^2=m(m+2)\), the first and second terms of (5.17) are zero and

$$-\frac{1}{2m^2}(EU^2-2mEU+m^2)=-\frac{1}{2m^2}\{m(m+2)-2m^2+m^2\}=-\frac{1}{m}\ ,$$


Next, we show that each term in (5.17) for \(n\ge 3\) is at most \(O(1/m^2)\). From the binomial theorem and (5.16), we have that

$$\begin{aligned} E(U-m)^n = \sum _{j=0}^n \left( \begin{array}{c} n\\ j \end{array} \right) EU^j(-m)^{n-j} = \sum _{j=0}^n(-1)^{n-j} \left( \begin{array}{c} n\\ j \end{array} \right) m^{n-j}m(m+2)\cdots (m+2j-2)\ . \end{aligned}$$

If we regard

$$m^{n-j}m(m+2)\cdots (m+2j-2)$$

as a polynmial w.r.t. m, the coefficients of the highest and \((n-1)\)-th terms are one and \(2\{1+2+\cdots +(j-1)\}=j(j-1)\), respectively. Hence, the coefficients of the n-th and \((n-1)\)-th terms in (5.18) are

$$\displaystyle \sum _{j=0}^n\left( \begin{array}{c} n\\ j \end{array} \right) (-1)^j= \sum _{j=0}^n\left( \begin{array}{c} n\\ j \end{array} \right) (-1)^j1^{n-j}=(-1+1)^n= 0$$


$$ \sum _{j=0}^n \left( \begin{array}{c} n\\ j \end{array} \right) (-1)^jj(j-1)= \sum _{j=2}^n \frac{n!}{(n-j)!(j-2)!} (-1)^{j-2}= n(n-1) \sum _{i=0}^{n-2} \left( \begin{array}{c} n-2\\ i \end{array} \right) (-1)^{i}=0\ , $$

respectively. Thus, we have shown that for \(n\ge 3\),

$$E\left( \frac{U}{m}-1\right) ^n=O\left( \frac{1}{m^2}\right) \ .$$

Finally, from \(\displaystyle \frac{RSS(S)}{\sigma ^2(S)}= \frac{N\hat{\sigma }^2(S)}{{\sigma }^2(S)}\sim \chi ^2_{N-k(S)-1}\) and (5.17), if we apply \(m=N-k(S)-1\), then we have that

$$\log \frac{N}{N-k(S)-1}=\frac{k(S)+1}{N-k(S)-1}+O((\frac{1}{N-k(S)-1})^2)$$
$$E\left[ \log \left( \frac{\hat{\sigma }^2(S)}{N-k(S)-1}\bigg /\frac{{\sigma }^2(S)}{N}\right) \right] =-\frac{1}{N-k(S)-1}+O\left( \frac{1}{N^2}\right) =-\frac{1}{N}+O(\frac{1}{N^2})$$


$$ E\left[ \log \frac{\hat{\sigma }^2(S)}{\sigma ^2}\right] =-\frac{1}{N}-\frac{k(S)+1}{N}+O\left( \frac{1}{N^2}\right) =-\frac{k(S)+2}{N}+O\left( \frac{1}{N^2}\right) \ . $$

Exercises 40–48

In the following, we define

$$X= \left[ \begin{array}{c} {x_1}\\ \vdots \\ x_N \end{array} \right] \in {\mathbb R}^{N\times (p+1)} ,\ y= \left[ \begin{array}{c} y_1\\ \vdots \\ y_N \end{array} \right] \in {\mathbb R}^N ,\ z= \left[ \begin{array}{c} z_1\\ \vdots \\ z_N \end{array} \right] \in {\mathbb R}^N ,\ \beta = \left[ \begin{array}{c} \beta _0\\ \beta _1\\ \vdots \\ \beta _p \end{array} \right] \in {\mathbb R}^{p+1}\ , $$

where \(x_1,\ldots ,x_N\) are row vectors. We assume that \(X^TX\) has an inverse matrix and denote by \(E[\cdot ]\) the expectation w.r.t.

$$f(y|x,\beta ):=\frac{1}{\sqrt{2\pi \sigma ^2}}\exp \left\{ -\frac{\Vert y-x\beta \Vert ^2}{2\sigma ^2}\right\} \ .$$
  1. 40.

    For \(X \in {\mathbb R}^{N\times (p+1)}\), \(y\in {\mathbb R}^N\), show each of the following.

    1. (a)

      If the variance \(\sigma ^2>0\) is known, the \(\beta \in {\mathbb R}^{p+1}\) that maxmizes \(\displaystyle l:=\sum _{i=1}^N \log f(y_i|x_i,\beta )\) coincides with the least squares solution. Hint:

      $${l}=-\frac{N}{2}\log (2\pi \sigma ^2)-\frac{1}{2\sigma ^2}\Vert y-X\beta \Vert ^2$$
    2. (b)

      If both \(\beta \in {\mathbb R}^{p+1}\) and \(\ \sigma ^2>0\) are unknown, the maximum likelihood estimate of \(\sigma ^2\) is given by

      $$\hat{\sigma }^2=\frac{1}{N}\Vert y-X\hat{\beta }\Vert ^2$$

      . Hint: If we partially differentiate l with respect to \(\sigma ^2\), we have

      $$\displaystyle \frac{\partial l^2}{\partial \sigma ^2}= \frac{N}{2\sigma ^2}-\frac{\Vert y-X\beta \Vert ^2}{2(\sigma ^2)^2} =0$$
    3. (c)

      For probabilistic density functions fg over \(\mathbb R\), the Kullback-Leibler divergence is nonnegative, i.e.,

      $$\displaystyle D(f\Vert g):=\int _{-\infty }^\infty f(x)\log \frac{f(x)}{g(x)}dx\ge 0$$
  2. 41.

    Let \(f^N(y|x,\beta ):=\prod _{i=1}^Nf(y_i|x_i,\beta )\). By showing (a) through (d), prove

    $$J=\frac{1}{N}E(\nabla l)^2=-\frac{1}{N}E\nabla ^2 l$$
    1. (a)

      \(\displaystyle {\nabla l}=\frac{\nabla f^N(y|x,\beta )}{f^N(y|x,\beta )}\)

    2. (b)

      \(\displaystyle \int \nabla f^N(y|x,\beta )dy=0\)

    3. (c)

      \(E\nabla l=0\)

    4. (d)

      \(\nabla E[\nabla l]= E[\nabla ^2 l]+E[(\nabla l)^2]\)

  3. 42.

    Let \(\tilde{\beta }\in {\mathbb R}^{p+1}\) be an arbitrary unbiased estimate \(\beta \). By showing (a) through (c), prove Cramer-Rao’s inequality

    $$V(\tilde{\beta })\ge (NJ)^{-1}$$
    1. (a)

      \(E[(\tilde{\beta }-\beta )(\nabla l)^T]=I\)

    2. (b)

      The covariance matrix of the vector combining \(\tilde{\beta }-\beta \) and \(\nabla l\) of size \(2(p+1)\)

      $$ \left[ \begin{array}{c@{\quad }c} V(\tilde{\beta })&{}I\\ I&{}NJ \end{array} \right] $$
    3. (c)

      Both sides of

      $$ \left[ \begin{array}{c@{\quad }c} V(\tilde{\beta })-(NJ)^{-1}&{}0\\ 0&{}NJ \end{array} \right] = \left[ \begin{array}{c@{\quad }c} I&{}-(NJ)^{-1}\\ 0&{}I \end{array} \right] \left[ \begin{array}{c@{\quad }c} V(\tilde{\beta })&{}I\\ I&{}NJ \end{array} \right] \left[ \begin{array}{c@{\quad }c} I&{}0\\ -(NJ)^{-1}&{}I \end{array} \right] $$

      are nonnegative definite.

  4. 43.

    By showing (a) through (c), prove \(E\Vert X(\tilde{\beta }-\beta )\Vert ^2\ge \sigma ^2(p+1)\).

    1. (a)

      \(E[(\tilde{\beta }-\beta )^T\nabla {l}]=p+1\)

    2. (b)

      \(E\Vert X(X^TX)^{-1}\nabla l\Vert ^2=(p+1)/\sigma ^2\)

    3. (c)

      \(\{E(\tilde{\beta }-\beta )^T\nabla {l}\}^2\le E\Vert X(X^TX)^{-1}\nabla l\Vert ^2E\Vert X(\tilde{\beta }-\beta )\Vert ^2\) Hint: For random variables \(U,V\in {\mathbb R}^m\) (\(m\ge 1\)), prove \(\{E[U^TV]\}^2\le E[\Vert U\Vert ^2]E[\Vert V\Vert ^2]\) (Schwarz’s inequality).

  5. 44.

    Prove the following statements.

    1. (a)

      For covariates \(x_1,\ldots ,x_N\), if we obtain the responses \(z_1,\ldots ,z_N\), then the likelihood \(\displaystyle -\sum _{i=1}^N\log f(z_i|x_i,\gamma )\) of the parameter \(\gamma \in {\mathbb R}^{p+1}\) is

      $$\frac{N}{2}\log 2\pi \sigma ^2 +\frac{1}{2\sigma ^2}\Vert z-X\beta \Vert ^2 -\frac{1}{\sigma ^2}(\gamma -\beta )^TX^T(z-X\beta )+\frac{1}{2\sigma ^2}(\gamma -\beta )^TX^TX(\gamma -\beta ) $$

      for an arbitrary \(\beta \in {\mathbb R}^{p+1}\).

    2. (b)

      If we take the expectation of (a) w.r.t. \(z_1,\ldots ,z_N\), it is

      $$\frac{N}{2}\log (2\pi \sigma ^2e)+\frac{1}{2\sigma ^2}\Vert X(\gamma -\beta )\Vert ^2\ .$$
    3. (c)

      If we estimate \(\beta \) and choose an estimate \(\gamma \) of \(\beta \), the minimum value of (b) on average is

      $$\frac{N}{2}\log (2\pi \sigma ^2e)+\frac{1}{2}(p+1)$$

      and the minimum value is realized by the least squares method.

    4. (d)

      Instead of choosing all the p covariates, we choose \(0\le k\le p\) covariates from p. Minimizing

      $$\frac{N}{2}\log (2\pi \sigma _k^2e)+\frac{1}{2}(k+1)$$

      w.r.t. k is equivalent to minimizing \({N}\log \sigma ^2_k+k\) w.r.t. k, where \(\sigma ^2_k\) is the minimum variance when we choose k covariates.

  6. 45.

    By showing (a) through (f), prove

    $$E\log \frac{\hat{\sigma }^2(S)}{\sigma ^2} =-\frac{1}{N}-\frac{k(S)+1}{N}+O\left( \frac{1}{N^2}\right) =-\frac{k(S)+2}{N}+O\left( \frac{1}{N^2}\right) \ .$$

    Use the fact that the moment of \(U\sim \chi ^2_m\) is

    $$EU^n=m(m+2)\cdots (m+2n-2)\ $$

    without proving it.

    1. (a)

      \(\displaystyle E\log \frac{U}{m}=E\left( \frac{U}{m}-1\right) -\frac{1}{2}E\left( \frac{U}{m}-1\right) ^2+\cdots \)

    2. (b)

      \(\displaystyle E\left( \frac{U}{m}-1\right) =0\) and \(\displaystyle E\left( \frac{U}{m}-1\right) ^2=\frac{2}{m}\)

    3. (c)

      \(\displaystyle \sum _{j=0}^n(-1)^{n-j} \left( \begin{array}{c} n\\ j \end{array} \right) =0 \)

    4. (d)

      if we regard \( \displaystyle E(U-m)^n= \sum _{j=0}^n(-1)^{n-j} \left( \begin{array}{c} n\\ j \end{array} \right) m^{n-j}m(m+2)\cdots (m+2j-2) \) as a polynomial of degree m, the sum of the terms of degree n is zero. Hint: Use (c).

    5. (e)

      The sum of the terms of degree \(n-1\) is zero. Hint: Derive that the coefficient of degree \(n-1\) is \(2\{1+2+\cdots +(j-1)\}=j(j-1)\) for each j and that \( \displaystyle \sum _{j=0}^n \left( \begin{array}{c} n\\ j \end{array} \right) (-1)^jj(j-1)=0 \).

    6. (f)

      \(\displaystyle E\log \left( \frac{\hat{\sigma }^2(S)}{N-k(S)-1}\bigg /\frac{{\sigma }^2}{N}\right) =-\frac{1}{N}+O\left( \frac{1}{N^2}\right) \)

  7. 46.

    The following procedure produces the AIC value. Fill in the blanks and execute the procedure.

    figure f
  8. 47.

    Instead of AIC, we consider a criterion that minimizes the following quantity (BIC, Bayesian Information Criterion):

    $$N\log \hat{\sigma }^2+k\log N$$

    Replace the associated lines of the AIC procedure above, and name the function BIC. For the same data, execute BIC. Moreover, construct a procedure to choose the covariate set that maximizes


    (adjusted coefficient of determination) and name the function AR2. For the same data, execute AR2.

  9. 48.

    We wish to visualize the k that minimizes AIC and BIC. Fill in the blanks and execute the procedure.

    figure g

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Suzuki, J. (2020). Information Criteria. In: Statistical Learning with Math and R. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7567-9

  • Online ISBN: 978-981-15-7568-6

  • eBook Packages: Computer ScienceComputer Science (R0)