Skip to main content


  • 1117 Accesses


In this chapter, we consider constructing a classification rule from covariates to a response that takes values from a finite set such as \(\pm 1\), figures \(0,1,\ldots ,9\). For example, we wish to classify a postal code from handwritten characters and to make a rule between them. First, we consider logistic regression to minimize the error rate in the test data after constructing a classifier based on the training data. The second approach is to draw borders that separate the regions of the responses with linear and quadratic discriminators and the k-nearest neighbor algorithm. The linear and quadratic discriminations draw linear and quadratic borders, respectively, and both introduce the notion of prior probability to minimize the average error probability. The k-nearest neighbor method searches the border more flexibly than the linear and quadratic discriminators. On the other hand, we take into account the balance of two risks, such as classifying a sick person as healthy and classifying a healthy person as unhealthy. In particular, we consider an alternative approach beyond minimizing the average error probability. The regression method in the previous chapter and the classification method in this chapter are two significant issues in the field of machine learning.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-15-7568-6_3
  • Chapter length: 19 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   39.99
Price excludes VAT (USA)
  • ISBN: 978-981-15-7568-6
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   49.99
Price excludes VAT (USA)
Fig. 3.1
Fig. 3.2
Fig. 3.3
Fig. 3.4


  1. 1.

    In this chapter, instead of \(\beta \in {\mathbb R}^{p+1}\), we separate the slope \(\beta \in {\mathbb R}^p\) and the intercept \(\beta _0\in {\mathbb R}\).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Joe Suzuki .

Exercises 19–31

Exercises 19–31

  1. 19.

    We assume that there exist \(\beta _0\in {\mathbb R}\) and \(\beta \in {\mathbb R}^p\) such that for \(x\in {\mathbb R}^p\), the probabilities of \(Y=1\) and \(Y=-1\) are \(\displaystyle \frac{e^{\beta _0+x\beta }}{1+e^{\beta _0+x\beta }}\) and \(\displaystyle \frac{1}{1+e^{\beta _0+x\beta }}\), respectively. Show that the probability of \(Y=y\in \{-1,1\}\) can be written as \(\displaystyle \frac{1}{1+e^{-y(\beta _0+x\beta )}}\).

  2. 20.

    For \(p=1\) and \(\beta >0\), show that the function \(f(x)=\displaystyle \frac{1}{1+e^{-(\beta _0+x\beta )}}\) is monotonically increasing for \(x\in {\mathbb R}\) and convex and concave in \(x<-\beta _0/\beta \) and \(x>-\beta _0/\beta \), respectively. How does the function change as \(\beta \) increases? Execute the following to answer this question.

    figure s
  3. 21.

    We wish to obtain the estimates of \(\beta _0\in {\mathbb R}\) and \(\ \beta \in {\mathbb R}^p\) by maximizing the likelihood \(\displaystyle \prod _{i=1}^N\frac{1}{1+e^{-y_i(\beta _0+x_i\beta )}}\), or equivalently, by minimizing the negated logarithm

    $$l(\beta _0,\beta )=\sum _{i=1}^N\log (1+v_i) ,\ v_i=e^{-y_i(\beta _0+x_i\beta )}$$

    from observations \((x_1,y_1),\ldots ,(x_N,y_N)\in {\mathbb R}^{p}\times \{-1,1\}\) (maximum likelihood). Show that \(l(\beta _0,\beta )\) is convex by obtaining the derivative \(\nabla l(\beta _0,\beta )\) and the second derivative \(\nabla ^2 l(\beta _0,\beta )\). Hint: Let \(\nabla l(\beta _0,\beta )\) and \(\nabla ^2 l(\beta _0,\beta )\) be the column vector of size \(p+1\) such that the jth element is \(\displaystyle \frac{\partial l}{\partial \beta _j}\) and the matrix of size \((p+1)\times (p+1)\) such that the (jk)th element is \(\displaystyle \frac{\partial ^2 l}{\partial \beta _j\partial \beta _k}\), respectively. Simply show that the matrix is nonnegative definite. To this end, show that \(\nabla ^2 l(\beta _0,\beta )=X^TWX\). If W is diagonal, then it can be written as \(W=U^TU\), where the diagonal elements of U are the square roots of W, which means \(\nabla ^2 l(\beta _0,\beta )=(UX)^TUX\).

  4. 22.

    Solve the following equations via the Newton–Raphson method by constructing an R program.

    1. (a)

      For \(f(x)=x^2-1\), set \(x=2\) and repeat the recursion \(x\leftarrow x-f(x)/f'(x)\) 100 times.

    2. (b)

      For \(f(x,y)=x^2+y^2-1\), \(g(x,y)=x+y\), set \((x,y)=(1,2)\) and repeat the recursion 100 times.

      $$ \left[ \begin{array}{c} x\\ y\\ \end{array} \right] \leftarrow \left[ \begin{array}{c} x\\ y\\ \end{array} \right] -\left[ \begin{array}{c@{\quad }c} \displaystyle \frac{\partial f(x,y)}{\partial x}&{}\displaystyle \frac{\partial f(x,y)}{\partial y}\\[3mm] \displaystyle \frac{\partial g(x,y)}{\partial x}&{}\displaystyle \frac{\partial g(x,y)}{\partial y}\\ \end{array} \right] ^{-1} \left[ \begin{array}{c} f(x,y)\\ g(x,y)\\ \end{array} \right] $$

      Hint: Define the procedure and repeat it 100 times.

      figure t
  5. 23.

    We wish to solve \(\nabla l(\beta _0,\beta )=0\), \((\beta _0,\beta )\in {\mathbb R}\times {\mathbb R}^p\) in Problem 21 via the Newton–Raphson method using the recursion

    $$(\beta _0,\beta )\leftarrow (\beta _0,\beta )-\{\nabla ^2 l(\beta _0,\beta )\}^{-1}\nabla l(\beta _0,\beta )\ ,$$

    where \(\nabla f(v) \in {\mathbb R}^{p+1}\) and \(\nabla ^2 f(v) \in {\mathbb R}^{(p+1)\times (p+1)}\) are the vector such that the ith element is \(\displaystyle \frac{\partial f}{\partial v_i}\) and the square matrix such that the (ij)th element is \(\displaystyle \frac{\partial ^2 f}{\partial v_i\partial v_j}\), respectively. In the following, for ease of notation, we write \((\beta _0,\beta )\in {\mathbb R}\times {\mathbb R}^p\) by \(\beta \in {\mathbb R}^{p+1}\). Show that the update rule can be written as follows:

    $$\begin{aligned} \beta _{ new}\leftarrow (X^TW X)^{-1}X^TW z\ , \end{aligned}$$

    where \(u \in {\mathbb R}^{p+1}\) such that \(\nabla l(\beta _{ old})=-X^Tu\) and \(W\in {\mathbb R}^{(p+1)\times (p+1)}\) such that \(\nabla ^2 l(\beta _{ old})=X^TW X\), \(z\in {\mathbb R}\) is defined by \(z:=X\beta _{ old}+W^{-1}u\), and \(X^TWX\) is assumed to be nonsingular. Hint: The update rule can be written as follows: \(\beta _{ new}\leftarrow \beta _{ old}+(X^TWX)^{-1}X^Tu\).

  6. 24.

    We construct a procedure to solve Problem 23. Fill in blanks (1)(2)(3), and examine that the procedure works.

    figure u
  7. 25.

    If the condition \(y_i(\beta _0+x_i\beta )\ge 0,\ (x_i,y_i)\in {\mathbb R}^{p}\times {\mathbb R},\ i=1,\ldots ,N\) is met, we cannot obtain the parameters of logistic regression via maximum likelihood. Why?

  8. 26.

    For \(p=1\), we wish to estimate the parameters of logistic regression from N/2 training data and to predict the responses of the N/2 test data that are not used as the training data. Fill in the blanks and execute the program.

    figure v

    Hint: For prediction, see whether \(\beta _0+x\beta _1\) is positive or negative.

  9. 27.

    In linear discrimination, let \(\pi _k\) be the prior probability of \(Y=k\) for \(k=1,\ldots ,m\) (\(m\ge 2\)), and let \(f_k(x)\) be the probability density function of the p covariates \(x\in {\mathbb R}^p\) given response \(Y=k\) with mean \(\mu _k\in {\mathbb R}^p\) and covariance matrix \(\Sigma _k\in {\mathbb R}^{p\times p}\). We consider the set \(S_{k,l}\) of \(x\in {\mathbb R}^p\) such that

    $$\frac{\pi _k f_k(x)}{\displaystyle \sum _{j=1}^K\pi _j f_j(x)}=\frac{\pi _l f_l(x)}{\displaystyle \sum _{j=1}^K\pi _j f_j(x)}$$

    for \(k,l=1,\ldots , m, k\not =l.\)

    1. (a)

      Show that when \(\pi _k=\pi _l\), \(S_{k,l}\) is the set of \(x\in {\mathbb R}^p\) on the quadratic surface

      $$ -(x-\mu _k)^T\Sigma ^{-1}_k(x-\mu _k)+(x-\mu _l)^T\Sigma ^{-1}_l(x-\mu _l)=\log \frac{\det \Sigma _k}{\det \Sigma _l}\ . $$
    2. (b)

      Show that when \(\Sigma _k=\Sigma _l\) (\(=\Sigma \)), \(S_{k,l}\) is the set of \(x\in {\mathbb R}^p\) on the surface \(a^Tx+b=0\) with \(a\in {\mathbb R}^p\) and \(b\in {\mathbb R}\)) and express ab using \(\mu _k,\mu _l,\Sigma ,\pi _k,\pi _l\).

    3. (c)

      When \(\pi _k=\pi _l\) and \(\Sigma _k=\Sigma _l\), show that the surface of (b) is \(x=(\mu _k+\mu _l)/2\).

  10. 28.

    In the following, we wish to estimate distributions from two classes and draw a boundary line that determines the maximum posterior probability. If the covariance matrices are assumed to be equal, how do the boundaries change? Modify the program.

    figure w

    Hint: Modify the lines marked with #.

  11. 29.

    Even in the case of three or more values, we can select the class that maximizes the posterior probability. From four covariates (length of sepals, width of sepals, length of petals, width of petals) of Fisher’s iris data, we wish to identify the three types of irises (Setosa, Versicolor, and Virginica) via quadratic discrimination. Specifically, we learn rules from training data and evaluate them with test data. Assuming \(N=150\) and \(p=4\), each of the three irises contains 50 samples, and the prior probability is expected to be equal to 1/3. If we find that the prior probabilities of Setosa, Versicolor, and Virginica irises are 0.5, 0.25, 0.25, respectively how should the program be changed to determine the maximum posterior probability?

    figure x
  12. 30.

    In the k-nearest neighbor method, we do not construct a specific rule from training data \((x_1,y_1),\ldots ,(x_N,y_N)\in {\mathbb R}^{p}\times \)(finite set). Suppose that given a new data \(x_*\), \(x_i\), \(i\in S\) are the k training data such that the distances between \(x_i\) and \(x_*\) are the smallest, where S is a subset of \(\{1,\ldots ,n\}\) of size k. The k-nearest neighbor method predicts the response \(y_*\) of \(x_*\) by majority voting of \(y_i\), \(i \in S\). If there is more than one majority, we remove the \(i \in S\) such that the distance between \(x_j\) and \(x_*\) is the largest among \(x_j\), \(j \in S_j\) and continue to find the majority. If S contains exactly one element, we obtain the majority. The following process assumes that there is one test data, but the method can be extended to cases where there is more than one test data. Then, apply the method to the data in Problem 29.

    figure y
  13. 31.

    Let \(f_1(x)\) and \(f_0(x)\) be the distributions for a measurement x of people with a disease and those without the disease, respectively. For each positive \(\theta \), the decision that the person had the symptoms was determined according to whether

    $$\frac{f_1 (x)}{f_0 (x)} \ge \theta \ .$$

    In the following, we suppose that the distributions of sick and healthy people are N(1, 1) and \(N(-1,1)\), respectively. Fill in the blank and draw the ROC curve.

    figure z

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Suzuki, J. (2020). Classification. In: Statistical Learning with Math and R. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7567-9

  • Online ISBN: 978-981-15-7568-6

  • eBook Packages: Computer ScienceComputer Science (R0)