Skip to main content

Unsupervised Learning

  • 1433 Accesses


Thus far, we have considered supervised learning from N observation data (x 1, y 1), …, (x N, y N), where y 1, …, y N take either real values (regression) or a finite number of values (classification). In this chapter, we consider unsupervised learning, in which such a teacher does not exist, and the relations between the N samples and between the p variables are learned only from covariates x 1, …, x N. There are various types of unsupervised learning; in this chapter, we focus on clustering and principal component analysis. Clustering means dividing the samples x 1, …, x N into several groups (clusters). We consider K-means clustering, which requires us to give the number of clusters K in advance, and hierarchical clustering, which does not need such information. We also consider the principal component analysis (PCA), a data analysis method that is often used for machine learning and multivariate analysis. For PCA, we consider another equivalent definition along with its mathematical meaning.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-15-7877-9_10
  • Chapter length: 27 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   34.99
Price excludes VAT (USA)
  • ISBN: 978-981-15-7877-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   44.99
Price excludes VAT (USA)
Fig. 10.1
Fig. 10.2
Fig. 10.3
Fig. 10.4
Fig. 10.5
Fig. 10.6
Fig. 10.7
Fig. 10.8
Fig. 10.9
Fig. 10.10
Fig. 10.11

Author information

Authors and Affiliations



Appendix: Program

A program generates the dendrogram of hierarchical clustering. After obtaining the cluster object via the function hc, we compare the distances between consecutive clusters using the ordered sample y. Specifically, we express the positions of the branches by z[k, 1], z[k, 2], z[k, 3], z[k, 4], and z[k, 5].

Exercises 88–100

  1. 88.

    The following procedure divides N samples with p variables into K disjoint sets, given K (K-means clustering). We repeat the following two steps after randomly assigning one of 1, …, K to each sample:

    1. (a)

      Compute the centers of clusters k = 1, …, K.

    2. (b)

      To each of the N samples, assign the nearest center among the K clusters.

    Fill in the blanks and execute the procedure.

  2. 89.

    The clusters that K-means clustering generates depend on the randomly chosen initial values. Repeat ten times to find the sequence of values immediately after the 2-step update. Display each transition as a line graph on the same graph.

  3. 90.

    K-means clustering minimizes

    $$\displaystyle \begin{aligned}S:=\sum_{k=1}^K\frac{1}{|C_k|}\sum_{i\in C_k}\sum_{i'\in C_k}\sum_{j=1}^p(x_{i,j}-x_{i',j})^2\end{aligned}$$

    w.r.t. C 1, …, C K from data X = (x i,j).

    1. (a)

      Show the following equation:

      $$\displaystyle \begin{aligned}\frac{1}{|C_k|}\sum_{i\in C_k}\sum_{i'\in C_k}\sum_{j=1}^p(x_{i,j}-x_{i',j})^2=2\sum_{i\in C_k}\sum_{j=1}^p(x_{i,j}-\bar{x}_{k,j})^2.\end{aligned}$$
    2. (b)

      Show that the score S is monotonously decreasing each time the two steps are executed in Problem 88.

    3. (c)

      Let N = 3, p = 1, and K = 2, and assume that the samples are in 0, 6, 10. We consider two cases: one and two are assigned to 0, 6 and 10, respectively, and one and two are assigned to 0 and 6, 10, respectively. What values do they converge to if the initial state is each of the two cases? What score do they finally obtain?

  4. 91.

    Write Python codes for the functions dist_complete, dist_single, dist_centroid, and dist_average to find the maximum distance between the rows in x, y, the minimum distance between the rows in x, y, the distance between the centers of x, y, and the average distance between the rows in x, y, given matrices x and y composed of multiple rows extracted from \(X\in {\mathbb R}^{N\times p}\).

  5. 92.

    The following procedure executes hierarchical clustering w.r.t. data \(x_1,\ldots ,x_N \in {\mathbb R}^p\). Initially, each cluster contains exactly one sample. We merge the clusters to obtain a clustering with any number K of clusters. Fill in the blanks and execute the procedure.

  6. 93.

    In hierarchical clustering, if we use centroid linkage, which connects the clusters with the smallest value of dist_centroid, inversion may occur, i.e., clusters with a smaller distance can be connected later. Explain the phenomenon for the case (0, 0), (5, 8), (9, 0) with N = 3 and p = 2.

  7. 94.

    Let Σ = X T XN for \(X\in {\mathbb R}^{N\times p}\), and let λ i be the i-th largest eigenvalue in Σ.

    1. (a)

      Show that the ϕ that maximizes ∥2 among \(\phi \in {\mathbb R}^N\) with ∥ϕ∥ = 1 satisfies Σϕ = λ 1 ϕ.

    2. (b)

      Show ϕ 1, …, ϕ m such that Σϕ 1 = λ 1 ϕ 1, …, and Σϕ m = λ m ϕ m are orthogonal when λ 1 > ⋯ > λ m.

  8. 95.

    Using the np.linalg.eig function in the Python language, write a Python program pca that outputs the average of the p columns, the eigenvalues λ 1, …, λ p, and the matrix that consists of ϕ 1, …, ϕ p, given input \(X\in {\mathbb R}^{N\times p}\). Moreover, execute the following to show that the results obtained via PCA in sklearn.decomposition coincide:

  9. 96.

    The following procedure produces the first and second principle component vectors ϕ 1 and ϕ 2 from N samples (x 1, y 1), …, (x N, y N). Fill in the blanks and execute it.


    Moreover, show that the product of the slopes is − 1.

  10. 97.

    There is another equivalent definition of PCA. Suppose that we have centralized the matrix \(X\in {\mathbb R}^{N\times p}\), and let x i be the i-th row vector of \(X\in {\mathbb R}^{N\times p}\) and \(\Phi \in {\mathbb R}^{p\times m}\) be the matrix that consists of the mutually orthogonal vectors ϕ 1, …, ϕ m of unit length. Then, we can obtain the projection \(z_1=x_1\Phi ,\ldots ,z_N=x_N\Phi \in {\mathbb R}^m\) of x 1, …, x N on ϕ 1, …, ϕ m. We evaluate how the x 1, …, x N are recovered by \(L:=\sum _{i=1}^N \|x_i-x_i\Phi \Phi ^T\|{ }^2\), which is obtained by multiplying z 1, …, z N by ΦT from the right. We can regard PCA as the problem of finding ϕ 1, …, ϕ m that minimize the value. Show the two equations:

    $$\displaystyle \begin{aligned} \begin{array}{rcl} & \displaystyle \sum_{i=1}^N\|x_i-x_i\Phi\Phi^T\|{}^2=\sum_{i=1}^N\|x_i\|{}^2-\sum_{i=1}^N\|x_i\Phi\|{}^2 \\ & \displaystyle \sum_{i=1}^N\|x_i\Phi\|{}^2=\sum_{j=1}^m\|X \phi_j\|{}^2. \end{array} \end{aligned} $$
  11. 98.

    We prepare a dataset containing the numbers of arrests for four crimes in all fifty states.

    Fill in the blanks and execute the following code:

  12. 99.

    The proportions and accumulated proportion are defined by \(\displaystyle \frac {\lambda _k}{\sum _{j=1}^p\lambda _j}\) and \(\displaystyle \frac {\sum _{k=1}^m\lambda _k}{\sum _{j=1}^p\lambda _j}\) for each 1 ≤ m ≤ p. Fill in the blanks and draw the graph.

  13. 100.

    In addition to PCA and linear regression, we may use principle component regression: find the matrix \(Z=X\Phi \in {\mathbb R}^{N\times m}\) that consists of the m principle components obtained via PCA, find \(\theta \in {\mathbb R}^m\) that minimizes ∥y − 2, and display via \(\hat {\theta }\) the relation between the response and m components (a replacement of the p covariates). Principle component regression regresses y on the columns of Z instead of those of X.

    Show that \(\Phi \hat {\theta }\) and β = (X T X)−1 X T y coincide for m = p. Moreover, fill in the blanks and execute it.

    Hint: Because minβy − 2 ≤minθy − X Φθ2 =minθy − 2, it is sufficient to show that there exists θ such that β =  Φθ for an arbitrary \(\beta \in {\mathbb R}^p\) when p = m.

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Suzuki, J. (2021). Unsupervised Learning. In: Statistical Learning with Math and Python. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7876-2

  • Online ISBN: 978-981-15-7877-9

  • eBook Packages: Computer ScienceComputer Science (R0)