Abstract
Thus far, we have considered supervised learning from N observation data (x _{1}, y _{1}), …, (x _{N}, y _{N}), where y _{1}, …, y _{N} take either real values (regression) or a finite number of values (classification). In this chapter, we consider unsupervised learning, in which such a teacher does not exist, and the relations between the N samples and between the p variables are learned only from covariates x _{1}, …, x _{N}. There are various types of unsupervised learning; in this chapter, we focus on clustering and principal component analysis. Clustering means dividing the samples x _{1}, …, x _{N} into several groups (clusters). We consider Kmeans clustering, which requires us to give the number of clusters K in advance, and hierarchical clustering, which does not need such information. We also consider the principal component analysis (PCA), a data analysis method that is often used for machine learning and multivariate analysis. For PCA, we consider another equivalent definition along with its mathematical meaning.
This is a preview of subscription content, access via your institution.
Buying options
Author information
Authors and Affiliations
Appendices
Appendix: Program
A program generates the dendrogram of hierarchical clustering. After obtaining the cluster object via the function hc, we compare the distances between consecutive clusters using the ordered sample y. Specifically, we express the positions of the branches by z[k, 1], z[k, 2], z[k, 3], z[k, 4], and z[k, 5].
Exercises 88–100

88.
The following procedure divides N samples with p variables into K disjoint sets, given K (Kmeans clustering). We repeat the following two steps after randomly assigning one of 1, …, K to each sample:

(a)
Compute the centers of clusters k = 1, …, K.

(b)
To each of the N samples, assign the nearest center among the K clusters.
Fill in the blanks and execute the procedure.

(a)

89.
The clusters that Kmeans clustering generates depend on the randomly chosen initial values. Repeat ten times to find the sequence of values immediately after the 2step update. Display each transition as a line graph on the same graph.

90.
Kmeans clustering minimizes
$$\displaystyle \begin{aligned}S:=\sum_{k=1}^K\frac{1}{C_k}\sum_{i\in C_k}\sum_{i'\in C_k}\sum_{j=1}^p(x_{i,j}x_{i',j})^2\end{aligned}$$w.r.t. C _{1}, …, C _{K} from data X = (x _{i,j}).

(a)
Show the following equation:
$$\displaystyle \begin{aligned}\frac{1}{C_k}\sum_{i\in C_k}\sum_{i'\in C_k}\sum_{j=1}^p(x_{i,j}x_{i',j})^2=2\sum_{i\in C_k}\sum_{j=1}^p(x_{i,j}\bar{x}_{k,j})^2.\end{aligned}$$ 
(b)
Show that the score S is monotonously decreasing each time the two steps are executed in Problem 88.

(c)
Let N = 3, p = 1, and K = 2, and assume that the samples are in 0, 6, 10. We consider two cases: one and two are assigned to 0, 6 and 10, respectively, and one and two are assigned to 0 and 6, 10, respectively. What values do they converge to if the initial state is each of the two cases? What score do they finally obtain?

(a)

91.
Write Python codes for the functions dist_complete, dist_single, dist_centroid, and dist_average to find the maximum distance between the rows in x, y, the minimum distance between the rows in x, y, the distance between the centers of x, y, and the average distance between the rows in x, y, given matrices x and y composed of multiple rows extracted from \(X\in {\mathbb R}^{N\times p}\).

92.
The following procedure executes hierarchical clustering w.r.t. data \(x_1,\ldots ,x_N \in {\mathbb R}^p\). Initially, each cluster contains exactly one sample. We merge the clusters to obtain a clustering with any number K of clusters. Fill in the blanks and execute the procedure.

93.
In hierarchical clustering, if we use centroid linkage, which connects the clusters with the smallest value of dist_centroid, inversion may occur, i.e., clusters with a smaller distance can be connected later. Explain the phenomenon for the case (0, 0), (5, 8), (9, 0) with N = 3 and p = 2.

94.
Let Σ = X ^{T} X∕N for \(X\in {\mathbb R}^{N\times p}\), and let λ _{i} be the ith largest eigenvalue in Σ.

(a)
Show that the ϕ that maximizes ∥Xϕ∥^{2} among \(\phi \in {\mathbb R}^N\) with ∥ϕ∥ = 1 satisfies Σϕ = λ _{1} ϕ.

(b)
Show ϕ _{1}, …, ϕ _{m} such that Σϕ _{1} = λ _{1} ϕ _{1}, …, and Σϕ _{m} = λ _{m} ϕ _{m} are orthogonal when λ _{1} > ⋯ > λ _{m}.

(a)

95.
Using the np.linalg.eig function in the Python language, write a Python program pca that outputs the average of the p columns, the eigenvalues λ _{1}, …, λ _{p}, and the matrix that consists of ϕ _{1}, …, ϕ _{p}, given input \(X\in {\mathbb R}^{N\times p}\). Moreover, execute the following to show that the results obtained via PCA in sklearn.decomposition coincide:

96.
The following procedure produces the first and second principle component vectors ϕ _{1} and ϕ _{2} from N samples (x _{1}, y _{1}), …, (x _{N}, y _{N}). Fill in the blanks and execute it.
1.0
Moreover, show that the product of the slopes is − 1.

97.
There is another equivalent definition of PCA. Suppose that we have centralized the matrix \(X\in {\mathbb R}^{N\times p}\), and let x _{i} be the ith row vector of \(X\in {\mathbb R}^{N\times p}\) and \(\Phi \in {\mathbb R}^{p\times m}\) be the matrix that consists of the mutually orthogonal vectors ϕ _{1}, …, ϕ _{m} of unit length. Then, we can obtain the projection \(z_1=x_1\Phi ,\ldots ,z_N=x_N\Phi \in {\mathbb R}^m\) of x _{1}, …, x _{N} on ϕ _{1}, …, ϕ _{m}. We evaluate how the x _{1}, …, x _{N} are recovered by \(L:=\sum _{i=1}^N \x_ix_i\Phi \Phi ^T\{ }^2\), which is obtained by multiplying z _{1}, …, z _{N} by Φ^{T} from the right. We can regard PCA as the problem of finding ϕ _{1}, …, ϕ _{m} that minimize the value. Show the two equations:
$$\displaystyle \begin{aligned} \begin{array}{rcl} & \displaystyle \sum_{i=1}^N\x_ix_i\Phi\Phi^T\{}^2=\sum_{i=1}^N\x_i\{}^2\sum_{i=1}^N\x_i\Phi\{}^2 \\ & \displaystyle \sum_{i=1}^N\x_i\Phi\{}^2=\sum_{j=1}^m\X \phi_j\{}^2. \end{array} \end{aligned} $$ 
98.
We prepare a dataset containing the numbers of arrests for four crimes in all fifty states.
Fill in the blanks and execute the following code:

99.
The proportions and accumulated proportion are defined by \(\displaystyle \frac {\lambda _k}{\sum _{j=1}^p\lambda _j}\) and \(\displaystyle \frac {\sum _{k=1}^m\lambda _k}{\sum _{j=1}^p\lambda _j}\) for each 1 ≤ m ≤ p. Fill in the blanks and draw the graph.

100.
In addition to PCA and linear regression, we may use principle component regression: find the matrix \(Z=X\Phi \in {\mathbb R}^{N\times m}\) that consists of the m principle components obtained via PCA, find \(\theta \in {\mathbb R}^m\) that minimizes ∥y − Zθ∥^{2}, and display via \(\hat {\theta }\) the relation between the response and m components (a replacement of the p covariates). Principle component regression regresses y on the columns of Z instead of those of X.
Show that \(\Phi \hat {\theta }\) and β = (X ^{T} X)^{−1} X ^{T} y coincide for m = p. Moreover, fill in the blanks and execute it.
Hint: Because min_{β}∥y − Xβ∥^{2} ≤min_{θ}∥y − X Φθ∥^{2} =min_{θ}∥y − Zθ∥^{2}, it is sufficient to show that there exists θ such that β = Φθ for an arbitrary \(\beta \in {\mathbb R}^p\) when p = m.
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Suzuki, J. (2021). Unsupervised Learning. In: Statistical Learning with Math and Python. Springer, Singapore. https://doi.org/10.1007/9789811578779_10
Download citation
DOI: https://doi.org/10.1007/9789811578779_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 9789811578762
Online ISBN: 9789811578779
eBook Packages: Computer ScienceComputer Science (R0)