## Abstract

We find that different Deep Neural Networks (DNNs) trained with the same dataset share a common principal subspace in latent spaces, no matter in which architectures (e.g., Convolutional Neural Networks (CNNs), Multi-Layer Preceptors (MLPs) and Autoencoders (AEs)) the DNNs were built or even whether labels have been used in training (e.g., supervised, unsupervised, and self-supervised learning). Specifically, we design a new metric \({\mathcal {P}}\)-vector to represent the principal subspace of deep features learned in a DNN, and propose to measure angles between the principal subspaces using \({\mathcal {P}}\)-vectors. Small angles (with cosine close to 1.0) have been found in the comparisons between any two DNNs trained with different algorithms/architectures. Furthermore, during the training procedure from random scratch, the angle decrease from a larger one (70°–80° usually) to the small one, which coincides the progress of feature space learning from scratch to convergence. Then, we carry out case studies to measure the angle between the \({\mathcal {P}}\)-vector and the principal subspace of training dataset, and connect such angle with generalization performance. Extensive experiments with practically-used Multi-Layer Perceptron (MLPs), AEs and CNNs for classification, image reconstruction, and self-supervised learning tasks on MNIST, CIFAR-10 and CIFAR-100 datasets have been done to support our claims with solid evidences.

all models and datasets are obtained from open-source contributions.

Codes will be open sourced upon the acceptance of the paper.

#samples and #features refer to the numbers of samples and features respectively.

We use the term “model \({\mathcal {P}}\)-vector” to represent a \({\mathcal {P}}\)-vector estimated using feature vectors of a deep model, while using “data \({\mathcal {P}}\)-vector” as the top left singular vector of the raw data matrix.

Not applicable.

Haoran Liu is affilated to the Department of Computer Science & Engineering, Texas A&M University, College Station, TX at present. The work was done when she was a research intern at Baidu.

### A Appendix

### A.0 Comparison of angles between \({\mathcal {P}}\)-vector extracted from well-trained models using different architectures on ImageNet

See Fig. 11.

### A.1 Comparison of angles between checkpoints and well-trained \({\mathcal {P}}\)-vector using different architectures on CIFAR-100

In the main text, we presented the result on CIFAR-10 dataset. To generalize the observations, we repeated the experiments on CIFAR-100 dataset to validate our hypothesis of the convergence of the angles between model checkpoints and well-trained model \({\mathcal {P}}\)-vectors. We investigate the change of angles over the \({\mathcal {P}}\)-vectors of training model checkpoints per epoch with comparison to the \({\mathcal {P}}\)-vectors of well-trained models (model of epoch 200 in our case). As shown in Fig. 12, a gradually decreasing manner of the curves for the angle between \({\mathcal {P}}\)-vectors and all angles between \({\mathcal {P}}\)-vectors cross models with different supervisory manners generally converge to a value that smaller than 10° degree. We can conclude that the hypothesis of the existence of common subspace during the learning procedure also stands on the experiments with CIFAR-100 dataset.

### A.2 The non-monotonic trend within the first epoch

The variation of angles between \({\mathcal {P}}\)-vectors for the well-trained model and its checkpoint per training epoch of each iteration in the first epoch. The non-monotonic trends within the first epoch also incorporate in the experiments using the testing sets of CIFAR-10 and CIFAR-100 datasets. Figure 13 shows the curves indicating the variation of angles between the training model \({\mathcal {P}}\)-vectors and the well-trained model \({\mathcal {P}}\)-vectors in the iterations in the first epoch. As we use 128 as the batch size in training procedure, the number of iterations for updates is 391 per epoch. We obtain the observation of a non-monotonic trend that the angle first rises with the random initialization and drop down. And in the rest of training process, the angles keeps the approximately monotonically decreasing and converging to small values. The experiments shows consistent result and conclusion on the testing set of CIFAR-10 and CIFAR-100 with the discussion in Sect. 3.

### A.3 Model-to-model common subspace

We also test and verify the model-to-model common subspace shared by models trained with different supervisory manners on CIFAR-100 dataset. Experiments carried out to evaluate the angles between \({\mathcal {P}}\)-vectors for checkpoints of all models and \({\mathcal {P}}\)-vectors for well-trained supervised, unsupervised and self-supervised models, where we use the well-trained Wide-ResNet28/Convolution Auto-encoder and SimCLR model (trained with 200 epochs under suggest settings) as the reference of supervised, unsupervised and self-supervised models, respectively. As shown in Fig. 14, a consistent convergence for the curves of the angles can be observed and support our hypothesis that the dynamics learning procedure construct the common subspace gradually.

### A.4 The non-monotonic trend in the first epoch of comparison of angles between model and raw data \({\mathcal {P}}\)-vectors

We also explore the construction procedure for the common subspace share between feature vectors and the raw data during the training process. Experiments are carried out to compare the space of models and raw data \({\mathcal {P}}\)-vectors on the training dataset. As shown in Fig. 15, we observe a non-monotonic trend that the angle first rises with the random initialization and drop down. The angles keeps the approximately monotonically decreasing and converging to small values in following training epochs. The experiments shows consistent result on both CIFAR-10 and 100 dataset. Note that we follow the default random data augmentation policy to pre-process the training dataset.

### A.5 Case studies on the angle variety between model and raw data for each layer

To explore the dynamic variation process by zooming in to every layer variation in each epoch, we perform the case study using Resnet-18 structure on CIFAR-10 dataset. As shown in Fig. 16, we give the angles according to layers of 5 different epochs, where the x-axis indicating the indices of residual blocks in the network structure and y-axis refers to the angles between the model checkpoint and raw data \({\mathcal {P}}\)-vectors. We observed that in early training stage, the angles between the model checkpoint and raw data \({\mathcal {P}}\)-vectors keeps an increasing manner when the features passing through layers and turn into a decrease trend towards the stacked layers in the late training stage. This set of experiments further support our hypothesis that the dynamics learning procedure construct the common subspace gradually through training process.

### A.6 Distribution of values in the \({\mathcal {P}}\)-vector

Please refer to Fig. 17 for the results of experiments carried out on CIFAR-10 and CIFAR-100 datasets using ResNet-50. To have a better view of the distribution drift, we further provide the KDE-smoothed frequency map of the \({\mathcal {P}}\)-vector for wide-resnet training using CIFAR-100 dataset, as shown in Fig. 18.

### A.7 No convergence found in comparisons between the top singular vectors other than the \({\mathcal {P}}\)-vectors

Please refer to Fig. 19 for the results of experiments carried out on CIFAR-10 datasets using ResNet-50.

### A.8 Log–log plots that correlate the \({\mathcal {P}}\)-vector angles and the model performance

Please refer to Fig. 20 for the log–log plot of the results based on CIFAR-10 datasets (Fig. 21).

### A.9 Explained variance analysis of top-k singular vector

Explained Variances and the approximation error of the top-*k* dimensional subspace \(E=\Vert X-U_k \varSigma _k V_k^T\Vert _F^2\) of top-*k* singular vectors are shown in Table. 2, where *X* would be the feature matrices, and \(U_k\), \(\varSigma _k\), \(V_k\) are the first columns in the result of SVD, here we analysis only the changes of top \(k=1\) singular vectors through different training checkpoints. The changing procedure for the reconstruction error of approximation is further shown in Fig. 22. Also, the variance ratio analysis with various *k* values is shown in Table.3.

