## Abstract

We find that different Deep Neural Networks (DNNs) trained with the same dataset share a common principal subspace in latent spaces, no matter in which architectures (e.g., Convolutional Neural Networks (CNNs), Multi-Layer Preceptors (MLPs) and Autoencoders (AEs)) the DNNs were built or even whether labels have been used in training (e.g., supervised, unsupervised, and self-supervised learning). Specifically, we design a new metric \({\mathcal {P}}\)-vector to represent the principal subspace of deep features learned in a DNN, and propose to measure angles between the principal subspaces using \({\mathcal {P}}\)-vectors. Small angles (with cosine close to 1.0) have been found in the comparisons between any two DNNs trained with different algorithms/architectures. Furthermore, during the training procedure from random scratch, the angle decrease from a larger one (70°–80° usually) to the small one, which coincides the progress of feature space learning from scratch to convergence. Then, we carry out case studies to measure the angle between the \({\mathcal {P}}\)-vector and the principal subspace of training dataset, and connect such angle with generalization performance. Extensive experiments with practically-used Multi-Layer Perceptron (MLPs), AEs and CNNs for classification, image reconstruction, and self-supervised learning tasks on MNIST, CIFAR-10 and CIFAR-100 datasets have been done to support our claims with solid evidences.

This is a preview of subscription content, access via your institution.

## Availability of data and materials

all models and datasets are obtained from open-source contributions.

## Code availability

Codes will be open sourced upon the acceptance of the paper.

## Notes

#samples and #features refer to the numbers of samples and features respectively.

We use the term “model \({\mathcal {P}}\)-vector” to represent a \({\mathcal {P}}\)-vector estimated using feature vectors of a deep model, while using “data \({\mathcal {P}}\)-vector” as the top left singular vector of the raw data matrix.

## References

Abdi, H., & Williams, L. J. (2010). Principal component analysis.

*Wiley Interdisciplinary Reviews: Computational Statistics,**2*(4), 433–459.Arvanitidis, G., Hansen, L. K., & Hauberg, S. (2018). Latent space oddity: On the curvature of deep generative models. In

*International conference on learning representations, Vancouver, Canada*.Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In

*Conference on computer vision and pattern recognition*.Bau, D., Zhu, J.-Y., Strobelt, H., Zhou, B., Tenenbaum, J. B., Freeman, W. T., & Torralba, A. (2019). Gan dissection: Visualizing and understanding generative adversarial networks. In

*International conference on learning representations, New Orleans Lousiana*.Berthelot, D., Raffel, C., Roy, A., & Goodfellow, I. (2019). Understanding and improving interpolation in autoencoders via an adversarial regularizer. In

*International conference on learning representations*.Björck, Á., & Golub, G. H. (1973). Numerical methods for computing angles between linear subspaces.

*Mathematics of Computation,**27*(123), 579–594.Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020).

*A simple framework for contrastive learning of visual representations*. arXiv preprint, arXiv:2002.05709.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets.

*Advances in Neural Information Processing Systems,**27,*2672–2680.Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.

*SIAM Review,**53*(2), 217–288.Jahanian, A., Chai, L., & Isola, P. (2020). On the “steerability” of generative adversarial networks. In

*International conference on learning representations*.Jiang, Y., Foret, P., Yak, S., Neyshabur, B., Guyon, I., Mobahi, H., Karolina, G., Roy, D., Gunasekar, S., & Bengio, S. (2020). Predicting generalization in deep learning, competition at NeurIPS.

Jiang, Y., Foret, P., Yak, S., Roy, D. M., Mobahi, H., Dziugaite, G. K., Bengio, S., Gunasekar, S., Guyon, I., & Neyshabur, B. (2020).

*Neurips 2020 competition: Predicting generalization in deep learning*. arXiv preprint, arXiv:2012.07976.Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., & Bengio, S.. (2019). Fantastic generalization measures and where to find them. arXiv preprint, arXiv:1912.02178.

Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020).

*Supervised contrastive learning*. arXiv preprint, arXiv:2004.11362.Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks.

*Advances in Neural Information Processing Systems,**25,*1097–1105.LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.

*Nature,**521*(7553), 436–444.Lee, D., Szegedy, C., Rabe, M. N., Loos, S. M., & Bansal, K. (2019).

*Mathematical reasoning in latent space*. arXiv preprint, arXiv:1909.11851.Nagarajan, V., & Kolter, J. Z. (2019).

*Generalization in deep networks: The role of distance from initialization*. arXiv preprint, arXiv:1901.01672.Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In

*Advances in neural information processing systems*(pp. 3387–3395).Pestov, V. (1999).

*On the geometry of similarity search: Dimensionality curse and concentration of measure*. arXiv preprint, arXiv:cs/9901004.Pihur, V., Datta, S., & Datta, S. (2009). RankAggreg, an R package for weighted rank aggregation.

*BMC Bioinformatics,**10*(1), 62.Radford, A., Metz, L., & Chintala, S. (2015).

*Unsupervised representation learning with deep convolutional generative adversarial networks*. arXiv preprint, arXiv:1511.06434.Richardson, E., & Weiss, Y. (2018). On GANs and GMMs. In

*Advances in neural information processing systems*(pp. 5847–5858).Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks.

*Proceedings of the National Academy of Sciences,**116*(23), 11537–11546.Sercu, T., Gehrmann, S., Strobelt, H., Das, P., Padhi, I., Dos Santos, C., Wadhawan, K., & Chenthamarakshan, V. (2019). Interactive visual exploration of latent space (IVELS) for peptide auto-encoder model selection. In

*ICLR DeepGenStruct workshop*.Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. In

*ICLR workshop*.Spinner, T., Körner, J., Görtler, J., & Deussen, O. (2018). Towards an interpretable latent space: An intuitive comparison of autoencoders with variational autoencoders. In

*IEEE VIS*.Vapnik, V. (2013).

*The nature of statistical learning theory*. Springer.White, T. (2016).

*Sampling generative networks*. arXiv preprint, arXiv:1609.04468.Zhang, X., & Wu, D. (2020). Empirical studies on the properties of linear regions in deep neural networks. In

*International conference on learning representations*.Zhu, J.-Y., Krähenbühl, P., Shechtman, E., & Efros, A. A. (2016). Generative visual manipulation on the natural image manifold. In

*European conference on computer vision*(pp. 597–613).

## Funding

Not applicable.

## Author information

### Authors and Affiliations

### Contributions

Haoyi Xiong proposed the research problem, formulated research hypotheses, and wrote the manuscript. Haoran Liu conducted experiments, analyzed data, and wrote part of the manuscript and appendix. Yaqing Wang involved in the discussion and wrote parts of the manuscript. Haozhe An contributed the codes for pseudo testing accuracy estimation and ran the experiments for generalization performance evaluation. Dongrui Wu helped to establish the connections between our observation and local linear behaviors of the DNN, so as to the generalization performance, he wrote parts of the manuscript. Dejing Dou oversaw the research progress and involved in discussion.

### Corresponding author

## Ethics declarations

### Conflict of interest

Not applicable.

### Ethics approval

Not applicable.

### Consent to participate

Not applicable.

### Consent for publication

Not applicable.

## Additional information

Editors: Yu-Feng Li, Mehmet Gönen, Kee-Eung Kim.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Haoran Liu is affilated to the Department of Computer Science & Engineering, Texas A&M University, College Station, TX at present. The work was done when she was a research intern at Baidu.

## A Appendix

### A Appendix

### A.0 Comparison of angles between \({\mathcal {P}}\)-vector extracted from well-trained models using different architectures on ImageNet

See Fig. 11.

### A.1 Comparison of angles between checkpoints and well-trained \({\mathcal {P}}\)-vector using different architectures on CIFAR-100

In the main text, we presented the result on CIFAR-10 dataset. To generalize the observations, we repeated the experiments on CIFAR-100 dataset to validate our hypothesis of the convergence of the angles between model checkpoints and well-trained model \({\mathcal {P}}\)-vectors. We investigate the change of angles over the \({\mathcal {P}}\)-vectors of training model checkpoints per epoch with comparison to the \({\mathcal {P}}\)-vectors of well-trained models (model of epoch 200 in our case). As shown in Fig. 12, a gradually decreasing manner of the curves for the angle between \({\mathcal {P}}\)-vectors and all angles between \({\mathcal {P}}\)-vectors cross models with different supervisory manners generally converge to a value that smaller than 10° degree. We can conclude that the hypothesis of the existence of common subspace during the learning procedure also stands on the experiments with CIFAR-100 dataset.

### A.2 The non-monotonic trend within the first epoch

The variation of angles between \({\mathcal {P}}\)-vectors for the well-trained model and its checkpoint per training epoch of each iteration in the first epoch. The non-monotonic trends within the first epoch also incorporate in the experiments using the testing sets of CIFAR-10 and CIFAR-100 datasets. Figure 13 shows the curves indicating the variation of angles between the training model \({\mathcal {P}}\)-vectors and the well-trained model \({\mathcal {P}}\)-vectors in the iterations in the first epoch. As we use 128 as the batch size in training procedure, the number of iterations for updates is 391 per epoch. We obtain the observation of a non-monotonic trend that the angle first rises with the random initialization and drop down. And in the rest of training process, the angles keeps the approximately monotonically decreasing and converging to small values. The experiments shows consistent result and conclusion on the testing set of CIFAR-10 and CIFAR-100 with the discussion in Sect. 3.

### A.3 Model-to-model common subspace

We also test and verify the model-to-model common subspace shared by models trained with different supervisory manners on CIFAR-100 dataset. Experiments carried out to evaluate the angles between \({\mathcal {P}}\)-vectors for checkpoints of all models and \({\mathcal {P}}\)-vectors for well-trained supervised, unsupervised and self-supervised models, where we use the well-trained Wide-ResNet28/Convolution Auto-encoder and SimCLR model (trained with 200 epochs under suggest settings) as the reference of supervised, unsupervised and self-supervised models, respectively. As shown in Fig. 14, a consistent convergence for the curves of the angles can be observed and support our hypothesis that the dynamics learning procedure construct the common subspace gradually.

### A.4 The non-monotonic trend in the first epoch of comparison of angles between model and raw data \({\mathcal {P}}\)-vectors

We also explore the construction procedure for the common subspace share between feature vectors and the raw data during the training process. Experiments are carried out to compare the space of models and raw data \({\mathcal {P}}\)-vectors on the training dataset. As shown in Fig. 15, we observe a non-monotonic trend that the angle first rises with the random initialization and drop down. The angles keeps the approximately monotonically decreasing and converging to small values in following training epochs. The experiments shows consistent result on both CIFAR-10 and 100 dataset. Note that we follow the default random data augmentation policy to pre-process the training dataset.

### A.5 Case studies on the angle variety between model and raw data for each layer

To explore the dynamic variation process by zooming in to every layer variation in each epoch, we perform the case study using Resnet-18 structure on CIFAR-10 dataset. As shown in Fig. 16, we give the angles according to layers of 5 different epochs, where the x-axis indicating the indices of residual blocks in the network structure and y-axis refers to the angles between the model checkpoint and raw data \({\mathcal {P}}\)-vectors. We observed that in early training stage, the angles between the model checkpoint and raw data \({\mathcal {P}}\)-vectors keeps an increasing manner when the features passing through layers and turn into a decrease trend towards the stacked layers in the late training stage. This set of experiments further support our hypothesis that the dynamics learning procedure construct the common subspace gradually through training process.

### A.6 Distribution of values in the \({\mathcal {P}}\)-vector

Please refer to Fig. 17 for the results of experiments carried out on CIFAR-10 and CIFAR-100 datasets using ResNet-50. To have a better view of the distribution drift, we further provide the KDE-smoothed frequency map of the \({\mathcal {P}}\)-vector for wide-resnet training using CIFAR-100 dataset, as shown in Fig. 18.

### A.7 No convergence found in comparisons between the top singular vectors other than the \({\mathcal {P}}\)-vectors

Please refer to Fig. 19 for the results of experiments carried out on CIFAR-10 datasets using ResNet-50.

### A.8 Log–log plots that correlate the \({\mathcal {P}}\)-vector angles and the model performance

Please refer to Fig. 20 for the log–log plot of the results based on CIFAR-10 datasets (Fig. 21).

### A.9 Explained variance analysis of top-k singular vector

Explained Variances and the approximation error of the top-*k* dimensional subspace \(E=\Vert X-U_k \varSigma _k V_k^T\Vert _F^2\) of top-*k* singular vectors are shown in Table. 2, where *X* would be the feature matrices, and \(U_k\), \(\varSigma _k\), \(V_k\) are the first columns in the result of SVD, here we analysis only the changes of top \(k=1\) singular vectors through different training checkpoints. The changing procedure for the reconstruction error of approximation is further shown in Fig. 22. Also, the variance ratio analysis with various *k* values is shown in Table.3.

## Rights and permissions

## About this article

### Cite this article

Liu, H., Xiong, H., Wang, Y. *et al.* Exploring the common principal subspace of deep features in neural networks.
*Mach Learn* **111, **1125–1157 (2022). https://doi.org/10.1007/s10994-021-06076-6

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10994-021-06076-6

### Keywords

- Interpretability of deep learning
- Feature learning
- and Subspaces of deep feature