Skip to main content

Exploring the common principal subspace of deep features in neural networks

Abstract

We find that different Deep Neural Networks (DNNs) trained with the same dataset share a common principal subspace in latent spaces, no matter in which architectures (e.g., Convolutional Neural Networks (CNNs), Multi-Layer Preceptors (MLPs) and Autoencoders (AEs)) the DNNs were built or even whether labels have been used in training (e.g., supervised, unsupervised, and self-supervised learning). Specifically, we design a new metric \({\mathcal {P}}\)-vector to represent the principal subspace of deep features learned in a DNN, and propose to measure angles between the principal subspaces using \({\mathcal {P}}\)-vectors. Small angles (with cosine close to 1.0) have been found in the comparisons between any two DNNs trained with different algorithms/architectures. Furthermore, during the training procedure from random scratch, the angle decrease from a larger one (70°–80° usually) to the small one, which coincides the progress of feature space learning from scratch to convergence. Then, we carry out case studies to measure the angle between the \({\mathcal {P}}\)-vector and the principal subspace of training dataset, and connect such angle with generalization performance. Extensive experiments with practically-used Multi-Layer Perceptron (MLPs), AEs and CNNs for classification, image reconstruction, and self-supervised learning tasks on MNIST, CIFAR-10 and CIFAR-100 datasets have been done to support our claims with solid evidences.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Availability of data and materials

all models and datasets are obtained from open-source contributions.

Code availability

Codes will be open sourced upon the acceptance of the paper.

Notes

  1. #samples and #features refer to the numbers of samples and features respectively.

  2. We use the term “model \({\mathcal {P}}\)-vector” to represent a \({\mathcal {P}}\)-vector estimated using feature vectors of a deep model, while using “data \({\mathcal {P}}\)-vector” as the top left singular vector of the raw data matrix.

References

  • Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.

    Article  Google Scholar 

  • Arvanitidis, G., Hansen, L. K., & Hauberg, S. (2018). Latent space oddity: On the curvature of deep generative models. In International conference on learning representations, Vancouver, Canada.

  • Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network dissection: Quantifying interpretability of deep visual representations. In Conference on computer vision and pattern recognition.

  • Bau, D., Zhu, J.-Y., Strobelt, H., Zhou, B., Tenenbaum, J. B., Freeman, W. T., & Torralba, A. (2019). Gan dissection: Visualizing and understanding generative adversarial networks. In International conference on learning representations, New Orleans Lousiana.

  • Berthelot, D., Raffel, C., Roy, A., & Goodfellow, I. (2019). Understanding and improving interpolation in autoencoders via an adversarial regularizer. In International conference on learning representations.

  • Björck, Á., & Golub, G. H. (1973). Numerical methods for computing angles between linear subspaces. Mathematics of Computation, 27(123), 579–594.

    MathSciNet  Article  Google Scholar 

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. arXiv preprint, arXiv:2002.05709.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2672–2680.

    Google Scholar 

  • Halko, N., Martinsson, P.-G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2), 217–288.

    MathSciNet  Article  Google Scholar 

  • Jahanian, A., Chai, L., & Isola, P. (2020). On the “steerability” of generative adversarial networks. In International conference on learning representations.

  • Jiang, Y., Foret, P., Yak, S., Neyshabur, B., Guyon, I., Mobahi, H., Karolina, G., Roy, D., Gunasekar, S., & Bengio, S. (2020). Predicting generalization in deep learning, competition at NeurIPS.

  • Jiang, Y., Foret, P., Yak, S., Roy, D. M., Mobahi, H., Dziugaite, G. K., Bengio, S., Gunasekar, S., Guyon, I., & Neyshabur, B. (2020). Neurips 2020 competition: Predicting generalization in deep learning. arXiv preprint, arXiv:2012.07976.

  • Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., & Bengio, S.. (2019). Fantastic generalization measures and where to find them. arXiv preprint, arXiv:1912.02178.

  • Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised contrastive learning. arXiv preprint, arXiv:2004.11362.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  • Lee, D., Szegedy, C., Rabe, M. N., Loos, S. M., & Bansal, K. (2019). Mathematical reasoning in latent space. arXiv preprint, arXiv:1909.11851.

  • Nagarajan, V., & Kolter, J. Z. (2019). Generalization in deep networks: The role of distance from initialization. arXiv preprint, arXiv:1901.01672.

  • Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in neural information processing systems (pp. 3387–3395).

  • Pestov, V. (1999). On the geometry of similarity search: Dimensionality curse and concentration of measure. arXiv preprint, arXiv:cs/9901004.

  • Pihur, V., Datta, S., & Datta, S. (2009). RankAggreg, an R package for weighted rank aggregation. BMC Bioinformatics, 10(1), 62.

    Article  Google Scholar 

  • Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint, arXiv:1511.06434.

  • Richardson, E., & Weiss, Y. (2018). On GANs and GMMs. In Advances in neural information processing systems (pp. 5847–5858).

  • Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23), 11537–11546.

    MathSciNet  Article  Google Scholar 

  • Sercu, T., Gehrmann, S., Strobelt, H., Das, P., Padhi, I., Dos Santos, C., Wadhawan, K., & Chenthamarakshan, V. (2019). Interactive visual exploration of latent space (IVELS) for peptide auto-encoder model selection. In ICLR DeepGenStruct workshop.

  • Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. In ICLR workshop.

  • Spinner, T., Körner, J., Görtler, J., & Deussen, O. (2018). Towards an interpretable latent space: An intuitive comparison of autoencoders with variational autoencoders. In IEEE VIS.

  • Vapnik, V. (2013). The nature of statistical learning theory. Springer.

    MATH  Google Scholar 

  • White, T. (2016). Sampling generative networks. arXiv preprint, arXiv:1609.04468.

  • Zhang, X., & Wu, D. (2020). Empirical studies on the properties of linear regions in deep neural networks. In International conference on learning representations.

  • Zhu, J.-Y., Krähenbühl, P., Shechtman, E., & Efros, A. A. (2016). Generative visual manipulation on the natural image manifold. In European conference on computer vision (pp. 597–613).

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

Haoyi Xiong proposed the research problem, formulated research hypotheses, and wrote the manuscript. Haoran Liu conducted experiments, analyzed data, and wrote part of the manuscript and appendix. Yaqing Wang involved in the discussion and wrote parts of the manuscript. Haozhe An contributed the codes for pseudo testing accuracy estimation and ran the experiments for generalization performance evaluation. Dongrui Wu helped to establish the connections between our observation and local linear behaviors of the DNN, so as to the generalization performance, he wrote parts of the manuscript. Dejing Dou oversaw the research progress and involved in discussion.

Corresponding author

Correspondence to Haoyi Xiong.

Ethics declarations

Conflict of interest

Not applicable.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editors: Yu-Feng Li, Mehmet Gönen, Kee-Eung Kim.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Haoran Liu is affilated to the Department of Computer Science & Engineering, Texas A&M University, College Station, TX at present. The work was done when she was a research intern at Baidu.

A Appendix

A Appendix

A.0 Comparison of angles between \({\mathcal {P}}\)-vector extracted from well-trained models using different architectures on ImageNet

See Fig. 11.

Fig. 11
figure 11

Cosine of angles between principal subspaces of deep features, measured using \({\mathcal {P}}\)-vectors, for models trained under default settings

A.1 Comparison of angles between checkpoints and well-trained \({\mathcal {P}}\)-vector using different architectures on CIFAR-100

In the main text, we presented the result on CIFAR-10 dataset. To generalize the observations, we repeated the experiments on CIFAR-100 dataset to validate our hypothesis of the convergence of the angles between model checkpoints and well-trained model \({\mathcal {P}}\)-vectors. We investigate the change of angles over the \({\mathcal {P}}\)-vectors of training model checkpoints per epoch with comparison to the \({\mathcal {P}}\)-vectors of well-trained models (model of epoch 200 in our case). As shown in Fig. 12, a gradually decreasing manner of the curves for the angle between \({\mathcal {P}}\)-vectors and all angles between \({\mathcal {P}}\)-vectors cross models with different supervisory manners generally converge to a value that smaller than 10° degree. We can conclude that the hypothesis of the existence of common subspace during the learning procedure also stands on the experiments with CIFAR-100 dataset.

Fig. 12
figure 12

Convergence to the common feature subspace with CIFAR-100. Curves of angles of \({\mathcal {P}}\)-vectors between the well-trained model and its checkpoint per training epoch of three learning supervisory manners. The trends of convergence for the angles can be observed in all models

A.2 The non-monotonic trend within the first epoch

The variation of angles between \({\mathcal {P}}\)-vectors for the well-trained model and its checkpoint per training epoch of each iteration in the first epoch. The non-monotonic trends within the first epoch also incorporate in the experiments using the testing sets of CIFAR-10 and CIFAR-100 datasets. Figure 13 shows the curves indicating the variation of angles between the training model \({\mathcal {P}}\)-vectors and the well-trained model \({\mathcal {P}}\)-vectors in the iterations in the first epoch. As we use 128 as the batch size in training procedure, the number of iterations for updates is 391 per epoch. We obtain the observation of a non-monotonic trend that the angle first rises with the random initialization and drop down. And in the rest of training process, the angles keeps the approximately monotonically decreasing and converging to small values. The experiments shows consistent result and conclusion on the testing set of CIFAR-10 and CIFAR-100 with the discussion in Sect. 3.

Fig. 13
figure 13

Angles between principal subspaces, measured using \({\mathcal {P}}\)-vectors based on the testing sets of CIFAR-10 and CIFAR-100, between well-trained models and checkpoints per training iteration in the first epoch

A.3 Model-to-model common subspace

We also test and verify the model-to-model common subspace shared by models trained with different supervisory manners on CIFAR-100 dataset. Experiments carried out to evaluate the angles between \({\mathcal {P}}\)-vectors for checkpoints of all models and \({\mathcal {P}}\)-vectors for well-trained supervised, unsupervised and self-supervised models, where we use the well-trained Wide-ResNet28/Convolution Auto-encoder and SimCLR model (trained with 200 epochs under suggest settings) as the reference of supervised, unsupervised and self-supervised models, respectively. As shown in Fig. 14, a consistent convergence for the curves of the angles can be observed and support our hypothesis that the dynamics learning procedure construct the common subspace gradually.

Fig. 14
figure 14

Convergence of \({\mathcal {P}}\)-vector angles between checkpoint per epoch and well-trained models using CIFAR-100

A.4 The non-monotonic trend in the first epoch of comparison of angles between model and raw data \({\mathcal {P}}\)-vectors

We also explore the construction procedure for the common subspace share between feature vectors and the raw data during the training process. Experiments are carried out to compare the space of models and raw data \({\mathcal {P}}\)-vectors on the training dataset. As shown in Fig. 15, we observe a non-monotonic trend that the angle first rises with the random initialization and drop down. The angles keeps the approximately monotonically decreasing and converging to small values in following training epochs. The experiments shows consistent result on both CIFAR-10 and 100 dataset. Note that we follow the default random data augmentation policy to pre-process the training dataset.

Fig. 15
figure 15

Angles between the \({\mathcal {P}}\)-vectors of the training model and the raw datasets over the number of iterations in the first epoch using CIFAR-10/CIFAR-100

A.5 Case studies on the angle variety between model and raw data for each layer

To explore the dynamic variation process by zooming in to every layer variation in each epoch, we perform the case study using Resnet-18 structure on CIFAR-10 dataset. As shown in Fig. 16, we give the angles according to layers of 5 different epochs, where the x-axis indicating the indices of residual blocks in the network structure and y-axis refers to the angles between the model checkpoint and raw data \({\mathcal {P}}\)-vectors. We observed that in early training stage, the angles between the model checkpoint and raw data \({\mathcal {P}}\)-vectors keeps an increasing manner when the features passing through layers and turn into a decrease trend towards the stacked layers in the late training stage. This set of experiments further support our hypothesis that the dynamics learning procedure construct the common subspace gradually through training process.

Fig. 16
figure 16

Angles changes through layer between the \({\mathcal {P}}\)-vectors of the training and raw data over the number of iterations in the first epoch using CIFAR-10

A.6 Distribution of values in the \({\mathcal {P}}\)-vector

Please refer to Fig. 17 for the results of experiments carried out on CIFAR-10 and CIFAR-100 datasets using ResNet-50. To have a better view of the distribution drift, we further provide the KDE-smoothed frequency map of the \({\mathcal {P}}\)-vector for wide-resnet training using CIFAR-100 dataset, as shown in Fig. 18.

Fig. 17
figure 17

Frequency of values appeared in the P-vector versus training epochs

Fig. 18
figure 18

KDE smoothed frequency of values appeared in the P-vector versus training epochs

A.7 No convergence found in comparisons between the top singular vectors other than the \({\mathcal {P}}\)-vectors

Please refer to Fig. 19 for the results of experiments carried out on CIFAR-10 datasets using ResNet-50.

Fig. 19
figure 19

Angles between the top-k left singular vector (\(k=1\) is the \({\mathcal {P}}\)-vector) of the training and well-trained models over the number of epochs in the training process (Resnet-50, CIFAR-10). Note the first plot point refers to the feature matrix after trained for one epoch

A.8 Log–log plots that correlate the \({\mathcal {P}}\)-vector angles and the model performance

Please refer to Fig. 20 for the log–log plot of the results based on CIFAR-10 datasets (Fig. 21).

Fig. 20
figure 20

Log–log plots: correlations between the model performance (training and testing accuracy in log range) and the angels (in log range) between model and data \({\mathcal {P}}\)-vectors using CIFAR-10 datasets

Fig. 21
figure 21

Log–log plots: correlations between the model performance (training and testing accuracy in log range) and the angels (in log range) between model and data \({\mathcal {P}}\)-vectors using CIFAR-100 datasets

A.9 Explained variance analysis of top-k singular vector

Explained Variances and the approximation error of the top-k dimensional subspace \(E=\Vert X-U_k \varSigma _k V_k^T\Vert _F^2\) of top-k singular vectors are shown in Table. 2, where X would be the feature matrices, and \(U_k\), \(\varSigma _k\), \(V_k\) are the first columns in the result of SVD, here we analysis only the changes of top \(k=1\) singular vectors through different training checkpoints. The changing procedure for the reconstruction error of approximation is further shown in Fig. 22. Also, the variance ratio analysis with various k values is shown in Table.3.

Table 2 Explained variance and reconstruction analysis of top-1 singular vector through the training process
Fig. 22
figure 22

The reconstruction error of approximation with top-1 singular vector through the training process

Table 3 Explained variances of top-k singular vectors

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, H., Xiong, H., Wang, Y. et al. Exploring the common principal subspace of deep features in neural networks. Mach Learn 111, 1125–1157 (2022). https://doi.org/10.1007/s10994-021-06076-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-021-06076-6

Keywords

  • Interpretability of deep learning
  • Feature learning
  • and Subspaces of deep feature