Skip to main content

Using regression makes extraction of shared variation in multiple datasets easy


In many data analysis tasks it is important to understand the relationships between different datasets. Several methods exist for this task but many of them are limited to two datasets and linear relationships. In this paper, we propose a new efficient algorithm, termed cocoreg, for the extraction of variation common to all datasets in a given collection of arbitrary size. cocoreg extends redundancy analysis to more than two datasets, utilizing chains of regression functions to extract the shared variation in the original data space. The algorithm can be used with any linear or non-linear regression function, which makes it robust, straightforward, fast, and easy to implement and use. We empirically demonstrate the efficacy of shared variation extraction using the cocoreg algorithm on five artificial and three real datasets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8







  • Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th international conference on machine learning, vol 28, pp 1247–1255

  • Dähne S, Nikulin VV, Ramírez D, Schreier PJ, Müller KR, Haufe S (2014) Finding brain oscillations with power dependencies in neuroimaging data. NeuroImage 96:334–348

    Article  Google Scholar 

  • Damianou A, Ek C, Titsias MK, Lawrence ND (2012) Manifold relevance determination. In: Proceedings of the 29th international conference on machine learning, pp 145–152

  • Fisher J, Darrell T (2003) Speaker association with signal-level audiovisual fusion. IEEE Trans Multimed 6(3):406–413

    Article  Google Scholar 

  • Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. doi:10.1162/0899766042321814

    Article  MATH  Google Scholar 

  • Hasson U, Nir Y, Levy I, Fuhrmann G, Malach R (2004) Intersubject synchronization of cortical activity during natural vision. Science 303(5664):1634–1640

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning: data mining, inference, and prediction. Springer, New York

    MATH  Google Scholar 

  • Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377

    Article  MATH  Google Scholar 

  • Hsieh WW (2000) Nonlinear canonical correlation analysis by neural networks. Neural Netw 13:1095–1105

    Article  Google Scholar 

  • Hwang H, Jung K, Takane Y, Woodward TS (2013) A unified approach to multiple-set canonical correlation analysis and principal components analysis. Br J Math Stat Psychol 66(2):308–321. doi:10.1111/j.2044-8317.2012.02052.x

    MathSciNet  Article  Google Scholar 

  • Kettenring J (1971) Canonical analysis of several sets of variables. Biometrika 58:433–451

    MathSciNet  Article  MATH  Google Scholar 

  • Klami A, Virtanen S, Kaski S (2013) Bayesian canonical correlation analysis. J Mach Learn Res 14:965–1003

    MathSciNet  MATH  Google Scholar 

  • Klami A, Virtanen S, Leppäho E (2015) Group factor analysis. IEEE Trans Neural Netw Learn Syst 26(9):2136–2147. doi:10.1109/TNNLS.2014.2376974

    MathSciNet  Article  Google Scholar 

  • Korpela J, Henelius A (2016) Cocoreg: extracts shared variation in collections of datasets using regression models.

  • Legendre P, Legendre L (1998) Numerical ecology, 2nd edn. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22.

  • Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2014) e1071: misc functions of the department of statistics (e1071), Technische Universität Wien.

  • Müller KE (1982) Understanding canonical correlation through the general linear model and principal components. Am Stat 36(4):342–354. doi:10.1080/00031305.1982.10483045

    MATH  Google Scholar 

  • Nguyen HV, Müller E, Vreeken J, Efros P, Böhm K (2014) Multivariate maximal correlation analysis. In: Proceedings of the 31st international conference on machine learning, pp 775–783

  • R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna,

  • Tenenhaus A (2011) Regularized generalized canonical correlation analysis and PLS path modeling. Psychometrika 76(2):257–284

    MathSciNet  Article  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Timmerman ME, Kiers H (2003) Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika 68(1):105–121. doi:10.1007/BF02296656

    MathSciNet  Article  MATH  Google Scholar 

  • Virtanen S, Klami A, Khan SA, Kaski S (2012) CCAGFA: Bayesian canonical correlation analysis and group factor analysis.

Download references


This work was partly supported by the Revolution of Knowledge Work Project, funded by Tekes (Grants 40228/13 and 5159/31/2014), and in part by Academy of Finland (Finnish Centre of Excellence in Computational Research COIN, 251170; 26696).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jussi Korpela.

Additional information

Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Korpela, J., Henelius, A., Ahonen, L. et al. Using regression makes extraction of shared variation in multiple datasets easy. Data Min Knowl Disc 30, 1112–1133 (2016).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Shared variation
  • Multiple regression
  • Regression chains