Using regression makes extraction of shared variation in multiple datasets easy

Korpela, Jussi; Henelius, Andreas; Ahonen, Lauri; Klami, Arto; Puolamäki, Kai

doi:10.1007/s10618-016-0465-y

Using regression makes extraction of shared variation in multiple datasets easy

Published: 26 May 2016

Volume 30, pages 1112–1133, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jussi Korpela¹,
Andreas Henelius¹,
Lauri Ahonen¹,
Arto Klami² &
…
Kai Puolamäki¹

705 Accesses
Explore all metrics

Abstract

In many data analysis tasks it is important to understand the relationships between different datasets. Several methods exist for this task but many of them are limited to two datasets and linear relationships. In this paper, we propose a new efficient algorithm, termed cocoreg, for the extraction of variation common to all datasets in a given collection of arbitrary size. cocoreg extends redundancy analysis to more than two datasets, utilizing chains of regression functions to extract the shared variation in the original data space. The algorithm can be used with any linear or non-linear regression function, which makes it robust, straightforward, fast, and easy to implement and use. We empirically demonstrate the efficacy of shared variation extraction using the cocoreg algorithm on five artificial and three real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Data clustering: application and trends

Article 27 November 2022

Correlation and variable importance in random forests

Article 23 March 2016

Notes

References

Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th international conference on machine learning, vol 28, pp 1247–1255
Dähne S, Nikulin VV, Ramírez D, Schreier PJ, Müller KR, Haufe S (2014) Finding brain oscillations with power dependencies in neuroimaging data. NeuroImage 96:334–348
Article Google Scholar
Damianou A, Ek C, Titsias MK, Lawrence ND (2012) Manifold relevance determination. In: Proceedings of the 29th international conference on machine learning, pp 145–152
Fisher J, Darrell T (2003) Speaker association with signal-level audiovisual fusion. IEEE Trans Multimed 6(3):406–413
Article Google Scholar
Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. doi:10.1162/0899766042321814
Article MATH Google Scholar
Hasson U, Nir Y, Levy I, Fuhrmann G, Malach R (2004) Intersubject synchronization of cortical activity during natural vision. Science 303(5664):1634–1640
Article Google Scholar
Hastie T, Tibshirani R, Friedman J (2003) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
MATH Google Scholar
Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377
Article MATH Google Scholar
Hsieh WW (2000) Nonlinear canonical correlation analysis by neural networks. Neural Netw 13:1095–1105
Article Google Scholar
Hwang H, Jung K, Takane Y, Woodward TS (2013) A unified approach to multiple-set canonical correlation analysis and principal components analysis. Br J Math Stat Psychol 66(2):308–321. doi:10.1111/j.2044-8317.2012.02052.x
Article MathSciNet Google Scholar
Kettenring J (1971) Canonical analysis of several sets of variables. Biometrika 58:433–451
Article MathSciNet MATH Google Scholar
Klami A, Virtanen S, Kaski S (2013) Bayesian canonical correlation analysis. J Mach Learn Res 14:965–1003
MathSciNet MATH Google Scholar
Klami A, Virtanen S, Leppäho E (2015) Group factor analysis. IEEE Trans Neural Netw Learn Syst 26(9):2136–2147. doi:10.1109/TNNLS.2014.2376974
Article MathSciNet Google Scholar
Korpela J, Henelius A (2016) Cocoreg: extracts shared variation in collections of datasets using regression models. http://cran.r-project.org/package=cocoreg
Legendre P, Legendre L (1998) Numerical ecology, 2nd edn. Elsevier, Amsterdam
MATH Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22. https://cran.r-project.org/package=randomForest
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2014) e1071: misc functions of the department of statistics (e1071), Technische Universität Wien. http://cran.r-project.org/package=e1071
Müller KE (1982) Understanding canonical correlation through the general linear model and principal components. Am Stat 36(4):342–354. doi:10.1080/00031305.1982.10483045
MATH Google Scholar
Nguyen HV, Müller E, Vreeken J, Efros P, Böhm K (2014) Multivariate maximal correlation analysis. In: Proceedings of the 31st international conference on machine learning, pp 775–783
R Core Team (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org/
Tenenhaus A (2011) Regularized generalized canonical correlation analysis and PLS path modeling. Psychometrika 76(2):257–284
Article MathSciNet MATH Google Scholar
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B 58(1):267–288
MathSciNet MATH Google Scholar
Timmerman ME, Kiers H (2003) Four simultaneous component models for the analysis of multivariate time series from more than one subject to model intraindividual and interindividual differences. Psychometrika 68(1):105–121. doi:10.1007/BF02296656
Article MathSciNet MATH Google Scholar
Virtanen S, Klami A, Khan SA, Kaski S (2012) CCAGFA: Bayesian canonical correlation analysis and group factor analysis. http://cran.r-project.org/package=CCAGFA

Download references

Acknowledgments

This work was partly supported by the Revolution of Knowledge Work Project, funded by Tekes (Grants 40228/13 and 5159/31/2014), and in part by Academy of Finland (Finnish Centre of Excellence in Computational Research COIN, 251170; 26696).

Author information

Authors and Affiliations

Finnish Institute of Occupational Health, PO Box 40, 00251, Helsinki, Finland
Jussi Korpela, Andreas Henelius, Lauri Ahonen & Kai Puolamäki
Department of Computer Science, Helsinki Institute for Information Technology HIIT, University of Helsinki, PO Box 68, 00014, Helsinki, Finland
Arto Klami

Authors

Jussi Korpela
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Henelius
View author publications
You can also search for this author in PubMed Google Scholar
Lauri Ahonen
View author publications
You can also search for this author in PubMed Google Scholar
Arto Klami
View author publications
You can also search for this author in PubMed Google Scholar
Kai Puolamäki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jussi Korpela.

Additional information

Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Korpela, J., Henelius, A., Ahonen, L. et al. Using regression makes extraction of shared variation in multiple datasets easy. Data Min Knowl Disc 30, 1112–1133 (2016). https://doi.org/10.1007/s10618-016-0465-y

Download citation

Received: 21 December 2015
Accepted: 13 May 2016
Published: 26 May 2016
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10618-016-0465-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using regression makes extraction of shared variation in multiple datasets easy

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

Correlation and variable importance in random forests

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using regression makes extraction of shared variation in multiple datasets easy

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Data clustering: application and trends

Correlation and variable importance in random forests

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation