Kernel canonical correlation analysis for data combination of multiple-source datasets

Mitsuhiro, Masaki; Hoshino, Takahiro

doi:10.1007/s42081-020-00074-z

Kernel canonical correlation analysis for data combination of multiple-source datasets

Original Paper
Theory and Practice of Surveys
Published: 06 March 2020

Volume 3, pages 651–668, (2020)
Cite this article

Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

245 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

To investigate the relationship between variables that are not observed simultaneously in the same dataset, “multiple-source datasets” obtained from different individuals or units must be integrated into a “(quasi) single-source dataset”, in which all the relevant variables are observed for the same units. Among various data combination methods, the statistical matching method, frequently used in practical usage in marketing or social sciences, matches units from a certain dataset with similar units from another dataset in terms of the distance of each unit’s values of covariates related to the concerned variables. However, when multiple-source datasets have a large number of covariates, it is difficult to obtain accurate quasi single-source dataset using matching methods, because combination of the covariates’ values becomes complicated and/or it is difficult to deal with the nonlinear relationship between the concerned variables. In this study, we propose a data combination method that combines extension of kernel canonical correlation analysis and statistical matching. This proposed method can estimate canonical variables of a common low-dimensional space that can preserve the relationship between covariates and outcome variables. Using a simulation study and real-world data analysis, we compare our method with existing methods and demonstrate its utility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Canonical Correlation Analysis with Missing Values: A Structural Equation Modeling Approach

Canonical Dependency Analysis Using a Bias-Corrected $$\chi ^2$$ Statistics Matrix

Article 08 January 2024

Structured Variable Selection for Regularized Generalized Canonical Correlation Analysis

References

Adachi, K. (2016). Matrix-based introduction to multivariate data analysis. Singapore: Springer.
Book Google Scholar
Akaho, S. (2001). A kernel method for canonical correlation analysis. Proceedings of the international meeting of the psychometric society. arXiv:cs/0609071 (arXiv preprint).
Alam, M. A., & Fukumizu, K. (2015). Higher-order regularized kernel canonical correlation analysis. International Journal of Pattern Recognition and Artificial Intelligence, 29, 1551005.
Article Google Scholar
Alam, M. A., Fukumizu, K., & Wang, Y. P. (2018). Influence function and robust variant of kernel canonical correlation analysis. Neurocomputing, 304, 12–29.
Article Google Scholar
Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.
MathSciNet MATH Google Scholar
Cudeck, R. (2000). An estimate of the covariance between variables which are not jointly observed. Psychometrika, 65, 539–546.
Article MathSciNet Google Scholar
D’ Orazio, M., Di Zio, M., & Scanu, M. (2004). Statistical matching and the likelihood principle: Uncertainty and logical constraints. ISTAT Technical Report.
Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, 43, 73–83.
Article Google Scholar
Hardoon, D. R., & Shawe-Taylor, J. (2009). Convergence analysis of kernel canonical correlation analysis: Theory and practice. Machine Learning, 74, 23–38.
Article Google Scholar
Heckman, J. J., Ichimura, H., & Todd, P. (1998). Matching as an econometric evaluation estimator. The Review of Economic Studies, 65, 261–294.
Article MathSciNet Google Scholar
Horst, P. (1961). Relations among $m$ sets of measures. Psychometrika, 26, 129–149.
Article MathSciNet Google Scholar
Hoshino, T. (2013). Semiparametric Bayesian estimation for marginal parametric potential outcome modeling: Application to causal inference. Journal of the American Statistical Association, 108, 1189–1204.
Article MathSciNet Google Scholar
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
Article Google Scholar
Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal of Marketing Research, 34, 485–498.
Article Google Scholar
Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of Marketing Research, 37, 490–498.
Article Google Scholar
Kamakura, W. A., & Wedel, M. (2003). List augmentation with model based multiple imputation: A case study using a mixed-outcome factor model. Statistica Neerlandica, 57, 46–57.
Article MathSciNet Google Scholar
Rässler, S. (2002). Statistical matching. New York: Springer.
Book Google Scholar
Ridder, G., & Moffitt, R. (2007). The econometrics of data combination. In J. Heckman & E. Leamer (Eds.), Handbook of econometrics, 6B (Vol. 75). North-Holland: Elsevier Science.
Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.
Article MathSciNet Google Scholar
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38.
Google Scholar
Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74, 318–328.
MATH Google Scholar
Rubin, D. B. (1980). Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293–298.
Article Google Scholar
Scott, D. W. (1992). Multivariate density estimation. Theory, practice and visualization. New York: Wiley.
Book Google Scholar
Shimodaira, H. (2014). A simple coding for cross-domain matching with dimension reduction via spectral graph embedding. arXiv:1412.8380.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.
Book Google Scholar
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25, 1–21.
Article MathSciNet Google Scholar
Uurtio, V., Bhadra, S., & Rousu, J. (2019). Large-scale sparse kernel canonical correlation analysis. International Conference on Machine Learning, 6383–6391.
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36, 2473–2480.
Article Google Scholar

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers 18H03209, 16H02013, 16H06323. The authors would like to thank the reviewers and editors for their helpful comments on this work.

Author information

Authors and Affiliations

Nikkei Research Inc., 2-2-1 Uchikanda, Chiyoda-ku, Tokyo, 101-0047, Japan
Masaki Mitsuhiro
Keio University, 2-15-45 Mita, Minato-ku, Tokyo, 108-8345, Japan
Masaki Mitsuhiro & Takahiro Hoshino
RIKEN Center for Advanced Intelligence Project, Nihonbashi 1-chome Mitsui Building, 15th Floor, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan
Takahiro Hoshino

Authors

Masaki Mitsuhiro
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Hoshino
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Takahiro Hoshino.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mitsuhiro, M., Hoshino, T. Kernel canonical correlation analysis for data combination of multiple-source datasets. Jpn J Stat Data Sci 3, 651–668 (2020). https://doi.org/10.1007/s42081-020-00074-z

Download citation

Received: 26 September 2019
Accepted: 22 January 2020
Published: 06 March 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s42081-020-00074-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Kernel canonical correlation analysis for data combination of multiple-source datasets

Abstract

Access this article

Similar content being viewed by others

Canonical Correlation Analysis with Missing Values: A Structural Equation Modeling Approach

Canonical Dependency Analysis Using a Bias-Corrected $$\chi ^2$$ Statistics Matrix

Structured Variable Selection for Regularized Generalized Canonical Correlation Analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Kernel canonical correlation analysis for data combination of multiple-source datasets

Abstract

Access this article

Similar content being viewed by others

Canonical Correlation Analysis with Missing Values: A Structural Equation Modeling Approach

Canonical Dependency Analysis Using a Bias-Corrected $$\chi ^2$$ Statistics Matrix

Structured Variable Selection for Regularized Generalized Canonical Correlation Analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation