Skip to main content
Log in

Kernel canonical correlation analysis for data combination of multiple-source datasets

  • Original Paper
  • Theory and Practice of Surveys
  • Published:
Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Abstract

To investigate the relationship between variables that are not observed simultaneously in the same dataset, “multiple-source datasets” obtained from different individuals or units must be integrated into a “(quasi) single-source dataset”, in which all the relevant variables are observed for the same units. Among various data combination methods, the statistical matching method, frequently used in practical usage in marketing or social sciences, matches units from a certain dataset with similar units from another dataset in terms of the distance of each unit’s values of covariates related to the concerned variables. However, when multiple-source datasets have a large number of covariates, it is difficult to obtain accurate quasi single-source dataset using matching methods, because combination of the covariates’ values becomes complicated and/or it is difficult to deal with the nonlinear relationship between the concerned variables. In this study, we propose a data combination method that combines extension of kernel canonical correlation analysis and statistical matching. This proposed method can estimate canonical variables of a common low-dimensional space that can preserve the relationship between covariates and outcome variables. Using a simulation study and real-world data analysis, we compare our method with existing methods and demonstrate its utility.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Adachi, K. (2016). Matrix-based introduction to multivariate data analysis. Singapore: Springer.

    Book  Google Scholar 

  • Akaho, S. (2001). A kernel method for canonical correlation analysis. Proceedings of the international meeting of the psychometric society. arXiv:cs/0609071 (arXiv preprint).

  • Alam, M. A., & Fukumizu, K. (2015). Higher-order regularized kernel canonical correlation analysis. International Journal of Pattern Recognition and Artificial Intelligence, 29, 1551005.

    Article  Google Scholar 

  • Alam, M. A., Fukumizu, K., & Wang, Y. P. (2018). Influence function and robust variant of kernel canonical correlation analysis. Neurocomputing, 304, 12–29.

    Article  Google Scholar 

  • Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.

    MathSciNet  MATH  Google Scholar 

  • Cudeck, R. (2000). An estimate of the covariance between variables which are not jointly observed. Psychometrika, 65, 539–546.

    Article  MathSciNet  Google Scholar 

  • D’ Orazio, M., Di Zio, M., & Scanu, M. (2004). Statistical matching and the likelihood principle: Uncertainty and logical constraints. ISTAT Technical Report.

  • Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, 43, 73–83.

    Article  Google Scholar 

  • Hardoon, D. R., & Shawe-Taylor, J. (2009). Convergence analysis of kernel canonical correlation analysis: Theory and practice. Machine Learning, 74, 23–38.

    Article  Google Scholar 

  • Heckman, J. J., Ichimura, H., & Todd, P. (1998). Matching as an econometric evaluation estimator. The Review of Economic Studies, 65, 261–294.

    Article  MathSciNet  Google Scholar 

  • Horst, P. (1961). Relations among \(m\) sets of measures. Psychometrika, 26, 129–149.

    Article  MathSciNet  Google Scholar 

  • Hoshino, T. (2013). Semiparametric Bayesian estimation for marginal parametric potential outcome modeling: Application to causal inference. Journal of the American Statistical Association, 108, 1189–1204.

    Article  MathSciNet  Google Scholar 

  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.

    Article  Google Scholar 

  • Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal of Marketing Research, 34, 485–498.

    Article  Google Scholar 

  • Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of Marketing Research, 37, 490–498.

    Article  Google Scholar 

  • Kamakura, W. A., & Wedel, M. (2003). List augmentation with model based multiple imputation: A case study using a mixed-outcome factor model. Statistica Neerlandica, 57, 46–57.

    Article  MathSciNet  Google Scholar 

  • Rässler, S. (2002). Statistical matching. New York: Springer.

    Book  Google Scholar 

  • Ridder, G., & Moffitt, R. (2007). The econometrics of data combination. In J. Heckman & E. Leamer (Eds.), Handbook of econometrics, 6B (Vol. 75). North-Holland: Elsevier Science.

    Google Scholar 

  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.

    Article  MathSciNet  Google Scholar 

  • Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38.

    Google Scholar 

  • Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74, 318–328.

    MATH  Google Scholar 

  • Rubin, D. B. (1980). Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293–298.

    Article  Google Scholar 

  • Scott, D. W. (1992). Multivariate density estimation. Theory, practice and visualization. New York: Wiley.

    Book  Google Scholar 

  • Shimodaira, H. (2014). A simple coding for cross-domain matching with dimension reduction via spectral graph embedding. arXiv:1412.8380.

  • Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.

    Book  Google Scholar 

  • Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25, 1–21.

    Article  MathSciNet  Google Scholar 

  • Uurtio, V., Bhadra, S., & Rousu, J. (2019). Large-scale sparse kernel canonical correlation analysis. International Conference on Machine Learning, 6383–6391.

  • Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36, 2473–2480.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by JSPS KAKENHI Grant Numbers 18H03209, 16H02013, 16H06323. The authors would like to thank the reviewers and editors for their helpful comments on this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takahiro Hoshino.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mitsuhiro, M., Hoshino, T. Kernel canonical correlation analysis for data combination of multiple-source datasets. Jpn J Stat Data Sci 3, 651–668 (2020). https://doi.org/10.1007/s42081-020-00074-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42081-020-00074-z

Keywords

Navigation