Abstract
To investigate the relationship between variables that are not observed simultaneously in the same dataset, “multiple-source datasets” obtained from different individuals or units must be integrated into a “(quasi) single-source dataset”, in which all the relevant variables are observed for the same units. Among various data combination methods, the statistical matching method, frequently used in practical usage in marketing or social sciences, matches units from a certain dataset with similar units from another dataset in terms of the distance of each unit’s values of covariates related to the concerned variables. However, when multiple-source datasets have a large number of covariates, it is difficult to obtain accurate quasi single-source dataset using matching methods, because combination of the covariates’ values becomes complicated and/or it is difficult to deal with the nonlinear relationship between the concerned variables. In this study, we propose a data combination method that combines extension of kernel canonical correlation analysis and statistical matching. This proposed method can estimate canonical variables of a common low-dimensional space that can preserve the relationship between covariates and outcome variables. Using a simulation study and real-world data analysis, we compare our method with existing methods and demonstrate its utility.
Similar content being viewed by others
References
Adachi, K. (2016). Matrix-based introduction to multivariate data analysis. Singapore: Springer.
Akaho, S. (2001). A kernel method for canonical correlation analysis. Proceedings of the international meeting of the psychometric society. arXiv:cs/0609071 (arXiv preprint).
Alam, M. A., & Fukumizu, K. (2015). Higher-order regularized kernel canonical correlation analysis. International Journal of Pattern Recognition and Artificial Intelligence, 29, 1551005.
Alam, M. A., Fukumizu, K., & Wang, Y. P. (2018). Influence function and robust variant of kernel canonical correlation analysis. Neurocomputing, 304, 12–29.
Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48.
Cudeck, R. (2000). An estimate of the covariance between variables which are not jointly observed. Psychometrika, 65, 539–546.
D’ Orazio, M., Di Zio, M., & Scanu, M. (2004). Statistical matching and the likelihood principle: Uncertainty and logical constraints. ISTAT Technical Report.
Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, 43, 73–83.
Hardoon, D. R., & Shawe-Taylor, J. (2009). Convergence analysis of kernel canonical correlation analysis: Theory and practice. Machine Learning, 74, 23–38.
Heckman, J. J., Ichimura, H., & Todd, P. (1998). Matching as an econometric evaluation estimator. The Review of Economic Studies, 65, 261–294.
Horst, P. (1961). Relations among \(m\) sets of measures. Psychometrika, 26, 129–149.
Hoshino, T. (2013). Semiparametric Bayesian estimation for marginal parametric potential outcome modeling: Application to causal inference. Journal of the American Statistical Association, 108, 1189–1204.
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal of Marketing Research, 34, 485–498.
Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of Marketing Research, 37, 490–498.
Kamakura, W. A., & Wedel, M. (2003). List augmentation with model based multiple imputation: A case study using a mixed-outcome factor model. Statistica Neerlandica, 57, 46–57.
Rässler, S. (2002). Statistical matching. New York: Springer.
Ridder, G., & Moffitt, R. (2007). The econometrics of data combination. In J. Heckman & E. Leamer (Eds.), Handbook of econometrics, 6B (Vol. 75). North-Holland: Elsevier Science.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41–55.
Rosenbaum, P. R., & Rubin, D. B. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39, 33–38.
Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control bias in observational studies. Journal of the American Statistical Association, 74, 318–328.
Rubin, D. B. (1980). Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293–298.
Scott, D. W. (1992). Multivariate density estimation. Theory, practice and visualization. New York: Wiley.
Shimodaira, H. (2014). A simple coding for cross-domain matching with dimension reduction via spectral graph embedding. arXiv:1412.8380.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.
Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25, 1–21.
Uurtio, V., Bhadra, S., & Rousu, J. (2019). Large-scale sparse kernel canonical correlation analysis. International Conference on Machine Learning, 6383–6391.
Yeh, I. C., & Lien, C. H. (2009). The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 36, 2473–2480.
Acknowledgements
This work was supported by JSPS KAKENHI Grant Numbers 18H03209, 16H02013, 16H06323. The authors would like to thank the reviewers and editors for their helpful comments on this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mitsuhiro, M., Hoshino, T. Kernel canonical correlation analysis for data combination of multiple-source datasets. Jpn J Stat Data Sci 3, 651–668 (2020). https://doi.org/10.1007/s42081-020-00074-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-020-00074-z