Abstract
In the analysis of multivariate data, a useful problem is to identify a subset of observations for which the variables are strongly associated. One example is in driving safety analytics, where we may wish to identify a subset of drivers with a strong association among their driving behavior characteristics. Other interesting domains include finance, health care, marketing, etc. Existing approaches, such as the Top-k method or the tau-path approach, primarily relate to bivariate data and/or invoke the normality assumption. Directly adapting these methods to the multivariate framework is cumbersome. In this work, we propose a semiparametric statistical approach for the optimal subpopulation selection based on the patterns of associations in multivariate data. The proposed method leverages the concept of general correlation coefficients to enable the optimal selection of a subpopulation for a variety of association patterns. We develop efficient algorithms consisting of sequential inclusion of cases into the subpopulation. We illustrate the performance of the proposed method using simulated data and an interesting real data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139–160.
Bingham, D., Sitter, R. R., & Tang, B. (2009). Orthogonal and nearly orthogonal designs for computer experiments. Biometrika, 96(1), 51–65.
Bühlmann, P., & Van De Geer, S. ( 2011). Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media.
Cai, X., Xu, L., Lin, C. D., Hong, Y., & Deng, X. (2021). Sequential design of computer experiments with quantitative and qualitative factors in applications to hpc performance optimization. arXiv:2101.02206
Chen, Z., Mak, S., & Wu, C. ( 2019). A hierarchical expected improvement method for Bayesian optimization. arXiv:1911.07285
Clarkson, K. L., & Shor, P. W. (1989). Applications of random sampling in computational geometry, II. Discrete & Computational Geometry, 4(5), 387–421.
Cramér, H. ( 2016). Mathematical Methods of Statistics, vol. 9, PMS. Princeton University Press.
Gibbons, J. D., & Fielden, J. D. G. ( 1993). Nonparametric Measures of Association. SAGE.
Hall, P., & Schimek, M. G. (2012). Moderate-deviation-based inference for random degeneration in paired rank lists. Journal of the American Statistical Association, 107(498), 661–672.
Kendall, M. G. ( 1948). Rank Correlation Methods. Griffin.
Li, Y., Deng, X., Jin, R., Ba, S., & Myers, W. (2020). Clustering-based data filtering for manufacturing big data system. Journal of Quality Technology.
Li, Y., Kang, L., & Deng, X. ( 2021). A maximin \(\Phi _p\)-efficient design for multivariate GLM. Statistica Sinica, in press.
Liberty, E., Lang, K., & Shmakov, K. ( 2016). Stratified sampling meets machine learning, in International Conference on Machine Learning, PMLR, pp. 2320–2329.
Lin, C. D., Chien, P., & Deng, X. (2022) Efficient Experimental Design for Regularized Linear Models. In Advances and Innovations in Statistics and Data Science.
Lin, C. D., Anderson-Cook, C. M., Hamada, M. S., Moore, L. M., & Sitter, R. R. (2015). Using genetic algorithms to design experiments: A review. Quality and Reliability Engineering International, 31(2), 155–167.
Liu, H., Sadygov, R. G., & Yates, J. R. (2004). A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical Chemistry, 76(14), 4193–4201.
Liu, K., Mei, Y., & Shi, J. (2015). An adaptive sampling strategy for online high-dimensional process monitoring. Technometrics, 57(3), 305–319.
Pawlowsky-Glahn, V., Egozcue, J. J., & Tolosana-Delgado, R. ( 2015). Modeling and Analysis of Compositional Data. Wiley.
Pearson, K. (1897). Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, 60(359–367), 489–498.
Peduzzi, P., Hardy, R., & Holford, T. R. (1980). A stepwise variable selection procedure for nonlinear regression models. Biometrics, 511–516.
Sampath, S., Caloiaro, A., Johnson, W., & Verducci, J. S. (2016). The top-K tau-path screen for monotone association in subpopulations. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5), 206–218.
Schimek, M. G., Budinská, E., Kugler, K. G., Švendová, V., Ding, J., & Lin, S. (2015). TopKLists: A comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists. Statistical Applications in Genetics and Molecular Biology, 14(3), 311–316.
Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. Annals of Statistics, 39(2), 1241–1265.
Shen, S., Kang, L., & Deng, X. (2020). Additive heredity model for the analysis of mixture-of-mixtures experiments. Technometrics, 62(2), 265–276.
Trost, J. E. (1986). Statistically nonrepresentative stratified sampling: A sampling technique for qualitative studies. Qualitative Sociology, 9(1), 54–57.
Wu, C. J., & Hamada, M. S. ( 2011). Experiments: Planning, Analysis, and Optimization, vol. 552. Wiley.
Xian, X., Wang, A., & Liu, K. (2018). A nonparametric adaptive sampling strategy for online monitoring of big data streams. Technometrics, 60(1), 14–25.
Yerushalmy, J. (1947). Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Reports, 1896–1970, 1432–1449.
Yu, L., Verducci, J. S., & Blower, P. E. (2011). The tau-path test for monotone association in an unspecified subpopulation: Application to chemogenomic data mining. Statistical Methodology, 8(1), 97–111.
Zhang, Y., Ravishanker, N., Ivan, J. N., & Mamun, S. A. (2019). An application of the tau-path method in highway safety. Journal of the Indian Society for Probability and Statistics, 20(1), 117–139.
Acknowledgements
We are grateful for very useful referee comments that helped us enhance the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Proof of Proposition 1
For (c), if \(u^{(x)}_{ij} = x_{i} - x_{j}\) and \(u^{(y)}_{ij} = y_{i} - y_{j}\), it is easy to see that
Similarly, we can have
Then the expression in (2) can be written as
Proof of Corollary 1
For binary variables \(x \in \{0, 1\}\) and \(y \in \{0, 1\}\) and data points \((x_{i}, y_{i})\), \(i=1,\ldots , n\), we can first check the Pearson correlation coefficient. First, we have
Second, we can get
Similarly, we obtain \(\sum _{i} (y_{i}- \bar{y})^{2} = \frac{1}{n} \big [ n_{\cdot 1} n_{\cdot 0} \big ]\). Thus,
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Guo, Q., Deng, X., Ravishanker, N. (2022). Association-Based Optimal Subpopulation Selection for Multivariate Data. In: Bekker, A., Ferreira, J.T., Arashi, M., Chen, DG. (eds) Innovations in Multivariate Statistical Modeling. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-031-13971-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-13971-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13970-3
Online ISBN: 978-3-031-13971-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)