Association-Based Optimal Subpopulation Selection for Multivariate Data

Guo, Qing; Deng, Xinwei; Ravishanker, Nalini

doi:10.1007/978-3-031-13971-0_1

Qing Guo¹²,
Xinwei Deng¹³ &
Nalini Ravishanker¹⁴

Part of the book series: Emerging Topics in Statistics and Biostatistics ((ETSB))

445 Accesses

Abstract

In the analysis of multivariate data, a useful problem is to identify a subset of observations for which the variables are strongly associated. One example is in driving safety analytics, where we may wish to identify a subset of drivers with a strong association among their driving behavior characteristics. Other interesting domains include finance, health care, marketing, etc. Existing approaches, such as the Top-k method or the tau-path approach, primarily relate to bivariate data and/or invoke the normality assumption. Directly adapting these methods to the multivariate framework is cumbersome. In this work, we propose a semiparametric statistical approach for the optimal subpopulation selection based on the patterns of associations in multivariate data. The proposed method leverages the concept of general correlation coefficients to enable the optimal selection of a subpopulation for a variety of association patterns. We develop efficient algorithms consisting of sequential inclusion of cases into the subpopulation. We illustrate the performance of the proposed method using simulated data and an interesting real data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/acaloiaro/topk-taupath.

References

Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139–160.
MathSciNet MATH Google Scholar
Bingham, D., Sitter, R. R., & Tang, B. (2009). Orthogonal and nearly orthogonal designs for computer experiments. Biometrika, 96(1), 51–65.
Article MathSciNet MATH Google Scholar
Bühlmann, P., & Van De Geer, S. ( 2011). Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media.
Google Scholar
Cai, X., Xu, L., Lin, C. D., Hong, Y., & Deng, X. (2021). Sequential design of computer experiments with quantitative and qualitative factors in applications to hpc performance optimization. arXiv:2101.02206
Chen, Z., Mak, S., & Wu, C. ( 2019). A hierarchical expected improvement method for Bayesian optimization. arXiv:1911.07285
Clarkson, K. L., & Shor, P. W. (1989). Applications of random sampling in computational geometry, II. Discrete & Computational Geometry, 4(5), 387–421.
Article MathSciNet MATH Google Scholar
Cramér, H. ( 2016). Mathematical Methods of Statistics, vol. 9, PMS. Princeton University Press.
Google Scholar
Gibbons, J. D., & Fielden, J. D. G. ( 1993). Nonparametric Measures of Association. SAGE.
Google Scholar
Hall, P., & Schimek, M. G. (2012). Moderate-deviation-based inference for random degeneration in paired rank lists. Journal of the American Statistical Association, 107(498), 661–672.
Article MathSciNet MATH Google Scholar
Kendall, M. G. ( 1948). Rank Correlation Methods. Griffin.
Google Scholar
Li, Y., Deng, X., Jin, R., Ba, S., & Myers, W. (2020). Clustering-based data filtering for manufacturing big data system. Journal of Quality Technology.
Google Scholar
Li, Y., Kang, L., & Deng, X. ( 2021). A maximin $\Phi _p$-efficient design for multivariate GLM. Statistica Sinica, in press.
Google Scholar
Liberty, E., Lang, K., & Shmakov, K. ( 2016). Stratified sampling meets machine learning, in International Conference on Machine Learning, PMLR, pp. 2320–2329.
Google Scholar
Lin, C. D., Chien, P., & Deng, X. (2022) Efficient Experimental Design for Regularized Linear Models. In Advances and Innovations in Statistics and Data Science.
Google Scholar
Lin, C. D., Anderson-Cook, C. M., Hamada, M. S., Moore, L. M., & Sitter, R. R. (2015). Using genetic algorithms to design experiments: A review. Quality and Reliability Engineering International, 31(2), 155–167.
Article Google Scholar
Liu, H., Sadygov, R. G., & Yates, J. R. (2004). A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical Chemistry, 76(14), 4193–4201.
Article Google Scholar
Liu, K., Mei, Y., & Shi, J. (2015). An adaptive sampling strategy for online high-dimensional process monitoring. Technometrics, 57(3), 305–319.
Article MathSciNet Google Scholar
Pawlowsky-Glahn, V., Egozcue, J. J., & Tolosana-Delgado, R. ( 2015). Modeling and Analysis of Compositional Data. Wiley.
Google Scholar
Pearson, K. (1897). Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, 60(359–367), 489–498.
MATH Google Scholar
Peduzzi, P., Hardy, R., & Holford, T. R. (1980). A stepwise variable selection procedure for nonlinear regression models. Biometrics, 511–516.
Google Scholar
Sampath, S., Caloiaro, A., Johnson, W., & Verducci, J. S. (2016). The top-K tau-path screen for monotone association in subpopulations. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5), 206–218.
Article MathSciNet Google Scholar
Schimek, M. G., Budinská, E., Kugler, K. G., Švendová, V., Ding, J., & Lin, S. (2015). TopKLists: A comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists. Statistical Applications in Genetics and Molecular Biology, 14(3), 311–316.
Article MathSciNet MATH Google Scholar
Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. Annals of Statistics, 39(2), 1241–1265.
Article MathSciNet MATH Google Scholar
Shen, S., Kang, L., & Deng, X. (2020). Additive heredity model for the analysis of mixture-of-mixtures experiments. Technometrics, 62(2), 265–276.
Article MathSciNet Google Scholar
Trost, J. E. (1986). Statistically nonrepresentative stratified sampling: A sampling technique for qualitative studies. Qualitative Sociology, 9(1), 54–57.
Article Google Scholar
Wu, C. J., & Hamada, M. S. ( 2011). Experiments: Planning, Analysis, and Optimization, vol. 552. Wiley.
Google Scholar
Xian, X., Wang, A., & Liu, K. (2018). A nonparametric adaptive sampling strategy for online monitoring of big data streams. Technometrics, 60(1), 14–25.
Article MathSciNet Google Scholar
Yerushalmy, J. (1947). Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Reports, 1896–1970, 1432–1449.
Article Google Scholar
Yu, L., Verducci, J. S., & Blower, P. E. (2011). The tau-path test for monotone association in an unspecified subpopulation: Application to chemogenomic data mining. Statistical Methodology, 8(1), 97–111.
Article MathSciNet MATH Google Scholar
Zhang, Y., Ravishanker, N., Ivan, J. N., & Mamun, S. A. (2019). An application of the tau-path method in highway safety. Journal of the Indian Society for Probability and Statistics, 20(1), 117–139.
Article Google Scholar

Download references

Acknowledgements

We are grateful for very useful referee comments that helped us enhance the paper.

Author information

Authors and Affiliations

Department of Statistics, Virginia Tech, Blacksburg, VA, USA
Qing Guo
Department of Statistics, Virginia Tech, Blacksburg, VA, USA
Xinwei Deng
Department of Statistics, University of Connecticut, Mansfield, CT, USA
Nalini Ravishanker

Authors

Qing Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xinwei Deng
View author publications
You can also search for this author in PubMed Google Scholar
Nalini Ravishanker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nalini Ravishanker .

Editor information

Editors and Affiliations

Department of Statistics, University of Pretoria, Pretoria, South Africa
Andriëtte Bekker
Department of Statistics, University of Pretoria, Pretoria, South Africa
Johannes T. Ferreira
Department of Statistics, Ferdowsi University of Mashhad, Mashhad, Iran
Mohammad Arashi
Department of Statistics, University of Pretoria, Pretoria, South Africa
Ding-Geng Chen

Appendix

Proof of Proposition 1

For (c), if $u^{(x)}_{ij} = x_{i} - x_{j}$ and $u^{(y)}_{ij} = y_{i} - y_{j}$, it is easy to see that

$$\begin{aligned} \sum _{i, j} u^{(x)}_{ij} u^{(y)}_{ij}&= \sum _{i, j} (x_{i} - x_{j}) (y_{i} - y_{j}) = 2 \sum _{i, j} x_{i}y_{i} - 2 \sum _{i, j} x_{i}y_{j} \\&= 2n \sum _{i} x_{i}y_{i} - 2n^{2} \bar{x} \bar{y} \\&= 2n(\sum _{i} x_{i}y_{i} - n \bar{x} \bar{y}) \\&= 2n \sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}). \end{aligned}$$

Similarly, we can have

$$\begin{aligned} \sum _{i, j} (u^{(x)}_{ij})^{2}&= \sum _{i, j} (x_{i} - x_{j})^{2} = 2 \sum _{i, j} x_{i}^{2} - 2 \sum _{i, j} x_{i}x_{j} \\&= 2n \sum _{i} x_{i}^{2} - 2n^{2} \bar{x}^{2} \\&= 2n(\sum _{i} x_{i}^{2} - n \bar{x}^{2}) \\&= 2n \sum _{i} (x_{i}- \bar{x})^{2}. \end{aligned}$$

Then the expression in (2) can be written as

$$\begin{aligned} \tau (x, y)&= \frac{\sum _{i, j} u^{(x)}_{ij} u^{(y)}_{ij}}{\sqrt{ \sum _{i, j} (u^{(x)}_{ij})^{2} \sum _{i,j} (u^{(y)}_{ij})^{2} } } = \frac{ \sum _{i, j} (x_{i} - x_{j}) (y_{i} - y_{j}) }{\sqrt{ \sum _{i, j} (x_{i} - x_{j})^{2} \sum _{i,j} (y_{i} - y_{j}) } } \\&= \frac{ 2n \sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}) }{\sqrt{ 2n \sum _{i} (x_{i}- \bar{x})^{2} 2n \sum _{i} (y_{i}- \bar{y})^{2} } } \\&= \frac{\sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}) }{\sqrt{ \sum _{i} (x_{i}- \bar{x})^{2} \sum _{i} (y_{i}- \bar{y})^{2} } }. \end{aligned}$$

Proof of Corollary 1

For binary variables $x \in \{0, 1\}$ and $y \in \{0, 1\}$ and data points $(x_{i}, y_{i})$, $i=1,\ldots , n$, we can first check the Pearson correlation coefficient. First, we have

$$\begin{aligned} \sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y})&= \sum _{i} x_{i}y_{i} - n \bar{x} \bar{y} = n_{11} - n \frac{n_{1\cdot }}{n}\frac{n_{\cdot 1}}{n} \\&= \frac{1}{n} \big [ n_{11} n - n_{1\cdot }n_{\cdot 1}\big ] \\&= \frac{1}{n} \big [ n_{11} (n_{11}+n_{10}+n_{01}+n_{00}) - (n_{10}+n_{11})(n_{01}+n_{11})\big ] \\&= \frac{1}{n} \big [ n_{11}n_{00} - n_{10}n_{01} \big ]. \end{aligned}$$

Second, we can get

$$\begin{aligned} \sum _{i} (x_{i}- \bar{x})^{2}&= \sum _{i} x_{i}^{2} - n \bar{x}^{2} = n_{1\cdot } - n (\frac{n_{1\cdot }}{n})^{2} \\&= \frac{1}{n} \big [ n_{1\cdot }n -n_{1\cdot }^{2} \big ] \\&= \frac{1}{n} \big [ n_{1\cdot } n_{0\cdot } \big ]. \end{aligned}$$

Similarly, we obtain $\sum _{i} (y_{i}- \bar{y})^{2} = \frac{1}{n} \big [ n_{\cdot 1} n_{\cdot 0} \big ]$. Thus,

$$\begin{aligned} \tau (x, y)&= \frac{\sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}) }{\sqrt{ \sum _{i} (x_{i}- \bar{x})^{2} \sum _{i} (y_{i}- \bar{y})^{2} } } \\&= \frac{n_{11}n_{00} - n_{10}n_{01}}{\sqrt{n_{1\cdot }n_{0\cdot }n_{\cdot 1}n_{\cdot 0}}}. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Guo, Q., Deng, X., Ravishanker, N. (2022). Association-Based Optimal Subpopulation Selection for Multivariate Data. In: Bekker, A., Ferreira, J.T., Arashi, M., Chen, DG. (eds) Innovations in Multivariate Statistical Modeling. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-031-13971-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-13971-0_1
Published: 16 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13970-3
Online ISBN: 978-3-031-13971-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Association-Based Optimal Subpopulation Selection for Multivariate Data

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation