Skip to main content

Association-Based Optimal Subpopulation Selection for Multivariate Data

  • Chapter
  • First Online:
Innovations in Multivariate Statistical Modeling

Abstract

In the analysis of multivariate data, a useful problem is to identify a subset of observations for which the variables are strongly associated. One example is in driving safety analytics, where we may wish to identify a subset of drivers with a strong association among their driving behavior characteristics. Other interesting domains include finance, health care, marketing, etc. Existing approaches, such as the Top-k method or the tau-path approach, primarily relate to bivariate data and/or invoke the normality assumption. Directly adapting these methods to the multivariate framework is cumbersome. In this work, we propose a semiparametric statistical approach for the optimal subpopulation selection based on the patterns of associations in multivariate data. The proposed method leverages the concept of general correlation coefficients to enable the optimal selection of a subpopulation for a variety of association patterns. We develop efficient algorithms consisting of sequential inclusion of cases into the subpopulation. We illustrate the performance of the proposed method using simulated data and an interesting real data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/acaloiaro/topk-taupath.

References

  1. Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44(2), 139–160.

    MathSciNet  MATH  Google Scholar 

  2. Bingham, D., Sitter, R. R., & Tang, B. (2009). Orthogonal and nearly orthogonal designs for computer experiments. Biometrika, 96(1), 51–65.

    Article  MathSciNet  MATH  Google Scholar 

  3. Bühlmann, P., & Van De Geer, S. ( 2011). Statistics for High-dimensional Data: Methods, Theory and Applications. Springer Science & Business Media.

    Google Scholar 

  4. Cai, X., Xu, L., Lin, C. D., Hong, Y., & Deng, X. (2021). Sequential design of computer experiments with quantitative and qualitative factors in applications to hpc performance optimization. arXiv:2101.02206

  5. Chen, Z., Mak, S., & Wu, C. ( 2019). A hierarchical expected improvement method for Bayesian optimization. arXiv:1911.07285

  6. Clarkson, K. L., & Shor, P. W. (1989). Applications of random sampling in computational geometry, II. Discrete & Computational Geometry, 4(5), 387–421.

    Article  MathSciNet  MATH  Google Scholar 

  7. Cramér, H. ( 2016). Mathematical Methods of Statistics, vol. 9, PMS. Princeton University Press.

    Google Scholar 

  8. Gibbons, J. D., & Fielden, J. D. G. ( 1993). Nonparametric Measures of Association. SAGE.

    Google Scholar 

  9. Hall, P., & Schimek, M. G. (2012). Moderate-deviation-based inference for random degeneration in paired rank lists. Journal of the American Statistical Association, 107(498), 661–672.

    Article  MathSciNet  MATH  Google Scholar 

  10. Kendall, M. G. ( 1948). Rank Correlation Methods. Griffin.

    Google Scholar 

  11. Li, Y., Deng, X., Jin, R., Ba, S., & Myers, W. (2020). Clustering-based data filtering for manufacturing big data system. Journal of Quality Technology.

    Google Scholar 

  12. Li, Y., Kang, L., & Deng, X. ( 2021). A maximin \(\Phi _p\)-efficient design for multivariate GLM. Statistica Sinica, in press.

    Google Scholar 

  13. Liberty, E., Lang, K., & Shmakov, K. ( 2016). Stratified sampling meets machine learning, in International Conference on Machine Learning, PMLR, pp. 2320–2329.

    Google Scholar 

  14. Lin, C. D., Chien, P., & Deng, X. (2022) Efficient Experimental Design for Regularized Linear Models. In Advances and Innovations in Statistics and Data Science.

    Google Scholar 

  15. Lin, C. D., Anderson-Cook, C. M., Hamada, M. S., Moore, L. M., & Sitter, R. R. (2015). Using genetic algorithms to design experiments: A review. Quality and Reliability Engineering International, 31(2), 155–167.

    Article  Google Scholar 

  16. Liu, H., Sadygov, R. G., & Yates, J. R. (2004). A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Analytical Chemistry, 76(14), 4193–4201.

    Article  Google Scholar 

  17. Liu, K., Mei, Y., & Shi, J. (2015). An adaptive sampling strategy for online high-dimensional process monitoring. Technometrics, 57(3), 305–319.

    Article  MathSciNet  Google Scholar 

  18. Pawlowsky-Glahn, V., Egozcue, J. J., & Tolosana-Delgado, R. ( 2015). Modeling and Analysis of Compositional Data. Wiley.

    Google Scholar 

  19. Pearson, K. (1897). Mathematical contributions to the theory of evolution.-On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proceedings of the Royal Society of London, 60(359–367), 489–498.

    MATH  Google Scholar 

  20. Peduzzi, P., Hardy, R., & Holford, T. R. (1980). A stepwise variable selection procedure for nonlinear regression models. Biometrics, 511–516.

    Google Scholar 

  21. Sampath, S., Caloiaro, A., Johnson, W., & Verducci, J. S. (2016). The top-K tau-path screen for monotone association in subpopulations. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5), 206–218.

    Article  MathSciNet  Google Scholar 

  22. Schimek, M. G., Budinská, E., Kugler, K. G., Švendová, V., Ding, J., & Lin, S. (2015). TopKLists: A comprehensive R package for statistical inference, stochastic aggregation, and visualization of multiple omics ranked lists. Statistical Applications in Genetics and Molecular Biology, 14(3), 311–316.

    Article  MathSciNet  MATH  Google Scholar 

  23. Shao, J., Wang, Y., Deng, X., Wang, S., et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. Annals of Statistics, 39(2), 1241–1265.

    Article  MathSciNet  MATH  Google Scholar 

  24. Shen, S., Kang, L., & Deng, X. (2020). Additive heredity model for the analysis of mixture-of-mixtures experiments. Technometrics, 62(2), 265–276.

    Article  MathSciNet  Google Scholar 

  25. Trost, J. E. (1986). Statistically nonrepresentative stratified sampling: A sampling technique for qualitative studies. Qualitative Sociology, 9(1), 54–57.

    Article  Google Scholar 

  26. Wu, C. J., & Hamada, M. S. ( 2011). Experiments: Planning, Analysis, and Optimization, vol. 552. Wiley.

    Google Scholar 

  27. Xian, X., Wang, A., & Liu, K. (2018). A nonparametric adaptive sampling strategy for online monitoring of big data streams. Technometrics, 60(1), 14–25.

    Article  MathSciNet  Google Scholar 

  28. Yerushalmy, J. (1947). Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques. Public Health Reports, 1896–1970, 1432–1449.

    Article  Google Scholar 

  29. Yu, L., Verducci, J. S., & Blower, P. E. (2011). The tau-path test for monotone association in an unspecified subpopulation: Application to chemogenomic data mining. Statistical Methodology, 8(1), 97–111.

    Article  MathSciNet  MATH  Google Scholar 

  30. Zhang, Y., Ravishanker, N., Ivan, J. N., & Mamun, S. A. (2019). An application of the tau-path method in highway safety. Journal of the Indian Society for Probability and Statistics, 20(1), 117–139.

    Article  Google Scholar 

Download references

Acknowledgements

We are grateful for very useful referee comments that helped us enhance the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nalini Ravishanker .

Editor information

Editors and Affiliations

Appendix

Appendix

Proof of Proposition 1

For (c), if \(u^{(x)}_{ij} = x_{i} - x_{j}\) and \(u^{(y)}_{ij} = y_{i} - y_{j}\), it is easy to see that

$$\begin{aligned} \sum _{i, j} u^{(x)}_{ij} u^{(y)}_{ij}&= \sum _{i, j} (x_{i} - x_{j}) (y_{i} - y_{j}) = 2 \sum _{i, j} x_{i}y_{i} - 2 \sum _{i, j} x_{i}y_{j} \\&= 2n \sum _{i} x_{i}y_{i} - 2n^{2} \bar{x} \bar{y} \\&= 2n(\sum _{i} x_{i}y_{i} - n \bar{x} \bar{y}) \\&= 2n \sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}). \end{aligned}$$

Similarly, we can have

$$\begin{aligned} \sum _{i, j} (u^{(x)}_{ij})^{2}&= \sum _{i, j} (x_{i} - x_{j})^{2} = 2 \sum _{i, j} x_{i}^{2} - 2 \sum _{i, j} x_{i}x_{j} \\&= 2n \sum _{i} x_{i}^{2} - 2n^{2} \bar{x}^{2} \\&= 2n(\sum _{i} x_{i}^{2} - n \bar{x}^{2}) \\&= 2n \sum _{i} (x_{i}- \bar{x})^{2}. \end{aligned}$$

Then the expression in (2) can be written as

$$\begin{aligned} \tau (x, y)&= \frac{\sum _{i, j} u^{(x)}_{ij} u^{(y)}_{ij}}{\sqrt{ \sum _{i, j} (u^{(x)}_{ij})^{2} \sum _{i,j} (u^{(y)}_{ij})^{2} } } = \frac{ \sum _{i, j} (x_{i} - x_{j}) (y_{i} - y_{j}) }{\sqrt{ \sum _{i, j} (x_{i} - x_{j})^{2} \sum _{i,j} (y_{i} - y_{j}) } } \\&= \frac{ 2n \sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}) }{\sqrt{ 2n \sum _{i} (x_{i}- \bar{x})^{2} 2n \sum _{i} (y_{i}- \bar{y})^{2} } } \\&= \frac{\sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}) }{\sqrt{ \sum _{i} (x_{i}- \bar{x})^{2} \sum _{i} (y_{i}- \bar{y})^{2} } }. \end{aligned}$$

Proof of Corollary 1

For binary variables \(x \in \{0, 1\}\) and \(y \in \{0, 1\}\) and data points \((x_{i}, y_{i})\), \(i=1,\ldots , n\), we can first check the Pearson correlation coefficient. First, we have

$$\begin{aligned} \sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y})&= \sum _{i} x_{i}y_{i} - n \bar{x} \bar{y} = n_{11} - n \frac{n_{1\cdot }}{n}\frac{n_{\cdot 1}}{n} \\&= \frac{1}{n} \big [ n_{11} n - n_{1\cdot }n_{\cdot 1}\big ] \\&= \frac{1}{n} \big [ n_{11} (n_{11}+n_{10}+n_{01}+n_{00}) - (n_{10}+n_{11})(n_{01}+n_{11})\big ] \\&= \frac{1}{n} \big [ n_{11}n_{00} - n_{10}n_{01} \big ]. \end{aligned}$$

Second, we can get

$$\begin{aligned} \sum _{i} (x_{i}- \bar{x})^{2}&= \sum _{i} x_{i}^{2} - n \bar{x}^{2} = n_{1\cdot } - n (\frac{n_{1\cdot }}{n})^{2} \\&= \frac{1}{n} \big [ n_{1\cdot }n -n_{1\cdot }^{2} \big ] \\&= \frac{1}{n} \big [ n_{1\cdot } n_{0\cdot } \big ]. \end{aligned}$$

Similarly, we obtain \(\sum _{i} (y_{i}- \bar{y})^{2} = \frac{1}{n} \big [ n_{\cdot 1} n_{\cdot 0} \big ]\). Thus,

$$\begin{aligned} \tau (x, y)&= \frac{\sum _{i} (x_{i}- \bar{x}) (y_{i} - \bar{y}) }{\sqrt{ \sum _{i} (x_{i}- \bar{x})^{2} \sum _{i} (y_{i}- \bar{y})^{2} } } \\&= \frac{n_{11}n_{00} - n_{10}n_{01}}{\sqrt{n_{1\cdot }n_{0\cdot }n_{\cdot 1}n_{\cdot 0}}}. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Guo, Q., Deng, X., Ravishanker, N. (2022). Association-Based Optimal Subpopulation Selection for Multivariate Data. In: Bekker, A., Ferreira, J.T., Arashi, M., Chen, DG. (eds) Innovations in Multivariate Statistical Modeling. Emerging Topics in Statistics and Biostatistics . Springer, Cham. https://doi.org/10.1007/978-3-031-13971-0_1

Download citation

Publish with us

Policies and ethics