Cluster analysis application to identify groups of individuals with high health expenditures


We compare and demonstrate the effectiveness of two clustering methods with the main purpose of identifying characteristic profiles of high utilizers of health care. In this work, we use three sets of mutually independent longitudinal data that are nationally representative of the US adult working-age civilian non-institutionalized population. We compare k-means, a commonly used clustering method, with a k-medoids algorithm called Partitioning Around Medoids. We use one cohort of data to create clusters based on similar characteristics of individuals for both clustering methods. We examine these characteristic compositions of the highest three average total expenditure clusters from this cohort. We also examine the health expenditure distributions for this cohort over the following two years. We validate the approach by applying the centers of the clusters to two other cohorts of similar data. We form clusters based on demographic, economic, and health-related characteristics that are commonly used in studies of health care utilization. We demonstrate the consistency of our results across the three cohorts of data and across different types of health expenditures, such as office-based/outpatient and drug. Clusters can be formed with other more homogeneous data, such as Medicaid, Medicare, employer sponsored insurance, or individual private plans issued under the Affordable Care Act. This approach can be used to follow similar groups over time for other types of health outcomes.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. Aday, L.A., Andersen, R.: A framework for the study of access to medical care. Health Serv. Res. 9(3), 208 (1974)

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Agency for Healthcare Research and Quality.: Medical Expenditure Panel Study. US Department of Health and Human Services (2020).

  3. Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)

    Article  Google Scholar 

  4. Andersen, R.: A behavioral model of families’ use of health services. 25, Chicago: Center for Health Administration Studies, 5720 S. Woodlawn Avenue, University of Chicago, Illinois 60637, USA (1968)

  5. Andersen, R., Newman, J.F.: Societal and individual determinants of medical care utilization in the united states. The Milbank Memorial Fund Quarterly Health and Society, pp. 95–124 (1973)

  6. Aranganayagi, S., Thangavel, K.: Improved k-modes for categorical clustering using weighted dissimilarity measure. World Acad. Sci. Eng. Technol. 3, 813–819 (2009)

    Google Scholar 

  7. Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM, pp. 582–589 (2002)

  8. Bayliss, E.A., Powers, J.D., Ellis, J.L., Barrow, J.C., Strobel, M., Beck, A.: Applying sequential analytic methods to self-reported information to anticipate care needs. eGEMs (2016).

    PubMed  PubMed Central  Article  Google Scholar 

  9. Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, pp. 243–254 (2008)

  10. Boscardin, C.K., Gonzales, R., Bradley, K.L., Raven, M.C.: Predicting cost of care using self-reported health status data. BMC Health Serv. Res. 15(1), 406 (2015)

    PubMed  PubMed Central  Article  Google Scholar 

  11. Charlson, M., Wells, M.T., Ullman, R., King, F., Shmukler, C.: The charlson comorbidity index can be used prospectively to identify patients who will incur high future costs. PLoS ONE 9(12), e112479 (2014)

    PubMed  PubMed Central  Article  Google Scholar 

  12. Cibulková, J., Šulc, Z., Sirota, S., Rezanková, H.: The effect of binary data transformation in categorical data clustering. STATISTICS (2019).

    Article  Google Scholar 

  13. Crawford, A.G., Fuhr Jr., J.P., Clarke, J., Hubbs, B.: Comparative effectiveness of total population versus disease-specific neural network models in predicting medical costs. Dis. Manag. 8(5), 277–287 (2005)

    PubMed  Article  Google Scholar 

  14. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)

    Google Scholar 

  15. Fleishman, J.A., Cohen, J.W.: Using information on clinical conditions to predict high-cost patients. Health Serv. Res. 45(2), 532–552 (2010)

    PubMed  PubMed Central  Article  Google Scholar 

  16. Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)

    Article  Google Scholar 

  17. Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4), 325–338 (1966)

    Article  Google Scholar 

  18. Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Article  Google Scholar 

  19. Hamad, R., Modrek, S., Kubo, J., Goldstein, B.A., Cullen, M.R.: Using “big data” to capture overall health status: properties and predictive value of a claims-based health risk score. PLoS ONE 10(5), e0126054 (2015)

    PubMed  PubMed Central  Article  Google Scholar 

  20. Healthy People.: Social Determinants. Office of Disease Prevention and Health Promotion, Washington, D.C (2020).

  21. Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)

    Article  Google Scholar 

  22. Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999)

    Article  Google Scholar 

  23. Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1 (2012)

    Article  Google Scholar 

  24. Jia, H., Ym, Cheung, Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)

    PubMed  Article  Google Scholar 

  25. Kaufman, L., Rousseeuw, P.J.: Partitioning around medoids (program pam). Finding groups in data: An introduction to cluster analysis pp. 68–125 (2005)

  26. Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recognit. 38(7), 1131–1134 (2005)

    Article  Google Scholar 

  27. Kim, K., Rosenberg, M.A.: Determinants of persistent high utilizers in US adults using nationally representative data. N. Am. Actuar. J. 24(1), 1–21 (2020)

    Article  Google Scholar 

  28. Lee, N.S., Whitman, N., Vakharia, N., Rothberg, M.B.: High-cost patients: hot-spotters don’t explain the half of it. J. Gen. Intern. Med. 32(1), 28–34 (2017)

    PubMed  Article  Google Scholar 

  29. Leisch, F.: Neighborhood graphs, stripes and shadow plots for cluster visualization. Stat. Comput. 20(4), 457–469 (2010)

    Article  Google Scholar 

  30. Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, p. 68 (2004)

  31. Liao, M., Li, Y., Kianifard, F., Obi, E., Arcona, S.: Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrol. 17(1), 25 (2016)

    PubMed  PubMed Central  Article  Google Scholar 

  32. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2020)

    Google Scholar 

  33. Long, P., Abrams, M., Milstein, A., Anderson, G., Apton, K., Dahlberg, M., et al.: Effective care for high-need patients, Washington DC (2017)

  34. Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: cluster analysis basics and extensions. R package version 2.0.7-1—For new features, see the ’Changelog’ file (in the package source) (2018)

  35. Mitchell, E.: Statistical brief# 497: concentration of health expenditures in the us civilian noninstitutionalized population, 2014 (2016)

  36. Morissette, L., Chartier, S.: The k-means clustering technique: general considerations and implementation in mathematica. Tutor. Quant. Methods Psychol. 9(1), 15–24 (2013)

    Article  Google Scholar 

  37. National Center for Health Statistics.: National Health Interview Survey. Centers for Disease Prevention and Control (2020).

  38. National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention.: NCHHSTP Social Determinants. Centers for Disease Control and Prevention, Washington, D.C (2020).

  39. Peltz, A., Hall, M., Rubin, D.M., Mandl, K.D., Neff, J., Brittan, M., Cohen, E., Hall, D.E., Kuo, D.Z., Agrawal, R., et al.: Hospital utilization among children with the highest annual inpatient cost. Pediatrics 137(2), e20151829 (2016)

    PubMed  Article  Google Scholar 

  40. R Core Team R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2020).

  41. Řezanková, H.: Cluster analysis of economic data. Statistika 94(1), 73–86 (2014)

    Google Scholar 

  42. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  43. Shenas, S.A.I., Raahemi, B., Tekieh, M.H., Kuziemsky, C.: Identifying high-cost patients using data mining techniques and a small set of non-trivial attributes. Comput. Biol. Med. 53, 9–18 (2014)

    Article  Google Scholar 

  44. Sneath, P.H., Sokal, R.R.: Numerical Taxonomy. The Principles and Practice of Numerical Classification. W.H. Freeman and Company, New York (1973)

    Google Scholar 

  45. Sokal, R.R., Camin, J., Rohlf, F., Sneath, P.: Numerical taxonomy: some points of view. Syst. Zool. 14(3), 237–243 (1965)

    Article  Google Scholar 

  46. Šulc, Z., Řezanková, H.: Comparison of similarity measures for categorical data in hierarchical clustering. J. Classif. 36(1), 58–72 (2019)

    Article  Google Scholar 

  47. Šulc, Z., Matějka, M., Procházka, J., Řezanková, H.: Evaluation of the Gower coefficient modifications in hierarchical clustering. Metodoloski Zvezki 14, 37–48 (2017)

    Google Scholar 

  48. Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)

    Article  Google Scholar 

  49. Wammes, J.J.G., van der Wees, P.J., Tanke, M.A., Westert, G.P., Jeurissen, P.P.: Systematic review of high-cost patients’ characteristics and healthcare utilisation. BMJ Open 8(9), e023113 (2018)

    PubMed  PubMed Central  Article  Google Scholar 

  50. Wherry, L.R., Burns, M.E., Leininger, L.J.: Using self-reported health measures to predict high-need cases among medicaid-eligible adults. Health Serv. Res. 49(S2), 2147–2172 (2014)

    PubMed  PubMed Central  Article  Google Scholar 

  51. Zhu, M., Ghodsi, A.: Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Stat. Data Anal. 51(2), 918–930 (2006)

    Article  Google Scholar 

  52. Zook, C.J., Moore, F.D.: High-cost users of medical care. N. Engl. J. Med. 302(18), 996–1002 (1980)

    CAS  PubMed  Article  Google Scholar 

Download references


We acknowledge the Society of Actuaries Center of Excellence Research Grant Program for partial support of this research.

Author information



Corresponding authors

Correspondence to Joshua Agterberg or Marjorie Rosenberg.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All authors approve of the article contents.

Informed consent

NHIS and MEPS data are publicly available.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Diagnostic results

Appendix: Diagnostic results

The elbow, silhouette, and stripes plots are used in guiding the decision to choose the number of clusters. We choose 15 clusters as it is large enough to show differences between the clusters, but not too large to be unmanageable as defined by one of the principles of Sneath and Sokal (1973). The elbow plot in Fig. 8a indicates that 6, 15, or 22 clusters reflect good choices per the method of Zhu and Ghodsi (2006). The graph of the silhouette coefficients in Fig. 8b peaks at 6 clusters and has a local max at 12 and 15 clusters. The overall silhouette index is 0.018.

Each column in the stripes plot in Fig. 9 shows the dissimilarities from the medoid for those observations within cluster \(j, \, j = 1, \ldots , 15\), as well as the dissimilarity from the medoid in cluster j for those observations in another cluster whose second closest medoid is cluster j. Each column j has the potential of 15 sub-columns, with the order of the markings in cluster order from \(j = 1, \ldots , 15\). The clusters have been labeled based on average total expenditures with cluster 1 representing those with the highest average total expenditures to cluster 15 with the lowest average. In all columns, sub-column 1 is reserved for those individuals from cluster 1. The cluster 1 column shows additional entiries for those individuals secondarily assigned to cluster 1 but whose primary cluster is other than cluster 1. Other cluster columns can be similarly defined. The cluster 1 column shows some secondary assignments, but has less additional assignments than the middle expenditure clusters. Also, there is a greater prevalence of cluster 1 individuals having cluster 2 be their second most closest cluster. These findings help justify the application to identifying the very high utilizers of health care.

We explored a wide range for the number of clusters (5 to 30), as well as looked at other methods for determining the number of clusters. We do not claim to have the optimal number of clusters, but we found that having more clusters allowed for better separation of the clusters from an expenditure perspective. With the choice of 15 clusters, we are able to see differences among clusters 1 to 3 that are different in covariates from the aggregate and whose expenditure distributions show a pattern of being the highest of all fifteen clusters. If we had used a smaller number of clusters, this expenditure distribution difference would not be evident and the stripes plot shows that there is more overlap among the high expenditure clusters with other clusters.

Fig. 8

Elbow plot and graph of Silhouette coefficients

Fig. 9

Stripes plot

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Agterberg, J., Zhong, F., Crabb, R. et al. Cluster analysis application to identify groups of individuals with high health expenditures. Health Serv Outcomes Res Method (2020).

Download citation


  • Unsupervised machine learning
  • Goodall similarities
  • k-Means
  • Partitioning around medoids
  • Predicting rare events