Skip to main content

Advertisement

Log in

Cluster analysis application to identify groups of individuals with high health expenditures

  • Published:
Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

Abstract

We compare and demonstrate the effectiveness of two clustering methods with the main purpose of identifying characteristic profiles of high utilizers of health care. In this work, we use three sets of mutually independent longitudinal data that are nationally representative of the US adult working-age civilian non-institutionalized population. We compare k-means, a commonly used clustering method, with a k-medoids algorithm called Partitioning Around Medoids. We use one cohort of data to create clusters based on similar characteristics of individuals for both clustering methods. We examine these characteristic compositions of the highest three average total expenditure clusters from this cohort. We also examine the health expenditure distributions for this cohort over the following two years. We validate the approach by applying the centers of the clusters to two other cohorts of similar data. We form clusters based on demographic, economic, and health-related characteristics that are commonly used in studies of health care utilization. We demonstrate the consistency of our results across the three cohorts of data and across different types of health expenditures, such as office-based/outpatient and drug. Clusters can be formed with other more homogeneous data, such as Medicaid, Medicare, employer sponsored insurance, or individual private plans issued under the Affordable Care Act. This approach can be used to follow similar groups over time for other types of health outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Aday, L.A., Andersen, R.: A framework for the study of access to medical care. Health Serv. Res. 9(3), 208 (1974)

    CAS  PubMed  PubMed Central  Google Scholar 

  • Agency for Healthcare Research and Quality.: Medical Expenditure Panel Study. US Department of Health and Human Services (2020). https://www.cdc.gov/nchs/nhis/index.htm

  • Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)

    Article  Google Scholar 

  • Andersen, R.: A behavioral model of families’ use of health services. 25, Chicago: Center for Health Administration Studies, 5720 S. Woodlawn Avenue, University of Chicago, Illinois 60637, USA (1968)

  • Andersen, R., Newman, J.F.: Societal and individual determinants of medical care utilization in the united states. The Milbank Memorial Fund Quarterly Health and Society, pp. 95–124 (1973)

  • Aranganayagi, S., Thangavel, K.: Improved k-modes for categorical clustering using weighted dissimilarity measure. World Acad. Sci. Eng. Technol. 3, 813–819 (2009)

    Google Scholar 

  • Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM, pp. 582–589 (2002)

  • Bayliss, E.A., Powers, J.D., Ellis, J.L., Barrow, J.C., Strobel, M., Beck, A.: Applying sequential analytic methods to self-reported information to anticipate care needs. eGEMs (2016). https://doi.org/10.13063/2327-9214.1258

    Article  PubMed  PubMed Central  Google Scholar 

  • Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, pp. 243–254 (2008)

  • Boscardin, C.K., Gonzales, R., Bradley, K.L., Raven, M.C.: Predicting cost of care using self-reported health status data. BMC Health Serv. Res. 15(1), 406 (2015)

    Article  PubMed  PubMed Central  Google Scholar 

  • Charlson, M., Wells, M.T., Ullman, R., King, F., Shmukler, C.: The charlson comorbidity index can be used prospectively to identify patients who will incur high future costs. PLoS ONE 9(12), e112479 (2014)

    Article  PubMed  PubMed Central  Google Scholar 

  • Cibulková, J., Šulc, Z., Sirota, S., Rezanková, H.: The effect of binary data transformation in categorical data clustering. STATISTICS (2019). https://doi.org/10.21307/stattrans-2019-013

    Article  Google Scholar 

  • Crawford, A.G., Fuhr Jr., J.P., Clarke, J., Hubbs, B.: Comparative effectiveness of total population versus disease-specific neural network models in predicting medical costs. Dis. Manag. 8(5), 277–287 (2005)

    Article  PubMed  Google Scholar 

  • Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)

    Google Scholar 

  • Fleishman, J.A., Cohen, J.W.: Using information on clinical conditions to predict high-cost patients. Health Serv. Res. 45(2), 532–552 (2010)

    Article  PubMed  PubMed Central  Google Scholar 

  • Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)

    Article  Google Scholar 

  • Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4), 325–338 (1966)

    Article  Google Scholar 

  • Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Article  Google Scholar 

  • Hamad, R., Modrek, S., Kubo, J., Goldstein, B.A., Cullen, M.R.: Using “big data” to capture overall health status: properties and predictive value of a claims-based health risk score. PLoS ONE 10(5), e0126054 (2015)

    Article  PubMed  PubMed Central  Google Scholar 

  • Healthy People.: Social Determinants. Office of Disease Prevention and Health Promotion, Washington, D.C (2020). https://www.healthypeople.gov/2020/leading-health-indicators/2020-lhi-topics/Social-Determinants

  • Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)

    Article  Google Scholar 

  • Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999)

    Article  Google Scholar 

  • Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1 (2012)

    Article  Google Scholar 

  • Jia, H., Ym, Cheung, Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)

    Article  PubMed  Google Scholar 

  • Kaufman, L., Rousseeuw, P.J.: Partitioning around medoids (program pam). Finding groups in data: An introduction to cluster analysis pp. 68–125 (2005)

  • Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recognit. 38(7), 1131–1134 (2005)

    Article  Google Scholar 

  • Kim, K., Rosenberg, M.A.: Determinants of persistent high utilizers in US adults using nationally representative data. N. Am. Actuar. J. 24(1), 1–21 (2020)

    Article  Google Scholar 

  • Lee, N.S., Whitman, N., Vakharia, N., Rothberg, M.B.: High-cost patients: hot-spotters don’t explain the half of it. J. Gen. Intern. Med. 32(1), 28–34 (2017)

    Article  PubMed  Google Scholar 

  • Leisch, F.: Neighborhood graphs, stripes and shadow plots for cluster visualization. Stat. Comput. 20(4), 457–469 (2010)

    Article  Google Scholar 

  • Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, p. 68 (2004)

  • Liao, M., Li, Y., Kianifard, F., Obi, E., Arcona, S.: Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrol. 17(1), 25 (2016)

    Article  PubMed  PubMed Central  Google Scholar 

  • Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2020)

    Google Scholar 

  • Long, P., Abrams, M., Milstein, A., Anderson, G., Apton, K., Dahlberg, M., et al.: Effective care for high-need patients, Washington DC (2017)

  • Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: cluster analysis basics and extensions. R package version 2.0.7-1—For new features, see the ’Changelog’ file (in the package source) (2018)

  • Mitchell, E.: Statistical brief# 497: concentration of health expenditures in the us civilian noninstitutionalized population, 2014 (2016)

  • Morissette, L., Chartier, S.: The k-means clustering technique: general considerations and implementation in mathematica. Tutor. Quant. Methods Psychol. 9(1), 15–24 (2013)

    Article  Google Scholar 

  • National Center for Health Statistics.: National Health Interview Survey. Centers for Disease Prevention and Control (2020). https://www.cdc.gov/nchs/nhis/index.htm

  • National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention.: NCHHSTP Social Determinants. Centers for Disease Control and Prevention, Washington, D.C (2020). https://www.cdc.gov/nchhstp/socialdeterminants/index.html

  • Peltz, A., Hall, M., Rubin, D.M., Mandl, K.D., Neff, J., Brittan, M., Cohen, E., Hall, D.E., Kuo, D.Z., Agrawal, R., et al.: Hospital utilization among children with the highest annual inpatient cost. Pediatrics 137(2), e20151829 (2016)

    Article  PubMed  Google Scholar 

  • R Core Team R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2020). https://www.R-project.org/

  • Řezanková, H.: Cluster analysis of economic data. Statistika 94(1), 73–86 (2014)

    Google Scholar 

  • Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  • Shenas, S.A.I., Raahemi, B., Tekieh, M.H., Kuziemsky, C.: Identifying high-cost patients using data mining techniques and a small set of non-trivial attributes. Comput. Biol. Med. 53, 9–18 (2014)

    Article  Google Scholar 

  • Sneath, P.H., Sokal, R.R.: Numerical Taxonomy. The Principles and Practice of Numerical Classification. W.H. Freeman and Company, New York (1973)

    Google Scholar 

  • Sokal, R.R., Camin, J., Rohlf, F., Sneath, P.: Numerical taxonomy: some points of view. Syst. Zool. 14(3), 237–243 (1965)

    Article  Google Scholar 

  • Šulc, Z., Řezanková, H.: Comparison of similarity measures for categorical data in hierarchical clustering. J. Classif. 36(1), 58–72 (2019)

    Article  Google Scholar 

  • Šulc, Z., Matějka, M., Procházka, J., Řezanková, H.: Evaluation of the Gower coefficient modifications in hierarchical clustering. Metodoloski Zvezki 14, 37–48 (2017)

    Google Scholar 

  • Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)

    Article  Google Scholar 

  • Wammes, J.J.G., van der Wees, P.J., Tanke, M.A., Westert, G.P., Jeurissen, P.P.: Systematic review of high-cost patients’ characteristics and healthcare utilisation. BMJ Open 8(9), e023113 (2018)

    Article  PubMed  PubMed Central  Google Scholar 

  • Wherry, L.R., Burns, M.E., Leininger, L.J.: Using self-reported health measures to predict high-need cases among medicaid-eligible adults. Health Serv. Res. 49(S2), 2147–2172 (2014)

    Article  PubMed  PubMed Central  Google Scholar 

  • Zhu, M., Ghodsi, A.: Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Stat. Data Anal. 51(2), 918–930 (2006)

    Article  Google Scholar 

  • Zook, C.J., Moore, F.D.: High-cost users of medical care. N. Engl. J. Med. 302(18), 996–1002 (1980)

    Article  CAS  PubMed  Google Scholar 

Download references

Funding

We acknowledge the Society of Actuaries Center of Excellence Research Grant Program for partial support of this research.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Joshua Agterberg or Marjorie Rosenberg.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All authors approve of the article contents.

Informed consent

NHIS and MEPS data are publicly available.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Diagnostic results

Appendix: Diagnostic results

The elbow, silhouette, and stripes plots are used in guiding the decision to choose the number of clusters. We choose 15 clusters as it is large enough to show differences between the clusters, but not too large to be unmanageable as defined by one of the principles of Sneath and Sokal (1973). The elbow plot in Fig. 8a indicates that 6, 15, or 22 clusters reflect good choices per the method of Zhu and Ghodsi (2006). The graph of the silhouette coefficients in Fig. 8b peaks at 6 clusters and has a local max at 12 and 15 clusters. The overall silhouette index is 0.018.

Each column in the stripes plot in Fig. 9 shows the dissimilarities from the medoid for those observations within cluster \(j, \, j = 1, \ldots , 15\), as well as the dissimilarity from the medoid in cluster j for those observations in another cluster whose second closest medoid is cluster j. Each column j has the potential of 15 sub-columns, with the order of the markings in cluster order from \(j = 1, \ldots , 15\). The clusters have been labeled based on average total expenditures with cluster 1 representing those with the highest average total expenditures to cluster 15 with the lowest average. In all columns, sub-column 1 is reserved for those individuals from cluster 1. The cluster 1 column shows additional entiries for those individuals secondarily assigned to cluster 1 but whose primary cluster is other than cluster 1. Other cluster columns can be similarly defined. The cluster 1 column shows some secondary assignments, but has less additional assignments than the middle expenditure clusters. Also, there is a greater prevalence of cluster 1 individuals having cluster 2 be their second most closest cluster. These findings help justify the application to identifying the very high utilizers of health care.

We explored a wide range for the number of clusters (5 to 30), as well as looked at other methods for determining the number of clusters. We do not claim to have the optimal number of clusters, but we found that having more clusters allowed for better separation of the clusters from an expenditure perspective. With the choice of 15 clusters, we are able to see differences among clusters 1 to 3 that are different in covariates from the aggregate and whose expenditure distributions show a pattern of being the highest of all fifteen clusters. If we had used a smaller number of clusters, this expenditure distribution difference would not be evident and the stripes plot shows that there is more overlap among the high expenditure clusters with other clusters.

Fig. 8
figure 8

Elbow plot and graph of Silhouette coefficients

Fig. 9
figure 9

Stripes plot

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Agterberg, J., Zhong, F., Crabb, R. et al. Cluster analysis application to identify groups of individuals with high health expenditures. Health Serv Outcomes Res Method 20, 140–182 (2020). https://doi.org/10.1007/s10742-020-00214-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10742-020-00214-8

Keywords

Navigation