Cluster analysis application to identify groups of individuals with high health expenditures

Agterberg, Joshua; Zhong, Fanghao; Crabb, Richard; Rosenberg, Marjorie

doi:10.1007/s10742-020-00214-8

Cluster analysis application to identify groups of individuals with high health expenditures

Published: 01 August 2020

Volume 20, pages 140–182, (2020)
Cite this article

Health Services and Outcomes Research Methodology Aims and scope Submit manuscript

481 Accesses
6 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

We compare and demonstrate the effectiveness of two clustering methods with the main purpose of identifying characteristic profiles of high utilizers of health care. In this work, we use three sets of mutually independent longitudinal data that are nationally representative of the US adult working-age civilian non-institutionalized population. We compare k-means, a commonly used clustering method, with a k-medoids algorithm called Partitioning Around Medoids. We use one cohort of data to create clusters based on similar characteristics of individuals for both clustering methods. We examine these characteristic compositions of the highest three average total expenditure clusters from this cohort. We also examine the health expenditure distributions for this cohort over the following two years. We validate the approach by applying the centers of the clusters to two other cohorts of similar data. We form clusters based on demographic, economic, and health-related characteristics that are commonly used in studies of health care utilization. We demonstrate the consistency of our results across the three cohorts of data and across different types of health expenditures, such as office-based/outpatient and drug. Clusters can be formed with other more homogeneous data, such as Medicaid, Medicare, employer sponsored insurance, or individual private plans issued under the Affordable Care Act. This approach can be used to follow similar groups over time for other types of health outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Diagnostic Profiles of Patients

A quantitative evidence base for population health: applying utilization-based cluster analysis to segment a patient population

Article Open access 25 November 2016

Two-Stage Approach to Cluster Categorical Medical Data

References

Aday, L.A., Andersen, R.: A framework for the study of access to medical care. Health Serv. Res. 9(3), 208 (1974)
CAS PubMed PubMed Central Google Scholar
Agency for Healthcare Research and Quality.: Medical Expenditure Panel Study. US Department of Health and Human Services (2020). https://www.cdc.gov/nchs/nhis/index.htm
Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)
Article Google Scholar
Andersen, R.: A behavioral model of families’ use of health services. 25, Chicago: Center for Health Administration Studies, 5720 S. Woodlawn Avenue, University of Chicago, Illinois 60637, USA (1968)
Andersen, R., Newman, J.F.: Societal and individual determinants of medical care utilization in the united states. The Milbank Memorial Fund Quarterly Health and Society, pp. 95–124 (1973)
Aranganayagi, S., Thangavel, K.: Improved k-modes for categorical clustering using weighted dissimilarity measure. World Acad. Sci. Eng. Technol. 3, 813–819 (2009)
Google Scholar
Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM, pp. 582–589 (2002)
Bayliss, E.A., Powers, J.D., Ellis, J.L., Barrow, J.C., Strobel, M., Beck, A.: Applying sequential analytic methods to self-reported information to anticipate care needs. eGEMs (2016). https://doi.org/10.13063/2327-9214.1258
Article PubMed PubMed Central Google Scholar
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, pp. 243–254 (2008)
Boscardin, C.K., Gonzales, R., Bradley, K.L., Raven, M.C.: Predicting cost of care using self-reported health status data. BMC Health Serv. Res. 15(1), 406 (2015)
Article PubMed PubMed Central Google Scholar
Charlson, M., Wells, M.T., Ullman, R., King, F., Shmukler, C.: The charlson comorbidity index can be used prospectively to identify patients who will incur high future costs. PLoS ONE 9(12), e112479 (2014)
Article PubMed PubMed Central Google Scholar
Cibulková, J., Šulc, Z., Sirota, S., Rezanková, H.: The effect of binary data transformation in categorical data clustering. STATISTICS (2019). https://doi.org/10.21307/stattrans-2019-013
Article Google Scholar
Crawford, A.G., Fuhr Jr., J.P., Clarke, J., Hubbs, B.: Comparative effectiveness of total population versus disease-specific neural network models in predicting medical costs. Dis. Manag. 8(5), 277–287 (2005)
Article PubMed Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
Google Scholar
Fleishman, J.A., Cohen, J.W.: Using information on clinical conditions to predict high-cost patients. Health Serv. Res. 45(2), 532–552 (2010)
Article PubMed PubMed Central Google Scholar
Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)
Article Google Scholar
Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4), 325–338 (1966)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Article Google Scholar
Hamad, R., Modrek, S., Kubo, J., Goldstein, B.A., Cullen, M.R.: Using “big data” to capture overall health status: properties and predictive value of a claims-based health risk score. PLoS ONE 10(5), e0126054 (2015)
Article PubMed PubMed Central Google Scholar
Healthy People.: Social Determinants. Office of Disease Prevention and Health Promotion, Washington, D.C (2020). https://www.healthypeople.gov/2020/leading-health-indicators/2020-lhi-topics/Social-Determinants
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Article Google Scholar
Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999)
Article Google Scholar
Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1 (2012)
Article Google Scholar
Jia, H., Ym, Cheung, Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)
Article PubMed Google Scholar
Kaufman, L., Rousseeuw, P.J.: Partitioning around medoids (program pam). Finding groups in data: An introduction to cluster analysis pp. 68–125 (2005)
Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recognit. 38(7), 1131–1134 (2005)
Article Google Scholar
Kim, K., Rosenberg, M.A.: Determinants of persistent high utilizers in US adults using nationally representative data. N. Am. Actuar. J. 24(1), 1–21 (2020)
Article Google Scholar
Lee, N.S., Whitman, N., Vakharia, N., Rothberg, M.B.: High-cost patients: hot-spotters don’t explain the half of it. J. Gen. Intern. Med. 32(1), 28–34 (2017)
Article PubMed Google Scholar
Leisch, F.: Neighborhood graphs, stripes and shadow plots for cluster visualization. Stat. Comput. 20(4), 457–469 (2010)
Article Google Scholar
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, p. 68 (2004)
Liao, M., Li, Y., Kianifard, F., Obi, E., Arcona, S.: Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrol. 17(1), 25 (2016)
Article PubMed PubMed Central Google Scholar
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2020)
Google Scholar
Long, P., Abrams, M., Milstein, A., Anderson, G., Apton, K., Dahlberg, M., et al.: Effective care for high-need patients, Washington DC (2017)
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: cluster analysis basics and extensions. R package version 2.0.7-1—For new features, see the ’Changelog’ file (in the package source) (2018)
Mitchell, E.: Statistical brief# 497: concentration of health expenditures in the us civilian noninstitutionalized population, 2014 (2016)
Morissette, L., Chartier, S.: The k-means clustering technique: general considerations and implementation in mathematica. Tutor. Quant. Methods Psychol. 9(1), 15–24 (2013)
Article Google Scholar
National Center for Health Statistics.: National Health Interview Survey. Centers for Disease Prevention and Control (2020). https://www.cdc.gov/nchs/nhis/index.htm
National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention.: NCHHSTP Social Determinants. Centers for Disease Control and Prevention, Washington, D.C (2020). https://www.cdc.gov/nchhstp/socialdeterminants/index.html
Peltz, A., Hall, M., Rubin, D.M., Mandl, K.D., Neff, J., Brittan, M., Cohen, E., Hall, D.E., Kuo, D.Z., Agrawal, R., et al.: Hospital utilization among children with the highest annual inpatient cost. Pediatrics 137(2), e20151829 (2016)
Article PubMed Google Scholar
R Core Team R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2020). https://www.R-project.org/
Řezanková, H.: Cluster analysis of economic data. Statistika 94(1), 73–86 (2014)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Shenas, S.A.I., Raahemi, B., Tekieh, M.H., Kuziemsky, C.: Identifying high-cost patients using data mining techniques and a small set of non-trivial attributes. Comput. Biol. Med. 53, 9–18 (2014)
Article Google Scholar
Sneath, P.H., Sokal, R.R.: Numerical Taxonomy. The Principles and Practice of Numerical Classification. W.H. Freeman and Company, New York (1973)
Google Scholar
Sokal, R.R., Camin, J., Rohlf, F., Sneath, P.: Numerical taxonomy: some points of view. Syst. Zool. 14(3), 237–243 (1965)
Article Google Scholar
Šulc, Z., Řezanková, H.: Comparison of similarity measures for categorical data in hierarchical clustering. J. Classif. 36(1), 58–72 (2019)
Article Google Scholar
Šulc, Z., Matějka, M., Procházka, J., Řezanková, H.: Evaluation of the Gower coefficient modifications in hierarchical clustering. Metodoloski Zvezki 14, 37–48 (2017)
Google Scholar
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Article Google Scholar
Wammes, J.J.G., van der Wees, P.J., Tanke, M.A., Westert, G.P., Jeurissen, P.P.: Systematic review of high-cost patients’ characteristics and healthcare utilisation. BMJ Open 8(9), e023113 (2018)
Article PubMed PubMed Central Google Scholar
Wherry, L.R., Burns, M.E., Leininger, L.J.: Using self-reported health measures to predict high-need cases among medicaid-eligible adults. Health Serv. Res. 49(S2), 2147–2172 (2014)
Article PubMed PubMed Central Google Scholar
Zhu, M., Ghodsi, A.: Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Stat. Data Anal. 51(2), 918–930 (2006)
Article Google Scholar
Zook, C.J., Moore, F.D.: High-cost users of medical care. N. Engl. J. Med. 302(18), 996–1002 (1980)
Article CAS PubMed Google Scholar

Download references

Funding

We acknowledge the Society of Actuaries Center of Excellence Research Grant Program for partial support of this research.

Author information

Authors and Affiliations

Johns Hopkins University, Whitehead Hall, 3400 N Charles St, Baltimore, MD, 21218, USA
Joshua Agterberg
New York University, New York, USA
Fanghao Zhong
University of Wisconsin-Madison, Madison, USA
Richard Crabb & Marjorie Rosenberg

Authors

Joshua Agterberg
View author publications
You can also search for this author in PubMed Google Scholar
Fanghao Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Richard Crabb
View author publications
You can also search for this author in PubMed Google Scholar
Marjorie Rosenberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Joshua Agterberg or Marjorie Rosenberg.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All authors approve of the article contents.

Informed consent

NHIS and MEPS data are publicly available.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Diagnostic results

The elbow, silhouette, and stripes plots are used in guiding the decision to choose the number of clusters. We choose 15 clusters as it is large enough to show differences between the clusters, but not too large to be unmanageable as defined by one of the principles of Sneath and Sokal (1973). The elbow plot in Fig. 8a indicates that 6, 15, or 22 clusters reflect good choices per the method of Zhu and Ghodsi (2006). The graph of the silhouette coefficients in Fig. 8b peaks at 6 clusters and has a local max at 12 and 15 clusters. The overall silhouette index is 0.018.

Each column in the stripes plot in Fig. 9 shows the dissimilarities from the medoid for those observations within cluster \(j, \, j = 1, \ldots , 15\), as well as the dissimilarity from the medoid in cluster j for those observations in another cluster whose second closest medoid is cluster j. Each column j has the potential of 15 sub-columns, with the order of the markings in cluster order from \(j = 1, \ldots , 15\). The clusters have been labeled based on average total expenditures with cluster 1 representing those with the highest average total expenditures to cluster 15 with the lowest average. In all columns, sub-column 1 is reserved for those individuals from cluster 1. The cluster 1 column shows additional entiries for those individuals secondarily assigned to cluster 1 but whose primary cluster is other than cluster 1. Other cluster columns can be similarly defined. The cluster 1 column shows some secondary assignments, but has less additional assignments than the middle expenditure clusters. Also, there is a greater prevalence of cluster 1 individuals having cluster 2 be their second most closest cluster. These findings help justify the application to identifying the very high utilizers of health care.

We explored a wide range for the number of clusters (5 to 30), as well as looked at other methods for determining the number of clusters. We do not claim to have the optimal number of clusters, but we found that having more clusters allowed for better separation of the clusters from an expenditure perspective. With the choice of 15 clusters, we are able to see differences among clusters 1 to 3 that are different in covariates from the aggregate and whose expenditure distributions show a pattern of being the highest of all fifteen clusters. If we had used a smaller number of clusters, this expenditure distribution difference would not be evident and the stripes plot shows that there is more overlap among the high expenditure clusters with other clusters.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Agterberg, J., Zhong, F., Crabb, R. et al. Cluster analysis application to identify groups of individuals with high health expenditures. Health Serv Outcomes Res Method 20, 140–182 (2020). https://doi.org/10.1007/s10742-020-00214-8

Download citation

Received: 13 August 2019
Revised: 23 June 2020
Accepted: 17 July 2020
Published: 01 August 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s10742-020-00214-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cluster analysis application to identify groups of individuals with high health expenditures

Abstract

Access this article

Similar content being viewed by others

Clustering Diagnostic Profiles of Patients

A quantitative evidence base for population health: applying utilization-based cluster analysis to segment a patient population

Two-Stage Approach to Cluster Categorical Medical Data

References

Funding