We compare and demonstrate the effectiveness of two clustering methods with the main purpose of identifying characteristic profiles of high utilizers of health care. In this work, we use three sets of mutually independent longitudinal data that are nationally representative of the US adult working-age civilian non-institutionalized population. We compare k-means, a commonly used clustering method, with a k-medoids algorithm called Partitioning Around Medoids. We use one cohort of data to create clusters based on similar characteristics of individuals for both clustering methods. We examine these characteristic compositions of the highest three average total expenditure clusters from this cohort. We also examine the health expenditure distributions for this cohort over the following two years. We validate the approach by applying the centers of the clusters to two other cohorts of similar data. We form clusters based on demographic, economic, and health-related characteristics that are commonly used in studies of health care utilization. We demonstrate the consistency of our results across the three cohorts of data and across different types of health expenditures, such as office-based/outpatient and drug. Clusters can be formed with other more homogeneous data, such as Medicaid, Medicare, employer sponsored insurance, or individual private plans issued under the Affordable Care Act. This approach can be used to follow similar groups over time for other types of health outcomes.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Aday, L.A., Andersen, R.: A framework for the study of access to medical care. Health Serv. Res. 9(3), 208 (1974)
Agency for Healthcare Research and Quality.: Medical Expenditure Panel Study. US Department of Health and Human Services (2020). https://www.cdc.gov/nchs/nhis/index.htm
Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)
Andersen, R.: A behavioral model of families’ use of health services. 25, Chicago: Center for Health Administration Studies, 5720 S. Woodlawn Avenue, University of Chicago, Illinois 60637, USA (1968)
Andersen, R., Newman, J.F.: Societal and individual determinants of medical care utilization in the united states. The Milbank Memorial Fund Quarterly Health and Society, pp. 95–124 (1973)
Aranganayagi, S., Thangavel, K.: Improved k-modes for categorical clustering using weighted dissimilarity measure. World Acad. Sci. Eng. Technol. 3, 813–819 (2009)
Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management. ACM, pp. 582–589 (2002)
Bayliss, E.A., Powers, J.D., Ellis, J.L., Barrow, J.C., Strobel, M., Beck, A.: Applying sequential analytic methods to self-reported information to anticipate care needs. eGEMs (2016). https://doi.org/10.13063/2327-9214.1258
Boriah, S., Chandola, V., Kumar, V.: Similarity measures for categorical data: a comparative evaluation. In: Proceedings of the 2008 SIAM International Conference on Data Mining, SIAM, pp. 243–254 (2008)
Boscardin, C.K., Gonzales, R., Bradley, K.L., Raven, M.C.: Predicting cost of care using self-reported health status data. BMC Health Serv. Res. 15(1), 406 (2015)
Charlson, M., Wells, M.T., Ullman, R., King, F., Shmukler, C.: The charlson comorbidity index can be used prospectively to identify patients who will incur high future costs. PLoS ONE 9(12), e112479 (2014)
Cibulková, J., Šulc, Z., Sirota, S., Rezanková, H.: The effect of binary data transformation in categorical data clustering. STATISTICS (2019). https://doi.org/10.21307/stattrans-2019-013
Crawford, A.G., Fuhr Jr., J.P., Clarke, J., Hubbs, B.: Comparative effectiveness of total population versus disease-specific neural network models in predicting medical costs. Dis. Manag. 8(5), 277–287 (2005)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
Fleishman, J.A., Cohen, J.W.: Using information on clinical conditions to predict high-cost patients. Health Serv. Res. 45(2), 532–552 (2010)
Goodall, D.W.: A new similarity index based on probability. Biometrics 22, 882–907 (1966)
Gower, J.C.: Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53(3–4), 325–338 (1966)
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Hamad, R., Modrek, S., Kubo, J., Goldstein, B.A., Cullen, M.R.: Using “big data” to capture overall health status: properties and predictive value of a claims-based health risk score. PLoS ONE 10(5), e0126054 (2015)
Healthy People.: Social Determinants. Office of Disease Prevention and Health Promotion, Washington, D.C (2020). https://www.healthypeople.gov/2020/leading-health-indicators/2020-lhi-topics/Social-Determinants
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)
Huang, Z., Ng, M.K.: A fuzzy k-modes algorithm for clustering categorical data. IEEE Trans. Fuzzy Syst. 7(4), 446–452 (1999)
Ienco, D., Pensa, R.G., Meo, R.: From context to distance: learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data (TKDD) 6(1), 1 (2012)
Jia, H., Ym, Cheung, Liu, J.: A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Netw. Learn. Syst. 27(5), 1065–1079 (2016)
Kaufman, L., Rousseeuw, P.J.: Partitioning around medoids (program pam). Finding groups in data: An introduction to cluster analysis pp. 68–125 (2005)
Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recognit. 38(7), 1131–1134 (2005)
Kim, K., Rosenberg, M.A.: Determinants of persistent high utilizers in US adults using nationally representative data. N. Am. Actuar. J. 24(1), 1–21 (2020)
Lee, N.S., Whitman, N., Vakharia, N., Rothberg, M.B.: High-cost patients: hot-spotters don’t explain the half of it. J. Gen. Intern. Med. 32(1), 28–34 (2017)
Leisch, F.: Neighborhood graphs, stripes and shadow plots for cluster visualization. Stat. Comput. 20(4), 457–469 (2010)
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: Proceedings of the Twenty-First International Conference on Machine Learning. ACM, p. 68 (2004)
Liao, M., Li, Y., Kianifard, F., Obi, E., Arcona, S.: Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis. BMC Nephrol. 17(1), 25 (2016)
Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, 3rd edn. Wiley, Hoboken (2020)
Long, P., Abrams, M., Milstein, A., Anderson, G., Apton, K., Dahlberg, M., et al.: Effective care for high-need patients, Washington DC (2017)
Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: cluster analysis basics and extensions. R package version 2.0.7-1—For new features, see the ’Changelog’ file (in the package source) (2018)
Mitchell, E.: Statistical brief# 497: concentration of health expenditures in the us civilian noninstitutionalized population, 2014 (2016)
Morissette, L., Chartier, S.: The k-means clustering technique: general considerations and implementation in mathematica. Tutor. Quant. Methods Psychol. 9(1), 15–24 (2013)
National Center for Health Statistics.: National Health Interview Survey. Centers for Disease Prevention and Control (2020). https://www.cdc.gov/nchs/nhis/index.htm
National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention.: NCHHSTP Social Determinants. Centers for Disease Control and Prevention, Washington, D.C (2020). https://www.cdc.gov/nchhstp/socialdeterminants/index.html
Peltz, A., Hall, M., Rubin, D.M., Mandl, K.D., Neff, J., Brittan, M., Cohen, E., Hall, D.E., Kuo, D.Z., Agrawal, R., et al.: Hospital utilization among children with the highest annual inpatient cost. Pediatrics 137(2), e20151829 (2016)
R Core Team R: a Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2020). https://www.R-project.org/
Řezanková, H.: Cluster analysis of economic data. Statistika 94(1), 73–86 (2014)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Shenas, S.A.I., Raahemi, B., Tekieh, M.H., Kuziemsky, C.: Identifying high-cost patients using data mining techniques and a small set of non-trivial attributes. Comput. Biol. Med. 53, 9–18 (2014)
Sneath, P.H., Sokal, R.R.: Numerical Taxonomy. The Principles and Practice of Numerical Classification. W.H. Freeman and Company, New York (1973)
Sokal, R.R., Camin, J., Rohlf, F., Sneath, P.: Numerical taxonomy: some points of view. Syst. Zool. 14(3), 237–243 (1965)
Šulc, Z., Řezanková, H.: Comparison of similarity measures for categorical data in hierarchical clustering. J. Classif. 36(1), 58–72 (2019)
Šulc, Z., Matějka, M., Procházka, J., Řezanková, H.: Evaluation of the Gower coefficient modifications in hierarchical clustering. Metodoloski Zvezki 14, 37–48 (2017)
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Wammes, J.J.G., van der Wees, P.J., Tanke, M.A., Westert, G.P., Jeurissen, P.P.: Systematic review of high-cost patients’ characteristics and healthcare utilisation. BMJ Open 8(9), e023113 (2018)
Wherry, L.R., Burns, M.E., Leininger, L.J.: Using self-reported health measures to predict high-need cases among medicaid-eligible adults. Health Serv. Res. 49(S2), 2147–2172 (2014)
Zhu, M., Ghodsi, A.: Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Stat. Data Anal. 51(2), 918–930 (2006)
Zook, C.J., Moore, F.D.: High-cost users of medical care. N. Engl. J. Med. 302(18), 996–1002 (1980)
We acknowledge the Society of Actuaries Center of Excellence Research Grant Program for partial support of this research.
Conflict of interest
The authors declare that they have no conflict of interest.
All authors approve of the article contents.
NHIS and MEPS data are publicly available.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Diagnostic results
Appendix: Diagnostic results
The elbow, silhouette, and stripes plots are used in guiding the decision to choose the number of clusters. We choose 15 clusters as it is large enough to show differences between the clusters, but not too large to be unmanageable as defined by one of the principles of Sneath and Sokal (1973). The elbow plot in Fig. 8a indicates that 6, 15, or 22 clusters reflect good choices per the method of Zhu and Ghodsi (2006). The graph of the silhouette coefficients in Fig. 8b peaks at 6 clusters and has a local max at 12 and 15 clusters. The overall silhouette index is 0.018.
Each column in the stripes plot in Fig. 9 shows the dissimilarities from the medoid for those observations within cluster \(j, \, j = 1, \ldots , 15\), as well as the dissimilarity from the medoid in cluster j for those observations in another cluster whose second closest medoid is cluster j. Each column j has the potential of 15 sub-columns, with the order of the markings in cluster order from \(j = 1, \ldots , 15\). The clusters have been labeled based on average total expenditures with cluster 1 representing those with the highest average total expenditures to cluster 15 with the lowest average. In all columns, sub-column 1 is reserved for those individuals from cluster 1. The cluster 1 column shows additional entiries for those individuals secondarily assigned to cluster 1 but whose primary cluster is other than cluster 1. Other cluster columns can be similarly defined. The cluster 1 column shows some secondary assignments, but has less additional assignments than the middle expenditure clusters. Also, there is a greater prevalence of cluster 1 individuals having cluster 2 be their second most closest cluster. These findings help justify the application to identifying the very high utilizers of health care.
We explored a wide range for the number of clusters (5 to 30), as well as looked at other methods for determining the number of clusters. We do not claim to have the optimal number of clusters, but we found that having more clusters allowed for better separation of the clusters from an expenditure perspective. With the choice of 15 clusters, we are able to see differences among clusters 1 to 3 that are different in covariates from the aggregate and whose expenditure distributions show a pattern of being the highest of all fifteen clusters. If we had used a smaller number of clusters, this expenditure distribution difference would not be evident and the stripes plot shows that there is more overlap among the high expenditure clusters with other clusters.
About this article
Cite this article
Agterberg, J., Zhong, F., Crabb, R. et al. Cluster analysis application to identify groups of individuals with high health expenditures. Health Serv Outcomes Res Method (2020). https://doi.org/10.1007/s10742-020-00214-8
- Unsupervised machine learning
- Goodall similarities
- Partitioning around medoids
- Predicting rare events