Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri

Mueller, Erik D.; Sandoval, J. S. Onésimo; Mudigonda, Srikanth P.; Elliott, Michael

doi:10.1007/s42001-019-00047-7

Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri

Research Article
Published: 10 June 2019

Volume 2, pages 271–291, (2019)
Cite this article

Journal of Computational Social Science Aims and scope Submit manuscript

Erik D. Mueller¹,
J. S. Onésimo Sandoval²,
Srikanth P. Mudigonda³ &
…
Michael Elliott⁴

424 Accesses
2 Citations
Explore all metrics

Abstract

In a previous study, Mueller et al. (ISPRS Int J Geo-Inf 8(1):13, 2019), presented a machine learning ensemble algorithm using K-means clustering as a preprocessing technique to increase predictive modeling performance. As a follow-on research effort, this study seeks to test the previously introduced algorithm’s stability and sensitivity, as well as present an innovative method for the extraction of localized and state-level variable importance information from the original dataset, using a nontraditional method known as synthetic population generation. Through iterative synthetic population generation with similar underlying statistical properties to the original dataset and exploration of the distribution of health insurance coverage across the state of Missouri, we identified variables that contributed to decisions for clustering, variables that contributed most significantly to modeling health insurance distribution status throughout the state, and variables that were most influential in optimizing model performance, having the greatest impact on change-in-mean-squared-error (MSE) measurements. Results suggest that cluster-based preprocessing approaches for machine learning algorithms can result in significantly increased performance, and also demonstrate how synthetic populations can be used for performance measurement to identify and test the extent to which variable statistical properties within a dataset can vary without resulting in significant performance loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Cost-sensitive learning for imbalanced medical data: a review

Article Open access 01 March 2024

Machine and deep learning for longitudinal biomedical data: a review of methods and applications

Article Open access 05 August 2023

References

Alpaydin, E. (2014). Introduction to machine learning. Cambridge: MIT Press.
Google Scholar
Breiman, L., Cutler, A., Liaw, A., & Wiener, M. (2011). R-Statistics Package 'randomForest'. http://www.stat.berkeley.edu/users/breiman/RandomForests. Accessed 27 Dec 2018.
Claussen, P. E. C. (2012). Regression: when a nonparametric approach is most fitting (Doctoral dissertation).
Dayhoff, J. E., & DeLeo, J. M. (2001). Artificial neural networks: Opening the black box. Cancer: Interdisciplinary International Journal of the American Cancer Society, 91(8), 1615–1635.
Article Google Scholar
DHSS (2007). Missouri Office of Rural Health Biennial Report, 2006–2007. https://health.mo.gov/living/families/ruralhealth/pdf/report07.pdf. Accessed 27 Dec 2018.
DHSS. (2015). Missouri bureau of healthcare analysis and data dissemination. Jefferson City: Missouri Department of Health and Senior Services.
Google Scholar
ESRI. (2018). ArcGIS desktop—release 10.6. Redlands: Environmental Systems Research Institute.
Google Scholar
Friedman, J.H., Hastie, T., & Tibshirani, R. (2010). glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1.1-5. http://CRAN.R-project.org/package=glmnet. Accessed 27 Dec 2018.
Goldman, D. P., Smith, J. P., & Sood, N. (2005). Legal status and health insurance among immigrants. Health Affairs, 24, 1640–1653.
Article Google Scholar
Haas, J. S., Lee, L. B., Kaplan, C. P., Sonneborn, D., Phillips, K. A., & Liang, S.-Y. (2003). The association of race, socioeconomic status, and health insurance status with the prevalence of overweight among children and adolescents. American Journal of Public Health, 93, 2105–2110.
Article Google Scholar
Juarez, P., Matthews-Juarez, P., Hood, D., Im, W., Levine, R., Kilbourne, B., & Estes, S. (2014). The public health exposome: A population-based, exposure science approach to health disparities research. International Journal of Environmental Research and Public Health, 11(12), 12866–12895.
Kennedy, B. P., Kawachi, I., Glass, R., & Prothrow-Stith, D. (1998). Income distribution, socioeconomic status, and self rated health in the United States: multilevel analysis. BMJ, 317, 917–921.
Article Google Scholar
Kuhnert, P., Venables, B., & Zocchi, S.S. (2005). An introduction to R: Software for statistical modelling and computing: USP/ESALQ/LCE. https://www.uv.es/conesa/CursoR/material/Rlecturenotes.pdf. Accessed 27 Dec 2018.
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.-C., Lin, C.-C., & Meyer, M.D. (2018). R-Statistics Package ‘e1071’. https://cran.r-project.org/web/packages/e1071/e1071.pdf. Accessed 27 Dec 2018.
Mueller, E., Sandoval, J., Mudigonda, S., & Elliott, M. (2019). A cluster-based machine learning ensemble approach for geospatial data: Estimation of health insurance status in Missouri. ISPRS International Journal of Geo-Information, 8(1), 13.
Article Google Scholar
Nasrabadi, N. M. (2007). Pattern recognition and machine learning. Journal of Electronic Imaging, 16, 049901.
Article Google Scholar
Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74, 1–26.
Article Google Scholar
Olden, J. D., & Jackson, D. A. (2002). Illuminating the “black box”: A randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling, 154, 135–150.
Article Google Scholar
Priddy, K. L., & Keller, P. E. (2005). Artificial neural networks: An introduction. Bellingham: SPIE Press.
Book Google Scholar
RStudio. (2012). RStudio: integrated development environment for R (p. 74). Boston: RStudio Inc.
Google Scholar
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge: Cambridge University Press.
Book Google Scholar
Trivedi, S., Pardos, Z., Sárközy, G., & Heffernan, N. (2010). Spectral clustering in educational data mining. Educational Data Mining 2011. http://educationaldatamining.org/EDM2011/wp-content/uploads/proc/edm2011_paper22_full_Trivedi.pdf. Accessed 27 Dec 2018.
Trivedi, S., Pardos, Z.A., & Heffernan, N.T. (2011). Clustering students to generate an ensemble to improve standard test score predictions. International Conference on Artificial Intelligence in Education (pp. 377–384). Springer.
USCB. (2016). US Census Bureau: American Community Survey (ACS). https://www.census.gov/programs-surveys/acs/. Accessed 27 Dec 2018.
USCB. (2017). The National Map: Transportation. US Census Bureau. Washington, DC, USA. https://www.usgs.gov/core-science-systems/national-geospatial-program/national-map. Accessed 27 Dec 2017.
USCB. (2018). Topologically Integrated Geographic Encoding and Referencing Datasets. US Census Bureau. Washington, DC, USA. https://www.census.gov/geo/maps-data/data/tiger.html. Accessed 27 Dec 2018.
USDA-ERS. (2000). Three Rural Definitions based on Census Places. United States Department of Agriculture, Economic Research Service. https://www.ers.usda.gov/data-products/rural-definitions/data-documentation-and-methods.aspx. Accessed 27 Dec 2018
Wharam, J. F., Zhang, F., Landon, B. E., Soumerai, S. B., & Ross-Degnan, D. (2013). Low-socioeconomic-status enrollees in high-deductible plans reduced high-severity emergency care. Health Affairs, 32, 1398–1406.
Article Google Scholar
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann.
Google Scholar
Yao, W., Basu, S., Wei-Nchih, L. E. E., & Singhal, S. (2015). US Patent Application No. 14/762,590.

Download references

Author information

Authors and Affiliations

Integrated and Applied Sciences: Bioinformatics and Geospatial Biology, College of Arts and Sciences, Saint Louis University, St. Louis, MO, USA
Erik D. Mueller
Department of Sociology and Anthropology, College of Arts and Sciences, Saint Louis University, St. Louis, MO, USA
J. S. Onésimo Sandoval
School for Professional Studies, Saint Louis University, St. Louis, MO, USA
Srikanth P. Mudigonda
Department of Epidemiology and Biostatistics, College for Public Health and Social Justice, Saint Louis University, St. Louis, MO, USA
Michael Elliott

Authors

Erik D. Mueller
View author publications
You can also search for this author in PubMed Google Scholar
J. S. Onésimo Sandoval
View author publications
You can also search for this author in PubMed Google Scholar
Srikanth P. Mudigonda
View author publications
You can also search for this author in PubMed Google Scholar
Michael Elliott
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

This publication contributes to the research requirement for conferral of Erik Mueller’s Ph.D. in Integrated and Applied Sciences, Bioinformatics and Geospatial Biology. Erik Mueller, with supervision from J.S. Onésimo Sandoval, serving as primary advisor and mentor, and Srikanth Mudigonda and Michael Elliott, serving as dissertation committee members and secondary mentors, planned, organized, carried out, and analyzed all aspects of this research study.

Corresponding author

Correspondence to Erik D. Mueller.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mueller, E.D., Sandoval, J.S.O., Mudigonda, S.P. et al. Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri. J Comput Soc Sc 2, 271–291 (2019). https://doi.org/10.1007/s42001-019-00047-7

Download citation

Received: 21 January 2019
Accepted: 28 May 2019
Published: 10 June 2019
Issue Date: 01 July 2019
DOI: https://doi.org/10.1007/s42001-019-00047-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Cost-sensitive learning for imbalanced medical data: a review

Machine and deep learning for longitudinal biomedical data: a review of methods and applications

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Cost-sensitive learning for imbalanced medical data: a review

Machine and deep learning for longitudinal biomedical data: a review of methods and applications

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation