Skip to main content
Log in

Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri

  • Research Article
  • Published:
Journal of Computational Social Science Aims and scope Submit manuscript

Abstract

In a previous study, Mueller et al. (ISPRS Int J Geo-Inf 8(1):13, 2019), presented a machine learning ensemble algorithm using K-means clustering as a preprocessing technique to increase predictive modeling performance. As a follow-on research effort, this study seeks to test the previously introduced algorithm’s stability and sensitivity, as well as present an innovative method for the extraction of localized and state-level variable importance information from the original dataset, using a nontraditional method known as synthetic population generation. Through iterative synthetic population generation with similar underlying statistical properties to the original dataset and exploration of the distribution of health insurance coverage across the state of Missouri, we identified variables that contributed to decisions for clustering, variables that contributed most significantly to modeling health insurance distribution status throughout the state, and variables that were most influential in optimizing model performance, having the greatest impact on change-in-mean-squared-error (MSE) measurements. Results suggest that cluster-based preprocessing approaches for machine learning algorithms can result in significantly increased performance, and also demonstrate how synthetic populations can be used for performance measurement to identify and test the extent to which variable statistical properties within a dataset can vary without resulting in significant performance loss.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Alpaydin, E. (2014). Introduction to machine learning. Cambridge: MIT Press.

    Google Scholar 

  2. Breiman, L., Cutler, A., Liaw, A., & Wiener, M. (2011). R-Statistics Package 'randomForest'. http://www.stat.berkeley.edu/users/breiman/RandomForests. Accessed 27 Dec 2018.

  3. Claussen, P. E. C. (2012). Regression: when a nonparametric approach is most fitting (Doctoral dissertation).

  4. Dayhoff, J. E., & DeLeo, J. M. (2001). Artificial neural networks: Opening the black box. Cancer: Interdisciplinary International Journal of the American Cancer Society, 91(8), 1615–1635.

    Article  Google Scholar 

  5. DHSS (2007). Missouri Office of Rural Health Biennial Report, 2006–2007. https://health.mo.gov/living/families/ruralhealth/pdf/report07.pdf. Accessed 27 Dec 2018.

  6. DHSS. (2015). Missouri bureau of healthcare analysis and data dissemination. Jefferson City: Missouri Department of Health and Senior Services.

    Google Scholar 

  7. ESRI. (2018). ArcGIS desktop—release 10.6. Redlands: Environmental Systems Research Institute.

    Google Scholar 

  8. Friedman, J.H., Hastie, T., & Tibshirani, R. (2010). glmnet: Lasso and elastic-net regularized generalized linear models. R package version, 1.1-5. http://CRAN.R-project.org/package=glmnet. Accessed 27 Dec 2018.

  9. Goldman, D. P., Smith, J. P., & Sood, N. (2005). Legal status and health insurance among immigrants. Health Affairs, 24, 1640–1653.

    Article  Google Scholar 

  10. Haas, J. S., Lee, L. B., Kaplan, C. P., Sonneborn, D., Phillips, K. A., & Liang, S.-Y. (2003). The association of race, socioeconomic status, and health insurance status with the prevalence of overweight among children and adolescents. American Journal of Public Health, 93, 2105–2110.

    Article  Google Scholar 

  11. Juarez, P., Matthews-Juarez, P., Hood, D., Im, W., Levine, R., Kilbourne, B., & Estes, S. (2014). The public health exposome: A population-based, exposure science approach to health disparities research. International Journal of Environmental Research and Public Health, 11(12), 12866–12895.

  12. Kennedy, B. P., Kawachi, I., Glass, R., & Prothrow-Stith, D. (1998). Income distribution, socioeconomic status, and self rated health in the United States: multilevel analysis. BMJ, 317, 917–921.

    Article  Google Scholar 

  13. Kuhnert, P., Venables, B., & Zocchi, S.S. (2005). An introduction to R: Software for statistical modelling and computing: USP/ESALQ/LCE. https://www.uv.es/conesa/CursoR/material/Rlecturenotes.pdf. Accessed 27 Dec 2018.

  14. Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F., Chang, C.-C., Lin, C.-C., & Meyer, M.D. (2018). R-Statistics Package ‘e1071’. https://cran.r-project.org/web/packages/e1071/e1071.pdf. Accessed 27 Dec 2018.

  15. Mueller, E., Sandoval, J., Mudigonda, S., & Elliott, M. (2019). A cluster-based machine learning ensemble approach for geospatial data: Estimation of health insurance status in Missouri. ISPRS International Journal of Geo-Information, 8(1), 13.

    Article  Google Scholar 

  16. Nasrabadi, N. M. (2007). Pattern recognition and machine learning. Journal of Electronic Imaging, 16, 049901.

    Article  Google Scholar 

  17. Nowok, B., Raab, G. M., & Dibben, C. (2016). synthpop: Bespoke creation of synthetic data in R. Journal of Statistical Software, 74, 1–26.

    Article  Google Scholar 

  18. Olden, J. D., & Jackson, D. A. (2002). Illuminating the “black box”: A randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling, 154, 135–150.

    Article  Google Scholar 

  19. Priddy, K. L., & Keller, P. E. (2005). Artificial neural networks: An introduction. Bellingham: SPIE Press.

    Book  Google Scholar 

  20. RStudio. (2012). RStudio: integrated development environment for R (p. 74). Boston: RStudio Inc.

    Google Scholar 

  21. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  22. Trivedi, S., Pardos, Z., Sárközy, G., & Heffernan, N. (2010). Spectral clustering in educational data mining. Educational Data Mining 2011. http://educationaldatamining.org/EDM2011/wp-content/uploads/proc/edm2011_paper22_full_Trivedi.pdf. Accessed 27 Dec 2018.

  23. Trivedi, S., Pardos, Z.A., & Heffernan, N.T. (2011). Clustering students to generate an ensemble to improve standard test score predictions. International Conference on Artificial Intelligence in Education (pp. 377–384). Springer.

  24. USCB. (2016). US Census Bureau: American Community Survey (ACS). https://www.census.gov/programs-surveys/acs/. Accessed 27 Dec 2018.

  25. USCB. (2017). The National Map: Transportation. US Census Bureau. Washington, DC, USA. https://www.usgs.gov/core-science-systems/national-geospatial-program/national-map. Accessed 27 Dec 2017.

  26. USCB. (2018). Topologically Integrated Geographic Encoding and Referencing Datasets. US Census Bureau. Washington, DC, USA. https://www.census.gov/geo/maps-data/data/tiger.html. Accessed 27 Dec 2018.

  27. USDA-ERS. (2000). Three Rural Definitions based on Census Places. United States Department of Agriculture, Economic Research Service. https://www.ers.usda.gov/data-products/rural-definitions/data-documentation-and-methods.aspx. Accessed 27 Dec 2018

  28. Wharam, J. F., Zhang, F., Landon, B. E., Soumerai, S. B., & Ross-Degnan, D. (2013). Low-socioeconomic-status enrollees in high-deductible plans reduced high-severity emergency care. Health Affairs, 32, 1398–1406.

    Article  Google Scholar 

  29. Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical machine learning tools and techniques. Burlington: Morgan Kaufmann.

    Google Scholar 

  30. Yao, W., Basu, S., Wei-Nchih, L. E. E., & Singhal, S. (2015). US Patent Application No. 14/762,590.

Download references

Author information

Authors and Affiliations

Authors

Contributions

This publication contributes to the research requirement for conferral of Erik Mueller’s Ph.D. in Integrated and Applied Sciences, Bioinformatics and Geospatial Biology. Erik Mueller, with supervision from J.S. Onésimo Sandoval, serving as primary advisor and mentor, and Srikanth Mudigonda and Michael Elliott, serving as dissertation committee members and secondary mentors, planned, organized, carried out, and analyzed all aspects of this research study.

Corresponding author

Correspondence to Erik D. Mueller.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mueller, E.D., Sandoval, J.S.O., Mudigonda, S.P. et al. Extending cluster-based ensemble learning through synthetic population generation for modeling disparities in health insurance coverage across Missouri. J Comput Soc Sc 2, 271–291 (2019). https://doi.org/10.1007/s42001-019-00047-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42001-019-00047-7

Keywords

Navigation