Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth

  • Zhaoyang Zhang
  • Hua FangEmail author
  • Honggang Wang
Systems-Level Quality Improvement
Part of the following topical collections:
  1. Advances in Big-Data based mHealth Theories and Applications


Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.


Big data Validation Multiple imputation Fuzzy clustering Missing data Longitudinal trial 



This research was supported by NIH grant R01 DA033323, 1UL1RR031982-01 Pilot Project to Dr. Fang. We thank Dr. Thomas Huston for providing their longitudinal web-delivered QuitPrimo trial data. This work was partially supported by the National Science Foundation through awards IIS#1401711, ECCS#1407882.


  1. 1.
    Eysenbach, G., and Group, C.-E., Consort-ehealth: improving and standardizing evaluation reports of web-based and mobile health interventions. J. Med. Internet Res. 13(4), 2011.Google Scholar
  2. 2.
    Fang, H, Zhang, Z., Wang, C. J, Daneshmand, M., Wang, C., Wang, H., A survey of big data research. IEEE Netw. 29:6–9, 2015.CrossRefPubMedPubMedCentralGoogle Scholar
  3. 3.
    Fang, H., Espy, K. A, Rizzo, M. L, Stopp, C., Wiebe, S. A, Stroup, W. W, Pattern recognition of longitudinal trial data with nonignorable missingness: An empirical case study. Int. J. Inf. Technol. Decis. Mak. 8 (03):491–513, 2009.CrossRefPubMedPubMedCentralGoogle Scholar
  4. 4.
    Fang, H., Dukic, V., Pickett, K. E., Wakschlag, L., Espy, K. A., Detecting graded exposure effects: A report on an east boston pregnancy cohort, p. ntr272: Nicotine & Tobacco Research , 2012.Google Scholar
  5. 5.
    Fang, H., Zhang, Z., Huang, H.: Jingfang Huang Wang, Validating patterns for longitudinal trial data. Section on Statistics in Epidemiology. Joint Statistical Meeting, American Statistical Association (2014)Google Scholar
  6. 6.
    Zhang, Z., Fang, H., Wang, H., Visualization aided engagement pattern validation for big longitudinal web behavior intervention data, the 17th international Conference on E-health Networking, Application & Services. (IEEE Healthcom’15), 2015. Accepted.Google Scholar
  7. 7.
    McLachlan, G., and Peel, D., Finite mixture models: Wiley, 2004.Google Scholar
  8. 8.
    Franċois, O., Ancelet, S., Guillot, G., Bayesian clustering using hidden markov random fields in spatial population genetics. Genetics 174(2):805–816, 2006.CrossRefPubMedPubMedCentralGoogle Scholar
  9. 9.
    Gan, G., Ma, C., Wu, J., Data clustering: theory, algorithms, and applications. Vol. 20. Siam, 2007.Google Scholar
  10. 10.
    Kubat, M., Neural networks: a comprehensive foundation by simon haykin, macmillan, 1994, isbn 0-02-352781-7, 1999.Google Scholar
  11. 11.
    Bezdek, J. C, Keller, J., Krisnapuram, R., Pal, N., Fuzzy models and algorithms for pattern recognition and image processing. Vol. 4. Springer Science & Business Media, 2006.Google Scholar
  12. 12.
    Schafer, J. L, Analysis of incomplete multivariate data. CRC press, 1997.Google Scholar
  13. 13.
    Little, R. J, and Rubin, D. B, Statistical analysis with missing data. Wiley, 2014.Google Scholar
  14. 14.
    Zhang, Z., and Fang, H., Multiple- vs non- or single-imputation based fuzzy clustering for incomplete longitudinal behavioral intervention data, Chase, 2016. Submitted.Google Scholar
  15. 15.
    Fang, H., Johnson, C., Stopp, C., Espy, K. A, A new look at quantifying tobacco exposure during pregnancy using fuzzy clustering,. Neurotoxicol. Teratol. 33(1):155–165, 2011.CrossRefPubMedPubMedCentralGoogle Scholar
  16. 16.
    Rubin, D. B, Multiple imputation for nonresponse in surveys. Vol. 81. Wiley, 2004.Google Scholar
  17. 17.
    Schafer, J. L, Analysis of incomplete multivariate data. CRC press, 1997.Google Scholar
  18. 18.
    Royston, P., Multiple imputation of missing values. Stata J. 4:227–241, 2004.Google Scholar
  19. 19.
    Royston, P., Multiple imputation of missing values: update of ice. Stata J. 5(4):527, 2005.Google Scholar
  20. 20.
    Little, R. J, A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 83(404):1198–1202, 1988.CrossRefGoogle Scholar
  21. 21.
    Rubin, D. B, Inference and missing data. Biometrika 63(3):581–592, 1976.CrossRefGoogle Scholar
  22. 22.
    Rubin, D. B, Multiple imputation for nonresponse in surveys. Vol. 81. Wiley, 2004.Google Scholar
  23. 23.
    Rubin, D. B, Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434):473–489, 1996.CrossRefGoogle Scholar
  24. 24.
    Klir, G., and Yuan, B., Fuzzy sets and fuzzy logic. Vol. 4. Prentice Hall New Jersey, 1995.Google Scholar
  25. 25.
    Zadeh, L. A, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Set. Syst. 90(2):111–127, 1997.CrossRefGoogle Scholar
  26. 26.
    Fang, H., Rizzo, M. L, Wang, H., Espy, K. A, Wang, Z., A new nonlinear classifier with a penalized signed fuzzy measure using effective genetic algorithm. Pattern Recogn. 43(4):1393–1401, 2010.CrossRefGoogle Scholar
  27. 27.
    Acock, A. C, Working with missing values. J. Marriage Fam. 67(4):1012–1028, 2005.CrossRefGoogle Scholar
  28. 28.
    Donders, A. R. T, van der Heijden, G. J, Stijnen, T., Moons, K. G, Review: a gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10):1087–1091, 2006.CrossRefPubMedGoogle Scholar
  29. 29.
    Little, R. J, and Rubin, D. B, The analysis of social science data with missing values. Sociol. Methods Res. 18(2–3):292–326, 1989.CrossRefGoogle Scholar
  30. 30.
    Afifi, A., and Elashoff, R., Missing observations in multivariate statistics i. review of the literature. J. Am. Stat. Assoc. 61(315):595–604, 1966.Google Scholar
  31. 31.
    Buck, S. F, A method of estimation of missing values in multivariate data suitable for use with an electronic computer. J. R. Stat. Soc. Ser. B Methodol.,302–306, 1960.Google Scholar
  32. 32.
    Marker, D. A, Judkins, D. R, Winglee, M., Large-scale imputation for complex surveys. Survey Nonresponse,329–341, 2002.Google Scholar
  33. 33.
    Xie, X. L, and Beni, G., A validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 13 (8): 841–847 , 1991.CrossRefGoogle Scholar
  34. 34.
    Kwon, S. H, Cluster validity index for fuzzy clustering. Electron. Lett. 34(22):2176–2177, 1998.CrossRefGoogle Scholar
  35. 35.
    Halkidi, M., Batistakis, Y., Vazirgiannis, M., On clustering validation techniques. J. Intell. Inf. Syst. 17(2-3):107–145 , 2001.CrossRefGoogle Scholar
  36. 36.
    Newman, M. E, Modularity and community structure in networks,. Proc. Natl. Acad. Sci. 103(23):8577–8582, 2006.CrossRefPubMedPubMedCentralGoogle Scholar
  37. 37.
    Newman, M., Networks: an introduction. Oxford University Press, 2010.Google Scholar
  38. 38.
    Ben-Hur, A., Elisseeff, A., Guyon, I., A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 7:6–17, 2001.Google Scholar
  39. 39.
    Lange, T., Roth, V., Braun, M. L, Buhmann, J. M, Stability-based validation of clustering solutions. Neural Comput. 16(6):1299–1323, 2004.CrossRefPubMedGoogle Scholar
  40. 40.
    Ben-David, S., Von Luxburg, U., Pal, D.: A sober look at stability of clustering. In: Proceedings of the Annual Conference on Computational Learning Theory (2006)Google Scholar
  41. 41.
    Fraley, C., and Raftery, A. E, Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458):611–631, 2002.CrossRefGoogle Scholar
  42. 42.
    Raftery, A. E, and Dean, N., Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473): 168–178, 2006.CrossRefGoogle Scholar
  43. 43.
    Yeung, K. Y, Fraley, C., Murua, A., Raftery, A. E, Ruzzo, W. L, Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987, 2001.CrossRefPubMedGoogle Scholar
  44. 44.
    Ng, A. Y, Jordan, M. I, Weiss, Y., et al., On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Proces. Syst. 2:849–856, 2002.Google Scholar
  45. 45.
    Von Luxburg, U., A tutorial on spectral clustering. Stat. Comput. 17(4):395–416, 2007.CrossRefGoogle Scholar
  46. 46.
    Zelnik-Manor, L., and Perona, P.: Self-tuning spectral clustering. In: Advances in neural information processing systems, pp. 1601–1608 (2004)Google Scholar
  47. 47.
    Efron, B., Bootstrap methods: another look at the jackknife. Ann. Stat.,1–26, 1979.Google Scholar
  48. 48.
    Efron, B., and Tibshirani, R. J, An introduction to the bootstrap. CRC Press, 1994.Google Scholar
  49. 49.
    Varian, H., Bootstrap tutorial. Math. J. 9(4):768–775, 2005.Google Scholar
  50. 50.
    Davison, A. C, Bootstrap methods and their application. Vol. 1. Cambridge University Press, 1997.Google Scholar
  51. 51.
    Beran, R., Prepivoting test statistics: a bootstrap view of asymptotic refinements. J. Am. Stat. Assoc. 83 (403):687–697, 1988.CrossRefGoogle Scholar
  52. 52.
    Bickel, P. J, and Freedman, D. A, Some asymptotic theory for the bootstrap. Ann. Stat.,1196–1217, 1981.Google Scholar
  53. 53.
    Shao, J., Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422):486–494, 1993.CrossRefGoogle Scholar
  54. 54.
    Zhang, P., Model selection via multifold cross validation. Ann. Stat.,299–313, 1993.Google Scholar
  55. 55.
    Yang, Y., Comparing learning methods for classification. Stat. Sin. 16(2):635, 2006.Google Scholar
  56. 56.
    Tibshirani, R., and Walther, G., Cluster validation by prediction strength. J. Comput. Graph. Stat. 14(3): 511–528, 2005.CrossRefGoogle Scholar
  57. 57.
    Kohavi, R. et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, Vol. 14, pp. 1137–1145 (1995)Google Scholar
  58. 58.
    Refaeilzadeh, P., Tang, L., Liu, H.: Cross-validation. In: Encyclopedia of database systems, pp. 532–538. Springer (2009)Google Scholar
  59. 59.
    Leicht, E. A, and Newman, M. E, Community structure in directed networks. Phys. Rev. Lett. 100(11): 118703, 2008.CrossRefPubMedGoogle Scholar
  60. 60.
    Von Luxburg, U., A tutorial on spectral clustering. Stat. Comput. 17(4):395–416, 2007.CrossRefGoogle Scholar
  61. 61.
    Sas, I.: Sas/stat ® 9.2 user’s guide. SAS Institute Inc, Cary (2008)Google Scholar
  62. 62.
    Wang, J., Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4):893–904, 2010.CrossRefGoogle Scholar
  63. 63.
    Houston, T. K, Sadasivam, R. S, Ford, D. E, Richman, J., Ray, M. N, Allison, J. J, The quit-primo provider-patient internet-delivered smoking cessation referral intervention: a cluster-randomized comparative effectiveness trial: study protocol. Implement. Sci. 5:87, 2010.CrossRefPubMedPubMedCentralGoogle Scholar
  64. 64.
    Houston, T. K, Sadasivam, R. S, Allison, J. J, Ash, A. S, Ray, M. N, English, T. M, Hogan, T. P, Ford, D. E, Evaluating the quit-primo clinical practice eportal to increase smoker engagement with online cessation interventions: a national hybrid type 2 implementation study,. Implement. Sci. 10(1):154 , 2015.CrossRefPubMedPubMedCentralGoogle Scholar
  65. 65.
    Zhang, Z., Fang, H., Wang, H.: A new mi-based visualization aided validation index for trajectory pattern recognition of big longitudinal web trial data, IEEE ACCESS, 2015. acceptedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Quantitative Health ScienceUniversity of Massachusetts Medical SchoolWorcesterUSA
  2. 2.Department of Electrical and Computer EngineeringUniversity of Massachusetts DartmouthNorth DartmouthUSA

Personalised recommendations