Self-organizing network for variable clustering
- 304 Downloads
Advanced sensing and internet of things bring the big data, which provides an unprecedented opportunity for data-driven knowledge discovery. However, it is common that a large number of variables (or predictors, features) are involved in the big data. Complex interdependence structures among variables pose significant challenges on the traditional framework of predictive modeling. This paper presents a new methodology of self-organizing network to characterize the interrelationships among variables and cluster them into homogeneous subgroups for predictive modeling. Specifically, we develop a new approach, namely nonlinear coupling analysis to measure variable-to-variable interdependence structures. Further, each variable is represented as a node in the complex network. Nonlinear-coupling forces move these nodes to derive a self-organizing topology of the network. As such, variables are clustered into sub-network communities. Results of simulation experiments demonstrate that the proposed method not only outperforms traditional variable clustering algorithms such as hierarchical clustering and oblique principal component analysis, but also effectively identifies interdependent structures among variables and further improves the performance of predictive modeling. Additionally, real-world case study shows that the proposed method yields an average sensitivity of 96.80% and an average specificity of 92.62% in the identification of myocardial infarctions using sparse parameters of vectorcardiogram representation models. The proposed new idea of self-organizing network is generally applicable for predictive modeling in many disciplines that involve a large number of highly-redundant variables.
KeywordsSelf-organizing network Variable clustering Predictive modeling Nonlinear coupling analysis Myocardial infarction Vectorcardiogram
The authors would like to thank the National Science Foundation (CMMI-1646660, CMMI-1617148, CMMI-1619648, and IOS-1146882) for the support of this research. The authors also thank Harold and Inge Marcus Career Professorship (HY) for additional financial support. The authors are very grateful to anonymous reviewers for their constructive suggestions that greatly improved the quality of this paper.
- Chen, Y., & Yang, H. (2014). Heterogeneous postsurgical data analytics for predictive modeling of mortality risks in intensive care units. In Engineering in Medicine and Biology Society (EMBC), 2014 36th annual international conference of the IEEE (pp. 4310–4314). Chicago.Google Scholar
- Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. American Society for Quality, 12(1), 55–67.Google Scholar
- Kantz, H., & Schreiber, T. (2003). Coupling and synchronisation of nonlinear systems. In Nonlinear time series analysis (2nd ed., pp. 292–299). Cambridge: Cambridge University Press.Google Scholar
- Lee, T., Duling, D., Liu, S., & Latour, D. (2008). Two-stage variable clustering for large data sets. In Proceeding of SAS global forum (pp. 1–14).Google Scholar
- Liu, G., Kan, C., Chen, Y., & Yang, H. (2014). Model-driven parametric monitoring of high-dimensional nonlinear functional profiles. In 2014 IEEE international conference on automation science and engineering (CASE) (pp. 722–727).Google Scholar
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.Google Scholar