Annals of Operations Research

, Volume 263, Issue 1–2, pp 119–140 | Cite as

Self-organizing network for variable clustering

Data Mining and Analytics


Advanced sensing and internet of things bring the big data, which provides an unprecedented opportunity for data-driven knowledge discovery. However, it is common that a large number of variables (or predictors, features) are involved in the big data. Complex interdependence structures among variables pose significant challenges on the traditional framework of predictive modeling. This paper presents a new methodology of self-organizing network to characterize the interrelationships among variables and cluster them into homogeneous subgroups for predictive modeling. Specifically, we develop a new approach, namely nonlinear coupling analysis to measure variable-to-variable interdependence structures. Further, each variable is represented as a node in the complex network. Nonlinear-coupling forces move these nodes to derive a self-organizing topology of the network. As such, variables are clustered into sub-network communities. Results of simulation experiments demonstrate that the proposed method not only outperforms traditional variable clustering algorithms such as hierarchical clustering and oblique principal component analysis, but also effectively identifies interdependent structures among variables and further improves the performance of predictive modeling. Additionally, real-world case study shows that the proposed method yields an average sensitivity of 96.80% and an average specificity of 92.62% in the identification of myocardial infarctions using sparse parameters of vectorcardiogram representation models. The proposed new idea of self-organizing network is generally applicable for predictive modeling in many disciplines that involve a large number of highly-redundant variables.


Self-organizing network Variable clustering Predictive modeling Nonlinear coupling analysis Myocardial infarction Vectorcardiogram 



The authors would like to thank the National Science Foundation (CMMI-1646660, CMMI-1617148, CMMI-1619648, and IOS-1146882) for the support of this research. The authors also thank Harold and Inge Marcus Career Professorship (HY) for additional financial support. The authors are very grateful to anonymous reviewers for their constructive suggestions that greatly improved the quality of this paper.


  1. Arnhold, J., Grassberger, P., Lehnertz, K., & Elger, C. E. (1999). A robust method for detecting interdependence: application to intracranially recorded EEG. Physica D, 134(4), 419–430.CrossRefGoogle Scholar
  2. Chen, Y., & Yang, H. (2012). Self-organized neural network for the quality control of 12-lead ECG signals. Physiological Measurement, 33(9), 1399–1418.CrossRefGoogle Scholar
  3. Chen, Y., & Yang, H. (2014). Heterogeneous postsurgical data analytics for predictive modeling of mortality risks in intensive care units. In Engineering in Medicine and Biology Society (EMBC), 2014 36th annual international conference of the IEEE (pp. 4310–4314). Chicago.Google Scholar
  4. Ding, Y., Elsayed, E. A., Kumara, S., Lu, J., Niu, F., & Shi, J. (2006). Distributed sensing for quality and productivity improvements. IEEE Transactions on Automation Science and Engineering, 3(4), 344–359.CrossRefGoogle Scholar
  5. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.CrossRefGoogle Scholar
  6. Fraser, A. M., & Swinney, H. L. (1986). Independent coordinates for strange attractors from mutual information. Physical Review A, 33(2), 1134–1140.CrossRefGoogle Scholar
  7. Friedman, J. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1, 55–77.CrossRefGoogle Scholar
  8. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., et al. (2000). PhysioBank, PhysioTollkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23), e215–e220.CrossRefGoogle Scholar
  9. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. American Society for Quality, 12(1), 55–67.Google Scholar
  10. Joseph, V. R., Dasgupta, T., Tuo, R., & Jeff Wu, C. F. (2014). Sequential exploration of complex surfaces Using minimum energy designs. Technometrics, 57(1), 64–74.CrossRefGoogle Scholar
  11. Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3), 187–200.CrossRefGoogle Scholar
  12. Kantz, H., & Schreiber, T. (2003). Coupling and synchronisation of nonlinear systems. In Nonlinear time series analysis (2nd ed., pp. 292–299). Cambridge: Cambridge University Press.Google Scholar
  13. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480.CrossRefGoogle Scholar
  14. Kojadinovic, I. (2004). Agglomerative hierarchical clustering of continuous variables based on mutual information. Computational statistics & data analysis, 46(2), 269–294.CrossRefGoogle Scholar
  15. Lee, T., Duling, D., Liu, S., & Latour, D. (2008). Two-stage variable clustering for large data sets. In Proceeding of SAS global forum (pp. 1–14).Google Scholar
  16. Liu, G., Kan, C., Chen, Y., & Yang, H. (2014). Model-driven parametric monitoring of high-dimensional nonlinear functional profiles. In 2014 IEEE international conference on automation science and engineering (CASE) (pp. 722–727).Google Scholar
  17. Liu, G., & Yang, H. (2013). Multiscale adaptive basis function modeling of spatiotemporal vectorcardiogram signals. IEEE Journal of Biomedical and Health Informatics, 17(2), 484–492.CrossRefGoogle Scholar
  18. Narendra, P. M., & Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C–26(9), 917–922.CrossRefGoogle Scholar
  19. Nas, T., & Mevik, B. H. (2001). Understanding the collinearity problem in regression and discriminant analysis. Journal of Chemometrics, 15(4), 413–426.CrossRefGoogle Scholar
  20. Neslin, S. A., Gupta, S., Kamakura, W., Lu, J., & Mason, C. H. (2006). Defection detection: Measuring and understanding the predictive accuracy of customer churn models. Journal of Marketing Research, 43(2), 204–211.CrossRefGoogle Scholar
  21. Slonim, N., Atwal, G. S., Tkačik, G., & Bialek, W. (2005). Information-based clustering. Proceedings of the National Academy of Sciences, 102, 18297–18302.CrossRefGoogle Scholar
  22. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267–288.Google Scholar
  23. Wang, H., Zhang, X., Ashok, K., & Huang, Q. (2009). Nonlinear dynamics modeling of correlated functional process variables for condition monitoring in chemical–mechanical planarization. IEEE Transactions on Semiconductor Manufacturing, 22(1), 188–195.CrossRefGoogle Scholar
  24. Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244.CrossRefGoogle Scholar
  25. Yang, H. (2011). Multiscale recurrence quantification analysis of spatial cardiac vectorcardiogram (VCG) signals. IEEE Transactions on Biomedical Egnineering, 58(2), 339–347.CrossRefGoogle Scholar
  26. Yang, H., Bukkapatnam, S. T., & Komanduri, R. (2012). Spatio-temporal representation of cardiac vectorcardiogram (VCG) signals. Biomedical Engineering Online, 11, 16.CrossRefGoogle Scholar
  27. Yang, H., Bukkapatnam, S. T., Le, T., & Komanduri, R. (2011). Identification of myocardial infarction (MI) using spatio-temporal heart dynamics. Medical Engineering & Physics, 34(4), 485–497.CrossRefGoogle Scholar
  28. Yang, H., & Chen, Y. (2014). Heterogeneous recurrence monitoring and control of nonlinear stochastic processes. Chaos: An Interdisciplinary Journal of Nonlinear Science, 24(1), 013138.CrossRefGoogle Scholar
  29. Yang, H., Kan, C., Liu, G., & Chen, Y. (2013). Spatiotemporal differentiation of myocardial infarctions. IEEE Transactions on Automation Science and Engineering, 10(4), 938–947.CrossRefGoogle Scholar
  30. Yang, H., & Kundakcioglu, E. (2014). Healthcare intelligence: Turning data into knowledge. IEEE Intelligent Systems, 29(3), 54–68.CrossRefGoogle Scholar
  31. Yang, H., & Liu, G. (2013). Self-organized topology of recurrence-based complex networks. Chaos: An Interdisciplinary Journal of Nonlinear Science, 23, 043116.CrossRefGoogle Scholar
  32. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2), 301–320.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  1. 1.Arbor Research Collaborative for HealthAnn ArborUSA
  2. 2.Harold and Inge Marcus Department of Industrial and Manufacturing EngineeringPennsylvania State UniversityUniversity ParkUSA

Personalised recommendations