Skip to main content

An open-source software package for multivariate modeling and clustering: applications to air quality management

Abstract

This paper presents an open-source software package, rSCA, which is developed based upon a stepwise cluster analysis method and serves as a statistical tool for modeling the relationships between multiple dependent and independent variables. The rSCA package is efficient in dealing with both continuous and discrete variables, as well as nonlinear relationships between the variables. It divides the sample sets of dependent variables into different subsets (or subclusters) through a series of cutting and merging operations based upon the theory of multivariate analysis of variance (MANOVA). The modeling results are given by a cluster tree, which includes both intermediate and leaf subclusters as well as the flow paths from the root of the tree to each leaf subcluster specified by a series of cutting and merging actions. The rSCA package is a handy and easy-to-use tool and is freely available at http://cran.r-project.org/package=rSCA. By applying the developed package to air quality management in an urban environment, we demonstrate its effectiveness in dealing with the complicated relationships among multiple variables in real-world problems.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  1. Amari S-I, Murata N, Muller K-R, Finke M, Yang HH (1997) Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans Neural Netw 8(5):985–996

    CAS  Article  Google Scholar 

  2. Bondarenko I, Van Malderen H, Treiger B, Van Espen P, Van Grieken R (1994) Hierarchical cluster analysis with stopping rules built on Akaike’s information criterion for aerosol particle classification based on electron probe X-ray microanalysis. Chemom Intell Lab Syst 22(1):87–95

    CAS  Article  Google Scholar 

  3. Cardinale BJ, Duffy JE, Gonzalez A, Hooper DU, Perrings C, Venail P, Narwani A, Mace GM, Tilman D, Wardle DA, Kinzig AP, Daily GC, Loreau M, Grace JB, Larigauderie A, Srivastava DS, Naeem S (2012) Biodiversity loss and its impact on humanity. Nature 486(7401):59–67

    CAS  Article  Google Scholar 

  4. Clarkea K, Romainb AC, Locogea N, Redona N (2012) Application of chemical mass balance methodology to identify the different sources responsible for the olfactory annoyance at a receptor-site. Chem Eng. 30

  5. Clemmensen L, Hastie T, Witten D, Ersbøll B (2011) Sparse discriminant analysis. Technometrics 53(4)

  6. Cooley WW, Lohnes PR (1971) Multivariate data analysis. J. Wiley

  7. Cooper GF (1990) The computational complexity of probabilistic inference using Bayesian belief networks. Artif Intell 42(2):393–405

    Article  Google Scholar 

  8. de Vente J, Poesen J, Verstraeten G, Govers G, Vanmaercke M, Van Rompaey A, Arabkhedri M, Boix-Fayos C (2013) Predicting soil erosion and sediment yield at regional scales: where do we stand? Earth Sci Rev 127:16–29

    Article  Google Scholar 

  9. DeFries RS, Rudel T, Uriarte M, Hansen M (2010) Deforestation driven by urban population growth and agricultural trade in the twenty-first century. Nat Geosci 3(3):178–181

    CAS  Article  Google Scholar 

  10. Gardner M, Dorling S (2000) Statistical surface ozone models: an improved methodology to account for non-linear behaviour. Atmos Environ 34(1):21–34

    CAS  Article  Google Scholar 

  11. He L, Huang G-H, Lu H-W, Zeng G-M (2008) Optimization of surfactant-enhanced aquifer remediation for a laboratory BTEX system under parameter uncertainty. Environ Sci Technol 42(6):2009–2014

    CAS  Article  Google Scholar 

  12. Healey NC, Oberbauer SF, Ahrends HE, Dierick D, Welker JM, Leffler AJ, Hollister RD, Vargas SA, Tweedie CE (2014) A mobile instrumented sensor platform for long-term terrestrial ecosystem analysis: an example application in an arctic tundra ecosystem. J Environ Inform 24(1):1–10

    Article  Google Scholar 

  13. Huang G (1992) A stepwise cluster analysis method for predicting air quality in an urban environment. Atmos Environ Part B 26(3):349–357

    Article  Google Scholar 

  14. Huang G, Huang Y, Wang G, Xiao H (2006) Development of a forecasting system for supporting remediation design and process control based on NAPL‐biodegradation simulation and stepwise‐cluster analysis. Water Resour Res 42(6)

  15. Huang G, Sun S (1988) Environmental quality reports of Xiamen Special Economic Zone. Xiamen Environmental Protection Bureau, Xiamen

    Google Scholar 

  16. Huang Y, Wang G, Huang G, Xiao H, Chakma A (2008) IPCS: an integrated process control system for enhanced in-situ bioremediation. Environ Pollut 151(3):460–469

    CAS  Article  Google Scholar 

  17. Hung YT, Wang LK, Shammas NK (2012) Handbook of environment and waste management: air and water pollution control. World Scientific

  18. Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  19. Jiao S, Zeng G-M, He L, Huang G-H, Lu H-W, Gao Q (2010) Prediction of dust fall concentrations in urban atmospheric environment through support vector regression. J Cent S Univ Technol 17:307–315

    CAS  Article  Google Scholar 

  20. Jordan YC, Ghulam A, Chu ML (2014) Assessing the impacts of future urban development patterns and climate changes on total suspended sediment loading in surface waters using geoinformatics. J Environ Inform 24(2):65–79

    Article  Google Scholar 

  21. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI, pp. 1137–1145

  22. Liu Y, Wang Y (1979) Application of stepwise cluster analysis in medical research. Sci Sinica 22(9):1082–1094

    CAS  Google Scholar 

  23. Ma ZZ, Wang ZJ, Xia T, Gippel CJ, Speed R (2014) Hydrograph-based hydrologic alteration assessment and its application to the yellow river. J Environ Inform 23(1):1–13

    Article  Google Scholar 

  24. Marcot BG, Holthausen RS, Raphael MG, Rowland MM, Wisdom MJ (2001) Using Bayesian belief networks to evaluate fish and wildlife population viability under land management alternatives from an environmental impact statement. For Ecol Manag 153(1):29–42

    Article  Google Scholar 

  25. Markou MT, Bartzokas A, Kambezidis HD (2009) Daylight climatology in Athens, Greece: types of diurnal variation of illuminance levels. Int J Climatol 29(14):2137–2145

    Article  Google Scholar 

  26. Mellino S, Buonocore E, Ulgiati S (2015) The worth of land use: a GIS-emergy evaluation of natural and human-made capital. Sci Total Environ 506:137–148

    Article  Google Scholar 

  27. Morrison DF (1967) Multivariate statistical methods. McGraw-Hill Book

  28. Overall JE, Klett CJ (1983) Applied multivariate analysis. RE Krieger Publishing Company

  29. Park Y-C, Jeong J-M, Eom S-I, Jeong U-P (2011) Optimal management design of a pump and treat system at the industrial complex in Wonju, Korea. Geosci J 15(2):207–223

    Article  Google Scholar 

  30. Qin X, Huang G, Zeng G, Chakma A (2008) Simulation‐based optimization of dual‐phase vacuum extraction to remove nonaqueous phase liquids in subsurface. Water Resour Res 44(4)

  31. Rao CR (1952) Advanced statistical methods in biometric research

  32. Ring MJ, Lindner D, Cross EF, Schlesinger ME (2012) Causes of the global warming observed since the 19th century. Atmos Climate Sci 2(04):401

    Article  Google Scholar 

  33. Rúa A, Bourhim S, Marín E, Hernández E (1999) Characterising SO2 and sulphate patterns in Europe: a cluster analysis. Toxicol Environ Chem 71(1–2):21–32

    Article  Google Scholar 

  34. Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494

    Article  Google Scholar 

  35. Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118

    Article  Google Scholar 

  36. Sun S (1989) Principal component analysis of air pollutant sources in Xiamen, China. China Environ Sci 10:23–41

    Google Scholar 

  37. Sun W, Huang GH, Zeng G, Qin X, Yu H (2011) Quantitative effects of composting state variables on C/N ratio through GA-aided multivariate analysis. Sci Total Environ 409(7):1243–1254

    CAS  Article  Google Scholar 

  38. Tan Q, Wei Y, Wang M, Liu Y (2014) A cluster multivariate statistical method for environmental quality management. Eng Appl Artif Intell 32:1–9

    Article  Google Scholar 

  39. Wang X, Huang G (2015) Impacts assessment of air emissions from point sources in Saskatchewan, Canada—a spatial analysis approach. Environ Prog Sustainable Energy 34(1):304–313

  40. Wang X, Huang G, Lin Q, Liu J (2014a) High-resolution probabilistic projections of temperature changes over Ontario, Canada. J Climate 27(14):5259–5284

  41. Wang X, Huang G, Lin Q, Nie X, Cheng G, Fan Y, Li Z, Yao Y, Suo M (2013) A stepwise cluster analysis approach for downscaled climate projection—a Canadian case study. Environ Model Softw 49:141–151

    Article  Google Scholar 

  42. Wang X, Huang G, Lin Q, Nie X, Liu J (2014b) High‐resolution temperature and precipitation projections over Ontario, Canada: a coupled dynamical‐statistical approach. Q J R Meteorol Soc

  43. Wang X, Huang G, Liu J (2014c) Projected increases in intensity and frequency of rainfall extremes through a regional climate modeling approach. J Geophys Res Atmos 119(23):13271–13286

  44. Wang X, Huang G, Liu J, (2014d) Projected increases in near-surface air temperature over Ontario, Canada: a regional climate modeling approach. Clim Dyn 1–13

  45. Wasserman PD (1993) Advanced methods in neural computing. John Wiley & Sons, Inc

  46. Westing AH (2013) Population: perhaps the basic issue, from environmental to comprehensive security. Springer, pp. 133–145

  47. Wilks S (1962) Mathematics statistics. John Wiley and Sons, New York

    Google Scholar 

  48. Xu Y, Huang GH, Cheng GH, Liu Y, Li YF (2014) A two-stage fuzzy chance-constrained model for solid waste allocation planning. J Environ Inform 24(2):101–110

    Article  Google Scholar 

  49. Ye J (2007) Least squares linear discriminant analysis, Proceedings of the 24th international conference on Machine learning. ACM, pp. 1087–1093

  50. Zhang N, Li YP, Huang WW, Liu J (2014) An inexact two-stage water quality management model for supporting sustainable development in a rural system. J Environ Inform 24(1):52–64

    Article  Google Scholar 

  51. Zou Y, Huang GH, Nie X (2009) Filtered stepwise clustering method for predicting fate of contaminants in groundwater remediation systems: a case study in western Canada. Water Air Soil Pollut 199(1–4):389–405

    CAS  Article  Google Scholar 

Download references

Acknowledgments

This research was supported by the Program for Innovative Research Team in University (IRT1127), the 111 Project (B14008), and the Natural Science and Engineering Research Council of Canada.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Guohe Huang.

Additional information

Responsible editor: Marcus Schulz

Electronic supplementary material

Below is the link to the electronic supplementary material.

Text S1

Description of the functionality of the rSCA package. (PDF 209 kb)

Text S2

Sample codes and outputs of the application for multivariate modeling. (PDF 184 kb)

Text S3

Sample codes and outputs of the application for multivariate clustering. (PDF 166 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Huang, G., Zhao, S. et al. An open-source software package for multivariate modeling and clustering: applications to air quality management. Environ Sci Pollut Res 22, 14220–14233 (2015). https://doi.org/10.1007/s11356-015-4664-7

Download citation

Keywords

  • Multivariate modeling
  • Multivariate clustering
  • Stepwise cluster analysis
  • Cluster tree
  • Air quality management