Abstract
This paper presents an open-source software package, rSCA, which is developed based upon a stepwise cluster analysis method and serves as a statistical tool for modeling the relationships between multiple dependent and independent variables. The rSCA package is efficient in dealing with both continuous and discrete variables, as well as nonlinear relationships between the variables. It divides the sample sets of dependent variables into different subsets (or subclusters) through a series of cutting and merging operations based upon the theory of multivariate analysis of variance (MANOVA). The modeling results are given by a cluster tree, which includes both intermediate and leaf subclusters as well as the flow paths from the root of the tree to each leaf subcluster specified by a series of cutting and merging actions. The rSCA package is a handy and easy-to-use tool and is freely available at http://cran.r-project.org/package=rSCA. By applying the developed package to air quality management in an urban environment, we demonstrate its effectiveness in dealing with the complicated relationships among multiple variables in real-world problems.
Similar content being viewed by others
References
Amari S-I, Murata N, Muller K-R, Finke M, Yang HH (1997) Asymptotic statistical theory of overtraining and cross-validation. IEEE Trans Neural Netw 8(5):985–996
Bondarenko I, Van Malderen H, Treiger B, Van Espen P, Van Grieken R (1994) Hierarchical cluster analysis with stopping rules built on Akaike’s information criterion for aerosol particle classification based on electron probe X-ray microanalysis. Chemom Intell Lab Syst 22(1):87–95
Cardinale BJ, Duffy JE, Gonzalez A, Hooper DU, Perrings C, Venail P, Narwani A, Mace GM, Tilman D, Wardle DA, Kinzig AP, Daily GC, Loreau M, Grace JB, Larigauderie A, Srivastava DS, Naeem S (2012) Biodiversity loss and its impact on humanity. Nature 486(7401):59–67
Clarkea K, Romainb AC, Locogea N, Redona N (2012) Application of chemical mass balance methodology to identify the different sources responsible for the olfactory annoyance at a receptor-site. Chem Eng. 30
Clemmensen L, Hastie T, Witten D, Ersbøll B (2011) Sparse discriminant analysis. Technometrics 53(4)
Cooley WW, Lohnes PR (1971) Multivariate data analysis. J. Wiley
Cooper GF (1990) The computational complexity of probabilistic inference using Bayesian belief networks. Artif Intell 42(2):393–405
de Vente J, Poesen J, Verstraeten G, Govers G, Vanmaercke M, Van Rompaey A, Arabkhedri M, Boix-Fayos C (2013) Predicting soil erosion and sediment yield at regional scales: where do we stand? Earth Sci Rev 127:16–29
DeFries RS, Rudel T, Uriarte M, Hansen M (2010) Deforestation driven by urban population growth and agricultural trade in the twenty-first century. Nat Geosci 3(3):178–181
Gardner M, Dorling S (2000) Statistical surface ozone models: an improved methodology to account for non-linear behaviour. Atmos Environ 34(1):21–34
He L, Huang G-H, Lu H-W, Zeng G-M (2008) Optimization of surfactant-enhanced aquifer remediation for a laboratory BTEX system under parameter uncertainty. Environ Sci Technol 42(6):2009–2014
Healey NC, Oberbauer SF, Ahrends HE, Dierick D, Welker JM, Leffler AJ, Hollister RD, Vargas SA, Tweedie CE (2014) A mobile instrumented sensor platform for long-term terrestrial ecosystem analysis: an example application in an arctic tundra ecosystem. J Environ Inform 24(1):1–10
Huang G (1992) A stepwise cluster analysis method for predicting air quality in an urban environment. Atmos Environ Part B 26(3):349–357
Huang G, Huang Y, Wang G, Xiao H (2006) Development of a forecasting system for supporting remediation design and process control based on NAPL‐biodegradation simulation and stepwise‐cluster analysis. Water Resour Res 42(6)
Huang G, Sun S (1988) Environmental quality reports of Xiamen Special Economic Zone. Xiamen Environmental Protection Bureau, Xiamen
Huang Y, Wang G, Huang G, Xiao H, Chakma A (2008) IPCS: an integrated process control system for enhanced in-situ bioremediation. Environ Pollut 151(3):460–469
Hung YT, Wang LK, Shammas NK (2012) Handbook of environment and waste management: air and water pollution control. World Scientific
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Jiao S, Zeng G-M, He L, Huang G-H, Lu H-W, Gao Q (2010) Prediction of dust fall concentrations in urban atmospheric environment through support vector regression. J Cent S Univ Technol 17:307–315
Jordan YC, Ghulam A, Chu ML (2014) Assessing the impacts of future urban development patterns and climate changes on total suspended sediment loading in surface waters using geoinformatics. J Environ Inform 24(2):65–79
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI, pp. 1137–1145
Liu Y, Wang Y (1979) Application of stepwise cluster analysis in medical research. Sci Sinica 22(9):1082–1094
Ma ZZ, Wang ZJ, Xia T, Gippel CJ, Speed R (2014) Hydrograph-based hydrologic alteration assessment and its application to the yellow river. J Environ Inform 23(1):1–13
Marcot BG, Holthausen RS, Raphael MG, Rowland MM, Wisdom MJ (2001) Using Bayesian belief networks to evaluate fish and wildlife population viability under land management alternatives from an environmental impact statement. For Ecol Manag 153(1):29–42
Markou MT, Bartzokas A, Kambezidis HD (2009) Daylight climatology in Athens, Greece: types of diurnal variation of illuminance levels. Int J Climatol 29(14):2137–2145
Mellino S, Buonocore E, Ulgiati S (2015) The worth of land use: a GIS-emergy evaluation of natural and human-made capital. Sci Total Environ 506:137–148
Morrison DF (1967) Multivariate statistical methods. McGraw-Hill Book
Overall JE, Klett CJ (1983) Applied multivariate analysis. RE Krieger Publishing Company
Park Y-C, Jeong J-M, Eom S-I, Jeong U-P (2011) Optimal management design of a pump and treat system at the industrial complex in Wonju, Korea. Geosci J 15(2):207–223
Qin X, Huang G, Zeng G, Chakma A (2008) Simulation‐based optimization of dual‐phase vacuum extraction to remove nonaqueous phase liquids in subsurface. Water Resour Res 44(4)
Rao CR (1952) Advanced statistical methods in biometric research
Ring MJ, Lindner D, Cross EF, Schlesinger ME (2012) Causes of the global warming observed since the 19th century. Atmos Climate Sci 2(04):401
Rúa A, Bourhim S, Marín E, Hernández E (1999) Characterising SO2 and sulphate patterns in Europe: a cluster analysis. Toxicol Environ Chem 71(1–2):21–32
Shao J (1993) Linear model selection by cross-validation. J Am Stat Assoc 88(422):486–494
Specht DF (1990) Probabilistic neural networks. Neural Netw 3(1):109–118
Sun S (1989) Principal component analysis of air pollutant sources in Xiamen, China. China Environ Sci 10:23–41
Sun W, Huang GH, Zeng G, Qin X, Yu H (2011) Quantitative effects of composting state variables on C/N ratio through GA-aided multivariate analysis. Sci Total Environ 409(7):1243–1254
Tan Q, Wei Y, Wang M, Liu Y (2014) A cluster multivariate statistical method for environmental quality management. Eng Appl Artif Intell 32:1–9
Wang X, Huang G (2015) Impacts assessment of air emissions from point sources in Saskatchewan, Canada—a spatial analysis approach. Environ Prog Sustainable Energy 34(1):304–313
Wang X, Huang G, Lin Q, Liu J (2014a) High-resolution probabilistic projections of temperature changes over Ontario, Canada. J Climate 27(14):5259–5284
Wang X, Huang G, Lin Q, Nie X, Cheng G, Fan Y, Li Z, Yao Y, Suo M (2013) A stepwise cluster analysis approach for downscaled climate projection—a Canadian case study. Environ Model Softw 49:141–151
Wang X, Huang G, Lin Q, Nie X, Liu J (2014b) High‐resolution temperature and precipitation projections over Ontario, Canada: a coupled dynamical‐statistical approach. Q J R Meteorol Soc
Wang X, Huang G, Liu J (2014c) Projected increases in intensity and frequency of rainfall extremes through a regional climate modeling approach. J Geophys Res Atmos 119(23):13271–13286
Wang X, Huang G, Liu J, (2014d) Projected increases in near-surface air temperature over Ontario, Canada: a regional climate modeling approach. Clim Dyn 1–13
Wasserman PD (1993) Advanced methods in neural computing. John Wiley & Sons, Inc
Westing AH (2013) Population: perhaps the basic issue, from environmental to comprehensive security. Springer, pp. 133–145
Wilks S (1962) Mathematics statistics. John Wiley and Sons, New York
Xu Y, Huang GH, Cheng GH, Liu Y, Li YF (2014) A two-stage fuzzy chance-constrained model for solid waste allocation planning. J Environ Inform 24(2):101–110
Ye J (2007) Least squares linear discriminant analysis, Proceedings of the 24th international conference on Machine learning. ACM, pp. 1087–1093
Zhang N, Li YP, Huang WW, Liu J (2014) An inexact two-stage water quality management model for supporting sustainable development in a rural system. J Environ Inform 24(1):52–64
Zou Y, Huang GH, Nie X (2009) Filtered stepwise clustering method for predicting fate of contaminants in groundwater remediation systems: a case study in western Canada. Water Air Soil Pollut 199(1–4):389–405
Acknowledgments
This research was supported by the Program for Innovative Research Team in University (IRT1127), the 111 Project (B14008), and the Natural Science and Engineering Research Council of Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Marcus Schulz
Electronic supplementary material
Below is the link to the electronic supplementary material.
Text S1
Description of the functionality of the rSCA package. (PDF 209 kb)
Text S2
Sample codes and outputs of the application for multivariate modeling. (PDF 184 kb)
Text S3
Sample codes and outputs of the application for multivariate clustering. (PDF 166 kb)
Rights and permissions
About this article
Cite this article
Wang, X., Huang, G., Zhao, S. et al. An open-source software package for multivariate modeling and clustering: applications to air quality management. Environ Sci Pollut Res 22, 14220–14233 (2015). https://doi.org/10.1007/s11356-015-4664-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11356-015-4664-7