Abstract
Mechanistic hydrologic and water quality models provide useful alternatives for estimating water quality in unmonitored streams. However, developing these elaborate models for large watersheds can be time-consuming and expensive, in addition to challenges that arise during calibration when there is limited spatial and/or temporal monitored in-stream water quality data. The main objective of this research was to investigate different approaches for developing multivariate analysis models as alternative methods for rapidly assessing relationships between spatio-temporal physical attributes of the watershed and water quality conditions in monitored streams, and then using the developed relationships for estimating water quality conditions in unmonitored streams. The study compares the use of various statistical estimates (mean, geometric mean, trimmed mean, and median) of monitored water quality variables to represent annual and seasonal water quality conditions. The relationship between these estimates and the spatial data is then modeled via linear and non-linear multivariate methods. Overall, the non-linear techniques for classification outperformed the linear techniques with an average cross-validation accuracy of 79.7%. Additionally, the geometric mean based models outperformed models based on other statistical indicators with an average cross-validation accuracy of 80.2%. Dividing the data into annual and quarterly datasets also offered important insights into the behavior of certain water quality variables impacted by seasonal variations. The research provides useful guidance on the use and interpretation of the various statistical estimates and statistical models for multivariate water quality analyses.
Similar content being viewed by others
References
Akume, D., & Weber, G. W. (2002). Cluster algorithms: theory and methods. Journal of Computational Technologies, 7(1), 15–27.
Alhoniemi, E., Himberg, J., Parviainen, J., & Vesanto, J. (1999). SOM Toolbox 2.0, a software library for Matlab 5 implementing the self-organizing map algorithm. Retrieved from http://www.cis.hut.fi/somtoobox
Bezdek, J. C., & Pal, N. R. (1998). Some new indexes of cluster validity. IEEE transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 28(3), 301–315.
Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.
Chang, C., & Lin, C. (2001). LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chen, Y., & Lin, C. (2006). Combining SVMs with various feature selection strategies (Vol. 207, pp. 315–324). Studies in Fuzziness and Soft Computing, Springer Berlin.
Dalzell, B. J., Filley, T. R., & Harbor, J. M. (2006). The role of hydrology in annual organic carbon loads and terrestrial organic matter export from a Midwestern agricultural watershed. Geochica et Cosmochimica Acta, 71, 1448–1462.
Davis, J. C. (2002). Statistics and data analysis in geology (3rd ed.). New York: Wiley.
ESRI (2005). Arc Hydro–HydroID. Version 1.1 Final, July 2005.
Fenelon, J. M. (1998). Water quality in the White River Basin, Indiana, 1992–1996: U.S. Geological Survey Circular 1150.
Fetter, C. W. (2001). Applied hydrology (4th ed.). Upper Saddle River: Prentice Hall.
Fry, J. A., Coan, M. J., Homer, C. G., Meyer, D. K., & Wickham, J. D. (2009). Completion of the national land cover database (NLCD) 1992–2001 land cover change retrofit product. USGS Open-File Report 2008-1379, 18 p.
Gunn, S. R. (1998). Support vector machines for classification and regression. Technical Report, University of Southhampton.
Hammer, O., Harper, D. A. T., & Ryan, P. D. (2009). PAST—Palaeontological STastitics, ver. 1.89. Technical Report.
Hellweger, F. (1997). AGREE-DEM Surface Reconditioning System. University of Texas at Austin, http://www.ce.utexas.edu/prof/maidment/GISHYDRO/ferdi/research/agree/agee.ht#Part2
Hem, J. (1985). Study and interpretation of the chemical characteristics of natural water (3rd ed.). US Geological Survey Water Supply Paper 2254.
Homer, C. C., Huang, C., Yang, L., Wylie, B., & Coan, M. (2004). Development of a 2001 National Landcover Database for the United States. Photogrammetric Engineering and Remote Sensing, 70(7), 829–840.
Hsu, C., Change, C., & Lin, C. (2010). A practical guide to support vector classification. Technical report, Dept. of Computer Science, National Taiwan University, Taipei 106, Taiwan. http://www.csie.ntu.edu.tw/~cjlin.
Iscen, C. F., Altin, A., Senoglu, B., & Yavuz, H. S. (2009). Evaluation of surface water quality characteristics by using multivariate statistical techniques: a case study of the Euphrates river basin, Turkey. Environmental Monitoring and Assessment, 151, 259–264.
Iscen, C. F., Emiroglu, O., Ilhan, S., Arslan, N., Yilmaz, V., & Ahiska, S. (2008). Application of multivariate statistical techniques in the assessment of surface water quality in Uluabat Lake, Turkey. Environmental Monitoring and Assessment, 144, 269–276.
Jacques, D. V., & Crawford, C. G. (1991). National water quality assessment program white River Basin: U.S. Geological Survey Open-File Report 91 169, 2 p. (WATER FACT SHEET)
Kartoun, U., Stern, H., & Edan, Y. (2006). Bag classification using support vector machines. In Applied Soft computing technologies: The challenge of complexity (pp. 665–674).
Kecman, V. (2001). Learning and soft computing—support vector machines, neural networks, and fuzzy logic models (slides accompanying book). Cambridge: The MIT Press.
Nilsson, R., Pena, J. M., Bjorkegren, J., & Tegner, J. (2006). Evaluating feature selection for SVMs in high dimensions. In Proceedings of the 17th European conference on machine learning (pp. 719–726).
Park, Y. (2003). Deliverable 12: publication of ANN model results. PAEQANN, European Commission, Contract No. EVK1-CT199900026. Available at http://aquaeco.ups-tlse.fr/.
Paul, S., Srinivasan, R., Sanabria, J., Haan, P. K., Mukhtar, S., & Neimann, K. (2006). Groupwise modeling and study of bacterially impaired watersheds in Texas: Clustering analysis. Journal of the AmericanWater Resources Association, 42(4), 1017–1031.
Rao, A. R., & Srinivas, V. V. (2008). Regionalization of watersheds, (Vol. 58). Springer Science+Business Media B.V. Water and Science Library of Technology.
Ren, Y., Liu, H., Xue, C., Yao, X., Liu, M., & Fan, B. (2006). Classification study of skin sensitizers based on support vector machine and linear discriminant analysis. Analytica Chimica Acta, 572, 272–282.
Rojas, R. (1996). Neural networks—a systematic introduction (pp. 391–412). Berlin: Springer.
Santos-Roman, D. M., Warner, G. S., & Scatena, F. (2003). Multivariate analysis of water quality and physical characteristics of selected watersheds in Puerto Rico. Journal of the American Water Resources Association, Paper No. 01039.
Sojka, M., Siepak, M., Ziola, A., Frankowski, M., Murat-Blazejewska, S., & Siepak, J. (2008). Application of multivariate statistical techniques to evaluation of water quality in the Mala Welna River (Western Poland). Environmental Monitoring and Assessment, 147, 159–170.
SAS (SAS Institute Inc.) (2002–2004). SAS 9.1.3 Help and documentation. Cary: SAS Institute, Inc.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York: McGraw Hill.
Singh, A., Maichle, R., & Lee, S. (2006). On the computation of a 95% upper confidence limit of the unkown population mean upon data sets with below detection limit observations. Las Vegas: USEPA, Contract No. 68-W-04 005.
Suhr, D. D. (2005). Principal component analysis vs. Factor analysis. SAS SUGI 30 Proceedings, Statistics and data analysis section, Paper No. 203-30, Cary, NC, SAS Institute.
Tabachnick, B. G., & Fidell, L. S. (1989). Using multivariate statistics (2nd ed.). New York: Harper and Row.
Tedesco, L. P., Pascual, D. L., Shrake, L. K., Casey, L. R., Vidon, P. G. F., Hernly, F. V., et al. (2005). Eagle creek watershed management plan: An integrated approach to improved water quality. Eagle Creek Watershed Alliance, CEES Publication 2005–2007. Indianapolis: IUPUI.
USDA (2004). State Soil Geographic (STATSGO) data base—data use information. Natural Resources Conservation Service, National Soil Survey Center. Miscellaneous Publication Number 1492.
USEPA (1996). U.S. EPA NPDES Permit Writers’ Manual. Office of Water; EPA-833B-96-003.
Vesanto, J., & Alhoniemi, E. (2000). Cluster of the self organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.
Vesanto, J., Himberg, J., Alhoniemi, E., & Parhankangas, J. (2000). SOM toolbox for Matlab 5. SOM Toolbox Team. Helsinki University of Technology. Report A57
Ward, A., & Trimble, S. (2004). Environmental hydrology (2nd ed.). Boca Raton: Lewis.
Yunrong, X., & Liangzhong, J. (2009). Water quality prediction using LS-SVM and particle swarm optimization. In Proceedings of the 2009 second international workshop on knowledge discovery and data mining (pp. 900–904).
Zhang, Y., Guo, F., Meng, W., & Wang, X. (2009). Water quality assessment and source identification of Daliao river basin using multivariate statistical methods. Environmental Monitoring and Assessment, 152, 105–121.
Zhou, F., Liu, Y., & Guo, H. (2007). Application of multivariate statistical methods to water quality assessment of the watercourses in Northwestern New Territories, Hong Kong. Environmental Monitoring and Assessment, 132, 1–13.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gamble, A., Babbar-Sebens, M. On the use of multivariate statistical methods for combining in-stream monitoring data and spatial analysis to characterize water quality conditions in the White River Basin, Indiana, USA. Environ Monit Assess 184, 845–875 (2012). https://doi.org/10.1007/s10661-011-2005-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10661-011-2005-y