Environmental Monitoring and Assessment

, Volume 184, Issue 2, pp 845–875 | Cite as

On the use of multivariate statistical methods for combining in-stream monitoring data and spatial analysis to characterize water quality conditions in the White River Basin, Indiana, USA

  • Andrew Gamble
  • Meghna Babbar-SebensEmail author


Mechanistic hydrologic and water quality models provide useful alternatives for estimating water quality in unmonitored streams. However, developing these elaborate models for large watersheds can be time-consuming and expensive, in addition to challenges that arise during calibration when there is limited spatial and/or temporal monitored in-stream water quality data. The main objective of this research was to investigate different approaches for developing multivariate analysis models as alternative methods for rapidly assessing relationships between spatio-temporal physical attributes of the watershed and water quality conditions in monitored streams, and then using the developed relationships for estimating water quality conditions in unmonitored streams. The study compares the use of various statistical estimates (mean, geometric mean, trimmed mean, and median) of monitored water quality variables to represent annual and seasonal water quality conditions. The relationship between these estimates and the spatial data is then modeled via linear and non-linear multivariate methods. Overall, the non-linear techniques for classification outperformed the linear techniques with an average cross-validation accuracy of 79.7%. Additionally, the geometric mean based models outperformed models based on other statistical indicators with an average cross-validation accuracy of 80.2%. Dividing the data into annual and quarterly datasets also offered important insights into the behavior of certain water quality variables impacted by seasonal variations. The research provides useful guidance on the use and interpretation of the various statistical estimates and statistical models for multivariate water quality analyses.


Water quality Principal component analysis Linear discriminant analysis Kohonen self-organizing map Support vector machine Cluster analysis 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Akume, D., & Weber, G. W. (2002). Cluster algorithms: theory and methods. Journal of Computational Technologies, 7(1), 15–27.Google Scholar
  2. Alhoniemi, E., Himberg, J., Parviainen, J., & Vesanto, J. (1999). SOM Toolbox 2.0, a software library for Matlab 5 implementing the self-organizing map algorithm. Retrieved from
  3. Bezdek, J. C., & Pal, N. R. (1998). Some new indexes of cluster validity. IEEE transactions on Systems, Man, and Cybernetics—Part B: Cybernetics, 28(3), 301–315.CrossRefGoogle Scholar
  4. Box, G. E. P., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), 26(2), 211–252.Google Scholar
  5. Chang, C., & Lin, C. (2001). LIBSVM: A library for support vector machines. Software available at
  6. Chen, Y., & Lin, C. (2006). Combining SVMs with various feature selection strategies (Vol. 207, pp. 315–324). Studies in Fuzziness and Soft Computing, Springer Berlin.Google Scholar
  7. Dalzell, B. J., Filley, T. R., & Harbor, J. M. (2006). The role of hydrology in annual organic carbon loads and terrestrial organic matter export from a Midwestern agricultural watershed. Geochica et Cosmochimica Acta, 71, 1448–1462.CrossRefGoogle Scholar
  8. Davis, J. C. (2002). Statistics and data analysis in geology (3rd ed.). New York: Wiley.Google Scholar
  9. ESRI (2005). Arc Hydro–HydroID. Version 1.1 Final, July 2005.Google Scholar
  10. Fenelon, J. M. (1998). Water quality in the White River Basin, Indiana, 1992–1996: U.S. Geological Survey Circular 1150.Google Scholar
  11. Fetter, C. W. (2001). Applied hydrology (4th ed.). Upper Saddle River: Prentice Hall.Google Scholar
  12. Fry, J. A., Coan, M. J., Homer, C. G., Meyer, D. K., & Wickham, J. D. (2009). Completion of the national land cover database (NLCD) 1992–2001 land cover change retrofit product. USGS Open-File Report 2008-1379, 18 p.Google Scholar
  13. Gunn, S. R. (1998). Support vector machines for classification and regression. Technical Report, University of Southhampton.Google Scholar
  14. Hammer, O., Harper, D. A. T., & Ryan, P. D. (2009). PAST—Palaeontological STastitics, ver. 1.89. Technical Report.Google Scholar
  15. Hellweger, F. (1997). AGREE-DEM Surface Reconditioning System. University of Texas at Austin,
  16. Hem, J. (1985). Study and interpretation of the chemical characteristics of natural water (3rd ed.). US Geological Survey Water Supply Paper 2254.Google Scholar
  17. Homer, C. C., Huang, C., Yang, L., Wylie, B., & Coan, M. (2004). Development of a 2001 National Landcover Database for the United States. Photogrammetric Engineering and Remote Sensing, 70(7), 829–840.Google Scholar
  18. Hsu, C., Change, C., & Lin, C. (2010). A practical guide to support vector classification. Technical report, Dept. of Computer Science, National Taiwan University, Taipei 106, Taiwan.
  19. Iscen, C. F., Altin, A., Senoglu, B., & Yavuz, H. S. (2009). Evaluation of surface water quality characteristics by using multivariate statistical techniques: a case study of the Euphrates river basin, Turkey. Environmental Monitoring and Assessment, 151, 259–264.CrossRefGoogle Scholar
  20. Iscen, C. F., Emiroglu, O., Ilhan, S., Arslan, N., Yilmaz, V., & Ahiska, S. (2008). Application of multivariate statistical techniques in the assessment of surface water quality in Uluabat Lake, Turkey. Environmental Monitoring and Assessment, 144, 269–276.CrossRefGoogle Scholar
  21. Jacques, D. V., & Crawford, C. G. (1991). National water quality assessment program white River Basin: U.S. Geological Survey Open-File Report 91 169, 2 p. (WATER FACT SHEET)Google Scholar
  22. Kartoun, U., Stern, H., & Edan, Y. (2006). Bag classification using support vector machines. In Applied Soft computing technologies: The challenge of complexity (pp. 665–674).Google Scholar
  23. Kecman, V. (2001). Learning and soft computing—support vector machines, neural networks, and fuzzy logic models (slides accompanying book). Cambridge: The MIT Press.Google Scholar
  24. Nilsson, R., Pena, J. M., Bjorkegren, J., & Tegner, J. (2006). Evaluating feature selection for SVMs in high dimensions. In Proceedings of the 17th European conference on machine learning (pp. 719–726).Google Scholar
  25. Park, Y. (2003). Deliverable 12: publication of ANN model results. PAEQANN, European Commission, Contract No. EVK1-CT199900026. Available at
  26. Paul, S., Srinivasan, R., Sanabria, J., Haan, P. K., Mukhtar, S., & Neimann, K. (2006). Groupwise modeling and study of bacterially impaired watersheds in Texas: Clustering analysis. Journal of the AmericanWater Resources Association, 42(4), 1017–1031.CrossRefGoogle Scholar
  27. Rao, A. R., & Srinivas, V. V. (2008). Regionalization of watersheds, (Vol. 58). Springer Science+Business Media B.V. Water and Science Library of Technology.Google Scholar
  28. Ren, Y., Liu, H., Xue, C., Yao, X., Liu, M., & Fan, B. (2006). Classification study of skin sensitizers based on support vector machine and linear discriminant analysis. Analytica Chimica Acta, 572, 272–282.CrossRefGoogle Scholar
  29. Rojas, R. (1996). Neural networks—a systematic introduction (pp. 391–412). Berlin: Springer.Google Scholar
  30. Santos-Roman, D. M., Warner, G. S., & Scatena, F. (2003). Multivariate analysis of water quality and physical characteristics of selected watersheds in Puerto Rico. Journal of the American Water Resources Association, Paper No. 01039.Google Scholar
  31. Sojka, M., Siepak, M., Ziola, A., Frankowski, M., Murat-Blazejewska, S., & Siepak, J. (2008). Application of multivariate statistical techniques to evaluation of water quality in the Mala Welna River (Western Poland). Environmental Monitoring and Assessment, 147, 159–170.CrossRefGoogle Scholar
  32. SAS (SAS Institute Inc.) (2002–2004). SAS 9.1.3 Help and documentation. Cary: SAS Institute, Inc.Google Scholar
  33. Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New York: McGraw Hill.Google Scholar
  34. Singh, A., Maichle, R., & Lee, S. (2006). On the computation of a 95% upper confidence limit of the unkown population mean upon data sets with below detection limit observations. Las Vegas: USEPA, Contract No. 68-W-04 005.Google Scholar
  35. Suhr, D. D. (2005). Principal component analysis vs. Factor analysis. SAS SUGI 30 Proceedings, Statistics and data analysis section, Paper No. 203-30, Cary, NC, SAS Institute.Google Scholar
  36. Tabachnick, B. G., & Fidell, L. S. (1989). Using multivariate statistics (2nd ed.). New York: Harper and Row.Google Scholar
  37. Tedesco, L. P., Pascual, D. L., Shrake, L. K., Casey, L. R., Vidon, P. G. F., Hernly, F. V., et al. (2005). Eagle creek watershed management plan: An integrated approach to improved water quality. Eagle Creek Watershed Alliance, CEES Publication 2005–2007. Indianapolis: IUPUI.Google Scholar
  38. USDA (2004). State Soil Geographic (STATSGO) data base—data use information. Natural Resources Conservation Service, National Soil Survey Center. Miscellaneous Publication Number 1492.Google Scholar
  39. USEPA (1996). U.S. EPA NPDES Permit Writers’ Manual. Office of Water; EPA-833B-96-003.Google Scholar
  40. Vesanto, J., & Alhoniemi, E. (2000). Cluster of the self organizing map. IEEE Transactions on Neural Networks, 11(3), 586–600.CrossRefGoogle Scholar
  41. Vesanto, J., Himberg, J., Alhoniemi, E., & Parhankangas, J. (2000). SOM toolbox for Matlab 5. SOM Toolbox Team. Helsinki University of Technology. Report A57Google Scholar
  42. Ward, A., & Trimble, S. (2004). Environmental hydrology (2nd ed.). Boca Raton: Lewis.Google Scholar
  43. Yunrong, X., & Liangzhong, J. (2009). Water quality prediction using LS-SVM and particle swarm optimization. In Proceedings of the 2009 second international workshop on knowledge discovery and data mining (pp. 900–904).Google Scholar
  44. Zhang, Y., Guo, F., Meng, W., & Wang, X. (2009). Water quality assessment and source identification of Daliao river basin using multivariate statistical methods. Environmental Monitoring and Assessment, 152, 105–121.CrossRefGoogle Scholar
  45. Zhou, F., Liu, Y., & Guo, H. (2007). Application of multivariate statistical methods to water quality assessment of the watercourses in Northwestern New Territories, Hong Kong. Environmental Monitoring and Assessment, 132, 1–13.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Department of Earth ScienceIndiana University Purdue University IndianapolisIndianapolisUSA

Personalised recommendations