Abstract
Accurate and inexpensive identification of potentially contaminated wells is critical for water resources protection and management. The objectives of this study are to 1) assess the suitability of approximation tools such as neural networks (NN) and support vector machines (SVM) integrated in a geographic information system (GIS) for identifying contaminated wells and 2) use logistic regression and feature selection methods to identify significant variables for transporting contaminants in and through the soil profile to the groundwater. Fourteen GIS derived soil hydrogeologic and landuse parameters were used as initial inputs in this study. Well water quality data (nitrate-N) from 6,917 wells provided by Florida Department of Environmental Protection (USA) were used as an output target class. The use of the logistic regression and feature selection methods reduced the number of input variables to nine. Receiver operating characteristics (ROC) curves were used for evaluation of these approximation tools. Results showed superior performance with the NN as compared to SVM especially on training data while testing results were comparable. Feature selection did not improve accuracy; however, it helped increase the sensitivity or true positive rate (TPR). Thus, a higher TPR was obtainable with fewer variables.
Résumé
La caractérisation à faible coût de puits potentiellement contaminés est essentielle pour la protection et la gestion des ressources en eau. Les objectifs de cette étude sont 1) évaluer la fiabilité et la pertinence d’outils d’approximation tels les réseaux nodaux “Neural Networks” (NN) et les outils vectoriels guide “Support Vector Machines” (SVM) intégrés dans un Système d’Information Géographique (SIG) pour identifier les puits contaminés et 2) utiliser une loi de régression et définir des méthodes de sélection pour identifier des variables représentatives du flux polluant à travers le sol jusqu’à la nappe. Quatorze paramètres hydrogéologiques et caractérisant l’occupation du sol ont été utilisés comme entrants initiaux dans cette étude. La qualité nitrates (azote N) de 6917 puits, fournie par le Département de la Protection Environnementale de Floride, a défini une classe de référence. L’utilisation de loi de regression et la sélection des critères a réduit le nombre de variables à neuf. Des courbes d’ajustement ont été utilisées pour évaluer l’outil de calcul. Les résultats ont montré une performance supérieure du NN comparativement au SVM, particulièrement sur les valeurs de calage tandis que les résultats des tests étaient comparables. La sélection des courbes n’a pas amélioré la précision. Toutefois, elle a contribué à accroître la sensibilité ou indice de confiance “true positive rate” (TPR). Ainsi, un TPR plus élevé a été obtenu avec moins de variables.
Resumen
La identificación precisa y a bajo costo de pozos potencialmente contaminados es crítico para la protección y gestión de los recursos hídricos. Los objetivos de este estudio son: 1) evaluar la conveniencia de herramientas de aproximación tales como redes neuronales (NN) y máquinas de soporte vectorial (SVM) integradas en un sistema de información geográfica (GIS) para identificar pozos contaminados y 2) usar la regresión logística y métodos de selección de características para identificar variables significativas para el transporte de contaminantes en y a través del perfil del suelo hasta el agua subterránea. Se usaron catorce parámetros de uso de la tierra e hidrogeológicos del suelo derivados del GIS como datos iniciales en este estudio. Se utilizaron datos de calidad de agua de pozo (N -nitrato) de 6,917 pozos provistos por el Florida Department of Environmental Protection (USA) como un objetivo de salida. El uso de los métodos de regresión logística y de selección de características redujeron el número de variables de entrada a nueve. Las curvas de características operativas del receptor (ROC) fueron usadas para la evaluación de estas herramientas de aproximación. Los resultados mostraron un rendimiento superior con el NN comparado con SVM especialmente sobre datos de entrenamiento mientras que los resultados testeados fueran comparables, La selección de características no mejoró la precisión, sin embargo, ello ayuda a incrementar la sensibilidad o tasa de verdaderos positivos (TPR). Así, un más alto TPR se obtuvo con menos variables.
摘要
准确、廉价地识别潜在污染井在水资源保护和管理中非常关键。本研究的目的是1) 评估逼近方法如神经网络 (NN) 和支持向量机 (SVM) 与地理信息系统 (GIS) 集成识别污染井的适用性; 2) 利用逻辑回归和特征选择方法识别控制污染物在土壤剖面中运移或通过剖面到达地下水的显著变量。研究中将14个通过GIS得到的土壤水文地质和土地利用参数作为初始输入。以福罗里达州的环境保护署 (美国) 提供的6917个井的水质数据 (硝态氮) 作为输出目标类。逻辑回归和特征选择方法的应用将输入变量数目减至九个。使用受试者工作特征 (ROC) 曲线评估这些逼近方法。结果显示在训练样本的检验结果可比时, NN比SVM性能更好。特征选择未能改善精确度, 但提高了灵敏度或真阳性率 (TPR) 。因此, 以较少变量也可获得较高的TPR。
Resumo
A identificação precisa e económica de poços potencialmente contaminados é crítica para a protecção e gestão dos recursos hídricos. Os objectivos deste estudo são 1) avaliar a conveniência de ferramentas de aproximação tais como as redes neuronais (RN) e máquinas de vector de suporte (MVS) integrados num sistema de informação geográfica (SIG) para identificar poços contaminados e 2) o uso de regressão logística e métodos de selecção de características para identificar variáveis significativas para o transporte de contaminantes no seio de, e através do, perfil do solo até à água subterrânea. Neste estudo foram usados como informação de entrada catorze parâmetros hidrogeológicos e de uso do solo derivados de SIG. Como espécie alvo para a saída foram usados dados da qualidade da água (nitrato-N) de 6917 poços, fornecidos pelo Departamento de Protecção Ambiental da Florida, EUA (Florida DEP). A utilização de regressão logística e métodos de selecção de características reduziram as variáveis de entrada para nove. Para a avaliação destas ferramentas de aproximação foram usadas curvas operacionais características de receptor (receiver operating characteristics, ROC). Os resultados mostraram um desempenho superior das RN quando comparadas com MVS, especialmente sobre dados de treino, enquanto os resultados dos ensaios eram comparáveis. A selecção de características não melhorou a precisão; no entanto, ajudou a incrementar a sensibilidade ou rácio de positivos verdadeiros (RPV). Desta forma, é possível obter RPV mais elevados com menos variáveis.







Similar content being viewed by others
References
Almasri MN, Kaluarachchi (2005) Modular neural networks to predict the nitrate distribution in groundwater using the on-ground nitrogen loading and recharge data. Environ Model Softw 20(7):851–871
Asefa T, Kemblowski MW, Urroz G, McKee M, Khalil A (2004) Support vectors–based groundwater head observation networks design. Water Resour Res 40(11):1–10
Barbash JE, Resek EA (1996) Pesticides in groundwater distribution, trends, and governing factors. Ann Arbor Press, Chelsea, MI, USA
Burkart M, Kolpin R, Dana W (1993) Hydrologic and land-use factors associated with herbicides and nitrate in near-surface aquifers. J Environ Qual 22(4):646–656
Burrough PA, McDonnell RA (1998) Principles of geographical information systems. London, Oxford University Press, Oxford
Candade N, Dixon B (2004) Multispectral classification of Landsat images: comparison of Support Vector Machine and Neural Network classifiers. ASPRS Annual Meeting Proceedings. Denver, CO, May 2004, Mira Digital, Bethesda, MD
Chang C, C-J Lin (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Cited 19 January 2009
Cheng Q, Varshney PK, Arora MK (2006) Logistic regression for feature selection and soft classification of remote sensing data. IEEE Geosci Remote Sens Lett 3(4):491–494
Corwin DL, Loague K, Ellsworth TR (1996) Introduction to non-point source pollution in the vadose zone with advanced information technologies. In: Corwin DL, Loague K, Eilsworth TR (eds) Assessment of non-point source pollution in the vadose zone. Geophysical Monograph 108, AGU, Washington, DC
Corwin DL, Hopmans J, de Rooij GH (2006) From field- to landscape-scale vadose zone processes: scale issues, modeling, and monitoring. Vadose Zone J 5:129–139
Coppola Jr E, Szidarovszky F, Poulton M (2003) Artificial neural network approach for predicting transient water levels in a multilayered groundwater system under variable state pumping, and climate conditions. J Hydrol Eng 8(6):348–360
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines. University Press, Cambridge
Dibike YB, Velickov S, Solomatine DP, Abott MB (2001) Model induction with support vector machines: introduction and applications. J Comput Civ Eng 15(3):208–216
Dixon B (2001) Application of Neuro-fuzzy techniques to predict groundwater vulnerability in Northwest Arkansas. PhD Thesis, University of Arkansas, USA
Dixon B (2004) Prediction of ground water vulnerability using an integrated GIS-based neuro-fuzzy techniques. J Spat Hydrol 4(2):1–380–152
Dixon B (2005a) Applicability of neuro-fuzzy techniques in predicting groundwater vulnerability: a GIS-based sensitivity analysis. J Hydrol 309(1–4):17–38
Dixon B (2005b) Groundwater vulnerability mapping: a GIS and fuzzy rule based integrated tool. J Appl Geog 25:327–347
Dixon B, Candade N (2008) Multispectral landuse classification using neural networks and support vector machines: one or the other or both? Int J Remote Sens 29(4):1185–1206
Dixon B, Scott HD (1998) Use of fuzzy logic with modified DRASTIC parameters to predict groundwater contamination. In: Scott HD (ed) Vulnerability and use of ground and surface waters in the southern Mississippi Valley region. Publ. no. 269, AWRC, Fayetteville, AR, pp 16–51
Dixon B, Scott HD, Dixon JC, Steele KF (2002) Prediction of aquifer vulnerability to pesticides using fuzzy rule-based models at the regional scale. Phys Geog 23:13
Dou C, Woldt W, Dahab M, Bogardi I (1997a) Transient groundwater flow simulation using a fuzzy set approach. Ground Water 35(2):205–215
Dou C, Woldt W, Bogardi I, Dahab M (1997b) Numerical solute transport simulation using fuzzy sets approach. J Contam Hydrol 27:107–126
Garcia LA, Shigidi A (2006) Using neural networks for parameter estimation in groundwater. J Hydrol 18(1–4):215–231
Hassan A, Hamed KH (2001) Prediction of plume migration in heterogeneous media using artificial neural networks. Water Resour Res 37(3):605–623
Hosmer DW Jr, Lemenshow S (2000) Applied logistics regression. Wiley, NY, 307 pp
Grimm LG, Yarnold PR (1995) Reading and understanding multivariate statistics. American Psychological Association, Washington, DC
Govindaraju RS, Rao AR (2000) Artificial neural networks in hydrology (Water Science and Technology Library). Kluwer, Dordrcht, The Netherlands
Guyon I, Elisseeff (2003) An introduction to variable and feature selection. J Machine Learn Res 3:1157–1182
Hosmer DW, Lemeshow S (2000) Applied logistic regression, 2nd edn. Wiley, New York
Kelly WR et al (2005) Estimating background and threshold nitrate concentrations in a karst aquifer using probability graphs. Geol Soc Am Abst Prog 37(7):325
Khalil A, Almasri MN, McKee M, Kaluarachchi JJ (2005) Applicability of statistical learning algorithms in groundwater quality modeling. Water Resour Res 41, W05010. doi:10.1029/2004WR003608
Koterba MT, Banks WSL, Shedlock RJ (1993) Pesticides in shallow groundwater in the Delmarva Peninsula. J Environ Qual 22(3):500–518
Kohavi R, John G (1995) Automatic parameter selection by minimizing estimated error. In: Proceedings of the Twelfth International Conference on Machine Learning (ICML-95), Kaufmann, San Francisco, pp 304–312
Kunstmann H, Kinzelbach W, Siegfried T (2001) Conditional first order second moment method and its application to the quantification of uncertainty in groundwater modeling. Water Resour Res 38(4), 1035. doi:10.1029/2000WR000022
Lin HS (1995) Hydraulic properties and macropore flow of water in relation to soil morphology. PhD Thesis, Texas A & M University, USA, 229 pp
Lin HS, McInnes KJ, Wilding LP, Hallmark CT (1999) Effects of soil morphology on hydraulic properties: I. quantification of soil morphology. Soil Sci Soc Am J 63:948–953
Lin HS (2003) Hydropedology: bridging disciplines, scales, and data. Vadose Zone J 2:1–11
Liong S, Sivapragasam C (2002) Flood stage forecasting with support vector machines. J Am Water Resour Assoc 38(1):173–186
Mathworks (2004) Matlab software package. Available at http://www.mathworks.com/academia/. Cited 17 January 2009
MedCalc (2004) Logistic regression. Available at http://www.medcalc.be/manual/logistic_regression.php. Cited 13 January 2009
Moore DL, Martin DW, Walker ST, Rauch JT, Jones GW (1986) Initial sampling results of an ambient background ground-water quality monitor network in the Southwest Florida Water Management District. Southwest Florida Water Management District, Brooksville, FL, 393 pp
National Research Council (NRC) (1993) Ground water vulnerability assessment: contamination potential under conditions of uncertainty. Committee on Techniques for Assessing Ground Water Vulnerability, Water Science and technology Board, Commission on Geosciences, Environment, and Resources, National Academy Press, Washington, DC, 179 pp
National Research Council (NRC) (2001a) Basic research opportunities in earth science. National Academy Press, Washington, DC
National Research Council (NRC) (2001b) Envisioning the agenda for water resources research in the twenty-first century. National Academy Press, Washington, DC
National Research Council (NRC) (2001c) Grand challenges in environmental sciences. National Academy Press, Washington, DC
Nayak PC, Rao YRS, Sudheer KP (2006) Groundwater level forecasting in a shallow aquifer using artificial neural network approach. Water Resour Manage 20(1):77–90
Oostindie K, Bronswijk JJB (1995) Consequences of preferential flow in cracking clay soils for contamination-risk of shallow aquifers. J Environ Manage 43:359–373
Osborne LL, Wiley MJ (1988) Empirical relationships between land use/cover and stream water quality in and agricultural watershed. J Environ Manage 26(1):9–27
Quisenberry VL, Smith BR, Phillips RE, Scott HD, Nortcliff S (1993) A soils classification system for describing water and chemical transport. Soil Sci 156:306–315
Ray C, Klindworth KK (2000) Neural Networks for agrichemical vulnerability assessment of rural private wells. J Hydrol Eng 4:162–171
Rupert MG (1997) Nitrate (NO2 + NO3–N) in ground water of the upper Snake River Basin, Idaho and western Wyoming, 1991–95. US Geol Surv Sci Invest Rep 97–4174, 47 pp
Sahoo GB, Raya C, Wadeb HF (2005) Pesticide prediction in ground water in North Carolina domestic wells using artificial neural networks. Ecol Modeling 183:29–46
Scott HD, Griffis CL, Udouj TH (1998) Well contamination and related spatial attributes. . International Water Resources Engineering Conference, Memphis, TN, 5 August 1998, pp 1200–1205
Sitharam TG, Samui P, Anbazhagan P (2008) Spatial variability of rock depth in Bangalore using geostatistical, neural network and support vector machine models. Geotech Geol Eng 26:503–517
Spechler RM, Kroening SE (2007) Hydrology of Polk County, Florida. US Geol Surv Sci Invest Rep 2006–5320, 114 pp
Steenhuis TS, Ball J, Shalit G, Selker JS, Merwin IA (1994) A simple equation for predicting preferential flow solute concentrations. J Environ Qual 23:1058–1064
Steinwart I, Christmann A (2008) Support vector machines (information science and statistics). Springer, Heidelberg
Tabachnick BG, Fidell LS (1996) Using multivariate statistics. HarperCollins, NY
Tipping ME (2000) The relevance vector machine. In: Solla SA, Leen TK, Muller K-R (eds) Advances in neural information processing systems 12, MIT Press, Boston, pp 652–658
US Environmental Protection Agency (1993) A review of methods for assessing aquifer sensitivity and ground water vulnerability to pesticide contamination: EPA/813/R–93/002, US EPA, Wasington, DC, 147 pp
US Geological Survey (2004) USGS Circular 1207. US Geological Survey, Reston, VA. http://sofia.usgs.gov/publications/circular/1207/nutrients.html. Cited 14 January 2009
Wagner BJ (1992) Simultaneous parameter estimation and contaminant source characterization for couples groundwater flow and contaminant transport modeling. J Hydrol 135:275–303
Westervelt J, Shipiro M, Goran W (1989) GRASS: geographical resources analysis support system. US Army Corps of Engineers, Construction Engineering Research Laboratories, Champagne, IL
Wohlberg B, Tartakovsky DM, Guadagnini A (2006) Subsurface characterization with support vector machines. 44(1):47-57
Zadeh LA (1964) The concept of state in system theory. In: Mesarovic MD (ed) Views on general systems theory: proceedings of the second systems symposium at case institute of technology. Wiley, New York
Acknowledgements
This research was funded by FWRRC Subcontract UF-EIES-0404012-USF. I would like to thank N. Candade and R. Stetson for their help with this research (database creation).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
ESM 1
(PDF 44 KB)
Rights and permissions
About this article
Cite this article
Dixon, B. A case study using support vector machines, neural networks and logistic regression in a GIS to identify wells contaminated with nitrate-N. Hydrogeol J 17, 1507–1520 (2009). https://doi.org/10.1007/s10040-009-0451-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10040-009-0451-1


