Abstract
Soil pollution levels can be quantified via sampling and experimental analysis; however, sampling is performed at discrete points with long distances owing to limited funding and human resources, and is insufficient to characterize the entire study area. Spatial prediction is required to comprehensively investigate potentially contaminated areas. Consequently, machine learning models that can simulate complex nonlinear relationships between a variety of environmental conditions and soil contamination have recently become popular tools for predicting soil pollution. The characteristics, advantages, and applications of machine learning models used to predict soil pollution are reviewed in this study. Satisfactory model performance generally requires the following: 1) selection of the most appropriate model with the required structure; 2) selection of appropriate independent variables related to pollutant sources and pathways to improve model interpretability; 3) improvement of model reliability through comprehensive model evaluation; and 4) integration of geostatistics with the machine learning model. With the enrichment of environmental data and development of algorithms, machine learning will become a powerful tool for predicting the spatial distribution and identifying sources of soil contamination in the future.
Similar content being viewed by others
References
Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X, Khosravi A, Acharya U R, et al. (2021). A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion, 76: 243–297
Adimalla N, Qian H, Nandan M J, Hursthouse A S (2020). Potentially toxic elements (PTEs) pollution in surface soils in a typical urban region of south India: an application of health risk assessment and distribution pattern. Ecotoxicology and Environmental Safety, 203: 111055
Adnan K, Akbar R (2019). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data, 6(1): 91
Akaike H (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6): 716–723
Akinpelu A A, Ali M E, Owolabi T O, Johan M R, Saidur R, Olatunji S O, Chowdbury Z (2020). A support vector regression model for the prediction of total polyaromatic hydrocarbons in soil: an artificial intelligent system for mapping environmental pollution. Neural Computing & Applications, 32(18): 14899–14908
Azizi K, Ayoubi S, Nabiollahi K, Garosi Y, Gislum R (2022). Predicting heavy metal contents by applying machine learning approaches and environmental covariates in west of Iran. Journal of Geochemical Exploration, 233: 106921
Baglaeva E, Buevich A, Sergeev A, Shichkin A, Subbotina I (2018). Recognition of chromium distribution features in different urban soils by multilayer perceptron. In: International Conference of Computational Methods in Sciences and Engineering (ICCMSE), Thessaloniki. Maryland: AMER INST Physics2040: 050008
Baglaeva E M, Sergeev A P, Shichkin A V, Buevich A G (2021). The extraction of the training subset for the spatial distribution modelling of the heavy metals in topsoil. Catena, 207: 105699
Ballabio C, Jiskra M, Osterwalder S, Borrelli P, Montanarella L, Panagos P (2021). A spatial assessment of mercury content in the European Union topsoil. Science of the Total Environment, 769: 144755
Bazoobandi A, Emamgholizadeh S, Ghorbani H (2022). Estimating the amount of cadmium and lead in the polluted soil using artificial intelligence models. European Journal of Environmental and Civil Engineering, 26(3): 933–951
Bellon-Maurel V, Fernandez-Ahumada E, Palagos B, Roger J M, Mcbratney A (2010). Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. Trends in Analytical Chemistry, 29(9): 1073–1081
Bhagat S K, Tiyasha T, Awadh S M, Tung T M, Jawad A H, Yaseen Z M (2021a). Prediction of sediment heavy metal at the Australian Bays using newly developed hybrid artificial intelligence models. Environmental Pollution, 268: 115663
Bhagat S K, Tung T M, Yaseen Z M (2021b). Heavy metal contamination prediction using ensemble model: case study of bay sedimentation, Australia. Journal of Hazardous Materials, 403: 123492
Bishop C (1991). Improving the generalization properties of radial basis function neural networks. Neural Computation, 3(4): 579–588
Bonelli M G, Ferrini M, Manni A (2017). Artificial neural networks to evaluate organic and inorganic contamination in agricultural soils. Chemosphere, 186: 124–131
Gordon A D, Breiman L, Friedman J H, Olshen R A, Stone C J (1984). Classification and Regression Trees. Biometrics, 40(3): 874
Broomhead D, Lowe D (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2: 321–355
Cai C, Li J, Wu D, Wang X, Tsang D C W, Li X, Sun J, Zhu L, Shen H, Tao S, Liu W (2017). Spatial distribution, emission source and health risk of parent PAHs and derivatives in surface soils from the Yangtze River Delta, eastern China. Chemosphere, 178: 301–308
Cao W, Zhang C (2020). A collaborative compound neural network model for soil heavy metal content prediction. IEEE Access: Practical Innovations, Open Solutions, 8: 129497–129509
Cao W, Zhang C (2021). Data prediction of soil heavy metal content by deep composite model. Journal of Soils and Sediments, 21(1): 487–498
Chen F, Zhang Q, Ma J, Zhu Q, Wang Y, Liang H (2021). Effective remediation of organic-metal co-contaminated soil by enhanced electrokinetic-bioremediation process. Frontiers of Environmental Science & Engineering, 15(6): 113
Chen T, Guestrin C (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Fransisco. New York: Association for Computing Machinery. 785–794
Cover T M, Hart P E (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1): 21–27
D’M, Macchiato M, Ragosta M, Simoniello T (2012). A method for the integration of satellite vegetation activities observations and magnetic susceptibility measurements for monitoring heavy metals in soil. Journal of Hazardous Materials, 241–242: 118–126
Droz B, Payraudeau S, Rodríguez Martín J A, Tóth G, Panagos P, Montanarella L, Borrelli P, Imfeld G (2021). Copper content and export in European vineyard soils influenced by climate and soil properties. Environmental Science & Technology, 55(11): 7327–7334
Duong V H, Ly H B, Trinh D H, Nguyen T S, Pham B T (2021). Development of Artificial Neural Network for prediction of radon dispersion released from Sinquyen Mine, Vietnam. Environmental Pollution, 282: 116973
Fathizad H, Ardakani M A H, Heung B, Sodaiezadeh H, Rahmani A, Fathabadi A, Scholten T, Taghizadeh-Mehrjardi R (2020). Spatiotemporal dynamic of soil quality in the central Iranian desert modeled with machine learning and digital soil assessment techniques. Ecological Indicators, 118: 106736
Fei X, Christakos G, Xiao R, Ren Z, Liu Y, Lv X (2019a). Improved heavy metal mapping and pollution source apportionment in Shanghai City soils using auxiliary information. Science of the Total Environment, 661: 168–177
Fei X, Xiao R, Christakos G, Langousis A, Ren Z, Tian Y, Lv X (2019b). Comprehensive assessment and source apportionment of heavy metals in Shanghai agricultural soils with different fertility levels. Ecological Indicators, 106: 105508
Friedman J H (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4): 367–378
Gao B, Stein A, Wang J (2022). A two-point machine learning method for the spatial prediction of soil pollution. International Journal of Applied Earth Observation and Geoinformation, 108: 102742
Huang H, Zhou Y, Liu Y, Li K, Xiao L, Li M, Tian Y, Wu F (2020). Assessment of anthropogenic sources of potentially toxic elements in soil from arable land using multivariate statistical analysis and random forest analysis. Sustainability (Basel), 12(20): 8538
Huang H, Zhou Y, Liu Y J, Xiao L, Li K, Li M Y, Tian Y, Wu F (2021a). Source apportionment and ecological risk assessment of potentially toxic elements in cultivated soils of Xiangzhou, China: a combined approach of geographic information system and random forest. Sustainability (Basel), 13(3): 1214
Huang S, Xiao L, Zhang Y, Wang L, Tang L (2021b). Interactive effects of natural and anthropogenic factors on heterogenetic accumulations of heavy metals in surface soils through geodetector analysis. Science of the Total Environment, 789: 147937
Hüllermeier E, Waegeman W (2021). Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning, 110(3): 457–506
Jang J S R (1993). ANFIS — adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics, 23(3): 665–685
Jia X, Cao Y, O’connor D, Zhu J, Tsang D C W, Zou B, Hou D (2021). Mapping soil pollution by using drone image recognition and machine learning at an arsenic-contaminated agricultural field. Environmental Pollution, 270: 116281
Jia X, Fu T, Hu B, Shi Z, Zhou L, Zhu Y (2020). Identification of the potential risk areas for soil heavy metal pollution based on the source-sink theory. Journal of Hazardous Materials, 393: 122424
Jia X, Hu B, Marchant B P, Zhou L, Shi Z, Zhu Y (2019). A methodological framework for identifying potential sources of soil heavy metal pollution based on machine learning: a case study in the Yangtze Delta, China. Environmental Pollution, 250: 601–609
Jia Z, Zhou S, Su Q, Yi H, Wang J (2017). Comparison study on the estimation of the spatial distribution of regional soil metal(loid)s pollution based on kriging interpolation and BP neural network. International Journal of Environmental Research and Public Health, 15(1): 34
Jordan M I, Mitchell T M (2015). Machine learning: trends, perspectives, and prospects. Science, 349(6245): 255–260
Kanevski M, Demyanov V, Pozdnukhov A, Parkin R, Maignan M (2003). Advanced geostatistical and machine-learning models for spatial data analysis of radioactively contaminated regions. Environmental Science and Pollution Research International, (Special Issue): 137–149
Kebonye N M, Eze P N, John K, Gholizadeh A, Dajčl J, Drábek O, Němeček K, Borůvka L (2021). Self-organizing map artificial neural networks and sequential Gaussian simulation technique for mapping potentially toxic element hotspots in polluted mining soils. Journal of Geochemical Exploration, 222: 106680
Bou Kheir R, Shomar B, Greve M B, Greve M H (2014). On the quantitative relationships between environmental parameters and heavy metals pollution in Mediterranean soils using GIS regression-trees: the case study of Lebanon. Journal of Geochemical Exploration, 147: 250–259
Kim S B, Han K S, Rim H C, Myaeng S H (2006). Some effective techniques for naive Bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11): 1457–1466
Li J, Heap A D (2014). Spatial interpolation methods applied in the environmental sciences: a review. Environmental Modelling & Software, 53: 173–189
Li X, Geng T, Shen W, Zhang J, Zhou Y (2021). Quantifying the influencing factors and multi-factor interactions affecting cadmium accumulation in limestone-derived agricultural soil using random forest (RF) approach. Ecotoxicology and Environmental Safety, 209: 111773
Li Y, Li C, Tao J, Wang L (2011). Study on spatial distribution of soil heavy metals in Huizhou City based on BP-ANN modeling and GIS. Procedia Environmental Sciences, 10, 1953–1960
Liu G, Zhou X, Li Q, Shi Y, Guo G, Zhao L, Wang J, Su Y, Zhang C (2020a). Spatial distribution prediction of soil As in a large-scale arsenic slag contaminated site based on an integrated model and multi-source environmental data. Environmental Pollution, 267: 115631
Liu H, Yin S, Chen C, Duan Z (2020b). Data multi-scale decomposition strategies for air pollution forecasting: a comprehensive review. Journal of Cleaner Production, 277: 124023
Lundberg S M, Lee S I (2017). A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach. New York: Curran Associates Inc. 4768–4777
McCuen R H, Knight Z, Cutter A G (2006). Evaluation of the Nash-Sutcliffe efficiency index. Journal of Hydrologic Engineering, 11(6): 597–602
Mikkonen H G, Van De Graaff R, Clarke B O, Dasika R, Wallis C J, Reichman S M (2018a). Geochemical indices and regression tree models for estimation of ambient background concentrations of copper, chromium, nickel and zinc in soil. Chemosphere, 210: 193–203
Mikkonen H G, Van De Graaff R, Mikkonen A T, Clarke B O, Dasika R, Wallis C J, Reichman S M (2018b). Environmental and anthropogenic influences on ambient background concentrations of fluoride in soil. Environmental Pollution, 242: 1838–1849
Nash J E, Sutcliffe J V (1970). River flow forecasting through conceptual models part I — A discussion of principles. Journal of Hydrology (Amsterdam), 10(3): 282–290
Padarian J, Minasny B, Mcbratney A B (2020). Machine learning and soil sciences: a review aided by machine learning tools. Soil (Göttingen), 6(1): 35–52
Paes É D C, Veloso G V, Fonseca A A, Fernandes-Filho E I, Fontes M P F, Soares E M B (2022). Predictive modeling of contents of potentially toxic elements using morphometric data, proximal sensing, and chemical and physical properties of soils under mining influence. Science of the Total Environment, 817: 152972
Qin G, Niu Z, Yu J, Li Z, Ma J, Xiang P (2021). Soil heavy metal pollution and food safety in China: effects, sources and removing technology. Chemosphere, 267: 129205
Qiu L, Wang K, Long W, Wang K, Hu W, Amable G S (2016). A comparative assessment of the influences of human impacts on soil cd concentrations based on stepwise linear regression, classification and regression tree, and random forest models. PLoS One, 11(3): e0151131
Ren X, Zeng G, Tang L, Wang J, Wan J, Liu Y, Yu J, Yi H, Ye S, Deng R (2018). Sorption, transport and biodegradation: an insight into bioavailability of persistent organic pollutants in soil. Science of the Total Environment, 610–611: 1154–1163
Riedmiller M (1994). Advanced supervised learning in multilayer perceptrons: from backpropagation to adaptive learning algorithms. Computer Standards & Interfaces, 16(3): 265–278
Rossiter D G (2018). Past, present & future of information technology in pedometrics. Geoderma, 324: 131–137
Ru F, Yin A, Jin J, Zhang X, Yang X, Zhang M, Gao C (2016). Prediction of cadmium enrichment in reclaimed coastal soils by classification and regression tree. Estuarine, Coastal and Shelf Science, 177: 1–7
Sakizadeh M, Mirzaei R, Ghorbani H (2017). Support vector machine and artificial neural network to model soil pollution: a case study in Semnan Province, Iran. Neural Computing & Applications, 28(11): 3229–3238
Schwarz K, Weathers K C, Pickett S T A, Lathrop R GJr, Pouyat R V, Cadenasso M L (2013). A comparison of three empirically based, spatially explicit predictive models of residential soil Pb concentrations in Baltimore, Maryland, USA: Understanding the variability within cities. Environmental Geochemistry and Health, 35(4): 495–510
Sergeev A P, Buevich A G, Baglaeva E M, Shichkin A V (2019). Combining spatial autocorrelation with machine learning increases prediction accuracy of soil heavy metals. Catena, 174: 425–435
Shao W, Guan Q, Tan Z, Luo H, Li H, Sun Y, Ma Y (2021). Application of BP-ANN model in evaluation of soil quality in the arid area, northwest China. Soil & Tillage Research, 208: 104907
Shi T, Hu X, Guo L, Su F, Tu W, Hu Z, Liu H, Yang C, Wang J, Zhang J, Wu G (2021). Digital mapping of zinc in urban topsoil using multisource geospatial data and random forest. Science of the Total Environment, 792: 148455
Shichkin A, Buevich A, Sergeev A, Baglaeva E, Subbotina I (2018). Forecasting of spatial variable by the models based on Artificial Neural Networks on an example of heavy metal content in Topsoil. Thessaloniki. Maryland: American Institute of Physics Inc, 2040: 050007
Singha S, Pasupuleti S, Singha S S, Singh R, Kumar S (2021). Prediction of groundwater quality using efficient machine learning technique. Chemosphere, 276: 130265
Specht D F (1991). A general regression neural network. IEEE Transactions on Neural Networks, 2(6): 568–576
Strobl C, Boulesteix A L, Zeileis A, Hothorn T (2007). Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics, 8(1): 1–21
Svozil D, Kvasnicka V, Pospichal J (1997). Introduction to multi-layer feed-forward neural networks. Chemometrics and Intelligent Laboratory Systems, 39(1): 43–62
Swets J A (1988). Measuring the accuracy of diagnostic systems. Science, 240(4857): 1285–1293
Taghizadeh-Mehrjardi R, Fathizad H, Ali Hakimzadeh Ardakani M, Sodaiezadeh H, Kerry R, Heung B, Scholten T (2021). Spatiotemporal analysis of heavy metals in arid soils at the catchment scale using digital soil assessment and a random forest model. Remote Sensing (Basel), 13(9): 1698
Tao H, Liao X, Zhao D, Gong X, Cassidy D P (2019). Delineation of soil contaminant plumes at a co-contaminated site using BP neural networks and geostatistics. Geoderma, 354: 113878
Tarasov D, Buevich A, Shichkin A, Subbotina I, Tyagunov A, Baglaeva E, Aip (2018a). Chromium distribution forecasting using multilayer perceptron Neural Network and Multilayer perceptron residual Kriging. Maryland: American Institute of Physics Inc, 1978, 440019
Tarasov D, Buevich A, Shichkin A, Vasilev J, Aip (2018b). Forecasting of chromium distribution in subarctic noyabrsk using generalized regression neural networks and multilayer perceptron. Maryland: American Institute of Physics Inc, 1978, 440024
Tarasov D A, Buevich A G, Sergeev A P, Shichkin A V (2018c). High variation topsoil pollution forecasting in the russian subarctic: using artificial neural networks combined with residual kriging. Applied Geochemistry, 88: 188–197
Tepanosyan G, Maghakyan N, Sahakyan L, Saghatelyan A (2017). Heavy metals pollution levels and children health risk assessment of Yerevan kindergartens soils. Ecotoxicology and Environmental Safety, 142: 257–265
Tepanosyan G, Sahakyan L, Maghakyan N, Saghatelyan A (2020). Combination of compositional data analysis and machine learning approaches to identify sources and geochemical associations of potentially toxic elements in soil and assess the associated human health risk in a mining city. Environmental Pollution, 261: 114210
Wang H, Yilihamu Q, Yuan M, Bai H, Xu H, Wu J (2020). Prediction models of soil heavy metal(loid)s concentration for agricultural land in Dongli: a comparison of regression and random forest. Ecological Indicators, 119: 106801
Wang L, Zhou Y, Li Q, Xu T, Wu Z, Liu J (2021a). Application of three deep machine-learning algorithms in a construction assessment model of farmland quality at the county scale: case study of Xiangzhou, Hubei Province, China. Agriculture, 11(1): 72
Wang Q, Xie Z, Li F (2015). Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale. Environmental Pollution, 206: 227–235
Wang Y, Wu X, He S, Niu R (2021b). Eco-environmental assessment model of the mining area in Gongyi, China. Scientific Reports, 11(1): 17549
Wu J, Teng Y, Chen H, Li J (2016). Machine-learning models for on-site estimation of background concentrations of arsenic in soils using soil formation factors. Journal of Soils and Sediments, 16(6): 1787–1797
Xiao L, Zhou Y, Huang H, Liu Y J, Li K, Li M Y, Tian Y, Wu F (2020a). Application of geostatistical analysis and random forest for source analysis and human health risk assessment of Potentially Toxic Elements (PTEs) in Arable Land Soil. International Journal of Environmental Research and Public Health, 17(24): 9296
Xiao L, Zhou Y, Huang H, Liu Y J, Li K, Li M Y, Tian Y, Wu F (2020b). Application of geostatistical analysis and random forest for source analysis and human health risk assessment of potentially toxic elements (PTEs) in arable land soil. International Journal of Environmental Research and Public Health, 17(24): 9296
Xu H, Croot P, Zhang C (2021). Discovering hidden spatial patterns and their associations with controlling factors for potentially toxic elements in topsoil using hot spot analysis and K-means clustering analysis. Environment International, 151: 106456
Yang H, Huang K, Zhang K, Weng Q, Zhang H, Wang F (2021a). Predicting heavy metal adsorption on soil with machine learning and mapping global distribution of soil adsorption capacities. Environmental Science & Technology, 55(20): 14316–14328
Yang S, Taylor D, Yang D, He M, Liu X, Xu J (2021b). A synthesis framework using machine learning and spatial bivariate analysis to identify drivers and hotspots of heavy metal pollution of agricultural soils. Environmental Pollution, 287: 117611
Yaseen Z M (2021). An insight into machine learning models era in simulating soil, water bodies and adsorption heavy metals: review, challenges and solutions. Chemosphere, 277: 130126
Yu Z, Zhang C, Xiong N, Chen F (2022). A new random forest applied to heavy metal risk assessment. Computer Systems Science and Engineering, 40(1): 207–221
Zafar M R, Khan N (2021). Deterministic local interpretable model-agnostic explanations for stable explainability. Machine Learning and Knowledge Extraction, 3(3): 525–541
Zhang C, Kuang W, Wu J, Liu J, Tian H (2021a). Industrial land expansion in rural China threatens environmental securities. Frontiers of Environmental Science & Engineering, 15(2): 29
Zhang H, Yin A, Yang X, Fan M, Shao S, Wu J, Wu P, Zhang M, Gao C (2021b). Use of machine-learning and receptor models for prediction and source apportionment of heavy metals in coastal reclaimed soils. Ecological Indicators, 122: 107233
Zhang H, Yin S H, Chen Y H, Shao S S, Wu J T, Fan M M, Chen F R, Gao C (2020). Machine learning-based source identification and spatial prediction of heavy metals in soil in a rapid urbanization area, eastern China. Journal of Cleaner Production, 273: 122858
Zhang X, Lin F, Jiang Y, Wang K, Wong M T F (2008). Assessing soil Cu content and anthropogenic influences using decision tree analysis. Environmental Pollution, 156(3): 1260–1267
Zhong S, Zhang K, Bagheri M, Burken J G, Gu A, Li B, Ma X, Marrone B L, Ren Z J, Schrier J, et al. (2021). Machine learning: new ideas and tools in environmental science and engineering. Environmental Science & Technology, 55(19): 12741–12754
Zhou P, Zhao Y, Zhao Z, Chai T (2015). Source mapping and determining of soil contamination by heavy metals using statistical analysis, artificial neural network, and adaptive genetic algorithm. Journal of Environmental Chemical Engineering, 3(4, Part A): 2569–2579
Acknowledgements
This research was supported by the National Key Research and Development Program of China (No. 2018YFC1800100); the National Natural Science Foundation of China (No. 42277475).
Author information
Authors and Affiliations
Corresponding author
Additional information
Highlights
• A review of machine learning (ML) for spatial prediction of soil contamination.
• ML have achieved significant breakthroughs for soil contamination prediction.
• A structured guideline for using ML in soil contamination is proposed.
• The guideline includes variable selection, model evaluation, and interpretation.
Special Issue—Artificial Intelligence/Machine Learning on Environmental Science & Engineering (Responsible Editors: Yongsheng Chen, Xiaonan Wang, Joe F. Bozeman III & Shouliang Yi)
Rights and permissions
About this article
Cite this article
Zhang, Y., Lei, M., Li, K. et al. Spatial prediction of soil contamination based on machine learning: a review. Front. Environ. Sci. Eng. 17, 93 (2023). https://doi.org/10.1007/s11783-023-1693-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11783-023-1693-1