Introduction

Machine learning models are increasingly widely used to predict or interpolate the spatial distribution of hazardous chemical constituents of a wide variety of environmental media, including groundwater (e.g. Amini et al. 2008; Winkel et al. 2008; Park et al. 2016; Tesoriero et al. 2017; Podgorski et al. 2018, 2022; Podgorski and Berg 2020; Chakraborty et al. 2020; Tan et al. 2020; Mukherjee et al. 2021; Erickson et al. 2021; Lombard et al. 2021; Wu et al. 2020, 2021a, 2021b; Perović et al. 2021; Connolly et al. 2021; Cao et al. 2022; Kumar and Pati 2022; Ottong et al. 2022; Knierim et al. 2022; Dhamija and Joshi 2022) and soils (Lado et al. 2008; Li et al. 2022; Hengl et al. 2017; Mikkonen et al. 2018; de Menezes et al. 2020; Jia et al. 2021; Kebonye et al. 2021).

Whilst the distribution of environmental chemical hazards has been determined by logistic regression models (e.g. arsenic in groundwater (Wu et al. 2020); fluoride in groundwater (Podgorski et al. 2018) or boosted regression trees (e.g. arsenic in groundwater (Tan et al. 2020); and nitrate in groundwater (Sajed-Hosseini et al. 2018)) there is an increasing preponderance of supervised random forest models, involving assemblies of decision trees. Because of the widely acknowledged (Millot et al. 2011; Bhattacharya et al. 2017; Bretzler et al. 2017b; UNICEF/WHO 2018; Polya et al. 2019) importance of arsenic in groundwater utilized as drinking water in contributing to massive detrimental public health outcomes, such random forest models have been widely used to predict the distribution of arsenic contamination in groundwater. These models have been rendered at various spatial scales, including the global scale: (Podgorski and Berg 2020); the country scale: India (Podgorski et al. 2020; Wu et al. 2021b; Mukherjee et al. 2021), China (Rodríguez-Lado et al. 2013); Burkino Faso (Bretzler et al. 2017a); Uruguay (Wu et al. 2021a); and more local scales: Gujarat state (Wu et al. 2020); Varanasi (Uttar Pradesh) (Kumar and Pati 2022), and Puralia (West Bengal) (Ruidas et al. 2022).

Machine learning random forest models are used to predict binary target dependent variables (e.g. high (value assigned = 1) or low (value assigned = 0) hazard compared to a user-defined hazard concentration). These models generate four different types of prediction: true positive (TP), true negative (TN), false positive (FP), and false negative (FN) with respect to the binary target variable. In the context of classifying an environmental hazard variable as “high” or “low”, model sensitivity (also known as the true-positive rate) refers to the proportion of truly “high” hazard values being correctly modelled as being “high”; whilst model specificity (also known as the true-negative rate), refers to the proportion of truly “low” hazard values being correctly modelled as being “low”. In general, there is a trade-off between model sensitivity and model specificity.

Thus, all machine learning models which ultimately classify areas as being of “high” or “low” hazard require some criterion to be used to determine the balance to be made between model sensitivity and model specificity or between other measures of model accuracy (see Table 1). Typically, the criteria used in previously published studies of machine learning modelling of the spatial distributions of chemical contaminants in environmental media have been based on either (i) adopting a cutoff value where specificity = sensitivity (Podgorski et al. 2020; Wu et al. 2021a) or (ii) simply using a cutoff value of 0.5 (Wu et al. 2020, 2021b). Although these cutoff criteria are convenient to use, there is no objective reason to choose one criterion over another, other than perhaps intellectual or artistic elegance. Further, as shown in Table 1, there are many other criteria that could and have been used as the basis for determining cutoff values: again, the selection of one of these cutoff criteria in preference to another would largely seem to be matter of convenience or personal preference rather than being based on any documented comprehensive objective reasoning.

Table 1 List of commonly used criteria for determining cutoff values for machine learning models. (Modified after Lopez-Raton et al. (2019))

In contrast to current published models of environmental chemical hazard distribution, models of public health relevant tests that aim to classify patients for the purposes of indicating preferred methods of treatment (or, indeed, of non-treatment) (Brinati et al. 2020), now widely use cutoff criteria that explicitly take into account the relative utility or costs of false-positive and false-negative classifications. The substantial disparity that may exist between the costs of false positives (e.g. arising from recalling non-diseased patients for unnecessary further diagnostic tests or treatments) and false negatives (e.g. arising from detrimental health outcomes arising from the failure to promptly treat a disease) frequently gives rise to cutoff values that are very different from those that would arise from giving explicit or implicit equal weighting to sensitivity and specificity. This is reflected in the widespread adoption during the early stage of the COVID-19 pandemic of rapid antigen detection tests with relatively low sensitivities compared to specificities (ECDC 2021) and, in contrast, elsewhere, recognition that using cutoff criteria that result in relatively high numbers of false positives can make screening for certain disease states as not being cost-effective (e.g. Sharib et al. 2020). Habibzadeh et al. (2016) amongst others indicates the importance of factoring in the costs of model-based misdiagnosis. Medical decision theory has long emphasized (i) the importance of utilizing cost-optimized cutoff criteria (e.g. Phelps and Mushlin 1988), (ii) that that would generally give rise to unequal weighting to sensitivity and specificity (e.g. Gail and Pfeiffer 2005), and (iii) that “regret probability”, defined as 1—the positive predictive value (PPV) (Maxim et al. 2014), may vary substantially for the same test depending upon the prevalence of the disease state in the population (Grimes and Schulz 2002; Maxim et al. 2014).

Few if any of the commonly used cutoff criteria methods for random forest modelling of the spatial distribution of environmental chemical hazards explicitly take into account the relative costs of misclassifying the hazard, indeed, they would seem to implicitly either ignore cost implications or assume that there is little material difference in the cost consequences of false-positive and false-negative classifications. As such, they are not designed to optimize utility taking into account combined mitigation, testing, and health impact costs. This begs the questions, does this really matter? And if it does matter, then how much does it matter? In particular, under what circumstances, is it materially important to consider the criteria used to obtain optimal cutoff values?

To test and illustrate the importance of using objective cost-optimized cutoff criteria, we present our analysis of a case study related to the machine learning modelling of the 2-D spatial distribution of groundwater arsenic in India. Our analysis is built upon our previous modelling (Wu et al. 2021b). We demonstrate that the use of such cost-optimized criteria not only results in the selection of numerically different cutoff values but also enables considerable reduction in overall potential mitigation/testing/health costs. Wider implications for the machine learning modelling of environmental chemicals of public health significance are discussed.

Methodology

Machine Learning Model

A binary target variable, groundwater arsenic, for India, with two possible values—“high” or “low” was determined using random foresting modelling, utilizing 145,813 geolocated (longitude/latitude) secondary groundwater arsenic concentration measurements from India and its neighbouring countries (Bangladesh, Pakistan, and Nepal) and 31 potential environmental predictors. The WHO provisional guide value for arsenic in drinking water, viz. 10 μg/L was adopted as the concentration used to classify concentrations as being either “high” or “low” in arsenic. The random forest model has been utilized to create a map at 1-km2 (pixel) resolution of the predicted probability of groundwater arsenic exceeding 10 μg/L (Fig. 1a). The process of the random forest model generation and description of the dependent and independent variables is outlined here and provided in detail in our previous studies (Wu et al. 2021b, 2020; Podgorski et al. 2020).

Fig. 1
figure 1

Distribution of arsenic in groundwater in India as determined by random forest modelling. a Probability map of groundwater arsenic exceeding 10 μg/L (from Wu et al. 2021b; reproduced here under the terms of a Creative Commons CC-BY Licence). bd Map of “high” groundwater arsenic hazard arising from using probability cutoffs of b 0.2, c 0.5, or d 0.7

Comparison of Non-cost-optimized Probability Cutoffs

A number of non-cost-optimized methods, as listed in Table 1, for determining cutoffs were separately used with the India groundwater arsenic random forest model to calculate the method-optimized cutoffs, together with their corresponding sensitivity, specificity, positive prediction values, and negative prediction values following the methods otherwise detailed in Wu et al. (2021b).

Calculation of Misclassification and Overall Costs

The misclassification costs, \({Cost}_{FP+FN}\), of a model can be expressed (Thiele and Hirschfeld 2020) as the sum of the costs arising from false-positive and false-negative classifications according to Eq. (1):

$$Cost_{ FP + FN} = N_{FPpixels} \times \overline{{CR_{pixel,FP} }} + N_{FNpixels} \times \overline{{CR_{pixel,FN} }},$$
(1)

where

\({N}_{FPpixels}\) and \({N}_{FNpixels}\) are the number of pixels, misclassified as a false positives and false negatives, respectively;

\(\overline{{CR}_{pixel,FP}}\) and \(\overline{{CR}_{pixel,FN}}\) are the weighted arithmetic mean per-pixel cost arising from misclassification of a pixel as a false positive or as a false negative, respectively.

We define here also, the overall costs, \({Cost}_{FP+FN+TP}\), of a model as the sum of costs arising from all false- and true-positive and negative model classifications and assuming that costs arising from a true-negative classification can be assumed to be zero, according to

$$Cost_{FP + FN + TP} = N_{FPpixels} \times \overline{{CR_{pixel,FP} }} + N_{FNpixels} \times \overline{{CR_{pixel,FN} }} + N_{TPpixels} \times \overline{{CR_{pixel,TP} }},$$
(2)

where

\({N}_{FPpixels}\), \({N}_{FNpixels}\), and \({N}_{TPpixels}\) are the number of pixels classified as FP, FN, and TP, respectively;

\(\overline{{CR}_{pixel,FP}}\), \(\overline{{CR}_{pixel,FN}},\), and \(\overline{{CR}_{pixel,TP}}\) are the weighted arithmetic mean per-pixel cost arising from misclassification of a pixel as a false positive or as a false negative or classification as a true positive, respectively.

In practice, \(\overline{{CR}_{pixel,FP}}\), \(\overline{{CR}_{pixel,FN}},\), and \(\overline{{CR}_{pixel,TP}}\) are inconvenient to calculate, so for the case study of India groundwater arsenic, we calculated \({Cost}_{ FP+FN}\) and \({Cost}_{FP+FN+TP}\) using the following methodology which renders the same results as Eqs. (1) and (2), respectively. We note that India is administratively divided into several hundred districts, each composed of a number of 1 km2 pixels for which we modelled groundwater arsenic status. Rounding errors arising from the imperfect alignment of some 1 km2 pixels with district boundaries were determined to be of only minor importance in the context of this study. We further note that costs (related to well testing or well water remediation) arising from false-positive and true-positive classification will largely be incurred on a per well basis, whereas those (related to detrimental health outcomes) arising from a false-negative classification will largely be incurred on a per groundwater arsenic-exposed person basis.

Misclassification costs, \({Cost}_{ FP+FN}\), were calculated according to

$$Cost_{ FP + FN} = \sum \left( {N_{FP, \,\,district} \times \frac{{P_{ district} \times Pro_{GW,\,district} \times Wells_{district} }}{{N_{pixel, district} }}} \right)\, \times \,CR_{wells, \,FP} + \sum \left( {N_{FN, district} \times \frac{{P_{ district} \times Pro_{GW,\,district} }}{{N_{pixel,\,district} }}} \right)\, \times \,CR_{people,\,FN},$$
(3)

where

\({N}_{FP, district}\) and \({N}_{FN, district}\) are the number of FP and FN classified pixels, respectively, based on the random forest model of groundwater arsenic exceeding 10 \(\mu g/L\);

\({P}_{ district}\) is the population of the subscripted district;

\({Pro}_{GW,district}\) is proportion of population in the subscripted district drinking groundwater via tubewells/boreholes and hand pumps;

\({N}_{pixel, district}\) is the total number of pixels in the subscripted district;

\({Wells}_{district}\) is the prevalence of wells utilized for drinking in subscripted district; and

\({CR}_{wells, FP}\) and \({CR}_{people,FN}\) are the relative unit costs for each well and each groundwater arsenic-exposed person.

Values of \({P}_{ district}\), \({Pro}_{GW,district}\), and \({N}_{pixel, district}\) were obtained or derived from [India Census, 2011], whilst \({Wells}_{district}\) was approximated by assuming a ratio of 1 drinking water well per 20 people based on whole country estimates of population and wells [Government of India 2011a, b; CGWB 2022].

Overall costs, \({Cost}_{ FP+FN+TP}\), were calculated according to

$$Cost_{ FP + FN + TP} = \sum \left( {N_{FP, district} \times \frac{{P_{ district} \times Pro_{GW,district} \times Wells_{district} }}{{N_{pixel, district} }}} \right)\, \times CR_{wells, FP} + \sum \left( {N_{FN,\,district} \times \frac{{P_{ district} \times Pro_{GW,\,district} }}{{N_{pixel, \,district} }}} \right)\, \times \,CR_{people,\,FN} \, + \,\left( {N_{TP, \,district} \times \frac{{P_{ district} \times Pro_{GW,\,district} \, \times \,Wells_{district} }}{{N_{pixel,\,district} }}} \right)\, \times \,CR_{wells, \,TP} \,$$
(4)

where

\({N}_{TP, district}\) is the number of TP classified pixels, respectively, in the subscripted district based on the random forest model of groundwater arsenic exceeding 10 \(\mu g/L\) in India;

And, \({CR}_{wells, TP}\) is the relative unit cost arising for TP classified pixels on a per groundwater arsenic-exposed person basis.

Comparison of Costs Arising Using Cost-Optimized Cutoffs Compared to a Default Cutoff

Both misclassification and overall cost proportion difference comparisons between the use of cost-optimized cutoff values and a default cutoff of 0.5 were calculated for various ratios of CRFP:CRFN (for misclassification costs and discrete cost ratio value selected, see Table S1) and ratios of CRFP: CRFN:CRTP (for discrete cost ratio values selected for overall costs calculations, see Table S2) according to Eqs. (5) and (6). These comparisons were made both on an all India basis and also for the individual states of Assam, Gujarat, Uttar Pradesh, and West Bengal states, which collectively exhibit a wide range of prevalence of high groundwater arsenic.

$$CPD_{1} = \frac{{Cost_{0.5} { } - { }Cost_{cost - optimized cutoff} }}{{Cost_{0.5} }},$$
(5)
$$CPD_{2} = \frac{{Cost_{0.5} { } - { }Cost_{cost - optimized cutoff} }}{{Cost_{cost - optimized cutoff} }},$$
(6)

where

\({CPD}_{1}\) and \({CPD}_{2}\) are the cost proportion differences using Eqs. (5) and (6), respectively;

\({Cost}_{0.5}\) is the cost (misclassification or overall) arising from the use of a default cutoff of 0.5 (cf. Wu et al. 2021b); and \({Cost}_{cost-optimized cutoff}\) is the cost (misclassification or overall) arising from the cost-optimized cutoff value.

Illustrative Example of Selecting Cost-Optimal Cutoff for Groundwater Arsenic in India

In order to estimate the potential cost savings arising from using a cost-optimized cutoff model as opposed to a default cutoff value, we further used estimated unit costs for well testing (FP), well remediation (TP), and detrimental groundwater arsenic-attributable health outcomes (FN) detailed in Table 2. For illustrative purposes, the costs of a true-positive (TP) classification were taken to be costs of remediation for each “high” arsenic well, the numbers of which were calculated as above; the costs of a false-positive (FP) classification were taken to be the costs of diagnostic testing for each “high” arsenic classified well, the costs of a false-negative (FN) classification were to be costs of health lives lost as result of unmitigated exposure to “high” arsenic groundwater and determined on a per well-user basis. The figures adopted here are broadly based on Wu et al. (2021b) supplemented by discussions with technology (remediation and chemical analysis) providers in India. The unit costs are intended to be illustrative not definitive, and, in any event, the actual unit costs may vary substantially from place to place, well-user to well-user and from time to time.

Table 2 Adopted illustrative potential unit costs arising from classification and misclassification of machine learning model of groundwater arsenic exceeding 10 \(\mathrm{\mu g}/\mathrm{L}\)

Results and Discussion

Machine Learning Model of Groundwater Arsenic Distribution in India

The random forest model for India of the probability of groundwater arsenic exceeding 10 µg/L is shown in Fig. 1a. A description and discussion of the characteristics of the distribution have been discussed previously in considerably more detail (Wu et al. 2021b) and is not the focus of the current study. This probability distribution gives rise to very different overall areas classified as “high” (> 10 µg/L) groundwater arsenic hazard depending upon on the value of the probability cutoff value selected, viz. 0.2 (Fig. 1b), 0.5 (Fig. 1c), or 0.7 (Fig. 1d). These illustrated cutoff values encompass the range of cutoff values (0.4 to 0.6) arising from the use of commonly used non-cost-optimized cutoff criteria (Table 1) and for which the corresponding sensitivity, specificity, positive prediction values (PPV), and negative prediction values (NPV) are shown in Table 3. Clearly, the use of different cutoff criteria gives rise both to different cutoff values (Table 3) and to different extents of areas classified as “high” groundwater arsenic hazard (Fig. 1). It is further noted that, for higher cutoffs, specificity, and positive prediction value increase, whilst sensitivity and negative prediction value decrease. Further, for higher cutoffs, there is a decrease in the relative number of false positives but an increase in the relative number of false negatives.

Table 3 Comparison of cutoff, sensitivity, specificity, positive prediction values (PPV), and negative prediction values (NPV) arising from applying non-cost-optimized probability cutoff criteria listed in Table 1 to the random forest model of distribution of groundwater arsenic exceeding 10 μg/L in India

Cost-Optimized Cutoffs as Function of Misclassification Costs

Calculated cost-optimized cutoffs for the whole of India groundwater arsenic distribution model as function of the relative costs of false-positive and false-negative classifications expressed as \({CR}_{wells, FP}\):\({CR}_{people, FN}\) are shown in Table S1.

As the cost ratio \({CR}_{wells, FP}\):\({CR}_{people, FN}\) increases from 1:1000 to 1000:1, the cost-optimized cutoffs become larger, increasing from just above zero, 0.00, to 0.68. The relationship between these cutoff values and log(\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) is further illustrated in Fig. 2c. The equivalent relationships for the states of Gujarat (Fig. 2a), Uttar Pradesh (Fig. 2b), West Bengal (Fig. 2d), and Assam (Fig. 2e) states, where the prevalence of high arsenic in groundwaters are different, are also shown. In all cases, the relationships are monotonic increasing and sigmoidal in form, although over much of the range considered they can be roughly approximated by first-order linear fits with very high (> 0.94) R2. The associations between cost-optimized cutoffs and log(\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) can serve as a guide to choosing cost-optimized cutoff values when the cost ratio \({CR}_{wells, FP}\):\({CR}_{people, FN}\) is known.

Fig. 2
figure 2

Relationship between cost-optimized probability cutoffs and the relative costs of false-positive and false-negative classification expressed as log (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) (see text for explanation) for a Gujarat, b Uttar Pradesh c India, d West Bengal, and e Assam, based on the machine learning model of the distribution of groundwater arsenic in India

For a given value of log(\({CR}_{wells, FP}\):\({CR}_{people, FN}\)), the calculated cost-optimized cutoff value is a strong function of the prevalence of high arsenic groundwaters in the area being considered. For example, for log(\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) = 0, the calculated cost-optimized cutoff values increase monotonically with prevalence of high arsenic groundwaters as follows: Gujarat (cutoff value 0.27; prevalence 0.4%); Uttar Pradesh (0.28, 3%), India (0.30, 8%), West Bengal (0.30, 30%), and Assam (0.41, 67%).

Misclassification Costs

Misclassification costs (\({Cost}_{ FP+FN}\)) as function of probability cutoffs (101 values ranging from 0 to 1 with interval 0.01) are plotted for each discrete selected cost ratio in Fig. 3. Particularly if the ratio of the unit FP (\({CR}_{wells, FP}\)) and FN (\({CR}_{people, FN}\)) relative costs is very large or very small, (e.g. \({CR}_{wells, FP}\):\({CR}_{people, FN}\) of 1:1000, 1000:1, 1:500, 500:1, 1:200, and 200:1), then the misclassification costs tend to have very high values at cutoff values of 0 or 1, respectively, with the lowest misclassification costs occurring at cutoff values close to 1 or 0, respectively. Where the ratio of FP (\({CR}_{wells, FP}\)) and FN (\({CR}_{people, FN}\)) relative costs is between 2 and 100, however, the plotted curves are more obviously “U” shaped with the lowest misclassification costs arising from cutoff values in the range 0.2–0.6 that is closer to the widely used default value of 0.5.

Fig. 3
figure 3

Misclassification costs (Y-axis) as a function of cutoff value for discrete \({CR}_{wells, FP}\):\({CR}_{people, FN}\) ratios (X-axis) ranging from 1:1000 to 1000:1, viz. a 1:1000, b 1:500, c 1:200, d 1:100, e 1:50, f 1:10, g 1:5, h 1:2, i 2:1, j 5:1, k 10:1, l 50:1, m 100:1, n 200:1, o 500:1, p 1000:1. Calculated considering only costs arising from false-positive and false-negative misclassification costs

Overall Model-Dependent Costs

Cost-optimized cutoffs as a function of \({CR}_{wells, FP}\):\({CR}_{people, FN}\) and \({CR}_{wells, FP}\):\({CR}_{wells, TP}\) using the sets of cost ratio values tabulated in Table S2 are tabulated in Table S3 and illustrated in Fig. 4. These cost-optimized cutoffs varied from 0 to 1 and increased with both increasing \({CR}_{wells, FP}\):\({CR}_{people, FN}\) and decreasing \({CR}_{wells, FP}\):\({CR}_{wells, TP}\). Of these cutoffs, more than half of them exceeded the commonly used default cutoff value of 0.5. In all cases, the relationships are monotonic increasing and sigmoidal in form, although over much of the range considered they can be roughly approximated by first-order linear fits.

Fig. 4
figure 4

Relationship between cost-optimized probability cutoffs and log (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) at different discrete \({CR}_{people, FN}: {CR}_{wells, TP}\) ratios (see Table S2 for CR values selected for each point) for the random forest model of groundwater arsenic distribution in India. Note that where the ratio of costs arising from a false positive are very low compared to those for a false negative, the cost-optimized probability cutoff tends to 0 (which tends to classify all samples as “high” groundwater arsenic)

Comparison of Costs Arising Using Cost-Optimized Cutoffs Compared to a Default Cutoff

The misclassification costs (arising from FP and FN) from using cost-optimized cutoffs compared with these costs that would arise from the use of a default cutoff of 0.5 for each cost ratio are shown in Fig. 5. When these differences are expressed as a percentage of the costs arising from the use of a default cutoff of 0.5 (Fig. 5a–e), the relative costs vs cost ratios (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) curves tend to be “V” shaped, with the minimum value arising when the cost ratio (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) gives rise to the default cutoff of 0.5 also being the cost-optimized cutoff. Obviously, this only occurs for a particular value of (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) and, in general, the cost-optimized cutoff will be different and gives rise to increasing substantial misclassification costs as the difference between the two cutoffs increases.

Fig. 5
figure 5

Calculated potential model associated excess costs as a function of misclassification cost ratio (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)), expressed as percentage of costs arising from the use of (i) default cutoff value of 0.5 (a–e) (see Eq. (5)); (ii) cost-optimized cutoff value (f–j) (see Eq. (6)) for the states of Gujarat (a, f), Uttar Pradesh (b, g), West Bengal (c, h), and Assam (d, i) and for the whole of India (e, j)

Interestingly, for the random forest model of groundwater arsenic exceeding 10 μg/L in this study, the cost ratio (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) at which the default cutoff value would also be the cost-optimized cutoff value is clearly a strong function of the prevalence of high groundwater arsenic. The equality of costs arising from the use of the cost-optimized cutoff and that of the default cutoff of 0.5 for all of India and Gujarat, Uttar Pradesh, West Bengal, and Assam states was found to occur for cost ratios (\({CR}_{wells, FP}\):\({CR}_{people, FN}\)) in the range of 10:1 to 1000:1, approximately as follows: Gujarat: 1000 (high groundwater arsenic prevalence 0.4%), Uttar Pradesh: 100 (3%), the whole of India: 50 (8%), West Bengal: a value between 10 and 50 (30%), and Assam: 10 (67%). Thus, the cost ratio values, where 0.5 is the cost-optimized cutoff with lowest misclassification cost, decrease with the increasing of prevalence of high groundwater arsenic for the 5 different regions considered here.

Where the cost differences (\({CPD}_{2}\)) are expressed relative to the costs arising from the use of cost-optimized cutoffs (Fig. 5f–j) similar relationships are observed, with the major difference being that (i) the magnitude of relative cost difference is much greater than when expressed relative to costs arising from the use of the default cutoff of 0.5 and (ii) the shape of the resultant curves are much more asymmetrical.

It is clear that, for this case study, in addition to misclassification and overall cost ratios, the actual prevalence of high groundwater arsenic concentrations also materially impacts the selection of cost-optimized cutoffs. When the high groundwater arsenic prevalence is low (e.g. Gujarat state), the proportion of FP misclassified pixels tends to be high. In contrast, when the groundwater arsenic prevalence is high (e.g. Assam state), the proportion of FN misclassified pixels tends to be low.

Further, it is evident that, for whole country models of groundwater arsenic, the inclusion of sub-regions (e.g. states) with highly different prevalence of high groundwater arsenic means that data from high groundwater arsenic prevalence states (e.g. Assam) will impact the model for low groundwater arsenic prevalence states (e.g. Gujarat) and vice versa. We speculate, therefore, that whole country modelling may not give the best cost-optimized models for smaller sub-regions and that global models may not give the best cost-optimized models for individual countries, particularly where there are wide differences in the prevalence of high arsenic groundwaters and where mechanisms leading to such high arsenic groundwaters may vary different from place to place (cf. Wu et al. 2021a). Interestingly, this is very analogous to the conclusions of Chen et al. (2020), albeit that their study was with respect to sub-populations of pregnant women with highly different likelihoods of bearing children with trisomy or open neural tube defects.

Illustrative Example of Selecting Cost-Optimal Cutoff for Groundwater Arsenic in India

Using the specific illustrative unit cost values (\({CR}_{wells, FP}\), \({CR}_{people, FN}\), and \({CR}_{wells, TP}\)) listed in Table 2), a cost-optimized cutoff value between 0.00 and 0.01 was determined for the all India random forest model of groundwater arsenic (Fig. 6). The, perhaps surprising, closeness of this cutoff value to zero is due to the unit treatment and other costs for individual people at risk of suffering avoidable arsenic-attributable detrimental health impacts, \({CR}_{people, FN}\), being substantially higher than the unit costs of well remediation or of chemical analytical testing. An implication of this all India model is that the whole of India should be classified as an area of high groundwater arsenic to optimize model-related costs (health treatment, well remediation, chemical analysis); however, an all India model may not be the optimal model upon which to inform policy for the reasons previously discussed. Notably, Maxim et al. (2014) warned of the excess costs arising from indiscriminate use of medical tests in populations with a very low prevalence of the conditions being tested for and this warning is also relevant to the large scale modelling of environmental chemical hazards, such as high arsenic groundwaters.

Fig. 6
figure 6

Overall costs (FP, FN, and TP relative costs) a function of probability cutoff, based on the machine learning model of the distribution of groundwater arsenic in the whole of India. The unit cost values as defined in Table S2

Conclusion

Probability maps of environmental chemical hazards generated by machine learning models are generally converted into hazard area maps by setting some probability cutoff. Choosing the probability cutoff is a crucial process to determining the modelled area of high hazard but existing methodologies for this are not designed to optimize costs related to health impacts, well remediation, and testing.

We demonstrate that for a case study of random forest-modelled groundwater arsenic distribution in India, the use of objective cost optimization criteria for selecting probability cutoff not only gives rise to probability cutoff different to those obtained from the most commonly published methods ((e.g. cutoff where sensitivity equals to specificity, cutoff where positive prediction values equal to negative prediction values, a default cutoff of 0.5) but also substantially reduces overall potential (health impacts, remediation, analytical) costs arising from the use of the model in informing practice.

The magnitude of the benefit of using cost-optimized probability cutoff criteria compared to commonly used default methods depends upon (i) the ratios of costs arising from false-positive, false-negative, and true-positive model classifications and (ii) the underlying prevalence of “high” (in this case study, higher than 10 µg/L arsenic) groundwater arsenic.

Where the distribution of high groundwater arsenic is highly heterogeneous, as it is in India, the greatest cost benefits may arise from the use of models of smaller, more granular areas at a more detailed scale (e.g. individual states; cf. Wu et al. 2020). State and basin scale modelling of groundwater arsenic using locally relevant cost bases for cost optimization of probability cutoffs is therefore indicated. We suggest that this would be a productive direction for studies not only of groundwater arsenic distribution in India but also of the distribution of other chemical hazards in various environmental media (waters, soils, crops) in India and other areas.