Introduction

As water managers around the world grapple with the impacts of climate change on water resources such as increasing drought frequency (Williams et al. 2020), annual temperatures (Overpeck and Udall 2020), and atmospheric demand for water (Albano et al. 2022) there is a need for tools that employ findable, accessible, interoperable, and reusable data (Wilkinson et al. 2016) for data-driven water budgets (Moyers et al. 2023). Actual evapotranspiration (ETa), continues to be one of the components of the water budget being refined through modeling and measurement techniques, because it is indicative of the water consumed by agriculture—often the largest consumer of water in dryland environments. The increase in atmospheric water demand and decrease in water resources due to climate change pose a challenge to water managers trying to meet all agricultural, domestic, and environmental water needs. As a result, there is a crucial need for accurate ETa measurements for water managers to improve the measurements of consumptive water use.

Accurate measurements or estimates of ETa can be difficult to obtain for decision-makers, as ET varies across different landscapes and crop types and is impacted by differences in climatic variables such as wind speed and direction, temperature, soil heat flux, and air density (Khand et al. 2017). ETa measurement systems have been installed in agricultural fields (Boyko et al. 2021), reservoirs (Fournier et al. 2021), wetlands (Matthes et al. 2014) and other regions where ETa needs to be accurately measured. ETa measurement systems such as eddy covariance towers are non-destructive and provide continuous measurement of the boundary layer for areas of 5000–100,000 m2 (Jensen et al. 2016). There are a few limitations to eddy covariance including the need for empirical corrections, energy balance closure error ranges from 10–30%, the system works best when there are large fetch distances of relatively homogenous vegetation, and the instrumentation is expensive to install (Jensen et al. 2016). Although eddy covariance can produce ETa measurements as an automated system, they require frequent maintenance (Morin 2019) and require an expert to interpret and process the climatic data to extract accurate ETa information (Helbig et al. 2021).

An alternative option for water managers to obtain spatiotemporal estimates of ETa is using publicly available satellite-based remote sensing evapotranspiration models. Satellite remote sensing imagery and climate data have been incorporated into models to estimate ETa and have been found to accurately estimate ETa within ±20% of in-situ measurements (Melton et al. 2022; Samani and Bawazir 2015). The literature on validation of remote sensing ET models is rich and models have been utilized for estimating ETa for different crops, land covers, and locations. For example, the surface energy balance algorithm for land (SEBAL) was assessed for measuring maize ETa in Nebraska (Singh et al. 2008). The disaggregate atmosphere–land exchange inverse model (DisALEXI) was used in grape vineyards and almond orchards in California (Knipper et al. 2023), METRIC and SSEBop were compared to eddy covariance measurements made in maize and soybeans (Singh and Senay 2015). The Priestley-Taylor Jet Propulsion Lab (PT-JPL) was assessed at 23 eddy covariance sites with different land cover classes in China (Chen et al. 2014). The Satellite Irrigation Management Support (SIMS) model was evaluated in grape vineyards in California (Doherty et al. 2022). The models have been used across different landscapes such as the continental United States (Senay et al. 2013) and South American ecoregions (Melo et al. 2021). While remote sensing ET models have been used for over twenty years, their use has been limited because the information has been difficult to access by individuals outside of academic and agency research circles.

OpenET has recently emerged as a tool to address the gap of inaccessibility by providing ETa estimates using remote sensing ET models to individuals outside of academic and agency research circles. OpenET is a web-based platform that provides ETa estimations for agriculture in the western United States on daily, monthly, and annual time scales and at individual field, 30 × 30 m pixel, or user defined area spatial scales. OpenET is composed of six ET models (METRIC, SSEBop, SIMS, PT-JPL, DisALEXI, SEBAL) that have been used at the regional level for official or government purposes of water assessment and an Ensemble model that is an average of the other models. The six models were modified for integration into Google Earth Engine and rely primarily on satellite imagery from Landsat and also incorporate imagery from GOES, Sentinel-2, Suomi NPP, MODIS, and others (Melton et al. 2022). The OpenET platform has the potential to assist in calculating water budgets at the watershed scale that can be used by farmers, water and irrigation districts, and policymakers.

Issues can arise when models are implemented in dryland environments, specifically underestimation when compared to in-situ measurements due to factors such as edge effect or pixel size (Jones and Sirault 2014), models failing to capture soil evaporation (Chen et al. 2020), and advective heating from the surrounding landscape not being accurately depicted (Yang et al. 2015). These issues have caused underestimation in drylands of daily alfalfa ETa amounting to −28.9 and −14.8% by SEBAL and METRIC (Mkhwanazi and Chávez 2013), seasonal alfalfa ETa of −546 and −847 mm by SSEBop and ALEXI (Samani and Bawazir 2015) and −0.31 mm/day by SIMS in beets (Wang et al. 2021). Developers of OpenET have performed modifications to some of the internal calculations for each of the six ET models’ original formulation to improve the models’ performance, however there are currently a limited number of studies (e.g., Melton et al. 2022; Volk et al. 2024) that assess if models within the OpenET platform still underestimate ET in dryland environments and the implications for using OpenET as a water budget component. This study addresses two questions 1) are there anay significant differences in means between OpenET’s model estimations and in-situ ETa measurements and crop evapotranspiration (ETc) estimates of alfalfa at the field scale? 2) What are the potential implications in terms of the range of predicted values on calculating a valley scale water budget for alfalfa?

Materials and methods

Site description

The study was based on data collected from a field study of three alfalfa fields in the Mesilla Valley located in south-central New Mexico during the 2017 growing season (Fig. 1). The three fields are referred to as Leyendecker (32 ° 12' 21.46" N, 106 ° 44' 50.74" W), Horse Farm (32 ° 16' 14.70" N, 106 ° 46' 17.29" W), and Willie’s Field (32° 12' 03.07" N, 106 °44' 09.90" W). The three sites are 1.79, 2.19, and 35.2 hectares for Leyendecker, Horse Farm, and Willie’s Field, respectively. The Mesilla Valley is located in Doña Ana county and is part of the New Mexico portion of the Rio Grande River basin, spanning 90 kilometers from Radium Springs, New Mexico to El Paso, Texas and the border with Mexico. Doña Ana County has over 29,814.4 hectares of irrigated agriculture (United States Department of Agriculture, 2024), with the majority located in Mesilla Valley. Alfalfa is the second most commonly grown crop in Mesilla Valley, with 2,913.7 hectares harvested in 2017 (United States Department of Agriculture, 2024). Flood irrigation using a combination of surface and groundwater is the predominant irrigation method. The area is typical of dryland climates with a Koppen climate classification of BWh (dry arid climate) or BSh (dry semi-arid climate) depending on annual precipitation. The average reference evapotranspiration was 1616 mm per year between 2017 and 2023 (ZiaMet Weather Station Network) and exceeds the average annual precipitation of 222 mm per year and the average annual temperature is 15.8 °C (Malm et al. 1994).

Fig. 1
figure 1

Location of the three alfalfa fields in the Mesilla Valley, New Mexico

Data

Field data

The field measurements used in this study come from Boyko et al. (2021) where individual components of the water budget were measured in the three alfalfa fields for the purpose of improving groundwater recharge estimates. In Willie’s Field, ETa was calculated using the eddy covariance method utilizing an eddy covariance station situated in the center of the field. Field instrumentation at Willie’s Field included a three-dimensional sonic anemometer model CSAT3 (CSI, Logan, Utah) to measure sensible heat, a NR-Lite net radiometer (CSI, Logan, Utah) to measure net radiation, two model HFT3 soil heat flux plates (CSI, Logan, Utah) in combination with soil moisture sensor model CS 616 (CSI, Logan, Utah), and a pair of model CAV two-averaging soil temperature thermocouples (CSI, Logan, Utah) to measure soil heat flux. Additional sensors were included to measure relative humidity and ambient air temperature. The instrumentation and site were meticulously maintained by a field technician at least once per week and the data were corrected for sonic temperature, frequency attenuation, and cross contamination. ETa was calculated as a residual of the energy balance and the energy balance closure ranged between 0.67 to 1.31 with a mean closures of 0.96 for the growing period (Boyko et al. 2021).

Eddy covariance was only used at Wille’s Field due to the other two sites not meeting the required fetch distance for installation of an eddy covariance station. Instead, crop evapotranspiration (ETc) was estimated for the Leyendecker using the crop coefficient (Kc) approach where ETc is the product of Kc multiplied by a nearby reference evapotranspiration (ETos) value (Jensen et al. 2016). The Leyendecker field was a mixture of half alfalfa and half native Saltgrass, as a result, Kc was weighted relative to the proportion of alfalfa and Saltgrass observed in the field. The daily alfalfa Kc was determined by the eddy covariance tower in Willie’s field and the Leyendecker III climate station (Kc = ETa/ETso). The Leyendecker III climate station was located approximately 0.6 km from the eddy covariance tower and approximately 0.75 km from the center of the Leyendecker alfalfa field and was maintained as a short grass reference (Bawazir et al. 2014). Bawazir et al. (2014) developed the Kc for Saltgrass in the region.

The Horse Farm site was assumed to have the same ETa as Willie’s Field due to similar climatic conditions, close proximity, similar soil texture at 120 cm depth, and synchronization of irrigation and cutting events (Boyko et al. 2021). The average time between cutting and irrigation events were within 4 days at the different sites for the growing season. The daily reference ET (ETsz) values from climate stations nearby the Leyendecker and Horse Farm sites were compared to each other to confirm there was a high degree of correlation between the sites (r2 = 0.974).

By definition, ETa and ETc are different, where ETa is the actual evapotranspiration from a surface which can be less than the potential and ETc is the crop evapotranspiration for a non-stressed crop with full irrigation requirements met. However, in this study we assume negligible differences between ETc and ETa because all three sites were managed, cut, and irrigated on a similar cycle and because Kc used for Leyendecker and Horse Farm was determined daily from the eddy covariance data and the local climate station, thus capturing measured changes in environmental, phenological, and physiological conditions of the sites. Given that ETa was measured at Wille’s field and ETc was estimated at Leyendecker and Horse Farm, only Willie’s field was used for daily comparisons. The other two sites aggregated the daily estimates to monthly and annual values for comparison to OpenET. Recently published research also utilized the Kc for comparing OpenET in pecans (Tawalbeh et al. 2024) and maize (Djaman et al. 2023).

In addition to the measured and estimated ET, Boyko et al. (2021) measured precipitation (P), irrigation (I), and change in soil storage (Δθ) to estimate recharge as runoff/discharge (Q) using the soil water balance approach (P + I – ET – Q – Δθ = µ) where µ is the error term. Soil textures were similar at 120 cm at all three sites. No alfalfa roots were observed below 120 cm. Their results showed a recharge for the growing season between 618 and 718 mm for the growing season. The seasonal ETa for Leyendecker, Horse Farm, and Willie’s Field were 964 mm, 1116 mm, and 1309 mm (Table 1), and the growing season dates for Leyendecker, Horse Farm, and Willie’s Field were 02/25/2017 to 11/26/2017, 04/13/2017 to 11/26/2017, and 03/03/2017 to 11/26/2017, respectively. For more detailed site descriptions and methodology for field measurements and results, see (Boyko et al. 2021).

Table 1 Field-measured and estimated monthly and seasonal ETa for three alfalfa fields in Mesilla Valley during the 2017 growing season (Boyko et al. 2021)

OpenET

OpenET models

OpenET is composed of six models: Atmosphere-Land Exchange Inverse / Disaggregation of the Atmosphere-Land Exchange Inverse (ALEXI/DisALEXI) Anderson et al. (2018); Anderson et al. (2007), Google Earth Engine implementation of the Mapping Evapotranspiration at high Resolution with Internalized Calibration model (eeMETRIC) (Allen et al. 2005, 2007, 2011), Google Earth Engine implementation of the Surface Energy Balance Algorithm for Land (geeSEBAL) (Bastiaanssen et al. 1998; Laipelt et al. 2021), Operational Simplified Surface Energy Balance (SSEBop) (Senay 2018; Senay et al. 2013), Priestley-Taylor Jet Propulsion Laboratory (PT-JPL) (Fisher et al. 2008), and Satellite Irrigation Management Support (SIMS) (Melton et al. 2012; Pereira et al. 2020).

All but one of the models included in OpenET are either surface energy balance (SEB) models (ALEXI/DisALEXI, eeMETRIC, and geeSEBAL) or simplified versions of the SEB approach (PT-JPL and SSEBop) (Volk et al. 2024). The SEB models solve for the latent energy consumed by ET by calculating the net radiation, soil heat flux, and sensible heat flux from the satellite imagery and climate data and then converting latent energy into an equivalent depth of water as described in the methodology used in Samani and Bawazir (2015). PT-JPL and SSEBop are versions of the SEB models where certain components of the SEB equation are simplified with using hot and cold reference pixels, empirical coefficients, and crop coefficients (Fisher et al. 2008; Senay et al. 2013).

Of the models within OpenET, the SIMS model is the sole model that isn’t based on the energy balance equation. Rather, the SIMS model uses surface reflectance and crop type information to calculate ETa as a function of canopy density (Melton et al. 2012). Satellite imagery is used by SIMS for calculating the normalized difference vegetation index (NDVI) which is required for fractional cover and basal crop coefficient calculations. All of the models included in the OpenET platform have additional calculations that must be solved prior to using the final ET calculations presented above, for more information regarding the entirety of model ET calculations, refer to the original papers cited at the beginning of this section.

The last ETa estimate calculated by the OpenET platform is done through an ensemble approach which includes estimates from the other models. The OpenET Ensemble is a simple average of all model estimates but removes outliers from the calculation first using Mean Absolute Deviation (MAD) where outliers are defined as being more than ±2 times the MAD and is performed at all desired time steps. The Ensemble model requires values from at least four of the models to ensure a range of models are still being used for the estimates (Melton et al. 2022) which is why Ensemble estimates are not exact averages of all models, as they can be calculated using four, five, or all six estimates in its calculation. The OpenET Ensemble model produces an estimate of ETa using the range of data provided that is, on average, just as if not more accurate than any individual model (Melton et al. 2022).

OpenET data collection

Daily and monthly OpenET data were downloaded on October 11, 2023 using OpenET’s API and Python code for polygon datasets provided by the OpenET website. We extracted OpenET data for individual field polygons for the three fields because this method will likely be used by most users of OpenET to access the data. Additionally, methodology adopted from Volk et al. (2024) in their comparisons of OpenET to in-situ ET data were used to extract OpenET data from a 7 × 7 grid of 30 m pixels (210 m × 210 m) centered over the eddy covariance tower in Wille’s field, hereinafter referred to as Willie’s Footprint. The dates used for Leyendecker, Horse Farm, and Willie’s Field were 02/25/2017 to 11/26/2017, 04/13/2017 to 11/26/2017, and 03/03/2017 to 11/26/2017, respectively. OpenET model versions were ALEXI/DisALEXI (version 0.032), eeMETRIC (version 0.20.26), geeSEBAL (version 0.2.2), PT-JPL (version 0.2.1), SIMS (version 0.1.0), SSEBop (version 0.2.6), and the OpenET Ensemble. Seasonal ETa estimates were derived by summing all monthly data. The daily, monthly, and seasonal ETa OpenET estimates were subsequently compared to the measured and estimated data from Boyko et al. 2021.

ET data analysis

Statistical significance

The in-situ and OpenET data were tested for normal distributions using a Shapiro-Wilks normality test (Eq. 1) with a 95% confidence interval to determine if a parametric or non-parametric statistical test should be used to test for statistical significance (Ghasemi and Zahediasl 2012). All months with over 15 days of observations were included in the Shapiro-Wilks test, yielding a sample size of 26 months for each ET method: March to November for Leyendecker and Willie’s Field and April to November for Horse Farm. The daily data had a large enough sample size at Willie’s Field, (n =, 269) that allowed for normality and statistical significance testing at this site for the field boundary defined polygon and the eddy covariance tower footprint polygon. When all of the ETa data follow a normal distribution, then a two sample t-test (Eq. 2) should be used to determine if there are any significant differences between model estimates and in-situ measurements; however, if any of the models in the dataset fail to have a normal distribution, then the non-parametric Mann-Whitney U test (Eq. 3) should be performed.

$$W= \frac{{({\sum }_{i=1}^{n}{a}_{i}{x}_{\left(i\right)})}^{2}}{{({\sum }_{i=1}^{n}({x}_{i}-\overline{x })}^{2}}$$
(1)

where W is the test statistic, n is the sample size, xi represents each data point, and ai represents constants derived from the covariances, variances, and means of the sample size.

$$t= \frac{{\overline{x} }_{1}- {\overline{x} }_{2}}{S\sqrt{\frac{1}{{n}_{1}}+} \frac{1}{{n}_{2}}}$$
(2)

where t is the test statistic, 1 is the sample mean of the first group, 2 is the sample mean of the second group, S is the pooled standard deviation, and n is the sample size.

$${U}_{1}= {n}_{1}{n}_{2}+\frac{{n}_{1}({n}_{1}+1)}{2}-{R}_{1}$$
$${U}_{2}= {n}_{1}{n}_{2}+\frac{{n}_{2}({n}_{2}+1)}{2}-{R}_{2}$$
(3)

where R1 and R2 are the sum of group ranks for groups one and two.

Performance assessment

Several statistical tests have been used to quantify OpenET model performance in the accuracy assessments by Melton et al. (2022) and Volk et al. (2024). Mean Bias Error (MBE), Root Mean Square Error (RMSE), Coefficient of Determination (r2), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE), and model percent mean bias error are frequently used to asses model performance (Djaman et al. 2023; Yang et al. 2015; Zhang et al. 2023; Zheng et al. 2022). MBE allows for discerning the direction and magnitude of overall model bias, with a value of zero indicating unbiased estimates. RMSE evaluates overall model accuracy by considering both bias and variability and offering insights into the precision of model estimates. r2 provides a measure of the goodness of fit between estimates and measurements, indicating how well the models captures the observed variability. MAE assesses the average prediction accuracy without sensitivity to the direction of errors, while MAPE expresses the same errors as a percentage. Model seasonal percent mean bias error was calculated by ((mod-obs)/obs)*100. Based on the MBE values recorded in this study, a 95% confidence interval range can be calculated using the following formula (Eq. 4).

$$\overline{x }\pm 1.96\frac{s}{\sqrt{n}}$$
(4)

where is the average model bias, s is the sample standard deviation, and n is the sample size.

With this bias range, we expanded our results to the 2913.7 hectares of alfalfa grown in Mesilla Valley in 2017, assuming that the model performance at the study sites is representative of Mesilla Valley alfalfa, thus, illustrating the potential impacts using OpenET’s models in place of in-situ measurements can have on water budget calculations for alfalfa.

Results and discussion

OpenET model estimates

The seasonal ETa from the OpenET models ranged from 737 to 1458 mm/yr for the three alfalfa sites (Table 2). Monthly data (supplementary file 1) ranged from 27.5 to 237 mm/month. Leyendecker had the lowest seasonal values for OpenET model estimates, most likely due to the field being composed of alfalfa and native grass (Boyko et al. 2021). Horse Farm was assumed to have the same ETa as Willie’s Field from April 13th to the end of the growing season, but had lower seasonal values as it took longer for the newly planted alfalfa to reach maturity, as shown in Table 1. Results of the seasonal comparisons revealed SSEBop and eeMETRIC produced higher values of ETa when compared to the in-situ data for all sites except Horse Farm (Table 2). Djaman et al. (2023) also found eeMETRIC to produce higher values of seasonal ET in two years of their six year study, with higher seasonal ET of 79 mm and 91 mm. The lower seasonal prediction of ETa relative to in-situ data from the other OpenET models at all three sites ranged from −375 to −33 mm/yr. The lower seasonal ETa predictions observed in this study fits within the previously described range that found the models included in OpenET underestimated seasonal ETa, relative to in-situ measurements, by −730.9 to −8.9 mm when they were applied in other dryland environments for maize and alfalfa (Djaman et al. 2023; Samani and Bawazir 2015).

Table 2 Measured in-situ ETa and estimated ETc (Boyko et al. 2021), seasonal ETa estimates of the OpenET models, and seasonal percent mean bias error for OpenET model estimates

Model seasonal percent mean bias error

The seasonal ETa model percent mean bias error at each site ranged from −33.99% to +11.37 (Table 2). With the exception of eeMETRIC and SSEBop, all model errors were due to lower predicted values of ETa at all three field sites relative to in-situ measurements and estimates. SSEBop exceeded the ±10–20% error OpenET uses as a standard for acceptable model performance (Melton et al. 2022 and Volk et al. 2024) for both Leyendecker and Horse Farm, while DisALEXI, and geeSEBAL exceeded the standard at Horse Farm. The larger percent errors at Leyendecker and Horse Farm are likely due to the smaller field sizes and edge effects. SIMS and PT-JPL had similar percent errors at all three sites which supports using the three sites in the comparison. eeMETRIC, SIMS, PT-JPL, and the OpenET Ensemble are within the ±20% benchmark OpenET set at all field sites, with eeMETRIC and SIMS having an average percent error within ±10%. All models were within the ±10–20% error standard for data extracted using either Willie’s field boundary or Willie’s Footprint, while the order of model performance changes, illustrating that caution is needed when interpreting ETa using field boundaries to extract data.

Shapiro-Wilks normality test

Normality tests for daily data were performed on Willie’s field boundary data and Willie’s eddy covariance footprint data because these datasets were the only ones with daily ETa values measured using an eddy covariance tower. Both data sets had at least one model that produced a p-value less than 0.05 (Table 3) indicating that the do not follow a normal and should be tested for significance with the non-parametric Mann-Whitney U test (Djaman et al. 2023; Shah et al. 2021). For the monthly data, all sites were used and normality tests yielded p-values above 0.05 (Table 4), indicating all monthly data follow a normal distribution and should be tested with the parametric two sample t-test (Castle et al. 2016).

Table 3 Daily Shapiro-Wilks normality test p-values for the in-situ data for Willie’s Field (Boyko et al. 2021) and the OpenET models
Table 4 Shapiro-Wilks normality test p-values for monthly data from all three field sites for the in-situ data (Boyko et al. 2021) and the OpenET models

Mann-Whitney U test and two sample t-test results

The daily ETa data did not all follow a normal distribution (Table 3), thus were tested using the non-parametric Mann-Whitney U test at a 0.05 significance level. The results show that PT-JPL, DisALEXI, and geeSEBAL yielded significantly different daily ETa estimates from in-situ measurements for both Willie’s Field and Willie’s Footprint (p-values < 0.05) (Table 5). eeMETRIC was significantly different from in-situ measurements of Willie’s Footprint and SSEBop was significantly different from in-situ measurements for Willie’s Field. The OpenET Ensemble and SIMS were the only two models to not produce significantly different daily ET estimates for either Willie’s Field or Willie’s Footprint.

Table 5 Mann-Whitney U test p-values comparing the daily in-situ data from Boyko et al. (2021) to the OpenET models

The monthly data were normally distributed and a two-sample t-test was used with a 0.05 significance interval to determine if there were any significant differences between the monthly OpenET model estimates and in-situ data (Table 6). SSEBop and DisALEXI produced significantly different monthly estimates when compared to the in-situ measurements and estimates (p = 0.0338 and p=0.0348) when Willie’s Field boundaries were used as the queried polygon. When Willie’s Footprint was used as the queried polygon in addition to the other two sites, no models were producing significantly different monthly estimations when compared to in-situ data. The higher frequency of significant differences for daily ETa when compared to monthly ETa is most likely due to averaging out of errors in the larger timeframe. This finding is consistent with Fisher and Pringle (2013) where under- and over-prediction errors were found to offset each other and reduce the MBE values.

Table 6 Two-sample t-test p-values comparing the monthly in-situ data (Boyko et al. 2021) to the OpenET models 

Performance assessment statistics

Mean bias error

Performance assessment statistics were calculated for Horse Farm, Leyendecker, and Willie’s Field using the field boundaries as the queried polygons (Table 7) and for Horse Farm, Leyendecker, and Willie’s Footprint as the queried polygons (Table 8). MBE values were consistently negative for monthly and seasonal time frames for all models except eeMETRIC (Tables 7 and 8). The overestimation at Willie’s Field and the Leyendecker site (Table 2) resulted in a positive MBE for the combined seasonal ETa for eeMETRIC despite underestimating seasonal ETa by over 100 mm, relative to in-situ measurements, at the Horse Farm site. The seasonal MBE for eeMETRIC and SIMS was within ±75 mm of in-situ measurements using both polygons for Willie’s Field (Tables 7 and 8), indicating that using these models may only cause slight discrepancies in seasonal water budget calculations. While the seasonal MBE values were either improved or remained the same for all other models when using a polygon centered around the eddy covariance tower at Willie’s Field (Table 8) reducing impacts from factors such as edge effects, all seasonal MBE values are still over 100 mm. With the exception of eeMETRIC and SIMS, the remaining models in OpenET are producing seasonal MBE values between −102 and −238 mm, indicating that there could potentially be much larger impacts on water budget calculations if these models were used in place of in-situ measurements.

Table 7 Monthly average MBE, RMSE, MAE, MAPE, r2 values and seasonal MBE for models included in OpenET compared to the in-situ data (Boyko et al. 2021) for the Horse Farm, Leyendecker, and Willie’s Field boundary polygons
Table 8 Monthly average MBE, RMSE, MAE, MAPE, r2 values, and seasonal MBE for the models included in OpenET compared to the in-situ data (Boyko et al. 2021) for the Horse Farm, Leyendecker, and Willie’s Footprint

Root mean square error & correlation coefficient

Linear regression equations and their associated RMSE and correlation coefficients (r2) were calculated in RStudio by regressing in-situ measurements to OpenET model estimates on a monthly basis (Tables 7 and 8). The monthly RMSE ranged from 23.5 and 41.8 mm/month for Horse Farm, Leyendecker, and Willie’s Field queried using the entirety of the field and 23.3 to 39.4 mm/month for Horse Farm, Leyendecker, and Willie’s Footprint. The model’s r2 values ranged from 0.55 to 0.71 for the monthly data (Table 7) using Horse Farm, Leyendecker, and Willies Field boundaries (Fig. 2) and from 0.56 to 0.70 (Table 8) using Horse Farm, Leyendecker, and Willie’s Field eddy covariance tower footprint (Fig. 3). Similar monthly RMSE (8.5 to 62 mm/month (Acharya and Sharma 2021)) and r2 (0.66 to 0.93 (Acharya and Sharma 2021) and 0.83 to 0.97 (Senay et al. 2013)) values have been reported for these models prior to their inclusion in the OpenET platform.

Fig. 2
figure 2

Relationship between monthly in-situ ETa measurements and ETc estimates (pooled from the three sites used by Boyko et al. 2021) and the models included in OpenET. In-situ vs a Ensemble b eeMETRIC c SSEBop d SIMS e PT-JPL f DisALEXI g geeSEBAL. The dashed line indicates a 1:1 relationship between model estimates and ¬in-situ measurements

Fig. 3
figure 3

Relationship between monthly in-situ ETa measurements and ETc estimates (pooled from the three sites used by from Boyko et al. 2021) and the models included in OpenET. Willie’s Field estimates were defined by a polygon centered around the eddy covariance tower located in the middle of the field. In-situ vs a Ensemble b eeMETRIC c SSEBop d SIMS e PT-JPL f DisALEXI g geeSEBAL. The dashed line indicates a 1:1 relationship between model estimates and ¬in-situ measurements

Mean absolute error and mean absolute percentage error

RMSE should be paired with MAE which treats all errors equally to better evaluate model performance (Chai and Draxler 2014) because RMSE places greater emphasis on more substantial discrepancies between model estimates and in-situ data. The MAE results are lower than the RMSE values as the greater discrepancies between model estimates and in-situ measurements aren’t being squared, ranging from 19.4 to 34.3 mm/month (Table 7) and 19.6 to 33.2 mm/month (Table 8). The monthly MAE values for eeMETRIC, SIMS, PT-JPL, and the Ensemble are within the range that was previously reported for OpenET of 15.3 to 22.5 mm/month (Melton et al. 2022; Volk et al. 2024) when using values from Willie’s Footprint. When the actual field boundaries were used for Willie’s Field, only eeMETRIC, SIMS, and the Ensemble fall within the range of MAE values previously reported by OpenET. Model MAPE ranged from 16.1 to 26.2% (Table 7) and 16.2 to 25.4% (Table 8) for monthly data.

Water budget implications

We performed an exploratory exercise to better understand the range of potential implications of using the OpenET for assessing the water budget of alfalfa in the Mesilla Valley. A previous study in the Mesilla Valley suggested a mean annual ET of ~1000 mm for all alfafa fields and a theoretical optimum of 1,451 mm (Ahadi et al. 2013; Samani and Bawazir 2015) using eddy covariance tower data and the Regional ET Estimation Model (REEM) which is consistent with the in-situ data of this study. The potential impacts of using OpenET models for water budget calculations were estimated by calculating the 95% confidence interval of the observed seasonal MBE values and scaling up to the 2913.7 hectares of alfalfa grown in Mesilla Valley in 2017. Table 9 shows the upper and lower bounds of the 95% confidence interval for each model’s MBE in MCM (million cubic meters) scaled up to the 2913.7 hectares of alfalfa grown in Mesilla Valley in 2017. The confidence intervals ranged from −11.7 to +4.56 MCM for the three sites including Willie’s Field polygon being defined by the field boundaries and −12.01 to +5.60 MCM for the three sites including Willie’s Footprint. The only model that showed potential overestimation for larger alfalfa water budget calculations using the three sites with Willie’s Field being defined by the field boundaries was eeMETRIC (Fig. 4), while all other models are expected to underestimate alfalfa ETa for Mesilla Valley water budget calculations by −11.7 to −0.37 MCM. When the confidence intervals were calculated for the three sites with Willie’s Field being defined by the eddy covariance tower footprint, the only models that showed potential overestimation were eeMETRIC and SSEBop. With the exception of eeMETRIC and PT-JPL all other models upper bound shifted closer to zero when Willie’s Field was defined by the eddy covariance tower footprint. This exploratory analysis was only for one year of data and based on a small sample size of three fields, thus future work needs to include additional fields and years to improve the comparison.

Table 9 Seasonal MBE at a 95% confidence interval derived for the six models included in OpenET and the OpenET Ensemble for field scale estimation of alfalfa ETa
Fig. 4
figure 4

The potential variation of OpenET model estimates for alfalfa water budget calculations in Mesilla Valley, New Mexico which constitutes approximately 20% of total irrigated agriculture in Mesilla Valley (United States Department of Agriculture  2024). The range of each model shows the 95% confidence interval calculated using seasonal MBE values in million cubic meters (MCM). The solid line 95% confidence interval ranges were calculated using the three fields and Willie’s Field being defined by the field boundaries, the dashed line 95% confidence interval ranges were calculated using the three fields and Willie’s Footprint. The dashed line indicates no difference between model estimates and in-situ measurements of ETa

There is a crucial need for accurate water budgets, especially in the American Southwest drylands, as droughts, annual temperatures, and atmospheric water demand (Albano et al. 2022; Overpeck and Udall 2020; Williams et al. 2020) continue to increase in this region, impacting water resources and water security. As ET is a key component in these water budgets (Hanson 1991), it needs to be accurately estimated in order to ensure water resources are managed appropriately. Due to the various degrees of model performance for the OpenET platform noted in this study, stakeholders need to be aware of potential underestimation when using these models for water budget calculations. The lower estimates when compared to in-situ measurements for all models excluding eeMETRIC would equate to between −11.7 to −0.37 MCM of water that is being lost to ET but not reported when the results of this study are expanded out to the 2913.7 hectares of alfalfa that were grown in Mesilla Valley in 2017. When using an eddy covariance tower footprint to derive OpenET model estimates, model performance generally increased, with underestimation ranging from −12.01 to −0.14 MCM of water. When using eeMETRIC for larger alfalfa water budget calculations, there may not be any differences between estimates and in-situ measurements as the 95% confidence interval includes zero, as shown in Table 9 and Fig. 4. When using an eddy covariance tower footprint to define field boundaries, most models performed better, with eeMETRIC and SSEBop 95% confidence intervals spanning zero, indicating that there may not be any differences between these models and in-situ ET data.

Model performance

The results of this study suggest that the under prediction errors models displayed when compared to in-situ measurements noted in previous literature (Cheng et al. 2021; Li et al. 2017; Samani and Bawazir 2015) occur in several of the models included in the OpenET platform when applied to the three alfalfa fields in this study. Similar error percentages (−14.8 to −28.9) (Mkhwanazi and Chávez 2013) were reported for the models prior to being incorporated into the OpenET platform. SSEBop exceeded −20% at the Leyendecker site and SSEBop and DisALEXI both exceeded −20% at the Horse Farm site. OpenET specifies an error of ±20% as a target accuracy. When examining the seasonal percent error values observed in this study for alfalfa with previous literature on maize, as shown in Table 10, several models do not perform as well as those indicated in the Melton et al. (2022) and Volk et al. 2024 studies conducted across the United States on cauliflower, wheat, grape, Sudan grass, lettuce, maize, soybean, alfalfa, almond, and rice.

Table 10 Seasonal percent error values for the models included in OpenET from this study and three other evaluations of OpenET

eeMETRIC and SIMS are within the seasonal percent error of ±10% (Table 10) that was reported in the OpenET intercomparison studies by Melton et al. (2022) and Volk et al. (2024), for both this study and the Djaman et al. 2023 study that took place in Farmington, New Mexico. geeSEBAL also has consistent performance between all four studies, with seasonal percent underestimation more than −10%, relative to in-situ measurements. The OpenET Ensemble, SSEBop, PT-JPL, and DisALEXI models’ seasonal percent error was more than −10% for this study as well as the Djaman study, but was reported as being within ±10% by Melton et al. (2022) and ±10% for monthly estimates by Volk et al. (2024). While this study and the Djaman et al. (2023) study both took place in New Mexico, the Melton et al. (2022) and Volk et al. (2024) studies used cropland sites from multiple states and non-dryland environments. This inclusion of dryland and non-dryland regions by Melton et al. (2022) and Volk et al. (2024) could lead to the averaging out of the lower estimates associated with using ET models in dryland regions. Given the MBE results exhibited in this study, we expect the eeMETRIC and SIMS models to be within ±100 mm of seasonal alfalfa ETa. The Ensemble, PT-JPL, and geeSEBAL models are expected to consistently underestimate seasonal alfalfa ETa by upwards of 100 mm for a given field, DisALEXI is expected to underestimate seasonal ETa by upwards of 200 mm for a given field, relative to in-situ measurements. SSEBop showed the largest change in MBE when comparing between field boundaries and an eddy covariance footprint.

While MAE can be preferred over RMSE for evaluating model performance due to it not weighing more substantial outliers more heavily (Chai and Draxler 2014), the differences observed between the two metrics at both the monthly timesteps are less than 10 mm for monthly data. The MAE and RMSE values observed in this study for the eeMETRIC, SIMS, and the Ensemble models are within 15.5 to 22.5 and 19.7 to 28.7 mm/month that OpenET was found to have when initial performance was tested (Melton et al. 2022; Volk et al. 2024), while all other models are outside this range for both statistics. This study found model underestimation, relative to in-situ measurements, at all three alfalfa sites when compared to the in-situ measurements for the Ensemble, SSEBop, SIMS, PT-JPL, DisALEXI, and geeSEBAL models. However, only SSEBop and DisALEXI were found to significantly underestimate monthly ETa when tested against the in-situ data using a two-sample t-test when Willie’s Field was queried using a 210m footprint centered around the eddy covariance tower. SSEBop, PT-JPL, and DisALEXI were the only models that significantly underestimated daily ETa at Willie’s Field when both the eddy covariance tower footprint and the field boundaries were queried using a Mann-Whitney U test; all other models had at least one non-significant observation. eeMETRIC and SIMS estimates of ETa were the closest to the field data with the lowest MAE, MBE, RMSE, percent error values, and had the lowest potential impacts of water budget calculations.

There was a difference in model performance based on how Willie’s Field was defined for model estimates with most models having closer agreement to the in-situ ETa when the eddy covariance tower footprint was used. Most notably the SSEBop model went from underpredicting ET by −10.77% for Willie’s Field when the true field boundaries were used to define the field to overestimating Willie’s Field ET by +2.14% (Table 2) when the eddy covariance tower footprint was used to define the field. DisALEXI, geeSEBAL, and the Ensemble models all saw improved performance with reduced MBE, MAE, and RMSE values as well when estimates for Willie’s Field were defined by the eddy covariance tower footprint (Tables 7 and 8). eeMETRIC, SIMS, and PT-JPL all had either similar or slightly poorer performance assessment statistics when the eddy covariance tower footprint was used to define Willie’s Field. While it is important to consider factors that can contribute to underestimation of remotely sensed ET such as edge effect and mixed pixels, it is not always possible for users of OpenET to sample their fields using appropriate buffers. OpenET currently displays ET estimates for polygons that are similar in size or smaller than Horse Farm and Leyendecker, and for stakeholders who do not have an extensive understanding of remote sensing ET models it is important that they understand underestimation factors that can be impacting their ET estimates as shown in this study.

Potential model underperformance causes

OpenET classifies eeMETRIC, geeSEBAL, DisALEXI, SSEBop, and PT-JPL as either full or simplified implementations of the surface enegery balance approach (Melton et al. 2022; Volk et al. 2024). While SIMS computes ET as a function of canopy density using a crop coefficient approach (Volk et al. 2024). The varying degrees of performance in the SEB models (eeMETRIC, geeSEBAL, and DisALEXI) are most likely due to slight differences in how the models solve for different variables required for the SEB approach. One potential explanation as to why greater discrepencies relative to the in-situ data were observed in this study for DisALEXI could be due to the inclusion of coarse spatial resolution imagery from GOES and MODIS as a primary model input. Using coarse spatial resolution imagery as a primary data input could cause increased error of ET due edge effects and the inclusion of mixed pixels (Jones and Sirault 2014; Piñon-Villarreal et al. 2020) such as those observed in this study.

The underestimation issues, relative to in-situ measurements, noted in this study for the simplified SEB models (SSEBop and PT-JPL) could be due to the assumptions made for simplifying the SEB equation. PT-JPL and eeMETRIC are both using surface and thermal data from Landsat TM/ETM+/OLI and meteorological data from NLDAS, indicating that the varying degrees of performance between the two models are most likely due to assumptions made in PT-JPL’s calculation of ETa. As a result of SSEBop only calculating latent heat flux for ETa (Senay 2018), the model requires less input data for use in the OpenET platform when compared to models like eeMETRIC and SIMS (Melton et al. 2022). The lack of additional data may cause issues with capturing local variability and could be a potential reason as to why it is producing significantly different monthly ETa estimates when compared to the in-situ data. There are several potential reasons why the SIMS model is generating ET estimates closer to the in-situ data than other models included in the OpenET platform. It is the only model that uses Sentinel 2a and 2b imagery with a spatial resolution of up to 10m as a primary input; reducing the possibility of underestimation issues due to edge effects, and, it is the only model that does not use the SEB equation but relies on using crop coefficients and ETo to calculate ETa.

The Ensemble underestimation observed in this study can be explained by the lower estimates observed in the other models included in the OpenET platform, additional underestimation could be due to the MAD outlier removal process OpenET uses for calculating Ensemble estimates. OpenET states on their website that model estimates can be dropped for the Ensemble calculation if they are outside the ±2 times the MAD, but it is rare that more than one model is dropped in cropland environments; however, when one or more models are dropped, it is commonly in dryland environments, and the models are typically underestimating ETa. While this may be the more frequent case, it is possible that the outlier values are closer to the true ETa value and are dropped in favor of models that are producing more erroneous values, OpenET did report this occurring in their initial test of the platform, but it was a rare occurrence (Melton et al. 2022; Volk et al. 2024).

Several of the models included in OpenET use coarse spatial resolution imagery from MODIS and GOES (Anderson et al. 2007; Melton et al. 2022) either as primary data inputs of for gap-filling. The coarse spatial resolution of MODIS and GOES may cause underestimation issues due to the inclusion of the surrounding desert landscape and irrigated agriculture in a single pixel (Jones and Sirault 2014; Piñon-Villarreal et al. 2020). Even with finer spatial resolution imagery such as Landsat and Sentinel 2 there can still be issues with mixed pixels and edge effects if the area of the field is relatively small. In these cases it is recommended to assess ET using a buffer or single point to reduce effects the surrounding area can have on ET estimates. While Willie’s Field in this study was able to produce a 7×7 grid of 30m pixels for ET estimates that will not be affected by the surrounding landscape, this is not always the case. Horse Farm and Leyendecker fields were too small to create such a footprint and this is a common issue for stakeholders. OpenET has noted the need to create a best practices manual that could potentially address how to implement this tool for relatively small agricultural fields in dry regions, however, this manual is currently in development and there is very little information available on the platform as to how users can still obtain reliable estimates of ETa for smaller agricultural fields in dry regions.

Alfalfa poses a unique challenge for measuring ET using satellite-based models due to multiple cutting periods throughout a growing season. The period of time between a cut and the alfalfa field to reach full canopy cover is often very rapid–12 days on average in Boyko et al. (2021). Landsat thermal data has a 8-day temporal resolution, however, images are often obscured by clouds. In the Mesilla Valley, this happens often during the peak growing season due to the seasonal monsoon between July and September. In 2017, the average return period was 11.8 days with Landsat ETM+ and OLI, thus missing important information between dates. Future investigations into OpenET should consider this nuance given the prevelance of alfalfa in the western United States.

The boundaries used for extracting estimated ETa from OpenET is an important consideration when applied to comparison studies and water management decisions. The platform currently offers users the ability to select a predefined field polygon to retrieve a depth of ET–perhaps the path of least resistance of ET retrieval for novice users. Many agricultural fields in irrigated valleys are narrow, sometimes narrower than 90 m, and land cover is heterogeneous which makes the ET values subject to impacts from edge effect (Jones and Sirault 2014; Piñon-Villarreal et al. 2020) of the 100 m native pixel size of the Landsat thermal band. Piñon-Villarreal et al. (2020) suggest a field buffer size of 0.75 × the Landsat thermal pixel spatial resolution to optimize for edge effects. However, in the Mesilla Valley, sevarl thousands of polygons included in OpenET are not large enough to apply the buffer to get even 1 center pixel without edge effect influences. Users unfamiliar with the nuances of remote sensing methodologies would most likely not be aware of the potential errors introduced by the edge effect. Future improvements to the OpenET platform could include a warning to users when a small field polygon is selected to indicate potential error in the estimated value due to edge effect.

Conclusions

This study utilized ETa measurements ETc estimates from Boyko et al. (2021) derived from eddy covariance and crop coefficient methods for three alfalfa fields during the 2017 growing season in Mesilla Valley, New Mexico and compared them to OpenET’s timeseries data for the OpenET models. Two sample t-test results found that two models (SSEBop and DisALEXI) included in OpenET significantly underestimated monthly alfalfa ETa when compared to in-situ measurements (p-values of 0.0338 and 0.0348) for a combination of Horse Farm, Lyendecker, and Willie’s Field boundary polygons. When estimates were queried at Willie’s Footprint, no models were found to be significantly different at the monthly timestep. Seasonal ETa percent error was found to range between −33.99 and +11.37%, depending on the model used. These results indicate that some models (eeMETRIC and SIMS) can be used to estimate seasonal alfalfa ETa within 10% of in-situ measurements. The potential impacts of the lower estimates exhibited in this study when related to water budget calculations are seasonal alfalfa ETa underestimation of −12.04 to −0.14 MCM for alfalfa fields in Mesilla Valley when DisALEXI, PT-JPL, SIMS, geeSEBAL, and the OpenET Ensemble are used. eeMETRIC was the only model found to have a 95% confidence interval for alfalfa water budget estimations that spanned across zero when Willie’s Field was defined using the field boundaries, eeMETRIC and SSEBop were found to have 95% confidence intervals for alfalfa water budget estimations that spanned across zero when Willie’s Field was defined using the eddy covariance tower footprint. These results indicate that there may not be any differences between model estimations and in-situ data.

If model errors, specifically lower estimates, are not accounted for in water budgets, ETa and consumptive water use will be underrepresented, potentially leading to overestimation of water resources. Water budgets calculated with underestimated ETa could potentially lead to water scarcity, reduced crop productivity, and inefficient water allocation. Addressing this underestimation is crucial for ensuring sustainable irrigation strategies and efficient water resource management, particularly in regions such as the American Southwest where droughts and water scarcity are prevalent issues.

Remote sensing ET models have the potential to provide accurate ETa data to a range of stakeholders for purposes such as water reporting and adjudication on a scale that point measurement systems cannot. While several remote sensing ET models, such as the ones included in OpenET, have been found to accurately estimate ETa in different regions, several still fail to accurately estimate seasonal ETa in dryland environments. These lower estimates when compared to in-situ measurements need to be addressed, and models may need to be refined to accurately estimate ETa in the American Southwest drylands, where water budgets are crucial for water management and ensuring a secure water future for future generations. Future work should focus on improving OpenET’s model estimates in smaller agricultural fields and for areas with multiple crops grown in a mosaic pattern. Further exploration is required to determine if the discrepancies between OpenET’s model estimates and in-situ measurements of ET observed in this study occur with other crops, such as pecans and cotton, in dryland environments.