Introduction

As the world's population continues to grow and water resources increasingly become scarce (Boretti and Rosa 2019) due to climate change, competition from various beneficial uses, and increased regulation of agriculture use, there is an urgent need for the development of more sustainable production practices. For example, in California, from the winter of 2011 to 2016, the state experienced extreme drought that impacted agricultural production and served as the catalyst for passing the Sustainable Groundwater Management Act (SGMA), which limits groundwater pumping for irrigation. Currently (as of October 2021), California is experiencing another major drought that started in 2020. The increased frequency of extreme drought and increased regulation of groundwater use are making irrigated agriculture challenging in California. Achieving precision irrigation water management requires a thorough understanding of crop-specific biophysical processes including crop response to water at different growth stages as well as variability over space. Soil moisture is a critical and vital variable that can help us improve our understanding of the relationships between climate dynamics (D’Odorico and Porporato 2004), water (Gao et al. 2014), drought (Sheffield and Wood 2008), and food security (Sadri et al. 2020). Information on soil moisture is important for the development of appropriate irrigation systems to maximize crop yield, and long-term soil moisture information combined with climatic information provides insights into patterns, agricultural thresholds, and losses (Bastiaanssen et al. 2007; Lin et al. 2018). Thus, soil moisture information is required to achieve the benefits of precision agriculture and agricultural sustainability.

Soil moisture collected at the soil surface is referred to as surface soil moisture (SSM), and most remote-sensing methods are confined to the determination of SSM. Soil moisture obtained at deeper depths where plant root water uptake occurs is referred to as root zone soil moisture (RZSM) and is of much more relevance (Scott et al. 2003; Bauer-Marschallinger et al. 2019) to management and decision making. Atmospheric conditions have a more direct impact on SSM than on RZSM (Hirschi et al. 2014). A precise RZSM estimate is required for quantifying plant available water and for irrigation scheduling. In situ sensors deployed at various depths along the soil profile can provide direct RZSM data (Cosh et al. 2016). However, one disadvantage of the soil moisture sensors is that they require site-specific calibration to provide reliable measurements of volumetric water content. Because of the effects that variations in soil properties have on sensor output, it is necessary to calibrate the sensor at each site (Peddinti et al. 2020a). Also, the volume of soil sensed is somewhat limited. With the advent of new in situ and proximal sensors, new satellites, other sensing technologies, and increased modeling capabilities, soil moisture monitoring techniques are undergoing fast expansion and innovation (Peddinti et al. 2018, 2020b). As a result, an increasing variety of soil moisture data products are being developed.

Various theoretical and empirical models were developed over the last few decades to retrieve surface soil moisture at a depth of 0–5 cm (Petropoulos et al. 2015) by utilizing passive or active microwave sensors and establishing the relationship between the soil dielectric constant and water content (Jackson 1993). In spite of this, remote-sensing satellites are unable to provide direct soil moisture content in the root zone at depths of 30–60 cm because of technical limitations imposed by L-band and X-band characteristics within the microwave wavelength range (Engman 1991). In most cases, SSM data may be obtained more easily than RZSM data. In most instances, the RZSM can be extrapolated from SSM data, which can be obtained either in situ or through satellites (Wigneron et al. 1999; Montaldo et al. 2001; Sabater et al. 2007). Spatially distributed RZSM can be a challenge, since the installation of a large number of sensors in the network within the subsurface is costly and time-consuming, and is likely to affect the soil characteristics (González-Teruel et al. 2019). It is, unfortunately, less common to measure RZSM at depths of 100 cm and beyond, such as is required for deep rooting crops, e.g., pistachio trees. The use of data-driven models that can efficiently relate the inputs to the desired output while being computationally efficient is required to precisely predict spatially distributed RZSM (Kornelsen and Coulibaly 2014; Carranza et al. 2021).

On a large-scale, detailed process-based models for simulating soil moisture dynamics based on the Richards equation require a lot of data for parameterization and can be computationally expensive. Data-driven prediction technologies (Kornelsen and Coulibaly 2014) such as artificial neural networks (ANNs) (Hassan-Esfahani et al. 2015), Random Forest (RF) (Carranza et al. 2021), and statistical learning tools, such as Support Vector Machines (SVMs) (Yu et al. 2012), are increasingly being used for SSM and RZSM estimations. They are designed to extract information from data by examining patterns of variability in the data and to stimulate responses that are being taught by the data in the process. A common prerequisite is that in situ data are properly calibrated and sufficient for model training. Specifically, data-driven techniques implicitly include and assess all of the interacting processes that result in the production of a specific RZSM state (Carranza et al. 2021). Advances in machine learning (ML) techniques have been mostly utilized in hydrology (Lange and Sippel 2020) and climate research (Huntingford et al. 2019) for the prediction and forecasting of environmental variables (Li et al. 2011), as well as the optimization of model parameters. Over the last few years, ML approaches have become more common in soil hydrology research to estimate model-derived RZSM using ANNs or satellite-derived SSM using SVMs (Yu et al. 2012; Adab et al. 2020; Carranza et al. 2021).

Yu et al. (2012) used SVMs and the ensemble particle filter (EnPF) to develop a multi-layer soil moisture prediction model for the Meilin watershed in China, which showed that SVMs are statistically significant and resilient for soil moisture prediction in both the surface and root zone layers. Using simulated soil moisture data from the Soil and Water Assessment Tool (SWAT), Al-Mukhtar (2016) determined which ANNs were most effective for modeling the RZSM up to 2 m depth. He found that layer recurrent network and feedforward network were the most effective estimators. Kornelsen and Coulibaly (2014) applied the process-based HYDRUS model and data-driven ANNs using surface soil moisture observations to predict RZSM. They demonstrated that ANNs were capable of accurately predicting soil moisture as estimated by HYDRUS, but the performance was reduced when compared to in situ moisture observations outside the training conditions. According to the findings of a recent study by Carranza et al. (2021), in situations where adequate training data can be obtained from intense observing campaigns where soil hydraulic parameters are not accessible, the RF model was shown to be more favorable than the process-based HYDRUS model, particularly when the primary goal was to predict soil moisture content.

The purpose of this study was to compare the performance of two semi-empirical models, pySEBAL and EFSOIL to the machine learning-based RF model in combination with remoting data for predicting spatial–temporal changes in RZSM in a commercial vineyard. The in situ soil moisture data were acquired from eight TDR soil moisture sensors in a vineyard at Ripperdan Ranch, California during the growing seasons from 2018 to 2020 as part of the Grape Remote Sensing Atmospheric Profile and Evapotranspiration eXperiment (GRAPEX).

Materials and methods

Study area description and soil moisture data

In this study, in situ soil moisture measurements were acquired from Ripperdan Ranch near Madera, CA (38.8500 N, − 120.1768 W), as part of the GRAPEX project (Kustas et al. 2018; Alfieri et al. 2019). As depicted in Fig. 1, soil moisture data were collected from eight TDR soil moisture sensors (Model: CS655, Campbell Sci. Inc., Logan, UT, USAFootnote 1) over the 2018–2020 grapevine crop growth seasons. Soil samples were collected in 2018 from eight different locations at three different depths: 30, 60, and 90 cm, for soil texture assessment and sensor calibration. Using the gravimetric calibration procedure (Peddinti et al. 2020a), the following calibration equation was developed to get calibrated volumetric water content from the sensors:

Fig. 1
figure 1

The study area showing the eight soil moisture sensors (RIPC1, RIPC2, RIPC3, RIPC4, RIPT1, RIPT2, RIPT3, and RIPT4) and four flux tower (red triangle) locations at Ripperdan Ranch, near Madera, was placed under the grapevines for which root zone soil moisture and both carbon and water flux data were collected

$${\theta }_{\text{volumetric}}=\left(1.2713\times {{(\theta }_{\text{gravimetric}})}^{2}\right)+\left(0.2914\times {\theta }_{\text{gravimetric}}\right)+0.0873;$$

here, \({\theta }_{\text{volumetric}}\) is volumetric water content obtained from the sensors and \({\theta }_{\text{gravimetric}}\) is the gravimetric water content obtained from the soil samples (Table 1).

Table 1 Soil properties within the sensor locations at the root zone in the study region

The daily average RZSM refers to volumetric soil moisture content within the top 60 cm, calculated as the average of sensor readings at 30 and 60 cm or, in some cases, 40 and 60 cm depth. It is possible that the majority of the grapevine root system can be found deeper than 60 cm; however, on the basis of the available data, we have considered the data from 60 cm depth as the effective RZSM. The majority of sensors were positioned beneath the vine row, where they sensed irrigation events during the growing season. In the root zone, the soil was classified as sandy loam, with 60% sand, 25% silt, and 15% clay. There was a difference in the soil bulk density ranging between 1.47 and 1.55 g cm−3 at all sensor locations. Field capacity was estimated as 0.21 cm3 cm−3, and the permanent wilting point was 0.10 cm3 cm−3 (Table 2). The soil moisture readings collected from all of the sensor locations over a three-year period ranged from 0.08 to 0.30 cm3 cm−3 (Fig. 2), with a saturated moisture content of 0.39 cm3 cm−3 and residual water content of 0.048 cm3 cm−3.

Table 2 The goodness-of-fit statistics for the three models assessed for pySEBAL, EFSOIL, and RF models were compared to measured and predicted root zone soil moisture at eight sensor locations
Fig. 2
figure 2

Box plots depicting the distribution of root zone soil moisture at eight sensor locations on the Ripperdan Ranch, near Madera, California

Remote sensing and meteorological data

To estimate the EF fraction from an energy balance model during the period 2018–2020 grapevine crop growing seasons, a total of 123 Landsat Thematic Mapper (TM)-Enhanced Thematic Mapper Plus (ETM +)/Operational Land Imager (OLI) images under clear sky conditions were collected from the USGS Earth Resources Observation and Science Center (https://earthexplorer.usgs.gov/). Image dates for Landsat 7 and 8 are presented in Fig. 3 along with the corresponding day of the year (DOY). The USGS EROS Center produced a high-resolution digital elevation model (DEM) from the Shuttle Radar Topography Mission (SRTM) with a resolution of 90 m, which was rescaled to a resolution of 30 m to match Landsat resolution.

Fig. 3
figure 3

Cloud free Landsat scene imagery (path: 042/043 row:034) was used to estimate evaporation fraction

The meteorological data, i.e., hourly and daily data, including solar radiation, air temperature, relative humidity, and wind speed required to run the pySEBAL model and energy balance components including net radiation, soil heat flux, and sensible heat flux, were collected from the eddy covariance (EC) flux tower located within the study region, as shown in Fig. 1. The ET data were also obtained from this EC tower which is one of the primary biophysical processes that governs root zone water dynamics. In an earlier publication, Alfieri et al. (2019) provide detailed information on the general sensor design of the GRAPEX eddy covariance flux towers with more specific details given by Knipper et al. (2019).

Root zone soil moisture estimation from pySEBAL

Traditional systems that use soil moisture sensors in conjunction with data from meteorological stations provide only point measurements, and are relatively expensive, and require frequent maintenance. IrriWatch (https://www.irriwatch.com/en/: IrriWatch, Maurik, The Netherlands) addresses this issue by combining root zone soil moisture and crop ET into a single, cost-effective technology commercial service that does not require hardware installation. IrriWatch is one of the first products to deliver comprehensive soil water potential and soil moisture data in the root zone to the farmers that is derived from thermal, multispectral, and optical-based remote sensing. In addition, estimates of ET and crop production among others are provided at the spatial resolution of 10 m for any field around the globe. The Surface Energy Balance Algorithm for Land (SEBAL) is the core algorithm behind IrriWatch, and it was implemented in a python environment, which is referred to as pySEBAL (python Surface Energy Balance Algorithm for Land) (Hessels et al. 2017; Jaafar and Ahmad 2020). The following section includes the major equations for estimating RZSM using the pySEBAL algorithm. Readers are referred to the following articles (Bastiaanssen et al. 1998b, a; Laipelt et al. 2021) for a more detailed explanation of the algorithm.

The soil moisture content is computed as a function of the evaporation fraction (EF) using the pySEBAL package (Waters et al. 2002). Bastiaanssen et al. (1997) and Scott et al. (2003) were the first studies to identify a relationship between soil moisture and the EF, leading to the formulation of Eq. (1)

$$\theta ={\theta }_{\text{sat}}\times \text{exp}\left(\frac{\text{EF}-a}{b}\right),$$
(1)

where \({\theta }_{\text{sat}}\) is saturated soil water content (cm3 cm−3), EF is evaporation fraction from pySEBAL, and a and b are curve fitting parameters set to 1 and 0.421. Note that Eq. (1) directly relates EF to volumetric soil water content. Nutini et al. (2014) evaluated this equation in semi-arid rangeland ecosystems in Niger and Chad, and Petropoulos et al. (2020) investigated the relationship in a Mediterranean environment in Spain.

The following formula is used to calculate the EF from the energy balance equation:

$$\text{EF}=\frac{{R}_{n}-G-H}{{R}_{n}-G},$$
(2)

where \({R}_{n}\) is the net radiation (W m−2) at the surface, \(H\) is the sensible heat flux (W m−2), and \(G\) is the soil heat flux (W m−2).

The net radiation is the first and the most important computing step in the pySEBAL method. According to Eq. (2), \({R}_{n}\) is computed by subtracting all outgoing radiation fluxes from all incoming radiation fluxes by Eq. (3)

$${R}_{n}=\left(1-\alpha \right){R}_{s\downarrow }+{\varepsilon }_{0}{R}_{L\downarrow }-{\varepsilon }_{0}\sigma {T}_{s}^{4},$$
(3)

where \({R}_{s\downarrow }\) is the incoming shortwave radiation measured at the time of satellite overpass (W m−2), \({R}_{L\downarrow }\) is the incoming longwave radiation (W m−2), \(\alpha \) is the surface albedo, \({\varepsilon }_{0}\) is the surface emissivity calculated by a semi-empirical relationship involving Normalized Difference Vegetation Index (NDVI) and Leaf Area Index (LAI) (Xue et al. 2020), which can be retrieved from the red and near-infrared bands of the electromagnetic spectrum, \(\sigma \) is the Stephen–Boltzmann constant denoted as 5.67 × 10−8 (W m−2 K4), and \({T}_{s}\) is the temperature of the land surface (K).

G is expressed as a fraction of Rn, and pySEBAL employs the empirical formula (Eq. 4) for calculation of G established by (Bastiaanssen et al. 1997)

$$G={T}_{{\text{s,reference}}}\times \left(0.0038+0.007\alpha \right)\times \left(1-0.98{\text{NDVI}}^{4}\right)\times {R}_{n},$$
(4)

where \({T}_{{\text{s,reference}}}\) is the corrected land surface temperature (Ts) based on the DEM of the area of interest (AOI), considering the slope and aspect of the land surface and NDVI.

When using pySEBAL, an internal calibration of H is implemented, eliminating the requirement for an additional atmospheric adjustment of \({T}_{s}\) to be performed. While calculating H, pySEBAL makes use of the bulk aerodynamic resistance equation, expressed as Eq. (5)

$$H=\frac{\rho \times {C}_{p}\times dT}{{r}_{\text{ah}}},$$
(5)

where \(\rho \) is the air density (kg m−3), \({C}_{p}\) is the specific heat of air at constant pressure and it is equal to 1004 J/(kg K), \({r}_{\text{ah}}\) is the aerodynamic resistance to heat transfer between z1 and z2 (s/m), and \(dT\) is the temperature difference between two near-surface height (z1 = 0.1 m and z2 = 2 m) above the canopy layer (K) (Xue et al. 2020), which is estimated as a linear function of corrected surface temperature \({T}_{{\text{s,reference}}}\) (Eq. 4), being a major assumption for estimating sensible heat flux (Bastiaanssen 1995; Allen et al. 2005b). Readers are referred to (Xue et al. 2020; Jaafar and Ahmad 2020) for a more detailed description of the pySEBAL algorithm and the automated selection of hot and cold pixels in the calculation of sensible heat fluxes. After estimating ET, IrriWatch uses a soil water balance and site-specific information on soil physical characteristics such as field capacity, wilting point, and effective crop root depth to estimate soil water content within the root zone. Soil physical characteristics are obtained from public databases such as gSSURGO. Validation of the EF data was carried out using EC tower data from the study region's source footprint area to ensure accuracy.

Root zone soil moisture estimation from evaporation fraction (EF) and soil properties

In the absence of ground-based auxiliary measurements, Pradhan (2019) proposed a method for estimating soil moisture content from the satellite-derived EF and soil physical properties. In this study, the relationship between satellite-based EF and soil properties was used to derive RZSM. In this study, this approach was defined as EFSOIL (Evapotranspiration fraction and soil properties-based RZSM).

Budyko and Zubenok (1961) defined the ratio of actual crop evapotranspiration (AET) to plant-specific reference evapotranspiration (\(ETr\)) as a function of actual available soil moisture (SM) and plant available soil moisture (PAM), which can be written as

$$\frac{{{\rm{AET}}}}{{{\rm{ETr}}}} = \frac{{{\rm{SM}}}}{{{\rm{PAM}}}}$$
(6)

This is a simplified equation for a complex physical process that requires additional testing for various soil and climatic conditions. In Eq. (6), the right-hand side term can be taken as the relative saturation or soil wetness index defined as

$$ \frac{{{\rm{SM}}}}{{{\rm{PAM}}}} = \frac{{\left( {\theta - {\theta _{{\rm{wp}}}}} \right)}}{{\left( {{\theta _{{\rm{fc}}}} - {\theta _{{\rm{wp}}}}} \right)}}$$
(7)

where \(\theta \) is the actual soil water content within the root zone (cm3 cm−3), and \({\theta }_{\text{fc}}\) and \({\theta }_{\text{wp}}\) are site-specific field capacity and permanent wilting point (cm3 cm−3), respectively.

By rearranging the terms from Eqs. (6) and (7), Eq. (8) can be written as

$$\frac{\text{AET}}{\text{ET}r}=\frac{(\theta -{\theta }_{\text{wp}})}{{(\theta }_{\text{fc}}-{\theta }_{\text{wp}})}.$$
(8)

Similarly, in Eq. (8), the left-hand side term is equal to the crop coefficient (Kc). As reported by Trezza 2002; Tasumi 2003; and Allen et al. 2007, Kc can be similar to the EF under certain conditions (Allen et al. 1998). While the parameter EF takes into consideration water stress, the actual ET is the variable that accounts for environmental stresses (Allen et al. 2005a). By assuming that EF is equal to Kc and that it is related to soil moisture fraction defined in Eq. (8), the following equation can be written as:

$$\text{EF}=\frac{(\theta -{\theta }_{\text{wp}})}{{(\theta }_{\text{fc}}-{\theta }_{\text{wp}})}.$$
(9)

Equation (9) can be used to derive spatially distributed soil moisture content \({(\theta }_{i})\) at any spatial location denoted as

$${\theta }_{i}={\text{EF}}\left({\theta }_{\text{fc}i}-{\theta }_{\text{wp}i}\right)+{\theta }_{\text{wp}i}.$$
(10)

This is the fundamental equation that was used in the EFSOIL model to retrieve the RZSM within the vineyard at the Ripperdan Ranch by utilizing the Landsat-derived EF from pySEBAL algorithm and soil characteristics at each sensor location.

Root zone soil moisture estimation using Random Forest

To estimate RZSM at fine spatial resolution, the RF machine learning algorithm was trained using in situ soil moisture measurements from eight sensors in combination with EF derived from Landsat imagery, meteorological, soil, and topography data as predictor variables. RF is an ensemble-based machine learning approach that uses multiple classifications and regression trees in sample selection utilizing the bootstrapping method, which is referred to as “bagging” (Breiman 1996, 2001). The detailed RF flowchart used in this study is shown in Fig. 4. With the bootstrap in the various decision trees, the selection of the variables is randomized in this method, with only a portion of the samples being selected in each of the multiple trees. RF is capable of performing both classification and regression processes (Fig. 4). The number of trees (tree) and the number of features (mtry) are the two most influential factors in the RF algorithm. According to its general definition, it is a method used to improve the precision of models when compared to linear regression, because it is resistant to multicollinearity and is capable of solving complex interactions between the predictor and explanatory variables (Drobnič et al. 2020). During the generation of RF samples, the input data are grouped into rows (called samples) and columns (called features) with respect to row sampling; the technique is to use replacement sampling, which means that some samples may occur several times in the training set of a tree or may never exist at all in the training set (Meyer et al. 2019).

Fig. 4
figure 4

Workflow diagram for the Random Forest (RF) model. The construction of regression trees is based on a large number of bootstrap samples. Each tree is formed by picking the datasets from each subsample and putting them together. Each tree's predictions were averaged to provide a single value for the purpose of building a model, which was then used to predict the spatial distribution of root zone soil moisture

The RF approach was implemented using the R package CAST developed by Hanna Meyer (https://cran.r-project.org/web/packages/CAST/index.html). To estimate the RZSM using RF, a random interpolation of randomly selected points from each sensor location within the time-series data from 2018 to 2020 was used. With data from all eight TDR soil moisture sensor stations on the Ripperdan Ranch integrated, a single RF model was developed and applied to the spatial prediction of soil moisture over the entire vineyard. Random samples were taken from the daily time-series data at each site to perform RF interpolation (Fig. 4). The samples from each station were generated using a proportion of 70–80% of the daily time-series measurements at each site. These were then integrated into a single training set for the purpose of developing an RF model. For interpolation, the value of the ntree option was set to 500 trees. The optimization of the RF model was carried out by modifying the mtry value from 2 to 10 for each training set proportion tested. Covariates or predictor variables utilized in the construction of an RF regression model were meteorological data, soil properties, and EF derived from Landsat imagery. During the training phase of the model, daily linear interpolated EF and normalized difference vegetation index (NDVI) values between two satellite dates from 123 images taken over 3 years were utilized as predictor variables. Also, meteorological factors, such as daily average solar radiation, reference evapotranspiration, wind speed, maximum, minimum, and daily average air temperature, and relative humidity respectively, were utilized as covariates. Additionally, soil attributes such as spatial distribution of bulk density, soil temperature (three temperature sensors were installed at the flux tower site at a depth of 10 cm), and a digital elevation model derived were also employed as predictor variables.

The validation of the RF model was accomplished by the use of k-fold cross-validation (CV). CV is widely used to estimate the performance of a model in the context of data that has not been utilized for model training (Meyer et al. 2018, 2019). The CV procedure involves training models on a large number of occasions (k models), and in each model run, the data from onefold are set aside and used not for model training, but for model validation instead. This way, the model's performance can be assessed using data that were not used in the model's training (Meyer et al. 2018, 2019). We used the index argument to account for data dependencies by leaving the entire dataset from one sensor location out. A random k-fold CV contains data points from each sensor site that are contained in each of the folds with the maximum degree of certainty.

Performance evaluation criteria

Different statistical goodness-of-fit indicators were employed to assess the errors between modeled and observed soil moisture values from the three models. The difference in RZSM values between observed (in situ) and modeled values was quantified using root-mean-square error (RMSE), index of agreement (d), and coefficient of determination (R2) (Huryna et al. 2019); see Eqs. (11)–(14)

$$\text{RMSE}={\left[\frac{1}{n}\sum_{i=1}^{n}{\left({P}_{i}-{O}_{i}\right)}^{2}\right]}^{1/2}$$
(11)
$$d=1-\frac{\sum_{i=1}^{n}{\left({O}_{i}-{P}_{i}\right)}^{2}}{\sum_{i=1}^{n}{\left(\left|{P}_{i}-\overline{O }\right|+\left|{O}_{i}-\overline{O }\right|\right)}^{2}} , 0\le d\le 1$$
(12)
$${R}^{2}=1-\frac{\sum {({P}_{i}-{O}_{i})}^{2}}{\sum {({P}_{i}-\overline{O })}^{2}}$$
(13)
$$\text{bias}=\frac{\sum_{i=1}^{n}({P}_{i}-{O}_{i})}{n},$$
(14)

where \({P}_{i}\) is modeled value, \({O}_{i}\) is the observed value, \(\overline{O }\) is the average of the observed values, and \(n\) is the number of observations.

Results and discussion

Model tuning and variable significance in the random forest model

The RF model built with varying amounts of training data sets showed the lowest RMSE when 70–80% of the total data was used, with R2 of 0.84 and RMSE of 0.02 cm3 cm−3. As a result, by setting ntree equal to 500, the 80% training set was chosen for further evaluation of the model. The results of the tenfold cross-validation demonstrated that, compared to meteorological factors, soil bulk density, EF, NDVI, and average soil temperature had the greatest impact on the RF model accuracy (Fig. 5). The soil bulk density and the soil temperature were the most critical parameters that impact the amount of moisture present in the soil within the root zone. Water movement in the root zone was influenced by the dominant sandy soil found at the study site, which has a high bulk density and thus a low porosity due to the coarse texture that affects the saturated water content. Additionally, at low temperatures, root water uptake may be decreased to due lower evaporative demand, and that is also accompanied by reductions in the photosynthetic rate in the grapevines. Topography and meteorological variables had minor effect on the RF model accuracy (Fig. 5), which is not surprising given that the study vineyard is relatively flat. The EF variable is strongly connected to crop ET that drives RZSM. Reference ET was found to significantly influence the RF model predictions (Fig. 5). Even though the NDVI is one of the variables used to derive EF, when used as an independent covariate, it had a major impact on the RF model, which not was surprising given that canopy size affects light interception and consequently crop water use. In light of the fact that precipitation has a direct impact on soil moisture, but the rainfall parameter did not show a high rank on the list of important variables, it is possible that the Mediterranean rainfall pattern in which there is negligible rainfall during the summer growing season explains this observation.

Fig. 5
figure 5

Variable importance of RF model used in this study. Here, BLD is soil bulk density at each sensor location, AvgST is the daily average soil temperature, EF is the evaporation fraction, NDVI is normalized difference vegetation index, DEM is the digital elevation model, SolRad is the daily solar radiation, MinRH, MaxRH, and AvgRH are the minimum, maximum, and daily average relative humidity respectively, MaxAT, MinAT, and AvgAT are minimum, maximum, and daily average air temperatures, respectively, Etr is reference evapotranspiration, AvgWS is daily average wind speed, and rainfall is the sum of precipitation in that particular day

Validation of the evaporation fraction and evapotranspiration

The evaporation fraction (EF) was determined using the energy balance components from the EC tower, which were then correlated with the EF derived from the pySEBAL model at the available dates (Fig. 6). The correlations revealed that the EC tower measured EF agreed well with remote-sensing-derived EF. This was critical, since EF is one of the important covariates (Fig. 5) in the training of the RF model for soil moisture predictions within the study area. Furthermore, the trained model was used to make spatial predictions by providing satellite-derived spatial EF data. The RMSE and nRMSE (normalized root-mean-square error) between measured and molded EF were found to be 0.07 and 0.102, respectively, with a coefficient of determination (R2) of 0.75. The amount of irrigation was implemented at each sensor location separately in the research block. The daily dynamics of ET with combined irrigation and precipitation at the flux tower location are depicted in Fig. 7. The ET patterns indicate that ET is low at the beginning of the crop season and gradually increases with crop growth as transpiration rates increase, and then begin to decrease at the end of the crop season. In addition, we can detect a relationship between soil moisture, the amount of water applied through irrigation and precipitation, and ET through Figs. 7 and 8, respectively. For example, during the middle part of the season, when ET is at its peak, RZSM is depleted at a faster rate and reaches its lowest levels. Also, soil moisture reaches its peak during the winter, when ET is low and precipitation is high (California has a Mediterranean climate with dominant winter rainfall). Additionally, it is reasonable to infer that the EF contains indirect information on irrigation, which has an impact on soil water dynamics.

Fig. 6
figure 6

Evaporation fraction (EF) values obtained from the eddy covariance flux tower and those derived using pySEBAL were compared for the three crop seasons, with scatter plots reflecting the root-mean-square error (RMSE), normalized RMSE (nRMSE), and coefficient of determination (R2)

Fig.7
figure 7

Daily evapotranspiration from eddy covariance flux tower and combined irrigation and precipitation for the three seasons from 2018 to 2020 in a vineyard at the Ripperdan Ranch near Madera, CA

Fig. 8
figure 8

Time-series plots of root zone soil moisture estimates from pySEBAL (orange dots), EFSOIL (blue stars), RF models (green dots), and in situ measurements (black solid lines) at eight sensor locations (RIPC1, RIPC2, RIPC3, RIPC4, RIPT1, RIPT2, RIPT3, and RIPT4) and combined irrigation and precipitation (histogram) at each sensor location are shown on secondary axis for the three growing seasons of grapevines

Root zone soil moisture dynamics

The daily average soil moisture in the root zone considered as the top 60 cm in this study was acquired from soil moisture sensors at eight sensor locations. These data were used to train the RF model and to validate predicted soil moisture from the three models to measured values. The daily observed and predicted RZSM dynamics from the three models pySEBAL, EFSOIL, and RF are shown in Fig. 8 for the vineyard growing seasons from 2018 to 2020. It should be noted that the RZSM correlations were limited to the times when the EF data were available over the 3-year studied period (i.e., 123 Landsat-derived EF data were considered in 3-year time frame). When comparing the three models with the observed RZSM, the soil moisture demonstrated significant temporal variability. During each of the three growing seasons, the RF model showed high correlations with in situ RZSM data. However, soil moisture estimated by the pySEBAL model either overestimated or underestimated soil moisture when compared to in situ data from all sensor locations. Similar results were found for the EFSOIL model at all the eight sensor locations. As expected, the in situ RZSM effectively tracked precipitation; for example, when the amount of rainfall was high, e.g., in 2019 and 2020, soil moisture content within the vineyard was high. The RF model was able to capture these wetting and drying cycles better than the pySEBAL and EFSOIL models (Fig. 8). Under dry conditions, the performance of both pySEBAL and EFSOIL models agreed closely with in situ sensor data, but the RF model produced more accurate predictions across all conditions. It was encouraging to see that, while in situ data for the 2018 growth period were not available for the RIPC3 and RIPT3 sensor locations, the RF model predictions showed extremely consistent and precise dynamics, similar to the dynamics observed at the other sensor locations during this period. This observation provided confidence in predicting the RZSM at this vineyard, which was useful for future implementations of the RF model.

The goodness-of-fit statistical indicators obtained from the comparison of observed and predicted root zone soil moisture at each sensor location are reported in Table 2. The results from the RF model had a high R2 (> 0.80), low RMSE (0.012–0.036 cm3 cm−3), low mean bias error (− 0.008 to 0.031), and a high index of agreement value (> 0.86) from all eight sensor locations, indicating that a data-driven ML-based method such as RF is capable of accurately predicting RZSM across the study this site. The pySEBAL model had the weakest performance, with low R2 (< 0.04), high RMSE (> 0.09 to 0.12 cm3 cm−3), high bias (> 0.02 to 0.052), and a low index of agreement value (0.18–0.31) from all of the monitoring locations. However, when compared to pySEBAL, the EFSOIL model predictions were only marginally better, with low R2 (< 0.09), high RMSE (0.05–0.07 cm3 cm−3), high mean bias (− 0.003 to 0.034), and a very low index agreement value (0.09–0.22) at all eight soil moisture sensor locations. Within the vineyard, soil moisture patterns were controlled by a variety of factors including soil evaporation and crop root water uptake. Compared to the data-driven RF-based approach, the semi-empirical models based on pySEBAL and EFSOIL were unable to accurately predict the RZSM values. As discussed earlier, the RF model was developed using a number of covariates such as meteorological data and soil characteristics that have high correlations with the in situ soil moisture data. The findings from this study indicate that the RF model can be used to accurately predict soil water status to guide irrigation scheduling decisions or in evaluating root zone soil water balance.

Spatial root zone soil moisture dynamics

The spatial distribution pattern of soil moisture predicted by the pySEBAL, EFSOIL, and RF models for each grapevine crop growth stage (single day for each stage) from 2018 to 2020 was evaluated at a 30 m spatial resolution, respectively, as shown in Figs. 9, 10 and 11. The spatial patterns were evaluated in relation to the growth stages of the grapevine. Several processes and events occur during the annual growth cycle of grapevines; however, the major ones are classified into four categories: budburst, bloom, veraison, fruit maturation, and harvest. In each of the 3 years, the spatial variability in RZSM differed from the three models in terms of magnitude and crop stage. The pySEBAL model revealed consistent patterns at the beginning (budburst) and end (harvest) of the crop season. However, the pySEBAL model exhibited very high soil moisture levels that were not practical (compared to soil moisture sensor values) during the Veraison stage, given the fact that the farmer applied deficit irrigation to allow light stress on crops to increase grape quality at this growth stage, as shown in Fig. 6. It is worth noting that the pySEBAL modeling framework was developed for predicting ET over large spatial scales and soil is derived as an auxiliary output. This might explain why it did not do very well at the vineyard scale. The sensor measurements of RZSM revealed a decrease in the observed RZSM over the months of June–July across all three crop seasons (Fig. 8). Furthermore, the RF model has enhanced stability during both well-watered and crop stress growth stages, and performed well both spatially and temporally (Figs. 8, 9, 10 and 11). The RF-based data-driven model, which was trained solely on point location data from the TDR soil moisture sensors, provided accurate predictions at the sensor locations. The spatial distributions of RZSM in the EFSOIL model had constant soil moisture values throughout the crop's growing season, which may be attributed to the fact that the field capacity and wilting point do not vary from season to season. Despite the fact that a spatial resolution of 30 m was insufficient to reliably predict soil moisture in response to crop phenological phases, spatially distributed estimates of soil moisture from the RF model can be used to refine site-specific irrigation scheduling in vineyards and other high-value crops by considering the location in addition to irrigation timing and amount.

Fig. 9
figure 9

On the growing season of 2018, an example of the spatial distribution of root zone soil moisture from the three models pySEBAL, EFSOIL, and RF for a certain day of the crop season. The crop growth stages are divided as Budburst, Bloom, Veraison, and Maturation & Harvest

Fig. 10
figure 10

On the growing season of 2019, an example of the spatial distribution of root zone soil moisture from the three models pySEBAL, EFSOIL, and RF for a certain day of the crop season. The crop growth divided as Budburst, Bloom, Veraison, and Maturation & Harvest stages

Fig. 11
figure 11

On the growing season of 2020, an example of the spatial distribution of root zone soil moisture from the three models pySEBAL, EFSOIL, and RF for a certain day of the crop season. The crop growth divided as Budburst, Bloom, Veraison, and Maturation & Harvest stages

Future possibilities of Random Forest model

Using random forest modeling for spatial prediction, regression, and classification with complex data sets in a variety of subjects has become increasingly popular in recent years. This is due to the fact that it has a computational advantage over other regression models and is simple to implement. Because this model is heavily reliant on the dependent variable and covariates, the RF predictions are always within a reasonable range of the observed data, and the values of the tuning parameters are insensitive to the model's parameters. The drawback of RF approaches is they require extremely high densities of in situ data within a given study region to be properly trained; appropriate data sets are not always available in some regions. For example, obtaining high spatial and temporal data on soil moisture is extremely difficult in some regions due to the high cost of the sensors and the need for more frequent and proper maintenance of the sensors. Aside from that, when creating a large number of trees to train the model, requires significantly more computational power and resources, as well as a significant amount of time to train the model to make decisions based on the majority of tree votes. Future research should place an emphasis on the spatial predictions of different variables in data-poor regions, as well as the development of more accurate validation methods. In addition, future work should explore developing Cyberphysical infrastructures that combine low cost sensors, and Internet of Things technologies with predictive power of RF-based machine learning models to enhance technology adoption among users.

Conclusions

This study compared two semi-empirical approaches that use the evaporation fraction (EF) from remote sensing to a machine learning data-driven approach that uses Random Forest (RF) for predicting spatial–temporal root zone soil moisture distribution in a vineyard. When sufficient observed data for training are available, data-driven models based on machine learning can be developed that accurately predict RZSM and are less computationally expensive compared to process-based models. RF showed the highest agreement with observed soil moisture compared to the semi-empirical models. Soil bulk density, soil temperature, and EF were the most influential covariates for predicting spatially distributed root zone soil moisture within the vineyard. The semi-empirical analytical models pySEBAL and EFSOIL, which were tested in this work, were unable to accurately predict the root zone soil moisture dynamics at the soil moisture sensor locations; depending on the wet and dry conditions, they were either overestimating or underestimating the RZSM. During the crop stress period, the pySEBAL model predicated very high spatial RZSM values which were not comparable to sensor data, but during budburst and harvesting growth stages, the model predicted reasonable spatial distributions throughout the studied area. With some algorithmic tweaks or parameterization, the pySEBAL may potentially be enhanced to produce accurate spatially distributed RZSM predictions, which is extremely valuable for site-specific irrigation scheduling and soil water balance evaluations, because it is based on Landsat imagery that are freely available. In summary, the RF approach that produced good predictions of RZSM though demonstrated in a vineyard, the framework can be applied to other cropping systems or conditions where accurate spatially distributed predictions of root zone soil moisture are needed as long as there is adequate soil moisture monitoring data over the range of soil textures within the field to train the machine learning algorithm. Combining remote sensing with machine learning techniques has the potential to enhance precision agricultural water management.