Introduction

Accurate pre-harvest yield forecasting is essential for all agricultural and horticultural industries as it helps growers (and the greater industry) be better prepared to meet market demands and respond to potential production issues such as seasonal, location and varietal variability and climate volatility. Whilst there has been extensive research on the accuracy of remote sensing (RS) for the yield forecasting of grains crops (Weiss et al., 2020), there has been little work on similar applications in horticulture and even less on carrots. This paper addresses this shortfall by evaluating the accuracy of RS for forecasting yield in commercial carrot crops grown over multiple seasons and locations within Australia.

Carrot is an important vegetable crop due to the high nutrient content and benefits for human health (Que et al., 2019). However, forecasting carrot yield is not a common practice. This is likely because carrots grow underground, and the farming systems used are often intensive, characterised by small planting areas, spatially dispersed crop planting (i.e. crops distributed with spatial gaps), and usually involve crop rotation. These characteristics also limit the development and adoption of commercial carrot yield monitors or other technological developments that allow growers to have a clear understanding of the within-field yield variability.

In a recent study by Schauberger et al. (2020), 362 studies on the yield forecasting were identified from a query to the Web of Science® from 2004 to 2019. In reference to horticultural crops, 12 papers explored potatoes, olives (9), citrus (5), apples (4), mangoes (3), strawberry (2) and none in carrots. Subsequent to this review, Suarez (2020) explored the accuracies of hyperspectral data, Sentinel-2 (S2) and Worldview-3 (WV3) satellite imagery for forecasting carrot yield prior harvesting as a surrogate approach for yield monitors.

While accurate crop yield forecasting can be achieved using various RS platforms and sensors (satellite, airborne, UAV, proximal) (Weiss et al., 2020), it is crucial to consider the intricate interactions between canopy reflectance properties and factors such as crop type, phenological stages, plant densities, soil type, and agro-climatic zones (Al-Gaadi et al., 2016; Mkhabela et al., 2011). Vegetation indices (VIs) have been developed to measure different biophysical or biochemical variables of crops, such as water status, chlorophyll content, or biomass (Zarco-Tejada et al., 2005). These VIs change during the growing season, reflecting variations in crop variables that, in turn, impact crop status, strongly correlated with yield (Schlemmer et al., 2013). VIs thus act as proxies for yield forecasting (Shanahan et al., 2001). To establish robust relationships with yield, reflectance-based data must be calibrated to account for these influences.

The most simplistic calibration approaches are linear or non-linear regressions using reflectance information usually in the form of IVIs to estimate biophysical variables (Schauberger et al., 2020). Although simple, these methods are statistically preferable as they are easier to interpret, are less likely to overfit and have been proven accurate (Robson et al., 2017; Suarez et al., 2020). Bolton and Friedl (2013) established relevant univariate linear regressions between Normalised difference vegetation index (NDVI), the two-band Enhanced vegetation index (EVI2) and Normalised difference water index (NDWI) derived from MODIS to maize and soybean yield, at the regional level. Due to the influence of phenological growth stage on the reflectance properties of the crop, yield predictions from the linear models improved when they included days after sowing (DAS) or ‘greenup’ instead of the “day of the year”. From this analysis, 70 and 80 DAS were identified to be the optimum growth stage that produced the highest correlation between the VIs and maize and soybean yield, respectively. Of the VI tested, NDWI produced the highest correlation to yield in maize (R2 = 0.58) and EVI2 for soybean (R2 = 0.70). For potato (another subterranean root crop), Al-Gaadi et al. (2016) obtained prediction errors of between 7.9 and 13.5% from Landsat-8 and between 3.8 and 10.2% from S2 using a univariate approach a few day prior harvesting.

Although in recent times more robust regression approaches including Machine Learning (ML) and Artificial Intelligence (AI) algorithms have gained popularity for the forecasting of crop yield from remote sensing, their capacity to accurately extrapolate or forecast under unknown events is often limited in comparison to regression models (Johnson et al., 2016; Shaub, 2020). These approaches also require large datasets for calibration and then validation of the models. Johnson et al. (2016) tested multivariate linear regressions (MLR), Bayesian neural networks (BNN) and Model-based recursive partitioning models for predicting the yield of barley, canola and spring wheat over the Canadian Prairies between 2000 and 2011. The authors used NDVI and EVI from AVHRR and MODIS as predictors (i.e. MODIS-NDVI, MODIS-EVI and NOAA-NDVI) and hierarchically clustered the crops (i.e. prairies) at different geographical levels. Whilst accuracies did vary per crop, the MLR models with NDVI and EVI as predictors produced significantly higher forecasting accuracies. The interaction of NDVI and EVI (NDVI x EVI) was found more accurate in the forecasting of barley than canola or spring wheat yields. The rationale behind this results is the linear relationship found between MODIS-NDVI and each crop yield. Similarly, a study by Gomez et al. (2019) reported that machine learning algorithms produced the lowest prediction error (at the block level = 11.2%.) using S2 for potato yield forecasting 2 months prior harvesting.

From previous studies, there is not one conclusively superior statistical approach for forecasting yield from remote sensed data. The accuracies vary according the crop, level of association (clustering, global, region and block level), resolution of the imagery and timing of capture in relation with the crop growing stage (Tedesco et al., 2021). However, one common trend is to find the point in time with the highest correlation between RS data and crop yield (Bala & Islam, 2009; Tedesco et al., 2021; Zhao et al., 2007). Tedesco et al. (2021) demonstrated that VIs serve as effective proxies for monitoring the temporal changes in sweet potato crops and distinguishing between their phenological stages. They found that the period of active growth (200–500 growing degree days, GDD) resulted in the smallest yield prediction errors, regardless of the season (i.e., summer or winter). In a similar vein, Ayu Purnamasari et al. (2019) identified the greenup period as the most suitable for predicting cassava yields using VIs derived from S2 satellite imagery and biophysical properties. Bala and Islam (2009) identified the optimal period for forecasting potato tuber yield using MODIS imagery as being between 48 and 64 DAS within a growing season of approximately 96 days. In contrast, Suarez et al. (2020) achieved overall accuracies, represented as percentage errors (%), ranging from 9.2 to 12.7% when estimating carrot root yield using WV3 satellite imagery. These estimations were made approximately 4 weeks before harvest and were conducted across various vegetable growing regions in Australia. The ranges for optimal yield estimation are often from the middle to the end of season, with the latter being too late in the growing season for growers to implement alternative management changes to maximise yields.

As one methodology to achieve earlier yield forecasts, Rahman and Robson (2016) developed a two-step ‘time series’ approach using historic Landsat imagery and corresponding annual yield data that accurately depicted the growth profile of sugarcane using the Green normalised difference vegetation index (GNDVI). The authors fitted a quadratic equation using the annual growing profile and identified single images acquired in April (3 months prior to harvest) achieved the highest correlations to yield as this period represented when all regional crops had reached full canopy cover and had not yet started senescence. From these quadratic equations it was possible to estimate the maximum GNDVI value from any capture date and then input this value into a second linear equation between maximum GNDVI and yield. This methodology provided a large window in which imagery could be used to forecast yield (Feb–June). This is particularly useful in regions of continual cloud cover and extended planting periods.

For the forecasting of carrot yield from RS data an understanding of the seasonal growth profile, as measured from changes in canopy reflectance, needs to be established. This will indicate spectral changes associated with growth stages, seasonal, locational, varietal and management influences. From here, the attempt to develop yield forecasting can be better addressed in terms of identifying what period is best correlated with yield and how robust that relationship is. Therefore, the aim of this study is to develop a yield forecasting algorithm for carrot from remotely sensed imagery and to identify the optimum capture window (OCW), under the hypotheses that:

  1. 1.

    VI values change per growing region, sowing arrangement, and per growing stage (Rapaport et al., 2014; Suarez et al., 2017; Tedesco et al., 2021);

  2. 2.

    The relationship between VIs and yield varies during the growing period (Wang et al., 2016);

  3. 3.

    VIs can be used as predictors of crop yield (Robson et al., 2017; Shanahan et al., 2001);

  4. 4.

    The integration of more than one VI increases the prediction accuracy earlier in the season compared to using a single VI (to be tested).

  5. 5.

    The regression fit needs to be both robust and simple enough to facilitate interpretation and industry adoption.

This paper addresses an important root crop (carrots) where research is lacking and the provision of yield forecasts early in the season is not available. We will demonstrate that the growing pattern as indicated by VIs varies per growing region. We will optimize the capture window to provide yield forecast as early as possible regardless of region by integrating different structure-based, pigment-based and water-related VIs in a multivariate analysis so a unique generic algorithm can be used among seasons and locations, facilitating the adoption by growers.

Material and methods

Study area

For this study three major vegetable (carrot) growing regions were selected from both the sub-tropical (Western Australia—WA and Queensland—Qld) and the temperate (Tasmania—Tas) climatic regions of Australia (Fig. 1). The soil type was variable across the regions. Arenosol soils with low water holding capacity dominate in WA, alluvial vertisols or cracking clay soil dominance in Qld and nitosoil soils occur in Tas. More information regarding the growing window of each region and management practices can be found in Suarez et al. (2020).

Fig. 1
figure 1

Study area for carrot yield forecasting including three vegetable growing regions in Australia (WA Western Australia, Qld Queensland, Tas Tasmania) with location of the investigated carrot fields

Crop distribution and field data collection

The carrot crops selected for this study exhibited similar planting and harvest dates within each growing region. Carrots are predominantly grown during the winter-spring season in Qld (with a crop duration ranging from 115 to 150 days), in summer in Tas (125 days), and throughout the year in WA (with a duration of 130–165 days). Data collection occurred between March 2017 and January 2019, encompassing four growing periods in Qld and two in both Tas and WA (Table 1).

Table 1 Carrot field crops growing periods by region included in this study

Crop boundaries of between seven to sixteen carrot fields per region were manually delineated from high resolution images from WV3 acquired over each region (Fig. 1). WV3 provides 8 multispectral bands in the visible (VIS) and near-infrared (NIR) with a spatial resolution of 1.24 m, 8 Short-wave infrared (SWIR) bands (3.7 m) and 12 bands to map clouds, aerosol, water vapor, ice and snow (CAVIS at 30 m spatial resolution) (DigitalGlobe, 2018). For each field carrot crop, NDVI derived from WV3 images was calculated and Iso Cluster unsupervised classification (Ball & Hall, 1965) was used to assign each NDVI pixel value into one of 8 vigor classes (from very low to very high). Classification thresholds were assigned per field. Six sample areas (located over low, medium and high vigor zones) were randomly selected and a total of 18 areas per field crop were manually sampled for whole plant carrot yield assessment in an area defined according to the sowing arrangement (~ 1–2 m2). The manually harvested carrot yield from the 18 areas per crop were averaged and converted into t ha−1. This sampling methodology was applied to ensure that the variability of canopy reflectance and therefore yield variation was encompassed within each field crop (Suarez et al., 2020). Crop quality assessment (grading) was performed for each of the samples (visual assessment of individual carrots) using standards defined in commercial practices. Leaves (fresh biomass) were removed from roots, weighed and converted into t ha−1 for additional analysis (data no presented).

Satellite image acquisition for time series analysis

S2 satellite imagery (Level 1C product), available in Google Earth Engine (GEE) (Gorelick et al., 2017), was evaluated for each of the growing seasons. This product level (1C) provides a Top-Of-Atmosphere reflectance product (TOA) produced by The European Space Agency (ESA). (European Space Agency, 2023). The lowest 1% of pixel values in each tile per band were selected to remove the darkest pixels in the image, which are likely candidates for dark objects. Cloud cover analysis was performed for each of the carrot crop fields across the growing periods and only cloud-free images over the fields were retained. A total of 99 captures were analyzed, including 43 over the Qld crops, and 28 for each of the WA and Tas regions.

Satellite data extraction

The field crop boundaries, including the respective field crop ID, sowing date (SDate) and average yield (t ha−1) were imported into GEE using Google Fusion Tables. In GEE, a query based on sowing and harvest date was used to select the S2-L1C imagery available for each growing season. VIs (listed in Table 2) were calculated and the mean value of each multispectral band and VI was extracted per field crop. A data table was generated per region, which contained the crop information (i.e. crop ID, yield, SDate), capture date (CDate) and the reflectance values for all the available multispectral bands and the VIs. An R software script (R Core Team, 2014), was designed to import the resulting data tables from GEE and to run the required exploratory and statistical analysis. Time series were produced based on the calculation of the DAS:

$${\text{DAS}} = {\text{CDate }} - {\text{SDate}}$$
(1)
Table 2 Sentinel-2 bands and vegetation indices (VIs) tested in this study

VI and reflectance values were interpolated to ± 10 day intervals, stopping at 150 ± 10 DAS. This produced a time series consisting of eight phenological stages (PhS) (Bolton & Friedl, 2013): 10, 30, 50, 70, 90, 110, 130 and 150 DAS with 0 DAS equivalent to the SDate. An initial attempt to reduce the interpolation period to ± 5 days was performed. However, many crops did not have sufficient images available due to cloud conditions, limiting sample size and the ability to undertake the subsequent statistical analysis. Hence, the broader ± 10 day interval was used to guarantee image availability for each PhS.

Carrot canopy reflectance profile and Optimal Capture Window (OCW) for yield forecasting per growing region

The VIs measured across each growing region was compared to better understand the variation of the canopy reflectance both temporally and spatially. The proposed VIs (Table 2) include structure-related, pigment-related and water-related indices, enabling them to effectively indicate the crop condition within the carrot fields.

The time series of the aggregated VIs values were analyzed per region from which the VIs variability could be established (hypothesis 1). To identify the OCW, several univariate linear models were fitted, with the derived VIs (Table 2) used as predictor of yield (t ha−1) per region and at each PhS. By analyzing the temporal relationship between individual VI and yield, we tested the hypothesis that such a relationship is not stable but that it changes according to the PhS and the VI (hypothesis 2). As such, once the OCW is established, we test the hypothesis that a specific VI can be used to forecast yield (hypothesis 3). The R2 was plotted per growing region from 0 to 150 DAS and a smoothing method using the local polynomial regression fitting (loess) was added to better identify trends.

Accelerating the optimal yield forecasting window capture and validation

Multivariate models for predicting yield were developed in an attempt to reduce the capture gap (CG) (i.e. time between the SDate and the OCW identified with the univariate lineal models). We aimed to test the hypothesis that using this method, yield forecasts can be provided earlier in the season (hypothesis 4). Spatial variability (across the 3 regions) was also included with the derived VIs into the multivariate linear models. These new models were fitted for each DAS using a stepwise regression to identify which VIs best described the variability in \(log\left( {yield} \right)\). \(Log\left( {yield} \right)\) was identified as a more suitable response variable than \(Yield\) due to the non-constant variability exhibited by the residuals in all fitted models. Upon transforming the response variable, model assumptions were satisfied for all DAS (normality, constant variability of residuals, independence). The stepwise regression method was carried out based upon the Akaike Information Criterion (AIC) to identify the model with the optimal AIC (Burnham & Anderson, 2004). To identify the OCW, the coefficient of determination (R2) value was compared for the models fitted for each DAS. Independent variables exhibiting multicollinearity were removed from the model, according to the generalized variance inflation factor (GVIF) (Fox & Monette, 1992), to produce a simplified model for predicting \(log\left( {yield} \right)\) at the OCW across the regions.

The resulting ‘best’ models were thoroughly validated with independent datasets from Tas and WA regions so these datasets were invisible during the training process. The validation dataset included new crops from the same and new seasons that were included in the training process. An independent dataset from Qld region was not available and therefore, validation results are only shown for Tas and WA regions. Figure 2 shows the steps for satellite data extraction, processing and statistical analysis performed in this study.

Fig. 2
figure 2

Flowchart of main data collection and processing steps

Results and discussion

Crop profile characterization: spatio-temporal VI variability

Average reflectance and VI values were calculated at ten-day intervals throughout the growing season, displaying distinct profiles across regions (Fig. 3). The spectral curves at sampled sites transitioned from very low VI values (similar to bare soils) to increasing values, reaching a plateau, and then decreasing between 90 and 130 DAS. This shift signifies the change in crop canopy from active growth to declining condition, consistent with the physiological growth stage when carrots maximize photosynthetic capacity (Johansen et al., 2015) and when the cessation of the carrot root growth coincides with the fall of the shoot weight (Nilsson, 1987). The alignment of VI profiles with crop growth underscores the predominant influence of crop development on RS data changes, with soil type playing a minor role. This is evidenced by the spectral profiles depicted by structural-related indices (e.g., NDVI, EVI2, SAVI) and pigment-related VIs (CHI, NDRE), whose values constantly increased as the crop developed.

Fig. 3
figure 3

Smoothed vegetation indices profiles across growing regions at different days after sowing

VIs such as CHI, NDRE, NDVI and SAVI constantly increased, reaching peak values around 90 DAS in the Tas region and 110 DAS in WA and Qld regions. This suggests that crops in the Tas region reach their maximum photosynthetic capacity earlier in the growing period than the WA and Qld crops, and as result harvested earlier. Other bands or indices clearly showed that the growth profile differed between regions during the entire growing period (i.e. NDRE740, NDRE783, NDRE865 and TCARI) whist signature of others VIs were similar over certain periods. The latter was the case of the SR (up to 50 DAS), SIPI (70 DAS–110 DAS), EVI2 and SAVI (up to 90 DAS). These results validate hypothesis 1, as the temporal variability of the VIs fluctuates based on both the growing region and the specific VI.

Vegetation indices and carrot root yield: variability per region and growing period

Univariate linear models (yield vs. VI) were calculated at each PhS per region to determine the peak of maximum correlation for yield forecasting, \(PhS_{{R2_{max} }}\). The regression coefficients (R2) varied per region and VI at different PhS confirming hypothesis 2 (Fig. 4). However, the PhS at which the \(PhS_{{R2_{max} }}\) occurred did not always coincide with the PhS at which the maximum VI value was achieved (\(PhS_{{VI_{max} }}\)) typically falling between 90 to 110 DAS (Fig. 3). Most of the VIs in the WA region reached maximum correlation with yield early in the season (~ 30–50 DAS) after which the relationships started to decline sharply until about 90 DAS. This response may indicate that a rapid early vegetative development is crucial for the efficient utilization of resources, in terms of yield potential, in a short growing period (Evers, 1988; Suojala, 2000b). In Qld region, the PhS at which \(PhS_{{R2_{max} }}\) occurred was around 130 DAS, indicating that canopy growth did not decline until later in the season and that the interaction of senescence of the leaves with carrot growth was different than in the WA crops. This interaction may be affected by genotype and the nutritional characteristics of the crops. The lengthy vegetative growth indicates that the Qld crops took more time to accumulate the final harvested yield than those in WA (Nilsson, 1987). However, a prolonged growing season does expose the carrots to increased risk of unfavorable environmental conditions such as frost.

Fig. 4
figure 4

Smoothed regression coefficient (R2) from univariate linear models for each growing region

In the Tas region, the \(PhS_{{VI_{max} }}\) and \(PhS_{{R2_{max} }}\) were both around 90 DAS for many of the indices evaluated (e.g. NDRE, GNDVI, NDRE740, SR_G), suggesting that the PhS at which the maximum photosynthetic capacity occurred coincided with the peak of vegetation development indicating that root growth gain did not vary much from 110 DAS until harvest (around 125 DAS). However, EVI2, SAVI and to some extent SR, showed two peaks of correlation to yield: between 30–50 DAS and 90–110 DAS. This suggests that there is potential for yield to be estimated earlier in the growing season. Hole et al. (1987) reported that the highest differences in relative root growth, defined by the shoot to root ratio, can be estimated between 27 and 48 DAS and Suojala (2000a) found that nearly 60% of the total harvested carrot yield was gained by the middle of the growing season after which there was no significant increments in yield gain. This situation can explain the capability of RS-derived data for forecasting yield at such early growing stages supporting hypothesis 3.

Reducing the capture gap (CG) for early yield forecasting

The \(PhS_{{R2_{max} }}\) differed per region and according to the VI used. It ranged from early, middle and late in the growing season (WA, Tas and Qld, respectively). Therefore, it is essential to minimize the CG (i.e. number of days from sowing to the forecast date) among the regions so the early yield forecast can be used in the current season to quantify and identify the extent of underperforming areas.

Multivariate models that included all the multispectral bands, the VIs and all regions were generated at each DAS to investigate if it was possible to reduce the CG, in other words, to provide earlier yield forecasts. However, these models were over fitted as many of the predictors (bands and VIs) showed multi-collinearity. Simplified models were tested based on the GVIF values and the VIs with high GVIF were removed until the influence of multi-collinearity was reduced. The parameters of the best models per DAS are presented in Fig. 5. The predictive capability is similar for 30, 50, 70 and 110 DAS, with moderate R2 values in the range of 0.5 to 0.62, while the models for 10 and 90 DAS are lower, at 0.35 and 0.18 respectively. At 130 DAS (close to harvest), the model performs very well, with R2 = 0.8 (Fig. 5).

Fig. 5
figure 5

Performance parameter R2 of the best multivariate models

Some VIs are common across most of the best models. NDWI was present in the ‘best’ model for 6 of the DAS models, while NDRE740, NDRE783 and TCARI were present in 5 models (Table 3). The region variable is present in all models except 90 DAS. Notably, Vegetation Indices (VIs) related to water and pigment content play a significant role in accurately estimating carrot yields. This is attributed to key limiting factors in carrot crop growth, development, and yield. These factors include a larger photosynthetic surface, often quantified as the Leaf Area Index (LAI), which can store more macronutrients such as nitrogen (N), phosphorus (P), and potassium (K) (Abdel-Mawly, 2004), as well as an ample supply of water (Jeptoo et al., 2013; Reid & Gillespie, 2017). Increased leaf nitrogen levels enhance the photosynthetic capacity of vegetation and, consequently, the chlorophyll (Chl) content (Gitelson et al., 2003). It's worth noting that the Red edge bands in remote sensing data are highly sensitive to changes in chlorophyll content, which explains their consistent presence in the models utilizing these bands directly or via VIs.

Table 3 Variables included in the best models per ‘days after sowing’ (DAS)

The resulting multivariate models were further validated with an independent dataset of 18 carrot field crops located in WA (12) and Tas (6) regions. The actual average carrot root yield (t h−1) per field crop was provided by the respective growers and compared with the forecasted yield (t h−1). The Root Mean Square Error (RMSE) was calculated for each DAS model. The best performing model, in terms of adjusted R2, was at 130 DAS. However, consideration of the best prediction model overall was based on a number of factors, including RMSE (Fig. 6) and usefulness of the model in terms of reducing the capture gap. The model for 70 DAS performs well in terms of R2 and RMSE, for both the training and the validation datasets.

Fig. 6
figure 6

Root mean square error of the best multivariate models

The final optimal model developed for 70 DAS is shown below (Eq. 2):

$$\begin{aligned} Log\left( {yield_{{t ha^{ - 1} }} } \right) && = 8.25 \mp 0.390Re_{TAS} + 0.21Re_{WA} \\&&\quad+ \left( { - 3.32GNDVI - 2.55NDRE740 - 19.89NDRE783 + 16.21NDRE865 + 4.91NDWI - 4.66RENDVI + 0.74RE740 - 0.322SIPI + 0.12SR} \right)_{70DAS}\end{aligned}$$
(2)

where Re represents the region effect, comparing Tas and WA regions to the Qld region. The adjusted R2 value for this model at 70 DAS is 0.50, and the RMSE is 10.21 t ha−1.

By integrating several VIs in the prediction model, the correlations of crop reflectance properties to yield variability may be better explained as different VIs relate to different plant properties i.e. vegetative cover, nutritional and water status (Zhao et al., 2007). This may explain why the final model that includes multiple variables performed better, as the respective VIs have been related to biomass and the physiological condition of the crops (SR, GNDVI), as well as biochemical composition (NDRE) and water status (NDWI) (Zarco-Tejada et al., 2005). Furthermore, as the generic yield forecast model incorporates the spatial variability associated with growing location (region) and its interactions with the different VIs, it is therefore more likely to compensate for a wide range of constraints that may limit yield. This result validate hypothesis 4 and 5.

Validation of the final optimal model

The total harvested yield for each sampled field crop was provided by the respective growers. This value was compared against the predicted yield from Eq. (2). These comparisons are presented in Fig. 7, with the gray colored points corresponding to the data used for training the model in Eq. (2). RMSE for the training dataset was 10.21 t ha−1. Furthermore, yield forecast of 18 additional crops (12 in the WA region and 6 in the Tas region) was calculated at 70 DAS to validate the fitted model (2). Results indicated that the model performed moderately well at predicting yield for WA and Tas crops, with a reasonably small RMSE of 16.97 t ha−1 considering that the standard deviation of the validation dataset was calculated as 19.32 t ha−1. The validation data is presented in Fig. 7 as the black markers.

Fig. 7
figure 7

Predicted vs observed root yield (t ha−1) for the 70 DAS model

Limitations

From Fig. 7, the model tended to underestimate yield (i.e. the majority of the fields in the validation dataset were below the parity line). There are two outliers (1 for each region) in the validation dataset, both with unusually large observed yields. Yields around 90–100 t ha−1 were not common across the sampled fields, and are therefore not well represented in the training dataset. The model also tends to under-predict yields for these high-observed yield fields in the training dataset, but to a lesser degree. There is room to improve the models ability to predict higher yielding crops with the inclusion of more training data from crops with higher yields. Future research endeavors could explore the utilization of cumulative Vegetation Indices (VIs) over time. As noted by Lai et al. (2018), time-integrated VIs offer a more comprehensive representation of the phenological cycle when compared to a single-date approach. This approach has the potential to enhance the accuracy of our estimations.

Conclusion

The potential of remote sensing for predicting carrot yield across multiple growing regions, seasons and at different growth stages was explored in this study. In the case of using a single VI as a predictor, the OCW varied per region and per VI. In two regions, the OCW was close to harvest. Whilst this outcome offers some benefit for pre-harvest yield forecasting i.e. for forward selling and harvesting logistics (labor, storage, transport etc.) it is likely too late to assist growers with the implementation of remedial actions to maximize production. For the first time in root crops, the methodology proposed in this study successfully reduced the capture gap by more than 60 days for some regions incorporating different RS-data and the region as input parameters. This alone greatly improves the potential of optical remote sensing for yield forecasting in growing regions and times of the year that are cloud dominated. This result offers immediate advantages in being able to narrow down the predictions of yields at the early time of 70 DAS. The fitted model presents a simple linear relationship between the regions, VIs, a multispectral band and yield. It is plausible that interactions exist between the predictors, which are yet to be explored. As more data becomes available, more complex models incorporating such interactions can be explored, which has the potential to improve the accuracy to predict yields at this stage of the growing season. The outcomes presented in this study are important to industry considering the subterranean growth habit of the carrot and the limited ability to derive an accurate pre-harvest non- destructive prediction of yield.