Introduction

Recent advances in machine learning techniques coupled with copious amounts of data have fostered a new niche of big data analytics in agriculture. Crop yield prediction plays a crucial role in optimizing production, resource allocation, and monitoring crop performance in real-time. With the increasing availability of data in agriculture, including remote sensing data, sensor data, and historical crop yield data, there is a growing need to leverage machine learning techniques to translate complex data into actionable insights for improving agricultural practices and enhancing crop yields.

Previous studies have attempted to understand yield variability in sweet corn; accounting for different planting dates (Williams, 2008), in-row spacing and genotype (Rangarajan et al., 2002; Williams, 2015; Dhaliwal & Williams, 2019), and planting date–weed control interactions (Williams and Lindquist, 2007). However, these factors reflect narrow aspects of the production of sweet corn for processing. More recent studies use long-term observational datasets to examine variability in historical yield trends of cereal crops (Jeong et al., 2016; Iwanska et al., 2018). To our knowledge, this is the first study to use fine-scale yield data (field-level observations) from commercial sweet corn production fields, harvested mechanically, to expand on our understanding of yield variability.

Crop yield is a product of complex interactions among genotype, environmental, and management practices, typically predicted by process-based or statistical modeling. Process-based models utilize detailed amounts of field-level measurements to simulate crop growth and development to environmental and management practices (Muchow et al., 1990). On the contrary, statistical models rely on historic climate & soil data and observational yield data to predict crop yields while considering the underlying eco-physiological conditions (Schlenker & Roberts, 2009). Process-based models provide more accurate outcomes compared to statistical models; however, statistical models are powerful regarding the volume, velocity, and variety of data (Roberts et al., 2017).

Statistical models implemented using machine learning models provide a better alternative to traditional regression models and are accompanied by tools to gain deeper insights into data. Different machine learning models have been used for crop yield forecasting, ranging from linear regression and decision trees (Jeong et al., 2016; Osman et al., 2017; Ranjan & Parida, 2019; Shahhosseini et al., 2020) to deep learning algorithms (Wang et al., 2018; Rao & Manasa, 2019; Khaki et al., 2021). For instance, Jeong et al. (2016) used random forest (RF) models to predict wheat, field corn, and potato yield at regional and global scales. Deep learning algorithms, such as convolutional neural networks (Rao & Manasa, 2019; Khaki et al., 2021), have shown substantial breakthroughs in improving crop yield predictions due to advancements in computational power and accessibility to larger volumes of data. However, the interpretability of higher-order deep learning algorithms remains a challenge compared to other machine learning models.

Machine Learning approaches have proven successful in identifying genetic variations (Yoosefzadeh-Naifabadi et al., 2021; Xu et al., 2022), understanding weather impacts, and determining effective management practices (Crane-Droesch, 2018; Shook et al., 2021) in agriculture. These models learn from historical information, considering factors such as environmental conditions, genetics, and management practices to make accurate predictions. However, machine learning models can suffer from overfitting, where they become overly specialized to the training data and struggle to generalize to new data. To address this challenge, ensemble techniques have emerged as a solution by combining multiple machine learning models to improve prediction accuracy and mitigating overfitting (Dietterich, 2000).

Ensemble techniques are valuable approaches to combat overfitting in machine learning models. They involve combining multiple models to enhance prediction accuracy and reduce overfitting. Bagging trains independent models on different subsets of the training data and combines their predictions, effectively smoothing out biases and reducing variance. Boosting builds models sequentially, focusing on correcting errors made by previous models, thus improving the overall performance of the ensembles. Stacking trains multiple models on the same dataset and uses their predictions as input features for a meta-model, capturing complex patterns and reducing overfitting. Regularization techniques, such as adding penalty terms to the model’s objective function, can be applied to each individual model within the ensemble to control complexity and prevent overfitting. By utilizing these ensemble techniques, the combined models can provide more robust and accurate predictions on new data, improving generalization and mitigating the risk of overfitting.

By leveraging advanced machine learning techniques to analyze vast amounts of data from commercial sweet corn fields, combined with historic weather and genotype data, it is possible to gain a deeper understanding of the complex interactions and variables that significantly impact sweet corn yield. This enables farmers, breeders, and industry professionals to make data-driven decisions that can optimize crop performance and ultimately improve yields.

The objectives of this study were to: (a) evaluate machine learning model performances on sweet corn yield prediction and (b) identify the most influential variables for crop yield predictions.

Materials and methods

Data description

Field-level historic sweet corn yield data were obtained from multiple US vegetable processors from 1992 to 2018. The dataset (hereafter referred to as ‘US sweet corn data’) contains sweet corn yields from two primary regions of commercial sweet corn production for processing in the US, i.e., the Upper Midwest (the states of IL, MN, WI) and the Pacific Northwest (the state of WA). Furthermore, these regions were classified into five production areas: IL-Irrigated, IL-Rainfed, MN-Rainfed, WA-Irrigated, and WI-Irrigated (Fig S1). Sweet corn yields, i.e., green ear mass (Mt/ha) were recorded from contract growers’ fields by the processor. Contract growers’ fields reflect typical standards of commercial sweet corn production regarding plant density, nutrient management, pest and weed control, etc. The Materials Transfer Agreement governing the use of this dataset dictates strict confidentiality, including names of processors, contract growers, and hybrids.

The US sweet corn dataset accompanied observed information on hybrid grown, cultural practices, and important agronomic planting dates, tasseling, and harvest (Table 1). Later, hybrid information was matched to the seed source/company, and this new variable (seed source) would become a proxy for a hybrid. This was done to reduce possible confounding bias arising from the likelihood of similar genetic material contributed by individual seed companies, thereby shrinking around 100 hybrids into nine unique seed sources as described in Table 1.

Table 1 Description of genetics and crop management variables in the US sweet corn data. Seed source denotes the parent seed company that makes the hybrid grown in the contract growers’ fields

Important information on growing season weather and soil characteristics was added to the US sweet corn data using field location coordinates. Weather data were obtained from Daymet daily surface weather database on a 1-km grid for the North Americas (Thornton et al., 2021). The following eight variables were included to represent growing season weather conditions: daily minimum, maximum, and average air temperatures; precipitation; shortwave solar radiation; growing degree days; average vapor pressure deficit; and average potential evapotranspiration (see Table 2). Daily average potential evapotranspiration was calculated using the Priestly-Taylor method, which captures both diurnal and seasonal variations in seasonal evaporative demand (Priestley & Taylor, 1972). All weather variables were calculated for four different intervals corresponding to different crop growth and development stages. The intervals were estimated using observed planting, tassel, and harvest date recorded for each of the contract growers’ fields. First, second, third, and fourth intervals represented the periods of 0–10 days after planting, 0–30 days before tasseling, 0–10 days after tasseling, and 0–20 days before harvest, respectively. The critical stages of sweet corn growth and development captured by the four intervals include seed germination and emergence, exponential growth, pollen shed and silking (anthesis), and kernel development (R2 or blister stage), respectively. Weather variables were also calculated for the growing season duration, representing season long estimates of all eight weather variables.

Table 2 Weather and soil characteristics included in the US sweet corn data. Using field location coordinates, weather and soil data were extracted from Daymet and SSURGO databases, respectively

Soil characteristics used to describe the environmental conditions included soil texture (clay, sand, silt content), cation exchange capacity, and soil organic carbon (see Table 2) obtained from the SSURGO database (Web Soil Survey, 2020). All soil characteristics were obtained at different depths along the soil profile (0–30 cm, 30–60 cm, 60–100 cm, and 100–200 cm, see Table 2), hence, resulting in total twenty soil features in the US sweet corn dataset.

After data augmentation, the US sweet corn dataset consisted of 16,040 unique green ear yield observations (i.e., fields) with 67 different explanatory features detailed above. The explanatory variables comprised a time component (year), spatial component (production area), genetics (seed source), crop management, and environmental conditions (weather and soil). This comprehensive set of explanatory variables captures sufficient information to explain spatio-temporal variabilities in sweet corn yield and will be used in model building to predict sweet corn yields.

Data pre-processing

Prior to model building, the US sweet corn dataset was split into two sets — a training set to build the model and a test set to provide an unbiased evaluation of the training model. A random subset consisting of 70% of yield observations was assigned to the training set, and the remaining was used as the test set. A k-fold cross-validation with k = 10 was used on the training dataset. This involved dividing the training data into 10 subsets, where 9 subsets are used for training and the remaining subset is used for validation. This process was repeated 10 times, with each subset used as the validation set once. This allowed us to obtain an estimate of the model’s performance and its variability by averaging the performance over k validation sets.

Machine learning model building

Principal components regression

Principal components regression is an unsupervised technique that performs dimensionality reduction by featuring principal component analysis (PCA) in the first step and then building a linear regression model on the transformed principal components (Jolliffe, 1986). We utilized the ‘prcomp’ function from the ‘stats’ package in R for PCA and the ‘lm’ function for linear regression (R Core Team, 2021).

Partial least squares regression

Partial least squares regression (PLS) is a supervised dimensionality reduction technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components. While principal component analysis chooses linear components that summarize the maximum variation of the predictors, PLS finds components that summarize the maximum variability in predictors while maintaining maximum correlation between components and the response (Kuhn & Johnson, 2013). We utilized the ‘plsr’ function from the ‘pls’ package in R (Mevik & Wehrens, 2007).

Multiple linear regression

Multiple linear regression accommodates multiple predictors to predict a measurable response variable (Y). The multiple linear regression model can be expressed as:

$$\varvec{Y}= {\varvec{\beta }}_{0}+{\varvec{\beta }}_{1}{\varvec{X}}_{1}+{\varvec{\beta }}_{2}{\varvec{X}}_{2}+\dots +{\varvec{\beta }}_{\varvec{P}}{\varvec{X}}_{\varvec{P}}+\varepsilon,$$

Where Xj represents the jth predictor and βj quantifies the association between that variable and the response. The model assumes a linear relationship between the predictors and response variable, normality, no multicollinearity, and homoscedasticity (Hastie et al., 2009). The ‘lm’ function from the ‘stats’ package in R was used to build the regression model (R Core Team, 2021).

Regularized regression

Regularized regression constrains the slope coefficient estimates and shrinks them towards zero to build a more parsimonious model. We used the ‘glmnet’ package in R to implement regularized regression (Friedman et al., 2010).

Multivariate adaptive regression splines

The multivariate adaptive regression splines (MARS) model creates a piecewise linear model which captures nonlinear relationships in data by automatically determining the number and location of breakpoints (or knots) (Friedman, 1991). The ‘earth’ package in R was used to implement the MARS model (Milborrow, 2019).

Random forest

Random forest (Breiman, 2001) is built on the concept of bootstrap aggregation, which is a tree-based ensemble model. Bootstrap aggregation (or Bagging) attempts to reduce the variance for notoriously noisy decision trees by utilizing bootstrap procedure, i.e., averaging predictions from many random sub-samples with replacement. Random forest algorithm uses a random number of features to construct each tree and repeats this procedure many times and eventually averaging all the predictions made by all trees. Thus, the RF model addresses both bias and variance components of the error and is proved to be powerful. Random forest model was implemented using ‘randomForest’ package in R.

For the purpose of reproducibility, the random forest model was built with carefully chosen hyperparameters. The forest consisted of 300 trees; a number determined through a validation process to ensure optimal performance. The ‘mtry’ parameter, which determines the random number of features considered at each split, was set to the square root of the total number of variables in the dataset. This selection aimed to strike a balance between model complexity and feature diversity, enabling the random forest model to effectively capture the relationships within the data.

Model performance comparison metrics

To assess model performance, the following two metrics were calculated on the test dataset.

Root Mean Square Error (RMSE) is calculated using the equation:

$$RMSE= \sqrt{{\sum }_{\varvec{i}=1}^{\varvec{N}}{({\varvec{y}}_{\varvec{i}}-{\widehat{\varvec{y}}}_{\varvec{i}})}^{2}/\varvec{N}}$$

Where, yi represents the observations in the test data sets, \({\widehat{\varvec{y}}}_{\varvec{i}}\) are the predictions in the test dataset. N is the total number of observations.

Pearson’s Correlation Coefficient (r) was calculated between yield observations and yield predictions obtained from the test dataset.

Results

Model performance comparisons

Model performances for accurate sweet corn yield predictions were compared using root mean square error (RMSE) and Pearson’s correlation coefficient (r) metrics. Using the same test dataset across all models, the performance comparison metrics were obtained between predicted and observed yield values. The random forest model outperformed all trained models with the lowest RMSE of 3.29 Mt/ha and the highest Pearson’s correlation coefficient of 0.77 (Fig. 1). Multivariate adaptive regression splines (MARS), regularized regression, and partial least squares models did not show significant improvements from the benchmark multiple linear regression model for RMSE or Pearson’s correlation coefficient (Fig. 1). Principal components regression reported the worst accuracy of parameters for yield predictions among the six trained models.

Fig. 1
figure 1

Comparison of machine learning model performances on test dataset (sample size = 4,800). Scatterplots for observed vs. predicted yields are shown. The solid line represents 1:1 relation between observed and predicted yields. Model comparison metrics shown are RMSE and Pearson’s correlation coefficient (r)

Variable importance plot

Variable importance plots provide a better understanding of data by quantifying the importance of explanatory variables in predictive modeling. Since the RF model was the best performing model, this section will discuss the implications of important variables extracted from the RF model. The metric used to quantify the importance of variables is a loss function in RMSE as described in Fisher et al. (2019). If eliminated from the predictive model, the more influential variables result in a greater loss in model performance metric. To eliminate the effect of a variable, values of the variable were permuted, and model performance was evaluated using the RMSE metric. Fifty permutations were carried out, and the average loss in RMSE is reported (Fig. 2). The top fifteen variables in predicting sweet corn yields, ranked by descending order of importance, are shown in Fig. 2.

Fig. 2
figure 2

Variable importance plot calculated by using fifty permutations and the RMSE loss function for a random forest model. Top fifteen selected variables are shown, with year being the most important and soil organic carbon content (0–30 cm) being the least important

Year, production area, and seed source were the top three influential variables in predicting sweet corn yield. Season long precipitation and average minimum temperature during the third crop growth interval were the top two important weather variables in predicting sweet corn yield. The only soil characteristic that showed up in the top fifteen important predictor variables was average soil organic carbon at soil depth 0–30 cm (Fig. 2).

Partial dependence plots

Partial dependence plots show the marginal effect of continuous predictor variables on sweet corn yield (Fig. 3). On average, sweet corn yield increased from nearly 16.5 Mt/ha to 20.0 Mt/ha over one decade (2000–10). Since 2010, crop yields have not reported any improvements. Season long precipitation stabilizes crop yields until the total precipitation exceeds ~ 500 mm.

Fig. 3
figure 3

Partial dependence plots, based on results from the random forest model, showing the mean marginal influence of top six explanatory variables on sweet corn yield prediction, including (A) year, (B) season long precipitation, (C) third interval avg. minimum temperature, (D) second interval avg. vapor pressure deficit, (E) fourth interval growing degree days, (F) fourth interval avg. temperature. Each plot represents the effect of a single variable while holding the other variables constant

During the third crop growth interval, the average minimum temperature shows detrimental effect on crop yields beyond 16 °C. The dependence of crop yield on remaining important weather variables is shown in Fig. 3.

Discussion

This study utilizes machine learning models to understand the temporal and spatial heterogeneities in sweet corn yield as a function of genetics, management, weather, and soil factors. Machine learning models were compared using common model performance metrics: RMSE and Pearson’s correlation coefficient. Our results illustrate that the RF model is highly effective for sweet corn yield prediction. The RF model outperformed all machine learning models we tested. The accuracy obtained in this study (r = 0.77) is similar to that obtained in field corn yield prediction using different machine learning models (Ahalawat and Minsker, 2016; Pantazi et al., 2016). However, this is the first report of data-driven analysis to predict sweet corn yields utilizing yield data at a finer scale (i.e., field-level) to build machine learning models.

The high performance of the RF model is likely more evident when the response is a result of complex interactions among multiple predictors, such as crop management, environmental, and edaphic factors, that can complicate classical linear regression modeling. Crop production variables from weather, management, and soil factors are usually highly correlated and violate the assumption of independent variables in traditional linear regression models. In such instances, RF models are highly advantageous. Additionally, RF models allow flexibility in using both categorical and continuous data in model-building frameworks.

The second objective of this study was to understand which variables contribute the most to sweet corn yield prediction. Variable importance and partial dependence plots are key utilities of RF models that allow comparisons of variable importance and dependence. Our results indicate that year and production area, representing broad temporal and spatial heterogeneities, were the top two influential variables predicting sweet corn yield. The year variable captured technological advancements and improved agronomic recommendations over time. Sweet corn yields increased significantly in the first decade of the 21st century and have stagnated since 2010. The lack of yield improvement in processing sweet corn in recent years is also supported by USDA–NASS data (USDA–NASS, 2021). However, the partial dependence plot (Fig. 3a) doesn’t capture the yield trends at the regional level, i.e., production area.

Production area represents the spatial component in US sweet corn data. Production areas in Pacific Northwest were higher yielding than those in the Upper Midwest (Fig. S2). Historically, production areas in the inland Pacific Northwest have led in fruit and vegetable yields, mainly by abundant seasonal daylight, moderate nighttime temperatures, and an ample irrigation water. Furthermore, irrigated production areas in the Upper Midwest yield higher than rainfed production areas in the same region (Fig. S2). Previous studies have shown the cooling effect of irrigation to negate extreme heat stress on field corn growth and development (Lobell et al., 2008; Siebert et al., 2017; Li et al., 2020).

One of the unique attributes of this observational dataset is the availability of hybrid information represented by seed sources. The Materials Transfer Agreement dictates the strict confidentiality of hybrid names. Nonetheless, seed source (genetics) is a significant driver of sweet corn yield heterogeneities across the production areas in the Upper Midwest and Pacific Northwest (Fig. S3). This can be attributed to differences in breeding programs (i.e., eating quality, yield improvement, host-disease resistance) and other unknown organizational dynamics within seed companies.

Our results show season long precipitation is the most influential weather variable on sweet corn yields. The partial dependence plot for the effect of season long precipitation on crop yield integrates information from all production areas in the Upper Midwest and Pacific Northwest. The initial decline in crop yield with increasing precipitation represents the shift in data from high-yielding WA-Irrigated to production areas in the Upper Midwest (Fig. S43). The yield decline observed near 500 mm precipitation underscores the importance of extreme weather conditions like floods on sweet corn yield (Fig. 3b). These data provide an excellent opportunity to study regional effects of precipitation anomalies on sweet corn yield.

Average minimum (i.e., nighttime) temperature during the third crop growth interval was the top temperature-related variable in sweet corn yield prediction. The third crop growth interval (anthesis) represents the phenological stage most vulnerable to heat stress. Higher nighttime temperatures during the third crop growth interval are associated with reduced sweet corn yields (Fig. 3c). In C4 crop species, including sweet corn, yield losses from high nighttime temperatures are a result of complex eco-physiological processes involving increased plant respiration rates and, potentially, changes in crop evaporative demands (Atkin & Tjoelker, 2003; Sadok & Jagadish, 2020). Vyn (2010) stated that grain fill is a 24-h process, so both day and nighttime temperatures are important. High nighttime temperatures (> 21 °C) induce nighttime respiration, resulting in a lower amount of dry matter accumulation in plants. With high nighttime temperatures, more photoassimilates produced during the day are lost; less is available to fill developing kernels, thereby lowering grain yield (Thomison, 2005). The lower threshold for yield loss in sweet corn, 16 °C compared to > 21 °C reported in field corn, suggests sweet corn has a higher sensitivity to yield losses from excessive temperatures than field corn.

Conclusions

This study used a rich spatiotemporal sweet corn dataset with field-level yield observations to evaluate various machine learning model prediction capabilities. Our results demonstrate that the RF model provides the most accurate yield predictions among the models tested. Variable importance plots quantify the importance of predictor variables, with year, production area, and seed source being the top three influential variables on sweet corn yield. Season long precipitation and average minimum (i.e., nighttime) temperature during anthesis were the top two essential weather variables in sweet corn yield prediction. Improving model prediction accuracy may be possible with the inclusion of additional crop management information, such as fertilizer application rate, plant density, and amount of irrigation, where applicable.