Abstract
Accurate and in-time prediction of crop yield plays a crucial role in the planning, management, and decision-making processes within the agricultural sector. In this investigation, utilizing area under irrigation (%) as an exogenous variable, we have made an exertion to assess the suitability of different hybrid models such as ARIMAX (Autoregressive Integrated Moving Average with eXogenous Regressor)–TDNN (Time-Delay Neural Network), ARIMAX–NLSVR (Non-Linear Support Vector Regression), ARIMAX–WNN (Wavelet Neural Network), ARIMAX–CNN (Convolutional Neural Network), ARIMAX–RNN (Recurrent Neural Network) and ARIMAX–LSTM (Long Short Term Memory) as compared to their individual counterparts for yield forecasting of major Rabi crops in India. The accuracy of the ARIMA model has also been considered as a benchmark. Empirical outcomes reveal that the ARIMAX–LSTM hybrid modeling combination outperforms all other time series models in terms of root mean square error (RMSE) and mean absolute percentage error (MAPE) values. For these models, an average improvement of RMSE and MAPE values has been observed to be 10.41% and 12.28%, respectively over all other competing models and 15.83% and 18.42%, respectively over the benchmark ARIMA model. The incorporation of the area under irrigation (%) as an exogenous variable in the ARIMAX framework and the inbuilt capability of the LSTM model to process complex non-linear patterns have been observed to significantly enhance the accuracy of forecasting. The performance supremacy of other hybrid models over their individual counterparts has also been evident. The results also suggest avoiding any performance generalization of individual models for their hybrid structures.
Similar content being viewed by others
Introduction
Agriculture plays a crucial role in the economy and sustenance of societies across India. The reliance on an enormous number of farmers, intermediaries, private enterprises, and public sectors in agriculture makes this sector an integral component of the country’s development1. Farmers perceive agriculture as financially worthwhile only when they obtain a successful crop season that leads to abundant harvests, leading to favorable prices2. Therefore, anticipating crop yield is critical in the agricultural sector for planning, management, and decision-making processes. It enables planners and stakeholders to take proactive measures to ensure adequate food supply and distribution, thereby enhancing food security at local, regional, and national levels. With reliable and in-time forecasts, it is within the capabilities of governments to arrive at well-informed decisions about imports, exports, and food aid programs.
Among earlier attempts of yield prediction, crop weather models developed by Fisher3 and Baier4 are noteworthy. Of late, remote sensing (RS)-based approaches5,6 and different simulation techniques7,8 have gained traction due to the considerable benefits in tracking farm yield and operations over the cultivation period. Models for yield prediction based on the physiological characteristics of crops are also in vogue9,10. However, these approaches may not be suitable at the macro level due to economic and data availability constraints, whereas thanks to their relative ease of use, statistical models can be effectively deployed for forecasting tasks.
The Autoregressive Integrated Moving Average (ARIMA) model stands as a crucial and frequently utilized model for time series analysis. The ARIMAX (ARIMA with eXogenous Regressor) model, an enhanced version of the ARIMA model, has been leveraged as well for quantitative understanding of crop responses. It offers flexibility by including pertinent auxiliary variable(s) through a linear modeling structure. However, both ARIMA and ARIMAX models suffer from the presumption of linearity. Over the past thirty years, a significant amount of literature has emerged that focuses on the modeling of non-linear characteristics in time series data.
The escalating studies in numerous machine learning (ML) algorithms, accompanied by their plenty of successful forecasting applications11,12,13,14, position them as viable candidates for time series forecasting. Contrary to the conventional models, these ML approaches are data-driven, self-adaptive, non-linear, and non-parametric with few restrictive presumptions. Nevertheless, given the inherent mixture of linear and non-linear patterns observed in agricultural data, it is impractical to rely solely on a single linear or non-linear model to effectively capture all the features exhibited by these time series data. In such cases, hybrid modeling strategies, i.e., sequential implementation of linear and non-linear models, have consistently shown better outcomes15,16,17,18,19.
Several authors have put efforts into comparing traditional, ML, and hybrid methodologies in their studies. Kumar et al.20 compared ARIMA and Time-Delay Neural Network (TDNN) models for potato price forecasting in India. In their analysis, the TDNN model handled non-stationary, non-linear, and non-normal aspects of the datasets concurrently, outperforming the classical ARIMA model. Rathod et al.21 predicted banana production in Karnataka by employing hybrid models. These models integrated the ARIMA model with the TDNN and Non-Linear Support Vector Regression (NLSVR) models, respectively. The hybrid models exhibited superior forecasting accuracies in comparison to the individual models, as evidenced by empirical results. The findings of Rathod and Mishra22 were also in consonance, whereby similar hybrid combinations were compared to the individual and stepwise regression models. Supriya23 discovered that the sequential combination of the ARIMAX–Artificial Neural Network (ANN) model outperformed the component models in predicting the damage caused by the yellow stem borer in the Telangana State of India. Neog et al.18 used ARIMAX–ANN and ARIMAX–SVM hybrid combinations for forecasting autumn rice production to demonstrate their competitive advantages over the corresponding single linear and non-linear models.
The aforementioned facts clearly demonstrate that the hybrid models employed so far mostly consist of either TDNN or NLSVR for modeling the non-linear residuals. As many advanced neural network models, such as Wavelet Neural Network, Convolutional Neural Network, Recurrent Neural Network, Long Short Term Memory network, etc., have recently been included in the model builders’ arsenal, their performance needs to be examined in the hybrid structure. The literature review further reveals a scarce amount of research in the realm of agricultural yield forecasting, especially using hybrid techniques with exogenous variables.
Among the different factors of production, irrigation plays a crucial role in enhancing crop yields and ensuring food security on a global scale24,25. Countries like the United States, China, Russia, and Australia have made significant investments in irrigation infrastructure, establishing themselves as major wheat producers26,27,28. In the context of wheat production in India, irrigation has transformed agriculture notably in the ‘wheat belt’ states of Punjab, Haryana, and Uttar Pradesh29,30,31,32,33,34. In the case of sugarcane, which is also a water-intensive crop, consistent supply of water via irrigation networks, such as canals, wells, and sprinklers ensures the necessary moisture for sugarcane growth35,36,37,38. Similarly, the adoption of irrigation techniques has resulted in a substantial increase in groundnut yields in India39,40.
With this backdrop, this study focuses on forecasting the yield of major Rabi crops in India, particularly by utilizing area under irrigation (%) as an exogenous variable. The key content of the paper includes a comparative assessment of different ARIMAX-based hybrid models and their individual counterparts within the purview of forecasting major Rabi crop yields in India. This study incorporates a consideration of the ARIMA model’s performance as a benchmark. Yield forecasts for the next five years (2021–2025) have also been obtained by the best-performing model to facilitate policymakers and stakeholders in their decision-making processes.
Materials and methods
Data
Yearly data on yield (Kg./Hectare) and area under irrigation (%) of India for wheat, sugarcane, and groundnut have been collected and compiled from the various issues of ‘Agricultural Statistics at a Glance’ published by Economics & Statistics Division, Department of Agriculture and Farmers Welfare, Ministry of Agriculture and Farmers Welfare, Government of India (www.agricoop.nic.in and http://desagri.gov.in). The required time series data are available for the period of 1950 to 2019 for wheat and sugarcane and from 1952 to 2019 in the case of groundnut. Methodologically, for each model, the last seven observations have been retained for testing purpose and the remaining observations have been utilized for model building.
Time series models
Autoregressive integrated moving average (ARIMA) model
The most commonly used models to represent linear dynamics in time series literature are the ARIMA models41. We state that a univariate process \(\{{{\text{y}}}_{{\text{t}}}\}\) conforms to the ARIMA (p, d, q) model if it can be expressed as follows:
where p, d, and q represent the orders of autoregression, differencing, and moving average, respectively.
{\({\upvarepsilon }_{{\text{t}}}\)} is hypothesised to adhere to a standard white noise process, exhibiting a normal distribution with zero mean and a variance of \({\upsigma }^{2}\). In case \({{\text{y}}}_{{\text{t}}}\) does not undergo mean-adjustment, there is provision to append a constant term, denoted by μ, to the right side of Eq. (1). The ARIMA methodology can be segmented into three essential stages: identification, estimation, and diagnostic checking. In the identification stage, the parameters for the ARIMA model are provisionally chosen. These tentatively selected parameters are then quantified in the estimation stage. During the subsequent diagnostic checking stage, the model adequacy is evaluated thoroughly. If the model is deemed unsuitable, the entire three-step process resumes and continues until an apt ARIMA model for the given time series is obtained.
Autoregressive integrated moving average with exogenous regressor (ARIMAX) model
The ARIMAX model is a more sophisticated version of the ARIMA model. It has the ability to incorporate an external input variable42. It operates by assuming a form predicated on a given historical input vector \({{\text{x}}}_{{\text{t}}}\):
where \(\phi_{{\text{p}}} \left( {\text{B}} \right)\) and \({\uptheta }_{{\text{q}}} \left( {\text{B}} \right)\) assume the predefined form of the ARIMA model and \({\upnu }\left( {\text{B}} \right){\text{x}}_{{{\text{mt}}}}\) is defined as:
In the model, m denotes the number of exogenous input variables. Additionally, it is presumed that \(\{{\upvarepsilon }_{{\text{t}}}\}\) follows a white noise process with \({\text{N}}(0, {\upsigma }^{2})\).
Time-delay neural network (TDNN) model
Artificial neural networks, modelled after the human brain, are composed of mathematical functions known as artificial neurons (nodes). These neurons are grouped together to form a layer of processing elements. Typically, neural networks are structured with three layers: the input, hidden, and output layers.
An approach to forecast time series with neural networks involves incorporating dynamic characteristics into a static structure, like a multilayer perceptron. This approach offers an implicit functional depiction of time, which is useful in portraying the behavior of data that evolves over time43. A potential approach to incorporate short-term memory is using time delay as input44. TDNN exemplifies such a structure. The general expression for the final output \({{\text{y}}}_{{\text{t}}}\) of a TDNN model is expressed as follows:
where \({\upalpha }_{{\text{j}}}\) and \({\upbeta }_{{{\text{ij}}}}\) are the model hyper-parameters. p and q denote the number of input and hidden nodes, respectively. The hidden layer activation function (g) has taken the form of Rectified Linear Unit (ReLU).
The identity function operates as the output layer activation function (f).
Non-linear support vector regression (NLSVR) model
NLSVR primarily converts the initial input space into a feature space with higher dimensions. Subsequently, a linear regression model is built within it, effectively representing non-linear regression in the original space21,45. In the context of a dataset represented by \({\text{Z}}={\{{{\text{x}}}_{\mathrm{i }}{{\text{y}}}_{{\text{i}}}\}}_{{\text{i}}=1}^{{\text{N}}}\), where the input vector \({{\text{x}}}_{{\text{i}}}\) belongs to the n-dimensional real space, \({{\text{y}}}_{{\text{i}}}\) represents the scalar output, and N denotes the size of Z, the general equation of NLSVR can be expressed as follows:
where w represents the weight vector, \(\phi \left( {\text{x}} \right)\) stands for the non-linear mapping function, b signifies the bias term, and superscript T indicates the transpose operation. The data is used to estimate the coefficients w and b through the minimization of a regularized risk function:
Equation (10) comprises two elements: the first is a regularized component represented as half times the norm of \({\text{w}}\) squared, while the second component is referred to as the empirical error, denoted as \(\frac{1}{{\text{N}}}\sum\nolimits_{{{\text{i}} = 1}}^{{\text{N}}} {{\text{L}}_{{\upvarepsilon }} \left( {{\text{y}}_{{\text{i}}} ,{\text{ f}}\left( {{\text{x}}_{{\text{i}}} } \right)} \right)}\). The regularized risk function effectively balances the optimization of both of these components simultaneously, thus preventing both underfitting and overfitting issues of the model.
Wavelet neural network (WNN) model
In a neural network configuration, different wavelet functions can be effective for approximating a function or predicting output data. Because of this, wavelet functions can serve as activation functions within hidden neurons46. This concept leads to the formulation of the WNN. In a WNN, the activation functions are derived from an orthonormal wavelet basis. The term ‘wavelon’ is used to refer to the neurons in this context. The output of a wavelon with a single input is defined as follows:
Morlet function47 has been proposed for WNN in this study, which is expressed as follows:
This wavelet is derived from a function that bears proportionality to both the cosine function, and normal probability density function.
Convolutional neural network (CNN) model
CNN is a network model introduced by Lecun et al.48, which has a neural connectivity pattern similar to the visual cortex of animals. CNN has found extensive usage in the realm of image and natural language processing49. Nonetheless, it can be effectively employed for time series forecasting. One of the key advantages of CNNs lies in their ability to perceive data locally and share weights, which can greatly lessen the number of parameters and thereby improve learning efficiency. CNN consists of two structural layers: the convolutional layer and the pooling layer50. Within the convolution layer, there are several convolutional kernels. The process of convolution can be represented as follows:
where tanh is the activation function. \({{\text{l}}}_{{\text{t}}}\), \({{\text{x}}}_{{\text{t}}}\), \({{\text{k}}}_{{\text{t}},}\) and \({{\text{b}}}_{{\text{t}}}\) represent the output value after convolution, the input vector, the weight of the convolution kernel, and the bias of the convolution kernel, respectively. Following the convolution operation, the main characteristics of the data are retrieved, which is marked by an expansion in the feature dimensions. To address this challenge as well as to lessen the load during training, a pooling layer is introduced before providing the final output51.
Recurrent neural network (RNN) model
RNNs pose greater technical complexity compared to feedforward networks, necessitating a solid grasp of dynamic recurrence mechanisms. A basic RNN can be thought of as a single-layer RNN where the activation is delayed and simultaneously looped back with the external input (or the output from a preceding layer). The conventional RNN can be described mathematically as52:
where t (0, 1, …, N) represents a discrete time point, N signifies the final time in a finite time period, \({{\text{s}}}_{{\text{t}}}\) refers to a vector with m dimensions representing external inputs at time t, and \({{\text{h}}}_{{\text{t}}}\) represents the n-dimensional output activation via \({\upsigma }_{{\text{t}}}\). This \({\upsigma }_{{\text{t}}}\) may vary over time and can exhibit non-linear behavior. The non-indexed parameters, to be set via training, are the \(\mathrm{n }\times \mathrm{ n}\) matrix U, the \(\mathrm{n }\times \mathrm{ m}\) matrix W, and the \(\mathrm{n }\times 1\) vector b.
Long short term memory (LSTM) model
Hochreiter and Schmidhuber53 proposed the LSTM neural network in 1997 to deal with long-term data dependencies. The design of LSTM allows it to recall information for an extended length of time while resolving the problem of vanishing gradient54,55, which is the main lacuna of the RNN model56. The LSTM model consists of three memory modules, namely forget gate \(({{\text{f}}}_{{\text{t}}})\), input gate \(({{\text{i}}}_{{\text{t}}})\), and output gate \(({{\text{o}}}_{{\text{t}}})\). The primary roles of these three gates are to retain critical information while eliminating unnecessary information from the cell state. The pivotal element in LSTM is the cell state \(({{\text{C}}}_{{\text{t}}})\), which operates concurrently throughout the entire recurrent chain with a few minor interactions. The structural representation of the LSTM is graphically provided in Fig. 1.
To commence the information processing with the LSTM model, the first step involves removing extraneous information from \({{\text{C}}}_{{\text{t}}}\). The forget gate offers this by employing a sigmoid function.
where \({{\text{W}}}_{{\text{f}}}\) and \({{\text{b}}}_{{\text{f}}}\) are the weight and bias of the forget gate, respectively, \({{\text{y}}}_{{\text{t}}}\) is the input value of the current time and \({{\text{h}}}_{{\text{t}}-1}\) is the output value of the prior unit.
The cell state is then updated with new or pertinent information. This is accomplished by the use of another gate with a sigmoid function, known as the input gate, and a tanh layer.
where \({{\text{W}}}_{{\text{i}}}\), \({{\text{b}}}_{{\text{i}},}\) and \({{\text{W}}}_{{\text{c}}}\), \({{\text{b}}}_{{\text{c}}}\) are the weight and bias of the input gate and the candidate input, respectively.
In the next step, the current cell state is updated as follows:
Then, the output gate takes \({\text{h}}_{{{\text{t}} - 1{ }}}\) and \({\text{y}}_{{\text{t}}}\) as input values, and its output is calculated using the formula:
where \({{\text{W}}}_{{\text{o}}}\) and \({{\text{b}}}_{{\text{o}}}\) are the weight and bias of the output gate, respectively.
Finally, the output of the LSTM model is computed as follows:
Hybrid models
Hybrid time series models leverage various modeling approaches to enhance the accuracy and robustness of forecasts. These models consider the time series \({{\text{y}}}_{{\text{t}}}\) as a blend of both linear and non-linear elements.
where \({{\text{L}}}_{{\text{t}}}\) and \({{\text{N}}}_{{\text{t}}}\) represent the linear and non-linear components, respectively.
The operational premise of the hybridization approach57 begins with fitting a linear model to the data and obtaining the corresponding forecast \((\widehat{{{\text{L}}}_{{\text{t}}}})\). The subsequent stage entails acquiring the residuals (\({{\text{e}}}_{{\text{t}}}\)) of the linear model and checking for the existence of non-linear patterns in its structure.
Once the residuals validate the presence of non-linearity, they are subsequently fed into an appropriate non-linear model. After obtaining forecasts \((\widehat{{{\text{N}}}_{{\text{t}}}})\) for the non-linear component, these are combined with the linear forecasts to generate the aggregate forecasts.
In this investigation, we have used the ARIMAX model to capture the linear component, whereby the residuals are modelled separately by the TDNN, NLSVR, WNN, CNN, RNN, and LSTM models to examine the performance of different combinations. Figure 2 provides the schematic representation of the hybridization technique.
Assessment of forecasting accuracy
To assess the forecasting accuracy of the time series models under investigation, two performance measures, namely the root mean square error (RMSE) and the mean absolute percentage error (MAPE), are employed. The model exhibiting the lowest RMSE and MAPE values is deemed to be the most optimal.
where \({{\text{y}}}_{{\text{t}}}\) and \(\widehat{{{\text{y}}}_{{\text{t}}}}\) denote the actual and predicted values of the tth observation of the test data, respectively and n is the size of the test data set.
Results
Summary statistics
Table 1 displays summary statistics of the data series utilised in the investigation. Sugarcane has exhibited the highest mean irrigated area (%), with wheat and groundnut trailing behind. A high coefficient of variation value signifies considerable volatility in these series.
Results of the augmented Dickey-Fuller (ADF) test
The ADF test58,59 has been utilised to ascertain the order of differencing and the results are presented in Table 2. All the yield series have exhibited non-stationary behavior at level series and stationary behavior at the first difference series.
Fitting of the ARIMA models
In the context of the ARIMA model, the autocorrelation function (ACF) and partial autocorrelation function (PACF) provide key insights into the potential order of the model. The optimal model selection is based on the minimum values for Akaike information criteria (AIC) and Bayesian information criteria (BIC), as well as the RMSE and MAPE. The parameter estimates of the chosen ARIMA models, along with their significance levels, are provided in Table 3.
Fitting of the ARIMAX models
The selection of a suitable exogenous variable is crucial for the ARIMAX model-building procedure. The significance of the correlation co-efficient between yield and area under irrigation (%) in each case, as reported in Table 4, indicates a possible outperformance of the ARIMAX models over the traditional ARIMA models. Following the ARIMA model-building process, the optimal model has been chosen based on the lowest AIC, BIC, RMSE, and MAPE values. Table 5 displays the specifications of the selected ARIMAX models.
Results of the Broock–Dechert–Scheinkman (BDS) test
Before proceeding to non-linear or hybrid modeling strategies, it is required to examine the series for the presence of non-linear features. To assess non-linearity, the BDS test60 has been implemented. Table 6 provides the outcomes of the BDS test. For all three cases, a strong rejection of linearity has been observed. It implies that the non-linear models can effectively be implemented in these series.
Fitting of the TDNN models
For this study, we have found the optimal time-delay neural network with a single hidden layer. Experimentation has been carried out to identify the number of tapped delays and hidden nodes. We have altered the range of input and hidden nodes from 1 to 6 and from 1 to 10, respectively for both the original and ARIMAX residual series. The training of networks has been accomplished by utilizing the Levenberg–Marquardt back-propagation algorithm. Table 7 contains the specifications of the chosen TDNN models.
Fitting of the NLSVR models
An essential step in NLSVR modeling is the selection of optimal hyper-parameters. The input lags, kernel function, regularization parameter, kernel width, and margin of tolerance significantly influence the NLSVR performance. In this study, we have employed the widely adopted radial basis function (RBF) as the kernel function. We have constructed NLSVR models for both the original and residual series based on the specifications outlined in Table 8.
Fitting of the WNN models
For each crop, wavelet neural network models have been chosen based on their forecasting performance at various numbers of input lags (from 1 to 6) and hidden nodes (from 1 to 10). Table 9 presents the specifications of the best performing WNN models for the original and residual series, respectively. Similar to TDNN, the Levenberg–Marquardt back-propagation algorithm has been used for training purposes.
Fitting of the CNN models
The performance of the CNN model crucially relies on the optimal selection of hyper-parameters. The set of hyper-parameters tuned for the training of the CNN model consists of the number of input nodes, number of filters and kernel size at the convolution layer, and the pool size at the pooling layer. Based on previous literature, we have used ReLU as an activation function. The best-performing CNN models have been constructed based on the specifications detailed in Table 10.
Fitting of the RNN models
RNNs perform better in predicting sequential, non-linear behavior of the series. Each series has already been duly assessed for the existence of non-linearity. Subsequently, by altering the range of nodes from 1 to 6 at the input layer and 1 to 10 at the hidden layer, the best configuration of hyper-parameters has been acquired. Specifications of the selected RNN models for both original and residual series are provided in Table 11.
Fitting of the LSTM models
The structure of the LSTM models studied in this work comprises an input layer, a hidden layer with LSTM cells as hidden nodes, and an output layer with a single output node. Because of its flexibility to apply multiple learning rates for different parameters, adaptive moment estimation (Adam), a prominent variant of the stochastic gradient descent (SGD) method, has been used for loss function optimization61. To obtain the optimal set, multiple configurations of the LSTM model have been explored by varying the different hyper-parameters. The LSTM tuning involves considering hyper-parameters such as the number of input and hidden nodes, batch size, and the number of epochs. These hyper-parameters serve not only to govern the model’s architecture and topology, but also to optimize key parameters like biases and weights. Several automated approaches are available in the literature that can potentially be used for hyper-parameter tuning. Out of these techniques, we have used the grid search method, which examines all the possible combinations of hyper-parameters. After trying various combinations, substantial effects have been observed for altering the number of input and hidden nodes, whereas the number of epochs and batch size have shown better results when set to 300 and 1, respectively. The trade-off between the number of (trainable) parameters and the error metrics is also considered for giving due weightage to parsimony. The outcomes of the eventual optimised configuration of the input and hidden nodes are depicted in Table 12.
Discussion
The comparative assessment of out-of-sample accuracy for different time series models under consideration is given in Table 13. The ARIMAX-LSTM hybrid models appear to have demonstrated superior performance in forecasting yield series, faring better than all the other models. This implies that the forecasted series obtained through the ARIMAX-LSTM framework tends to align more closely with the actual yield series values. The plots of the original series and predicted series by the best performing ARIMAX-LSTM model are shown in Figs. 3, 4, and 5, respectively. The plots clearly show that the ARIMAX-LSTM hybrid models have effectively captured the trends and trajectories of yield movements. Yield forecasts for the next five years (2021–2025) by these models have also been provided in Table 14. Figure 6 displays a radar plot depicting the error metrics (RMSE and MAPE values) for various models under study.
It is also noteworthy to mention that despite the presumption of linearity, meticulous selection of area under irrigation (%) as an auxiliary variable has mediated the outperformance of the ARIMAX model over the univariate non-linear models. As we confront evolving climatic outlooks, the importance of irrigation will become even more pronounced. By reducing reliance on precipitation and maintaining a balanced level of moisture in the soil, irrigated farming can show more resilience to shifts in weather patterns. However, the supremacy of the ARIMAX-LSTM models over the other ARIMAX-based hybrid models is due to the inbuilt ability of the LSTM model to process any sequential data effectively. Its unique structural make up has helped to provide a more comprehensive view of the complex time series data, encompassing both short-term and long-term insights. The competitive advantages of using LSTM in capturing residual patterns were also evidenced in the studies of Manowska et al.62 for natural gas consumption forecasting, Wu et al.63 for precipitation amount and drought forecasting, Dave et al.64 for export forecasting, Khozani et al.65 for groundwater level forecasting, etc.
Outcomes emanated from this investigation also suggest the superior performance of all the hybrid models over their individual counterparts. As real-world time series data are subjected to shifts, abrupt changes, and evolving patterns, different forecasting techniques excel in their respective domains66,67. Hybrid models can leverage the strengths of various methods to adapt to changing conditions, making them more robust in scenarios where the underlying data-generating process may vary over time. This adaptability is especially valuable for dealing with data from domains where external factors can significantly influence the time series, such as in this case, yield forecasting of Rabi crops68,69.
In addition, it has been noticed that the ranking of performance among the individual models is distinct from their hybrid counterparts. To illustrate, the performance hierarchy of the non-linear models is as follows: LSTM > RNN > WNN > TDNN > CNN > NLSVR. Conversely, within the hybrid framework, a different performance hierarchy has been observed: LSTM > WNN > CNN > RNN > TDNN > NLSVR. It clearly indicates that the comparative performance of individual models cannot be generalized for the comparison of their hybrid structures, emphasizing the importance of data-driven forecasting exercises.
Conclusions
In this study, we have employed different ARIMAX-based hybrid models and compared their performances with their individual counterparts as well as with the ARIMA model for yield forecasting of major Rabi crops in India. It has been observed that the ARIMAX–LSTM modeling combination has provided better forecasts than other time series models, as evidenced by various accuracy measures. For these models, an average improvement of RMSE and MAPE values has been observed to be 10.41% and 12.28%, respectively over all other competing models and 15.83% and 18.42%, respectively over the benchmark ARIMA model. It can also be inferred that the inclusion of area under irrigation (%) as an exogenous variable in the ARIMAX framework and the inbuilt ability of the LSTM model to process complex non-linear patterns have greatly improved the forecasting accuracy. The performance supremacy of other hybrid models over their individual counterparts has also been evident. It is also suggested to avoid any performance generalization of individual models for their hybrid structures. Future works are expected to explore the performance of other hybrid structures such as ARIMA–NARX, ARIMA–NLSVRX, ARIMA–LSTMX, etc. for agricultural yield forecasting.
Data availability
The data that support the findings of this study are available on request from the first author: Pramit Pandit.
References
Guntukula, R. Assessing the impact of climate change on Indian agriculture: Evidence from major crop yields. J. Public Aff. 20(1), e2040 (2020).
Dharmaraja, S., Jain, V., Anjoy, P. & Chandra, H. Empirical analysis for crop yield forecasting in India. Agric. Res. 9, 132–138 (2020).
Fisher, R. A. The influence of rainfall on the yield of wheat at Rothamsted. Philos. Trans. R. Soc. Lond. B Biol. Sci. 213(402–410), 89–142 (1925).
Baier, W. Crop Weather Models and Their Use in Yield Assessments. WMO Technical Note No. 151 (World Meteorological Organization, 1977).
Khaki, S., Pham, H. & Wang, L. Simultaneous corn and soybean yield prediction from remote sensing data using deep transfer learning. Sci. Rep. 11(1), 11132 (2021).
Ma, Y., Zhang, Z., Kang, Y. & Özdoğan, M. Corn yield prediction and uncertainty analysis based on remotely sensed variables using a Bayesian neural network approach. Remote Sens. Environ. 259, 112408 (2021).
Basso, B. & Liu, L. Seasonal crop yield forecast: Methods, applications, and accuracies. Adv. Agron. 154, 201–255 (2019).
Feng, P. et al. Dynamic wheat yield forecasts are improved by a hybrid approach using a biophysical model and machine learning technique. Agric. For. Meteorol. 285, 107922 (2020).
Lin, D., Wei, R. & Xu, L. An integrated yield prediction model for greenhouse tomato. Agronomy 9(12), 873 (2019).
Bian, C. et al. Prediction of field-scale wheat yield using machine learning method and multi-spectral UAV data. Remote Sens. 14(6), 1474 (2022).
Demolli, H., Dokuz, A. S., Ecemis, A. & Gokcek, M. Wind power forecasting based on daily wind speed data using machine learning algorithms. Energy Convers. Manag. 198, 111823 (2019).
Moein, M. M. et al. Predictive models for concrete properties using machine learning and deep learning approaches: A review. J. Build. Eng. 63, 105444 (2023).
Bai, F. J. J. S., Shanmugaiah, K., Sonthalia, A., Devarajan, Y. & Varuvel, E. G. Application of machine learning algorithms for predicting the engine characteristics of a wheat germ oil–Hydrogen fuelled dual fuel engine. Int. J. Hydrog. Energy 48(60), 23308–23322 (2023).
Barrera-Animas, A. Y. et al. Rainfall prediction: A comparative analysis of modern machine learning algorithms for time-series forecasting. Mach. Learn. Appl. 7, 100204 (2022).
Naveena, K., Singh, S., Rathod, S. & Singh, A. Hybrid ARIMA-ANN modelling for forecasting the price of Robusta coffee in India. Int. J. Curr. Microbiol. Appl. Sci. 6(7), 1721–1726 (2017).
Rahim, N. F., Othman, M. & Sokkalingam, R. A comparative review on various methods of forecasting crude palm oil prices, in Journal of Physics: Conference Series (2018).
Purohit, S. K., Panigrahi, S., Sethy, P. K. & Behera, S. K. Time series forecasting of price of agricultural products using hybrid methods. Appl. Artif. Intell. 35(15), 1388–1406 (2021).
Neog, B., Gogoi, B. & Patowary, A. N. Development of hybrid time series models for forecasting autumn rice using ARIMAX-ANN AND ARIMAX-SVM. Ann. For. Res. 65(1), 9119–9133 (2022).
Chitikela, G. et al. Artificial-intelligence-based time-series intervention models to assess the impact of the COVID-19 pandemic on tomato supply and prices in Hyderabad, India. Agronomy 11(9), 1878 (2021).
Kumar, S. et al. Performance comparison of ARIMA and time delay neural network for forecasting of potato prices in India. Agric. Econ. Res. Rev. 35, 119–134 (2022).
Rathod, S., Mishra, G. C. & Singh, K. N. Hybrid time series models for forecasting banana production in Karnataka Staten India. J. Indian Soc. Agric. Stat. 71(3), 193–200 (2017).
Rathod, S. & Mishra, G. C. Statistical models for forecasting mango and banana yield of Karnataka. India. J. Agric. Sci. Technol. 20(4), 803–816 (2018).
Supriya, K. Comparative study of ARIMAX-ANN hybrid model with ANN and ARIMAX models to forecast the damage caused by yellow stem borer (Scirpophaga incertulas) in Telangana state. Int. J. Curr. Microbiol. App. Sci. 10(01), 3421–3428 (2021).
Kang, S. et al. Improving agricultural water productivity to ensure food security in China under changing environment: From research to practice. Agric. Water Manag. 179, 5–17 (2017).
Wang, X. Managing land carrying capacity: Key to achieving sustainable production systems for food security. Land 11(4), 484 (2022).
Li, Y. et al. An analysis of China’s grain production: Looking back and looking forward. Food Energy Secur. 3(1), 19–32 (2014).
Tanaka, A. et al. Adaptation pathways of global wheat production: Importance of strategic adaptation to climate change. Sci. Rep. 5(1), 14312 (2015).
FAOSTAT. https://www.fao.org/faostat/en/#data (2023).
Kannan, E., Bathla, S. & Das, G. K. Irrigation governance and the performance of the public irrigation system across states in India. Agric. Econ. Res. Rev. 32(1), 27–41 (2019).
Zaveri, E. & Lobell, D. The role of irrigation in changing wheat yields and heat sensitivity in India. Nat. Commun. 10(1), 4144 (2019).
Baranski, M. & Ollenburger, M. How to improve the social benefits of agricultural research. Issues Sci. Technol. 36(3), 47–53 (2020).
Anantha, K. H. & Wani, S. P. Evaluation of cropping activities in the Adarsha watershed project, southern India. Food Secur. 8, 885–897 (2016).
Qaim, M. Role of new plant breeding technologies for food security and sustainable agricultural development. Appl. Econ. Perspect. Policy 42(2), 129–150 (2020).
Ajl, M. & Sharma, D. The green revolution and transversal countermovements: Recovering alternative agronomic imaginaries in Tunisia and India. Rev. Can. Etudes. Dev. 43(3), 418–438 (2022).
Gunarathna, M. H. J. P. et al. Optimized subsurface irrigation system: The future of sugarcane irrigation. Water 10(3), 314 (2018).
Khumla, N. et al. Sugarcane breeding, germplasm development and supporting genetics research in Thailand. Sugar Tech. 24(1), 193–209 (2022).
Press Information Bureau, Government of India. https://www.pib.gov.in/PressReleseDetailm.aspx?PRID=1865320 (2023).
Solomon, S. Sugarcane production and development of sugar industry in India. Sugar Tech. 18(6), 588–602 (2016).
Namara, R. E., Nagar, R. K. & Upadhyay, B. Economics, adoption determinants, and impacts of micro-irrigation technologies: Empirical results from India. Irrig. Sci. 25(3), 283–297 (2007).
Rao, C. S. et al. Potential and challenges of rainfed farming in India. Adv. Agron. 133, 113–181 (2015).
Box, G. E., Jenkins, G. M., Reinsel, G. C. & Ljung, G. M. Time Series Analysis: Forecasting and Control (Wiley, 2015).
Alharbi, F. R. & Csala, D. A seasonal autoregressive integrated moving average with exogenous factors (SARIMAX) forecasting model-based time series approach. Inventions 7(4), 94 (2022).
Haykin, S. Neural Networks: A Comprehensive Foundation (Prentice Hall, 1999).
Xi, Z., Wang, R., Fu, Y. & Mi, C. Accurate and reliable state of charge estimation of lithium ion batteries using time-delayed recurrent neural networks through the identification of overexcited neurons. Appl. Energy 305, 117962 (2022).
Vapnik, V., Golowich, S. & Smola, A. Support vector method for function approximation, regression estimation, and signal processing. In Advances in Neural Information Processing Systems (eds Mozer, M. et al.) 281–287 (MIT Press, 1997).
Sharma, V., Yang, D., Walsh, W. & Reindl, T. Short term solar irradiance forecasting using a mixed wavelet neural network. Renew. Energy 90, 481–492 (2016).
Chitsaz, H., Amjady, N. & Zareipour, H. Wind power forecast using wavelet neural network trained by improved Clonal selection algorithm. Energy Convers. Manag. 89, 588–598 (2015).
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998).
Kim, B. S. & Kim, T. G. Cooperation of simulation and data model for performance analysis of complex systems. Int. J. Simul. Model. 18(4), 608–619 (2019).
Lu, W., Li, J., Li, Y., Sun, A. & Wang, J. A CNN-LSTM-based model to forecast stock prices. Complexity 2020, 1–10 (2020).
Widiputra, H., Mailangkay, A. & Gautama, E. Multivariate CNN-LSTM model for multiple parallel financial time-series prediction. Complexity 2021, 1–14 (2021).
Salem, F. M. Recurrent Neural Networks: From Simple to Gated Architectures (Springer, 2022).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997).
Monge, J., Ribeiro, G., Raimundo, A., Postolache, O. & Santos, J. AI-based smart sensing and AR for gait rehabilitation assessment. Information 14(7), 355 (2023).
Sheng, Z., An, Z., Wang, H., Chen, G. & Tian, K. Residual LSTM based short-term load forecasting. Appl. Soft Comput. 144, 110461 (2023).
Chaturvedi, S., Rajasekar, E., Natarajan, S. & McCullen, N. A comparative assessment of SARIMA, LSTM, RNN and FB Prophet models to forecast total and peak monthly energy demand for India. Energy Policy 168, 113097 (2022).
Zhang, G. P. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50, 159–175 (2003).
Dickey, D. A. & Fuller, W. A. Distribution of the estimators for autoregressive time series with a unit root. J. Am. Stat. Assoc. 74(366a), 427–431 (1979).
Dickey, D. A. & Fuller, W. A. Likelihood ratio statistics for autoregressive time series with a unit root. Econometrica 49, 1057–1072 (1981).
Broock, W. A., Scheinkman, J. A., Dechert, W. D. & LeBaron, B. A test for independence based on the correlation dimension. Econom. Rev. 15(3), 197–235 (1996).
Nanni, L., Manfè, A., Maguolo, G., Lumini, A. & Brahnam, S. High performing ensemble of convolutional neural networks for insect pest image detection. Ecol. Inform. 67, 101515 (2022).
Manowska, A., Rybak, A., Dylong, A. & Pielot, J. Forecasting of natural gas consumption in poland based on ARIMA-LSTM hybrid model. Energies 14(24), 8597 (2021).
Wu, X. et al. The development of a hybrid wavelet-ARIMA-LSTM model for precipitation amounts and drought analysis. Atmosphere 12(1), 74 (2021).
Dave, E., Leonardo, A., Jeanice, M. & Hanafiah, N. Forecasting Indonesia exports using a hybrid model ARIMA-LSTM. Procedia Comput. Sci. 179, 480–487 (2021).
Khozani, Z. S., Banadkooki, F. B., Ehteram, M., Ahmed, A. N. & El-Shafie, A. Combining autoregressive integrated moving average with long short-term memory neural network and optimisation algorithms for predicting ground water level. J. Clean. Prod. 348, 131224 (2022).
Hamrani, A., Akbarzadeh, A. & Madramootoo, C. A. Machine learning for predicting greenhouse gas emissions from agricultural soils. Sci. Total Environ. 741, 140338 (2020).
Mahto, A. K., Alam, M. A., Biswas, R., Ahmad, J. & Alam, S. I. Short-term forecasting of agriculture commodities in context of indian market for sustainable agriculture by using the artificial neural network. J. Food Qual. 2021, 1–13 (2021).
Xu, D., Zhang, Q., Ding, Y. & Zhang, D. Application of a hybrid ARIMA-LSTM model based on the SPEI for drought forecasting. Environ. Sci. Pollut. Res. 29(3), 4128–4144 (2022).
Xavier, A. L., Fernandes, B. J. & De Oliveira, J. F. A hybrid swarm-based system for commodity price forecasting during the Covid-19 pandemic. IEEE Access 11, 74379–74387 (2023).
Acknowledgements
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Research Groups Project under grant number RGP 1/440/44. The authors are also grateful to the editor and anonymous reviewers for their valuable suggestions and comments, which helped to improve the manuscript to a great extent.
Funding
The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through Large Groups Project under grant number RGP 1/440/44.
Author information
Authors and Affiliations
Contributions
Conceptualization, P.P. and A.S.; methodology, P.P. and B.G.; software, P.P., A.S. and B.G..; validation, B.G., M.P. and P.D.; formal analysis, P.P., B.G. and M.P.; investigation, P.P. and A.S.; data curation, A.S.; writing—original draft preparation, P.P., M.P.; writing—review and editing, A.S., P.D., S.A., J.M, H.A., H.G.A.; visualization, A.S., P.D.; supervision, P.P. and A.S. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pandit, P., Sagar, A., Ghose, B. et al. Hybrid time series models with exogenous variable for improved yield forecasting of major Rabi crops in India. Sci Rep 13, 22240 (2023). https://doi.org/10.1038/s41598-023-49544-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-49544-w
- Springer Nature Limited