1 Introduction

Since farmers are responsible for producing a sizable proportion of the world’s food supply, agriculture is one of the significant areas of social interest. Due to population growth and food shortages, hunger persists in many nations today. Increasing food production is an attractive strategy to eradicate hunger. The United Nations has set the year 2030 as a target date for accomplishing two of its most important goals: increasing food security and decreasing need. Policymakers in a country need accurate forecasts to make informed decisions about food exports and imports. Also, farmers and growers can use yield predictions to better plan their budgets and operations. However, due to many complicated factors, accurate predictions of agricultural yields are notoriously tricky. The success of a crop relies on several factors, including the weather, the soil, the terrain, the presence of pests, the water supply, the plant’s genetic makeup, the crop’s organization, and more [1, 2].

Researchers are making more precise predictions with the help of data-driven models [3]. To enhance the accuracy of data-driven models, Machine Learning (ML) methods are essential [4]. Machine learning enables computers to gain new abilities without the need for explicit programming. Agricultural frameworks, whether non- or linear-based, can be resolved by these procedures, which thus provide exceptional foresight [5]. Agronomic frameworks based on machine learning acquire their strategies through learning. The operations necessitate much practice before they can be executed successfully. Once the training phase is complete, the model will use its assumptions to validate the data.

While ML and its realization have made great strides, there are still some limits to what can be accomplished when relying solely on the data. ML predictions’ accuracy and limitations are influenced by model parameters, data quality, and the relationship between target and input variables in the obtained datasets [6]. Incomplete or inaccurate data, biases, outliers, and noisy data can all severely weaken the prediction ability of models [7]. Multivariate regression, random forests, regression trees, neural networks and association rules are machine learning models engaged in numerous research to predict yields of agriculture. The output, crop yield, is viewed by machine learning models as a function of many input factors, such as soil conditions and weather components.

ML has developed as a powerful tool for obtaining insights and patterns from data, applied to various applications and domains including the agriculture environment. ML models can be generally split into unsupervised and supervised learning models. For supervised learning, where models are trained on labeled data to make predictions or decisions, offers a structured approach to solving predictive tasks. In contrast, unsupervised learning explores patterns and structures within unlabeled data, often uncovering hidden relationships or clusters. This paper focused on the supervised learning models such as SVR, kNN, RF, etc., for assessing the evaluation of the proposed framework. The supervised learning allows us to leverage historical datasets containing labeled information about environmental factors and corresponding crop yields. This labeled data provides a clear signal for the model to learn from, enabling it to capture complex relationships and make accurate predictions. By utilizing supervised learning, we aim to develop robust models that effectively forecast crop yields, thereby informing optimized agricultural practices and contributing to food security efforts.

More recently, Support Vector Machine has attracted the attention of researchers, practitioners, and statisticians due to its theoretical and practical superiority, which has been learnt to perform better in both classification and regression. SVM was originally applied to classification problems [22]. It has since been extended to handle nonlinear regression problems and named Support Vector Regression [22]. There are two distinct advantages to the application of SVR. Firstly, SVR guarantees convergence to optimal solutions using quadratic programming with linear constraints for learning data. Secondly, it is computationally efficient in modeling nonlinear relationships using kernel mapping. However, the computational efficiency of SVR depends on a couple of hyperparameters and factors that directly or indirectly affect finding the optimal solutions. Ordinarily, an exhaustive grid search is utilized to explore all the hyperparameter combinations. Cross-validation is conducted to evaluate the prediction capability of SVR. Despite its striking features, SVR has its limitations [26]. The important one is its inability to perform feature selection. In other words, it is incapable of feature selection [27].

On the other hand, feature selection plays an essential part in supervised learning to obtain more promising and efficient results. Feature selection filters unnecessary information from a dataset using statistical metrics to improve a learning algorithm. The primary goal of feature selection is to collect a good set of characteristics that may be used to characterize and limit a dataset. The feature selection approach in machine learning reduces computation time [8], improves forecasting outcomes, and enhances data comprehension. Feature selection, then, is a typical preprocessing step for high-dimensional data. The goals are to make the data and the model easier to understand by reducing their dimensionality and improving forecasts’ accuracy. In other words, feature selection involves identifying the most relevant input variables from a pool of potential predictors, such as weather conditions, soil attributes, and agricultural practices for improving and enhancing the crop prediction phase.

There are three distinct types of feature selection FS methodologies named as a wrapper, a filter, or a hybrid. Due to the increased complexity introduced by features of higher dimensions, this problem cannot be fixed by simply combining all possible solutions. The filter approaches can identify and eliminate irrelevant features; they cannot do the same for repeating features due to their failure to account for possible associations among features [9, 10]. For the filter method, it is the features of the data themselves that define which subset of features is most important, such as its correlation, Fisher score measure, information gain, mutual information, and entropy [11]. The wrapper feature selection approaches is wrapped within the induction process [12, 13]. It is helpful to use the wrapper method when problems arise. Several search methods can be used to find a subgroup of features by restricting the suitable objective function, including recursive feature elimination and backward and forward elimination passes [14]. Wrapper methods are easily recognized by the excellent quality of the features they select, although at the deprivation of a higher computational cost. Hybrid approaches are another approach that is occasionally studied to better special features. They employ methods that aim for an intermediate between computational complexity and speed [1, 15]. It strikes a good mix between accuracy and processing speed. In this study, we proposed a hybrid approach, utilizing elements of both the filter and wrapper approaches.

Despite the promising potential of supervised learning and feature selection techniques in agriculture, challenges persist in effectively integrating these methods to enhance crop yield prediction models. Agricultural datasets often exhibit high dimensionality and contain numerous variables, necessitating robust feature selection approaches to identify the most influential factors. Moreover, manual tuning of SVR hyperparameters can be labor-intensive and may not always yield optimal results.

To address these challenges, this paper proposes a new framework with three phases: Preprocessing, Feature Selection, and prediction. First, the k-means (KM) technique is used to cluster all the dataset’s features. It strives to maintain the clusters as far apart as feasible while making their features consistent. Then, the CFS ranking method independently positions features in each cluster. These two techniques simplify the search space by addressing the high dimensionality and redundancy problems. After the top features from each cluster are chosen, the resulting reduced dataset is forwarded to the feature selection phase. Secondly, in the feature selection phase, a novel hybrid feature selection strategy is proposed to narrow down the pool of candidates to the top-performing features. Filter-type approaches, Fisher score and Mutual information gain, are applied. The intersection set of the resulting features from each process is fed into the wrapper approach. The wrapper-based approach, recursive feature elimination Random-Forest-based RFE, combined with the filer approaches to create a hybrid-based feature selection technique. Finally, for the prediction phase, a novel improved algorithm ICOA is proposed to optimize the hyperparameter of SVR model to enhance the prediction results of the final phase. The COA algorithm is enhanced with the chaotic map and the Levy distribution function to enhance exploration and exploitation phases of COA resulting in a novel ICOA algorithm. The paddy crop dataset is used with the proposed method to identify the best features for future crop production prediction.

This paper’s main contributions are summed up as follows:

  • A framework that integrates a novel hybrid feature selection approach with optimized SVR model to enhance the prediction results is proposed.

  • This paper provides a hybrid approach to feature selection, combining heuristic techniques such as filter and wrapper methods.

  • An improved variant of COA algorithm is proposed to enhance the exploration and exploitation phases of COA.

  • The Levy flight and chaotic maps are integrated into the original COA resulting in a promising ICOA applied to optimize the SVR model.

  • The dataset’s redundancy and high dimensionality were mitigated through an unsupervised feature selection strategy in the preprocessing phase, such as combining KM clustering and the CFS ranking.

  • Experimental results confirm that the proposed approach selects the most relevant features and enhances the prediction results.

The following sections make up this paper: Sect.  2 compiles previous research on predicting crop yields, and Sect.  3 explains the information collected for this study. Section 4 discusses the proposed framework and its components; Sect.  5 presents our results and discussions of the experiments; Sect.  6 present the discussion of the obtained results and Sect.  7 offers the conclusion and future work in the related work.

2 Related work

Estimating crop yields is critical in today’s world when an ever-growing population demands more and more food. It aids in the enhancement of management procedures vital to maximizing agricultural yield. ML methods, traditional regression methods, and crop models [16,17,18] have been used to estimate crop yields in the previous decade. Crop yield models are a type of crop growth model. According to these parameters, they are merely a simulacrum of actual scientific studies [19]. Providing reliable data on agricultural output, these models aid policymakers, farmers, and the government achieves maximum sustainability [20].

Vani and Rathi [21] described big data analysis as gathering, maintaining, and analyzing massive amounts of data to find connections and other insights. Big data was used to analyses harvest, soil, and climate data from internal and external sources for agricultural applications. Several machine learning algorithms grouped the data to estimate agricultural productivity. However, the grouping was inaccurate and deprived. On the other hand, Proximity Likelihood Maximization Data Clustering (PLMDC) uses fewer characteristics from vast and densely packed farming data to improve clustering and farmer crop output projections. An appropriate linear regression method was utilized to remove extraneous features from dense and sparse agricultural data. The Genetic Algorithm (GA) selected clustering data features for best fitness. The A-FP development methods evaluate the decision-support system’s capability to predict agricultural yields using meteorological data and crop quality. The facts and observations showed that PLMDC was more effective than current methods.

Predictions of frost danger for Zhejiang tea plantations using ML methods have also been made [22]. Damage was calculated using meteorology, topography, and coordinate geometry (latitude and longitude). ANN and SVM were used for the estimation. The authors in [23] built a Spatio-temporal hybrid model using satellite-derived hydro-meteorological data from 20 sites for 20 years in Bangladesh. Dragonfly optimization and support vector regression (SVR) were employed in this research. This hybrid model reduced the relative error in predicting tea crops by 11%.

A. Reyana et al., [24] utilize data from IoT sensors to remotely monitor their crops. In contemporary agriculture, producers presently control the surrounding environment of their crops to optimize yields. The authors presented the MMLA, a novel method for recognizing multisensory information. The suggested recommendation system categorized eight crops. Different machine learning algorithms were utilized to classify different sorts of crops, specifically the Random Forest and J48 Decision Tree. The classifier’s performance was evaluated using precision, F-measure and recall. These findings were then compared to the advanced classifications. The Random Forest method demonstrated superior efficiency in categorizing agricultural-related text, exhibiting the lowest error measures such as a 13% Root Mean Square Error (RMSE).

Paudel et al. [25] used Crop Yield Prediction model (MARS) data from the mutual Research Centre of the European Commission to evaluate NN crop yield forecasting models’ accuracy and accessibility. The 1DCNN and LSTM could handle time-series data. A GBDT model with hand-crafted attributes was compared to effectiveness. Agriculture and crop yield forecasting experts used feature recognition algorithms to rate input parameters’ significance. LSTM models outperformed GBDT models economically for wheat crop in Germany. LSTM models accurately predicted the impacts of yield pattern, static features including biomass and soil retention ability features on crop output, however high temperature and moisture circumstances were harder to measure. This study shows that DL can mechanically acquire characteristics and provide accurate crop output estimates the advantages and challenges of relating stakeholders’ human in model understanding assessment.

Khaki et al. [26] used ML to successfully forecast corn production and yield difference among corn hybrids given either environmental or genotype data. Using remotely sensed data collected before the crop, You et al. [27] applied DL algorithms to estimate soybean production. An ANN model was also built to forecast environmental impacts and tea crops in Iran for black, green, and oolong [28]. After comparing the performance of deep fully connected neural networks, LASSO and RF found that combining a CNN and an RNN was superior for predicting soybean and corn yields [29]. Researchers created a decision-support system using information about the soil and the surrounding environment [30].

Swanth et al. [31] propose a new way of predicting crop yield using a hybrid classification model that incorporates an enhanced feature ranking fusion technique. The authors propose a new SMOTE algorithm for data enrichment to ensure the optimization of features that will be extracted. Their technique for feature extraction includes statistical features, improved correlation-based features, raw data, and entropy features. They also offer an enhanced way of combining feature rankings using the results of various feature selection techniques: Relief, RFE and Chi-square. Their hybrid model, which combines DBN and LSTM models, is used for prediction. The results of the authors show that their approach improves upon traditional classifiers including LSTM, DBN, Bi-GRU, CNN, and SVM.

Fatma M. Talaat [32] introduces the Crop Yield Prediction Algorithm (CYPA), a new method that employs IoT techniques in precision farming. The authors integrate climate, meteorological, chemical data and agricultural yield into CYPA to enable policymakers to forecast annual crop yields. The authors developed a decision support tool to aid farmers and decision makers in predicting agricultural yields by analyzing meteorological circumstances specific to their regions. The researchers suggested an advanced machine learning approach for predicting agricultural yields. In addition, active learning was implemented in CYPA to optimize the model’s performance by minimizing the amount of labeled data required for training. The CYPA can respond to adjusting field environments, such as pest outbreaks or weather by engaging in active learning. This involves actively choosing fresh samples for labelling that accurately reflect the current conditions.

The Levenberg–Marquardt technique was previously exploited to evaluate and forecast human gait [33], and can be utilized in surveys of forests and farms. Surveying with the old methods is difficult, time-consuming, and costly, especially in remote or rugged places or where a lot of vegetation is present, such as mountains, forests, or fields. In another paper, relevant operating laws and necessary weighted aggregation operators were devised [34]. Here, scalar multiplication and neutral addition operational rules define the properties of the neutral type in the group association degrees and the sum of probability. All facets of the proposed legislation are examined.

However, research on the use of DL for predicting tea yield is scant [35]. To estimate yield, ML and DL methods analyze data on climate, soil, crops, and satellite imagery [36]. The use of different microwave and spectral wavelengths, made possible by remote sensing data, enables crop status monitoring [37]. Predictions of wheat crops have been made using satellite and climate data [38]. A model for prediction for sorghum biomass was suggested with the sorghum crop model APSIM, the multi-layer perceptron, and SVM as input. After comparing other models, they decided that the MLP one was the most reliable [39].

Previous authors have used data mining (DM) techniques to identify and organize data corresponding to the relative importance of the critical features influencing sugarcane output and then to create mathematical models for predicting sugarcane yield [40]. Three different DM methods were used to analyze data from the databases of numerous sucrose mills in Brazil. Some DM strategies have been used to investigate relationships between weather conditions and plant care. An external dataset has been used to estimate the accuracy of the derived models. The RF algorithm was utilized for comparison.

In [41], CNN and LSTM are coupled to estimate county-level soybean yields using outdoor remote sensing data at both the end of the increasing season. There is a shortage of literature on using the deep learning approach to estimate agricultural yields in an indoor greenhouse setting, in contrast to outside application scenarios. The research in [42, 43] motivated us to apply a RNN with long-short temporal memory (LSTM) units to the problem of predicting crop yields for tomatoes and ficus. The evaluation results also demonstrate that the conventional machine learning algorithms are inferior to deep learning techniques regarding prediction accuracy and root mean square errors (RMSEs).

To define agricultural objectives for import and export, as well as to boost farmer incomes, crop yields must be predicted quickly and precisely in numerical and economic assessments. Crop production forecasts are one of the most difficult concerns in the agricultural industry, since they are used to estimate higher crop output utilizing machine learning techniques. According to previously reported related studies, it is found that the literature work that utilizes DT algorithm has overfitting concerns with the data, resulting in inaccurate predictions. The primary obstacles and issues in the associated work can be summarized as follows:

  • More technique classifiers for agricultural yield prediction must be examined in the linked work.

  • The associated work must consider the proposed technique in all variables of the agricultural sectors to enhance the forecasting process.

  • The related work must add climatic data of the suggested method to boost the accuracy of prediction.

  • The linked work must add more crop-related features into the suggested technique for accurate prediction.

  • The linked work needs to analyze improved strategies for higher accuracy in crop forecast.

  • The associated work does not use an optimized model, which can have a significant impact on prediction accuracy when compared to traditional models.

Therefore, this paper proposes a new framework that enhances the prediction performance by introducing a comprehensive framework that proposes a new hybrid feature selection approach and a novel algorithm for optimizing the different hyper parameters for the prediction process. The proposed framework helped to handle different issues by related works where the novel hybrid feature selection approach is more focused on the best features to reduce the dimensionality reduction. In addition, the new optimized model for the prediction enhanced the prediction results compared with the recent approaches. Furthermore, a new set of climatic features are integrated to enhance the obtained results.

3 Preliliminary

This section provides a high-level overview of the tools and techniques used in this study.

3.1 Crayfish optimization algorithm (COA)

Jia et al. [44] developed the Crayfish Optimization Algorithm by mimicking the behaviors of crayfish. These include summer resorting, competition, and foraging. Foraging and competition behaviors imitate the exploitation and exploration processes of the optimization procedure, respectively, which can be controlled by temperature. In high temperatures, crayfish seek shelter in caves either for summer retreats or to compete for cave possession. In suitable temperatures, crayfish manifest foraging behavior as their means of exploration. They become more random in searching for the global solution through temperature adjustment. The following sections detail the process of COA.

3.1.1 Initialization

The COA randomly initialize the population \(X\) of \(N\) candidate solutions each of which with \(dim\) dimensions. The position of each solution \(X_{i,j}\) is modeled as:

$$X_{i,j} = Lb + \left( {Ub + Lb} \right) \times rand$$
(1)

where \(Lb\) and \(Ub\) refer to the limit bounds of each of the dimension \(j\).

As previously mentioned, temperature is a crucial factor in multiple phases of the crayfish and has been defined in Eq. (2). When the temperature exceeds 30 degrees, the crayfish relocates to a cooler area for its summer retreat. When the temperature is suitable, the crayfish initiates its foraging habit. The temperature range for foraging behavior is specified as 15 to 30 degrees. Therefore, the foraging behavior can be replicated using a normal distribution, which is influenced by the temperature. The mathematical representation of this relationship is presented in Eq. (3).

$${\text{temp}} = {\text{rand}} \times 15 + 20$$
(2)
$$p = C_{1} \times \left( {\frac{1}{{\sqrt {2 \times \pi } \times \sigma }} \times \exp } \right)\left( {\frac{{({\text{temp}} - \mu )^{2} }}{{2\sigma^{2} }}} \right)$$
(3)

where the temperature of the crayfish location is denoted by \(temp\) while \(\mu\) refers to the temperature of the best crayfish. In addition, \(\sigma\) and \(C_{1}\) parameters control the different temperatures of crayfish.

3.1.2 Summer resort phase

In this phase, if the temperature is larger than 30 degrees, then the crayfish enjoys the summer break at the cave \(X_{shade}\) which is formulated as:

$$X_{{{\text{shade}}}} = X_{{{\text{gbest}}}} + X_{{{\text{local}}}}$$
(4)

where the best position obtained is denoted by \(X_{gbest}\) and the current population position is defined as \(X_{local}\). The crayfish usually compete and fight for the cave which is modeled as a random process. On the other hand, if \(rand < 0.5\), then no competition among the crayfishes is hold and the crayfish directly hold the cave as follows:

$$X_{{i,j}}^{{t + 1}} = X_{{i,j}}^{t} + C_{2} \times {\text{rand}} \times (Xshade - X_{i} ,j_{t} )$$
(5)
$$C_{2} = 2 - \left( {\frac{t}{{{\text{Max}}_{{{\text{iter}}}} }}} \right)$$
(6)

where \(\left( {t + 1} \right)\) in the next updated position of the crayfish, t is the present iteration defined and the maximum number of iterations is expressed by \(Max_{iter}\).

3.1.3 Competition phase

In this phase, the crayfish fight for the possession of the cave. If the temperature is over 30 degrees and the random value is greater than 0.5, the other crayfish are attracted to the same cave. Consequently, they engage in conflict with one another in order to obtain possession of the cave, as indicated by Eq. (7).

$$X_{i,j}^{t + 1} = X_{i,j}^{t} - X_{z,j}^{t} + X_{{\text{shade }}}$$
(7)
$$z = {\text{round}}\left( {{\text{rand}} \times \left( {N - 1} \right)} \right) + 1$$
(8)

where the total number of population’s agents is defined by \(N\).

3.1.4 Foraging phase

In this phase, the crayfishes start the process of searching for food (optimal solution). Hence, when the temp is less than 30 degrees, the crayfish start searching for food at different locations. The location and size of food is formulated as:

This process simulates the process of searching for the optimal solution for a problem

$$X_{{\text{food }}} = X_{G}$$
(9)
$$Q = C_{3} \times \left( {\frac{{{\text{fit}}_{i} }}{{{\text{fit}}_{{{\text{food}}}} }}} \right)$$
(10)

3.2 Chaotic maps

Chaos refers to the inherent unpredictability exhibited by a complex system. A chaotic map is a mathematical function that is used to associate or map chaotic behavior to a parameter in an algorithm. Chaotic maps are commonly employed in optimization issues because of their ergodic properties. It facilitates the dynamic exploration of the search space at a faster rate compared to stochastic searches that primarily depend on probability. Substituting the stochastic elements in meta-heuristic algorithms with chaotic maps rather than conventional probability distributions can provide benefits. The use of chaos techniques enhances the ability to search for a global optimum by overcoming getting stuck in local optimal values. This study considers 8 one-dimensional chaotic maps, defined in Fig. 1, in order to enhance the basic COA.

Fig. 1
figure 1

The different types of chaotic maps theory

3.3 Levy flight

A Lévy Flight is a type of arbitrary walk where the steps taken follow a probability distribution known as the Lévy distribution, which has tails that are heavier than those of a normal distribution. The concept of Lévy-flight was initially developed by Paul Lévy in 1937 and later further elaborated by Benoit Mandelbrot [45]. Multiple studies indicate that as animals and insects look for food, their flight behavior often exhibits a characteristic pattern of random direction selection, which can be described as a Lévy-flight. In [46], Reynolds et al. investigated the movement patterns of fruit flies as they navigated their environment using a series of straight paths that were interrupted by a sudden 90° turn. This resulted in a search pattern known as a scale-free intermittent Lévy-flight. In [47], the authors have shown that Lévy-flight can be utilized to replicate particular light phenomena. The Lévy Flight can be defined as a stochastic process where the variance is infinite, and the mean is given by Eq. (11).

$${\text{Levy}}\left( \gamma \right) \sim {\text{u}} = {\text{t}}^{ - 1 - \gamma } ,(0 < \gamma \le 2)$$
(11)

The Mantegna method is an effective approach for producing random step lengths that exhibit behavior similar to Lévy flights.

$$s = \frac{\mu }{{|\nu |^{{\frac{1}{\gamma }}} }}$$
(12)

where \(\nu\) and \(\mu\) are normal distribution with \(\mu \sim {\text{N}}\left( {0,\sigma_{\mu } { }^{2} } \right)\) and \(\nu \sim {\text{N}}\left( {0,\sigma_{\nu } { }^{2} } \right)\) and \(\gamma = 1.5\)

$${\Gamma }\;{\text{is}}\;{\text{a}}\;{\text{gamma}}\;{\text{function}}\;\;\sigma_{\mu } = \left[ {\frac{{{\Gamma }\left( {1 + \gamma } \right) \times {\text{sin}}\left( {\pi \times \frac{\gamma }{2}} \right)}}{{\left( {{\Gamma }\left[ {\frac{{\left( {1 + \gamma } \right)}}{2}} \right] \times \gamma \times 2^{{\frac{{\left( {\gamma - 1} \right)}}{2}}} } \right)}}} \right]^{{\frac{1}{\beta }}}$$
(13)

where \({\Gamma }\left( {\text{z}} \right) = \smallint_{0}^{\infty } t^{z - 1} e^{ - t} dt\).

4 Proposed framework

This section presents the proposed framework consisting of three main phases as shown in Fig. 2: preprocessing, hybrid feature selection, and prediction. The preprocessing phase includes the normalization task and clustering with the CFS task. The goal of this phase is to normalize the values of the dataset features, then to cluster the dataset into different clusters. The groups/ clusters help to extract the hidden information in the dataset. After that, the CFS is applied to reduce each cluster’s features, presenting a new reduced dataset. In the second phase, a hybrid feature selection approach is proposed, which implements two stages of filter and wrapper stages. The most relevant features are selected for the prediction phase in this phase. In the prediction phase, a novel variant of COA algorithm is proposed to optimize the different hyperparameters of different machine learning models particularly SVR. This is because the manual tuning of hyperparameters may not lead to a more promising solution.

Fig. 2
figure 2

The proposed framework phases

4.1 Pre-processing phase

This section presents the preprocessing phase components. The preprocessing phase consists of two main stages, normalization, and clustering with CFS stages. Normalization is applied as a preprocessing stage to make the values of different features in a specified range. Feature values may be of various ranges, so normalization is used. Secondly, clustering is applied to cluster the dataset into groups of similar patterns and rank the features of each cluster using CFS ranking. Top-ranked features from each cluster are chosen to form a new minimized dataset for different phases of the proposed framework. The details of each stage of the preprocessing phase are discussed in the following subsections.

4.1.1 Normalization

Normalizing values is necessary for processing data. To acquire values associated with another variable, some normalization forms require merely a rescaling step. When we have data on the size of a crop's population, we can correct the mistakes. The population values can be regularly distributed instead of randomly distributed once the inaccuracies are corrected. Getting the z-score is the initial step in normalization. The z-score can be written as:

$$z = \left[ {\left( {x - \mu } \right)/\sigma } \right]$$
(14)

where means of the crop population and standard deviations of the crop population are denoted by \({\upmu }\) and, respectively.

4.1.2 Clustering with CFS

In this phase, we apply a preprocessing strategy that combines KM clustering and CFS ranking to deal with the high dimensionality of the input data. The KM approach is used on raw data to generate the initial ‘k’ number of clusters. The suggested method is very similar to the traditional ones in that the ‘k’ number of clusters is predefined each time, with ‘k’ values of 8, 10, and 12 being considered. Each cluster’s data is then ranked using CFS rating and sorted in ascending order. A minimized dataset is obtained by choosing the top CFS-ranked features from each cluster. This reduced new dataset is sent to the next phase since it has less redundancy. KM clustering is used on training data to identify commonalities and create subsets. The theory behind this strategy is that clustering can help bring to light previously hidden information and highlight the underlying data structure that was not apparent before grouping.

The cluster analysis output can serve as a valuable guide when extracting and prioritizing critical features from many clusters. Using a KM approach, the training data is partitioned into k groups. The center of each cluster can be determined by taking the meaning of the data inside it. Choosing how many clusters should be created or what value k would have been a crucial challenge in the KM clustering process.

After the clusters have been generated, a CFS filter approach is applied to select the crucial features for each cluster’s resulting minimal subset dataset. CFS uses statistical metrics to evaluate feature subsets as part of its filtering procedure. When one feature is highly correlated with another, it is considered redundant. In this case, the features are discrete assessments of the variable of interest’s distinctive qualities. Given a set of features, if the relationship between each extrinsic variable and feature is known, and the relationships between all other pairs of features are also known, then Eq. 15 can be used to determine the relationship between the complex test, which includes all features, and the extrinsic variable,

$$r = \frac{{\sum_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)\left( {y_{i} - \overline{y}} \right)}}{{\sqrt {\sum_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)^{2} } \sqrt {\sum_{i = 1}^{n} \left( {y_{i} - \overline{y}} \right)^{2} } }}$$
(15)

The formula mentioned above characterizes Pearson’s correlation coefficient (PCC). Here, \(\overline{x}\) and \(x_{i}\) define the mean and actual values related to the features under consideration. The average and actual values of the dataset class are denoted by \(\overline{y}\) and \(y_{i}\), respectively.

As the degree of similarity between the features and the classes grows, so does the importance of the resulting feature set. Furthermore, the overall feature-output correlation is denoted by \(r_{ny}\) = \(p\left( {X_{n} ,Y} \right)\), while the overall feature-feature correlation is denoted by \(r_{nn}\) = \(p\left( {X_{n} ,X} \right){ }\left( {X_{n} ,{ }X_{n} } \right)\). The relative importance of the chosen features is calculated using Eq. 16,

$$J\left( {X_{n} ,Y} \right) = \frac{{nr_{ny} }}{{\sqrt {n + \left( {n - 1} \right)r_{nn} } }}$$
(16)

This indicates that the relationship between an external feature and a group grows more vital as the size of the group increases. The steps for selecting from each cluster using the CFS filter are presented in pseudo-code form in Algorithm 1.

figure a

4.2 Hybrid feature selection approach

Because feature selection is optimized for the employed learning algorithm, wrappers typically produce better results than filter procedures [48]. This makes wrappers challenging to work with and prohibitively costly to execute, especially for large databases with many features.

In most cases, filters are faster to execute than the wrapper, making them a much more attractive option for scaling to extensive databases with many features. A filter can supply an intelligent initial feature subset for a wrapper when higher precision is desired for a specific learning algorithm. Wrapper and filter methods are combined in the proposed methodology to use the filter method’s ranking data. In this approach, we take full advantage of filters and wrappers. Combining filter and wrapper approaches may boost the former’s prediction accuracy while reducing the latter’s runtime.

The proposed hybrid feature selection approach uses a Reverse Feature Elimination RFE with support vector machine predictor (RFE) as a wrapper approach and two filter approaches to form a hybrid approach to choose the most relevant subset of features.

Given the reduced dataset obtained from preprocessing phase, First, the top 10% of features are chosen based on their predictive quality, using a combination of the two feature ranking approaches, fisher and Mutual Information Gain (MIG). During this phase, features that are unlikely to classify samples in the dataset are eliminated. With filter-based procedures, the optimum feature set is found by obtaining the intersection of two feature sets generated by distinct feature rankers. The feature set obtained by filter methods is then wrapped using a wrapper technique, which is used to eliminate redundant features and boost accuracy.

Recursive Feature Elimination (RFE) is a cyclical procedure that removes features in reverse. Using a learning algorithm, this technique ranks feature subsets to produce a superior feature set. Figure 3 depicts the proposed hybrid feature selection approach’s architecture. Each step of the proposed approach is outlined in greater detail below.

Fig. 3
figure 3

The proposed hybrid feature selection approach

4.2.1 Feature ranking approaches—filter stage

It is crucial for classification that the best features be identified. On the other hand, classification algorithms benefit little from using high dimensions because they encompass all features and lengthen the processing time. Fisher Score and mutual information gain feature selection methods are popular supervised feature selection methodologies. By picking the most valuable features, the Fisher Score and MIG can be utilized to minimize the number of dimensions being considered drastically.

The features in the reduced dataset obtained from clustering with the CFS phase are ranked and sorted using the fisher score and MIG simultaneously. The output of each ranking approach is a feature subset with ranked features. The output of each ranking approach may be different. Therefore, the top-ranked 10% of features from each feature subset are chosen. The intersection between the top selected features is obtained to form a new final reduced dataset. Then, the new dataset is fed into the wrapper-based phase to select the optimal features for final prediction.

4.2.2 Recursive feature elimination (RFE)—wrapper stage

While filter approaches choose the most informative features to include in a feature subset, they do not consider the correlation between features in making their decision. Wrapper techniques can optimize the feature subset by employing a learning algorithm in the feature selection process. Compared to filter approaches, these are more costly because updating the feature subset necessitates re-creating the classification algorithm. Therefore, using a filter with a wrapper technique for feature selection can lead to a superior feature size and accuracy solution. Recursive feature elimination is a ranking method that iteratively removes features based on their importance. RFE ranks feature using a greedy approach.

RFE will always start by removing the minor essential features from the obtained feature set from the previous stage. The relative importance of particular defining features can change drastically when evaluating beyond a different subgroup of features throughout the step-by-step elimination phase, making the recursion necessary. One example is the ensemble-based classification or prediction method known as the random forest.

An absolute rule is applied to selecting the various predictors, and the resultant prediction is made for the supplied dataset. In addition, different trees are created using subsets of the training set to ensure that the insights they yield are independent and novel. Furthermore, the algorithm incorporates random chance in its quest for optimal splits [49], so there is a possible variation between the trees. Applications of random forests to this issue will dictate how far the wrapper stage of feature selection goes. Since every tree in a random forest is built from a bootstrap sample, some portions of the feature set are not used during training. This subset, known as out-of-bag, provides an objective way to quantify error projections.

The RFE-SVM feature selection is described in Algorithm 2. The RFE-SVM method relies on a recursive feature elimination procedure, the significance of which is proportional to the number of identifying features that are eliminated. Therefore, the idea behind an RFE-SVM wrapper model is that it’s best to build a model consistently and then choose the best or worst feature. Once that step is complete, you can repeat it with the remaining steps for each feature. Until all the features in the reduced dataset have been used, this process will continue. After each feature is discarded, its place in the ranking is determined by its elimination order. On the other hand, it uses a greedy optimization search to find the optimal features. Algorithm 3 presents the overall steps of the proposed hybrid feature selection method components.

figure b
figure c

4.3 Prediction

In this phase, multiple ML predictors are used to predict the crop yield including SVR, kNN, DR, RF and Gradient boosting. In order to get the best combination of hyperparameters for the ML predictors, a new optimization algorithm is proposed. ICOA is a new variant of COA algorithm which enhanced the searching process of COA to search for the best parameter combination to reduce the prediction error defined by MSE. The proposed ICOA is used to optimize the different parameters of ML models and the results obtained indicated that SVR is the best among all other algorithms in the crop yield prediction problem. The main strategy of enhancement defined to boost the performance of COA is using combination of chaotic mapping theory and Levy flight operators. Multiple studies indicate that involving Lévy flight trajectory enhances the equilibrium between exploration and exploitation in optimization algorithms. In this paper, the Lévy flight technique is employed to further adjust the positions of the gazelles. In addition, the chaotic maps are utilized to further explore the search space of optimizing the hyperparameters and obtain more promising parameter sets for different ML models. Therefore, the combination of chaotic and Levy flight operators helps the ICOA algorithm to avoid falling into local optima and enhance the search process of ICOA by balancing both exploration and exploitation phases. The mathematical model for the new formulated combination is defined as follows:

$${\text{Gazelle}}_{{\text{i}}} \left( {{\text{t}} + 1} \right) = {\text{Gazelle}}_{{\text{i}}} \left( {\text{t}} \right) + ch\left( i \right) \times {\text{sign}}\left[ {{\text{r}}_{1} - 1/2} \right] \times L\mathop e\limits^{\prime } vy\left( \gamma \right)$$
(17)

\(Gazelle\left( t \right)\) indicates the position of the ith gazelle at the tth iteration, r1 is a stochastic number between 0 and 1, and \(ch\left( i \right)\) is a chaotic value obtained by the chaos map. The stochastic random walk equation, denoted as Eq. (17), assists the COA in guaranteeing that the search agent will systematically explore the search area. This is achieved by increasing the step length over time, which helps to eliminate local minima. This study incorporates the Lévy flight trajectory with the chaos map applied into the COA. Therefore, the proposed ICOA can be used to optimize the parameters of SVR ML algorithm to enrich the high performance of the prediction results.

The regression performance of SVR is highly dependent on the values of the bandwidth of the Gaussian kernel σ in the Radial Basis Function (RBF) and the penalty factor C. The SVR’s regression performance is enhanced by utilizing the ICOA to optimize the bandwidth of the Gaussian kernel σ and the penalty factor C. The mean square error (MSE) between the actual values and the predictive values of SVR is employed as the fitness function of ICOA. The ICOA-SVR model algorithm is depicted below.

  1. 1.

    Step 1. Initialize the population of ICOA with a set of candidate parameters, set the population size and the maximum number of required iterations. Also, the LB and UB of the optimized parameters (σ and C) are set. The initial population consists of a set of candidate solutions each of which is two-dimensional solution to represent the two parameters. The initial population is generated randomly.

  2. 2.

    Step 2. The fitness value of each solution (set of parameters) is determined using the MSE fitness function and \(X_{G}\) and \(X_{L}\) are obtained.

  3. 3.

    Step 3. The algorithm update formulas are executed according to the exploration and exploitation phased based on the \(temp\) variables which can be either less than or greater than thirty degrees.

  4. 4.

    Step 4. The Levy-chaotic update position (as shown in Eq. (17)) is utilized to modify the parameters with more enhanced parameter set.

  5. 5.

    Step 5. At the end of each iteration, the solution with the minimum MSE is recorded which indicates the best parameter set at this iteration.

  6. 6.

    Step 6. Save the optimal Crayfish overall the whole iterations to represent the optimal set of σ and C that minimized the MSE fitness value.

  7. 7.

    Step 7. The steps 2–6 are repeated until the maximum number of iterations is reached and output the optimal solution that represents the optimal parameter set. The flowchart of the proposed steps for optimizing the SVR parameters.

5 Experimental results

In this section, a set of experiments are conducted to evaluate the performance of the proposed framework including the proposed hybrid feature selection and prediction phases.

5.1 Simulation environment

The crop yield forecast model was set up in Python. The Python version was “PYTHON 3.7”, and the processor was “Intel(R) Core(TM) i7-10750G7 @ 2.40 GHz”. Furthermore, the system contained 32.0 GB of RAM. A set of ML models are used for the evaluation of the ICOA approach including RF, kNN, SVM, Gradient Boosting and DT. The training models parameters and configurations are set according to Table 1.

Table 1 training models parameters and configurations

5.2 Dataset description

With vast increases in both human intelligence and the availability of suitable tools, the field of machine learning has exploded in recent years, allowing for the development of novel circumstances to ascertain, evaluate, and value information-pervasive approaches in agricultural contexts. The dataset for this study was collected from official Indian government websites belonging to several agricultural ministries. Three primary keys are used to piece together the data: states, years, and crops. There needs to be a lot of data for feature selection methods to work well. Data with flexible features simplify discovering patterns by filtering out details that aren’t pertinent to the study’s goals.

This section describes the data set utilized to make predictions about crop yield in the study. Factors such as rainfall, crop type, market price, and yield are collected to create a dataset that can forecast whether or not a crop will be profitable. Data from many sources is gathered, filtered, and combined using Python. Eands. dacnet.nic.in. [50], Agmarknet [51], and Mospi.nic.in. [52] are among the sources used.

Paddy crop production prediction in the Vellore district of southern India is the focus of the planned study. Ponnai, Sholinghur, Arcot, Thimiri, Ammur, and Kalavai are all part of the study's geographical focus. Since paddy is a significant cash crop in the area, it makes sense to look into the economy there. The information includes non-typical meteorological and soil features, such as the characteristics of the groundwater used by the crops and the amount of fertilizer applied to them. Parameters such as evapotranspiration, wet day frequency, groundwater nutrients, and aquifer features were examined in this study. Brief details regarding the study’s many crop parameters can be found in Table 2.

Table 2 description of dataset parameters

A combination of paddy output (tonnes) cultivated area (hectares), and yield acquired (kg/hectare) is used to calculate the estimated paddy crop yield. Regular climatic factors were used, such as reference crop evapotranspiration, mean temperature, humidity, potential evapotranspiration, and precipitation. In contrast, unique climatic data such as diurnal temperature range, ground frost frequency, and wind speed were also considered. The climate information comes from the Indian Meteorological Department’s online platform, metadata. Topsoil density, soil macronutrients and Soil pH are all examples of soil parameters. The analysis considers the many hydro chemical characteristics of groundwater, such as its permeability, aquifer type, transmissivity, electrical conductivity, and pre- and post-monsoon micro-nutrient (chloride, magnesium, potassium, calcium, and sodium) content. Table 3

Table 3 Years of testing datasets with cross-validation

5.3 Evaluation metrics and machine learning models

Validating the proposed framework is essential for achieving the desired results. It is implemented alongside different machine learning algorithms: Decision Tree [53], Support Vector Machine [54], k-Nearest Neighbor [55], gradient boosting [56] and Random Forest [57]. The following subsections details the utilized evaluation metrics to assess the different components of the proposed model. The evaluation map consists of many folds including: the evaluation of the hybrid FS approach, the evaluation of the proposed optimized prediction phase and the evaluation of the full framework.

5.3.1 Metrics of evaluation

Model performance is defined by the evaluation measures used. Measures of evaluation and performance metrics are used to determine how effective a machine learning model is. Distinguishing between the outcomes of multiple learning models is a crucial part of the evaluation metric. Mean Square Error (MSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Determination Coefficient (R2), Median Absolute Error (MedAE) and Root Mean Squared Error (RMSE) are some of the performance indicators considered in evaluating the work that has been developed.

  • Mean Absolute Error (MAE) is a way to determine the typical significance of errors [58] by taking a set of predictions and averaging them over. According to the Equation, the mean absolute deviation from the expected value has been observed.

    $$MAE = \frac{1}{n}\sum_{j = 1}^{n} \left| {y_{j} - y_{j}^{\prime } } \right|$$
    (18)

    where sample size, \(n\), represents the population from which information is drawn; \(y_{j}\) represents the baseline measure of interest, and \(y{^{\prime}}j\) characterizes the predicted estimate of interest.

  • Mean Squared Error (MSE) is an essential metric for evaluating an estimator’s efficacy. Moreover, this characterizes how well a regressor line corresponds to the points in the dataset [59]. Mean squared error (MSE) can be calculated using the formula:

    $$MSE = \frac{1}{n}\sum_{j = 1}^{n} \left( {y_{j} - y_{j}^{\prime } } \right)^{2}$$
    (19)
  • Root mean square error (RMSE) measures prediction uncertainty by squaring the difference between the observed and the predicted errors [60]. More specifically, it clarifies the degree to which the data is concentrated along the best fit line. Equation 6 represents the calculation of the (RMSE):

    $${\text{RMSE}} = \sqrt {\sum_{j = 1}^{n} \frac{{\left( {y_{j} - y_{j}^{\prime } } \right)^{2} }}{n}}$$
    (20)
  • R-squared, also known as the determination coefficient, is a statistical metric used to assess how well the regression model fits the data [61]. As such, the determination coefficient characterizes the extent to which the revised framework outperforms the original. Equation 18 provides the following definition:

    $$R^{2} = \left( {\frac{{n\left( {\sum xy} \right) - \left( {\sum x} \right)\left( {\sum y} \right)}}{{\sqrt {\left[ {n\sum x^{2} - \left( {\sum x} \right)^{2} } \right]\left[ {n\sum y^{2} - \left( {\sum y} \right)^{2} } \right]} }}} \right)^{2}$$
    (21)
  • Mean Absolute Percentage Error (MAPE) measures how far the model’s forecast differs from the actual result. Simply put, it represents an average of the various percentage errors. It’s calculated by dividing the total absolute error by each timeframe separately. As shown in Eq. 19, its definition is as follows:

    $${\text{MAPE}} = \frac{1}{n}\sum_{j = 1}^{n} \frac{{y_{j} - y_{j}^{\prime } }}{{y_{j} }}$$
    (22)
  • Median Absolute Error (MedAE) is an attractive metric because it holds up well in the face of extremes. When determining the loss, the middle value of all absolute discrepancies between the prediction and the target is used. The median absolute error estimated over n samples is defined as follows, where ŷ represents the predicted value for sample \(i,\) and the actual value is defined as yi.

    $${\text{MedAE}}\left( {y,\hat{y}} \right) = {\text{median}}\left( {\left| {y_{1} - \hat{y}_{1} } \right|, \ldots ,\left| {y_{n} - \hat{y}_{n} } \right|} \right)$$
    (23)

5.3.2 Cross-validation

It is common to split the dataset into a test set and a training set when building a model of machine learning, with the more extensive set being given more weight as the model is refined. While the test dataset is minimal, there is always the chance that important data that could have helped the model was left out. High variance in the data is also a cause for concern. The method of K-fold cross-validation is used to deal with this issue. Attempting to model and predict time series data is a challenging and involved process. Cross-validation based on randomly splitting a time series does not work very well. A temporal dependency issue may arise because of the reliance on earlier observations and the possibility of leakage in lag variables due to the response variable. Because of this, the information space exhibits non-stationarity or a tendency toward fluctuating mean and variance values. In this case, a forward chaining method is more suitable for executing the cross-validation. The proposed approach builds its model on historical data and makes predictions using a five-fold cross-validation procedure. Table 4 summarizes these results.

Table 4 Evaluation of ML models' performance using all features of the dataset

It’s similar to training on a small sample of data and then using that to make predictions about new data and assess how well those predictions hold up. The data points predicted are included in the next batch of training data, and predictions are made for the following data points. The results of the cross-validation on the proposed method are shown in Fig. 4.

Fig. 4
figure 4

The proposed forecasting model’s predicted values are compared to actual values. a After cross-validation, b Before cross-validation

PyScikit-Learn, a machine learning library, is used to conduct cross-validation. The steps of preprocessing the dataset are performed. Sklearn’s train test_split_function is brought in via the model_selection sub-library so that data can be split into test and training sets. To discover the best possible value for K, the cross_val_score function is used for fine-tuning the cross-validation hyperparameters. The data is divided into K-equal subgroups by specifying a value of 5 for the n_splits argument. In this work, we allocate 75% of the data to training and 25% to testing. K-fold cross-validation is where the error measure for the trained model is established. The R2 score is used to quantify the accuracy of the model and is refined at each iteration until an optimal value is reached.

Below is a detailed illustration of the experimental setup for the proposed hybrid feature selection approach for crop prediction, which may be used in conjunction with several different machine learning frameworks, including KNN, gradient boosting, SVM, decision trees, and random forest.

5.4 Experimental results

Here, we summarize the results of experiments that chose the proposed framework above the baseline machine learning models. The feature selection approach, which involves picking the most pertinent features from a dataset, can boost the accuracy of a prediction model. In addition, the proposed prediction phase is optimized by using a novel ICOA algorithm to best obtain a set of parameters of ML models. To ensure that the proposed hybrid statistical feature selection technique work as intended, it is implemented with the following models of machine learning:

  • Decision tree

  • Random forest

  • Gradient boosting.

  • Support Vector machine.

  • KNN

In some machine learning algorithms, a helpful built-in mechanism known as feature importance is included. These techniques are commonly used in forecasting, allowing close monitoring of the essential model variables. Depending on the situation, this data can be utilized to modify the current models by engineering new features or discarding the noisy feature data. The proposed hybrid feature selection framework is compared to this metric as one of its benchmarks. There are five stages to this model’s analysis and evaluation.

First, Prediction results are verified by several statistical assessment measures once the models are built in the first step using all the features in the dataset. Second, the feature importance techniques included within the algorithm are used to create models, with only the most essential features being chosen. Third, the models are built using the proposed FMIG-RFE-SVM approach. The method proposes selecting the most critical features and assessing the predicted outcomes. Fourth, the prediction phase is evaluated using the novel ICOA compared with other optimization algorithms for optimizing the different ML parameters. Finally, the full framework is evaluated with some state-of-art approaches and models.

5.4.1 Estimating the effectiveness of hybrid proposed feature selection approach

To evaluate the efficacy of the proposed hybrid feature selection method, a set of experimental results are captured. The features obtained via the proposed feature selection methods, features gained via the built-in feature importance approach and all of the features are used to evaluate the accuracy of the various experimental models. The evaluation metrics define the effectiveness of the running model. The differences between the expected and actual values are measured by the residuals acquired in the experiments. A model’s efficacy and accuracy can be measured by examining the size of the residual spread. As can be shown in Tables 5, 6 and 7, the evaluation metrics attained via the proposed hybrid feature selection technique outperform those acquired via the other investigated methods.

Table 5 Evaluation of ML models' performance using the inherent approach of feature importance
Table 6 Evaluation of ML models' performance using the proposed FMIG-RFE-SVM approach
Table 7 Effectiveness of a hybrid FMIG-RFE-SVM approach for machine learning models

Efficiency evaluation is a crucial part of developing a better model. For future iterations, it aids in determining the best possible framework for describing the data and putting it to use. A prediction's accuracy is evaluated by considering how close the prediction comes to the true value. It measures how often a model comes up with correct results. Accuracy measurements for the tested models are shown in Table 8 using the proposed FMIG-RFE-SVM hybrid approach for feature selection, the built-in feature importance approach, and the entire dataset.

Table 8 Optimized Prediction models evaluation without using Hybrid FMIG-RFE-SVM as feature selection

When evaluated with the proposed feature selection approach, the results show higher accuracy. Figure 5 visually depicts the different types of machine learning models. Each was trained using either the complete set of features, the inherent feature importance approach or a subset of features generated using the proposed feature selection method.

Fig. 5
figure 5

Machine learning model performance metrics with a all features of the dataset, b selected features using the inherent feature importance approach, and c features obtained using the proposed FMIG-RFE-SVM approach

Figures 6, 7, 8, 9, 10 show visual representations of the accuracy of machine learning models trained on the entire dataset and the proposed feature selection method. For instance, you can see in Fig. 6a how practical the proposed feature selection approach is by looking at the accuracy achieved by the Gradient boosting algorithm when fed with features produced in this manner. When using all of the features in the dataset, as shown in Fig. 6a, the Gradient boosting algorithm achieves an accuracy of 84.22%. Achieving an accuracy of 86.77% utilizing the proposed hybrid feature selection approach of the gradient boosting algorithm is depicted in Fig. 6b. In Fig. 7a, we can see that when the random forest algorithm is applied to the entire dataset, it achieves an accuracy measure of 90.84%. The proposed feature selection approach yielded an accuracy of 91.98%when fed into the random forest algorithm (as shown in Fig. 7b). Figure 8a shows that an accuracy measure of 79.25% is obtained when the decision tree method is applied to all dataset features. As shown in Fig. 8b, the accurate measurement of the features was obtained after using the proposed feature selection method. Utilizing all features in the dataset with the SVM machine learning model is shown in Fig. 9a with an accuracy measure of 86.22%. Figure 10b shows an accuracy of 88.91% after applying the proposed feature selection approach with the SVM model. In Fig. 10a, we can see that when the KNN model is used for all of the features in the dataset, it achieves an accuracy measure of 78.21%. The proposed feature selection approach yielded an accuracy of 80.69%when fed into the KNN model, shown in Fig. 10b.

Fig. 6
figure 6

The results of comparing the accuracy of the Gradient boosting with a all of the features in the dataset and b the proposed feature selection approach

Fig. 7
figure 7

The results of a comparison of the accuracy of the Random forest model with a all of the features in the dataset and b the proposed feature selection approach

Fig. 8
figure 8

The results of a comparison of the accuracy of the Decision-tree model with a all of the features in the dataset and b the proposed feature selection approach

Fig. 9
figure 9

The results of a comparison of the accuracy of the SVM model with a all of the features in the dataset and b the proposed feature selection approach

Fig. 10
figure 10

The results of a comparison of the accuracy of the KNN model with a all of the features in the dataset and b the proposed feature selection approach

5.4.2 Estimating the effectiveness of prediction phase

This section evaluates the proposed prediction model for the crop yield. In order to evaluate the proposed ICOA to optimize the parameters of different machine learning models, ICOA is applied for the five utilized machine learning models and the results are captured. Tables 9 and 10 present the obtained results applied as a result of using ICOA to optimize the parameters of different ML models. This section tests the ICOA parameter optimization with and without using the proposed hybrid feature selection approach. Tables 9 and 10 test the ICOA as a parameter optimizer for many ML models without and with using the proposed hybrid feature selection approach, respectively. According to Table 9 it is obvious that the performance of ML models increased after utilizing ICOA compared with Table 5 and 7. This is because the utilized chaotic levy Crayfish algorithm utilized the search process for the vest set of parameters that adapt the ML models. Moreover, Table 10 evaluates the proposed ICOA to adapt the parameters of ML models in the presence of the hybrid feature selection approach. It is clear that the parameter optimization process along with the hybrid feature selection approach largely increased the performance of the ML to predict the best results. From Table 10, SVR obtained the best results, least prediction errors, compared with other algorithms. The performance presented by SVR is superior to all other algorithms indicating a robust prediction result can be obtained.

Table 9 Optimized Prediction model evaluation using Hybrid FMIG-RFE-SVM as feature selection
Table 10 Prediction model evaluation without using Hybrid FMIG-RFE-SVM as feature selection

For further evaluation of the proposed ICOA as a parameter optimizer approach, several optimization algorithms are compared with ICOA. Table 11 presents the captured results of applying different optimization algorithms to SVM ML model. The MAE, MSE, RMSE, R2 and MedAE are used to differentiate between different approaches. According to Table 11 it is clear that ICOA-SVM is superior to all other algorithms particularly COA-SVM which indicates the original COA. The MAE and RMSE of ICOA-SVM is 0.151 and 0.228 which is the minimum among all optimizers. Therefore, the proposed ICOA-SVM is a promising algorithm to optimize the parameters of ML models particularly SVM model. Figure 11 visualizes the comparison reported at Table 11 for further analysis and visualization.

Table 11 Prediction model evaluation without using Hybrid FMIG-RFE-SVM as feature selection
Fig. 11
figure 11

Comparison between ICOA and several optimization algorithms to optimize SVM

5.4.3 Comparison with some recent state-of-art approaches

To undertake a more extensive study of the suggested framework, it is compared to various current state-of-the-art techniques to evaluate. In this experiment, multiple recently published methodologies are combined to demonstrate the innovative model as a possible solution to the crop yield forecast problem. The compared state-of-the-art techniques include RF [62], 1DCNN [25], LSTM-DBN [31] and CYPA [32]. The results show that the suggested model outperformed other prediction models, indicating a robust model. Table 12 records the captured results in terms of MAE, MSE, R2 and MedAE. According to Table 12, it is clear that the MAE and RMSE obtained by the proposed framework is the minimum among all state-of-art works while CYPA is ranked as the second best one. The obtained results indicate that the proposed framework presents an excellent contribution to the literature work.

Table 12 Prediction model evaluation without using Hybrid FMIG-RFE-SVM as feature selection

5.4.4 Statistical analysis on accuracy

Table 12 shows the statistical comparison of the proposed model to the RF, 1DCNN, CYPA, and LSTM-DBN for agricultural yield prediction. Furthermore, it is evaluated for accuracy. Because metaheuristic procedures are unreliable, each method is rigorously tested to assure improved estimation. Furthermore, five different types of statistical measures are investigated: the mean, maximum, Wilcoxon test with p-value is 0.05 [63], Friedman rank [64], median, standard deviation, and minimum. Furthermore, the proposed framework has a maximum statistical metric accuracy of 0.949, while the RF has an accuracy of 0.853, LSTM-DBN has an accuracy of 0.914, CYPA has maximum accuracy of 0.939 and 1DCNN [24] has an accuracy of 0.924. The average accuracy obtained by the proposed framework is the best among all literature, which is 0.943. It is also analyzed that the obtained p-value of Proposed framework versus all other approaches is less than 0.05 indicating the significance of the obtained results between the proposed framework compared to other approaches. Finally, the Friedman rank is used to rank the obtained results among all approaches where the proposed framework is ranked first with rank 1.42 while CYPA is ranked second with 2.29.

5.4.5 Comparative analyses of regression performance

Specifically, diagnostic regression plots [61] are built using features from the conventional hybrid feature selection approach to validate the regression findings of the machine learning models. By providing a convenient set of tools for assessing the model’s validity, diagnostic regression charts boost the exploratory performance of the regression model. This examination may involve looking into the model’s unstated statistical assumptions or analyzing the model’s framework by thinking about alternatives that use fewer or more varied illustrative components. They are also helpful when searching for outliers or samples that have an outsized influence on the regression model’s predictions but are not well represented by the data. When a model is fit to data, it often leaves behind something called a residual. But residuals may show how poorly a model represents the data. They also use the tested model to find previously unknown patterns in the data. These numbers will let us check if our regression hypotheses hold up and allow us to make cultivated guesses about improving the model. Figure 12 shows the four diagnostic plots that depict residuals in varying ways. In this section, we summarize the results from our evaluations of the forecasting models employing the proposed hybrid feature selection approach.

Fig. 12
figure 12

Diagnostic residual plots for regression analysis. a the scale versus location plot, b the residuals against leverage plot, c The residuals vs fitted plot, and d the usual Q-Q plot

5.4.6 Features of data distributions

The probability density function of both the experimental models and actual data of crop yield is observed to ascertain if the proposed model retains the original distributional attributes of the data. An analytical expression, the Probability Density function (PDF), compares the distribution of one random variable to that of another, either continuous or discrete. The area graphically represents the range over which the expected variable is found under the PDF curve. The probability of observing the constant random variable is equal to the absolute area in the graph interval. We can use it to determine the probability of specific outcomes. Different probability density graphs for the raw data and tried ML models are shown in Fig. 13. To better match the distribution features of the actual crop production data, the random forest model, as specified clearly in Fig. 13, outperforms the other tested machine learning algorithms.

Fig. 13
figure 13

presents the probability density curves of the following: actual data and data predicted by the proposed FMIG-RFE-SVM used with: b Random Forest, c Gradient Boosting, d Decision Tree, e Support Vector Machine, (f)K-Nearest Neighbor

6 Discussion

Results from the proposed model are discussed, and the future implications of this research are briefly outlined. Nonlinear residual patterns can be seen clearly in the residuals vs fitted graph. Nonlinear relationships between the real and the predictor variable may manifest in these graphs if the model fails to do so initially. Nonlinear relationships are shown by randomly scattered residuals about the zero line.

Scale location plots check if residuals are spread out typically across the predictor’s scale. It allows us to test for homoscedasticity [65] or the assumption of equal variance. A horizontal line with points placed at random is preferable. As can be seen in Fig. 12a, the residuals are distributed as possible to isolate the most critical information. All the outliers can’t matter significantly for the regression line. Margin can be established thanks to Cook’s distance. Those outliers who fall beyond the norm but have a significant impact are those that have a high Cook’s distance score. Because of this, there are no significant outliers, as shown in Fig. 12b. Therefore, the improved model performance with the proposed feature selection technique is defined by the regression diagnostic charts. Based on Fig. 12c, it appears that the model data have well met the linear regression assumptions. No unique data pattern emerges when considering the linear distribution of information. The Q-Q plot will reveal whether or not the residuals are normally distributed with a constant standard deviation. The best-case scenario is where the residuals interline smoothly on a straight line with minimal variance. Suppose the residual deviates significantly from what would be expected under a normal distribution. In that case, the confidence intervals and p-values do not correctly reflect the true extent of the variation in the data. Figure 12d depicts the normal distribution of the residual, with residuals drawn almost perfectly along the diagonal line.

Factor analysis is an exploratory data analysis approach performed in addition to feature extraction techniques to discover the essential or hidden variables. It reduces the number of potential factors, making it easier to conclude the data. Factor analysis, a linear statistical model that explains the variance among the observed variables, refers to the unobserved variables as “factors”. Multiple variables with consistent response patterns are linked to the same factors. It’s the process of seeing if the factors f1, f2,… fn can explain the relationships between the relevant variables (× 1, × 2,… xn). To isolate latent factors, factor analysis seeks to reduce the number of variables that can be measured. In addition, factor extraction or rotation can be used to accomplish this. The factor analyzer library in Python is also used to actualize the proposed work factor analysis. Evaluation of the dataset’s factorability is required before running the factor analysis. Aside from that, the Kaiser–Meyer–Olkin (KMO) test is used to determine if the data is suitable for factor analysis. It details whether or not the overall model and set of data are adequate. The KMO values might range from 0 to 1, with anything below 0.1 deemed insufficient.

In general, the KMO for the crop dataset is 0.83, which is a good fit for moving forward with factor analysis. The eigenvalues define the number of factors in a scree plot. A straight line represents each factor and its eigenvalue in the scree plot procedure. The variables with eigenvalues more significant than one are regarded to be independent. The scree plot shown in Fig. 14 reveals 32 eigenvectors with squared values larger than 1. These factors together account for 55% of the total variance. By analyzing massive datasets, factor analysis can uncover hidden relationships and identify groups of connected variables.

Fig. 14
figure 14

Factor analysis scree plot defined with the number of factors

In any case, the same data components might be used to support competing explanations. Our proposed feature selection technique yields 32 deciding factors close to the number of features. The overall performance and comparison findings demonstrate that the proposed hybrid feature selection approach yields superior performance results compared to the other feature selection process. As a result, the frameworks’ prediction ability and efficiency are enhanced, as evidenced by a lower mean squared error (MSE), root mean square error (RMSE), mean absolute error (MAE), median absolute error (MedAE) and higher value of determination coefficient. The diagnostic graphs additionally detail the improved exploratory performance of the models.

7 Conclusion and future work

Agriculture is one of the most challenging departments to incorporate analytical results. Weather, soil, crop diseases, and pest infestations affect agricultural productivity and precision agriculture. Machine learning can change agribusiness by incorporating yield forecasting components. Machine learning models assess facts, translate data, and provide in-depth process knowledge. Feature selection using statistical measurements is critical for streamlining the predictive model’s learning process and efficiently representing the dataset. This paper proposes a novel framework with a new hybrid feature selection strategy for machine learning models. The models predict paddy crop production based on soil, climate, and groundwater hydrochemical parameters. The FMIG-RFE-SVM method determines an intriguing study area’s most crucial agricultural yield feature. The proposed approach combines the filter and RFE-SVM wrapper approaches. The filter approach eliminates redundant and non-essential features utilizing information gain and fisher score, resulting in a smaller subgroup. These features can be used to build an intelligent agricultural model for crop prediction. Future research may focus on new cutting-edge fuzzy-based clustering algorithms that can provide more helpful information for yield prediction. Second, we may include additional features in the dataset to improve the accuracy of the prediction model.