1 Introduction

Agriculture has always been based on mass labor and intensive physical work, with the results heavily dependent on weather and climate conditions [42]. Despite the everlasting uncertainty of crop yields, the increasing demand for food due to rapid population growth, Fig. 1, has provided farmers with a stable income and thus made agriculture the largest workforce absorber. Even nowadays, despite modern technology and automation of agricultural production, a third of the world’s economically active population derives its income from agriculture [22].

Fig. 1
figure 1

Global population growth with key milestones, based on [48, 66]

Fig. 2
figure 2

Global cereal yield and production versus population and land used for production [50]

New technologies facilitated and mechanized work in agriculture, reducing the number of required farmers [24]. In the second half of the twentieth century, a significant effort known as the Green Revolution [17] was made to increase the production of high-yielding cereals, especially wheat and rice, to suppress hunger and increase yield per plant, Fig. 2. Food production transitioned from being of the local character, i.e., farmers producing food for their families or communities, to global food trade which has made diets around the world more diverse and brought new business opportunities to farmers and processing industries [49]. This process was not interrupted even by major events such as the 2007–2008 Global Financial Crisis (GFC), the recent COVID-19 pandemic, or the ongoing conflict in Ukraine.

It is necessary to emphasize the fact that the problem of hunger in the world has not been solved yet. Many countries still have a significant percentage of the population that cannot meet their nutritional energy needs on a regular basis [36].

Fig. 3
figure 3

Total area versus arable land (left) and percentage of global arable land (right), based on [6, 65]

Fig. 4
figure 4

Arable land versus population versus cereal yield per country on a global scale, the most populated countries, based on [6, 64]

Fig. 5
figure 5

Arable land versus population versus cereal yield per country on global scale, countries with prevalence of undernourishment (% of population), based on [6, 36, 64]

Arable land comprises only a small part of the total area of each country. According to currently available statistical data, \(\sim\) 15.83 million square kilometers (\(\text{mkm}^{2}\)) are cultivated on the global level [6], i.e., just under 11% of the total land mass, and yet only 10 countries control more than half of the globally available arable land, Fig. 3.

A simplified assessment of the extent to which the most populous countries are currently effective in solving the problem of food security can be evaluated based on the comparison of the population to available arable land ratio and cereal yield, both compared to the global average. Comparing the ten most populous countries in the world, Fig. 4, the US, Brazil, and Russia have the most favorable population to available arable land ratio (< 1), but yet only the US and China achieve significantly above-average cereal yields, \(\sim\) 200% and \(\sim\) 150%, respectively. Countries with a prevalence of undernourishment, Fig. 5, have a mostly unfavorable ratio of population to arable land (ratio > 2), and cereal yields significantly below the global average (\(\sim\) 40%).

Although the evident increase in crop yield was achieved, the nutritional quality failed to keep pace. Modern cereals suffer from deficiencies such as low-quality proteins, and the lack of essential amino acids, vitamins, and minerals [53]. The so-called ancient grains and heirloom varieties became popular in the early twenty-first century, but their lower yield per plant may present a problem in resource–poor areas where crops used by the producer to feed his own family and livestock in subsistence agriculture (food crops) are being replaced by crops grown for profit (cash crops) [18].

Intensive farming with the use of synthetic fertilizers and pesticides increased the productivity of crops but also increased environmental pollution and its impact on the quality of life, which people became aware of and revived interest in organic, regenerative, and sustainable agriculture. The European Union was a pioneer in this field by introducing certification of organic food in 1991 [20]. Research interest in alternative technologies was reestablished, primarily in pest management, selective breeding, and controlled environment agriculture [60].

Until recently, it was believed that the single means of surviving in agriculture were to increase the holdings and productivity, with the mandatory (mis)use of chemical fertilizers and appropriate machinery, but as mentioned before, this approach has brought a whole new set of challenges. In research, the scenario of industrial agricultural production is considered most often, while individual households and small farms are classified as a recurrence of the pre-industrial form of production [22]. However, during the COVID-19 pandemic, a significant number of urban households classified themselves as food insecure due to occasional food shortages caused by disruptions in supply chains and began producing food for their own needs through gardening, fishing, backyard livestock, etc. [40]. Trends of exurbanization and counter-urbanization gained in popularity, where part of urban population trade the city life for life in smaller, but healthier environments. These people are mostly digitally savvy, work remotely, and are not interested in industrial agriculture but may produce organic food for their needs [11].

Crop yield forecasting could help alleviate many uncertainties associated with food production. However, as many factors influence production, this is not a straightforward task. Fluctuations in several factors, unpredictable disease outbreaks, natural disasters, and many other factors can severely and unforeseeable impact yields. To help tackle this ever-pressing challenge, robust techniques are needed, capable of responding to an ever-changing environment.

One possible approach comes from the application of artificial intelligence (AI). These algorithms mathematically mimic behaviors observed in biological brains, and given enough computational time and data can adjust their behaviors to suit a specific problem without explicitly being programmed to do so. By applying powerful AI algorithms to the task of crop yield prediction, nonlinear relations between various factors can be observed and leveraged to cast more accurate forecasts. These can become a valuable tool for both farmers and policy-makers allowing preemptive measures to be taken to prevent crop failure and even famine. Preceding works have explored the potential of integrating powerful machine learning (ML) techniques to incorporate emerging technologies into agriculture systems [10] as well as other difficult challenges in computing [3, 47, 59].

A notably interesting emerging class of AI algorithms inspired by evolution and cell structures in the brain, as well as the processes of shaping these connections comes from neuroevolutionary algorithms. While more traditional algorithms rely on simulated predefined structures divided into layers and organized into a network, neuroevolutionary algorithms are capable of evolving the structure of the network to better suit the specific problem. The process of selection is similar to that of the genetic algorithm (GA) [38] where each potential solution is assigned a describing genome that is altered, mutated, and combined with other solutions to attain an optimal. This can often result in simpler and smaller networks, requiring less computational power to execute. Furthermore, while traditional networks heavily depend on weights and biases, the approach of using weight agnostic neural networks (WANN) [19] simplifies the process of selecting weights, pushing the responsibility for addressing the problem toward network architecture.

The motivation behind employing WANN in this research lies in the intriguing concept that a network’s architecture can be dynamically evolved to address specific challenges, diverging from the conventional emphasis solely on weights and biases. This relatively unexplored avenue within computer science offers a novel approach to problem-solving. Additionally, WANN’s distinct advantage in mitigating the computational complexity associated with traditional backpropagation methods serves as a key motivation. By streamlining the network optimization process during training, WANN provides an efficient and promising framework for enhancing crop yield forecasting models, prompting further exploration and investigation within the scope of this work.

The performance of AI algorithms is heavily connected to parameter values that define behavior. Often referred to as hyperparameters, these are usually defined by default to ensure a good general performance of the algorithm. Additional tuning is often required for the algorithm to address a specific problem effectively. This task was traditionally tackled via trial and error; however, with the increasing number of hyperparameters present in newer algorithms, automated processes are needed.

Hyperparameter tuning can often be considered an NP-hard task. Therefore, to tackle it effectively, algorithms capable of resolving NP-hard problems are required. One notably interesting subgroup of algorithms known for the ability to tackle NP-hard problems with reasonable computational resources and within realistic time frames is swarm intelligence algorithms. These algorithms simulate cooperative populations usually observed in nature, through a set of simple rules. By following these rules, individuals allow for complex behaviors to emerge on a global scale, allowing for effective optimization to take place.

To effectively tackle the pressing issue of crop yield forecasting, several problems need to be addressed as well. To efficiently apply WANN, adequate parameter selections need to be made. Furthermore, adequate weight adjustments need to be made to attain satisfying results when applied to this specific problem.

This work proposes a two-layer cooperative framework based on metaheuristics optimization algorithms. The first layer (L1) is tasked with optimizing the WANN hyperparameters to efficiently select the best possible network architecture suited to crop yield forecasting. The best models created by L1 are passed on to layer two (L2) where shared weights are further optimized. Additionally, a modified version of the recently introduced RSA is developed and applied within the context of the optimization framework. This approach has been evaluated on two real-world datasets, and the results compared with several state-of-the-art metaheuristics tuned WANN as well as several standard ML, and AI models applied to the same problem. Finally, the best-performing models have been subjected to statistical evaluations to determine the significance of the attained improvements, followed by a Simulator for Autonomy and Generality Evaluation (SAGE) [21] analysis to determine what features have the highest impact on the model predictions.

The scientific contributions of this work may be summarized as the following:

  • A proposal of a novel WANN-based approach for forecasting crop yields.

  • An introduction of a cooperative two-layer framework for WANN structure optimization and shared weights tuning

  • An introduction of a modified version of the RSA metaheuristics which is incorporated into the proposed framework.

The remainder of this work is structured according to the following: Sect. 2 presents preceding research that has contributed to the conducted work. In Sect. 3, the proposed method is described in detail, and the logic behind the modified metaheuristic is elaborated. The experimental setup and developed two-layer experimental framework, as well as the two utilized datasets, are described in Sect. 4, and the attained results are presented and discussed in Sect. 5. Finally, Sect. 6 gives a few concluding words on the work and presents proposals for future research.

2 Background and related work

The concept of precision agriculture implies the application of Information and Communication Technologies (ICT) to provide better situational awareness on the farm, and therefore provide the possibility of making more effective decisions. Smart farms are systems with a multi-layered structure that allows individual components to be added or removed according to specific needs. IoT enables the collection, transmission, and exchange of information between components, while AI brings automation of system management through autonomous decision-making.

Optimum crop management requires prior soil analysis and assessment of irrigation needs. Then, the estimate of actual crop growth and yield can be compared with projections, taking into account weather conditions and other factors. Deep convolutional neural networks (CNN or DCNN) combine the ability to recognize objects by shape, color, and texture. Computer vision (CV) and YOLOv3 algorithm may be successfully used, with a precision of over 92%, by harvesting robots for fruit detection and yield counting, as described in [32]. AI, CV, and YOLO can also help in the real-time detection of crop diseases, e.g., early blight disease in potato fields, apple scab and rust, or grapevine disease, just to name a few [51]. Existing algorithms for fruit recognition are the basis for creating a new generation of robots capable of solving the problem of labor shortages for fruit harvesting.

Precise crop yield estimation may prove to be difficult due to complex, interrelated environmental factors. Significant variations in the assessment can be influenced by weather changes in different stages of plant growth, spatial variability of soil properties, crop rotation, fertilization, irrigation, etc. There are two basic approaches when estimating crop yields, a crop growth model and a data-driven model.

The existing literature on crop yield forecasting has witnessed advancements in various methodologies; however, a notable research gap exists concerning the exploration and optimization of WANN in this context. While the potential of WANNs to generate lightweight networks with shared weights has been acknowledged, their application and optimization for crop yield forecasting remain underexplored. The majority of studies in this domain focus on traditional approaches or lack comprehensive investigations into leveraging the capabilities of WANNs. This research aims to bridge this gap by introducing a novel two-layer cooperative framework and a modified metaheuristic to optimize WANN parameters for enhanced crop yield forecasting accuracy. The proposed methodology addresses the current gap by providing a systematic exploration of WANNs in the specific context of crop yield prediction, offering insights that contribute to the advancement of predictive modeling in agriculture.

Various mathematical models, the so-called crop growth models can be used to simulate the interaction of plant physiological processes with the environment. For the model to work, it is necessary to provide real data on the type of soil, solar radiation, precipitation, temperature changes, adopted management practices, etc. Semi-empirical crop models may provide fair results [25, 46, 61], but they are expensive in terms of time and money and impractical for mass applications and agricultural planning.

The empirical approach is more practical and easier to use than the crop growth model. Here, yield data from the recent past is used, and a set of the most influential parameters on yield variation is determined. Accepting these parameters as independent, and the harvest yield as the dependent variable, empirical equations are formed to calculate the coefficients of these parameters, which are then used for the final estimation of the crop yield. This approach is economically more viable and easier to implement and does not require prior information about the physiological processes involved in plant growth or a predefined model structure [39].

Modern ICT provided the basis for agriculture to become more efficient, first by massively embracing web technologies in the early twenty-first century. A decade later, the emergence of affordable sensors, microcontrollers, single-board computers (SBC), and eventually wireless sensor networks (WSN), has inevitably led to the ever-expanding use of modern electronics in agriculture, both industrial and subsistence, with the aim of data collection, transfer, aggregation, and analytics, all toward tasks automation and increased productivity. Internet of Things (IoT), fog computing (FC), and cloud computing (CC) have become indispensable components of modern farming (Fig. 6).

Fig. 6
figure 6

Data flow in agriculture—Active/passive remote sensing, IoT and fog/cloud computing

Metaheuristic algorithms have proven to be powerful optimization algorithms, with the ability to address even NP-hard problems. Swarm intelligence algorithms are notably interesting for their ability to tackle these problems using a relatively simple set of rules imposed on a population. By following these rules, overarching behaviors occur on a global scale leading the algorithm toward promising areas in the search space and eventually optimal solutions.

Some notably powerful algorithms are inspired by nature such as the artificial bee colony (ABC) [15, 29] algorithm, firefly algorithm (FA) [30, 67], particle swarm optimization (PSO) [62] algorithm, bat algorithm (BA) [68], Harris hawks optimization algorithm (HHO) [23], and whale optimization algorithm (WOA) [37]. More novel algorithm examples include the reptile search algorithm (RSA) [1] and the chimp optimization algorithm (ChOA) [31]. Finally, while demonstrating admirable performance, these algorithms do not come without shortcomings, and one notable approach for improving performance comes from algorithm hybridization.

Hybrid algorithms have been applied to several real-world problems and demonstrated admirable performance. Some notable examples come from health care [8, 70]. Ways of tackling computer security issues have also been improved through the use of hybrid algorithms [26, 69]. Forecasting has also been improved with hybrid algorithms as demonstrated on crude oil [27], stock prices [4, 28], and energy prediction [45, 56].

3 Methods

The following section presents an overview of WANN principles. After this, the original RSA is presented followed by the introduced modifications. Finally, the proposed two-layer framework is presented and discussed.

3.1 Weight agnostic artificial neural networks

Weight agnostic neural network is a neural architecture search (NAS) technique and an evolutionary strategy in the development of neural networks where the model weights are not trained [19]. The goal is to find the smallest neural network architecture capable of coping with several reinforcement learning (RL) tasks without training weights. WANNs are inspired by animals where newborns come with innate reflexes (e.g., walking, swimming, and hiding from predators), not having to acquire them through trial and error, i.e., training. The task of WANN is to provide satisfactory performance with a single common weight, even when randomly assigned.

WANNs imply smaller network sizes, with fewer connections between nodes and with the optimal overall architecture. Performance depends exclusively on the network architecture and can prove to be inferior compared to other methods, based on the given scenario. The model weights are not optimized for the given task and therefore are underutilized, and there are no clear stopping criteria since the search space is unbounded. A low number of iterations will lead to poor network performance, while too many iterations will take more computing resources than necessary [35].

The WANN algorithm can be described through the following steps:

  1. 1.

    Initialize. Generate an initial population of minimal neural network topologies, consisting of input and output layers.

  2. 2.

    Evaluate. Each network’s performance with the different common values assigned in each pass. Use of a fixed sequence, e.g., − 2, − 1, − 0.5, + 0.5, + 1, and + 2, can help reduce the variance between evaluations, and use of the values greater than 2 will lead to similar behavior due to saturation of the activation functions.

  3. 3.

    Rank. After evaluation, the networks get ranked based on the achieved performance and simplicity. Comparing two network models with similar performance, the one with the simpler structure gets chosen.

  4. 4.

    Vary. New network topologies are created by mutation of existing simple networks by adding nodes, and neurons or changing the activation function, and the best topology gets chosen through tournament selection.

At this point, the algorithm may go back to Step (2), and the process repeats. The outcome is weight agnostic topology of gradually increasing complexity and enhanced performance in each successive generation. When the number of iterations hits the allowed maximum, the algorithm will stop.

WANNs are capable of learning abstract associations, without the need for encoding explicit relationships between inputs. The use of learned features can be evaluated through the execution of continuous control tasks, e.g., CartPoleSwingUP, BipedalWalker, and CarRacing-v0 tasks as described in [5, 19].

Unlike the conventional fixed topology networks which require extensive tuning in order to produce desired behavior, WANNs may accomplish this with random shared weights due to architecture strongly biased toward solution. Although the magnitude of weights may not be crucial, their respective value and consistency of sign are, and thus, WANNs can fail with randomly assigned individual weights. Finally, the use of single shared weight is much simpler compared to the use of gradient-based methods.

Besides RL tasks, WANNs may also be used in solving high-dimensional classification tasks, e.g., image classification, as demonstrated on MNIST dataset in [19, 34]. Restricted to a single weight value, WANNs performed in MNIST digits classification as well as a single-layer neural network with thousands of weights, and yet, the WANN structure remains flexible to allow further weight training and accuracy improvements.

WANNs structure provides different predictions at each weight value, which may be treated as a distinct classifier. This gives the possibility of use of a single WANN with multiple weight values as a self-contained ensemble. Vice versa, as WANNs are optimized to perform well using a shared weight over a range of values, this single parameter can be used to increase the network performance, which may prove to be useful in few-shot learning [16] and continual learning [43].

3.2 Original reptile search algorithm (RSA)

Inspired by the social, hunting, and encircling behaviors of crocodiles, the RSA algorithm is a novel gradient-free and population-based optimization algorithm originally introduced by [1]. By mathematically simulating these processes, the RSA can address complex tasks. By simulating agent cooperation robustness is further augmented. The algorithm is comprised of several stages described below.

3.2.1 Initialization stage

The first step in the optimization procedure is creating a population of agents (X) as per Eq. 1 that represents potential solutions. These solutions are created through a stochastic process. The best-attained solution is treated as optimal through subsequent iterations.

$$\begin{aligned} X = \left[ \begin{array}{ccccc} x_{1,1} &{} \cdots &{} x_{1,j} &{} x_{1,n-1} &{} x_{1,n} \\ x_{2,1} &{} \cdots &{} x_{2,j} &{} \cdots &{} x_{2,n} \\ \vdots &{} \ddots &{} \vdots &{} \ddots &{} \vdots \\ x_{i,1} &{} \cdots &{} x_{i,j} &{} \cdots &{} x_{i,n} \\ \vdots &{} \ddots &{} \vdots &{} \ddots &{} \vdots \\ x_{N-1,1} &{} \cdots &{} x_{N-1,j} &{} \cdots &{} x_{N-1,n} \\ x_{N,1} &{} \cdots &{} x_{N,j} &{} x_{N,n-1} &{} x_{N,n} \\ \end{array} \right] \end{aligned}$$
(1)

in this context, X refers to a collection of potential solutions that are generated randomly using Eq. 2. Here \(x_{i,j}\) represents the value at the j-th position of the i-th solution. N represents the total number of potential solutions in the set X, and n represents the dimension size of the given problem.

$$\begin{aligned} x_{ij} = \text{rand} \times (\text{UB} - \text{LB}) + \text{LB}, j = 1, 2, \ldots , n \end{aligned}$$
(2)

in which the term rand refers to a randomly generated value. The lower and upper bounds of the given problem are represented by LB and UB, respectively.

3.2.2 Encircling (exploration) stage

The algorithm employs two distinct search stages: exploration and exploitation. The transition between behaviors is determined by four variables, which involve separating iterations into four segments. During the exploration phase, the RSA deploys different search strategies to explore the search space and approach a better solution. These strategies include the high walking strategy and the belly walking strategy.

The current stage of searching is governed by two conditions, with the high walking movement strategy triggered for t values less than or equal to \(\frac{T}{4}\), and the belly walking movement strategy activated for t values between \(\frac{T}{4}\)and \(\frac{T}{2}\), as well as t values greater than \(\frac{T}{4}\). This indicates that the aforementioned condition is met during approximately half of the total exploration iterations, with high walking and belly walking strategies being utilized for the respective portions. Both of these approaches involve exploration-based search techniques. Additionally, to generate diverse solutions and explore varied regions, a stochastic scaling coefficient is considered for each element. This coefficient follows a simple rule that mimics the encircling behavior of crocodiles. In this study, we introduce position updating equations for the exploration stage, as outlined in Eq. 3.

$$\begin{aligned} x_{x,j}(t+1) = {\left\{ \begin{array}{ll} \text {Best}_j(t) \times -\eta _{(i,j)}(t) \times \beta - R_{(i,j)}(t) \times rand, &{} t \le \frac{T}{4} \\ \text {Best}_j(t) \times x_{(r_1,j)} \times ES(t) \times rand, &{} t \le \frac{T}{2} \; \text {and} \; t> \frac{T}{4} \end{array}\right. } \end{aligned}$$
(3)

where the j-th position in the best solution obtained so far is represented by \(\text {Best}_j(t)\). The variable rand represents a random number ranging from 0 to 1. The value of t indicates the current iteration number, while T represents the maximum iteration count. The operator \(\eta _{(i,j)}\) is used for hunting and corresponds to the j-th position in the i-th solution. By applying Eq. (4), \(\eta _{(i,j)}\) can be computed. The parameter \(\beta\), fixed at 0.1, determines explorative accuracy during the encircling stage. It determines the high walking behavior over iterations. The reduction function, \(R_{(i,j)}\), is used to decrease the search area and is calculated using Eq. (5). The position of the i-th solution is denoted by \(x_{(r_1,j)}\), where \(r_1\) is a random number between 1 and N. The value in N represents population size. The probability ratio, \(\text{ES}(t)\), takes decreasing random values in the range \([-2, 2]\) over the iterations and is calculated using Eq. (6)

$$\begin{aligned} \eta _{(i,j)}= & {} \text {Best}_j(t) \times P_{(i,j)}, \end{aligned}$$
(4)
$$\begin{aligned} R_{(i,j)}= & {} \frac{\text {Best}_j(t) - x_{(r_2, j)}}{\text {Best}_j(t) + \epsilon }, \end{aligned}$$
(5)
$$\begin{aligned} \text {ES}(t)= & {} 2 \times r_3 \times (1 - \frac{1}{T} ), \end{aligned}$$
(6)

in the given equations, the value of epsilon is small, \(r_2\) is an arbitrary value from a range [1, N]. The correlation value of 2 is used in Eq. (6) to generate values in the range [2, 0], while \(r_3\) represents a random integer value in the range \([-1, 1]\). The percentage difference between the j-th location of the optimal obtained solution so far and the j-th position of the current solution is indicated by \(P_{(i,j)}\). This value is computed with Eq. (7)

$$\begin{aligned} P_{(i,j)} = \alpha + \frac{x_{(i,j) - M(x_i)}}{\text {Best}_j(t) \times (\text{UB}_{(j)} - \text{LB}_{(j)} + \epsilon '} \end{aligned}$$
(7)

in which, \(M(x_i)\) represents the average position of the i-th agent, which can be determined using Eq. (8). The upper and lower boundaries of the j-th position are represented by \(\text{UB}_(j)\) and \(\text{LB}_(j)\), respectively. The parameter \(\alpha\), fixed at 0.1, represents a sensitive control value that determines the exploration accuracy, which determines the difference between agent fitness during hunting through the run.

$$\begin{aligned} M(x_i) = \frac{1}{n} \sum _{j=1}^{n} x(i,j). \end{aligned}$$
(8)

3.2.3 Hunting (exploitation) stage

This section describes the predatory behavior of RSA, which involves hunting. The hunting behavior of crocodiles is discussed, specifically their two strategies: hunting coordination and cooperation. These strategies utilize different intensification techniques, which focus on exploiting local search areas. With their intensified approach, crocodiles can more easily approach their target prey compared to encircling mechanisms. As a result, the exploitation can identify a near-optimal solution, although it may require multiple attempts. In addition, during this stage, exploitation mechanisms are utilized to conduct a more focused search close to the optimal solution, while also emphasizing communication between the mechanisms.

The RSA’s exploitation mechanisms use two primary search strategies (hunting coordination and hunting cooperation) to explore potential solutions and locate the optimal solution. These strategies are represented mathematically in Eq. (9). During this phase, the search is guided by specific conditions: The hunting coordination strategy is employed when t is between 3/4T and 2T/4, while the hunting cooperation strategy is used when t is between T and 3/4T. Additionally, stochastic methods are used to generate denser solutions and focus on promising regions locally. To simulate the hunting behavior of crocodiles, the authors employed a simple rule. The paper proposes position-updating equations for the exploitation phase, which are also represented in Eq. (9).

$$\begin{aligned} x_{x,j}(t+1) = {\left\{ \begin{array}{ll} \text {Best}_j(t) \times P_{(i,j)}(t) \times \text{rand}, &{} t \le \frac{3T}{4} \; and \; t> \frac{T}{2} \\ \text {Best}_j(t) - \eta _{(i,j)}(t) \times \epsilon - R_{(i,j)}(t) \times \text{rand}, &{} t \le T \; and \; t> \frac{3T}{4} \end{array}\right. } \end{aligned}$$
(9)

The variable \(\text{Best}_j(t)\) represents the j-th location in the best solution found up to the current time step t. \(\eta _{(i,j)}\) refers to the hunting operator for the j-th location in the i-th solution, which is determined by Eq. (4). The variable \(P_{(i,j)}\) represents the percentage difference between the j-th location in the best solution and the j-th location in the current agent and is computed using Eq. (7). The value of \(\eta _{(i,j)}\) is also calculated using Eq. (4), with a small constant value \(\epsilon\). Finally, \(R_{(i,j)}\) is applied to sharing the search space and is computed using Eq. (5).

The exploitation search mechanisms, including hunting coordination and cooperation, aim to avoid being stuck in local optima. The mechanisms help the exploration search find the optimal agent and maintain diversity among candidate agents. The authors designed two parameters, \(\beta\) and \(\alpha\), to generate a stochastic variable following every iteration, which facilitates exploration during the early iterations and the later ones. This aspect of the search is particularly useful when faced with local stagnation, especially in the final iterations.

Algorithm 1
figure a

Original RSA pseudocode

3.3 Multi-swarm RSA (MSRSA)

Metaheuristic algorithms rely on an effective balance between the two primary mechanisms. Exploration helps algorithms locate promising areas, while exploitation focuses on promising regions helping locate near-optimal (sub-optimal) solutions within the smaller region. While these mechanisms help metaheuristics overcome many difficult and even NP-hard tasks, it is also important to note that as per the no-free lunch (NFL) [63] theorem of optimization, no single metaheuristic is equally suited to all problems. All metaheuristics have certain advantages as well as limitations. It is important to emphasize that constant experimentation and improvement of existing metaheuristics are essential for determining the most suitable tools for tackling emerging challenges.

One promising approach for tackling deficiencies present in certain metaheuristics is hybridization. By combining attributes of compatible algorithms, the resulting approach can overcome the deficiencies of the original and even produce results that are more than the sum of their parts. While this may result in a slightly increased complexity, hybrid algorithms demonstrate excellent performance justifying the slight increase in complexity. Two major approaches of hybridization exist today: low-level (LLH) and high-level hybridization (HLH). In LLH, search mechanisms of an algorithm are replaced by different mechanisms, while in HLH can be considered as self-contained.

The introduced metaheuristic takes the LLH approach, introducing mechanisms of two well-known metaheuristics, the ABC and FA, into the robust base of the novel RSA. The FA is well known for its powerful exploitation mechanism, while the ABC poses a powerful exploration mechanism. These two algorithms are, therefore, respectively, used to boost the exploration and exploitation of the original RSA.

The initialization procedure for the introduced method incorporates random population generation as well as two additional mechanisms to boost initial population diversity. The two mechanisms incorporated into the metaheuristic are chaotic maps initialization and quasi-reflection-based learning (QRL). Given a population with a size N, an initial portion of \(\frac{N}{2}\) is randomly initialized within the constraints of the given search space. This subpopulation is then divided into two. These are then processed using QLR and chaotic maps, respectively. The one reminding \(\frac{N}{4}\) of this population is generated by applying chaotic maps, while the final \(\frac{N}{4}\) is created through the use of the QRL mechanism. These mechanisms are further described in the following.

Applying chaotic maps can help aid the search procedures of metaheuristics. Several options for chaotic maps exist; however, empirical experimentation suggests that the application of logistic maps yields is best suited to this application.

In the initialization stage, a pseudo-random number \(\theta _0\) is used to seed a chaotic sequence as per Eq. 10

$$\begin{aligned} \theta _i + 1 = \mu \theta _i \times (1-\theta _i),\; i=1,2,\ldots ,N-1 \end{aligned}$$
(10)

in this context, N refers to the size of the population, i represents the sequence number, and \(\mu\) is a control parameter for a chaotic sequence with an empirically selected value of 4. The value of \(\theta _0\) falls between 0 and 1, but is not equal to 0.25, 0.5, 0.75, or 1.

Each potential agent is mapped based on the generated chaotic sequence as demonstrated in Eq. 11.

$$\begin{aligned} X^c_i = \theta _i X_i \end{aligned}$$
(11)

where the variable \(X^c_i\) represents the updated position of individual i following chaotic disturbances.

The QRL method involves producing quasi-reflexive-opposite solutions by following the principle that if an individual is located far away from the optimal solution, there is a higher likelihood that the opposite solution could be situated closer to the optimum.

When implementing the QRL process as described previously, the quasi-reflexive-opposite individual \(X^{\text{qr}}\) of the solution X can be generated using Eq. 12 for each component j of the X solution:

$$\begin{aligned} X^{\text{qr}} = \text{rnd}\bigg (\frac{\text{LB} + \text{UB}}{2}, X\bigg ) \end{aligned}$$
(12)

where \(\text{rnd}\bigg (\frac{\text{LB} + \text{UB}}{2}, X\bigg )\) denotes the generation of a random value form a uniform distribution, between \(\frac{\text{LB} + \text{UB}}{2}\) and X, with LB and UB denoting the lower and upper boundaries, respectfully. This initialization procedure is used for every individual in the population.

The population is divided into a pair of subpopulations. Each subpopulation is utilized to apply an LLH version of the RSA algorithm. One population leveraging the ABC algorithm hybridized into the RSA for a boost in exploration, while the second utilizes the FA introduced in the RSA to focus on exploitation.

The utilized mechanisms from the ABC algorithm are described with a set of equations. The scouting phase is described in Eq. 13.

$$\begin{aligned} x_{i,j} = ;b_j + \text{rand}(0,1)*(\text{ub}_j - \text{lb}_j) \end{aligned}$$
(13)

in which \(x_{i,j}\) represents the j parameter of bee i form the population, \(\text{rand}(0,1)\) denotes a random value from a uniform distribution between 0 and 1, and \(\text{lb}_j\) and \(\text{ub}_J\) represent the lower and upper bounds of parameter j.

The bee and onlooker formulas are given in Eq. 14.

$$\begin{aligned} v_{i,j} = {\left\{ \begin{array}{ll} x_{i,j} + \phi * (x_{x,j} - x_{k,j}), &{} R_j < MR\\ x+{i,j}, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(14)

in which \(x_{i,j}\) denotes the j-th element of the previous solution i, neighboring solution k parameters are denoted with \(x_{k,j}\), while \(\phi\) denotes a random value in range [0, 1], and MR defines the modification rate.

The primary search mechanism for the FA algorithm is described in Eq. 15.

$$\begin{aligned} x_i^{t+1} = x_i^{t} + \beta _0 * e^{-\gamma r^2_{i,j}}(x^t_j - x^t_i) + \alpha ^t(k-0.5) \end{aligned}$$
(15)

where \(\textbf{x}_i\) and \(\textbf{x}j\) are the positions of the i-th and j-th fireflies, r ij is the distance between them, \(\beta _0\) and \(\gamma\) are the parameters controlling the attractiveness, \(\alpha\) is the step size, and \(\textbf{rand}_i\) is a random vector.

The hybrid search mechanisms for the LLH ABC and FA subpopulation are shown in Algorithm 2 and Algorithm 3, respectively.

Algorithm 2
figure b

ABC-RSA hybrid search process pseudocode

Algorithm 3
figure c

FA-RSA hybrid search process pseudocode

The described ABC and FA have been carefully chosen for their characteristics. The role of the ABC algorithm is to boost the hybrid algorithms’ exploratory power. Likewise, the FA has been chosen due to the potential of its exceptionally powerful exploitation mechanism to boost the hybrid algorithms’ exploitation power.

One additional mechanism inspired by the GA is introduced called transfer learning. This mechanism considers the best-performing individuals from each subpopulation as potential parents for new agents. The new agents adopt a certain set of traits from their parents and represent a combination of the two. This is simulated using a uniform crossover between agent traits. To govern this process, an additional value control parameter representing population crossover and denoted as PC is introduced. This value has been empirically determined to give the best results when \(\text{PC} = 0.1\). When applying uniform crossover, two parents are selected from a population, and offspring is generated based on agent values, these offspring agents replace the worst performing in each respective population. The fitness of these newly generated agents is not evaluated after generation, thus computational complexity is maintained.

Finally, a high-level overview of the complete proposed MS-RSA algorithm can be seen in Algorithm 4.

Algorithm 4
figure d

Introduced MSRSA algorithm high-level pseudocode

3.4 Experimental framework

The experimental cooperative framework is comprised of two layers. Each layer is tasked with handling one specific task. The layers are labeled Layer 1 (L1) and Layer 2 (L2). Both evaluated datasets have been subjected to both stages of the framework to select the optimal models suited to forecasting yields. A visualization of the framework is shown in Fig. 7.

Fig. 7
figure 7

Structure of the two-layer framework utilized for research

3.5 Metaheuristic optimizers

Metaheuristic algorithms present a popular choice for selecting near-optimal parameters of baseline algorithms. Contemporary ML and AI algorithms are usually designed with good general performance in mind. However, while showing good general performance, algorithms usually require adaptation for a specific problem. This is done through a set of exposed parameters that are available to programmers. The process of selection is not without its challenges, as the number of combinations often makes this an NP-hard problem. This work evaluates several contemporary optimizers alongside the introduced algorithms. Brief descriptions of these algorithms, with their respective inspirations and basic search strategies, are discussed below.

The recently introduced ChOA [31] is inspired by the individual intelligence and mating motivation observed in chimpanzees during group hunting. The algorithm is designed to address two common issues in optimization problems: slow convergence speed and getting trapped in local optima, especially in high-dimensional scenarios. The algorithm incorporates a mathematical model representing diverse intelligence and mating motivation in chimps. Four types of simulated chimps—attacker, barrier, chaser, and driver—are utilized to capture the range of intelligence observed in chimpanzee groups. The hunting behavior is divided into four main steps: driving, chasing, blocking, and attacking.

Several well-known algorithms have also been compared including in the comparisons as well. Such as the GA [38], a search and optimization method inspired by natural selection. It operates with a population of potential solutions, represented as chromosomes. The algorithm involves the selection of individuals based on their fitness, crossover (genetic recombination) to create offspring, and mutation to introduce random changes. Through iterations, the population evolves, mimicking the process of natural evolution. GAs are versatile and effective for solving complex optimization problems across various domains due to their ability to explore vast solution spaces and handle multiple local optima simultaneously.

The PSO [62] is an optimization algorithm based on the collective behavior of social organisms, particularly the movement patterns of bird flocks or fish schools. In PSO, a population of potential solutions is represented as particles, each navigating through the solution space. These particles adjust their positions based on their own experience (local best) and the shared knowledge of the entire swarm (global best). The movement is influenced by both the particle’s current velocity and the historical best positions. This collaborative exploration and exploitation process encourages convergence toward optimal solutions. PSO is known for its simplicity and efficiency in finding solutions to optimization problems, particularly in continuous and high-dimensional spaces. The algorithm’s ability to balance exploration and exploitation makes it suitable for various applications, including engineering, finance, and ML.

The ABC [29] algorithm is a technique inspired by the foraging behavior of honeybees. In ABC, the population consists of artificial bees, and the algorithm is designed to mimic the food source exploration process observed in a bee colony. The optimization process involves three main components: employed bees, onlooker bees, and scout bees. Employed bees explore the solution space, representing potential solutions. Onlooker bees select solutions based on the employed bees’ performance and exploit these solutions for further exploration. Scout bees, in turn, introduce randomness by exploring new solutions when the algorithm stagnates or fails to improve. The iterative nature of ABC allows for the continuous refinement of solutions, making it suitable for various optimization problems, especially in domains such as engineering, logistics, and data analysis.

The FA [67] ) is inspired by the light-flashing behavior of fireflies in nature. In FA, potential solutions to an optimization problem are represented as fireflies, and the algorithm seeks to improve these solutions iteratively. The attractiveness of a firefly is determined by its brightness, influenced by both its distance from other fireflies and their respective brightness levels. Brighter fireflies attract others, and the algorithm simulates this process to converge toward optimal solutions.

The main steps of the firefly algorithm include the initialization of firefly positions, the computation of their attractiveness, the movement toward brighter fireflies, and the updating of the solution space. FA effectively balances exploration and exploitation, making it suitable for various optimization problems. It has been applied in fields such as engineering, finance, and image processing, demonstrating its versatility and effectiveness in finding solutions to complex optimization challenges.

The BA [68] is a metaheuristic optimization method inspired by bats’ echolocation behavior. It utilizes a population of virtual bats for solving optimization problems, incorporating both local and global search strategies. BA represents solutions as bat positions, adjusting their emission rates (loudness) for exploration–exploitation trade-offs. The algorithm’s random walks facilitate global exploration. Known for its simplicity and effectiveness, BA has been successfully applied to diverse optimization domains, including engineering, finance, and data science.

The HHO [23] algorithm is a nature-inspired optimization technique based on the cooperative hunting behavior of Harris’s Hawks. In HHO, potential solutions to an optimization problem are represented as hawks, and the algorithm mimics their collaboration during hunting. The optimization process involves exploration, exploitation, and communication among the hawks to improve solutions iteratively.

The key features of HHO include the representation of solutions as hawk positions, the integration of explorative and exploitative movements, and the adoption of a leader-follower strategy. The leader attracts followers based on their fitness, and the followers adjust their positions accordingly. HHO has shown promise in solving various optimization problems due to its ability to balance exploration and exploitation inspired by the collaborative hunting nature of Harris’s hawks.

The WOA [37] is a method based on the cooperative hunting behavior of humpback whales. It represents potential solutions as whale positions and incorporates exploration and exploitation strategies. WOA has demonstrated effectiveness in solving diverse optimization problems, making it applicable to fields such as engineering, finance, and data

4 Experimental setup

During experimentation, WWANNs are assigned the task of predicting crop yields across two distinct datasets. Metaheuristics are employed to optimize the hyperparameters of WANN within a two-layer framework to enhance performance. The first layer involves architecture selection, and in the second layer, shared weights are optimized. Subsequently, the outcomes undergo meticulous validation and interpretation using the SAGE method.

4.1 Employed datasets

The wild blueberry (Vaccinium angustifolium Aiton) that can be found on Kaggle Footnote 1 is influenced by cross-pollination that requires bees [2, 12]. Thus, it is affected by current bee density [2], but also by other factors including weather, soil fertility, pests, disease, and others. To produce useful results, ML algorithms for crop yield prediction generally require large amounts of data, and the availability of training data of sufficient quality and quantity may appear as a problem. The wild blueberry predictive yield models require data that sufficiently characterize the influence of spatial characteristics of plants, bees, and weather conditions on production. Experiments may be performed on a calibrated version of the blueberry simulation model. The simulated dataset is examined with proper feature selection and afterward used to build four ML-based prediction models, which may be used for comparison with real-life data acquired in the fields. Wild blueberry yield prediction dataset [41, 54] was generated by a wild blueberry pollination model, a spatially explicit simulation model validated by field observations, and experimental data collected in Maine (USA) during the past three decades.

Another crop yield prediction dataset [44] that can also be found on Kaggle Footnote 2 brings yield data on the most common agricultural cultures per country per year (maize, potatoes, rice, wheat, sorghum, soybeans, yams, cassava, sweet potatoes, and plantains) combined with data on average rainfall, temperature, and use of pesticides. Data sheets were compiled from publicly available datasets from the Food and Agriculture Organization (FAO) and World Data Bank.

Both datasets underwent pre-processing, involving the transformation of categorical features, notably the crop species, through the utilization of the one-hot encoding technique to enhance forecasting accuracy. However, this approach results in an expansion of the feature set for each dataset. Specifically, the crop yield dataset experienced a transformation from the original seven features to a total of 116 features, while the blueberry dataset expanded from the initial 18 features to a total of 17 used as inputs in the specific test case.

4.2 Evaluation metrics

To ensure a thorough examination, several metrics have been utilized during experimentation. The utilized metrics include the mean square error (MSE) described in Eq. 16, root-mean-square error (RMSE) described in Eq. 17, mean absolute error (MAE) described in Eq. 18, as well as the coefficient of determination \(R^2\) described in Eq. 19 metric. It is important to note that the \(R^2\) metric is utilized as the primary guiding objective function utilized during the network architecture selection process in L1 of the framework. While MSE is used as an objective function when optimizing shared weights in L2 of the framework.

$$\begin{aligned} \text{MSE}= & {} \frac{1}{n}\sum _{i=1}^{n}(y_i - \hat{y_i})^2 \end{aligned}$$
(16)
$$\begin{aligned} \text{RMSE}= & {} \sqrt{\frac{1}{n}\sum _{i=1}^{n}(y_i - \hat{y_i})^2} \end{aligned}$$
(17)
$$\begin{aligned} \text{MAE}= & {} \frac{1}{n}\sum _{i=1}^{n}|y_i - \hat{y_i}| \end{aligned}$$
(18)
$$\begin{aligned} R^2= & {} 1 - \frac{\sum _{i=1}^{n}(y_i - \hat{y_i})^2}{\sum _{i=1}^{n}(y_i - \bar{y})^2} \end{aligned}$$
(19)

where n is the total number of observations, \(y_i\) is the actual value of the i-th observation, \(\hat{y_i}\) is the predicted value of the i-th observation, and \(\bar{y}\) is the mean value of all observations.

4.3 Experimental setup and framework adjustments

The initial step is the framework involves selecting a suitable network architecture. To accomplish this, a population of potential architectures is created and evolved through a predefined number of iterations. For the blueberry dataset, a population of 200 individuals was used with 400 generations allocated for improvement. Due to the larger number of input parameters, the crop yield dataset was assigned a larger population of 300 individuals, with 800 generations allocated for improvement. Metaheuristics were leveraged to select control parameters of the WANN. The parameters selected for optimization and their respective ranges are shown in Table 1. It is also important to note that apart from hyperparameters, activation functions for each node also needed to be selected. The considered functions include sigmoid, tanh, gauss, relu, sin, inv, and identity. Due to extensive computation demands, each metaheuristic was assigned a population size of 6 and given 15 iterations to improve network weights to attain optimal performance. Furthermore, to provide grounds for a fair comparison, the evaluations have been repeated over 20 independent runs to account for the heuristics inherent in this class of algorithms.

Table 1 Hyperparamaters optimized by metaheuristics in L1 with their respective ranges

After each iteration, individual architectures are evaluated using the \(R^2\) metrics. A selection of shared weights is assigned that are used to construct and evaluate networks. Population fitness is assessed based on the mean \(R^2\) value of all individuals in a population tested with each possible shared weight value. This process allows the network architecture to grow as needed introducing new neurons and connections for the network’s evolution to better address the given task. The best-performing architecture is passed to layer two or the framework.

During L2 optimization, each metaheuristic was assigned a population size of 40 and given 30 iterations to improve network weights to attain optimal performance. Furthermore, to provide grounds for a fair comparison, the evaluations have been repeated over 30 independent runs to account for the heuristics inherent in this class of algorithms.

In the first stage, metaheuristic algorithms are tasked with selecting optimal hyperparameters for the evolving neural networks. In the second, they were tasked with optimizing the values of the shared weights to boost network performance. Several metaheuristics have been considered for tackling this task, and their performance has been subjected to a comparative analysis with the introduced MSRSA. The evaluated algorithms include the original RSA [1] as well as the novel ChOA [31]. Several well-known algorithms have also been compared including the GA [38], PSO [62], ABC [29], FA [67], BA [68], HHO [23], and WOA [37]. For each of the utilized metaheuristics, the parameters used during the optimizations are the values suggested in the works that originally introduced them. The control parameter of the proposed metaheuristics PC was set to \(\text{PC}=0.1\) as this value has been empirically determined to give the best results.

5 Results and discussion

The following section demonstrates the results attained in each layer of the framework individually. Following the presentation, the results are discussed in detail.

5.1 L1 observed outcomes

In L1, metaheuristics algorithms were used to select optimal control parameters for evolving WANN architectures suited for yield forecasting. The results attained by networks optimized in L1 are shown and discussed in the following segment.

5.1.1 Wild blueberry optimal network architecture

During the architecture selection process, fitness metrics were monitored and tracked to determine the influence of the competing metaheuristics on the optimization process. Table 2 demonstrates the results attained during the best, worst, median, and mean runs, as well as the standard deviation and results variance. Furthermore, fitness convergence rates are shown in Fig. 8.

Table 2 Fitness results attained by each metaheuristic-optimized population for the best, worst, median, and mean execution for blueberry dataset

As demonstrated, the networks tuned by the introduced metaheuristic attained the best results compared to all other metaheuristic-optimized networks.

Fig. 8
figure 8

Population fitness converges rates attained by each metaheuristic for the blueberry dataset

Fitness distributions are observed in Fig. 9.

Fig. 9
figure 9

Population fitness converges rates attained by each metaheuristic for the blueberry dataset

As can be observed, the proposed MSRSA attained the lowest variation in results across several runs, suggesting that it demonstrates the highest reliability and stability compared to other algorithms. Furthermore, compared to the unoptimized version of the WANN, it attained significantly better results, with the basic WANN with default parameters obtaining a fitness value of 0.056813. The optimal evolved network architecture by L1 of the framework is shown in Fig. 10.

It can be observed that the proposed MSRSA algorithm shows the highest stability compared to other algorithms. Finally, the best-performing network architecture is shown in Fig. 10.

Fig. 10
figure 10

Best selected WANN model architecture for the wild blueberry yield forecasting

The selected hyperparameters for the best-performing WANN model are given in Table 3

Table 3 Best selected WANN hyperparameters for the blueberry dataset

The relative simplicity of the constructed network is worth emphasizing with a total of only 14 neuron nodes and only 61 active weighted connections. Furthermore, even though the fitness value of the entire best population in L1 was only 0.381873, and the relative network simplicity, the optimal genome attained an admirable \(R^2\) score of 0.884506 with a shared weight of 1. This network was subjected to further optimization in L2 where further improvements were attained.

5.1.2 Crop yield optimal network architecture

While network architectures were evolved, the fitness metrics were recorded to better understand the effects each evaluated metaheuristic has on the process of optimization. Table 4 demonstrates the results attained during the best, worst, median, and mean runs, as well as the standard deviation and results variance. Additionally, fitness convergence is demonstrated in Fig. 8.

Table 4 Fitness results attained by each metaheuristic-optimized population for the best, worst, median, and mean execution for crop yield dataset

As demonstrated, the networks tuned by the introduced metaheuristic attained the best results compared to all other metaheuristic-optimized networks. However, it is also important to note that the performance of the population remains relatively poor regardless of the optimization algorithms, with the average population fitness being a net negative value. This is likely due to the relatively high complexity of the crop yield dataset and the large number of available features making it harder for lighter network structures to determine adequate architectures with fewer connections. An unoptimized version of WANN was also tested and applied to this specific task under identical test condictions. The resulting average population fitness was only \(-0.3793214\), significantly less than even the worst-performing optimized version (Fig. 11).

Fig. 11
figure 11

Population fitness converges rates attained by each metaheuristic for crop yield dataset

Fitness distributions are observed in Fig. 12.

Fig. 12
figure 12

Population fitness converges rates attained by each metaheuristic for crop yield dataset

As can be observed, the proposed MSRSA attained the lowest variation in results across several runs, suggesting that it demonstrates the highest reliability and stability compared to other algorithms. The optimal evolved network architecture by L1 of the framework is shown in Fig. 13.

Fig. 13
figure 13

Best selected WANN model architecture for the wild blueberry yield forecasting

The relatively simple architecture evolved by WANN failed to sufficiently address this task in L1, with most populations having a very low R\(^2\) score. With the best genome attaining an R\(^2\) value of \(-\,0.103878\), the selected architecture contained a total of 15 nodes with 259 connections and a shared weight value of \(-\,1.9854476\). The poor performance is very likely due to the high complexity of the crop field dataset, with significantly more inputs compared to the previous dataset. However, by optimizing shared weights, performances can significantly improve in L2 of the framework.

5.2 L2 observed outcomes

In L2, metaheuristics are tasked with selecting optimal shared weights within the already constructed network architecture. The results attained by networks optimized in L2 are shown and discussed in the following segment.

5.2.1 Wild blueberry dataset results

In the second stage of the optimization, several state-of-the-art algorithms were tasked with fine-tuning the shared network weight. The experiments were carried out over 20 dependent runs to account for the randomness inherent to this class of algorithms. The resulting model’s performance was recorded, and the results are demonstrated. The objective function used in L2 is the MSE function, and the results attained by each metaheuristic-optimized WANN in the best, worst, mean, and median run are demonstrated in Table 5. The already decent performance attained in L1 is further improved by metaheuristics in L2 of the framework through shared weight fine-tuning (Table 6).

Table 5 Overall objective function results for each evaluated metaheuristic-optimized network for the blueberry dataset

As can be observed, the introduced metaheuristic obtained the best results compared to all other evaluated algorithms. However, to provide further insight into the improvements made by the introduced modifications, the best-performing models have been evaluated using additional metrics. The results are demonstrated in Table 7.

Table 6 Best-performing metaheuristics optimized models detailed evaluation results for blueberry dataset

It can be observed that the introduced metaheuristic attained the best MSE results as this was the optimization target. Several metaheuristics share the first place for \(R^2\) and R, while the FA-optimized WANN attained the best MAE. This is in line with the NFL theorem, which states that no single approach is the best for all problems.

Table 7 Best-performing metaheuristics optimized models detailed evaluation results for blueberry dataset normalized

Convergence rates of each optimized WANN through each metaheuristic iteration are shown in Fig. 14, alongside distribution plots of the results.

Fig. 14
figure 14

Objective and \(R^2\) metrics convergence and distribution plots for blueberry dataset

It can be observed that the introduced metaheuristic improved convergence rates in comparison with the original algorithm. Furthermore, results distributions suggest that the introduced algorithm demonstrates the lowest result distribution, suggesting that the introduced algorithm has the highest level of robustness and reliability. This is further reinforced by the KDE plots shown in Fig. 15.

Fig. 15
figure 15

Compared model KDE plots for the objective and \(R^2\) function for blueberry dataset

Finally, the predictions cast by the best-performing metaheuristic-optimized WANN in comparison with actual yields are shown in Fig. 16.

Fig. 16
figure 16

Forcasts made by the best-performing model compared to actual values for blueberry dataset

5.2.2 Crop yield dataset results

Similarly, the second optimization layer applied several state-of-the-art algorithms tasked with fine-tuning the shared network weight for forecasting crop yield. The experiments were carried out over 20 dependent runs to account for the randomness inherent to this class of algorithms. The resulting model’s performance was recorded, and the results are demonstrated. The objective function used in L2 is the MSE function, and the results attained by each metaheuristic-optimized WANN in the best, worst, mean, and median run are demonstrated in Table 8. Additionally, it is important to note, that since networks evolved in the first layer of the framework attained quite modest results, the application of metaheuristics to shared weight fine-tuning demonstrated a significant improvement in performance compared to the initial results.

Table 8 Overall objective function results for each evaluated metaheuristic-optimized network for the crop yield dataset

The detailed metrics are shown in Table 9 for each of the best runs. The results signify that several metaheuristics share the best position for \(R^2\) while the novel-introduced metaheuristics attained the best MSE and RMSE scores. Nevertheless, the FA once again demonstrated the best results for MAE further solidifying the NFL theorem (Table 10).

Table 9 Best-performing metaheuristics optimized models detailed evaluation results for crop yield dataset
Table 10 Best-performing metaheuristics optimized models detailed evaluation results for crop yield dataset normalized

Convergence rates of each function and final distributions are shown in Fig. 17 followed by KDE diagrams in Fig. 18.

Fig. 17
figure 17

Objective and \(R^2\) metrics convergence and distribution plots for crop yield dataset

Fig. 18
figure 18

Compared model KDE plots for the objective and \(R^2\) function for crop yield dataset

Finally, the forecasts of the best-performing model optimized by metaheuristics compared to actual values are shown in Fig. 19.

Fig. 19
figure 19

Forcasts made by the best-performing model compared to actual values for crop yield dataset

The selected hyperparameters in L1 for the best-performing WANN architecture are given in Table 11.

Table 11 Best selected WANN hyperparameters for the crop yield dataset

Considering both steps in the optimization process, the role of metaheuristics optimization cannot be understated. While in the first step, WANN attained more modest outcomes likely due to the increased data complexity of the crop yield dataset coupled with the relative simplicity of the evolved networks, the major improvements made by shared weight tuning in L2 can to a degree mitigate the shortcomings of L1.

5.3 Comparative analysis with other well-known ML and ANN models

The proposed approach has also been put to a comparison with several well-known and well-performing ML and ANN models. The compared methods include: eXtreme Gradient Boosting (XGBoost) [7], support vector machines (SVM) [57] with various kernel functions, and several traditional ANN network architectures. Network architectures with one, two, and three hidden layers were applied to the task.

For the SVM, popular kernel functions have been considered. The results attained using the radial basis function (RBF) are marked as SVM (RBF). Outcomes attained when using a polynomial kernel function are marked as SVM (poly). And finally, those attained using a linear kernel function are marked as SMV (linear).

Network architecture marked ANN1, ANN2, and ANN3 have different structures for each dataset. For the blueberry dataset, ANN1 has one hidden layer with 16 neurons total, ANN2 has two hidden layers with 32 neurons and 16, and ANN3 consists of three hidden layers with 32, 16, and 8 neurons, respectively. For the crop yield dataset, ANN1 has one hidden layer with 230 neurons total, ANN2 has two hidden layers with 230 neurons and 115, and ANN3 consists of three hidden layers with 230, 115, and 58 neurons, respectively. All networks utilized the Adam optimizer and the relu activation function.

Due to the stochastic nature of the training process, experiments have been carried out over 20 executions. Mean average results attained over 20 runs of each method applied to the blueberry dataset are shown in Table 12, while results applied to crop yield are shown in Table 13. As crop forecasting is a regression problem, the SVM is applied as a support vector regressor (SVR). It is also important to note that during experimentation, one-hot encoding is not used for XGBoost and SVM giving these methods a slight advantage due to a lower number of input features.

Table 12 Comparison of contemporary ML and ANN with the proposed metaheuristic-optimized WANN on the blueberry dataset
Table 13 Comparison of contemporary ML and ANN with the proposed metaheuristic-optimized WANN on the crop yield dataset

From the presented results, several interesting deductions can be drawn. Firstly, the methods that do not use one-hot encoding, and thus work with fewer features, have an advantage. While state-of-the-art techniques such as XGBoost perform the best, helped by the fact that the number of features in the utilized dataset is reduced ANN displays admirable performance as well.

It is also important to note that the goal of this comparison was not to prove that there are no better methods than the use of WANN. Even optimized through the use of metaheuristics, their networks have their limitations. However, their advantage is in the lightly constructed architectures that require less computation to conclude.

One especially interesting fact is that compared to several ANN architectures, the proposed WANN despite having a significantly simpler structure with only 14 and 15 neurons each, attained better performance than most more complex ANN architectures. A significant advantage when working with systems that have limited computational power.

5.4 Findings validation and best model interpretation

An important part of modern computer science research is determining whether the improvements made show statistical significance. Outcomes alone are insufficient to determine an advantage of one algorithm over others. In this work, nine established methods were evaluated alongside the proposed MSRSA based on their ability to optimize the models’ performance for WANN for crop yield forecasting. The comparison was conducted over two datasets and two problems, the L1 and L2 parts of the framework, that address the WANN structure and shared weights tuning, respectively, yielding in total of four different experiments.

 [9] suggested using statistical evaluations in these scenarios is preceded by the adequate sampling of each method by determining objective averages through multiple independent executions for each problem. This approach can be inconclusive in cases where the samples do not follow a normal distribution or even produce misleading conclusions. It is also important to note that researchers remain divided on whether taking the average objective function value for statistical tests is appropriate when comparing stochastic methods [14, 52]. Nevertheless, the objective functions over 20 independent runs for each of the four problems are considered in this research.

To determine the statistical significance of the obtained results, the best values from each of the 20 runs of every algorithm were selected for both wild blueberry and crop yield prediction datasets instances. These values were then observed as data series. However, before applying the appropriate statistical test from the family of parametric or non-parametric, the safe usage of parametric tests, which involves investigating the independence, normality, and homoscedasticity of the data variances [33], needs to be investigated.

The independence condition is satisfied because each run is executed separately from its pseudo-random number seed. However, the normality requirement is not satisfied since the acquired samples do not originate from a normal distribution. This can be observed from the KDE plots and further reinforced by the Shapiro–Wilk test for single-problem analysts [55]. By conducting the Shapiro–Wilk test, p-values are computed for every method–problem pair, and these results are shown in Table 14.

Table 14 Shapiro–Wilk test scores for the single-problem analysis

Both standard threshold values \(\alpha = 0.05\) and \(\alpha = 0.1\) indicated that the null hypothesis (H0) can be rejected, therefore deducing that none of the samples (for neither of the problem–method pair) come from a normal distribution. Therefore, since the normality condition for safe usage of parametric tests was not satisfied, there was no need to verify the homoscedasticity constraint, and it was decided to continue the statistical analysis by employing the non-parametric tests.

Due to the limited number of problems addressed in the study (four in total), a multi-problem analysis is not conducted because an insufficient number of samples can produce misleading results. It was, therefore, proceeded with the pair-wise non-parametric test, where the introduced MSRSA was distinguished as the control method.

The non-parametric Wilcoxon signed-rank test [58] between the introduced MSRSA and every other method for each of the problems being addressed for each framework layer was conducted. The results of this analysis are summarized in Table 15. Generated p-values higher than the threshold of \(\alpha =0.05\) are marked as bold.

Table 15 Wilcoxon signed-rank test findings

Generated Wilcoxon signed-rank test p-values showed in Table 15 exhibit that, in the scenario of L1 experiments (WANN structure tuning), proposed MSRSA significantly outperformed all other methods yielding all p-values which are substantially lower than 0.5. Therefore, in this extremely computationally extensive experiment, the MSRSA proved a robust and efficient optimizer.

Although in L2 simulations (WANN shared weights tuning), according to the Wilcoxon signed-rank analysis, the MSRSA outperformed most of the other methods in terms of significant improvements, there are some instances where other approaches showed competitive performance. More precisely, in the case of L2 blueberry yield prediction, the MSRSA did not show noteworthy results compared to the ABC approach when the threshold of 0.05 is taken into account. Similarly, in the L2 crop yield prediction simulations, according to the generated p-values, the RSA and HHO showed similar performance as the MSRSA at both threshold values, 0.1 and 0.05.

Nonetheless, as an overall statistical analysis conclusion, the MSRSA exhibited performance in both experiments which are statistically significantly better than most of all other metaheuristics that were taken for analysis.

Finally, the best model generated model by the proposed MSRSA in L2 experiment for the wild blueberry dataset is taken, and the SAGE [13] method is applied to observe the feature importance. Analysis through SAGE allows for the isolation of single parameters across complexity dimensions. It can be used to evaluate both narrow AI systems and general systems. By evaluating the response of a model to variable changes a conclusion can be drawn on its ability to extract relations and attain autonomy as well as how well the model can exploit these to achieve goals. The SAGE method is applied to determine the importance within each feature for the best-constructed model providing potentially very useful information for further studies, as it indicates parameters that researchers may want to focus on in order to attain accurate forecasts using the proposed model as well as ligher models if needed.

Since the SAGE library sagemaker does not natively support WANNs, the code for generating SAGE values was written from scratch in Python. However, since the crop yield dataset in the pre-processing phase involves one-hot encoding for categorical variables, there was no logic to run the SAGE for the best-generated model for this dataset.

The feature importance bar plot for the wild blueberry best model is depicted in Fig. 20.

Fig. 20
figure 20

The SAGE feature importance bar plot for wild blueberry simulations

According to the SAGE analysis outcomes, the front set features have the highest impact on model predictions. The next highest impacting feature is the seeds feature followed by fruit mass.

6 Conclusion

The presented work put forth a proposal for a novel two-layer framework for constructing optimized network architectures based on neuroevolutionary algorithms. By using WANN, much of the training required by traditional neural networks can be avoided. Additionally, the responsibility of the network is shifted from fine-tuned weights and biases toward the network architecture. One result of taking this approach is the generation of simpler and tighter neural networks. However, the performance of these networks is heavily dependent on adequate parameter selection. This work introduces a novel MSRSA algorithm that partakes in both aspects of network optimization and demonstrates the best performance in both.

The proposed framework possesses a two-layer structure. Optimal network architectures are evolved in the first layer, while the share weights are further optimized in layer 2 to improve performance. The approach has been tested on two real-world blueberry and crop yield datasets with mixed results. Depending on data complexity, L1 attained decent results when working with the simple blueberry dataset, while attaining noticeably more modest results on the more complex crop yield dataset. However, it is important to note that following L2 optimization, the networks have seen a significant improvement following shared weight optimization by metaheuristic algorithms. The best-performing model for the blueberry dataset has been subjected to SAGE analysis to determine the features that have the highest impact on model forecasts. Additionally, the performance of WANNs has been compared to several proven AI and ML methods to determine their relative effectiveness. While the proposed approach did not match the performance of state-of-the-art methods, the method nonetheless proves promising outperforming significantly more computationally demanding network architectures.

One notable advantage of this approach is the relative simplicity of the final models, featuring few nodes and connections in comparison with traditional networks handling similar tasks. This feature can be a significant advantage when creating models for systems with limited computational power. This process is somewhat balanced by the demanding process of generating optimized models. Nevertheless, by introducing metaheuristic optimization, generation times can be significantly reduced and model performance improved.

As with any study, certain limitations are present within this work. The high cost of the comparative analysis between optimization algorithms limits population sizes and allocated iterations, parameters that might improve the performance of slower converging algorithms or improve the exploratory power of fast converging ones. Furthermore, only a limited number of optimization algorithms have been considered in the optimization. Future works hope to address some of the state’s limitations and will focus on exploring the potential of WANN for tackling pressing real-world problems. Additionally, the potential of the introduced MSRSA algorithm will be explored in other optimization areas.