Background

The walnut is one of the most important nuts in the world. Persian walnuts (Juglans regia L.) are the only edible species of walnut which are widely grown for their nuts and timbers [1]. In general, walnut tree propagation is still mainly by using seeds rather than vegetative procedures which results in non-uniform nut quality and irregular yielding [2]. Therefore, in vitro propagation is used to overcome the mentioned problems. But walnuts are considered recalcitrant to in vitro culture which makes difficult the mass propagation of different genotypes while several micropropagation protocols have been published for different genotypes [3,4,5,6,7,8,9,10,11]. It has been proven that walnut micropropagation results are highly dependent upon genotype [7, 9,10,11]. In addition to genotype, the formulation of culture medium has a great impact on all micropropagation stages. Up to now, the [3] walnut (DKW) culture medium has been the most employed formulation for walnut tissue culture. Nevertheless, there are some researches reporting improved results using modified DKW or other formulations [6,7,8, 12,13,14,15].

However, to the best of our knowledge, no comprehensive study has been done on the balance of culture media components (mineral nutrients, plant growth regulators (PGRs) and vitamins) and their interaction together and with genotype on walnut in vitro performance to increase the efficiency of the micropropagation process by enhancing proliferation rate and reducing physiological disorders.

Predicting the interaction of mineral nutrients, PGRs, vitamins and genotype on the explant in vitro performance would involve modeling a very complex database, which is very problematic and time-consuming process using classic statistical analyses and needs accurate and advanced modeling procedures [16, 17]. Machine learning (ML) tools allow researchers to perceive the studied process and make proper decisions to develop optimal culture media [17]. In recent years, different ML models like neural networks [18,19,20,21,22,23] have been successfully applied for prediction and optimization of different plant tissue culture processes. In our previous studies, we described the ML hybrid techniques, combining artificial neural network (ANN) with genetic algorithm (ANN-GA) in Pyrus [24] and Prunus rootstocks [25,26,27], rootstocks gene expression programing (GEP) with GA (GEP-GA) [20] and particle swarm optimization (PSO) (GEP-PSO) [28] in Pyrus rootstocks as powerful data mining approaches, which allow modeling of complicated databases and finding the factors influencing a given response in micropropagation process.

ANNs are inspired by the functions of human brain [29]. The ANN [multi-layer perceptron neural network (MLPNN) and radial basis function neural network (RBFNN)] has revealed significant development in complex plant tissue culture systems [20, 24,25,26,27]. ANN does not require any previous knowledge regarding the creation or interrelationships between signals of input and output that is one of its profits [16]. Other benefits of ANN are prediction of the plant biomass [30], clustering the micropropagated plantlets and influencing growth and quality of the regenerated plants by controlling light, ventilation, CO2 and air temperature inside the culture containers which could be of ANN benefits [16].

GEP model is another ML-based optimization technique presented by [31] which comprises useful traits of both genetic programming (GP) and GA. This new model according to an evolving computer programs algorithm was used in our previous studies on Pyrus rootstocks micropropagation which precisely detected nonlinear and complicated relationships between input and output [20, 24].

Here, ANN and GEP are compared to k-nearest neighbors (KNN) method as one of the simplest machine learning techniques. The KNN technique recognizes the elements amongst the training samples that correspond “current” conditions maximum closely based on some predefined attributes: the neighbors. The prediction value is then specified from the groups of the next values of the neighbors [32]. Comparing to mathematical modeling, the KNN method involves no model development or confirmation and thus can be used without recombining data, contrasting in the case of common data-based models [33]. In spite of the potential advantages, no research has yet been done on the use of this technique in the area of plant micropropagation.

In our previous study [20], we compared the RBFNN and GEP in optimizing the in vitro culture media composition for pear rootstocks. Based on our results GEP was a significantly powerful and more precise technique than RBFNN in prediction of in vitro proliferation quantity and quality. So, GA technique was applied to optimize GEP models [20]. Nevertheless, GA optimized the level of inputs required for each specific output, distinctly. Consequently, in our recent study [28], in order to achieve a complete optimum formulation for culture medium, we compared two algorithms GEP and M5’ model tree, to predict the impacts of media minerals and PGRs on in vitro proliferation of pear rootstocks. We found that GEP showed a higher prediction precision than M5’ model tree. So, we optimized the GEP prediction models using multi-objective evolutionary optimization algorithms (MOEAs) including GA and PSO methods and compared to the mono-objective GA optimization procedure. The PSO optimized GEP prediction models made the best outputs in both rootstocks [28].

With MOEAs, inputs are evaluated as multi-objective optimization problems (MOPs) and the solutions specify the best probable balance between two reverse functions [34]. Recently, several mathematical methods have been used to solve MOPs, nonetheless the real MOPs applications are specifically nonlinear and also occasionally non-differentiable [35]. This has enhanced interest in metaheuristic methods, and among these procedures, MOEAs are of special interest. Here, PSO as an evolutionary computation technique was used for determining optimized culture media.

The aim of this study is to employ three soft computing methods namely MLPNN, GEP and KNN and to compare the accuracy of their prediction to multiple linear regression (MLR) technique as well as applying PSO algorithm with aim of predicting and optimizing walnut tissue culture media. Briefly, the new contributions of the present research are:

  • Comparing the appropriateness of MLPNN, KNN and GEP nonlinear methods for modeling the impacts of mineral nutrients, PGRs and vitamins on in vitro culture of walnut.

  • Constructing hybrid models in order to assess how Chandler and Rayen explants respond to the culture medium composition according to the new produced shoots attained from the Taguchi design.

  • Finding the optimal composition of culture media to maximize the proliferation rate (PR) and minimize callus weight (CW), shoot tip necrosis (STN) and vitrification (Vit) by optimizing the developed model using PSO.

To our knowledge, this study is the first application of MLR, KNN, ANN, GEP and PSO methods for optimizing walnut tissue culture media. In addition, this work is the first use of KNN modeling procedure in plant tissue culture.

Results

Our models of the interaction of modifying inputs including nutrients, PGRs and vitamins on outputs including PR, CW, STN and Vit were developed using MLR, MLPNN, KNN and GEP techniques. Here, we assess the developed models’ performances through evaluating each modelling method precision to predict the composition of plant micropropagation media for walnut. After that, PSO optimization results of the selected modeling method is investigated to find the most efficient compositions of media for each considered trait. An outline of the techniques used here to achieve the most appropriate model is shown in Fig. 1.

Fig. 1
figure 1

Schema of the techniques used to construct prediction models for Persian walnut in vitro culture media

Comparison of modeling techniques performances

The mathematical equations attained from GEP method, which is showing the best estimate of the explant growth parameters, are shown in Table 1. Moreover, calculated statistics results for output variables (PR, CW, STN and Vit) related to the MLR, MLPNN, KNN and GEP models are given in Table 2. Unlike MLR, the trained MLPNN, KNN and GEP models of PR, CW, STN and Vit resulted in balanced statistic values for both the training and testing subsets (Table 2). For output variables (PR, CW, STN, and Vit) the calculated statistical values corresponding to the KNN, MLPNN and GEP models showed a considerably higher accuracy of prediction than for MLR models as calculated R2 for MLPNN, KNN and GEP vs. MLR models were: 0.672, 0.695 and 0.802 vs. 0.412 for PR of Chandler; 0.377, 0.354 and 0.428 vs. 0.178 for PR of Rayen; 0.923, 0.931 and 0.844 vs. 0.696 for CW of Chandler; 0.929, 0.930 and 0.839 vs. 0.276 for CW of Rayen; 0.855, 0.915 and 0.807 vs. 0.241 for STN of Chandler; 0.812, 0.831 and 0.808 vs. 0.341 for STN of Rayen; 0.974, 0.975 and 0.853 vs. 0.434 for Vit of Chandler; and 0.977, 0.978 and 0.891 vs. 0.299 for Vit of Rayen, respectively (Table 2).

Table 1 Constructed models using gene expression programming to predict explant growth traits in Persian walnut
Table 2 Evaluation of different developed models using various statistics for PR, CW, STN, and Vit of Persian walnut through in vitro proliferation

Comparison of the observed and predicted values of outputs may explain the performance of the developed models according to the studied inputs. A high squared correlation coefficient fitting technique was used to produce plots according to the constructed models derived, to show how each of the four outputs varied as the concentration of media components changed. The plots may be helpful to understand the complete relationship between media components and responses, and to assess the multiple effects of modifying the media components in the DKW medium. The predicted MLR, MLPNN, KNN and GEP models diagrams vs. observed values for the PR, CW, STN and Vit are shown in Figs. 2, 3, 4, 5, 6, 7, 8 and 9. Comparing the fitted simple regression lines of the MLR with ML models showed that MLR resulted in the lowest accordance between the observed and predicted values regarding all considered outputs so that calculated R2 for MLPNN, KNN and GEP vs. MLR were: 0.696, 0.672 and 0.802 vs. 0.412 for PR of Chandler (Fig. 2); 0.178, 0.359, 0.377 and 0.428 for PR of Rayen (Fig. 3); 0.696, 0.931, 0.924 and 0.844 for CW of Chandler (Fig. 4); 0.276, 0.874, 0.930 and 0.840 for CW of Rayen (Fig. 5); 0.241, 0.916, 0.856 and 0.807 for STN of Chandler (Fig. 6); 0.342, 0.810, 0.813 and 0.809 for STN of Rayen (Fig. 7), 0.435, 0.976, 0.975 and 0.853 for Vit of Chandler (Fig. 8); and 0.300, 0.979, 0.978 and 0.891 for Vit of Rayen (Fig. 9), respectively. Therefore, the ML models were able to accurately predict the outputs while the MLR developed models were not able to describe extensive diversity of growth parameters owing to the studied variables interaction, that may hide the effects of media components. Figures 2, 3, 4, 5, 6, 7, 8 and 9 may be helpful for realizing the complete relationship between media components and responses, and assessing the combined impacts of modifying the DKW medium components.

Fig. 2
figure 2

Observed vs. predicted values of proliferation rate (PR) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Chandler

Fig. 3
figure 3

Observed vs. predicted values of proliferation rate (PR) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Rayen

Fig. 4
figure 4

Observed vs. predicted values of callus weight (CW) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Chandler

Fig. 5
figure 5

Observed vs. predicted values of callus weight (CW) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Chandler

Fig. 6
figure 6

Observed vs. predicted values of shoot tip necrosis (STN) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Chandler

Fig. 7
figure 7

Observed vs. predicted values of shoot tip necrosis (STN) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Rayen

Fig. 8
figure 8

Observed vs. predicted values of vitrification (Vit) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Chandler

Fig. 9
figure 9

Observed vs. predicted values of vitrification (Vit) related to A multiple linear regression (MLR); B multi-layer perceptron neural network (MLPNN); C k-nearest neighbors (KNN); D gene expression programming (GEP) developed models (n = 224) for walnut cv. Rayen

According to the results presented in Table 2 and Figs. 2, 3, 4, 5, 6, 7, 8 and 9 as well as the above-mentioned results, MLPNN, KNN and GEP models performed accurately in predicting the effect of media components on in vitro performance of Persian walnut. So, in order to select one of these ML modeling techniques to be optimized and achieve final models for in vitro proliferation of Persian walnut, we considered the ease of using model by the end user. In other words, although MLPNN and KNN performed relatively well, none of these models offer explicit mathematical expression. Unlike MLPNN and KNN methods which produce black-box models, GEP can provide the researchers with an opportunity to optimize the extractive equations (optimal values of the variables) by generating explicit mathematical equations between the independent variable and the dependent variable and can be used as an equation for the pre-test stages (initial phase of the study) in designing and developing of their studies. Hence, we selected GEP models to be optimized and achieve proliferation media formulations of Chandler and Rayen.

Optimization of GEP models

Consequently, to achieve the optimized medium resulting in the highest PR and the lowest CW, STN and Vit in walnut, we optimized developed GEP models by using multi-objective PSO technique.

The optimized amounts of the studied factors and the predicted values of growth parameters by the GEP models are shown in Table 3. The PSO optimization of the GEP models revealed that media containing 1.76 × NH4NO3, CaNO3 and ZnNO3, 1.67 × KNO3, 0.96 × K2SO4, 0.66 × MgSo4, MnSo4 and CuSo4, 2.35 × KH2PO4, H3BO3 and Na2MoO4, 1.64 × FeEDDHA and 1.89 × Thiamine, Nicotinic acid and Glycine concentrations in DKW medium, 0.67 mg/l BAP and 1.30 mg/l TDZ and 1.30 mg/l IBA could lead to optimal PR (23.54), CW (0.12), STN (2.23) and Vit (9.95) in Chandler and media containing 0.73 × NH4NO3, CaNO3 and ZnNO3, 0.69 × KNO3, 0.94 × K2SO4, 0.64 × MgSo4, MnSo4 and CuSo4, 0.83 × KH2PO4, H3BO3 and Na2MoO4, 1.35 × FeEDDHA and 1.52 × Thiamine, Nicotinic acid and Glycine concentrations in DKW medium, 0.67 mg/l BAP and 1.23 mg/l TDZ and 1.23 mg/l IBA could result in optimal PR (24.57), CW (0.64), STN (12.48) and Vit (3.04) in Rayen.

Table 3 Multi-objective PSO optimization of GEP models to achieve the highest quantity and quality through in vitro proliferation of walnut

Discussion

Walnuts as one of the important woody plants are considered recalcitrant to in vitro culture in which genetic determinism besides other factors such as media components makes more complicated different stages of micropropagation, as well. In the present study, three different ML modeling approaches along with PSO optimization algorithm were applied to determine and predict the effect of genotype and the media formulation throughout the proliferation of walnut. Walnut micropropagation can be improved by involving different physiological disorders in modeling and optimization processes. The incidence of physiological disorders through micropropagation of walnut has not been comprehensively investigated. Different studies on walnut tissue culture have been focused on introducing some chemicals like phloroglucinol and FeEDDHA to DKW or [36] (MS) basal media [11, 15], supplementing media with various concentrations of different PGRs [37,38,39], removing agar [39], ventilation and reducing sucrose concentration [40], but a few of the studies focused on media components, including mineral nutrients [9, 41], vitamins and PGRs [39] interaction on proliferation quality and quantity.

Here, we concentrated on increasing PR and reducing important abnormalities occurring during this phase, by recording data associated to several designed experiments. The subsequent database including a range of concentrations of each component in culture media allows simultaneous evaluation of the impacts of all minerals, vitamins and PGRs used in media as well as genotype on the explant growth indices only through the ML tools.

Machine learning as a powerful tool has been effectively applied in plant biology studies [42, 43] including plant tissue culture data analysis and accurate prediction of optimal in vitro culture media composition [20, 24,25,26,27,28]. The development of in vitro plant tissues is controlled by minerals, vitamins and PGRs in the culture media. To achieve maximum explant performance, the prediction of the most efficient media composition is highly useful since the optimization of the type and concentration of minerals, vitamins and PGRs in media is a time-consuming, expensive and laborious job [9, 41].

In our previous studies, we successfully performed constructing neural models using ANN technique to study the effects of different combinations of minerals and PGRs on in vitro proliferation and rooting of G × N15 Prunus rootstock [25,26,27]. Our study on comparing ANN with MLR modeling to forecast the optimum concentrations of macronutrients for OHF 69 and Pyrodwarf Pyrus rootstocks in vitro media showed ANN as a precise and promising technique [24]. The important benefit of ANN-based methods is that they do not need a prior identification of proper fitting function consequently; they have an overall approximation ability to calculate all kinds of non-linear functions in practice. This trait may help the modeler to develop the most possible precise model. Despite the fact that ANN is a good alternative for MLR, it does not provide us any equations including the relationships between input and output variables. Moreover, the ANN technique needs a time-consuming process of trial and error to find network parameters like number of neurons and hidden layers [44,45,46]. ANNs as the most extensively used ML model, can efficiently solve different multivariate, non-linear and nonparametric problems via an unidentified ‘‘black box” training [47]. Nevertheless, there are also some drawbacks with ANN “black box” nature [48]. In general, ANN is unable to clarify its logical process and this constraint makes ANN application unfriendly in natural science studies, as it can just simulate the change process according to experimental data, without helping us to understand the reason of the change.

Considering these restrictions in using ANN models, in another study on Pyrus rootstocks in vitro proliferation [20] we compared the power of GEP technique to ANN (RBFNN) and MLR in predicting the optimal media. RBFNN and GEP exhibited higher performance precision towards the MLR, and the GEP resulted in the most precise model as well as being practical [20]. In our recent research [28], we used two algorithms, GEP and M5’ model tree to overcome the ANN method weaknesses and simplify forecast of the media components interactions on in vitro proliferation of Pyrus rootstocks. Again, we found GEP as a more accurate technique than M5’ model tree [28].

Consequently, in the present study, we applied GEP as the most precise modeling procedure found by [20, 28], MLPNN as an ANN technique that its models are easier to give precise prediction than RBFNN when input data are randomly distributed [49] and KNN as one of the simplest machine learning approaches which can also be used for regression problems [50]. The MLR was also applied as a linear modeling method to be compared with above-mentioned ML procedures in predicting the optimum in vitro proliferation media composition of walnut to achieve the most appropriate outcomes. The accuracy of the developed prediction models was evaluated using MAE, RMSE and R2 statistics and correlation coefficient between observed and predicted values of each output. To our knowledge, KNN algorithms have not ever been applied to predict the plant tissue culture media composition. The advantage of KNN algorithm is that it does not require specific assumptions about the predictors’ distribution. The samples of KNN are classified according to the k neighbor responses mean values in a space of predictor [51]. The examples of training are defined by n traits. Each example means a point in a space with n-dimension. So, all examples of training will be kept in a space with the pattern on n-dimension. Here, the number of neighbors (k) leading to the best results for each model are presented in Table 3.

A key advantage of GP-based procedures such as GEP, toward other methods is that they do not need any hypothesis for preceding form of the relationship to produce prediction equations. GP and its deviations have been applied in many researches to find any complicated relationships which fit different experimental data [52,53,54]. An individuals’ population is employed in this technique and afterwards, better individuals are chosen by using genetic variations and fitness function. The genetic variations are introduced by genetic operators. Machine learning approaches including GEP have been programed to learn the variables̛ relationships in data collections. GEP difference with GA and GP as its precursors is in the method of individual programming so that in GEP, individuals are programmed as chromosomes i.e. fixed length linear strings which are presented finally as a simple diagram called expression tree. Whereas, in GA and GP, individuals are expressed, as nonlinear entities with different shapes (parse trees) and sizes and chromosomes, respectively. One of the GEP strengths over GA and GP is that genetic operators work very simple at the level of chromosome in GEP making development of genetic diversity. GEP unique, multi-genic nature is another important point which allows more complicated programs with multiple sub-programs to be developed. The advantages of both GA and GP are collected in GEP, whereas some of their constraints are met [55].

Based on our results presented in Table 2, KNN, MLPNN and GEP models were much more accurate than MLR. On the other hand, in most cases, the MLPNN method provided better fit calculation than KNN and GEP. But based on the results of our aforementioned studies [20, 28], the optimized GEP method provides better fit calculation than other approaches. Furthermore, GEP is preferred over ANN models, as ANN is a black-box model, whereas GEP explains the constructed prediction models with mathematical Eqs. [54].

Through the previous years, GEP has been applied extensively in other areas because of its high efficiency and effectiveness. GEP applications are so wide and are rapidly enhancing [55]. GEP is one of the most effective function mining algorithms which has been widely used in classification, pattern recognition, prediction, and other research areas. This algorithm can mine an ideal function to deal with further complex tasks [56]. GEP has been used to determine the quality and stress of water on lakes or rivers as a result of the wastewater pollutants [31]. The problem of missing values in data set due to the measurement conditions can simply be solved by employing GEP [31]. Results based on actual data set confirmed that the multiple GEP and fuzzy expert system outperforms detection methods in medical field by attaining high prediction precision [57].

Our previous studies [20, 24, 28] on pear rootstocks using ML-based modeling showed that there are different responses to the concentrations of macronutrients and PGRs based on genotypes, as we found here in Persian walnut varieties. Regarding the complex interactions, detection of the optimum levels of minerals and PGRs for a certain plant genotype is complicated [58]. Furthermore, the incidence of physiological disorders like Vit and STN throughout the proliferation phase of walnut needs improvement of media for optimal growth of explants. Constructing optimized and effective media by using authentic mathematical modeling and optimization methods have been performed previously on different plant species [17, 20, 24,25,26,27,28, 59,60,61]. Here, we consequently suggested use of ML-based modeling to recognize concentrations of minerals and PGRs that would maximize PR while minimizing CW, STN and Vit [24]. As we found here (Table 3), our previous results on pear [20, 24] showed that ANN prediction models had higher precision than MLR models and MLR could not be a trustworthy method for assessing nonlinear or non-polynomial relationships among variables.

It has been revealed from our recent study on pear rootstocks micropropagation [28] that the most efficient optimization method for optimizing GEP models was multi-objective PSO. Therefore, here, we used multi-objective PSO method for optimization of selected GEP models. Our GEP-PSO optimized models could give us intact optimized formula for proliferation of Chandler and Rayen (Table 3).

The mono-objective GA optimized MLPNN and RBFNN-based models obtained in our previous studies [20, 24] on Pyrodwarf and OHF Pyrus rootstocks showed the significance of some minerals such as NH4+ and NO3 and/or PGRs for explants proliferation. Our previous research [25] on G × N15 Prunus rootstock by using mono-objective GA optimized ANN models found the higher importance of NO3, NH4+, Ca2+, K+, and PO42− towards Mg2+, Cl and SO42− for in vitro proliferation. Our recent study [28] on Pyrus rootstocks using mono-objective GA optimization of GEP models indicated that high PR may cause low quality plantlets. In accordance with it, our study [25] on G × N15 using mono-objective GA optimization of ANN models also predicted that increasing the NH4+ concentration will enhance shoot number and length with higher number of non-healthy shoots but decreasing amount of NH4+ will enhance the plantlets quality. Our results [28] on pear rootstock using RBFNN and GEP modeling procedures also indicated that a lower content of nitrogen will result in higher quality plantlets. NH4+, NO3 and K+ interaction has been the main subject of most in vitro studies [62] but using ML models, [63] reported interaction of K+, EDTA and SO42− with critical effect of K+ on PR of pistachio; as low and high concentrations of K+ resulted in the highest and lowest PR, respectively. Study on Prunus sp. also showed that K+ at low concentration promotes PR [64]. Nezami-Alanagh et al. [63] concluded that either low or too high amounts of K+, EDTA and SO42− ions result in low quality plantlets. Considering macro- and micro-elements, our multi-objective PSO optimized GEP models in Chandler showed that increasing NH4+, NO3 and SO42− increased PR and Vit while decreasing CW and STN. But the results in Rayen showed that increasing SO42− except K2SO4 as well as increasing NO3 except KNO3 increased PR and CW while decreasing STN and Vit (Table 3).

Reed et al. [65] emphasized on the optimization of nitrogen components content of the culture media to stimulate high number of elongated shoots and reduced amount of callus, in different pear species. Nezami-Alanagh et al. [66] suggested avoiding high content of NH+ to reduce callus formation in the in vitro pistachio shoots. Low amounts of some of the MS medium components such as KNO3, MgSO4, KH2PO4, CaCl2, and NH4NO3 have been reported to contribute to STN promotion in some Pyurus species [67]. Whereas based on our results, lower concentrations of K2SO4, MgSO4, MnSO4, CuSO4 in Chandler and K2SO4 and KNO3 in Rayen reduced the occurrence of STN. The results of [63] using neurofuzzy logic showed that low amount of K+ and mid to high concentrations of SO42− inhibit the STN in pistachio explants with lower signs in UCB1than in Ghazvini which refers to the genotypes differences as we found in our study. Ion confounding problem again prevents determining exact relationship between a given mineral and the physiological disorder.

The neurofuzzy logic procedure show a linear positive effect of nicotinic-acid and pyridoxine–HCl on pistachio parameters of shoot multiplication [68], but, to our knowledge, there is no study about the impact of vitamins on the proliferation of walnut. Nezami-Alanagh et al. [66] showed that the glycine and thiamin-HCl affected differently on some in vitro disorders of pistachio. They showed that increasing glycine content highly reduced the development of callus. Our study showed that higher content of vitamins reduced CW in Chandler (Table 3) and reduced vitamins content in Rayen which caused higher CW (Table 3). Rayen was more recalcitrant to micropropagation than Chandler, hence, achieving higher PR and lower incidence of STN and Vit can cover the low increase in CW.

Genotype is an important factor influencing the occurrence of physiological disorders in walnut which is in agreement with reports of [63, 66] in pistachio. Similarly, other researches on pear [20, 24, 28, 67] explained that the in vitro physiological disorders incidence caused by unbalanced mineral nutrition differed among genotypes.

The purpose of our study was to present an ML approach with high accuracy for prediction of optimized culture media. We applied techniques of MLPNN, KNN and GEP combined with PSO to walnut proliferation data sets to achieve the most appropriate proliferation results. Comparison of our results with the previous ones [20, 24,25,26,27,28, 63, 64, 66] indicates that using at least two methods together results in more precise consequences. So that, comparing the results of the used methods showed the effect of media components enhancing or reducing the measured parameters (Table 3). The efficiency of the developed optimized media was compared to DKW. The media constituents proposed by our PSO optimized GEP models related to Chandler showed that decrease in K2SO4, MgSO4, MnSO4, CuSO4 and BAP besides increase in other nutrients, PGRs and vitamins increased PR as well as Vit while reducing CW, and STN. Nevertheless, it was slightly different for Rayen as decrease in K2SO4, KNO3, vitamins and BAP along with increase in remained nutrients and PGRs caused higher PR and CW but lower STN and Vit (Table 3). The use of macro- and micro-nutrients as factors, in many micropropagation studies [20, 24, 28, 63, 68], indicates the ion confounding problem, being problematic to recognize precisely corresponding ion(s) affecting the studied parameter [69]. Our results in comparison to previous studies on walnut [11, 15, 37,38,39] which were about minerals and/or PGRs effects, showed for the first time that not only the effects of minerals depend on the used PGRs concentration but vitamins concentration affects the explant response. The interaction of minerals, PGRs and vitamins could determine the quantity and quality of proliferated plantlets. The plant species and genotype are also highly important in predicting the explant growth response to the minerals, PGRs and vitamins interaction.

Plant PGRs interactions make a critical complication in regulating the processes of plant growth, as well. Cytokinin controls cell proliferation [70] and auxin enhances the sensitivity of apical meristem less mitotically active cells to cytokinin [71]. Cytokinin to auxin ratio is a key signal which controls phenotype [72]. As auxin and cytokinin have roles in DNA replication and cell cycle regulation, respectively [73]. PGRs effects may vary with plant species. Ref. [26] results on Prunus rootstock indicated that applying cytokinin and auxin together will result in higher PR than employing each one alone. According to their results, PGRs concentration and interaction are also important. According to these results and [74] and [75] findings, we used various concentrations of BAP, TDZ and IBA in our experiments. Our adverse results can be attributed to the interaction of genotype and culture medium constituents [76] with PGRs [20]. Type and concentration of cytokinin highly affected in vitro growth and survival of black walnut [39]. Ref. [37] reported that lower concentrations of zeatin was better than BAP for fast shoot elongation of black walnut nodal explants, while higher levels of zeatin and BAP led to shoot necrosis. Using TDZ at 0.01–0.02 mg/l in the medium resulted in an enhanced rate of morphological disorders [37]. But higher levels of TDZ (1.30 and 0.52 mg/l in Chandler and Rayen, respectively) in our present study resulted in reduction of STN in both Chandler and Rayen. Juglans regia was successfully micropropagated using 0.1–2.01 mg/l BAP [4, 8, 12, 77,78,79,80,81]. Our used BAP concentrations (0.67 and 0.99 mg/l for Chandler and Rayen, respectively) are also in this range. There is no result in the literature about the effect of BAP on the incidence of walnut in vitro physiological disorders. But according to the results of in vitro studies on other plant species like pistachio [64, 82, 83], addition of adequate amount of BAP strongly decreases the incidence of STN.

Therefore, in the present study, we evaluated the interaction of cytokinin and auxin PGRs and medium components including nutrients and vitamins on proliferation of walnut to achieve the most efficient protocol with a reasonable range of PGRs. Our analyses using PSO optimized GEP modeling technique showed that this method can be used as an efficient procedure for evaluating the interaction of different factors on walnut explant growth indices in proliferation phase. Therefore, for the first time GEP is introduced as a great tool in optimizing higher quality and efficiency walnut tissue culture protocols in less time.

Callus development during explant proliferation is a common problem in walnut micropropagation which has been reduced here by increasing PR in Chandler while enhanced by increasing PR in Rayen (Table 3). Yegizbayeva et al. [15] reported that callus formation is not correlated with PR in walnut. Callus formation has been attributed to certain concentrations of different mineral nutrients in various plant species like KH2PO4, CaCl2 and MgSO4 in some Prunus cultivars [67], NO3 in germplasms of Robus [84] or MgSO4 in Prunus armeniaca [85]. Akin et al. [86] reported NH4+ and after that genotype and SO42− as significant factors affecting callus formation in hazelnut in vitro proliferation using CHAID analysis. Nezami-Alanagh et al. [63] using neurofuzzy logic predicted that high and low concentrations of Fe2+ and SO42−, respectively, result in the lowest callus formation in pistachio rootstocks explants. They suggested that lower concentration of SO42− in MS reduces shoot tip necrosis and callus development in pistachio in vitro proliferation. While our results showed that lower concentrations of both FeEDDHA and minerals containing SO42− in DKW caused lower CW in both Chandler and Rayen. Bosela et al. [37] showed that the high-salt media i.e. DKW and MS resulted in lower Vit vs. WPM and 1/2X DKW media in walnut.

Conclusions

Walnut micropropagation is a problematic process with lots of in vitro drawbacks including necrosis, callusing and vitrification. The present study demonstrated the efficiency of plant in vitro proliferation predictive models by using advanced ML modeling procedures. Therefore, a regression model i.e. MLR and three advanced ML models including MLPNN, KNN, and GEP were constructed to predict walnut in vitro PR and associated physiological disorders under the effect of culture medium constituents and genotype. According to the results, following conclusions and suggestions are presented:

  • Advanced computational models are the highly precise approaches which can be applied to control and predict walnut explant in vitro performance. They can also be employed as an alternative technique for linear regression and usual statistical analysis methods with noteworthy performance among them. The KNN model has been used for the first time in this study for predicting plant in vitro performance. The optimized models should be applied to predict walnut PR in experimental designs for controlling undesirable physiological disorders.

  • All ML models performed accurately for forecasting PR, CW, STN and Vit. Nevertheless, the accuracy of the GEP models were mostly higher than ANN and KNN models. So, the GEP models were selected to be optimized by PSO technique in order to achieve optimal culture media.

  • Using above-mentioned ML models is extremely useful for reducing time and cost for formulating efficient walnut tissue culture media.

  • The ML-designed media for walnut can not only raise PR (especially about Chandler) but, simultaneously, reduce CW, STN and Vit.

  • Genotype is a very important factor which affects the in vitro performance and based on our results, it seems that Rayen as a not bred genotype is more recalcitrant to in vitro propagation than the bred cultivar Chandler.

  • Other factors such as sucrose along with our studied medium components and their interaction on PR and occurrence of physiological disorders also need to be incorporated into the predicting model to control the PR comprehensively.

Methods

MLR, MLPNN, KNN and GEP modeling techniques were applied to make models using various arrangements of minerals, vitamins and PGRs with different concentrations as inputs and different proliferation indices as outputs. The selected models were used to achieve the optimized models using PSO. Two case studies were done using walnut cultivar Chandler and genotype Rayen which have explained details of the used procedures to understand the optimized inputs combinations as follows.

Case studies

In vitro established nodal cultures of Chandler and Rayen were sub-cultured in altered DKW media supplemented with various auxin and cytokinin PGRs concentrations, 30 g/l sucrose and 3 g/l Gelrite. The media were dispersed into jam jars (250 ml) with polyethylene caps after adjusting pH to 5.5. Then, the distributed media were autoclaved for 15 min at 1 kg cm−2 s−1 (121 °C). The cultures were kept under 16-h white fluorescent (80 µmol m2 s−1) light at 25 ± 2 °C for 30 days. Subsequently, parameters comprising PR, CW, STN and Vit were measured. In each experiment set, every treatment included 8 replicates (jam jars) for both Chandler and Rayen.

Taguchi experimental design for optimization of explant proliferation

Taguchi design is a strong and effective tool for the process of optimization that functions constantly and optimally through different conditions. Evaluating numerous factors with limited runs is possible via Taguchi designs i.e. orthogonal arrays. In this design, factors are not weighted more or less in the same experiment and therefore all factors are analyzed independently to each other. Deviation of a product efficient characteristics from their target values is produced by some noise factors such as human errors. Based on orthogonal arrays of Taguchi’s, a standard orthogonal array L27 (35) 27 experiments by 26 of freedom were applied for each of Chandler and Rayen to evaluate the effect of nine factors according to Table 4, on PR, CW, STN, and Vit. For each experiment, three different levels of factor variations were based on various coefficients × DKW basal medium nutrients and different PGRs concentrations (Tables 5). Every nutrient and PGR concentration treatment includes at least 8 replicates. 157 experimental sets (70% of data lines) among 224 sets were randomly chosen for training the modeling methods and the rest 67 sets (30% of data lines) were applied for testing the model’s generalization capacity. In all ML models, k-fold (k = 10) cross validation method [87, 88] was used for training to maintain and grantee the generalizability of constructed models.

Table 4 The components and levels of factors used for walnut micropropagation and measured traits mean values applied to characterize it
Table 5 The components of factors and experimental runs ranges based on DKW medium

Modeling techniques

Multiple linear regression

MLR analysis is a multivariate statistical method to assess the relationship between multiple independent variables and an individual dependent variable. Two important purposes of MLR are prediction and explanation. The MLR prediction comprises the level to which the independent variables can predict the dependent variables. The mentioned description of MLR estimates the coefficients of regression, their sign, magnitude and statistical interface, for each independent variable [89]. Linear regression is considered as the first statistical method in regression and assumed to be an index technique to be used by new methods. As other regression methods, the relationships between a dependent variable and multiple independent variables are modeled by MLR and a linear equation is fitted to the experimental data. MLR technique makes relationship between independent variable k value and the dependent variable M value. The regression equation of n input variables × 1, × 2, …, kn is according to the following:

$$M=\alpha 0+ \alpha 1k1+\dots + \alpha n$$

in which the dependent variable is M, k (× 1, …, kn) denotes a vector of input variables, α0 indicates intercept (a constant), and α is the coefficient of regression vector, each of which is for each expository variable. Y experimental values have various meanings and are supposed with the identical standard deviation ε. The SPSS 19 software package was used for the MLR modeling.

Multilayer perceptron neural network

The neural network is divided into various types based on the transfer functions basis. In the present study, we used multi-layer perceptron (MLPNN) network. The MLPNN model is the most common and widely used type of artificial neural network [90]. This model generally contains an input layer and an output layer. One or more hidden layers can be placed between these two layers. Each neuron in this structure has a number of inputs and a number of outputs. A neuron calculates its output responses based on the weighted sum of all its inputs, performed by a stimulus or transmission function. In the MLPNN model, starting from the input information in the first layer (independent variables), the information flows in only one direction and enters the output layer (dependent variable) by transferring from the hidden layer. The training process of MLPNN model involves adjusting and modifying the weights of the interface between neurons using different network training methods [91]. In this study, Broyden-Fletcher-Goldfarb-Shanno (BFGS) training algorithm has been used. Also, stimulus functions; The tangent hyperbolic (Tanh), sigmoid function (Logs), exponential function (Exp), relu function (Relu) in the hidden layer and linear function (Idn) in the output layer were compared and evaluated and the best function was selected. The number of hidden layers was also determined by trial and error by reaching the minimum error rate. See [91, 92] for more information.

k-nearest neighbors’ algorithm

The k-nearest neighbors (KNN) model is a non-supervised learning machine algorithm for data classification. In this model, each data represents a coordinate position in a vector-space model that the information of each particular section must have similar properties as well as be close to each other. In the KNN algorithm, determining the number of neighbors (k) as well as the method based on which the distance between them is calculated is of particular importance. If k is considered too small, then neighboring points that do not appear in the classification will reduce the accuracy of the results. On the other hand, if k is considered too large, the results of the same classifications may be merged as the computational volume increases [93]. The nearest neighbor was evaluated and selected from different values ​​to find the best value of k and to achieve the highest model accuracy. Distances between neighboring points were determined using various geometric methods. In this study, the methods of Euclidean Distance, Chebyshev Distance, Manhattan Distance and Minkowski Distance were studied and the best method was selected.

Gene expression programming

GP is a modeling approach used to model the structural engineering complications behavior. It is an extension of genetic algorithm that utilizes a program space for searching, rather than using a data space. An important benefit of applying GP-based techniques toward other methods is their capability to produce equations of prediction without using any hypothesis for previous relationship form. Many researchers have applied GP and GP-based methods to find any complicated relationships fitting different experimental data [44, 94, 95]. GEP has been introduced as an effective substitute method to the conventional GP [31, 46]. GEP have established many computer programs, by getting encoded in linear chromosomes with constant length, each of which included several encoding genes [31, 96]. GEP is originated of evolutionary algorithms such as GA and GP. In this technique, an individual population is applied and afterwards, fitness function and genetic variations are used to select better individuals. The genetic variations are presented by genetic operators. GEP is a learning machine which is assumed to learn the variables relationship in datasets. The individual programming technique is different in GEP and its predecessors GP and GA since GEP programs individuals as linear strings (chromosomes) with fixed length which are finally displayed by expression trees as unsophisticated diagram. While, GP and GA express individuals in the form of linear strings (chromosomes) with fixed length and nonlinear entities of diverse forms (parse trees) and dimensions, respectively. One of the strongpoints of GEP towards GP and GA is that genetic operators run very easily at the level of chromosome in GEP producing genetic diversity creation. Another strength of GEP is its exclusive, multi-genic nature letting more complicated programs with numerous subprograms to be developed. Both GP and GA advantages are collected in GEP, whereas some of their constraints are met [57, 97, 98].

The real GEP chromosome phenotype is the illustration in Fig. 10 and the genotype would be simply described of the phenotype as represented in Eq. (1)

Fig. 10
figure 10

Diagram of gene expression programming as a prediction model

$$\left(a+b\right)*(c-d)$$
(1)

Functional steps of the GEP are represented in Fig. 10 [31]. According to this diagram, the GEP start point is a population of chromosomes. After that, the chromosomes genes are expressed, and each individual fitness is analyzed. Then, the individuals are defined according to their fitness to reproduce with alteration. The same development process is run on the new individuals’ generation. Overall, this technique is replicated for a particular number of generations or it is performed until reaching a termination condition. Roulette wheel sampling with elitism is employed by GEP system to ensure that the top individuals, according to the fitness, are remained and copied to the next generation. Once genetic operator(s) are performed on chosen chromosomes, comprising mutation, cross over and rotation, diversity is developed into the population.

The GeneXpro software package was applied to perform the GEP models. The parameters employed in the GEP models are represented in Table 6.

Table 6 Parameters of training GEP model

In this study, the selected functions and mathematical operators are rational and not definite so that the plant modeling designer is free to select such functions according to the studied problem anatomy. The functions and operators were selected with a viewpoint of invocating simpleness of the advanced model assuring quicker convergence. The size of the population (chromosomes number) adjusts the programs number into the population. The larger the population, the longer it takes for an iteration run. High chromosomes number were tried to realize minimum error models. The program running continued to reach no significant rectification in the models’ performance. Here, we aimed to achieve obvious relationship between decision variables and response variables. GEP clear formulations were obtained for PR, CW, STN and Vit as a function of experimental parameters including Y1, Y2, Y3 and Y4 = f (X1, X2, X3, X4, X5, X6, X7, X8 and X9) (Table 1).

Input data were normalized in the range of 0 and 1 according to the Eq. 2:

$${X}_{n}=\frac{{X}_{i}-{X}_{min}}{{X}_{max}-{X}_{min}}$$
(2)

where Xn is normalized dimensionless data, Xi is observed data, Xmin is the minimum amount of observed data, and Xmax is its maximum value.

Comparison of the performance of developed models

To evaluate the precision of created models, we used different statistical indices including coefficient of determination (R2), root mean square error (RMSE) and mean absolute error (MAE) based on Eqs. 5, 6 and 7:

$${R}^{2}=\frac{{\left({\sum }_{i=1}^{N}\left({y}_{i}-\overline{\widehat{y} }\right)\left(\widehat{{y}_{i}}-\overline{\widehat{y} }\right)\right)}^{2}}{{\sum }_{i=1}^{N}{\left({y}_{i}-\overline{y }\right)}^{2}{\sum }_{i=1}^{N}{\left(\widehat{{y}_{i}}-\overline{\widehat{y} }\right)}^{2}}$$
(5)
$$RMSE=\sqrt{\frac{1}{N}{\sum }_{i=1}^{N}{\left({y}_{i}-\widehat{y}\right)}^{2}}$$
(6)
$$MAE=\frac{1}{N}{\sum }_{i=1}^{N}\left|{y}_{i}-\widehat{y}\right|$$
(7)

where \(y\) and \(\overline{y }\) are observed values and their mean and \(\widehat{y}\) and \(\overline{\widehat{y} }\) are predicted values and their mean, respectively, as well for N samples. Analyses of parameters were performed together to achieve an accurate medium composition.

In addition, the predicted values by the developed models were plotted against the corresponding observed values to evaluate the ability of models for prediction.

Particle swarm optimization of GEP models

PSO is a method of evolutionary calculation and swarm intelligence algorithm according to population to solve the pervasive problem of optimization that was developed by [99]. It is a method of mathematical computation that starts with the swarm (a population of grain) and mostly based on social models, such as the swarm theory, fish schooling and bird flocking [20]. PSO key factors are with behavior of swarm i.e. keeping optimum distances between different members and their neighbors. To optimize each particle location, their position is modified as arranged for the objective function within the search area. Thus, PSO key factor is a particle velocity which is compared to the previous one in each repetition to lead the particle to its optimal position. The best solution (fitness) every particle in a swarm achieves so far in each repetition, named pbest. Extra “best” value that a particle is attained in the population up to now followed by the particle swarm optimizer which is global best, named gbest. Each particle velocity in a swarm is estimated by Eqs. 3 and 4 [99].

$${V}_{i+1}=w{V}_{i}+{c}_{1}{r}_{1}\left({pBest}_{i}-{X}_{i}\right)+{c}_{2}{r}_{2}({gBest}_{i}-{X}_{i})$$
(3)
$${X}_{i+1}={X}_{i}+{V}_{i+1}$$
(4)

in which, Vi+1 is each particle new velocity based on prior velocity (Vi), w is inertial coefficient (0.8–1.2), c1 and c2 are cognitive and social coefficients, respectively (0–2), r1 and r2 are random values for each velocity update (0–1) and Xi+1 is new location for each particle according to the prior location (Xi).