Introduction

Faba beans (Vicia faba L.) have a long history of consumption by humans, with origins in the Middle East and a presence spanning thousands of years (Dhull et al. 2022). In the global market, Australia has emerged as the leading exporter of faba beans (Australian Export Grains Innovation Center [AEGIC] 2017), while China has a much larger production capacity for domestic consumption. Faba beans contain high concentrations of polyphenols and other antioxidants (Dhull et al. 2022), which contribute to their potential protective properties against various health conditions such as hypertension, cancer, and reactive oxygen species (ROS) (Siah et al. 2012; Turco et al. 2016). Consequently, there is a growing interest in developing value-added ingredients and food products derived from faba beans (Rahate et al. 2021). However, the major drawback is the presence of anti-nutritional factors in the seed, particularly the pyrimidine glycosides vicine and convicine which can reduce the health benefits associated with faba bean (Duc et al. 1999; Khazaei et al. 2019) and, in individuals with favism, can even lead to hemolytic anemia (Pulkkinen et al. 2016).

Recent studies have identified extensive variability in the phenolic profiles and antioxidant capacities among various faba bean varieties (Johnson et al. 2021; Zanotto et al. 2020). However, the relationship between these variables (individual polyphenols and antioxidant capacity) has not been thoroughly investigated. In order to develop high-value processed products from faba beans that have high antioxidant content, it is crucial to identify the key polyphenols (and potentially other compounds) which are the main contributors to the antioxidant capacity.

The relationship between specific phenolic compounds and the overall antioxidant capacity is complex, as they interact with one another and other dietary compounds (Cianciosi et al. 2022). However, depending on their concentration and individual structure (Chen et al. 2020), it would be anticipated that the majority of antioxidant activity in a given matrix could be attributed to a number of specific compounds.

Amarowicz and Shahidi (2017) found higher relative antioxidant activity in the ‘tannin fraction’ (high-molecular-weight compounds) of a faba bean acetone extract compared to the low-molecular-weight fraction. Although further fractionation was not performed, the researchers did identify 14 phenolics in the crude extract using liquid chromatography–mass spectrometry (LC–MS), including p-hydroxybenzoic acid, ferulic acid, and a number of proanthocyanidins. In a similar study, Siah et al. (2014) reported a moderately higher antioxidant activity in the less polar fraction of faba bean seeds, compared to the more polar fraction.

Studies on other matrices have established tentative links between specific polyphenols and antioxidant capacity. For example, a neural network regression model was able to predict the total antioxidant activity of table wine samples based on the concentrations of nine phenolic compounds (Kazak et al. 2022). Xiang et al. (2019) found strong Pearson correlations between antioxidant activity (ORAC and ABTS+) and the catechin, epicatechin, caffeic acid, and ferulic acid content in foxtail millet. Similarly, Sytařová et al. (2020) reported significant correlations between the antioxidant activity and caffeic acid, p-coumaric acid, and sinapic acid content in sea buckthorn berries.

Building off this previous work, the aim of this study was threefold: (1) to determine whether the antioxidant capacity and total phenolic content of faba bean could be predicted from the concentrations of 12 individual constituents, (2) to compare the performance of different linear and nonlinear regression techniques for the prediction of antioxidant capacity, and (3) to identify the constituents that show the strongest correlation with the antioxidant capacity and total phenolic content.

This information should provide food technologists and plant breeders with valuable insights into specific antioxidant-active compounds that can be targeted and accurately measured using precise analytical methods. Although it would not entirely replace the role of measuring ‘total antioxidant capacity,’ it would provide important supplementary information.

Methods

Sample details

For this study, we used existing data that were previously collected and reported for a total of 60 faba bean samples. These samples consisted of 10 different varieties, each of which was cultivated at two different locations in South Australia during the 2017 growing season (Skylas et al. 2019). The seeds were processed by impact milling, using a Falling Number grinder with a 0.8-mm screen, to obtain flour for further analysis.

Chemical analysis

The vicine and convicine contents were quantified in NH4OH extracts of the faba bean flour, using a Shimadzu LCMS-8050 Triple Quadrupole Mass Spectrometer, as described in Skylas et al. (2019).

The total phenolic content (TPC) and ferric reducing antioxidant potential (FRAP) were determined in 90% methanol extracts of the faba bean flour, as detailed in Johnson et al. (2020a). The extracts were then concentrated using a rotary evaporator prior to using an Agilent 1100 high-performance liquid chromatography system with a diode array detector (HPLC–DAD) to measure the concentrations of 10 polyphenol compounds that commonly occur in pulse crops and other foods (Johnson et al. 2021). These targeted polyphenols were protocatechuic acid, catechin, chlorogenic acid, p-hydroxybenzoic acid, vanillic acid, syringic acid, p-coumaric acid, vitexin, trans-ferulic acid, and rutin.

Dataset descriptions

Two datasets were prepared for conducting regression analysis. The first dataset comprised the concentrations of the 10 polyphenol compounds in the neat 90% methanol extracts (i.e., in mg L−1), which were the predictor variables. For the target variables (i.e., those to be predicted), the TPC was calculated in mg of gallic acid equivalents per liter (mg GAE L−1), and the FRAP was calculated in mg of Trolox equivalents per liter (mg TE L−1). Since vicine and convicine were not measured in the methanol extracts, this dataset did not include those variables.

For the second dataset, the concentrations of the 10 polyphenols, vicine, convicine, FRAP, and TPC were calculated for the original faba bean flour on a w/w basis, taking into account the respective extraction masses and volumes. The concentrations of individual compounds (polyphenols, vicine, and convicine) were calculated as mg kg−1, while the concentrations of FRAP and TPC were calculated as mg TE kg−1 and mg GAE kg−1, respectively. All concentrations in the flour were expressed on a dry weight basis.

Regression analysis

Multiple linear regression (MLR) analysis was conducted in RStudio, running R 4.0.5 (R Core Team, 2023). This served as a baseline for comparison, as MLR is a well-established and well-documented linear regression technique.

Nonlinear regression analysis, machine learning (ML) and deep learning, was conducted in Python 3.10, using the Scikit-learn library. Four machine learning techniques were trialed.

  • Machine learning—linear regression: This method assumes a linear relationship between the independent and dependent variable and uses the least squares method to estimate the coefficients of the linear equation.

  • Machine learning—nonlinear regression (RBF): This technique uses Radial Basis Function (RBF) as a kernel function to model the nonlinear relationship between dependent and independent variables. RBF is a popular kernel function used in support vector machines (SVM), due to its ability to model complex nonlinear relationships.

  • Machine learning—nonlinear regression (linear): A combination of linear and nonlinear functions; this technique assumes a linear relationship between the variables but uses nonlinear functions to transform the independent variables.

  • Machine learning—nonlinear regression (poly): This method assumes a nonlinear relationship between the dependent and independent variables and uses polynomial functions to transform the independent variables.

Finally, deep machine learning was also conducted in Python using a convolutional neural network (CNN) method. This technique uses a training method to optimize the model and allow it to ‘learn’ the important predictor variables.

The results from different regression methods were compared based on their key metrics, including R2 values, and the mean absolute error (MAE) and the mean square error (MSE) of prediction. These metrics are commonly used to evaluate the accuracy and performance of regression models (Kazak et al. 2022).

Results and discussion

Prediction in methanol extracts

The first portion of this study was the prediction of antioxidant activity in the faba bean methanolic extracts. Correlation analysis (Fig. 1) revealed a number of strong, positive correlations between several phenolic constituents, including between rutin and chlorogenic acid (r = 0.98), and between ferulic and p-hydroxybenzoic acid (r = 0.82). As a similar result was not seen in mung bean (Johnson et al. 2020b), this appears to be specific to the faba bean matrix. This degree of correlation may be indicative of closely linked biosynthetic pathways for these compounds or a common genetic regulation of different biosynthetic pathways. Based on information available on the polyphenol catabolism pathways in other plants (Schilmiller et al. 2010; Venkanna and Addepally 2021), the latter case would appear to be more likely for these compounds.

Fig. 1
figure 1

Correlation plot of the phenolic constituents in the faba bean methanolic extracts. Darker blue values indicate a stronger positive correlation between constituents

Research by Clé et al. (2008) noted that both chlorogenic acid and rutin are synthesized through the phenylpropanoid metabolic pathway, and phenolic precursors could be redirected from the synthesis of chlorogenic acid to rutin. Cuong et al. (2018) proposed p-coumaroyl-CoA as the common biosynthetic precursor for both of these compounds in Momordica charantia. Consequently, it is likely that one or more regulatory genes could influence the ratio (and therefore the correlation) of these two compounds. Similarly, both p-hydroxybenzoic acid and ferulic acid are products of the shikimic acid pathway, being only three steps apart in the biosynthetic process (Srinivasulu et al. 2018), and are therefore likely to share common regulatory genes.

The constituents showing the strongest correlation with the antioxidant activity of the extracts (FRAP) were ferulic acid (r = 0.86) and p-hydroxybenzoic acid (r = 0.79), followed by protocatechuic acid (r = 0.60), vitexin (r = 0.52), and catechin (r = 0.47). Another observation was the relatively weak correlation found between FRAP and TPC (r = 0.26). This agreed with previous work by Valente et al. (2019) who found a poor correlation between TPC and antioxidant activity (r = 0.55) in faba bean, suggesting that other non-phenolic antioxidants were contributing to the overall antioxidant activity observed.

Regression of the FRAP from the individual phenolic constituents showed a good linearity for MLR (R2pred = 0.86; see Table 1 and Fig. 1), comparable to the R2cal of 0.92 found for the prediction of antioxidant capacity in wine samples (Kazak et al. 2022). However, MLR also showed a much larger predictive error, as measured by the MAE and MSE. In general, the accuracy of the results was poorest using ML—linear (poly), followed by MLR, and then ML—nonlinear (linear) methods. Deep learning and the other two machine learning methods showed reasonable accuracy (R2pred ≥ 0.93; MAE < 5 mg TE/L). However, the best performance was found using the machine learning—nonlinear regression (RBF) technique, which gave a MAE of just 0.8 mg TE/L (Table 1, Fig. 2).

Table 1 Regression statistics for the prediction of ferric reducing antioxidant power (FRAP) from the individual phenolic constituents, in the faba bean methanolic extracts
Fig. 2
figure 2

Prediction plot for the multiple linear regression of FRAP in faba bean methanolic extracts, using the contents of individual phenolic compounds. Calibration samples are shown in black; validation samples in red

In contrast to the results obtained for FRAP, the prediction of the TPC from the methanol extracts using MLR gave very poor performance (Table 2). This also contrasted with some results in other matrices such as foxtail millet (Xiang et al. 2019), where significant positive linear correlations between the concentrations of five individual phenolic compounds and the total phenolic content were found. This outcome suggests that the main polyphenols (or other compounds) contributing to the TPC of the extracts may not have been included among the common polyphenols targeted in this study. Based on previous research using more advanced analytical instrumentation (such as LC–MS), the major polyphenols in faba bean could include proanthocyanidins (prodelphinidin, epicatechin) or glycosides of flavonoids such as quercetin, kaempferol, luteolin, and myricetin (Kwon et al. 2018; Valente et al. 2019). Depending on their structure (Chen et al. 2020), individual polyphenols vary widely in their activity in antioxidant capacity and total phenolic content assays (Johnson et al. 2022), helping to explain this disparity between the targeted polyphenol concentrations and the TPC. It also suggests that there were no strong correlations between the 10 phenolic compounds analyzed here and the major phenolics found in faba bean, although this remains to be confirmed by future research.

Table 2 Regression statistics for the prediction of total phenolic content (TPC) from the individual phenolic constituents, in the faba bean methanolic extracts

The ML—nonlinear (poly) and ML—nonlinear (linear) methods performed similarly poorly; however, several of the other regression techniques showed promise. Both ML—nonlinear (RBF) and deep learning gave R2pred values of ≥ 0.99, although the MAE of the latter technique was much higher (9.18 mg TE L−1 compared to 0.71 mg TE L−1). Consequently, ML—nonlinear (RBF) provided the best mathematical model for predicting the TPCs of the methanolic extracts from their individual phenolic constituents, as was seen for the FRAP predictions.

Prediction in faba bean flour

Correlation analysis was also performed on the faba bean flour constituents, which again highlighted the strong correlations between rutin and chlorogenic acid (r = 0.98), and between ferulic and p-hydroxybenzoic acid (r = 0.81). Both of these correlations were stronger in the faba bean flour (Fig. 3) compared to the methanol extracts. Notably, the FRAP and TPC were much more strongly correlated (r = 0.94) in the faba bean flour dataset, as compared to the methanolic extracts.

Fig. 3
figure 3

Correlation plot of the phenolic constituents in the faba bean flour samples. Darker blue values indicate a stronger positive correlation between constituents

As described in the Methods section, the concentrations of the pyrimidine glycosides vicine and convicine were also included in the predictor dataset for the faba bean flour. Neither were strongly correlated with each other or any individual phenolic compounds, although vicine showed a moderate to weak linear correlation with the TPC (r = 0.58) and ferulic acid (r = 0.51). It also had a moderate negative correlation (r = − 0.64) with the syringic acid concentration (Fig. 3). Although the structures of vicine and convicine do not contain any phenolic groups, it has been proposed that vicine may act as a free radical acceptor (Khatib et al. 2017), thus imparting it with antioxidant activity.

In terms of its R2pred value (linearity), the MLR prediction result for FRAP in the faba bean flour (R2 = 0.88; Table 3) was similar to that in the methanolic extracts (R2 = 0.86). It should be noted that the magnitude of MAE and MSE for the prediction of FRAP and TPC is much higher in the flour samples than in the methanol samples, as the concentrations of antioxidants/phenolics in the flour samples are much higher than in the methanolic extracts. Improved prediction results, in terms of R2pred, MAE, and MSE, were found using all machine learning methods except ML—nonlinear (poly). However, the best results were found using deep learning—CNN, with an R2pred of 0.99. This contrasted with the methanol extract predictions, where deep learning—CNN showed excellent linearity but either poorer MAE or MSE compared to nonlinear machine learning methods.

Table 3 Regression statistics for the prediction of ferric reducing antioxidant power (FRAP) from the individual phenolic constituents, in the faba bean flour samples

In contrast to the results seen in the methanol extracts, multiple linear regression gave a much better R2pred value for the prediction of TPC in the flour samples (R2pred = 0.86; see Table 4 and Fig. 4). Again, the use of most machine learning regression techniques provided a moderate improvement in linearity and accuracy, with R2pred values up to 0.95, although deep learning with CNN gave the best performance overall.

Table 4 Regression statistics for the prediction of total phenolic content (TPC) from the individual phenolic constituents, in the faba bean flour samples
Fig. 4
figure 4

Prediction plot for the multiple linear regression of TPC in the faba bean flour samples, using the contents of individual phenolic compounds, as well as vicine and convicine. Calibration samples are shown in black; validation samples in red

One of the major benefits of using MLR is that the model loadings are quite easy to interpret, compared to nonlinear machine learning or deep learning. Consequently, the variables significantly contributing to the three successful MLR models were investigated and are presented in Table 5. For all of these models, paired t tests showed that the predicted FRAP or TPC values for the test set were not significantly different to the measured FRAP or TPC values (P > 0.05 for all).

Table 5 Importance of different constituents for the prediction of TPC and FRAP in the faba bean flour and methanol extract samples

In all three MLR models, protocatechuic acid, p-hydroxybenzoic acid, and ferulic acid showed very strong contributions in the prediction of FRAP or TPC (P < 0.001 for all). Notably, all three of these compounds had showed moderate to strong correlations with the FRAP and TPC in the flour samples (see Fig. 3), confirming their importance as predictor variables. This is likely to be a function of both their relative antioxidant activity and their absolute concentrations in the faba bean matrix. Although they cannot be definitively identified as the only ‘main’ contributors to the FRAP or TPC, they appear to be the most important antioxidant-active compounds among the polyphenols quantified in this study.

Similarly, studies on other matrices have reported strong positive correlations between antioxidant activity and ferulic acid (Xiang et al. 2019) and p-hydroxybenzoic acid (Huang et al. 2020).

Syringic acid, catechin, and p-coumaric acid also contributed significantly to the FRAP of the methanol extracts, but not in the flour samples. Conversely, chlorogenic acid and vanillic acid showed significant but more minor contributions (P < 0.05) in the flour samples, while vitexin also gave a significant contribution (P < 0.05) for the flour—FRAP model, but not for the flour—TPC model. The only major anomaly was seen for vicine, which had a significant contribution to the flour—TPC model (P < 0.001), but not the FRAP model. Further research is required to determine whether vicine does act as a free radical acceptor as proposed by Khatib et al. (2017) and thus contributes to the reducing capacity of the sample, or if it is interfering with the Folin-Ciocalteu reagent as suggested by Higazi and Read (1974). In the latter case, it may indicate that the TPC of faba bean products would be over-estimated using the Folin-Ciocalteu method.

General discussion

When comparing the prediction results for the methanolic extracts and flour, the methanolic extracts generally gave better results for the prediction of antioxidant capacity. This is likely as the matrix used for antioxidant assays and HPLC analysis was identical (the neat methanolic extract), whereas the antioxidant and phenolic concentrations in the flour had to be back-calculated from the subsequent analysis of the methanolic extracts.

In terms of the regression techniques, machine learning—nonlinear regression (poly) consistently gave poor results for these two datasets. Overall, multiple linear regression gave moderately accurate predictions for unknown samples (R2pred = 0.86–0.88), although not for the prediction of TPC in the methanolic extracts. The use of other machine learning techniques (linear regression, nonlinear regression—RBF, and nonlinear regression—linear) gave moderate improvements in predictive accuracy compared to regular multiple linear regression. Finally, deep learning gave a significant increase in predictive accuracy. This supported results from previous research, who similarly reported that deep learning performed quite well for the prediction of antioxidant capacity in other matrices (Chen et al. 2012; Kazak et al. 2022). However, deep learning may not outperform linear regression in all situations; therefore, research is required to determine the optimum modeling technique for each proposed application/matrix (Jiao et al. 2020).

Conclusion

This study demonstrated that multiple linear regression and machine learning regression techniques could be used to predict the antioxidant activity (FRAP) and total phenolic content (TPC) of faba bean methanol extracts from their individual phenolic constituents with moderate accuracy. However, the prediction accuracy was lowered when attempting to regress FRAP and TPC from phenolic constituents in faba bean flour. In general, machine learning approaches yielded better results compared to multiple linear regression, while deep learning methods often outperformed both linear and nonlinear machine learning techniques. The compounds identified as showing significant contribution to the models for antioxidant activity and total phenolic content in faba bean were protocatechuic acid, p-hydroxybenzoic acid, and ferulic acid.