Introduction

A better understanding of the biochemical basis of taste attributes is a main challenge within the tomato breeding programs which aim particularly to improve tomato flavor. Tomato (Solanum Lycopersicum L.) belonging to the Solanacea family and originally from South America, is one of the most consumed vegetables in the world and it has a big impact upon human diet as well as on our health (Agarwal and Rao 2000). Although, it is well known that the main components contributing to the flavour in tomato fruits are a mixture of sugars, acids and amino acids together with volatiles and minerals (Baldwin et al. 1991; Saliba-Colombani et al. 2001), identification and quantification of the constituents that account for the differences in tomato flavour is still to a large extent an open problem. In this study, various statistical approaches to provide quantitative models that explain tomato taste attributes based on metabolic measures, are compared.

Different studies have been conducted to decipher the relationship between sensory traits and metabolites, ranging from studies based on principal component analysis (Krumbein and Auerswald 1998; Krumbein et al. 2004) to some recent ones, in which networks were constructed to illustrate the correlations between sensory traits and metabolites (Ursem et al. 2008; Carli et al. 2009). Multiple linear regression seems to be one of the most appropriate platforms to provide quantitative models which link taste attributes to sensory traits. Multiple linear regression models have been proposed by Skovgaard (1995) as a general framework to model relationships between instrumental and sensory measurements. In tomato related studies, Verkerke et al. (1998) presented a model which links a set of pre-selected metabolites with certain sensory traits. More recent studies within the same scheme are those reported by Tandon et al. (2003) and Abegaz et al. (2004), in which predictive models for tomato taste were presented, based on volatile and non volatile compounds.

In the great majority of the aforementioned studies, multiple linear regression models were computed based on ordinary least squares, in combination with forward stepwise techniques for feature selection. In this paper, we compare this existing methodology with a new generation of regression techniques, called regularization or penalization methods. This new methodology enjoys fame for its ability of performing estimation and variable selection at once, handling models where the number of variables is greater than the number of observations and for producing more accurate models.

In particular, we focused on Lasso (Tibshirani 1996) and elastic net (Zou and Hastie 2005) and different model selection strategies. We evaluate the advantages and disadvantages of this new methodology in comparison with the traditional stepwise regression, for the study of tomato sensory traits in relation to metabolic compounds.

Materials and methods

Data description

The collection of tomato germplasm analyzed in this study can be divided in three morphological types which are, beef, round and cherry, consisting of 94 cultivars, provided by six different breeding companies. This set of cultivars represents, to a large extent, the important commercial varieties in the market and have a considerable phenotypic variation between the different types (beef, round and cherry) as well as between individuals of the same type.

Sensory and metabolic measurements form the empirical data for this study. The sensory data covers the spectrum of fragance, taste, after taste and mouth feel, and was scored by a trained tasting panel of observers of taste, smell and texture. At a biochemical level, the data consisted of metabolic records that can be divided into two categories: volatiles and derivatized compounds (Table 1), analyzed from ripe tomato fruits. Of special interest are the volatile compounds (derived from different precursors including amino acids, fatty acids and carotenoids) because of their large influence on flavor perception. Volatiles were measured by using Gas Chromatography and Mass Spectrometry according to the methods formerly reported by Tikunov et al. (2005). The organic acids and sugars were profiled by the same techniques as described in the protocol employed for quantification of volatiles by Roessner-Tunali et al. (2003).

Table 1 Volatiles and non derivatized chemical compounds used in this study

The same data set, has been studied by Ursem et al. (2008), van Berloo et al. (2008a, b), where more details about the data and their preparation can be found.

Penalized regression

In this investigation we are interested in finding the relationship between a given quantitative trait Y, for an observed phenotype and a collection of metabolic variables \({\bf X}=({\bf x}_1,\ldots,{\bf x}_p)\). A simple and yet very convenient model, to describe this type of association, is the so called linear model

$$ {\bf Y}={\bf X}{\beta} + {\varvec{\epsilon}} $$
(1)

where \(p,\, {{{\beta}}} = \{\beta_0,\ldots,\beta_p\}\) and \({{\varvec{\epsilon}}}=\{\epsilon_1,\ldots,\epsilon_n\}\) represent the number of variables, the regression coefficients and the errors in the model. The errors are assumed to be independent and identically distributed normal random variables, with mean 0 and variance σ2.

Ordinary least squares (OLS) estimates, are well known solutions to the multiple linear regression problem (1), obtained when minimizing the residual sum of squares. Although unbiased, OLS estimates are discredited for being unstable and overfitting of data in the presence of collinearity or in a high dimensional set up, i.e when the number of variables, p is larger than the number of observations, n. In any of the previous scenarios, OLS estimates are variance inflated and have a poor prediction accuracy. However, these problems can be partially alleviated by conducting variable selection.

Another alternative to the least squares solution drawbacks, is provided by the so called penalization or regularization techniques such as Lasso, proposed by Tibshirani (1996). Lasso, the acronym for least absolute shrinkage and selection operator, has became very popular because simultaneously performs estimation and variable selection. The main idea here to estimate the regression coefficients \({\beta}\), consists of minimizing the residual sum of squares plus an \({\user2{L}}_1\) constraint on the regression coefficients as follows

$$ \hat{\beta}(\lambda)\,=\,argmin_{\beta}\left\{\|{\bf Y}-{\bf X}\beta\|_2^2+\lambda\|\beta\|_1\right\} $$
(2)

where \(\|{\bf Y}-{\bf X}\beta\|_2^2=\sum_{i=1}^n(Y_i-({\bf X}\beta)_i)^2\) is the residual sum of squares, \(\|\beta\|_1\,=\,\sum_{i\,=\,1}^n|\beta_j|\) and \(\lambda\,\ge\,0\) being the penalty parameter which controls the amount of shrinkage, acting as a tuning parameter for the model. Large values of λ account for greater amount of shrinkage, drawing the model coefficients towards zero. Besides, the geometry of the \({\user2{L}}_1\) constraint ensures that some of them will be exactly zero, producing in that way sparse models, which depend on the choice of the penalty parameter λ.

Similarly to Lasso, elastic net is a shrinkage and variable selection method for linear regression, proposed by Zou and Hastie (2005). Elastic net tries to combine the good properties of Lasso together with the ones from Ridge regression (Hoerl and Kennard 1970), to obtain sparse models with reduced standard error estimates. It solves problem (1) by minimizing the residual sum of squares, adding a convex constraint for the regression coefficients to find \(\hat{\beta}\) as the following minimizer

$$ \hat{\beta}(\lambda)\,=\,argmin_{\beta}\left\{\|{\bf Y}-{\bf X}\beta\|_2^2+\lambda\left((1-\alpha)\|\beta\|_1+\alpha\|\beta\|_2^2\right)\right\} $$
(3)

with \(\|\beta\|_2^2\,=\,\sum_{i\,=\,1}^n\beta_j^2\) and \(\lambda\,\geq\,0\) being the penalty parameter, behaving as in the Lasso regression. Furthermore, the value \(\alpha\in[0,1]\) decides on the type of constraint applied, being a compromise between the ones in Lasso and Ridge regression. The first part of the constraint, equivalent to \(\|\beta\|_1\,\leq\, s\) (Lasso) generates a sparse model. The second one \(\|\beta\|_2^2\,\leq\, s\) (Ridge), encourages a grouping effect, removes the limitation on the number of selected variables and, stabilizes the Lasso regularization path. In addition, the convex constraint \((1-\alpha)\|\beta\|_1+\alpha\|\beta\|_2^2,\) depends on the value \(\alpha\in [0,1]\) which allows the construction of a broad range of possible models that enjoy different properties. Finally, it is worth noting that here, the optimal models will depend on the choice of the two parameters λ and α.

The elastic net, being a combination of Lasso and ridge regression was expected to be suitable for the modeling of sensory traits on a set of metabolites, where these metabolites are assumed to come from a small set of metabolic pathways, with metabolites within pathways showing some correlation. The Lasso property of the elastic net then should select the pathways that enter the regression model for a particular sensory trait. The ridge property subsequently shrinks the metabolites within pathways in about the same amount.

These regularization techniques rely on fast and efficient computing algorithms, to calculate the set of possible Lasso or elastic net solutions \( \{\hat{\beta}(\lambda),\;\lambda\in[0,\infty)\}\), that depend on the parameter λ, and are known as solution paths or traces. So far, different algorithms have been proposed to compute the whole path of Lasso solutions. One of the most popular, was a path following algorithm, called the least angle regression algorithm (LARS) and proposed by Efron et al. (2004). This algorithm has the same order of computation as a least square fit (Hastie et al. 2009). An alternative algorithm for computing Lasso as well as elastic net path solutions, is the coordinate descendant algorithm by Friedman et al. (2007). In addition to being faster for resolving large problems, this algorithm can be applied to a non convex penalty functions.

Model selection

Variable selection is a common problem in modern statistical analysis, arising from the necessity of identifying the set of important variables among all the superfluous ones. Noisy variables add complexity to the models and do not lead to great improvements in prediction power. The usual variable selection procedure is based on the residual sum of squares and a penalty which take into account the number of parameters in the candidate model. In analogy to Lasso and elastic net, stepwise regression finds the candidate model as the minimizer of

$$ \hat{\beta}(\lambda)=argmin_{\beta} \left\{\|{\bf Y}-{\bf X}\beta\|_2^2+\lambda\|\beta\|_0\right\} $$
(4)

where the \({\user2{L}}_0\) norm penalty is \(\|\beta\|_0 = \sum_{i\,=\,1}^p I(\beta_j\ne 0)\), that is equivalent to the number of variables included in the model. For stepwise regression, first the variables that belong into the model are identified and then once the model has been identified, the coefficients are estimated. Regularization techniques offer an alternative to the traditional variable selection methods such as forward or stepwise regression (Efroymson 1960), which are known for being unstable under certain situations (Breiman 1996). Under the Lasso or elastic net framework, feature selection is equivalent to model choice. The penalty parameters there, account for the amount of shrinkage in the regression coefficients, and therefore for the number of variables appearing in the model. When a high penalty is chosen, few variables are included, whereas when a very low penalty is selected, most of them will be present. Because complex models are not necessarily performing better than the simpler ones, the main challenge here is to find a trade off between sparsity and prediction accuracy.

Many model selection techniques have been developed during the last years and crossvalidation (Stone 1974), based on the performance of the estimated model into a new data set (generalization error), is one of the most widely used among them. Crossvalidation, based on generalization performance, describes the model performance in a new data set, by selecting those that have the best prediction performance. Other very popular model selection criteria are those of the form

$$ \Upphi (\gamma) = -2 \ln (L)+|\gamma|D(n) $$
(5)

where L corresponds to the maximized value of the likelihood function for the estimated model γ, |γ| is the effective model dimension and D(n) is a function of the sample size (Broman and Speed 2002). Very well known examples of them, are the Akaike Criteria, (Akaike 1974) in which D(n) = 2, or BIC (Schwarz 1978) when D(n) = ln(n).

For regression models computed via Lasso regression the effective model dimension is equal to the number of variables included in the model (Zou et al. 2007); for the elastic net it is equal to \(\sum_{j\,=\,1}^{size(A)}{\frac{d_j}{d_j+2}}\) , were A denotes the set of variables in the model and d j is the jth eigenvalue of the matrix \(X_A^tX_A\) (van der Kooij 2007).

Results and discussion

Models selected by crossvalidation

Six tomato sensory traits were analyzed by different regression techniques, to find their underlying metabolic models.Multiple linear regression models with 3 particular elastic net penalties, namely α = 0.25, α = 0.50, α = 0.75, as defined in Eq. 3, and from Lasso were computed. For those fits, the number of predictor variables selected in each model is rather large, as can be seen in Table 2, together with their corresponding goodness of fit R 2. The model selection criterion in all the cases was crossvalidation. Crossvalidation, tunes models to achieve the best prediction accuracy (P.A.)

$$ P.A.(\lambda)\,=\,1-{\frac{1}{n}}\sum_{i\,=\,1}^n \left({\frac{Y_i-X_i\hat{\beta}_{\lambda}}{1-{{\mathcal{H}}}_{ii}}}\right)^2 $$
(6)

where X i is the ith row of X and \({{\mathcal{H}}_{ii}}\) is the ith diagonal element of the “hat” matrix \({{\mathcal{H}}}\) (such that \({\hat{Y}={\mathcal{H}}Y}\)). The prediction accuracy measures the predictive power of a given model on new sample data. Nevertheless, crossvalidation does not take into account the model complexity, as a consequence, correlated predictors may be included in the model leading to a decrease in prediction accuracy.

Table 2 Number of predictor variables included in the models selected by crossvalidation for Lasso and elastic net with penalties: α = 0.25, α = 0.50, α = 0.75

The relative minor influence of the different three elastic net and Lasso penalties in the final models, is because the prediction error curves for the four different models reached their minimum almost at the same location, as shown in Fig. 1. Crossvalidation selects optimal models to be the minimizers of those curves, and in this study the minimum of the four prediction error curves falls very close together, producing models which have almost identical number of regressors. However, in all the cases the most parsimonious models were those given by the Lasso since this method applies the strongest constraint to the regression coefficients.

Fig. 1
figure 1

Regression coefficients for sensory trait taste spicy computed by different models, common variables are shown in blue. Upper panel: Coefficients computed by Lasso with BIC as model selection criteria. Middle panel: Coefficients computed by Lasso stopping the algorithm when the model contains five variables. Lower panel: Coefficients obtained by stepwise regression

Models selected by BIC and stepwise regression

Models selected by BIC (Bayesian information criterion), particular case of (5), for elastic net penalties, contained a large number of regressors (Table 3). The selected variables did not show clear grouping structures which could be interpretable in terms of chemical pathways. Therefore, we decided to focus on Lasso and stepwise regression and further compare the performance between these methods.

Table 3 Number of variables selected in the optimal models computed by BIC for elastic net

Lasso regression models, selected by the BIC criterion, were superior in terms of the coefficient of determination R 2, from those selected by crossvalidation, and achieved similar predictive power as those from stepwise regression (Table 4). For stepwise regression, the criteria used to decide whether a variable entered or left the model, was BIC. It is also important to notice that the number of variables selected by stepwise regression is in general smaller than those selected by Lasso (Table 3). Stepwise regression coefficients are of larger size than those from Lasso having an influence on the number of variables entering into the model. That is shown in Fig. 2, in particular for the sensory trait taste spicy, although it was the case for all the traits. In general, the regression coefficients signs, obtained from Lasso and stepwise regression coincide in all the cases. Models calculated by Lasso which contained five variables, were studied to assess the order in which predictor variables were selected along the traces. In addition, it is also of interest to compare the predictor variables selected by those models, with the ones obtained by stepwise regression and Lasso when BIC was used as a selection criterion.

Table 4 Sensory traits linear models based on volatiles and non derivatized tomato chemical compounds, computed by Lasso
Fig. 2
figure 2

Prediction error curves for the studied sensory traits. The x-axis represents the grid of 70 log 10λ values for which models were computed, y-axis presents the corresponding prediction error values. Prediction error values on the right side of the x-axis correspond to models for which few variables were included while moving towards the left along the x-axis lead to models which contain more variables

There is a general agreement on the selected variables by the BIC and step wise regression for all the traits except taste spicy (Table 3). Models which contained exactly five variables succeeded in selecting those which are more important although failed on the estimation of the coefficient value as is clear from Table 3. Furthermore, we are studying different criteria based on (5), to compare the performance of stepwise regression, Lasso and elastic net.

As it was shown by the above comparison, there are traits such as taste spicy and taste watery, that were predicted reasonable well by Lasso as well as by stepwise regression, whereas some others like aftertaste bitter or scent tomato, could hardly be predicted by the considered set of metabolites, with any of the studied methodology. For the traits that failed to be predicted by any of the proposed methods, we can conclude that the chemical basis behind them was not contained in the set of studied metabolites. For the sensory traits that were predicted well, the relevance of some variables is clear since they appeared in all the models regardless of the technique used.

Concluding remarks

In this study we have compared existing regression methodology for linear models, namely stepwise regression, with a new generation of regression procedures known as elastic net and Lasso. The aim was to analyze the different approaches to find optimal biochemical models based on metabolic information to predict a group of sensory traits. In the set up of this investigation, that is, when the number of variables in the model is smaller than the number of observations, Lasso models, selected by BIC, achieved a comparable fit to those from stepwise regression in terms of R 2, not being very clear which method was superior. However, looking at the size of the regression coefficients, it was clear that those estimated by stepwise regression, had larger size than the ones calculated by the Lasso approach. The Lasso models contained more correlated predictors than the stepwise regression models which may have induced the smaller estimates for the coefficients.

Based on our analysis, stepwise regression provided a very good platform to find satisfactory prediction models a as we have seen in this study.

Elastic net and Lasso models selected by crossvalidation failed finding the set of most important variables. That result, agreed with the conclusions in Leng et al (2006), where they proofed that regularization models selected by techniques based on prediction accuracy, as is the case with crossvalidation, are not consistent in terms of variable selection. In other words, variable selection and model prediction are different issues which need to be simultaneously taken into account suggesting that model selection criteria such as BIC lead to more appropriate models.

To further improve the prediction accuracy of the sensory traits that were not well predicted by none of the discussed methods a further analysis with a more extensive set of metabolites will be carried out. Finally, we aim to obtain more accurate sensory-metabolic models by including genetical considerations.