Analytical and Bioanalytical Chemistry

, Volume 409, Issue 3, pp 841–857

Improved quantification of important beer quality parameters based on nonlinear calibration methods applied to FT-MIR spectra

  • Carlos Cernuda
  • Edwin Lughofer
  • Helmut Klein
  • Clemens Forster
  • Marcin Pawliczek
  • Markus Brandstetter
Research Paper

DOI: 10.1007/s00216-016-9785-4

Cite this article as:
Cernuda, C., Lughofer, E., Klein, H. et al. Anal Bioanal Chem (2017) 409: 841. doi:10.1007/s00216-016-9785-4
Part of the following topical collections:
  1. Process Analytics in Science and Industry

Abstract

During the production process of beer, it is of utmost importance to guarantee a high consistency of the beer quality. For instance, the bitterness is an essential quality parameter which has to be controlled within the specifications at the beginning of the production process in the unfermented beer (wort) as well as in final products such as beer and beer mix beverages. Nowadays, analytical techniques for quality control in beer production are mainly based on manual supervision, i.e., samples are taken from the process and analyzed in the laboratory. This typically requires significant lab technicians efforts for only a small fraction of samples to be analyzed, which leads to significant costs for beer breweries and companies. Fourier transform mid-infrared (FT-MIR) spectroscopy was used in combination with nonlinear multivariate calibration techniques to overcome (i) the time consuming off-line analyses in beer production and (ii) already known limitations of standard linear chemometric methods, like partial least squares (PLS), for important quality parameters Speers et al. (J I Brewing. 2003;109(3):229–235), Zhang et al. (J I Brewing. 2012;118(4):361–367) such as bitterness, citric acid, total acids, free amino nitrogen, final attenuation, or foam stability. The calibration models are established with enhanced nonlinear techniques based (i) on a new piece-wise linear version of PLS by employing fuzzy rules for local partitioning the latent variable space and (ii) on extensions of support vector regression variants (𝜖-PLSSVR and ν-PLSSVR), for overcoming high computation times in high-dimensional problems and time-intensive and inappropriate settings of the kernel parameters. Furthermore, we introduce a new model selection scheme based on bagged ensembles in order to improve robustness and thus predictive quality of the final models. The approaches are tested on real-world calibration data sets for wort and beer mix beverages, and successfully compared to linear methods, showing a clear out-performance in most cases and being able to meet the model quality requirements defined by the experts at the beer company.

Figure

Workflow for calibration of non-Linear model ensembles from FT-MIR spectra in beer production 

Keywords

Quality control of beer FT-MIR spectroscopy Nonlinear PLS Flexible fuzzy systems Support vector regression variation Bagged model selection 

Introduction

Motivation and state-of-the-art

A high quality of beer and its spin-offs (e.g., beer mix beverages), one of the most heavily consumed beverages in the world and the typical ‘national drink’ in Middle European countries, is of great importance in order to satisfy the consumers and the whole alcoholic drink market. For instance, the bitterness is an essential quality parameter which has to be controlled within the specifications already at the beginning of the production process in the unfermented beer (wort) as well as in final products such as beer and beer mix beverages [3]: it is the key parameter for achieving a certain taste the beer should have in order to fall within the common classification boundaries [4]. A high quality can be only guaranteed by permanent supervision of the liquid during its production.

By the application of an analytical method on the basis of FT-MIR spectroscopy in combination with suitable chemometric methods, it is possible to significantly reduce time consuming laboratory analysis. Instead of measuring the relevant quality parameters—such as bitterness, free amino nitrogen, final attenuation, citric acid, total acid, and foam stability—with six different analytical methods sequentially, it is possible to have all quality parameters simultaneously analyzed in less than 15 min in case of 10 samples drawn from the liquid after production. In comparison, a manual analysis of the most relevant parameters, namely bitterness, final attenuation, and free amino nitrogen requires operators efforts of about 4 h and an overall duration for final attenuation of about 24 h in sum. This usually causes significant costs for beer breweries and companies.

Current analytical methods for quality control of beer rely on time and resource consuming chemical analysis in the laboratory where for each quality parameter an individual method and equipment is needed. Recently, spectroscopic methods are being developed in order to determine relevant quality parameters simultaneously in much shorter time and strongly reduced effort for sample preparation [5] [6]. In this context, chemometric methods are employed to gain mathematical models for quantification of the analytes [7] and process parameters [8]. Currently, most of these approaches and resulting models are based on linear calibration methods (not being able to resolve any non-linearities contained in the production process adequately with sufficient accuracy), mainly on the basis of partial least squares regression [9] and especially without the usage of robust model selection techniques. A nonlinear approach can be found in [2] where neural networks have been used for predicting the content of acetic acid; however, it does not address important beer parameters for the end consumer such as bitterness, final attenuation, or foam; moreover, no robust model selection strategies are embedded for appropriately addressing calibration problems based on a very low number of samples.

Our approach

Our approach aims on compensating current shortcomings in beer quality analysis and goes significantly beyond state-of-the-art in terms of the following aspects:
  • It enables the fully automatic quantification of several important beer parameters in wort as well as in the final products (beer and beer mix beverages), such as bitterness, final attenuation, free amino nitrogen, citric acid, total acid, and foam stability.

  • It employs a FT-MIR spectrometer equipped with an automatic sampler for the purpose to draw samples from probes and to overcome the time consuming off-line analyses in beer production.

  • It applies enhanced nonlinear calibration modeling methods to overcome already known limitations of standard linear chemometric methods (such as PLS) for the essential parameters mentioned above: one is based on a variation of support vector regression (SVR), the other one on a batch version of Gen-Smart-EFS (short for Generalized Smart Evolving Fuzzy Systems) for extracting generalized fuzzy rules in a fast single-pass manner; it is coupled with PLS in order to achieve a kind of a piece-wise linear (thus overall non-linear) version of PLS with fuzzy transitions.

  • It embeds a new, robust model selection scheme based on bagged model ensembles which are constructed from multiple bags; the selection is carried out on a bunch of possible model candidates, which are obtained due to various learning parameter combinations (parameter grid) used in SVR and fuzzy rule base extraction. The bagged model selection scheme has been mainly motivated due to the availability of a very low number of samples for calibration, as bagging explores a sparse sample space in a nice way, thus increasing robustness of calibration [10].

We evaluate our approach on three data sets, two drawn from wort and one from beer mix beverages production. Thereby, we report on both, the cross-validation (CV) error as well as on the error on a separate validation set. Results show that there is a clear improvement in CV errors over classical linear state-of-the-art methods when applying enhanced nonlinear techniques for most of the targets achieving finally errors within the limits of the company’s requirements, whereas this can be also confirmed for the separate validation data in case of beer mix beverages. For beer mix beverages data, the application of bagging for model selection brings much improvement for providing robust models on separate validation data with lower over-fitting proneness in case of bitterness and foam stability (the two most essential parameters); in fact, without bagging no useful results within the acceptable error ranges could be achieved for these two parameters.

The setup

Data acquisition

Spectroscopic data of the wort and beer mix beverages samples were acquired off-line using a Nicolet iZ10 FT-IR-spectrometer with CETAC ASX-520 autosampler. Besides the spectrometer core, this instrument contains a programmable logic controller (PLC) and an engine for the evaluation of chemometric models. The optics of the spectrometer contains a monolithic Michelson interferometer, which helps with temperature stability [11]. The resolution and the measurement rate of the instrument are configurable.

For mid infrared spectroscopic measurement, we used a transmission flow cell with an optical path length of 15 or 20 μm and CaF2 windows. Data collection was set to 16 scans per sample with a resolution of 4 cm−1 in the spectral range 400 to 4000 cm−1. Absorbance spectra of the investigated samples were calculated according to Beer’s law [12] using pure water for recording the background single beam spectra. A schematical sketch of the measurement setup is shown in Fig. 1.
Fig. 1

Schematic view of the data acquisition framework as installed at beer production systems

Wort samples were filtered with Kieselgur and beer samples were degassed by ultrasonic treatment. In order to obtain appropriate and reproducible spectra, the filling of the flow cell was optimized by rinsing the flow cell between the replica of the sample and back-washing with deionized water after each sample. Reference data for the target parameters to be supervised (bitterness, final attenuation, free amino nitrogen, types of acidities, and foam stability) have been obtained by manual analysis of wort and beer mix beverages probes taken at random from the production process. Using the reference data and the corresponding FT-MIR absorption spectra, the (nonlinear) chemometric models could be established, see “Results”.

Non-linear calibration methods

Nonlinear PLS with the usage of flexible fuzzy inference systems

Classical PLS

Partial least squares regression [13] is one of the most widely used calibration method in today’s chemometric modeling tasks and applications [14, 15, 16]. The core concept of PLS is the transformation of the original input feature space—in case of Chemometrics, it is typically the space spanned by the wavelengths [17] or at least partial connected pieces in form of wavebands [18] contained in the spectra—into a reduced input space for best explaining the variance contained in the target (which is typically a continuous numerical output when dealing with regression problems). Partial least squares is used to find the fundamental relations between two matrices (input X and output Y), i.e., a latent variable approach to model the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the input space that explains the maximum multidimensional variance direction in the output space. It emphasizes a rotation of the input space in order to best explain the variance of the target by means of linear relations/mappings. The resulting transformed space is characterized by eigenvectors which forms the so-called latent variable space. Their corresponding eigenvalues can be sorted in descending order in order to achieve a ranking of latent variables based on which a selected subset due to a variance-explained cut-off is typically used for model calibration (as, e.g., typically used within the well-known and widely-used PLS-Toolbox 1).

NonLinear PLS (Variants) through Kernel Transformation

Even though PLS respects the target concept during space transformation, it is still a linear method, i.e., it emphasizes rotations to best represent covariance structures in the data in a linear sense. In order to establish a nonlinear variant of PLS for emphasizing the best variance explanation in a non-linear sense, the kernel-based PLS (K-PLS) is typically employed [19]. Firstly, it applies the kernel trick (in the same way as done in support vector regression, see below) in order to perform a nonlinear transformation of the original data set into an S-dimensional feature space. Secondly, it performs the conventional partial least squares algorithm on the transformed kernel Gram matrices K1(for the input space) andK2(for the output space), with entriesK1ij = K(xi, xj), K2ij = K(yi, yj) and K the multi-dimensional kernel and xi, xjthe input vectors of the i-th and j-th sample (so, the kernel function is applied to all sample pairs). The disadvantage of K-PLS becomes immediate when the number of samples N available for regression is large, because then the kernel Gram matrix from input space explodes in size (N × N).

Our nonlinear PLS Version

In this paper, we abandon the disadvantage of nonlinear K-PLS and apply a nonlinear version recently introduced in [20] and successfully applied for establishing calibration models from Fourier transform near-infrared (FT-NIR) spectra in melamine resin production (for cloud point prediction and supervision). Its basic idea lies in the partitioning of the latent variable space (after transformation with classical PLS) into several (C) local pieces represented by fuzzy rules, i.e., dividing it into different partial principal component directions along the target. Each fuzzy rule embeds a local linear hyper-plane for local trend estimation, thus it results in piece-wise local linear PLS predictors, which are combined through a weighted linear combination, where the weights are rule activation levels in form of multi-dimensional Gaussian kernels. This assures smoothness of the whole regression surface as the piece-wise linear predictors are ’kernel-smoothened’ across their transitions [21].

Our fuzzy rules learning engine

Our engine for extracting the appropriate number and positioning of the fuzzy rules from data acts in a single-pass manner directly in the PLS space, i.e., each single sample taken from the calibration set is first transformed to the latent variable space due to the loadings and then sent into the fuzzy rule learning process. Single-pass capability assures very fast learning speeds of the whole fuzzy systems, as the rule base grow and the parameters are recursively updated based on single samples (loaded one-by-one into the memory), leading to a method whose computational complexity and virtual memory requirement is linear with the number of samples in the calibration set. This makes it very attractive for calibrating models over larger parameter grids within time-intensive cross-validation procedures, see “Intervened nonlinear modeling and evaluation scheme (for all methods)”. In particular, our learning engine is based on the Gen-Smart-EFS approach [22], whose core functionality (without the concepts regarding rule merging and dynamic feature weighting for dimension reduction) is used to find the appropriate number of rules in single-pass evolution steps and also to estimate the kernel functions forming the antecedents of the rules; in this way, each rule antecedent is associated with a triplet (c−1, r) with c its center, Σ−1 the inverse covariance matrix defining its multivariate ellipsoidal shape and r its tolerance radius (statistical range of influence also termed as prediction interval [23]), which is automatically extracted from data and steers rule evolution versus rule update, see below. In this sense, one fuzzy rule reads as
$$ \text{IF} \hspace{0.15cm} \mathbf{x} \hspace{0.15cm} \text{IS (about)} \hspace{0.15cm} \text{\(\mu_{i}\)} \hspace{0.15cm} \text{THEN} \hspace{0.15cm} l_{i}(\mathbf{x}) = w_{i0}+w_{i1}x_{1}+w_{i2}x_{2}+...+w_{ip}x_{p} $$
(1)
with \(\mu _{i} = exp (-\frac {1}{2} (\mathbf {x}-\mathbf {c_{i}})^{T}{\Sigma }_{i}^{-1}(\mathbf {x}-\mathbf {c_{i}}))\) denoting the multivariate Gaussian distribution.
The single-pass rule evolution and antecedent learning steps are as follows (with C = 0 initially):
  1. 1.

    Load a new sample x; if it is the first one, go to step 5 (there, ignoring the if-part);

     
  2. 2.

    Elicit the winning rule, i.e., the rule closest to the current sample, which is then denoted as cwin; for the distance calculation, standard Mahalanobis distance is used [24] (as on the right hand side in (2) below).

     
  3. 3.
    Check whether the following criterion is met (the rule evolution criterion):
    $$\begin{array}{@{}rcl@{}} min_{i=1,...,C} \sqrt{(\mathbf{x}-\mathbf{c_{i}})^{T} {\Sigma}^{-1} (\mathbf{x}-\mathbf{c_{i}})} > r_{i} \hspace{1.0cm}\\ r_{i} = vigi*p^{1/\sqrt{2}}*\frac{1.0}{\left( 1-1/(k_{i}+1)\right)^{m}} \end{array} $$
    (2)
    with p the dimensionality of the input feature space and vigi an a priori defined parameter, steering the tradeoff between stability (update of an old cluster) and plasticity (evolution of a new cluster); ki the support of the ith rule and m a tuning parameter per default set to 4. This is the only sensitive parameter and is varied during the model evaluation phase, see “Evaluation scheme and parametrization”—for further explanation of this criterion, please refer to [22].
     
  4. 4.
    If (2) is not met, the center of the winning rule is updated by
    $$ \mathbf{c_{win}}(N+1) = \mathbf{c_{win}}(N) + \eta_{win}(\mathbf{x}-\mathbf{c_{win}}(N)) $$
    (3)
    and its inverse covariance matrix by (the index win neglected due to transparency reasons):
    $$ {\Sigma}^{-1}(k+1) = \frac{{\Sigma}^{-1}(k)}{1-\alpha} -\frac{\alpha}{1-\alpha} \frac{({\Sigma}^{-1}(k)(\mathbf{x}-\mathbf{c}))({\Sigma}^{-1}(k)(\mathbf{x}-\mathbf{c}))^{T}}{1+\alpha ((\mathbf{x}-\mathbf{c})^{T}{\Sigma}^{-1}(k)(\mathbf{x}-\mathbf{c}))} $$
    (4)
    with N the number of samples seen so far and \(\alpha =\frac {1}{k_{win}+1}\) with kwin the number of samples seen so far for which cwin has been the winning rule (cluster). The former stems from the idea in vector quantification [25] by minimizing the expected squared quantization error; the learning gain ηwin is thereby set in a way that it fulfills the Robbins-Monroe conditions. The latter is a recursive exact update without requiring the original covariance matrix, which is analytically derived with the usage of the Neumann series, see [26] for full details.
     
  5. 5.

    If (2) is met, a new rule is evolved as covering a new region in the feature space (i.e., having sufficient novelty content) by setting its center cC+1 to the coordinates of x and initialize its inverse covariance matrix \({\Sigma }_{win}^{-1}\) by setting it to a diagonal matrix with entries 1 divided by a small fraction, i.e., 1/100, of the variable ranges (= initial rule spreads); increase the number of rules C = C + 1.

     
  6. 6.

    If there have not yet been all samples in the calibration set processed, go to step 1, otherwise stop.

     
Once these are formed, the consequent parameters l1,..., lC for all C rules are estimated through fuzzily weighted least squares [27] in order to assure local learning which has several advantages over global learning, see [21], Chapter 2 for a detailed analysis. A block diagram summarizing the procedure can be seen in Fig. 2
Fig. 2

Block diagram summarizing the fuzzy rules learning engine

A special case comes up when the inverse covariance matrix Σ−1 is used as a diagonal matrix (ignoring the co-variances between the inputs). Then, axis-parallel fuzzy rules are triggered and the steps in the itemization above end up in the classical flexible fuzzy inference systems (FLEXFIS) approach [28].

Support vector regression variation

Support vector machines (SVM) [29, 30] is a well-known nonlinear classification method, based on the calculation of hyper-planes in the input feature space to separate classes with maximal margin. The samples closest to the decision boundary, i.e., defining the positioning of the hyper-planes are called the support vectors. It employs the kernel trick [31] for performing a nonlinear transformation of the original data into a linearized space, where then the conventional linearized concept of separating hyper-planes with margin maximization can be again applied. There is a regression version, support vector regression (SVR) [32], with two variants called 𝜖-SVR and ν-SVR. The general principle behind SVR is the following: a mapping ϕ maps the data X to a m-dimensional feature space, where a linear model is generated
$$ f\left( \mathbf{X},\mathbf{\omega}\right) = \sum\limits_{j=1}^{m} \omega_{j}\phi_{j}(\mathbf{X}) + b $$
(5)
where b is the bias term (null for centered data), ϕj are the nonlinear transformations, and ωj are the model coefficients. The quality of the estimation is then measured by an 𝜖-insensitive loss function, meaning that any loss below 𝜖 is neglected.
Finally, the coefficients calculation depend on the two variants of SVR. For 𝜖-SVR, the coefficients are the solution of the quadratic problem
$$ \begin{array}{ll} min & \frac{1}{2}\Vert\mathbf{\omega}\Vert^{2} + \frac{C}{n}{\sum}_{i=1}^{n}\left( \xi_{i} + \xi_{i}^{*}\right) \\ s.t. &\left\{\begin{array}{rcl} y_{i} - \mathbf{\omega}^{T}x_{i} - b & \leq & \epsilon + \xi_{i} \\ \mathbf{\omega}^{T}x_{i} + b - y_{i} & \leq & \epsilon + \xi_{i}^{*} \\ \xi_{i},\: \xi_{i}^{*} & \geq & 0 \\ \end{array}\right. \end{array} $$
(6)
where n is the number of inputs, xi the inputs, yi the targets, C is the cost parameter, and ξi and \(\xi _{i}^{*}\) are slack variables.
For ν-SVR, the coefficients are the solution of the quadratic problem
$$ \begin{array}{ll} min & \frac{1}{2}\Vert\mathbf{\omega}\Vert^{2} + C\nu\epsilon + \frac{C}{n}{\sum}_{i=1}^{n}\left( \xi_{i} + \xi_{i}^{*}\right) \\ s.t. & \left\{\begin{array}{rcl} y_{i} - \mathbf{\omega}^{T}x_{i} - b & \leq & \epsilon + \xi_{i} \\ \mathbf{\omega}^{T}x_{i} + b - y_{i} & \leq & \epsilon + \xi_{i}^{*} \\ \xi_{i},\: \xi_{i}^{*} & \geq & 0 \\ \end{array}\right. \end{array} $$
(7)

The cost parameter C controls how relevant is to fall out of the 𝜖-insensitivity zone, and the parameter ν is used to bound the noise. Indeed, ν is an upper bound on the fraction of errors and a lower bound on the number of support vectors. Figure Online Resource1 shows an example of the 𝜖-insensitive areas (in between the dotted lines) and the loss for the only two points outside. Therefore, the total loss is the sum of the loss for those two points in this case.

There are plenty of kernel functions that could be used as nonlinear transformations. The most commonly used one is the Gaussian kernel, given by
$$ K(u,v) = e^{-\gamma\vert u-v\vert^{2}} $$
(8)
where γ is the spread of the Gaussian function. The most interesting parameters to be tuned are C and γ in both SVR approaches.

A drawback for SVR is the computational cost for high-dimensional data sets. Therefore, we propose a variant for both 𝜖-SVR and ν-SVR. It consists on including a previous step, in which the goal is to reduce the dimensionality in advance by compressing the data by means of PLS. We denote these variants by 𝜖-PLSSVR and ν-PLSSVR. Therefore, a new parameter arises that is the number of latent variables to be used.

Intervened nonlinear modeling and evaluation scheme (for all methods)

Assuming to have N calibration samples available (drawn by the spectroscopic equipment as described in “Data acquisition”), our modeling procedure together with the full evaluation performs the following steps:
  1. 1.

    Calculate latent variables lat1,..., latall with all the number of wavelengths contained in the spectra and ordered according to their importance. Notice that our approach includes always a previous data compressing step by means of PLS. Therefore, the latent variables from PLS are always needed.

     
  2. 2.
    Define parameter grid: Parameter selection for PLS, FLEXFIS-PLS, and PLSSVR is based on a grid search including a cross validation procedure. There are two parameter selection approaches: the classical CV selection based on the minimum CVRMSE (=cross-validated root mean squared error) and a robust model selection based on bagging CV, see “Bagged selection for increasing robustness of non-linear modelling”. The parameter grids are different for each of the algorithms:
    • For PLS, use dim = {1,..., a} for the number of latent variables to be included into the calibration model, achieving a vector of grid points gi = dimi.

    • For ridge regression, use different regularization parameters λ, with grid points gi = λi.

    • For generalized linear models with elastic net (GLMNet), use the coefficient α that controls the convex combination between Lasso and ridge regression, and the regularization parameter λ. This results in a matrix of grid pointsGij = (αi, λj). See “State-of-art methods used for comparison” for further details on GLMNet.

    • For fuzzy systems, use the number of latent variables dimi and define the vigilance parameter vigi inside the interval (0,1) that steers the rule evolution criterion in (2) and thus controls the level of non-linearity applied [28]. This results in a matrix of grid points Gij = (dimi, vigij).

    • For the SVR approaches, the number of latent variables is fixed, taken from the applied model selection performed for PLS. The parameters to be tuned are the cost C and the width of the Gaussian kernel function γ. The matrix of grid points is Gij = (Ci, γj), default grid suggested by the authors of the guidance for using Lib-SVM [33], the most widely-used library for SVM.

     
  3. 3.

    For all grid points, perform 10-fold cross-validation [34], in both the classical and the bagged versions, and store the cross-validation error: CVerri respectively CVerrij. See “Bagged selection for increasing robustness of non-linear modelling” for further details on the bagged version.

     
  4. 4.
    Perform model selection. Complexity is measured in different ways in each of the considered algorithms, existing in some cases a relationship between the complexity and the parameters in the grids. For instance, it increases with the number of latent variables in PLS. For fuzzy systems learning coupled with PLS, it increases in a direct way with dimensionality and in an inverse way with vigilance because the lower the vigilance the higher the number of rules (as (2) is more often fulfilled). For SVR, the complexity can be measured in terms of the number of support vectors. Thus, the higher the number, the higher the complexity. There is a direct relationship with the cost C, because a high cost means a high penalization for non-separable points, thus higher number of support vectors would be stored in order to diminish the number of non-separable points. There is also a relation between the complexity and the width γ, as a higher value induces a lower kernel width, i.e., steeper surfaces and thus a higher nonlinearity. Then, our model selection procedure selects the parameters corresponding to the grid point for which the corresponding model has lower CVRMSE, after being penalized according to their complexity (CVerr(pen):
    $$ \mathbf{CVerr}_{ij}(pen)=\mathbf{CVerr}_{ij}\cdot e^{\alpha\:\:param1_{i} + \beta\:\:(1-param2_{j})} $$
    (9)
    with param1 related to dimensionality in case of fuzzy modeling and to cost in case of SVR and param2 to the vigilance in case of fuzzy modeling and to γ in case of SVR. α and β are normalization factors which are set to 0.05 in our case, 0.5 respectively.
     
  5. 5.

    Perform a final model training on the whole training set with the obtained optimum parameters \((param1_{i}^{*},param2_{j}^{*})\) and test it on a separate validation set (if available).

     
A block diagram summarizing the procedure can be seen in Fig. 3
Fig. 3

Block diagram summarizing the standard model selection approach

Bagged selection for increasing robustness of non-linear modelling

bAGGING [35] stands for bootstrap aggregating. The basic idea behind is the creation of M bags of N training samples each by means of sampling with replacement (bootstrap sampling [36]). The bagged algorithm is performed for all M bags, and the M outputs are aggregated according to certain aggregation function, depending on the algorithm. Theoretically, the diversity brought by the bootstrap sampling lead to M models that are not necessarily good, but lead to a good final aggregated model. Notice that the usual size N of the bags coincide with the number of available samples N. In that concrete case, the expected percentage of unique samples in each bag is 63.2 % [37].

We use bagging for the specific purpose of model selection, thus including the following steps:
  1. 1.

    Create M bags with N samples in each bag.

     
  2. 2.

    For the k-th bag, perform the classical cross validation for all parameters combinations in the grid, depending on the regression approach under consideration. Store the errors \(\mathbf {CVerr}_{ij}^{k}\) for the parameters (param1i, param2j).

     
  3. 3.

    Aggregate all k cross validation errors using the average as aggregation function.

     
  4. 4.

    Penalize the errors according to the complexity, using (9).

     
  5. 5.

    Select the optimum parameters \((param1_{i}^{*},param2_{j}^{*})\), for which the penalized error is minimum.

     
A block diagram summarizing the procedure can be seen in Fig. 4
Fig. 4

Block diagram summarizing the bagged model selection approach

Please note to take the average as aggregation function in step 3 is the standard way as used in many other bagged modeling approach, such as, for instance, random forests [38]. It is well-known that it produces more robust predictions, especially in case of a low number of samples due to its characteristics to explore the sample space through the bootstrapped bags well as, e.g., analyzed in [10]. A low number of calibration samples is expected in our application, as the real targets have to be manually elicited by the experts, whereas such a manual analysis requires an effort of several hours for only a couple of samples. In this sense, the usage of bagging for our application is well motivated.

Case study configuration for evaluation of calibration methods

Data sets characteristics and pre-treatment

Three data sets have been made available from beer production with manual target measurements. Two of them correspond to unfermented beer (wort), independently recorded in 2014 and 2015 with different parts in the measurement equipment (15 and 20 μm cells), and the third one to beer mix beverages beer production, the latter including a separate validation set recorded several weeks later. Due to the differences in the composition and in the final product, the parameters that are relevant for the product quality, and will be therefore monitored, are not the same. The concrete parameters and their acronyms (coming from German language) are
  • For wort: (i) Bitterness (EBU), (ii) final attenuation (FA), and (iii) free amino nitrogen (FAN).

  • For beer mix beverages: (i) Bitterness, (ii) foam stability (S), (iii) citric acid (CA), and (iv) total acid (TA).

The most relevant one is bitterness, which is known to show quite nonlinear behavior, thus is a good motivation for nonlinear approaches. Moreover, within a pre-study conducted by BrauUnion, it turned out that linear approaches failed to reach an acceptable accuracy for these targets.
Figures 5 and 6 show, respectively, the absorbance spectra for the one wort data set and the beer mix beverages data set. It can be seen in Fig. 6 that the external validation set shows severe extrapolation, indicating a hard validation benchmark case, which can be indeed weakened by appropriate pre-processing methods, see below, but not completely avoided. Just by visual inspection, it is clear that not all wavelengths are relevant and constructive for the modeling process (e.g., sudden peaks should be removed). Thus, subsets of the original 1790 wavenumbers have been selected by an expert. Those subsets contain between 200 and 500 wave-numbers each, depending on the data set and target. The selections have been tested against several stochastic and non-stochastic variable selection methods, e.g., using uninformative variable elimination [39, 40], forward selection [41] and genetic algorithms [18] and have been found to be optimal for those subsets sizes.
Fig. 5

Absorbance spectra for wort (2014)

Fig. 6

Absorbance spectra for beer mix beverages, including both calibration and validation data

In each data set the number of samples available for each target varies. Spectral data is continuously being recorded, but some targets require longer time to be measured than others. Due to the high effort for manual analysis of probes drawn in order to obtain the target values, the number of samples have been restricted to 31 for beer mix beverages, 47 for 2014 wort and 37 for 2015 wort data sets. After cleaning, this number reduces further to the values for the several targets as shown in Table 1.
Table 1

Data sets characteristics used for calibration and validation

Dataset

Target

Unit

Relevant spectral regions cm−1

# of samples calib/valid

Wort 2014

Bitterness

EBU

1346-1564

47/-

Wort 2014

Final attenuation

%

976-1363

47/-

Wort 2014

Free amino nitrogen

mg/l

1012-1475, 2517-2980

47/-

Wort 2015

Bitterness

EBU

1134-1499

37/-

Wort 2015

Final attenuation

%

1070-1421

37/-

Wort 2015

Free amino nitrogen

mg/l

1012-1437

37/-

Beer mix beverages

Bitterness

EBU

1138-1443

31/11

Beer mix beverages

Foam stability

s

1207-1437, 2748-2960

31/11

Beer mix beverages

Citric acid

g/l

1148-1495

31/11

Beer mix beverages

Total acids

g/l

1051-1128, 1168-1495, 2748-2931

31/11

According to this very low number of samples, bagging which explores well the sample space can be expected to provide more robust models (model selection) than the classical CV.

Besides, several well-known preprocessing methods [42] have been employed in order to find a preprocessing strategy that behaves well for all targets. For operational reasons, as the spectral data are the same for all the targets, a single strategy for all targets of each data set is required. The chosen one for all data sets has been the two-steps strategy consisting on first applying standard normal variate [43], and then mean centering.

State-of-art methods used for comparison

It is known by the experts that most of the parameters we are interested on show, to some extent, some nonlinear behavior. Nevertheless, basic linear methods have been applied at the company’s production site. In order to check the (improved) performance achievable with nonlinear methods, we will compare our nonlinear methods with the following state-of-the-art linear methods:
Partial Least Squares (PLS):

It is a linear method, used to find the fundamental relations between two matrices (input X and output Y), i.e., a latent variable approach to model the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the input (X) space that explains the maximum multidimensional variance direction in the output (Y) space. It emphasizes a rotation of the input space in order to best explain (the variance of the) target by means of linear relations/mappings. We have used it for comparison purposes, because it is the most widely used state-of-the-art method in chemometrics and especially in automatic beer parameter analytics. For a compact summary of its principal concepts, please refer to the beginning of “Nonlinear PLS with the usage of flexible fuzzy inference systems”.

Generalized linear models with elastic net (GLMNet):

The Lasso method [44] and ridge regression [43] are approaches included in the family of shrinkage methods that can be seen as regression algorithms including an 1 and an 2 penalties, respectively. The elastic net [45] includes a penalty based on a combination of both 1 and 2 penalties, looking for some elasticity in the regularization.

Generalized linear models [43] is a generalization of ordinary linear regression that provides flexibility in the sense that the distribution of the errors is not necessarily supposed to be normal, as happens in ordinary linear regression.

The combination of the elastic net with generalized linear models is a regression algorithm based on generalized least squares that uses cyclical coordinate descent [46] in a path-wise fashion [47] in order to select the optimum elasticity in the regularization via the elastic net.

Ridge regression:
Despite it is a particular case of GLMNet, when the lasso part of the elastic net is ignored, ridge regression deserves its own separate spot. In MLR, we determine the best regression vector \(\hat {\mathbf {b}}\), according to a minimum least squares criterion, when trying to solve the regression problem y = Xb with X the regression matrix. Then, it is well known that the regression vector is
$$\hat{\mathbf{b}} = (\mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X} \mathbf{y} $$
When handling variables that are highly correlated, problems of singularities arise when it comes to calculating the inverse of XTX. A way to deal with this problem is regularization. It consists on adding a regularization term in the least squares minimization problem. Ridge regression uses αIX as regularization term, where λ > 0 is a parameter to be tuned. Then, the regression vector becomes
$$\hat{\mathbf{b}} = (\mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X} \mathbf{y} $$

The regularization parameter will be tuned with a grid, see below.

Evaluation scheme and parametrization

The evaluation scheme is performed differently for the unfermented beer and beer mix beverages beer, due to the characteristics of the data. For unfermented beer, we have performed the classical cross validation model selection, as stated in “Intervened nonlinear modeling and evaluation scheme (for all methods)” for both data sets separately, so that we can compare the performance of the regression methods. As a hard benchmark, we used the final models trained on the 2014 data for validation on the 2015 data (different measurement equipments), just to check how far our models are able to reliably extrapolate into the future. For beer mix beverages, the availability of validation data offers the possibility of comparing also the classical and bagged cross validation model selections, in order to see how close those model selection approaches are to the best possible parameter combination for the external validation set (which is not accessible during CV selection).

The proposed parameter grids are the following:
  • PLS: The number of latent variables (coded as P1 in the tables in “Results”) varies from 1 to 15.

  • Ridge: The regularization parameter λ(coded as P2) goes from 0.01 to 0.95 in steps of 0.05. This grid has been successfully used in previous studies [48], in which the data were obtained under similar circumstances in similar real world problems.

  • GLMNet: The regularization parameter λ(coded as P1) has been set from 0.01 to 0.09 in steps of 0.01, in order to leave the default value suggested by the proposers of the method, 0.05, in the middle of the grid. The parameter α(coded as P2), responsible for playing with the elasticity in the elastic net takes the values from 0.1 to 1.0 in steps of 0.1. Notice that α = 1 is equivalent to use pure lasso, and α = 0 would be pure ridge (excluded here because it has its own spot).

  • FLEXFISPLS: The dimensionality (coded as P1) varies in the same way as the number of latent variables in PLS, and the vigilance (coded as P2) takes the values between 0.1 and 0.9, with steps of length 0.1.

  • 𝜖-PLSSVR, ν-PLSSVR: As mentioned in “Support vector regression variation”, the number of latent variables used is not tuned, but fixed as the selection made for PLS. Besides, the cost and spread parameters (coded as P1 and P2, respectively) take the values suggested by the Lib-SVM library developers. Thus, C takes the values in {2−5, 2−3, … , 215}, and γ in {2−15, 2−13, … , 23}.

Results

The results section is structured according to the two validation schemes we have conducted for performance evaluation:
  1. 1.

    A classical and enhanced (employing bagging) cross-validation procedure on each of the training data sets for wort 2014, wort 2015, and beer mix beverages data.

     
  2. 2.

    Validation on a separate available test data set in case of beer mix beverages, as well as validation of the final models trained on wort 2014 data on the wort 2015 data (hard benchmark).

     
In the following two subsection, we will visually show the results and perform a detailed interpretation of them.

Cross-validation performance

When it comes to unfermented beer, the most relevant characteristics to be monitored are bitterness, final attenuation, and free amino nitrogen.

Figure 7 shows the results for EBU in the data set for wort beer from year 2014; we can see: Fig. 7a, b the correlation plots corresponding to, respectively, the best nonlinear and linear methods; Fig. 7c a summary table containing the selected parameters , the CVRMSE and the average R2of the predictions in all folds (CVR2) for each calibration method; and Fig. 7d the observed vs. predicted plot for the method achieving the lowest CVRMSE (highlighted in bold font in Fig. 7c). Analogously, the results for FA and FAN targets are shown in Figures Online Resource 2 and Online Resource 3, respectively.
Fig. 7

CV summary results for bitterness in the data set for wort beer from year 2014. We can see: (a), and (b) the correlation plots corresponding to, respectively, the best non-linear and linear methods; (c) a summary table containing the selected parameters, the CVRMSE and the average R2 of the predictions in all folds (CVR2) for each calibration method; and (d) the observed vs predicted plot for the method achieving the lowest CVRMSE (highlighted in bold font in (c))

Notice that for final attenuation (see Figure Online Resource 2), the performance of the nonlinear methods is indeed quite similar to the performance of PLS, which is linear, thus theoretically less prone to over-fitting. Besides, it is good to see that the prediction ability for samples close to the targets’ extreme values is high, despite the lack of balance in the data. For FAN (Figure Online Resource 3), both SVR approaches show around 15 % lower CVRMSE than the rest. Just by ocular inspection, comparing Online Resource 3 (a) and (b), we can see that the SVR approach performs well in both upper and lower boundaries, and GLMNet does not. Thus, this explains the difference in the CVRMSE. In case of bitterness, the most important parameter for wort supervision (as being responsible for the final taste for customers), the improvement achieved by SVR compared to the best linear method GLMnet is about 36 %, finally achieving the company’s goal to stay within the error range limit of 3 (an error of 2.47 is achieved), which is not the case for linear methods (and error of 3.85 is achieved).

When it comes to the data set for wort beer from 2015 (flow cell with 20 μm), the structure of the results is similar for both the linear and nonlinear methods. Again, there is a clear outperformance of linear methods by nonlinear ones in case of bitterness (see Fig. 8) and FAN (see Online Resource 5), but this time also for final attenuation (Online Resourse 4). The overall conclusions are a good extrapolation behavior, little risk of over-fitting for both SVR approaches, and a much lower one for FLEXFIS, which is noticeable when we see that the dimensionality is lower than in PLS and the vigilance is pretty high, at least 0.3, which is usually an indicator of a low nonlinearity degree. If vigilance is below 0.3, that is an indicator for very high nonlinearity in our model, thus high risk of over-fitting.
Fig. 8

CV summary results for bitterness in the data set for wort beer from year 2015. We can see: (a), and (b) the correlation plots corresponding to, respectively, the best non-linear and linear methods; (c) a summary table containing the selected parameters, the CVRMSE and the average R2 of the predictions in all folds (CVR2) for each calibration method; and (d) the observed vs predicted plot for the method achieving the lowest CVRMSE (highlighted in bold font in (c))

For beer mix beverages beer, it is noticeable that in both citric acid (see Online Resources 7) and total acids (Online Resources 8), the performance of both linear and nonlinear approaches is similar. In both targets, FLEXFIS is the worst algorithm, but the situations are different. The parameters for the total acids look coherent, but in the case of citric acid, it seems that the CV model selection aimed to a parameter combination with two huge clusters (dimensionality equals 2, much lower than the amount of LVs in PLS, because the vigilance is the lowest possible). In case of bitterness (Fig. 9), nonlinear methods can again outperform linear ones significantly (as is the case for wort data)—whether this is a matter of over-fitting or not (because of the high parameter values in SVR), will be clarified in the subsequent section when illuminating the results on the separate validation data set. When it comes to foam stability (Online Resources 6), the difference between the number of LVs (latent variables) from PLS and the dimensionality for FLEXFIS is quite big. Nevertheless, it has to be understood in terms of nonlinearity degree. PLS needs more dimensions in order to catch part of the non-linearity, but FLEXFIS can do it with lower ones (vigilance indicates a mid-high degree of nonlinearity).
Fig. 9

CV summary results for bitterness in the data set for beer mix beverages. We can see: (a), and (b) the correlation plots corresponding to, respectively, the best non-linear and linear methods; (c) a summary table containing the selected parameters, the CVRMSE and the average R2 of the predictions in all folds (CVR2) for each calibration method; and (d) the observed vs predicted plot for the method achieving the lowest CVRMSE (highlighted in bold font in (c))

Performance on separate validation data

Regarding the data corresponding to beer mix beverages beer, the bagging approach has also been applied for model selection as an alternative to standard cross validation. Those results are not shown in previous section, because the final error used in the bagging approach has a different purpose. That error measure is an average made from CVRMSEs coming from the different bags, thus comparing that error measure with the usual CVRMSE makes no sense. Nevertheless, the aim is not to use that error measure as an estimation of the future performance on unseen data, but to take advantage of the robustness provided by the use of diverse bags for a better, more robust model selection, in order to obtain a lower root mean square error of prediction (on separate test data) (termed RMSEP) and to reduce the over-fitting effect.

In order to check the performance of the two model selection approaches (classical and bagged), we have calculated the RMSEP for all possible combinations of the model learning parameters (see “Evaluation scheme and parametrization”), so the parameter combination corresponding to the minimum RMSEP is considered as the best possible model selection overall,
$$\left( \widehat{P}_{1}, \widehat{P}_{2}\right) = argmin_{(P_{1}, P_{2})\in \mathbf{G}_{1}\times \mathbf{G}_{2}}\left( RMSEP_{P_{1}, P_{2}}\right) $$
In this way, we can see how close our model selections are from the best overall possible model selection.
Figure 10, Online Resources 9, Online Resources 10, and Online Resources 11 show the results respectively for bitterness, foam stability, citric acid, and total acid. The structure of each Figure is (a) correlation plot for the best method according to the classical CV selection, i.e., the parameters corresponding to the minimal CV error are selected (thus, a direct comparison to the figures for the CV results is possible), (b) correlation plot for the best method according to the new bagged selection, (c) correlation plot for the best method according to the (theoretically) best possible selection (corresponding to minimal entry in the right half of the tables), and (d) summary table containing the selected parameters for each algorithm (in both classical CV and bagged selections, the latter indicated in the method name by the appended term ’Bag’), the root mean square error of prediction, and the corresponding R2 for our selections (columns 2–5) and the best possible selections (columns 6–9).
Fig. 10

Summary results of external validation for bitterness for the data corresponding to beer mix beverages. The four parts correspond to: (a) correlation plot for the best method according to the classical CV selection, i.e. the parameters corresponding to the minimal CVRMSE), (b) correlation plot for the best method according to the new bagged selection, (c) correlation plot for the best method according to the (theoretically) best possible selection (corresponding to minimal entry in the right half of the summary table), and (d) summary table containing the selected parameters for each algorithm (in both classical CV and bagged selections, the latter indicated in the method name by the appended term ’Bag’), the root mean square error of prediction, and the corresponding R2 for our selection (columns 2-5) and the best possible selection (columns 6-9)

When it comes to bitterness, Fig. 10, we can see that in both the best model selection and the best possible selection there is a very good prediction ability in the important range [5,8]. Besides, the errors are systematically below 2, that is a requirement from the company. With one single exception, 𝜖-SVR, the use of the bagged model selection improves the classical one, leading the selection to models with lower complexity (lower number of LVs in PLS, lower number of rules in FLEXFIS, and lower number of support vectors in SVR), while being closer to the best possible models in terms of error performance. In fact, in most cases, the RMSEP of the model selected by grid search (RMSEPGS) can be significantly reduced, especially in case of fuzzy modeling (using FLEXFIS) down to 1.55, clearly outperforming all state-of-the-art methods. In the case of SVR approaches, there is margin for improving the selection process. The most promising action would be to estimate/compute the adequate number of LVs to be used, instead of using the ones obtained for PLS. One more thing should be noticed for ν-SVR, that is the improvement that bagging brings—not clearly visible in the RMSEP, but in the R2. The reason for it is the presence of some isolated high error peaks in the boundaries of the range that penalize the error, but not the correlation.

For foam stability (see Online Resources 9), similar observations as in case of bitterness can be made, whereas the improvement achieved by bagging is even more intense in case of nonlinear methods (e.g., reduction of more than 50 % error in case of 𝜖-SVR down to an error of around 26). In this sense, this variant in combination with bagging is the most feasible option. Compared to the CV results, the errors are indeed significantly worse but with the help of bagging still lying within the company’s upper limit of 30 (which is not achievable with classical CV selection). The best possible selection (right half of the table in Figure Online Resources 9) does not really further improve the error on separate validation data. Hence, the bagged selection already achieves the optimum performance during CV, which is the ideal situation as the separate test data set is generally not accessible during the training phase.

For citric acid and total acids (Online Resources 10 and 11 respectively), bagging helps only in the non-linear methods. The reason for that is clear, the higher the risk of overfitting, the bigger the advantage of bagged approaches, but the non-bagged variants already perform pretty well (close to the CV results) and clearly in-line the upper error limit of 0.3. It is known by the experts that the most non-linear target is bitterness. This fact is confirmed by the results, in which linear methods seem to be the best ones, being GLMNet the preferred of those. Besides, fuzzy modeling with FLEXFIS behaves better than all linear models, despite the model selection cannot see it. The reason is the flexibility of FLEXFIS to adapt to any degree of non-linearity, even light nonlinearity like in the case of both citric acid, and total acid. In the case of citric acid, the classical selection for FLEXFIS is working badly, leading to the lowest possible vigilance. The consequence of that is a high number of rules. Bagging selects the same dimensionality, but with a higher vigilance (= a lower number of rules), that is closer to the best possible selection and expected to be more robust (as less prone to over-fitting) for prediction on future data.

Finally, the validation of the 2014 wort models with the 2015 wort data which we checked by incidence (thus have not been a requirement by the company) did not bring any reasonable results for bitterness and FAN, as the errors raised to significantly above 4 in case of bitterness and to above 14 in case of FAN (both significantly above the requested upper limits), also when taking into account the best possible parameter/model selection. However, for FA, they stayed in the same range as achieved through cross-validation, which is a remarkable result due to the fact that they have been recorded with two different measurement equipments.

Conclusion and outlook

This paper proposes two nonlinear modeling techniques for calibrating models to predict important parameters during beer production. The supervision of them is necessary in order to guarantee a high level of beer quality, to assure that a beer tastes in the same way as used to within small boundaries of variation and thus that it satisfies the customers’ expectations. Current state-of-the-art chemometric methods based on spectroscopic measurements does not meet the minimal prediction error requirement provided by the company for all the important parameters (especially not for bitterness and final attenuation), which, however, can be resolved with two nonlinear modeling techniques: (1) the first one relying on a nonlinear version of PLS with the usage of Takagi-Sugeno fuzzy systems for obtaining piecewise linear predictors and (2) the second one (with even higher performance) relying on a variation of support vector regression. In particular, an error reduction of about 35 % up to 45 % in case of bitterness and of about 50 % in case of final attenuation could be achieved. Furthermore, in case of beer mix beverages, the new, robust model selection scheme based on bagged ensembles lead to significant error reduction on separate validation data for foam stability and bitterness: especially, in case of foam, the error can be reduced from 64 down to below the upper allowed limit of 30, which is remarkable. In case of the acids for mix beverages, no significant improvement could be made, as the linear models already performed very well on them.

Future work includes the usage of enhanced genetic algorithms for wavelength selection in the context of differential evolution and co-evolution (as having been successfully applied before on FT-NIR spectra data from another application [18] by the main authors of this paper) as well as the application of more advanced ensemble methods such as, e.g., boosting or random forests for a better stability of prediction errors on separate validation data. Additionally, more important parameters for different types of alcoholic and non-alcoholic beverages will be analyzed by our non-linear modeling techniques.

Acknowledgments

Financial support was provided by (i) the Austrian research funding association (FFG) under the scope of the COMET programme within the research project Industrial Methods for Process Analytical Chemistry - From Measurement Technologies to Information Systems (imPACts) (contract #843546), (ii) the Basque Government through the ELKARTEK and BERC 2014-2017 programs, and (iii) the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa accreditation SEV-2013-032. This publication reflects only the authors’ views.

Compliance with Ethical Standards

This paper does not contain any studies with human participants or animals performed by any of the authors.

Conflict of interests

The authors declare that they have no conflict of interest.

Supplementary material

216_2016_9785_MOESM1_ESM.pdf (3.3 mb)
(PDF 3.26 MB)

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Carlos Cernuda
    • 1
    • 2
  • Edwin Lughofer
    • 2
  • Helmut Klein
    • 3
  • Clemens Forster
    • 3
  • Marcin Pawliczek
    • 4
  • Markus Brandstetter
    • 4
  1. 1.BCAM - Basque Center for Applied MathematicsBilbaoSpain
  2. 2.Department of Knowledge-Based Mathematical SystemsJohannes Kepler University LinzLinzAustria
  3. 3.BrauUnion GmbHLinzAustria
  4. 4.RECENDT GmbH, Science Park 2LinzAustria

Personalised recommendations