# Improved quantification of important beer quality parameters based on nonlinear calibration methods applied to FT-MIR spectra

- First Online:

- Received:
- Revised:
- Accepted:

DOI: 10.1007/s00216-016-9785-4

- Cite this article as:
- Cernuda, C., Lughofer, E., Klein, H. et al. Anal Bioanal Chem (2017) 409: 841. doi:10.1007/s00216-016-9785-4

**Part of the following topical collections:**

## Abstract

During the production process of beer, it is of utmost importance to guarantee a high consistency of the beer quality. For instance, the bitterness is an essential quality parameter which has to be controlled within the specifications at the beginning of the production process in the unfermented beer (wort) as well as in final products such as beer and beer mix beverages. Nowadays, analytical techniques for quality control in beer production are mainly based on manual supervision, i.e., samples are taken from the process and analyzed in the laboratory. This typically requires significant lab technicians efforts for only a small fraction of samples to be analyzed, which leads to significant costs for beer breweries and companies. Fourier transform mid-infrared (FT-MIR) spectroscopy was used in combination with nonlinear multivariate calibration techniques to overcome (i) the time consuming off-line analyses in beer production and (ii) already known limitations of standard linear chemometric methods, *like partial least squares (PLS),* for important quality parameters Speers et al. (J I Brewing. 2003;109(3):229–235), Zhang et al. (J I Brewing. 2012;118(4):361–367) such as bitterness, citric acid, total acids, free amino nitrogen, final attenuation, or foam stability. The calibration models are established with enhanced nonlinear techniques based (i) on a new *piece-wise linear version of PLS* by employing fuzzy rules for local partitioning the latent variable space and (ii) on extensions of *support vector regression variants* (*𝜖*-PLSSVR and *ν*-PLSSVR), for overcoming high computation times in high-dimensional problems and time-intensive and inappropriate settings of the kernel parameters. Furthermore, we introduce a *new model selection scheme* based on bagged ensembles in order to improve robustness and thus predictive quality of the final models. The approaches are tested on real-world calibration data sets for wort and beer mix beverages, and successfully compared to linear methods, showing a clear out-performance in most cases and being able to meet the model quality requirements defined by the experts at the beer company.

### Keywords

Quality control of beer FT-MIR spectroscopy Nonlinear PLS Flexible fuzzy systems Support vector regression variation Bagged model selection## Introduction

### Motivation and state-of-the-art

A high quality of beer and its spin-offs (e.g., beer mix beverages), one of the most heavily consumed beverages in the world and the typical ‘national drink’ in Middle European countries, is of great importance in order to satisfy the consumers and the whole alcoholic drink market. For instance, the bitterness is an essential quality parameter which has to be controlled within the specifications already at the beginning of the production process in the unfermented beer (wort) as well as in final products such as beer and beer mix beverages [3]: it is the key parameter for achieving a certain taste the beer should have in order to fall within the common classification boundaries [4]. A high quality can be only guaranteed by permanent supervision of the liquid during its production.

By the application of an analytical method on the basis of FT-MIR spectroscopy in combination with suitable chemometric methods, it is possible to significantly reduce time consuming laboratory analysis. Instead of measuring the relevant quality parameters—such as bitterness, free amino nitrogen, final attenuation, citric acid, total acid, and foam stability—with six different analytical methods sequentially, it is possible to have all quality parameters simultaneously analyzed in less than 15 min in case of 10 samples drawn from the liquid after production. In comparison, a manual analysis of the most relevant parameters, namely bitterness, final attenuation, and free amino nitrogen requires operators efforts of about 4 h and an overall duration for final attenuation of about 24 h in sum. This usually causes significant costs for beer breweries and companies.

Current analytical methods for quality control of beer rely on time and resource consuming chemical analysis in the laboratory where for each quality parameter an individual method and equipment is needed. Recently, spectroscopic methods are being developed in order to determine relevant quality parameters simultaneously in much shorter time and strongly reduced effort for sample preparation [5] [6]. In this context, chemometric methods are employed to gain mathematical models for quantification of the analytes [7] and process parameters [8]. Currently, most of these approaches and resulting models are based on linear calibration methods (not being able to resolve any non-linearities contained in the production process adequately with sufficient accuracy), mainly on the basis of partial least squares regression [9] and especially without the usage of robust model selection techniques. A nonlinear approach can be found in [2] where neural networks have been used for predicting the content of acetic acid; however, it does not address important beer parameters for the end consumer such as bitterness, final attenuation, or foam; moreover, no robust model selection strategies are embedded for appropriately addressing calibration problems based on a very low number of samples.

### Our approach

It enables the fully automatic quantification of several important beer parameters in wort as well as in the final products (beer and beer mix beverages), such as bitterness, final attenuation, free amino nitrogen, citric acid, total acid, and foam stability.

It employs a FT-MIR spectrometer equipped with an automatic sampler for the purpose to draw samples from probes and to overcome the time consuming off-line analyses in beer production.

It applies enhanced nonlinear calibration modeling methods to overcome already known limitations of standard linear chemometric methods (such as PLS) for the essential parameters mentioned above: one is based on a variation of support vector regression (SVR), the other one on a batch version of

*Gen-Smart-EFS*(*short for Generalized Smart Evolving Fuzzy Systems*) for extracting generalized fuzzy rules in a fast single-pass manner; it is coupled with PLS in order to achieve a kind of a piece-wise linear (thus overall non-linear) version of PLS with fuzzy transitions.It embeds a new, robust model selection scheme based on bagged model ensembles which are constructed from multiple bags; the selection is carried out on a bunch of possible model candidates, which are obtained due to various learning parameter combinations (parameter grid) used in SVR and fuzzy rule base extraction.

*The bagged model selection scheme has been mainly motivated due to the availability of a very low number of samples for calibration, as bagging explores a sparse sample space in a nice way, thus increasing robustness of calibration*[10].

We evaluate our approach on three data sets, two drawn from wort and one from beer mix beverages production. Thereby, we report on both, the cross-validation (CV) error as well as on the error on a separate validation set. Results show that there is a clear improvement in CV errors over classical linear state-of-the-art methods when applying enhanced nonlinear techniques for most of the targets achieving finally errors within the limits of the company’s requirements, whereas this can be also confirmed for the separate validation data in case of beer mix beverages. For beer mix beverages data, the application of bagging for model selection brings much improvement for providing robust models on separate validation data with lower over-fitting proneness in case of bitterness and foam stability (the two most essential parameters); in fact, without bagging no useful results within the acceptable error ranges could be achieved for these two parameters.

## The setup

### Data acquisition

Spectroscopic data of the wort and beer mix beverages samples were acquired off-line using a Nicolet iZ10 FT-IR-spectrometer with CETAC ASX-520 autosampler. Besides the spectrometer core, this instrument contains a programmable logic controller (PLC) and an engine for the evaluation of chemometric models. The optics of the spectrometer contains a monolithic Michelson interferometer, which helps with temperature stability [11]. The resolution and the measurement rate of the instrument are configurable.

*μ*m and CaF2 windows. Data collection was set to 16 scans per sample with a resolution of 4 cm

^{−1}in the spectral range 400 to 4000 cm

^{−1}. Absorbance spectra of the investigated samples were calculated according to Beer’s law [12] using pure water for recording the background single beam spectra. A schematical sketch of the measurement setup is shown in Fig. 1.

Wort samples were filtered with Kieselgur and beer samples were degassed by ultrasonic treatment. In order to obtain appropriate and reproducible spectra, the filling of the flow cell was optimized by rinsing the flow cell between the replica of the sample and back-washing with deionized water after each sample. Reference data for the target parameters to be supervised (bitterness, final attenuation, free amino nitrogen, types of acidities, and foam stability) have been obtained by manual analysis of wort and beer mix beverages probes taken at random from the production process. Using the reference data and the corresponding FT-MIR absorption spectra, the (nonlinear) chemometric models could be established, see “Results”.

## Non-linear calibration methods

### Nonlinear PLS with the usage of flexible fuzzy inference systems

### Classical PLS

Partial least squares regression [13] is one of the most widely used calibration method in today’s chemometric modeling tasks and applications [14, 15, 16]. The core concept of PLS is the transformation of the original input feature space—in case of Chemometrics, it is typically the space spanned by the wavelengths [17] or at least partial connected pieces in form of wavebands [18] contained in the spectra—into a reduced input space for best explaining the variance contained in the target (which is typically a continuous numerical output when dealing with regression problems). Partial least squares is used to find the fundamental relations between two matrices (input **X** and output **Y**), i.e., a latent variable approach to model the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the input space that explains the maximum multidimensional variance direction in the output space. It emphasizes a rotation of the input space in order to best explain the variance of the target by means of linear relations/mappings. The resulting transformed space is characterized by eigenvectors which forms the so-called *latent variable space*. Their corresponding eigenvalues can be sorted in descending order in order to achieve a ranking of latent variables based on which a selected subset due to a variance-explained cut-off is typically used for model calibration (as, e.g., typically used within the well-known and widely-used PLS-Toolbox ^{1}).

### NonLinear PLS (Variants) through Kernel Transformation

*Even though PLS respects the target concept during space transformation, it is still a linear method, i.e., it emphasizes rotations to best represent covariance structures in the data in a linear sense. In order to establish a nonlinear variant of PLS for emphasizing the best variance explanation in a non-linear sense, the kernel-based PLS (K-PLS) is typically employed* [19]. *Firstly, it applies the kernel trick (in the same way as done in support vector regression, see below) in order to perform a nonlinear transformation of the original data set into an S-dimensional feature space. Secondly, it performs the conventional partial least squares algorithm on the transformed kernel Gram matrices K***1***(for the input space) and***K****2***(for the output space), with entries***K****1**_{ij} = **K**(**x**_{i}, **x**_{j}), **K****2**_{ij} = **K**(**y**_{i}, **y**_{j}) and **K** the multi-dimensional kernel and **x**_{i}, **x**_{j}*the input vectors of the i-th and j-th sample (so, the kernel function is applied to all sample pairs).* The disadvantage of K-PLS becomes immediate when the number of samples *N* available for regression is large, because then the kernel Gram matrix from input space explodes in size (*N* × *N*).

### Our nonlinear PLS Version

In this paper, we abandon the disadvantage of nonlinear K-PLS and apply a nonlinear version recently introduced in [20] and successfully applied for establishing calibration models from *Fourier transform near-infrared (FT-NIR)* spectra in melamine resin production (for cloud point prediction and supervision). Its basic idea lies in the partitioning of the latent variable space (after transformation with classical PLS) into several (*C*) local pieces represented by fuzzy rules, i.e., dividing it into different partial principal component directions along the target. Each fuzzy rule embeds a local linear hyper-plane for local trend estimation, thus it results in piece-wise local linear PLS predictors, which are combined through a weighted linear combination, where the weights are rule activation levels in form of multi-dimensional Gaussian kernels. This assures smoothness of the whole regression surface as the piece-wise linear predictors are ’kernel-smoothened’ across their transitions [21].

### Our fuzzy rules learning engine

*Gen-Smart-EFS*approach [22], whose core functionality (without the concepts regarding rule merging and dynamic feature weighting for dimension reduction) is used to find the appropriate number of rules in single-pass evolution steps and also to estimate the kernel functions forming the antecedents of the rules; in this way, each rule antecedent is associated with a triplet (

**c**,Σ

^{−1},

**r**) with

**c**its center, Σ

^{−1}the inverse covariance matrix defining its multivariate ellipsoidal shape and

**r**its tolerance radius (statistical range of influence also termed as prediction interval [23]), which is automatically extracted from data and steers rule evolution versus rule update, see below. In this sense, one fuzzy rule reads as

*C*= 0 initially):

- 1.
Load a new sample

**x**; if it is the first one, go to step 5 (there, ignoring the if-part); - 2.
Elicit the winning rule, i.e., the rule closest to the current sample, which is then denoted as

*c*_{win}; for the distance calculation, standard Mahalanobis distance is used [24] (as on the right hand side in (2) below). - 3.Check whether the following criterion is met (the
*rule evolution criterion*):with$$\begin{array}{@{}rcl@{}} min_{i=1,...,C} \sqrt{(\mathbf{x}-\mathbf{c_{i}})^{T} {\Sigma}^{-1} (\mathbf{x}-\mathbf{c_{i}})} > r_{i} \hspace{1.0cm}\\ r_{i} = vigi*p^{1/\sqrt{2}}*\frac{1.0}{\left( 1-1/(k_{i}+1)\right)^{m}} \end{array} $$(2)*p*the dimensionality of the input feature space and*vigi*an a priori defined parameter, steering the tradeoff between stability (update of an old cluster) and plasticity (evolution of a new cluster);*k*_{i}the support of the*i*th rule and*m*a tuning parameter per default set to 4. This is the only sensitive parameter and is varied during the model evaluation phase, see “Evaluation scheme and parametrization”—for further explanation of this criterion, please refer to [22]. - 4.
*If*(2)*is not met*, the center of the winning rule is updated byand its inverse covariance matrix by (the index$$ \mathbf{c_{win}}(N+1) = \mathbf{c_{win}}(N) + \eta_{win}(\mathbf{x}-\mathbf{c_{win}}(N)) $$(3)*win*neglected due to transparency reasons):with$$ {\Sigma}^{-1}(k+1) = \frac{{\Sigma}^{-1}(k)}{1-\alpha} -\frac{\alpha}{1-\alpha} \frac{({\Sigma}^{-1}(k)(\mathbf{x}-\mathbf{c}))({\Sigma}^{-1}(k)(\mathbf{x}-\mathbf{c}))^{T}}{1+\alpha ((\mathbf{x}-\mathbf{c})^{T}{\Sigma}^{-1}(k)(\mathbf{x}-\mathbf{c}))} $$(4)*N*the number of samples seen so far and \(\alpha =\frac {1}{k_{win}+1}\) with*k*_{win}the number of samples seen so far for which*c*_{win}has been the winning rule (cluster). The former stems from the idea in vector quantification [25] by minimizing the expected squared quantization error; the learning gain*η*_{win}is thereby set in a way that it fulfills the Robbins-Monroe conditions. The latter is a recursive exact update without requiring the original covariance matrix, which is analytically derived with the usage of the Neumann series, see [26] for full details. - 5.
*If*(2)*is met*, a new rule is evolved as covering a new region in the feature space (i.e., having sufficient*novelty content*) by setting its center**c**_{C+1}to the coordinates of**x**and initialize its inverse covariance matrix \({\Sigma }_{win}^{-1}\) by setting it to a diagonal matrix with entries 1 divided by a small fraction, i.e., 1/100, of the variable ranges (= initial rule spreads); increase the number of rules*C*=*C*+ 1. - 6.
If there have not yet been all samples in the calibration set processed, go to step 1, otherwise stop.

*l*

_{1},...,

*l*

_{C}for all

*C*rules are estimated through fuzzily weighted least squares [27] in order to assure local learning which has several advantages over global learning, see [21], Chapter 2 for a detailed analysis.

*A block diagram summarizing the procedure can be seen in*Fig. 2

A special case comes up when the inverse covariance matrix Σ^{−1} is used as a diagonal matrix (ignoring the co-variances between the inputs). Then, axis-parallel fuzzy rules are triggered and the steps in the itemization above end up in the classical *flexible fuzzy inference systems (FLEXFIS)* approach [28].

### Support vector regression variation

*support vectors*. It employs the kernel trick [31] for performing a nonlinear transformation of the original data into a linearized space, where then the conventional linearized concept of separating hyper-planes with margin maximization can be again applied. There is a regression version, support vector regression (SVR) [32], with two variants called

*𝜖*-SVR and

*ν*-SVR. The general principle behind SVR is the following: a mapping

**ϕ**maps the data

**X**to a

*m*-dimensional feature space, where a linear model is generated

*b*is the bias term (null for centered data),

*ϕ*

_{j}are the nonlinear transformations, and

*ω*

_{j}are the model coefficients. The quality of the estimation is then measured by an

*𝜖*-insensitive loss function, meaning that any loss below

*𝜖*is neglected.

*𝜖*-SVR, the coefficients are the solution of the quadratic problem

*n*is the number of inputs,

*x*

_{i}the inputs,

*y*

_{i}the targets,

*C*is the cost parameter, and

*ξ*

_{i}and \(\xi _{i}^{*}\) are slack variables.

*ν*-SVR, the coefficients are the solution of the quadratic problem

The cost parameter *C* controls how relevant is to fall out of the *𝜖*-insensitivity zone, and the parameter *ν* is used to bound the noise. Indeed, *ν* is an upper bound on the fraction of errors and a lower bound on the number of support vectors. Figure *Online Resource*1 shows an example of the *𝜖*-insensitive areas (in between the dotted lines) and the loss for the only two points outside. Therefore, the total loss is the sum of the loss for those two points in this case.

*γ*is the spread of the Gaussian function. The most interesting parameters to be tuned are

*C*and

*γ*in both SVR approaches.

A drawback for SVR is the computational cost for high-dimensional data sets. Therefore, we propose a variant for both *𝜖*-SVR and *ν*-SVR. It consists on including a previous step, in which the goal is to reduce the dimensionality in advance by compressing the data by means of PLS. We denote these variants by *𝜖*-PLSSVR and *ν*-PLSSVR. Therefore, a new parameter arises that is the number of latent variables to be used.

### Intervened nonlinear modeling and evaluation scheme (for all methods)

*N*calibration samples available (drawn by the spectroscopic equipment as described in “Data acquisition”), our modeling procedure together with the full evaluation performs the following steps:

- 1.
Calculate latent variables

*l**a**t*_{1},...,*l**a**t*_{all}with*all*the number of wavelengths contained in the spectra and ordered according to their importance. Notice that our approach includes always a previous data compressing step by means of PLS. Therefore, the latent variables from PLS are always needed. - 2.Define parameter grid: Parameter selection for PLS, FLEXFIS-PLS, and PLSSVR is based on a grid search including a cross validation procedure. There are two parameter selection approaches: the classical CV selection based on the minimum CVRMSE (=cross-validated root mean squared error) and a robust model selection based on bagging CV, see “Bagged selection for increasing robustness of non-linear modelling”. The parameter grids are different for each of the algorithms:
For PLS, use

*d**i**m*= {1,...,*a*} for the number of latent variables to be included into the calibration model, achieving a vector of grid points**g**_{i}=*d**i**m*_{i}.For ridge regression, use different regularization parameters

*λ*, with grid points**g**_{i}=*λ*_{i}.For generalized linear models with elastic net

*(GLMNet)*, use the coefficient*α*that controls the convex combination between Lasso and ridge regression, and the regularization parameter*λ*.*This results in a matrix of grid points***G**_{ij}= (*α*_{i},*λ*_{j}). See “State-of-art methods used for comparison” for further details on GLMNet.For fuzzy systems, use the number of latent variables

*d**i**m*_{i}and define the vigilance parameter*vigi*inside the interval (0,1) that steers the rule evolution criterion in (2) and thus controls the level of non-linearity applied [28]. This results in a matrix of grid points**G**_{ij}= (*d**i**m*_{i},*v**i**g**i*_{j}).For the SVR approaches, the number of latent variables is fixed, taken from the applied model selection performed for PLS. The parameters to be tuned are the cost

*C*and the width of the Gaussian kernel function*γ*. The matrix of grid points is**G**_{ij}= (*C*_{i},*γ*_{j}), default grid suggested by the authors of the guidance for using Lib-SVM [33], the most widely-used library for SVM.

- 3.
For all grid points, perform 10-fold cross-validation [34], in both the classical and the bagged versions, and store the cross-validation error:

**C****V****e****r****r**_{i}respectively**C****V****e****r****r**_{ij}. See “Bagged selection for increasing robustness of non-linear modelling” for further details on the bagged version. - 4.Perform model selection. Complexity is measured in different ways in each of the considered algorithms, existing in some cases a relationship between the complexity and the parameters in the grids. For instance, it increases with the number of latent variables in PLS. For fuzzy systems learning coupled with PLS, it increases in a direct way with dimensionality and in an inverse way with vigilance because the lower the vigilance the higher the number of rules (as (2) is more often fulfilled). For SVR, the complexity can be measured in terms of the number of support vectors. Thus, the higher the number, the higher the complexity. There is a direct relationship with the cost
*C*, because a high cost means a high penalization for non-separable points, thus higher number of support vectors would be stored in order to diminish the number of non-separable points. There is also a relation between the complexity and the width*γ*, as a higher value induces a lower kernel width, i.e., steeper surfaces and thus a higher nonlinearity. Then, our model selection procedure selects the parameters corresponding to the grid point for which the corresponding model has lower CVRMSE, after being penalized according to their complexity (**C****V****e****r****r**(*p**e**n*):with$$ \mathbf{CVerr}_{ij}(pen)=\mathbf{CVerr}_{ij}\cdot e^{\alpha\:\:param1_{i} + \beta\:\:(1-param2_{j})} $$(9)**p****a****r****a****m****1**related to dimensionality in case of fuzzy modeling and to cost in case of SVR and**p****a****r****a****m****2**to the vigilance in case of fuzzy modeling and to*γ*in case of SVR.*α*and*β*are normalization factors which are set to 0.05 in our case, 0.5 respectively. - 5.
Perform a final model training on the whole training set with the obtained optimum parameters \((param1_{i}^{*},param2_{j}^{*})\) and test it on a separate validation set (if available).

**A block diagram summarizing the procedure can be seen in**Fig. 3

### Bagged selection for increasing robustness of non-linear modelling

bAGGING [35] stands for bootstrap aggregating. The basic idea behind is the creation of *M* bags of *N*^{′} training samples each by means of sampling with replacement (bootstrap sampling [36]). The bagged algorithm is performed for all *M* bags, and the *M* outputs are aggregated according to certain aggregation function, depending on the algorithm. Theoretically, the diversity brought by the bootstrap sampling lead to *M* models that are not necessarily good, but lead to a good final aggregated model. Notice that the usual size *N*^{′} of the bags coincide with the number of available samples *N*. In that concrete case, the expected percentage of unique samples in each bag is 63.2 *%* [37].

- 1.
Create

*M*bags with*N*samples in each bag. - 2.
For the

*k*-th bag, perform the classical cross validation for all parameters combinations in the grid, depending on the regression approach under consideration. Store the errors \(\mathbf {CVerr}_{ij}^{k}\) for the parameters (*p**a**r**a**m*1_{i},*p**a**r**a**m*2_{j}). - 3.
Aggregate all

*k*cross validation errors using the average as aggregation function. - 4.
Penalize the errors according to the complexity, using (9).

- 5.
Select the optimum parameters \((param1_{i}^{*},param2_{j}^{*})\), for which the penalized error is minimum.

**A block diagram summarizing the procedure can be seen in**Fig. 4

Please note to take the average as aggregation function in step 3 is the standard way as used in many other bagged modeling approach, such as, for instance, random forests [38]. It is well-known that it produces more robust predictions, especially in case of a low number of samples due to its characteristics to explore the sample space through the bootstrapped bags well as, e.g., analyzed in [10]. A low number of calibration samples is expected in our application, as the real targets have to be manually elicited by the experts, whereas such a manual analysis requires an effort of several hours for only a couple of samples. In this sense, the usage of bagging for our application is well motivated.

## Case study configuration for evaluation of calibration methods

### Data sets characteristics and pre-treatment

*μ*m cells), and the third one to beer mix beverages beer production, the latter including a separate validation set recorded several weeks later. Due to the differences in the composition and in the final product, the parameters that are relevant for the product quality, and will be therefore monitored, are not the same. The concrete parameters and their acronyms (coming from German language) are

For wort: (i) Bitterness (EBU), (ii) final attenuation (FA), and (iii) free amino nitrogen (FAN).

For beer mix beverages: (i) Bitterness, (ii) foam stability (S), (iii) citric acid (CA), and (iv) total acid (TA).

*e.g., using uninformative variable elimination*[39, 40],

*forward selection*[41]

*and genetic algorithms*[18] and have been found to be optimal for those subsets sizes.

Data sets characteristics used for calibration and validation

Dataset | Target | Unit | Relevant spectral regions cm | # of samples calib/valid |
---|---|---|---|---|

Wort 2014 | Bitterness | EBU | 1346-1564 | 47/- |

Wort 2014 | Final attenuation | % | 976-1363 | 47/- |

Wort 2014 | Free amino nitrogen | mg/l | 1012-1475, 2517-2980 | 47/- |

Wort 2015 | Bitterness | EBU | 1134-1499 | 37/- |

Wort 2015 | Final attenuation | % | 1070-1421 | 37/- |

Wort 2015 | Free amino nitrogen | mg/l | 1012-1437 | 37/- |

Beer mix beverages | Bitterness | EBU | 1138-1443 | 31/11 |

Beer mix beverages | Foam stability | s | 1207-1437, 2748-2960 | 31/11 |

Beer mix beverages | Citric acid | g/l | 1148-1495 | 31/11 |

Beer mix beverages | Total acids | g/l | 1051-1128, 1168-1495, 2748-2931 | 31/11 |

According to this very low number of samples, bagging which explores well the sample space can be expected to provide more robust models (model selection) than the classical CV.

Besides, several well-known preprocessing methods [42] have been employed in order to find a preprocessing strategy that behaves well for all targets. For operational reasons, as the spectral data are the same for all the targets, a single strategy for all targets of each data set is required. The chosen one for all data sets has been the two-steps strategy consisting on first applying standard normal variate [43], and then mean centering.

### State-of-art methods used for comparison

- Partial Least Squares (PLS):
*It is a linear method, used to find the fundamental relations between two matrices (input X and output Y), i.e., a latent variable approach to model the covariance structures in these two spaces. A PLS model will try to find the multidimensional direction in the input (X) space that explains the maximum multidimensional variance direction in the output (Y) space. It emphasizes a rotation of the input space in order to best explain (the variance of the) target by means of linear relations/mappings. We have used it for comparison purposes, because it is the most widely used state-of-the-art method in chemometrics and especially in automatic beer parameter analytics.*For a compact summary of its principal concepts, please refer to the beginning of “Nonlinear PLS with the usage of flexible fuzzy inference systems”.- Generalized linear models with elastic net (GLMNet):
The Lasso method [44] and ridge regression [43] are approaches included in the family of shrinkage methods that can be seen as regression algorithms including an

*ℓ*_{1}and an*ℓ*_{2}penalties, respectively. The*elastic net*[45] includes a penalty based on a combination of both*ℓ*_{1}and*ℓ*_{2}penalties, looking for some elasticity in the regularization.Generalized linear models [43] is a generalization of ordinary linear regression that provides flexibility in the sense that the distribution of the errors is not necessarily supposed to be normal, as happens in ordinary linear regression.

The combination of the elastic net with generalized linear models is a regression algorithm based on generalized least squares that uses cyclical coordinate descent [46] in a path-wise fashion [47] in order to select the optimum elasticity in the regularization via the elastic net.

- Ridge regression:
- Despite it is a particular case of GLMNet, when the lasso part of the elastic net is ignored, ridge regression deserves its own separate spot. In MLR, we determine the best regression vector \(\hat {\mathbf {b}}\), according to a minimum least squares criterion, when trying to solve the regression problem
**y**=**X**⋅**b**with**X**the regression matrix. Then, it is well known that the regression vector isWhen handling variables that are highly correlated, problems of singularities arise when it comes to calculating the inverse of$$\hat{\mathbf{b}} = (\mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X} \mathbf{y} $$**X**^{T}**X**. A way to deal with this problem is*regularization*. It consists on adding a regularization term in the least squares minimization problem. Ridge regression uses*α***I****X**as regularization term, where*λ*> 0 is a parameter to be tuned. Then, the regression vector becomes$$\hat{\mathbf{b}} = (\mathbf{X}^{T} \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X} \mathbf{y} $$

The regularization parameter will be tuned with a grid, see below.

### Evaluation scheme and parametrization

The evaluation scheme is performed differently for the unfermented beer and beer mix beverages beer, due to the characteristics of the data. For unfermented beer, we have performed the classical cross validation model selection, as stated in “Intervened nonlinear modeling and evaluation scheme (for all methods)” for both data sets separately, so that we can compare the performance of the regression methods. As a hard benchmark, we used the final models trained on the 2014 data for validation on the 2015 data (different measurement equipments), just to check how far our models are able to reliably extrapolate into the future. For beer mix beverages, the availability of validation data offers the possibility of comparing also the classical and bagged cross validation model selections, in order to see how close those model selection approaches are to the best possible parameter combination for the external validation set (which is not accessible during CV selection).

PLS: The number of latent variables (coded as

*P*1 in the tables in “Results”) varies from 1 to 15.Ridge: The regularization parameter

*λ**(coded as P*2) goes from 0.01 to 0.95 in steps of 0.05. This grid has been successfully used in previous studies [48], in which the data were obtained under similar circumstances in similar real world problems.GLMNet: The regularization parameter

*λ**(coded as P*1) has been set from 0.01 to 0.09 in steps of 0.01, in order to leave the default value suggested by the proposers of the method, 0.05, in the middle of the grid. The parameter*α**(coded as P*2), responsible for playing with the elasticity in the elastic net takes the values from 0.1 to 1.0 in steps of 0.1. Notice that*α*= 1 is equivalent to use pure lasso, and*α*= 0 would be pure ridge (excluded here because it has its own spot).FLEXFISPLS: The dimensionality

*(coded as P*1) varies in the same way as the number of latent variables in PLS, and the vigilance*(coded as P*2) takes the values between 0.1 and 0.9, with steps of length 0.1.*𝜖*-PLSSVR,*ν*-PLSSVR: As mentioned in “Support vector regression variation”, the number of latent variables used is not tuned, but fixed as the selection made for PLS. Besides, the cost and spread parameters*(coded as P1 and P2, respectively)*take the values suggested by the Lib-SVM library developers. Thus,*C*takes the values in {2^{−5}, 2^{−3}, … , 2^{15}}, and*γ*in {2^{−15}, 2^{−13}, … , 2^{3}}.

## Results

- 1.
A classical and enhanced (employing bagging) cross-validation procedure on each of the training data sets for wort 2014, wort 2015, and beer mix beverages data.

- 2.
Validation on a separate available test data set in case of beer mix beverages, as well as validation of the final models trained on wort 2014 data on the wort 2015 data (hard benchmark).

### Cross-validation performance

When it comes to unfermented beer, the most relevant characteristics to be monitored are bitterness, final attenuation, and free amino nitrogen.

*, the CVRMSE and the average R*

^{2}

*of the predictions in all folds (CVR2)*for each calibration method; and Fig. 7d the observed vs. predicted plot for the method achieving the lowest CVRMSE (highlighted in bold font in Fig. 7c). Analogously, the results for FA and FAN targets are shown in Figures Online Resource 2 and Online Resource 3, respectively.

Notice that for final attenuation (see Figure Online Resource 2), the performance of the nonlinear methods is indeed quite similar to the performance of PLS, which is linear, thus theoretically less prone to over-fitting. Besides, it is good to see that the prediction ability for samples close to the targets’ extreme values is high, despite the lack of balance in the data. For FAN (Figure Online Resource 3), both SVR approaches show around 15 % lower CVRMSE than the rest. Just by ocular inspection, comparing Online Resource 3 (a) and (b), we can see that the SVR approach performs well in both upper and lower boundaries, and GLMNet does not. Thus, this explains the difference in the CVRMSE. In case of bitterness, the most important parameter for wort supervision (as being responsible for the final taste for customers), the improvement achieved by SVR compared to the best linear method GLMnet is about 36 %, finally achieving the company’s goal to stay within the error range limit of 3 (an error of 2.47 is achieved), which is not the case for linear methods (and error of 3.85 is achieved).

*μ*m), the structure of the results is similar for both the linear and nonlinear methods. Again, there is a clear outperformance of linear methods by nonlinear ones in case of bitterness (see Fig. 8) and FAN (see Online Resource 5), but this time also for final attenuation (Online Resourse 4). The overall conclusions are a good extrapolation behavior, little risk of over-fitting for both SVR approaches, and a much lower one for FLEXFIS, which is noticeable when we see that the dimensionality is lower than in PLS and the vigilance is pretty high, at least 0.3, which is usually an indicator of a low nonlinearity degree. If vigilance is below 0.3, that is an indicator for very high nonlinearity in our model, thus high risk of over-fitting.

### Performance on separate validation data

Regarding the data corresponding to beer mix beverages beer, the bagging approach has also been applied for model selection as an alternative to standard cross validation. Those results are not shown in previous section, because the final error used in the bagging approach has a different purpose. That error measure is an average made from CVRMSEs coming from the different bags, thus comparing that error measure with the usual CVRMSE makes no sense. Nevertheless, the aim is not to use that error measure as an estimation of the future performance on unseen data, but to take advantage of the robustness provided by the use of diverse bags for a better, more robust model selection, in order to obtain a lower root mean square error of prediction (on separate test data) (termed RMSEP) and to reduce the over-fitting effect.

*R*

^{2}for our selections (columns 2–5) and the best possible selections (columns 6–9).

When it comes to bitterness, Fig. 10, we can see that in both the best model selection and the best possible selection there is a very good prediction ability in the important range [5,8]. Besides, the errors are systematically below 2, that is a requirement from the company. With one single exception, *𝜖*-SVR, the use of the bagged model selection improves the classical one, leading the selection to models with lower complexity (lower number of LVs in PLS, lower number of rules in FLEXFIS, and lower number of support vectors in SVR), while being closer to the best possible models in terms of error performance. In fact, in most cases, the *RMSEP of the model selected by grid search (RMSEPGS)* can be significantly reduced, especially in case of fuzzy modeling (using FLEXFIS) down to 1.55, clearly outperforming all state-of-the-art methods. In the case of SVR approaches, there is margin for improving the selection process. The most promising action would be to estimate/compute the adequate number of LVs to be used, instead of using the ones obtained for PLS. One more thing should be noticed for *ν*-SVR, that is the improvement that bagging brings—not clearly visible in the RMSEP, but in the *R*^{2}. The reason for it is the presence of some isolated high error peaks in the boundaries of the range that penalize the error, but not the correlation.

For foam stability (see Online Resources 9), similar observations as in case of bitterness can be made, whereas the improvement achieved by bagging is even more intense in case of nonlinear methods (e.g., reduction of more than 50 % error in case of *𝜖*-SVR down to an error of around 26). In this sense, this variant in combination with bagging is the most feasible option. Compared to the CV results, the errors are indeed significantly worse but with the help of bagging still lying within the company’s upper limit of 30 (which is not achievable with classical CV selection). The best possible selection (right half of the table in Figure Online Resources 9) does not really further improve the error on separate validation data. Hence, the bagged selection already achieves the optimum performance during CV, which is the ideal situation as the separate test data set is generally not accessible during the training phase.

For citric acid and total acids (Online Resources 10 and 11 respectively), bagging helps only in the non-linear methods. The reason for that is clear, the higher the risk of overfitting, the bigger the advantage of bagged approaches, but the non-bagged variants already perform pretty well (close to the CV results) and clearly in-line the upper error limit of 0.3. It is known by the experts that the most non-linear target is bitterness. This fact is confirmed by the results, in which linear methods seem to be the best ones, being GLMNet the preferred of those. Besides, fuzzy modeling with FLEXFIS behaves better than all linear models, despite the model selection cannot see it. The reason is the flexibility of FLEXFIS to adapt to any degree of non-linearity, even light nonlinearity like in the case of both citric acid, and total acid. In the case of citric acid, the classical selection for FLEXFIS is working badly, leading to the lowest possible vigilance. The consequence of that is a high number of rules. Bagging selects the same dimensionality, but with a higher vigilance (= a lower number of rules), that is closer to the best possible selection and expected to be more robust (as less prone to over-fitting) for prediction on future data.

Finally, the validation of the 2014 wort models with the 2015 wort data which we checked by incidence (thus have not been a requirement by the company) did not bring any reasonable results for bitterness and FAN, as the errors raised to significantly above 4 in case of bitterness and to above 14 in case of FAN (both significantly above the requested upper limits), also when taking into account the best possible parameter/model selection. However, for FA, they stayed in the same range as achieved through cross-validation, which is a remarkable result due to the fact that they have been recorded with two different measurement equipments.

## Conclusion and outlook

This paper proposes two nonlinear modeling techniques for calibrating models to predict important parameters during beer production. The supervision of them is necessary in order to guarantee a high level of beer quality, to assure that a beer tastes in the same way as used to within small boundaries of variation and thus that it satisfies the customers’ expectations. Current state-of-the-art chemometric methods based on spectroscopic measurements does not meet the minimal prediction error requirement provided by the company for all the important parameters (especially not for bitterness and final attenuation), which, however, can be resolved with two nonlinear modeling techniques: (1) the first one relying on a nonlinear version of PLS with the usage of Takagi-Sugeno fuzzy systems for obtaining piecewise linear predictors and (2) the second one (with even higher performance) relying on a variation of support vector regression. In particular, an error reduction of about 35 % up to 45 % in case of bitterness and of about 50 % in case of final attenuation could be achieved. Furthermore, in case of beer mix beverages, the new, robust model selection scheme based on bagged ensembles lead to significant error reduction on separate validation data for foam stability and bitterness: especially, in case of foam, the error can be reduced from 64 down to below the upper allowed limit of 30, which is remarkable. In case of the acids for mix beverages, no significant improvement could be made, as the linear models already performed very well on them.

Future work includes the usage of enhanced genetic algorithms for wavelength selection in the context of differential evolution and co-evolution (as having been successfully applied before on FT-NIR spectra data from another application [18] by the main authors of this paper) as well as the application of more advanced ensemble methods such as, e.g., boosting or random forests for a better stability of prediction errors on separate validation data. Additionally, more important parameters for different types of alcoholic and non-alcoholic beverages will be analyzed by our non-linear modeling techniques.

## Acknowledgments

Financial support was provided by (i) the Austrian research funding association (FFG) under the scope of the COMET programme within the research project Industrial Methods for Process Analytical Chemistry - From Measurement Technologies to Information Systems (imPACts) (contract #843546), (ii) the Basque Government through the ELKARTEK and BERC 2014-2017 programs, and (iii) the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa accreditation SEV-2013-032. This publication reflects only the authors’ views.

### Compliance with Ethical Standards

This paper does not contain any studies with human participants or animals performed by any of the authors.

### Conflict of interests

The authors declare that they have no conflict of interest.