1 Introduction

Monitoring biotechnological processes requires real-time monitoring of—in the best case—many variables using hardware sensors. However, several critical information-bearing variables, such as biomass and product concentration, often cannot be monitored in real time or only by expensive sensor technology. In this case, soft (ware) sensors are commonly used. Within this sensor concept, existing hardware sensors are combined in mathematical models according to statistical correlations and process knowledge to predict the required target variables [1, 2]. This technology has already been widely used in biotechnological processes in the past. For example, it was possible to predict the biomass concentration in a process involving Pichia pastoris. To achieve this, a soft sensor model was developed based on various online process parameters, such as the concentration of CO2 and O2 in the exhaust gas and actuators, such as the addition of pH correction agents [3]. Another example of the use of soft sensors is the prediction of biomass and product concentration in industrial insulin production with Escherichia coli. Galvanauskas et al. [4] showed that the target variables could be successfully predicted using neural networks and Monod kinetics. The input for the models included the temperature of the reactor heating jacket, the addition of pH correction agents, and variables resulting from gas concentrations, such as the oxygen uptake rate (OUR) and the carbon dioxide evolution rate (CER). Soft sensor models for predicting biomass, substrate concentration, and spore yield were also successfully developed for a bioprocess with Clostridium butyricum. The input variables used were fermentation time, capacitance, conductivity, pH, initial total sugar concentration, as well as ammonium and calcium concentration [5]. Further examples of the successful application of soft sensors in bioprocesses can be found in several review articles [6,7,8,9]. However, soft sensors require regular recalibration to prevent loss of prediction performance over time due to changing process characteristics [10]. In biotechnological processes, these can contain changing raw materials, modified process strategies, or biological variability.

In many applications, soft sensors are still recalibrated manually, which is a very time-consuming and expensive process [10]. However, the automatic recalibration of soft sensors is an alternative and subject of different studies and is referred to as just-in-time modeling [11,12,13,14,15]. The approach is usually similar: First, a time is defined, or, depending on process conditions, it is determined when a recalibration should occur. Subsequently, historical data sets are selected to recalibrate the soft sensor model. This is necessary because the reference values of the current process would be available with a significant delay only. All these steps can be carried out automatically using an optional manual specification of initial conditions. The selection can be based on the chronological order (temporal similarity criterion) [15] of the data sets as well as spatial similarity (distance-based similarity criterion) [11,12,13,14]. Finally, the soft sensor model is recalibrated based on the selected data sets and defined as valid until the subsequent recalibration. When selecting historical data sets, selection based on distance-based similarity criteria is particularly promising for bioprocesses. Thus, reliable predictions can be guaranteed despite sudden changes in process characteristics if similar characteristics have already been logged in the past. If the data sets were selected solely based on their chronological occurrence, the prediction performance would be substantially reduced in the case of a raw material change, for example. In the distance-based selection of data sets, the temporal trends of the online variables are matched with the current process. This is often done in combination with multiway principal component analysis (MPCA) and batch-wise unfolding of the online measured variables [16,17,18]. The advantage of MPCA is that additional low-information noise of the data sets can be removed with the help of an ordinary PCA [19]. Therefore, the historical process data matrix must only be refolded in advance, which gave this method the name MPCA. Similar data sets can now be detected in this principal component space for the current process based on similarity criteria. These criteria can be based on different distances, e.g., the Euclidean distance or the Mahalanobis distance [20,21,22].

Nevertheless, the problem with this approach is that particularly temporally similar data sets are selected. This means calibration data sets with equally long lag phases are selected if the current process has a similar long lag phase. However, these calibration data sets do not necessarily guarantee the best prediction performance. It is conceivable that the curve patterns of the areas between different process-specific landmarks are significantly more relevant than the absolute length of individual sections. In this example, a process with a shorter lag phase but a similarly steep exponential phase could guarantee a better prediction performance as a calibration data set. To test this hypothesis, the process and the historical data sets must be temporally aligned, which can be done via data synchronization methods.

In this study, two synchronization approaches, dynamic time warping (DTW) and curve registration (CR), suitable for synchronizing all process-specific landmarks, were compared, and the influence of data synchronization on the prediction performance of an automatically recalibrated soft sensor approach was investigated. The two synchronization methods were modified for soft sensor recalibration and applied to the process variables of two different bioprocesses. Firstly, the automatic recalibration concept of the soft sensors selects similar data sets using an MPCA and distance-based similarity criterion. Next, the soft sensor is recalibrated using partial least squares regression (PLSR) with an additional model transition that contains a forgetting factor. A linear model was used as the basic soft sensor model, with all process variables available online and additionally calculated variables (CER and OUR) as input. To evaluate the influence of data synchronization on prediction performance, the normalized root mean squared error (NRMSE) was compared with and without the synchronization approach. The evaluation was performed for the biomass prediction of the P. pastoris process and the protein prediction of a Bacillus subtilis process.

2 Materials and methods

2.1 P. pastoris process—cultivation and hardware

2.1.1 Strain, preculture conditions, and main culture

In a preculture, P. pastoris (DSMZ 70382) was cultured in three shake flasks of 150 mL, each containing 50 mL FM22 medium with glycerol for the main process. The flasks were cultured for 70 h at 150 min−1 and 30 °C. The preculture was then pooled and used as inoculum for the following main culture. The main culture had a volume of 15 L and FM22 with glycerol as a medium. In the initial batch phase for biomass generation, all glycerol was consumed. In the following fed-batch phase, a substrate change to methanol occurred. The methanol addition was also supplemented with 12 mL L−1 PTM4 solution. Methanol concentration (4.5 g L−1), pH (5), pressure (500 mbar), temperature (30 °C), and dissolved oxygen (40%) were controlled. The FM22 medium contained the following [23]: (NH4)2SO4, 5 g L−1; CaSO4 2H2O, 1 g L−1; K2SO4, 14.3 g L−1; KH2PO4, 42.9 g L−1; MgSO4 7H2O, 11.7 g L−1; and glycerol, 40 g L−1. To the FM22 medium, an additional 2 mL L−1 of the PTM4 solution was added: CuSO4 5H2O, 2 g L−1; KI, 0.08 g L−1; MnSO4 H2O, 3 g L−1; Na2MoO4 2H2O, 0.2 g L−1; H3BO3, 0.02 g L−1; CaSO4 2H2O, 0.5 g L−1; CoCl2, 0.5 g L−1; ZnCl2, 7 g L−1; FeSO4 H2O, 22 g L−1; biotin, 0.2 g L−1; and conc. H2SO4, 1.0 mL.

2.1.2 Bioreactor, sensor systems, and reference measurements

The main culture was performed in a stirred tank reactor with a total volume of 42 L (Biostat® Cplus reactor; Sartorius AG). During the processes, the concentration of O2 and CO2 in the exhaust gas (BlueInOne sensor; BlueSens gas sensors GmbH) and the methanol concentration in the reactor (Alcosens sensor; Heinrich Frings GmbH & Co. KG) were monitored in real time, in addition to the standard probes (pH, pressure, and dissolved oxygen). The target variable for soft sensor prediction was biomass concentration in dry cell weight (DCW). As reference values for the soft sensor predictions, samples were taken and analyzed every 2–14 h. To determine DCW in triplicates, centrifuge tubes were pre-weighed, filled with 2 mL of sample, and centrifuged at 21,000 × g. The supernatant was discarded, and the cell pellet was dried at 80 °C for 72 h and subsequently weighed. The process was controlled with the Biostat® Cplus control unit. Data logging was performed using the SIMATIC SIPAT software (Siemens AG). All sensor values, actuator values, and reference values were logged.

2.2 Industrial B. subtilis process—cultivation and hardware

2.2.1 Strain, preculture conditions, and main culture

An optimized preculture cultivation strategy developed by Clariant Produkte (Deutschland) GmbH was implemented to generate an inoculum of B. subtilis for the main culture. The main culture (700 mL) was cultivated using a proprietary high-performance medium specifically designed for industrial cultivation, and its detailed composition is not disclosed due to confidentiality agreements. The temperature was modified during the process, and glucose served as the substrate initially supplied for an initial batch phase and later fed in the fed-batch phase. Oxygen was continuously provided to the process through a constant inflow (1.5 L min−1) of sterile air via a sparger.

2.2.2 Bioreactor, sensor systems, and reference measurements

Multifors reactors (1.4 L total volume, Infors AG) were used for the processes. These reactors were equipped with standard sensors for pH, pressure, and dissolved oxygen measurements. In-line exhaust gas analysis was performed using a mass spectrometer (Thermo Scientific™ Prima PRO; Thermo Fisher Scientific Inc.). Protein concentration was selected as the target variable for soft sensor prediction. Reference measurements were taken manually by trained laboratory personnel. The protein concentration was determined in triplicates by assessing the target protein’s activity. The data logging and process control were managed using the bioprocess platform software eve® (Infors AG).

2.3 Automatic recalibration of soft sensors with different synchronization methods

The soft sensor development and validation were performed in MATLAB R2023a (The MathWorks Inc.). As a basic prediction model, a linear model with all process variables available online (pO2, pH, temperature, addition of pH correcting agents, addition of substrate, CO2 and O2 concentration in the exhaust gas), as well as additionally calculated variables (CER and OUR, as well as the cumulative values of these variables) as input were used. This underlying linear model structure has already been successfully used for several bioprocesses [2], including P. pastoris and B. subtilis [14]. The structure of the algorithm used to recalibrate this soft sensor model is described below.

2.3.1 Structure of the automatic recalibration soft sensor concept

To determine the influence of synchronization methods (DTW, CR) on the prediction performance of a soft sensor with automatic recalibration, the structure shown in Fig. 1 was used. A data pool of P. pastoris (n = 12) and a data pool of B. subtilis (n = 24) were available. The rough outline of the soft sensor structure is as follows: At the beginning, one data set is deleted from the data pool and declared as the current process (query data set). This data set is passed to the algorithm step by step as if it occurred in real time. Additional input variables (CER, OUR, etc.) are now calculated, followed by an optional synchronization using DTW or CR. Next, the most similar data sets (n = 3) are automatically selected using an MPCA and a similarity analysis based on the weighted Euclidean distances between the historical and the current query data set. Finally, the PLSR-based prediction model is recalibrated, and the model is evaluated. Initially, a soft sensor model calibrated with all historical data sets was given, which was recalibrated four times per process using the methodology described. More details on the sub-steps are presented in the following chapters.

Fig. 1
figure 1

The automatic recalibration of the soft sensors with different synchronization methods. OUR: oxygen uptake rate, CER: carbon dioxide evolution rate, MPCA: multiway principal component analysis, PLSR: partial least squares regression

2.3.2 Preprocessing of data sets

Initially, calculations were performed to determine additional input variables, namely the OUR and the CER. These calculations required the utilization of various parameters, including the airflow rate (\({\dot{V}}_{\text{air}})\), pressure (\(p)\), liquid reactor volume \(({V}_{\text{liquid}})\), the universal gas constant (\(R=\) \(8.314 \bullet {10}^{-2}\frac{L \text{bar}}{\text{mol} K}\)), temperature (\(T\)), and the mole fractions of oxygen (\({x}_{\text{O}2}\)) and carbon dioxide (\({x}_{\text{CO}2}\)) at the inlet (indexed as \(in\)) and outlet (indexed as \(out\)) [24].

$${\text{CER}} = { }\frac{{\dot{V}_{{{\text{air}}}} \cdot p}}{{V_{{{\text{liquid}}}} \cdot R \cdot T}} \cdot \left( {\frac{{1 - x_{{{\text{O}}2,{\text{ in}}}} - x_{{{\text{CO}}2,{\text{in}}}} }}{{1 - x_{{{\text{O}}2,{\text{out}}}} - x_{{{\text{CO}}2,{\text{out}}}} }} \cdot x_{{{\text{CO}}2,{\text{out}}}} - x_{{{\text{CO}}2,{\text{in}}}} } \right)$$
(1)
$${\text{OUR}} = { }\frac{{\dot{V}_{{{\text{air}}}} \cdot p}}{{V_{{{\text{liquid}}}} \cdot R \cdot T}} \cdot \left( {x_{{{\text{O}}2,{\text{in}}}} - \frac{{1 - x_{{{\text{O}}2,{\text{ in}}}} - x_{{{\text{CO}}2,{\text{in}}}} }}{{1 - x_{{{\text{O}}2,{\text{out}}}} - x_{{{\text{CO}}2,{\text{out}}}} }} \cdot x_{{{\text{O}}2,{\text{out}}}} } \right)$$
(2)

The majority of required variables for calculating the CER and OUR were measured directly with hardware sensors. Besides, the liquid reactor volume (\({V}_{\text{liquid}}\)) was determined by a balance approach, considering the initial volume, the liquids added during the process (such as pH corrector, antifoam, and substrate feed), and the liquids removed from the process (samples). The influence of evaporation could be neglected as an exhaust air condenser was used. In addition to calculating the CER and OUR, the cumulative values were also calculated as input variables.

2.3.3 DTW

The synchronization of the online variables \({x}_{\text{raw}}\) of the historical data sets to the online variables \({x}_{\text{query}}\) of the query data set (validation data set) using DTW was performed iteratively. This data-driven method calculates a distance matrix between the historical data points and the current process. Therefore, the Euclidean distances between all possible pairs of points are calculated. Then, the optimal warping path through this matrix is searched, minimizing Euclidean distances. This search considers boundary conditions, such as prohibiting backward steps in time. This process is repeated iteratively until a termination criterion is reached. Process length and specific landmarks are synchronized between the data sets by applying the calculated warping path to skip values or to use them more than once [25,26,27,28,29]. This process is visualized in Fig. 2.

Fig. 2
figure 2

Synchronization of a data set with the query data set using dynamic time warping. The distance matrix with warping path (gray boxes) is shown in the lower right corner, representing the Euclidean distances between the points of the query data set and the other data set. Left: query data set, top: data set to be synchronized

2.3.4 CR

CR was used as an alternative synchronization method to DTW. In this method, the online variables \({x}_{\text{raw}}\) of the historical data sets are aligned to the online variables \({x}_{\text{query}}\) of the query data set. Therefore, characteristic features (landmarks) are searched in curves and synchronized (Fig. 3). It is assumed that the curves of the sensor values consist of underlying continuous functions. For synchronization, the curve-specific characteristics of the trajectories, such as extrema or trend reversals, are identified and aligned between the processes. This can be done in raw and derived signal profiles [30,31,32,33]. Areas between the curve-specific characteristics are then linearly compressed or stretched.

Fig. 3
figure 3

Synchronization of a data set with the query data set using curve registration. The maximum of the curve was chosen as an exemplary landmark. For synchronization, more landmarks are used, e.g., turning points and other extremes. Regions between the landmarks are uniformly compressed or stretched

2.3.5 MPCA

To create a search space to compare the historical data sets with the query data set, an MPCA [16,17,18,19] was performed. This search space is dimension- and noise-reduced compared to the input variable space. Therefore, the data matrix of the data pool (content: batch \(I\), input variables \(J\) and time \(K\)) has to be refolded from an \(I\times J\times K\) matrix to an \(I\times JK\) matrix. The query data set is also added as a row in this matrix. As a result, each batch represents a long row in a two-dimensional matrix. After this step, regular PCA can be performed on the two-dimensional matrix. The scores of each data sets’ first four principal components are now weighted by the variance they explain. The new dimension-reduced search space created this way is used to identify similar data sets.

2.3.6 Similarity analysis and selection of data sets via k-nearest neighbors

To identify the three most similar data sets, the variance-weighted scores (\(w\bullet t\)) of the query data set were compared with those of the historical data sets. Therefore, the Euclidean distance \({dk}_{j}\) between the query data set and each historical data set \(j\) was calculated.

$$dk_{j} = \|w \cdot t_{{{\text{query}}}} - w \cdot t_{j} \| = \sqrt {\mathop \sum \limits_{i = 1}^{4} (w_{i} \cdot t_{{{\text{query}}, i}} - w_{i} \cdot t_{j,i} )^{2} }$$
(3)

Now the three lowest Euclidean distances \({dk}_{j}\) could be identified, and thus, the three most similar data sets could be selected.

2.3.7 Partial least square regression

A linear model was used to predict the target variables. All available, online measurable variables of the query process (hardware sensors, actuators, calculated variables) served as input for this model. Recalibration was performed using the selected similar data sets. As a calibration approach, PLSR, which is widely applied in bioprocesses, was used [2]. The advantage of this methodology is that in addition to the calibration of the prediction model, a dimension reduction of the input variables takes place. This is done by calculating latent variables representing combinations of the input variables. The composition of these latent variables is based on the covariance of those to the target variable. Iteratively, more and more latent variables are added to the prediction model, and each individual model is evaluated using mean squared error (MSE). For this purpose, the \(i\) reference values \({y}_{\text{selected}}\) of the selected data sets are compared with the predictions \(\widehat{y}\) of the models with increasing number \(k\) of latent variables.

$${\text{MSE}}_{k}=\frac{1}{n}\sum_{i=1}^{n}{({y}_{\text{selected}}-{\widehat{y}}_{k})}^{2}$$
(4)

The optimal number of latent variables is determined based on the first local minimum of the MSE. Thus, the new, recalibrated prediction model \({f}_{\text{recal}}({x}_{\text{query}}(t))\) is defined. To prevent sharp jumps between the previous model \({f}_{\text{previous}}\) and the recalibrated model, a smooth transition is made for the \(m\) timesteps \({t}_{m}\) in a defined transition period from the recalibration timestamp \({t}_{\text{recal}}\) to the end of the transition phase \({t}_{tr}\) using a forgetting factor \(\lambda\) (changes linearly in the transition period from 1 to 0) between the old and the new models. The transition phase takes up the initial 40% of the time between two recalibrations.

$${f}_{\text{recal},tr }\left({x}_{\text{query}}\left({t}_{m}\right)\right)=\lambda {(t}_{m})\bullet {f}_{\text{previous}}\left({x}_{\text{query}}\left({t}_{m}\right)\right)+(1-\lambda \left({t}_{m}\right))\bullet {f}_{\text{recal}}\left({x}_{\text{query}}\left({t}_{m}\right)\right)$$
(5)

with \(\lambda \left({t}_{m}\right)=1-\frac{{t}_{m}-{t}_{\text{recal}}}{{t}_{tr}-{t}_{\text{recal}}}\) (6).

2.3.8 Evaluation of prediction performance via quality parameters

The normalized root mean squared error of prediction (NRMSEP) is used to compare and evaluate the prediction performances of the models with and without synchronization methodology. Thereby, the \(i\) reference values \({y}_{\text{query}}\) of the query data set are used with the values \({\widehat{y}}_{\text{query}}\) predicted by the models \(f({x}_{\text{query}})\) with and without synchronization.

$$\text{RMSEP}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{({y}_{i,\text{query}}- {\widehat{y}}_{i,\text{query}})}^{2}}$$
(7)
$$\text{NRMSEP}=\frac{\text{RMSEP}}{{y}_{\text{max}} -{ y}_{\text{min}}}$$
(8)

3 Results and discussion

To discuss the influence of the synchronization methods on an automatically recalibrating soft sensor, the results are structured as follows: First, the choice of synchronization methods and the specific adaptation to the bioprocesses are discussed (Sect. 3.1). The results of the synchronizations are then presented and discussed. Here, the visual synchronization success on two essential input variables for the soft sensor, the carbon dioxide and the oxygen concentration in the exhaust gas of the process, is reasoned first. Then, the prediction performance of the soft sensor without synchronization is compared to those with CR and DTW. On the one hand, this is visualized on an example process. On the other hand, the mean NRMSE of the soft sensors for all data sets of a data pool is given, and differences are discussed. This is carried out for the P. pastoris process (Sect. 3.2) and the B. subtilis process (Sect. 3.3). Finally, the results are discussed regarding their transferability between bioprocesses and further potentials of the presented soft sensor concept (Sect. 3.4).

3.1 Selection of synchronization methods and implementation

Two different synchronization methods were used as preprocessing for the automatic recalibration of the soft sensors: DTW and CR. The objective was to increase the prediction performance of the soft sensors. Therefore, it was essential that, not only the general length of the process data, but all process-specific landmarks were synchronized. In the following, the choice of suitable synchronization methods is briefly discussed, and modifications are described.

A comprehensive overview of different synchronization methods for bioprocesses was given by Brunner et al. [34]. They described three different synchronization methods for bioprocess data: Indicator variable techniques, DTW, and CR.

With indicator variable techniques, other variables than time are used to describe the process progress. These can be single variables as well as linear combinations of sensor values. When using linear combination models, these can be trained using a partial least square method (PLSR) [35, 36]. Therefore, an additional maturity index (0–100%) is introduced, which describes the process progress. The problem with this method is, that not all process-specific landmarks are synchronized. Instead, only the process lengths are aligned. The suitability of this methodology in bioprocesses is relatively limited. Consequently, this methodology was not considered in this study.

DTW is suitable for synchronizing all process-specific landmarks and was therefore included in these investigations. A brief overview of the method has already been given in 2.3.3. Several modifications were made for the specific implementation. The synchronization was carried out iteratively. In the first three iterations, the historical data were synchronized to the query data set and aligned to the averaged trajectory of all data in the following iterations. With this procedure, it is possible to align all historical data sets to the same length as the query data set and to achieve the basic curve shape of the query data set. By the following synchronization to the mean curve, an even more exact synchronization of the curves is achieved without overweighing anomalies in the query data set (e.g., sensor faults). Multiple steps, following Kassidas et al. [25] and González-Martínez et al. [26], were used to perform the DTW. First, the variables were scaled by dividing them by the average range of each variable. This was followed by the first synchronization iteration, which determines the warping path based on the Euclidean distances between the data sets to be synchronized and the reference trajectory. In addition, temporally backward warping steps were interdicted when determining the warping path. Since a multivariate approach with several variables per data set was pursued, a weighting matrix for the next iteration was calculated, which weights the variables with consistent trajectories higher. Two possible cases were defined as termination criteria for the iterations: A maximum number of iterations of 20 and a change in the weighting matrix between the iterations of less than 1%. Furthermore, as preprocessing for the synchronization with DTW, a preselection of the data sets (n = 10) was performed using MPCA and k-nearest-neighbors based on the Euclidean distance of the data sets to the query data set. The procedure was analogous to the methods described in Sects. 2.3.5 and 2.3.6. With this step, the partly very different process characteristics were addressed, which otherwise would favor singularities with the synchronization starting from the 4th iteration. Singularities are defined as loss of information during the synchronization of processes caused by too frequent duplication of individual values.

Also, CR is suitable for synchronizing all process-specific landmarks and was part of this study. Depending on curve-specific characteristics, multiple signals’ landmarks can be synchronized using this methodology. For this study’s specific implementation of CR, a principal component analysis was first performed with the variables \({x}_{\text{query}}\) of the query data set. This was done to enable multivariate synchronization. In addition to the scores of the first principal component of the query data set, the scores of the historical data sets were then determined using the calculated loadings [33]. To synchronize the principal component scores, the search for landmarks was not performed analytically, but characteristic landmarks were identified with a generalist data-driven concept. The pruned exact linear time method was used to identify the landmarks. This method is an efficient algorithm for automatically identifying structural changes in data sets. The algorithm traverses the data set step by step, examining all possible subsegmentations to identify those that best explain structural changes (landmarks) [37]. Here, significant changes in the mean and the slope of the curve were considered as criteria. Due to the nonlinearity of bioprocesses, numerous landmarks were found in the first principal component of the bioprocesses. However, regarding the variable process characteristics, many and partly different numbers of landmarks were detected and had to be harmonized and reduced. Therefore, a knowledge-based minimum distance between the individual landmarks was specified for each process, which resulted in ten landmarks per total process for the P. pastoris process and six landmarks per complete process for the B. subtilis process. Subsequently, the landmarks found in the historical data sets were synchronized to the query data set. Areas between two landmarks were linearly stretched or compressed.

3.2 Comparison of the different data synchronization techniques for the prediction of the biomass concentration of the P. pastoris processes

The biomass concentration was selected as the target variable for the soft sensor prediction in P. pastoris, since the real-time availability can be used to optimize control concepts, such as methanol control. In addition to the exhaust gas values shown in this chapter, further process variables of a query process (see Fig. S1) are presented in the supplementary information.

Figure 4 shows a query process with a potential calibration data set. The potential calibration data set is demonstrated unsynchronized, after DTW and CR. The plot represents the trajectories at the 4th recalibration point in time. Comparing the trajectories, several absolute differences between the trajectories can be observed. On the one hand, there is a slight initial difference, probably due to an inaccurate calibration of the exhaust gas measuring device. Significant absolute differences between the data sets can be recognized in the later course (time window 45–50 h, beginning of the fed-batch phase). In this time frame, the query data set only reaches CO2 concentrations of up to 1.2%, compared to the potential calibration data set with CO2 concentrations of up to 1.5%. This is due to differences in the adjustment of the methanol concentration at the beginning of the fed-batch phase. Analogous differences can be recognized in the O2 curves. Looking at the temporal variability of the processes, it can be noted that there are no major differences in the represented progressions until the end of the batch phase (~ 37 h); only after that do the processes vary in time (transition and fed-batch phases). There are also significant differences between the two synchronized curves. It can be observed that DTW changes the original curve shape more strongly than CR. However, evaluating the prediction performance of the resulting prediction model will show whether this is a valuable synchronization of the curve or overfitting and, thus, yield in loss of information.

Fig. 4
figure 4

Unsynchronized and synchronized CO2 (A) and O2 (B) trajectories of a historical data set with a query data set from the Pichia pastoris data pool at the last recalibration step. DTW, dynamic time warping; CR, curve registration

In the following, the influence of the synchronization methods on an automatically recalibrated soft sensor for the biomass concentration prediction of the process was considered. Comparing the prediction performance of the soft sensor with and without synchronization (Fig. 5), there are no apparent differences between the predictions in the batch phase. This is not unexpected since there are few temporal differences between the processes during this phase. In the subsequent transition and fed-batch phase, on the other hand, there is a clear deviation of the soft sensor without synchronization methodology. As already observed in Fig. 4, this is an area where synchronization of the historical data sets results in a clear shift of the curves. When now comparing the average prediction performance of the soft sensor for all data sets (each data set acts as a query data set), it results as follows: NRMSEPunsync = 13.5%; NRMSEPDTW = 13.0%; NRMSEPCR = 10.3%. A comparison of the NRMSEP of the recalibration without synchronization step with the recalibration with CR or DTW shows the following: On average, a 24% improvement in prediction performance could be achieved by applying CR during recalibration. However, no such substantial improvement (4%) can be achieved with DTW because of too extensive changes in the synchronized curve profiles and the resulting loss of information. Consequently, there is no optimal selection of the calibration data sets, and thus, regarding the mean NRMSEP, only a comparable prediction performance between DTW and the unsynchronized prediction methodology. In general, the main reason for the differences between the NRMSEPs lies primarily in the more frequent major misestimates in single recalibration steps (as in the example at hours 45–63 of the prediction without synchronization) than the minor differences between the predictions, such as at the end of the batch phase and the transition phase (around hour 37). Synchronizing the input variables of the soft sensor with CR can, therefore, lead to fewer major misestimates.

Fig. 5
figure 5

Biomass predictions of the automatically recalibrated soft sensor with automatic selection of similar data sets with and without synchronization in the Pichia pastoris process. DTW: dynamic time warping, CR: curve registration

3.3 Comparison of the different data synchronization techniques for the prediction of the protein concentration of the B. subtilis processes

The B. subtilis bioprocess is an industrial process for commercial target protein production. The industrially most crucial variable, which cannot be measured directly online, was the protein concentration. Therefore, a soft sensor was used to predict it. For confidentiality reasons, not all process variables can be provided. The process variables shown in the figures have been normalized to the maximum value of the respective measured variable.

A slightly different pattern for the B. subtilis process emerges for the CO2 and O2 curves in the exhaust gas for an exemplary query process with a potential calibration data set (Fig. 6). The curves start very similarly but show first temporal differences during the batch phase. The potential calibration data set reaches the first CO2 peak after 13% of the total process duration (given that the substrate of the batch phase is consumed, thus the end of the batch phase). In contrast, the query data set reaches this peak after 16% of the total process duration. Subsequently, O2 consumption and CO2 production increase again due to the limited addition of substrate during the fed-batch phase. This results in relatively stable exhaust gas values, which can differ significantly between the processes based on different feeding strategies. Thus, the CO2 production levels are almost double the level of the query process. Such differences are common for processes with different process characteristics. However, this already shows a problem with DTW: Long singularities partly occur, starting at about 30% of the total process duration. The reason for this is the remaining absolute differences between the curve to be synchronized and the query process. Despite normalization of the curves before synchronization, these differences remain due to different feeding strategies. Such information losses do not occur during synchronization with CR. The underlying curve shape is primarily preserved, and yet the landmarks are aligned.

Fig. 6
figure 6

Unsynchronized and synchronized CO2 (A) and O2 (B) trajectories of a historical data set with a query data set from the Bacillus subtilis data pool at the third recalibration step. Axes are in % due to confidentiality agreements. DTW, dynamic time warping; CR, curve registration

The differences between the unsynchronized and synchronized process variables are also noticeable in the predictions. Thus, the predictions of the protein concentration vary, especially at the third recalibration (Fig. 7). The prediction without synchronization shows overshooting. The prediction with DTW appears more suitable compared to the reference points, but the general prediction shape here is untypical for a bioprocess. Thus, a decrease in protein concentration (from about 55% of the absolute process duration) followed by an increase to the original level (from 80% of the total process duration) is not expected to be a plausible prediction if CO2 production and O2 consumption remain constant. Again, the automatically recalibrated soft sensor with CR convinces with its conclusive and steady curve shape. Considering the mean NRMSE of all 24 data sets of the B. subtilis data pool, prediction performances of NRMSEPunsync = 17.4%, NRMSEPDTW = 20.7%, and NRMSEPCR = 15.9% can be calculated. Thus, the mean NRMSE can be reduced by 9% using CR compared to the soft sensor without synchronization. With DTW, no higher prediction performance can be achieved than with the soft sensor without synchronization. The reason for this is the partly strong singularities due to the large variances between the processes. These occur mainly in the B. subtilis process since substantially more different process strategies were used in this data pool than in the P. pastoris process.

Fig. 7
figure 7

Protein predictions of the automatically recalibrated soft sensor with automatic selection of similar data sets with and without synchronization in the Bacillus subtilis process. Axes are in % due to confidentiality agreements. DTW: dynamic time warping, CR: curve registration

3.4 Transferability between bioprocesses and further aspects

Both synchronization approaches could be straightforwardly transferred between the two bioprocesses. DTW could be applied directly in the form presented here without individual adjustments to both bioprocesses. A transfer to further bioprocesses is thus conceivable. For the transferability of CR, the number of landmarks must be defined manually. For this study, the number of landmarks was set to ten (P. pastoris) and six (B. subtilis). As the number of landmarks varies from process to process, depending on various factors such as the general velocity of the processes or the number of technical process phases (batch, fed-batch), a recommendation for further processes can only be made to a very limited extent. However, iterative approaches are also conceivable here, which automatically examine several landmark counts initially and define a suitable number depending on the synchronization performance. The number found here always represents an optimum between too many landmarks (incorrect detection of sensor faults as landmarks) and too few landmarks (no synchronization of all important landmarks).

When using DTW and CR as preprocessing for the automatic recalibration of soft sensors, CR led to higher prediction performances of the soft sensors in both bioprocesses. In contrast, DTW did not lead to any significant improvements (compared to the recalibration without synchronization method). The poor performance of DTW is mainly due to the complexity of the data sets. Changing process characteristics such as changing raw materials, modified process strategies, and biological variability lead to process variables differing not only in the temporal occurrence of landmarks, but also arise in the absolute value of the process variables, as shown in Fig. 6 at the beginning of the fed-batch phase of the B. subtilis process (from 25% of the process duration). This shows the weakness of DTW: With these permanent deviations in the absolute heights of curve profiles, singularities occur in the synchronization of the curves. The reason is, that with this method, the process curves are iteratively synchronized by skipping and duplicating individual sensor values until a termination criterion is reached. However, if the curves generally differ significantly from one another, due to changing process characteristics, the termination criterion is only reached after numerous iterations and long singularities are generated in the curves up to this point. This loss of information in the process variables subsequently leads to problems with the prediction performance of the soft sensors. This problem does not occur with CR. Here, characteristic landmarks are selected first, and only these are synchronized. The areas between the landmarks are then evenly stretched or compressed. The absolute values of the process variables are, therefore, less critical for synchronization, which is advantageous for the success of synchronization when process characteristics change.

The recalibration algorithm applied can be widely automated and straightforwardly transferred to further bioprocesses and target variables, such as other target proteins and by-products. Thus, the presented concept allows an even more comprehensive validation and optimization of the presented synchronization methods. Only a defined number of equal time intervals were specified as initial conditions for when the recalibrations are to be carried out. The general prediction performance of the soft sensor concept could be further improved by various other approaches besides the application of synchronization methods. As mentioned, the prediction is recalibrated in five fixed sections per process. These prediction windows could be adapted to the process phases by, e.g., automated phase detection [14, 38,39,40]. Thus, the underlying relationships between variables in phase-dependent selected process sections do not change, and better prediction models are formed. Furthermore, the algorithm can be even more automated, as the recalibration points no longer need to be specified.

Linear models were used as the soft sensor model structure in the algorithm presented. As mentioned, the prediction performance of these models can be achieved by a phase-dependent segmentation. Alternatively, nonlinear models such as artificial neural networks could be used. These models have more complex model structures and are also suitable when the underlying relationships are nonlinear [9]. However, a major challenge is the number of reference values of the data sets used for model training. An insufficient number can lead to overfitting when training the models and thus to poor prediction performance. This can be avoided by automated sampling and processing and, hence, a high number of reference values for model training. Nevertheless, the use of synchronization methods as preprocessing for the selection and calibration of soft sensor models can likely lead to an increase in prediction performance regardless of the model structure. Therefore, preprocessing the data sets with CR can also be recommended for nonlinear models. Most likely, the singularities that occur when using DTW are also challenging, as information losses lead to problems with all model types.

Additional methods to compensate for sensor faults should also be incorporated to optimize the prediction models' robustness further. Besides the fault-tolerant fusion of redundant soft sensor models [41], the detection of sensor faults utilizing pattern recognition [42, 43], symptom signal methods [36], or multivariate statistical process control [35] would be thinkable.

The recalibration concept presented allows the long-term use for the automated maintenance of soft sensors. However, another factor must be considered before realizing its real-time implementation. In this study, we worked with the data pools of two different bioprocesses, which were created over several months. However, the data pool must be regularly expanded for long-term real-time use on a bioprocess. This is because changing process characteristics can only be automatically compensated for in the prediction model if these or similar process characteristics have already occurred and are part of the data pool. If this condition is fulfilled, high long-term prediction performance can be achieved by combining synchronization methods and automated recalibration.

4 Conclusion

This study investigated the influence of synchronization methods on an automatically recalibrating soft sensor concept. Therefore, two different synchronization methods (DTW and CR) were used to preprocess the real-time automatic selection of the most similar data sets. These data sets were subsequently used to recalibrate the current soft sensor model. These studies were performed on two different bioprocesses (P. pastoris and B. subtilis) with different target variables. When comparing the NRMSEP of the soft sensors without synchronization method and with CR or DTW, there were significant differences in the prediction performance of the soft sensor models. The use of CR resulted in a reduction of the NRMSEP of up to 24%. DTW was less suitable for the synchronization of bioprocess data. This was due to the occasionally large differences between data sets in the data pools due to different process characteristics, such as new feeding strategies. This led to singularities in the synchronized data sets associated with loss of information, which impaired the predictions of the soft sensors. Nevertheless, the DTW algorithm could be further developed and adapted to the different process characteristics. Initial approaches to this have already been presented here. However, implementing a CR methodology is much more intuitive and requires minimal process knowledge in the form presented here.

Overall, soft sensors allow enhanced bioprocess monitoring and control. Unfortunately, the implementation in the biotechnological industry often fails due to the long-term usability of these sensors [10]. Optimized intelligent recalibration can address this issue and secure soft sensors a permanent place in the biotechnology industry.