Multi-source transfer learning of time series in cyclical manufacturing

This paper describes a new transfer learning method for modeling sensor time series following multiple different distributions, e.g. originating from multiple different tool settings. The method aims at removing distribution specific information before the modeling of the individual time series takes place. This is done by mapping the data to a new space such that the representations of different distributions are aligned. Domain knowledge is incorporated by means of corresponding parameters, e.g. physical dimensions of tool settings. Results on a real-world problem of industrial manufacturing show that our method is able to significantly improve the performance of regression models on time series following previously unseen distributions.


Introduction
Standard machine learning techniques rely on the assumption that the entire data, both for training and for testing, follows the same distribution. However, this assumption can be violated. In particular, in cyclical manufacturing processes, data is often collected from different operating conditions and environments-called scenarios.
One example is the drilling of steel components (Pena et al. 2005;Ferreiro et al. 2012) where different machine settings can lead to different torque curves during time. A second example is the regression of spectroscopic measurements where different instrumental responses, environmental conditions, or sample matrices can lead to different training and test measurements (Nikzad-Langerodi et al. 2018;Malli et al. 2017). Other examples can be found in the optical inspection of textures or surfaces (Malaca et al. 2016;Stübl et al. 2012;Zȃvoianu et al. 2017), where different lightening conditions and texture classes can lead to variations in measurements.
Approaching such heterogeneities in data by standard machine learning techniques requires to model each scenario independently which often causes expensive and time consuming data collection efforts. To overcome this problem, approaches from the field of Transfer Learning  have been proposed. Transfer learning aims at extracting knowledge from source scenarios (with large amounts of possibly labeled data) and applies it to the modeling of target scenarios (with little or no available data).
In this paper we address the problem of domain generalization (Muandet et al. 2013), where, assuming enough data from a representative set of (source) scenarios, no data at all is required for the generalization to previously unknown (target) scenarios. We aim at predicting time series from target scenarios arising in cyclical process problems in manufacturing, e.g. torque curves.
We propose a new transfer learning method called Scenario-Invariant Time Series Mapping (ScITSM) that leverages available information in multiple similar scenarios and applies it to the prediction of previously unseen scenarios (without available training data).
ScITSM does so by mapping the data in a new space where the scenario-specific data distributions are aligned and such that subsequent joint modeling of the whole transformed data samples is possible. The proposed method is based on the idea of the parameter-based multi-task learning approach presented in Zhang and Yang (2017), where coefficients of neighboring models are either shared or forced to be similar. Our method differs from the approach in Zhang and Yang (2017) by the incorporation of expert knowledge and by its application to time series data. The corrected data from different scenarios is more homogeneous and easier to learn by subsequent machine learning tasks. Furthermore, the learned correction formulas generalize to unseen scenarios. To the best of our knowledge no comparable methods exist that were specifically designed for time series data. The ScITSM method is illustrated in Fig. 1.
The performance of the new algorithm is demonstrated by experiments on a real-world intelligent manufacturing problem. Details of the application must be kept confidential, so it is introduced here in an abstracted way. In particular, a schematic sketch of the application is shown in Fig. 1, the results of the experiments are presented and parts of the collected and preprocessed data are shown. The results indicate that prediction accuracy can be significantly improved by ScITSM.
This paper is organized as follows: Sect. 2 reviews related work, Sect. 3 formulates the problem of domain generalization, Sect. 4 describes the proposed method for Scenario-Invariant Domain Generalization and details our algorithm, Sect. 5 describes our industrial use case, our experiments and results, and, Sect. 6 concludes the work.

Related work
Transfer learning techniques are commonly applied in the areas of computer vision, natural language processing, biology, finance, business management and control applicationsee e.g. Lu et al. (2015), Grubinger et al. (2016Grubinger et al. ( , 2017b, Zellinger et al. (2016Zellinger et al. ( , 2017 and references within. Published work in manufacturing applications are relative scare. Successful application in chemistry-oriented manufacturing processes with the usage of chemometric modeling techniques are presented in Nikzad-Langerodi et al. (2018), Malli et al. (2017). Another successful application of transfer learning in intelligent manufacturing for improving product quality was presented in Luis et al. (2010).
The presented method corresponds to the transfer learning subtask of domain generalization (Muandet et al. 2013), which in contrast to other popular transfer learning subtasks like domain adaptation (Zellinger et al. 2017(Zellinger et al. , 2019 does not require any process data measurements of the target scenarios. Many existing domain generalization algorithms can be found in the area of kernel methods (Muandet et al. 2013;Grubinger et al. 2015Grubinger et al. , 2017aBlanchard et al. 2017;Deshmukh et al. 2017;Gan et al. 2016;Erfani et al. 2016). These algorithms first map the source scenarios in a high dimensional kernel Hilbert space where the different data distributions are aligned and subsequently train a prediction model. Neural network based domain generalization approaches were presented Ghifary et al. (2015), Li et al. (2017a, b). Domain generalization was also combined with SVM (Niu et al. 2015;Xu et al. 2014) and DC-programming (Hoffman et al. 2017). To the best of our knowledge there is no domain generalization method that accounts for multiple source domains and temporal information in time series data.

Formal problem statement
For simplicity, we formulate the problem of multi-source domain generalization for time series of equal length T . Such time series are obtained as results of subsampling procedures as it is the case in our application in Sect. 5.
Following Muandet et al. (2013), Ben-David et al. (2010) and Zellinger et al. (2019), we consider distributions P 1 , . . . , P S and Q over the input space R N ×T which represent S source scenarios and one target scenario, respectively, where N represents the number of features. In this work, we assume for each of the S + 1 scenarios a given corresponding parameter vectors p 1 , . . . , p S , p Q ∈ R P , e.g. corresponding tool dimensions or material properties. Note that the parameter vectors are not the parameters of the distributions P 1 , . . . , P n , Q.
Following Sugiyama and Kawanabe (2012), Ben-David and Urner (2014), we consider an unknown target function l : R N ×T → R T .
Given S source samples X 1 , . . . , X S drawn from P 1 , . . . , P S with corresponding target values Y 1 = l(X 1 ), . . . , Y S = l(X S ) and parameters p 1 , . . . , p S , respectively, the goal of domain generalization is to learn a regression model in the target scenario, where x is the Euclidean norm of the vector x. Note that, except for the parameter vector p Q , no information is given about data in the target scenario.

Scenario-invariant time series mapping
The aim of the proposed ScITSM method is to remove the scenario specific differences in heterogeneous cyclical process manufacturing data such that the transformed data can be jointly modeled by subsequent machine learning procedures.
In principle any regression model that accepts time-series data as input can subsequently be employed, e.g. recurrent neural networks or standard machine learning methods based on features contracted from expert knowledge. From our experience, the former usually is the first choice for complex application with very large amounts of available data, while the latter is particularly useful if only a limited amount of data is available.

Theoretical motivation
Intuitively the error in Eq.
(1) cannot be small if the target scenario is too different from the source scenarios. However, if the data distributions of the scenarios are similar, this error can be small as shown by the following theorem (obtained as extension of (Ben-David et al. 2010, Theorem 1) to multiple sources and time series).
Theorem 1 Consider some distributions P 1 , . . . , P S and Q on the input space R N ×T and a target function l : R N ×T → [0, 1] T . Then the following holds for all regression models f : is the total variation distance with Borel σ -algebra B.
Theorem 1 shows that the error in the target scenario can be expected to be small if the mean over all errors in the source scenarios is small and the mean distance of the target distribution to the source distributions is small. For simplicity, Theorem 1 assumes a target feature in the unit cube, which can be realized in practice by additional normalization procedures.
Our method tries to minimize the left-hand side of Eq. (2) (target error) by mapping the data in a new space where an approximation of the right-hand side is minimized. The minimization of the second term on the right-hand side is tackled by aligning all source distributions in the new space (they move towards zero in Fig. 1, right column). The minimization of the first term is tackled by subsequent regression.
It is important to note that the alignment of only the source distributions does not minimize the second term on the righthand side, if the target distribution Q is too different from all the source distributions P 1 , . . . , P S (Ben-David et al. 2010). As there is no data given from Q in our problem setting (Sect. 3), we cannot identify such cases based on samples. As one possible solution to this problem, we propose to consider only parameter vectors p Q which represents physical dimensions of tool settings that are similar to related tool settings represented by p 1 , . . . , p S (see Fig. 2 and compare Fig. 1).

Practical implementation
Consider some source samples X 1 , . . . , X S ∈ R L×N ×T with target feature vectors Y 1 , . . . , Y S ∈ R L×T and parameter vectors p 1 , . . . , p S ∈ N P (e.g. parameters 30, 50 in Fig. 1). For simplicity of the subsequent description, the number of samples L is assumed to be equal for each scenario.
The goal of ScITSM is to compute a mapping which transforms a time series x and a scenario parameter vector p to a new time series Ψ (x, p) such that the (source) distributions of Ψ (X 1 , p 1 ), . . . , Ψ (X S , p S ) are similar and such that a subsequently learned regression model f : R T → R T performs well on each scenario. Here, Ψ (X , p) refers to the sample matrix that is obtained by applying Ψ (·, p) to each row of the sample matrix X .
The computation of the function Ψ in ScITSM involves three processing steps: 1. Calculation of a mean curve for each source scenario, 2. Learning of correction functions at equidistant fixed time steps, and, 3. Smooth connection of correction functions.
Step 1: Calculation of mean curves In a first step a smooth curve called mean curve is fitted for each source scenario (dashed lines in middle column of Fig. 1).
Therefore, for each of the scenarios samples X 1 , . . . , X S , the mean value for each of the N features and T time steps is computed and a spline curve is fitted subsequently by means of the algorithm proposed in Dierckx (1982). This process results in a matrix X ∈ R S×N ×T storing the mean curves (rows) for each of the S source scenarios.
Step 2: Learning of Equidistant Corrections After the mean curves are computed K equidistant points t 1 , . . . , t K are fixed and K corresponding correction functions are learned which map a parameter vector p i corresponding to the i-th scenario close to the corresponding points the i-th row of X . This is done under the constraint of similar predictions Φ t (p i ), Φ t (p i ) of nearby time steps t , t of two points x t , x t on the mean curve. We apply ideas from the Multi-Task Learning approach proposed in Evgeniou et al. (2004) that aims at similar predictions by means of similar parameters θ 1 , . . . , θ K of the learning functions Φ 1 , . . . , Φ K . More precisely, we propose the following objective function: where X i,:, j is the vector of features corresponding to the ith scenario and the j-th timestep, · is the Euclidean norm, · 1 is the 1-norm and θ k ∈ R Z refers to the parameter vector of Φ k , e.g. Φ k (p) = θ k , p + b is a linear model with Euclidean inner product ., . , parameter vector θ k ∈ R P and bias b ∈ R. The first term of Eq. (6) ensures that the prediction of the correction functions applied on the mean curves are not far away from the mean curves itself. The second term of Eq. (6) ensures similar parameter vectors of 2R nearby correction functions, where R ∈ N and α, l ∈ R are hyper-parameters. The last term ensures sparse parameter vectors by means of L1-regularization (Andrew and Gao 2007) with hyper-parameter β ∈ R.
Step 3: Smooth Connection To obtain a time series of length T , we aim at a smooth connection of the functions Φ 1 , . . . , Φ K between the points t 1 , . . . , t K . This is done by applying ideas from moving average filtering (Makridakis and Wheelwright 1977). For a new time step t ≤ T , we denote by a set of pairs constructed from the equidistant timesteps t 1 , . . . , t K in a nested order, where t ( t ) denotes the largest (smallest) number in {t 1 , . . . , t K } being smaller (larger) than t. The coordinates of the final transformation Ψ (x, p) = (Ψ 1 (x, p), . . . , Ψ T (x, p)) in Eq. (4) are obtained by where |R(t)| is the cardinality of R(t) and m ∈ (0, 1] is the smoothing hyper-parameter. That is, for each vector element x t of the time series x, a sum is subtracted which describes a (weighted) average of linear interpolations between the points Φ i and Φ j for each time step pair (i, j) ∈ R(t). ScITSM is summarized by Algorithm 1.

Init
: Setting of hyper-parameters α, β, l ∈ R, K , R ∈ N and m ∈ (0, 1] and initialization of K correction functions Φ 1 , . . . , Φ K : R P → R N Step 1 : Calculation of mean curve tensor X ∈ R S×N ×T as a row-wise concatenation of the means (over rows and columns) of X 1 , . . . , X S .

Subsequent regression
Consider a transformation function Ψ : R N ×T × R P → R N ×T as computed by ScITSM, a previously unseen target scenario sample X Q = (x 1 , . . . , x L ) of size L drawn from the unknown target distribution Q over R N ×T and a corresponding parameter vector p Q ∈ N P (e.g. parameter 40 in Fig. 1). As motivated in Sect. 4.1, the distribution of the transformed sample Ψ (X Q , p Q ) is assumed to be similar to the distributions of the samples Ψ (X 1 , p 1 ), . . . , Ψ (X S , p S ) which is induced by the selection of an appropriate corresponding parameter space (see e.g. Figs. 1 and 2).
Subsequently to ScITSM, a regression function is trained using the concatenated input sample (Ψ (X 1 , p 1 ); . . . ; Ψ (X S , p S )) and its corresponding target features (Y 1 ; . . . ; Y S ). Finally, the target features of X Q can be computed by f (Ψ (X Q , p Q )). Theorem 1 shows that the empirical error of the function f • Ψ on the target sample can be expected to be small, if the sample size is large enough (i.e. the empirical error approximates well the error in Eq. 1) and the model f performs well on the concatenated source data.

Use case
Intelligent manufacturing extends control systems with machine learning models trained from gathered data, e.g. Virtual Sensors (Wang and Nace 2009). We integrated our approach described in Sect. 4 into the data-flow of a machine learning pipeline used to implement a Virtual Sensor in an Intelligent Manufacturing setting similar to the one described in Fig. 1.

Dataset
Our use case consists of 11 scenarios based on physical tool settings with parameters describing physical tool dimensions as illustrated in Fig. 2. For each scenario, we collected around 50 time series. We applied some application-specific normalization and transformation steps to each time series including its subtraction from a finite element simulation of the mechanical tool process. Some representative resulting time series from the source scenarios are illustrated in Fig. 3 on the left. For our experiments we choose 6 (out of 11) scenarios as source scenarios and 5 scenarios as target scenarios. The target scenarios are chosen such that its parametrization is well captured by the parametrization of the source scenarios (see Fig. 2).

Validation procedure
To estimate the performance of the proposed ScITSM on previously unseen scenarios, we evaluate different regression models based on an unsupervised transductive training protocol (Ganin et al. 2016;Gong et al. 2013;Chopra et al. 2013;Long et al. 2017) combined with cross-validation on source scenarios. In a first step, we select appropriate hyper-parameters in a semi-automatic way. That is, the parameters are fixed by a method expert based only on the unsupervised data from the source scenarios without considering any labels, i.e. output values, or target samples. The decision is based on visual quantification of the distribution alignment in the representation space. As a result, the hyper-parameters are the same for all subsequently trained regression models. The result of some representative time series is illustrated in Fig. 3.
For evaluating the performance of regression models trained subsequently to ScITSM we use 10-fold crossvalidation (Varma and Simon 2006). That is, in each of 10 steps, 90% of the data (90% of each source scenario) are chosen as training data and 10% as validation data.
Since no data of the target scenarios is used for training, the models are evaluated on the whole data of the target scenarios in each fold.
Using this protocol, 10 different root-mean squared errors for each model and each scenario are computed, properly aggregated and (together with its standard deviation) reported in Table 1.
To show the advantage of using more than one source scenario, we additionally optimize each regression model Best values of scenarios are shown in boldface, improvements of ScITSM are shown by italic numbers using the training data of only a single source scenario (see Table 2). We compare the following regression models and we use the following parameter sets for selection:  (Smola and Schölkopf 2004) (SVR) with sigmoid kernel: The epsilon parameter is selected from the set {10 −1 , 10 −2 , 10 −3 }, the parame-ter C is selected in {10 −5 , 5 · 10 −4 , 10 −4 , 5 · 10 −3 , 10 −3 } and the algorithm is stopped when a selected error in the set {10 −3 , 10 −5 } is reached. -Support Vector Regression with RBF kernel: The epsilon parameter is selected from the set {10 −1 , 10 −2 , 10 −3 }, the parameter C is selected in {10, 25, 30}, the bandwidth parameter is selected in the set {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 1} and the algorithm is stopped when a selected error in the set {10 −3 , 10 −2 } is reached. Figure 3 illustrates some selected time series pre-processed by ScITSM. It can be seen, that the diversity caused by different source scenarios is reduced resulting in more homogeneous time series for subsequent regression.  Table 1 shows the results of applying ScITSM to multiple source scenarios. The application of ScISTM improves all regression models in average root mean squared error except the support vector regression model based on RBF kernel.

Results
The scenario (2, 40) is the only scenario where the application of ScITSM reduces the performance of support vector regression models by a large margin. From Fig. 2 it can be seen that both tool dimensions 2 and 40 are not considered in the source scenarios. We conclude that at least one dimension should be considered in the source scenarios in our use case, otherwise the scenario distributions are too different. This well known phenomenon is often called negative transfer .
It is interesting to observe that the random forest models 'overfit' the source scenarios. This can be seen by a low average root mean squared error on the source scenarios compared to the target scenarios. Consequently, it is hard for ScITSM to improve the performance on the source scenarios (average error decreased to 97.59% of that of the raw models) where the target scenarios errors are improved by a large margin. The target scenario improvement is without considering scenario (4, 60) where the random forest model performed best over all models. This improvement is not unexpected, as the 'overfitting' of source scenarios can imply performance improvements in some very similar target scenarios. However, our goal is an improvement in many scenarios, not in single ones.
In general ScITSM improves the results of regression models in 9 out of 11 scenarios, where the remaining two results have explainable reasons of negative transfer and overfitting.
In principle it is possible that a high root mean squared error of the models without ScITSM is caused by mixing data from different scenarios, i.e. negative transfer happens. To exclude this possibility, we trained one model for each scenario and computed the root mean squared error for all other scenarios.
In a first step, we observed that no model is able to generalize to scenarios other than the single training one. The resulting root mean squared errors of the single scenario trained models are excessively high and give no further information that can be reported in this work. One possible reason is that the scenarios are too different. For example, consider a model trained on the yellow time series in Fig. 3 on the left and tested on data on the green time series. This experiment underpins that generalization is not possible for models trained only on single scenarios (standard regression case) and that the considered problem of domain generalization is important in our use case.
It is interesting to observe that even models trained on single scenarios (standard regression case) can be improved by considering data from different scenarios. To see this, consider Table 2. Each column denoted by 'without ScITSM' shows the performance of different models trained on data from a single scenario only (shown by the row). This is in contrast to Table 1 where each column shows errors of the same model on different scenarios. Applying ScITSM to data from other scenarios, almost always improves the performance of (standard) regression models. This is interesting as one may expect that models trained on data from a specific scenario cannot be improved by data from different scenarios. However, this positive effect of transfer learning can happen when Another interesting question is about the effect of ScITSM when the amount of source scenario samples decreases. Therefore, we consider the average root mean squared errors over all target scenarios of the best regression models, i.e. SVR with RBF kernel, for a varying number of source samples. The result is shown in Fig. 4. It can be seen that the positive effect of ScITSM gets even stronger when the sample size of all scenarios decreases by a certain percentage value.
Our procedure of choosing appropriate parameters for ScITSM requires expert knowledge about our method. In our use case, long-term knowledge from several years resulted in a well-performing default setting. It is interesting to observe that this default setting gives a high performance independently of the data size (see Fig. 4). It is important to note that the selection of appropriate parameters is sophisticated in the considered problem of domain generalization, as no data of the target scenarios is given. No classical cross-validation procedures can be used which would suffer from an unbounded bias in the generalization error estimate (Zhong et al. 2010). Finding appropriate parameters for transfer learning is an active research area (You et al. 2019). Most methods rely on a small set of data from the target scenarios (Long et al. 2012;Ganin et al. 2016) or fix their parameters (Zellinger et al. 2017) to some default values. Unfortunately, both variants cannot be used in our industrial use case. Note, by using this method, the resulting performance of the regression models in the source scenarios cannot be directly interpreted as estimating the generalization error. However, in this work, we are more interested in the generalization error of the unseen target scenarios, which are not effected.
We finally conclude that our method successfully enables the improvement of the performance of regression models in previously unseen scenarios by using information from multiple similar source scenarios. The result is obtained by a single regression model, which is conceptually and computationally simpler than the application of multiple single models for separate scenarios.

Conclusion and future work
A multi-source transfer learning method for time series data is proposed. The method transforms the data in a new space such that the distributions of samples produced by multiple different tool settings are aligned. Domain knowledge is incorporated by means of corresponding tool dimensions. In a real world application of industrial manufacturing, the proposed methods significantly reduce the prediction error on data originating from already seen tool settings. The biggest benefit of the proposed method is that it can be applied to unseen data from new unseen tool settings without the need of time and cost intensive collection of training data using these settings.
Unfortunately, parameter selection becomes an important issue without data from unseen tool settings. Without such data, it is also hard to identify wrong expert knowledge used in our work to select appropriate future settings.
However, small amounts of (possibly unlabeled) data from new tool settings could be used to improve the parameter selection process in the future. These small amounts of data could also be used to overcome the phenomenon of negative transfer by strengthening the similarity assessment of data distributions from different tool settings.
where the last equality follows from the application of Lemma 2.1 in Tsybakov (2008).