Multisource transfer learning of time series in cyclical manufacturing
 244 Downloads
Abstract
This paper describes a new transfer learning method for modeling sensor time series following multiple different distributions, e.g. originating from multiple different tool settings. The method aims at removing distribution specific information before the modeling of the individual time series takes place. This is done by mapping the data to a new space such that the representations of different distributions are aligned. Domain knowledge is incorporated by means of corresponding parameters, e.g. physical dimensions of tool settings. Results on a realworld problem of industrial manufacturing show that our method is able to significantly improve the performance of regression models on time series following previously unseen distributions.
Graphic abstract
Keywords
Transfer learning Multisource transfer learning Regression Domain generalization Domain adaptationIntroduction
Standard machine learning techniques rely on the assumption that the entire data, both for training and for testing, follows the same distribution. However, this assumption can be violated. In particular, in cyclical manufacturing processes, data is often collected from different operating conditions and environments—called scenarios.
One example is the drilling of steel components (Pena et al. 2005; Ferreiro et al. 2012) where different machine settings can lead to different torque curves during time. A second example is the regression of spectroscopic measurements where different instrumental responses, environmental conditions, or sample matrices can lead to different training and test measurements (NikzadLangerodi et al. 2018; Malli et al. 2017). Other examples can be found in the optical inspection of textures or surfaces (Malaca et al. 2016; Stübl et al. 2012; Zăvoianu et al. 2017), where different lightening conditions and texture classes can lead to variations in measurements.
Approaching such heterogeneities in data by standard machine learning techniques requires to model each scenario independently which often causes expensive and time consuming data collection efforts. To overcome this problem, approaches from the field of Transfer Learning (Pan and Yang 2010) have been proposed. Transfer learning aims at extracting knowledge from source scenarios (with large amounts of possibly labeled data) and applies it to the modeling of target scenarios (with little or no available data).
In this paper we address the problem of domain generalization (Muandet et al. 2013), where, assuming enough data from a representative set of (source) scenarios, no data at all is required for the generalization to previously unknown (target) scenarios. We aim at predicting time series from target scenarios arising in cyclical process problems in manufacturing, e.g. torque curves.
We propose a new transfer learning method called ScenarioInvariant Time Series Mapping (ScITSM) that leverages available information in multiple similar scenarios and applies it to the prediction of previously unseen scenarios (without available training data).
ScITSM does so by mapping the data in a new space where the scenariospecific data distributions are aligned and such that subsequent joint modeling of the whole transformed data samples is possible. The proposed method is based on the idea of the parameterbased multitask learning approach presented in Zhang and Yang (2017), where coefficients of neighboring models are either shared or forced to be similar. Our method differs from the approach in Zhang and Yang (2017) by the incorporation of expert knowledge and by its application to time series data. The corrected data from different scenarios is more homogeneous and easier to learn by subsequent machine learning tasks. Furthermore, the learned correction formulas generalize to unseen scenarios. To the best of our knowledge no comparable methods exist that were specifically designed for time series data. The ScITSM method is illustrated in Fig. 1.
The performance of the new algorithm is demonstrated by experiments on a realworld intelligent manufacturing problem. Details of the application must be kept confidential, so it is introduced here in an abstracted way. In particular, a schematic sketch of the application is shown in Fig. 1, the results of the experiments are presented and parts of the collected and preprocessed data are shown. The results indicate that prediction accuracy can be significantly improved by ScITSM.
Related work
Transfer learning techniques are commonly applied in the areas of computer vision, natural language processing, biology, finance, business management and control application—see e.g. Lu et al. (2015), Grubinger et al. (2016, 2017b), Zellinger et al. (2016, 2017) and references within. Published work in manufacturing applications are relative scare. Successful application in chemistryoriented manufacturing processes with the usage of chemometric modeling techniques are presented in NikzadLangerodi et al. (2018), Malli et al. (2017). Another successful application of transfer learning in intelligent manufacturing for improving product quality was presented in Luis et al. (2010).
The presented method corresponds to the transfer learning subtask of domain generalization (Muandet et al. 2013), which in contrast to other popular transfer learning subtasks like domain adaptation (Zellinger et al. 2017, 2019) does not require any process data measurements of the target scenarios. Many existing domain generalization algorithms can be found in the area of kernel methods (Muandet et al. 2013; Grubinger et al. 2015, 2017a, b; Blanchard et al. 2017; Deshmukh et al. 2017; Gan et al. 2016; Erfani et al. 2016). These algorithms first map the source scenarios in a high dimensional kernel Hilbert space where the different data distributions are aligned and subsequently train a prediction model. Neural network based domain generalization approaches were presented Ghifary et al. (2015), Li et al. (2017a, b). Domain generalization was also combined with SVM (Niu et al. 2015; Xu et al. 2014) and DCprogramming (Hoffman et al. 2017).
To the best of our knowledge there is no domain generalization method that accounts for multiple source domains and temporal information in time series data.
Formal problem statement
For simplicity, we formulate the problem of multisource domain generalization for time series of equal length T. Such time series are obtained as results of subsampling procedures as it is the case in our application in Sect. 5.
Following Muandet et al. (2013), BenDavid et al. (2010) and Zellinger et al. (2019), we consider distributions \(P_1\), \(\ldots \,\), \(P_S\) and Q over the input space \({\mathbb {R}}^{N\times T}\) which represent S source scenarios and one target scenario, respectively, where N represents the number of features. In this work, we assume for each of the \(S+1\) scenarios a given corresponding parameter vectors \(\mathbf p_1,\ldots ,\mathbf p_S,\mathbf p_Q\in {\mathbb {R}}^P\), e.g. corresponding tool dimensions or material properties. Note that the parameter vectors are not the parameters of the distributions \(P_1,\ldots ,P_n,Q\).
Following Sugiyama and Kawanabe (2012), BenDavid and Urner (2014), we consider an unknown target function \(l:{\mathbb {R}}^{N\times T}\rightarrow {\mathbb {R}}^T\).
Scenarioinvariant time series mapping
The aim of the proposed ScITSM method is to remove the scenario specific differences in heterogeneous cyclical process manufacturing data such that the transformed data can be jointly modeled by subsequent machine learning procedures. In principle any regression model that accepts timeseries data as input can subsequently be employed, e.g. recurrent neural networks or standard machine learning methods based on features contracted from expert knowledge. From our experience, the former usually is the first choice for complex application with very large amounts of available data, while the latter is particularly useful if only a limited amount of data is available.
Theoretical motivation
Intuitively the error in Eq. (1) cannot be small if the target scenario is too different from the source scenarios. However, if the data distributions of the scenarios are similar, this error can be small as shown by the following theorem (obtained as extension of (BenDavid et al. 2010, Theorem 1) to multiple sources and time series).
Theorem 1
Proof
See “Appendix”. \(\square \)
Theorem 1 shows that the error in the target scenario can be expected to be small if the mean over all errors in the source scenarios is small and the mean distance of the target distribution to the source distributions is small. For simplicity, Theorem 1 assumes a target feature in the unit cube, which can be realized in practice by additional normalization procedures.
It is important to note that the alignment of only the source distributions does not minimize the second term on the righthand side, if the target distribution Q is too different from all the source distributions \(P_1,\ldots ,P_S\) (BenDavid et al. 2010). As there is no data given from Q in our problem setting (Sect. 3), we cannot identify such cases based on samples. As one possible solution to this problem, we propose to consider only parameter vectors \(\mathbf p_Q\) which represents physical dimensions of tool settings that are similar to related tool settings represented by \(\mathbf p_1,\ldots ,\mathbf p_S\) (see Fig. 2 and compare Fig. 1).
Practical implementation
Consider some source samples \(X_1,\ldots ,X_S\in {\mathbb {R}}^{L\times N\times T}\) with target feature vectors \(Y_1,\ldots ,Y_S\in {\mathbb {R}}^{L\times T}\) and parameter vectors \(\mathbf p_1,\ldots ,\mathbf p_S\in {\mathbb {N}}^P\) (e.g. parameters 30, 50 in Fig. 1). For simplicity of the subsequent description, the number of samples L is assumed to be equal for each scenario.
Here, \(\varPsi (X,\mathbf p)\) refers to the sample matrix that is obtained by applying \(\varPsi (\cdot ,\mathbf p)\) to each row of the sample matrix X.
The computation of the function \(\varPsi \) in ScITSM involves three processing steps: 1. Calculation of a mean curve for each source scenario, 2. Learning of correction functions at equidistant fixed time steps, and, 3. Smooth connection of correction functions.
Step 1: Calculation of mean curves In a first step a smooth curve called mean curve is fitted for each source scenario (dashed lines in middle column of Fig. 1).
Therefore, for each of the scenarios samples \(X_1,\ldots ,X_S\), the mean value for each of the N features and T time steps is computed and a spline curve is fitted subsequently by means of the algorithm proposed in Dierckx (1982). This process results in a matrix \({\widehat{X}}\in {\mathbb {R}}^{S\times N\times T}\) storing the mean curves (rows) for each of the S source scenarios.
Subsequent regression
Use case
Intelligent manufacturing extends control systems with machine learning models trained from gathered data, e.g. Virtual Sensors (Wang and Nace 2009). We integrated our approach described in Sect. 4 into the dataflow of a machine learning pipeline used to implement a Virtual Sensor in an Intelligent Manufacturing setting similar to the one described in Fig. 1.
Dataset
Our use case consists of 11 scenarios based on physical tool settings with parameters describing physical tool dimensions as illustrated in Fig. 2. For each scenario, we collected around 50 time series. We applied some applicationspecific normalization and transformation steps to each time series including its subtraction from a finite element simulation of the mechanical tool process. Some representative resulting time series from the source scenarios are illustrated in Fig. 3 on the left. For our experiments we choose 6 (out of 11) scenarios as source scenarios and 5 scenarios as target scenarios. The target scenarios are chosen such that its parametrization is well captured by the parametrization of the source scenarios (see Fig. 2).
Validation procedure
To estimate the performance of the proposed ScITSM on previously unseen scenarios, we evaluate different regression models based on an unsupervised transductive training protocol (Ganin et al. 2016; Gong et al. 2013; Chopra et al. 2013; Long et al. 2017) combined with crossvalidation on source scenarios.
Root mean squared error (and standard deviation) of regression models evaluated using tenfold crossvalidation as described in Sect. 5.2
Scenario  Without ScITSM  With ScITSM  Perc.  Without ScITSM  With ScITSM  Perc. 

Bayesian ridge  Random forest  
(1, 30)  0.443 (0.082)  0.239 (0.056)  53.93  0.259 (0.109)  0.262 (0.082)  101.13 
(1, 50)  0.645 (0.070)  0.359 (0.103)  55.69  0.322 (0.140)  0.311 (0.111)  96.62 
(1, 100)  0.431 (0.140)  0.299 (0.070)  69.34  0.308 (0.090)  0.267 (0.064)  86.48 
(4, 30)  0.690 (0.117)  0.334 (0.077)  48.47  0.346 (0.095)  0.372 (0.064)  107.31 
(4, 50)  0.431 (0.052)  0.243 (0.090)  56.44  0.317 (0.098)  0.238 (0.051)  75.11 
(4, 100)  0.488 (0.105)  0.235 (0.064)  48.05  0.197 (0.077)  0.234 (0.101)  118.87 
Average  0.521 (0.094)  0.285 (0.077)  55.32  0.292 (0.102)  0.281 (0.079)  97.59 
(1, 40)  0.523 (0.078)  0.403 (0.125)  77.12  0.707 (0.243)  0.418 (0.163)  59.12 
(1, 60)  0.709 (0.058)  0.394 (0.087)  55.54  0.461 (0.148)  0.381 (0.099)  82.72 
(2, 40)  0.576 (0.092)  0.426 (0.117)  73.90  0.949 (0.236)  0.440 (0.108)  46.34 
(4, 40)  0.426 (0.031)  0.342 (0.076)  80.30  1.062 (0.238)  0.399 (0.114)  37.57 
(4, 60)  0.519 (0.110)  0.371 (0.142)  71.58  0.291 (0.060)  0.395 (0.165)  135.76 
Average  0.551 (0.074)  0.387 (0.109)  71.69  0.694 (0.185)  0.407 (0.130)  72.30 
SVR (sigmoid)  SVR (RBF)  
(1, 30)  0.586 (0.114)  0.253 (0.081)  43.17  0.243 (0.072)  0.238 (0.068)  97.64 
(1, 50)  0.519 (0.221)  0.364 (0.170)  70.15  0.229 (0.092)  0.226 (0.078)  98.46 
(1, 100)  0.694 (0.202)  0.379 (0.159)  54.63  0.249 (0.064)  0.242 (0.070)  97.26 
(4, 30)  1.697 (0.341)  0.407 (0.067)  23.97  0.342 (0.122)  0.294 (0.098)  85.95 
(4, 50)  0.363 (0.154)  0.325 (0.141)  89.66  0.201 (0.060)  0.192 (0.042)  95.71 
(4, 100)  0.682 (0.199)  0.341 (0.090)  49.93  0.186 (0.059)  0.166 (0.032)  89.00 
Average  0.757 (0.205)  0.345 (0.118)  55.25  0.242 (0.078)  0.226 (0.065)  93.28 
(1, 40)  0.491 (0.142)  0.483 (0.134)  98.34  0.445 (0.151)  0.387 (0.129)  87.13 
(1, 60)  0.637 (0.208)  0.450 (0.134)  70.70  0.337 (0.079)  0.321 (0.064)  95.24 
(2, 40)  0.518 (0.085)  0.570 (0.158)  109.95  0.314 (0.055)  0.385 (0.096)  122.72 
(4, 40)  0.684 (0.189)  0.452 (0.153)  66.08  0.382 (0.156)  0.378 (0.156)  98.66 
(4, 60)  0.507 (0.202)  0.487 (0.196)  96.08  0.334 (0.056)  0.339 (0.134)  101.45 
Average  0.567 (0.165)  0.488 (0.155)  88.23  0.362 (0.099)  0.363 (0.116)  101.04 
For evaluating the performance of regression models trained subsequently to ScITSM we use 10fold crossvalidation (Varma and Simon 2006). That is, in each of 10 steps, \(90\%\) of the data (\(90\%\) of each source scenario) are chosen as training data and \(10\%\) as validation data.
Since no data of the target scenarios is used for training, the models are evaluated on the whole data of the target scenarios in each fold.
Using this protocol, 10 different rootmean squared errors for each model and each scenario are computed, properly aggregated and (together with its standard deviation) reported in Table 1.
To show the advantage of using more than one source scenario, we additionally optimize each regression model using the training data of only a single source scenario (see Table 2).

Bayesian Ridge Regression (MacKay 1992): The four gamma priors are searched in the set \(\{10^{3}, 10^{4}, 10^{5}, 10^{6}\}\) and the iterative algorithm is stopped when a selected error in the set \(\{10^{2}, 10^{3},10^{4}, 10^{5}\}\) is reached.

Random Forest (Breiman 2001): We used 100 estimators, the maximum depth is searched in the set \(\{1,2,4, 8,\ldots ,\infty \}\) where \(\infty \) refers to a pure expansion of the leaves and the minimum number of splits is selected in the set \(\{2,4,8,\ldots ,1024\}\).

Support Vector Regression (Smola and Schölkopf 2004) (SVR) with sigmoid kernel: The epsilon parameter is selected from the set \(\{10^{1},10^{2},10^{3}\}\), the parameter C is selected in \(\{10^{5},5\cdot 10^{4},10^{4},5\cdot 10^{3},10^{3}\}\) and the algorithm is stopped when a selected error in the set \(\{10^{3}, 10^{5}\}\) is reached.

Support Vector Regression with RBF kernel: The epsilon parameter is selected from the set \(\{10^{1},10^{2},10^{3}\}\), the parameter C is selected in \(\{10, 25, 30\}\), the bandwidth parameter is selected in the set \(\{10^{5},10^{4},10^{3},10^{2},10^{1},1\}\) and the algorithm is stopped when a selected error in the set \(\{10^{3}, 10^{2}\}\) is reached.
Root mean squared error (and standard deviation) of regression models trained and evaluated on a single source scenario, i.e. one model per scenario, using tenfold crossvalidation as described in Sect. 5.2
Scenario  Without ScITSM  With ScITSM  Perc.  Without ScITSM  With ScITSM  Perc. 

Bayesian ridge  Random forest  
(1,30)  0.215 (0.069)  0.210 (0.065)  97.66  0.255 (0.079)  0.261 (0.078)  102.15 
(1,50)  0.202 (0.047)  0.202 (0.048)  100.00  0.370 (0.172)  0.352 (0.151)  95.05 
(1,100)  0.342 (0.112)  0.341 (0.109)  99.67  0.325 (0.100)  0.330 (0.127)  101.55 
(4,30)  0.275 (0.072)  0.275 (0.074)  100.09  0.351 (0.090)  0.334 (0.094)  95.00 
(4,50)  0.217 (0.069)  0.217 (0.070)  100.00  0.301 (0.091)  0.292 (0.081)  96.84 
(4,100)  0.197 (0.057)  0.196 (0.058)  99.42  0.240 (0.058)  0.269 (0.095)  111.70 
SVR (sigmoid)  SVR (RBF)  
(1,30)  0.404 (0.096)  0.273 (0.099)  67.54  0.390 (0.157)  0.380 (0.161)  97.42 
(1,50)  0.486 (0.223)  0.394 (0.222)  81.01  0.364 (0.173)  0.357 (0.159)  98.12 
(1,100)  0.656 (0.229)  0.405 (0.167)  61.72  0.360 (0.201)  0.369 (0.194)  102.24 
(4,30)  1.130 (0.149)  0.440 (0.071)  38.97  0.502 (0.298)  0.438 (0.244)  87.17 
(4,50)  0.382 (0.176)  0.354 (0.174)  92.76  0.323 (0.108)  0.322 (0.110)  99.67 
(4,100)  0.580 (0.181)  0.364 (0.094)  62.80  0.215 (0.080)  0.234 (0.102)  108.64 
Results
Figure 3 illustrates some selected time series preprocessed by ScITSM. It can be seen, that the diversity caused by different source scenarios is reduced resulting in more homogeneous time series for subsequent regression.
Table 1 shows the results of applying ScITSM to multiple source scenarios. The application of ScISTM improves all regression models in average root mean squared error except the support vector regression model based on RBF kernel.
The scenario (2, 40) is the only scenario where the application of ScITSM reduces the performance of support vector regression models by a large margin. From Fig. 2 it can be seen that both tool dimensions 2 and 40 are not considered in the source scenarios. We conclude that at least one dimension should be considered in the source scenarios in our use case, otherwise the scenario distributions are too different. This well known phenomenon is often called negative transfer (Pan et al. 2010).
It is interesting to observe that the random forest models ‘overfit’ the source scenarios. This can be seen by a low average root mean squared error on the source scenarios compared to the target scenarios. Consequently, it is hard for ScITSM to improve the performance on the source scenarios (average error decreased to \(97.59\%\) of that of the raw models) where the target scenarios errors are improved by a large margin. The target scenario improvement is without considering scenario (4, 60) where the random forest model performed best over all models. This improvement is not unexpected, as the ‘overfitting’ of source scenarios can imply performance improvements in some very similar target scenarios. However, our goal is an improvement in many scenarios, not in single ones.
In general ScITSM improves the results of regression models in 9 out of 11 scenarios, where the remaining two results have explainable reasons of negative transfer and overfitting.
In principle it is possible that a high root mean squared error of the models without ScITSM is caused by mixing data from different scenarios, i.e. negative transfer happens. To exclude this possibility, we trained one model for each scenario and computed the root mean squared error for all other scenarios.
In a first step, we observed that no model is able to generalize to scenarios other than the single training one. The resulting root mean squared errors of the single scenario trained models are excessively high and give no further information that can be reported in this work. One possible reason is that the scenarios are too different. For example, consider a model trained on the yellow time series in Fig. 3 on the left and tested on data on the green time series. This experiment underpins that generalization is not possible for models trained only on single scenarios (standard regression case) and that the considered problem of domain generalization is important in our use case.
It is interesting to observe that even models trained on single scenarios (standard regression case) can be improved by considering data from different scenarios. To see this, consider Table 2. Each column denoted by ‘without ScITSM’ shows the performance of different models trained on data from a single scenario only (shown by the row). This is in contrast to Table 1 where each column shows errors of the same model on different scenarios. Applying ScITSM to data from other scenarios, almost always improves the performance of (standard) regression models. This is interesting as one may expect that models trained on data from a specific scenario cannot be improved by data from different scenarios. However, this positive effect of transfer learning can happen when a high number of scenarios is considered with a comparably low amount of samples.
Our procedure of choosing appropriate parameters for ScITSM requires expert knowledge about our method. In our use case, longterm knowledge from several years resulted in a wellperforming default setting. It is interesting to observe that this default setting gives a high performance independently of the data size (see Fig. 4). It is important to note that the selection of appropriate parameters is sophisticated in the considered problem of domain generalization, as no data of the target scenarios is given. No classical crossvalidation procedures can be used which would suffer from an unbounded bias in the generalization error estimate (Zhong et al. 2010). Finding appropriate parameters for transfer learning is an active research area (You et al. 2019). Most methods rely on a small set of data from the target scenarios (Long et al. 2012; Ganin et al. 2016) or fix their parameters (Zellinger et al. 2017) to some default values. Unfortunately, both variants cannot be used in our industrial use case. Note, by using this method, the resulting performance of the regression models in the source scenarios cannot be directly interpreted as estimating the generalization error. However, in this work, we are more interested in the generalization error of the unseen target scenarios, which are not effected.
We finally conclude that our method successfully enables the improvement of the performance of regression models in previously unseen scenarios by using information from multiple similar source scenarios. The result is obtained by a single regression model, which is conceptually and computationally simpler than the application of multiple single models for separate scenarios.
Conclusion and future work
A multisource transfer learning method for time series data is proposed. The method transforms the data in a new space such that the distributions of samples produced by multiple different tool settings are aligned. Domain knowledge is incorporated by means of corresponding tool dimensions. In a real world application of industrial manufacturing, the proposed methods significantly reduce the prediction error on data originating from already seen tool settings. The biggest benefit of the proposed method is that it can be applied to unseen data from new unseen tool settings without the need of time and cost intensive collection of training data using these settings.
Unfortunately, parameter selection becomes an important issue without data from unseen tool settings. Without such data, it is also hard to identify wrong expert knowledge used in our work to select appropriate future settings.
However, small amounts of (possibly unlabeled) data from new tool settings could be used to improve the parameter selection process in the future. These small amounts of data could also be used to overcome the phenomenon of negative transfer by strengthening the similarity assessment of data distributions from different tool settings.
Notes
Acknowledgements
Open access funding provided by Johannes Kepler University Linz. This work was partially funded by SCCH within the Austrian COMET programme. We thank Ciprian Zavoianu for helpful discussions. The fourth author acknowledges the support by the LCMK2 Center within the framework of the Austrian COMETK2 program.
References
 Andrew, G., & Gao, J. (2007). Scalable training of L1regularized loglinear models. In Proceedings of the international conference on machine learning (pp. 33–40).Google Scholar
 BenDavid, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79(1–2), 151–175.CrossRefGoogle Scholar
 BenDavid, S., & Urner, R. (2014). Domain adaptationcan quantity compensate for quality? Annals of Mathematics and Artificial Intelligence, 70(3), 185–202.CrossRefGoogle Scholar
 Blanchard, G., Deshmukh, A. A., Dogan, U., Lee, G., & Scott, C. (2017). Domain generalization by marginal transfer learning. arXiv preprint arXiv:1711.07910.
 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefGoogle Scholar
 Chopra, S., Balakrishnan, S., & Gopalan, R. (2013). Dlid: Deep learning for domain adaptation by interpolating between domains. In International conference on machine learning workshop on challenges in representation learning.Google Scholar
 Deshmukh, A. A., Sharma, S., Cutler, J. W., & Scott, C. (2017). Multiclass domain generalization. In NIPS workshop on limited labeled data.Google Scholar
 Dierckx, P. (1982). A fast algorithm for smoothing data on a rectangular grid while using spline functions. SIAM Journal on Numerical Analysis, 19(6), 1286–1304.CrossRefGoogle Scholar
 Erfani, S., Baktashmotlagh, M., Moshtaghi, M., Nguyen, V., Leckie, C., Bailey, J., & Kotagiri, R. (2016). Robust domain generalisation by enforcing distribution invariance. In Proceedings of the international joint conference on artificial intelligence (pp. 1455–1461). AAAI Press/International Joint Conferences on Artificial Intelligence.Google Scholar
 Evgeniou, T., & Pontil, M. (2004). Regularized multitask learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 109–117). ACM.Google Scholar
 Ferreiro, S., Sierra, B., Irigoien, I., & Gorritxategi, E. (2012). A Bayesian network for burr detection in the drilling process. Journal of Intelligent Manufacturing, 23(5), 1463–1475.CrossRefGoogle Scholar
 Gan, C., Yang, T., & Gong, B. (2016). Learning attributes equals multisource domain generalization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 87–97).Google Scholar
 Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. (2016). Domainadversarial training of neural networks. Journal of Machine Learning Research, 17, 1–35.Google Scholar
 Ghifary, M., Bastiaan Kleijn, W., Zhang, M., & Balduzzi, D. (2015). Domain generalization for object recognition with multitask autoencoders. In Proceedings of the IEEE international conference on computer vision (pp. 2551–2559).Google Scholar
 Gong, B., Grauman, K., & Sha, F. (2013). Connecting the dots with landmarks: Discriminatively learning domaininvariant features for unsupervised domain adaptation. In Proceedings of the international conference on machine learning (pp. 222–230).Google Scholar
 Grubinger, T., Birlutiu, A., Schöner, H., Natschläger, T., & Heskes, T. (2015). Domain generalization based on transfer component analysis. In I. Rojas, G. Joya, & A. Catala (Eds.), Advances in computational intelligence (pp. 325–334). Berlin: Springer. Google Scholar
 Grubinger, T., Birlutiu, A., Schöner, H., Natschläger, T., & Heskes, T. (2017a). Multidomain transfer component analysis for domain generalization. Neural Processing Letters, 46, 1–11.CrossRefGoogle Scholar
 Grubinger, T., Chasparis, G. C., & Natschläger, T. (2016). Online transfer learning for climate control in residential buildings. In Proceedings of the annual European control conference (ECC 2016) (pp. 1183–1188).Google Scholar
 Grubinger, T., Chasparis, G. C., & Natschläger, T. (2017b). Generalized online transfer learning for climate control in residential buildings. Energy and Buildings, 139, 63–71.CrossRefGoogle Scholar
 Hoffman, J., Mohri, M., & Zhang, N. (2017). Multiplesource adaptation for regression problems. arXiv preprint arXiv:1711.05037.
 Li, D., Yang, Y., Song, Y. Z., & Hospedales, T. M. (2017a). Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision (pp. 5543–5551).Google Scholar
 Li, D., Yang, Y., Song, Y. Z., & Hospedales, T. M. (2017b). Learning to generalize: Metalearning for domain generalization. arXiv preprint arXiv:1710.03463
 Long, M., Wang, J., Ding, G., Shen, D., & Yang, Q. (2012). Transfer learning with graph coregularization. In Conference on artificial intelligence (pp. 1805–1818). AAAI.Google Scholar
 Long, M., Wang, J., & Jordan, M. I. (2017). Deep transfer learning with joint adaptation networks. In Proceedings of the international conference on machine learning.Google Scholar
 Lu, J., Behbood, V., Hao, P., Zuo, H., Xue, S., & Zhang, G. (2015). Transfer learning using computational intelligence: A survey. KnowledgeBased Systems, 80, 14–23.CrossRefGoogle Scholar
 Luis, R., Sucar, L. E., & Morales, E. F. (2010). Inductive transfer for learning bayesian networks. Machine Learning, 79(1–2), 227–255.CrossRefGoogle Scholar
 MacKay, D. J. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447.CrossRefGoogle Scholar
 Makridakis, S., & Wheelwright, S. C. (1977). Adaptive filtering: An integrated autoregressive/moving average filter for time series forecasting. Journal of the Operational Research Society, 28(2), 425–437.CrossRefGoogle Scholar
 Malaca, P., Rocha, L. F., Gomes, D., Silva, J., & Veiga, G. (2016). Online inspection system based on machine learning techniques: Real case study of fabric textures classification for the automotive industry. Journal of Intelligent Manufacturing, 30, 1–11.Google Scholar
 Malli, B., Birlutiu, A., & Natschläger, T. (2017). Standardfree calibration transfer: An evaluation of different techniques. Chemometrics and Intelligent Laboratory Systems, 161, 49–60.CrossRefGoogle Scholar
 Muandet, K., Balduzzi, D., & Schölkopf, B. (2013). Domain generalization via invariant feature representation. In Proceedings of the 30th international conference on machine learning (pp. 10–18).Google Scholar
 NikzadLangerodi, R., Zellinger, W., Lughofer, E., & SamingerPlatz, S. (2018). Domaininvariant partial least squares regression. Analytical Chemistry, 90, 6693.CrossRefGoogle Scholar
 Niu, L., Li, W., & Xu, D. (2015). Multiview domain generalization for visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 4193–4201).Google Scholar
 Pan, S., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.CrossRefGoogle Scholar
 Pan, S., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.CrossRefGoogle Scholar
 Pena, B., Aramendi, G., Rivero, A., & de Lacalle, L. N. L. (2005). Monitoring of drilling for burr detection using spindle torque. International Journal of Machine Tools and Manufacture, 45(14), 1614–1621.CrossRefGoogle Scholar
 Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.CrossRefGoogle Scholar
 Stübl, G., Bouchot, J. L., Haslinger, P., & Moser, B. (2012). Discrepancy norm as fitness function for defect detection on regularly textured surfaces. In: Joint DAGM (German Association for Pattern Recognition) and OAGM symposium (pp. 428–437). Springer.Google Scholar
 Sugiyama, M., & Kawanabe, M. (2012). Machine learning in nonstationary environments: Introduction to covariate shift adaptation. Cambridge: MIT Press.CrossRefGoogle Scholar
 Tsybakov, A. B. (2008). Introduction to nonparametric estimation (1st ed.). Berlin: Springer.Google Scholar
 Varma, S., & Simon, R. (2006). Bias in error estimation when using crossvalidation for model selection. BMC Bioinformatics, 7(1), 91.CrossRefGoogle Scholar
 Wang, L., & Nace, A. (2009). A sensordriven approach to webbased machining. Journal of Intelligent Manufacturing, 20(1), 1–14.CrossRefGoogle Scholar
 Xu, Z., Li, W., Niu, L., & Xu, D. (2014). Exploiting lowrank structure from latent domains for domain generalization. In Proceedings of the European conference on computer vision (pp. 628–643).Google Scholar
 You, K., Wang, X., Long, M., & Jordan, M. (2019). Towards accurate model selection in deep unsupervised domain adaptation. In Proceedings of the international conference on machine learning (pp. 7124–7133).Google Scholar
 Zăvoianu, A. C., Lughofer, E., Pollak, R., MeyerHeye, P., Eitzinger, C., & Radauer, T. (2017). Multiobjective knowledgebased strategy for process parameter optimization in microfluidic chip production. In IEEE symposium series on computational intelligence (pp. 1–8). IEEE.Google Scholar
 Zellinger, W., Grubinger, T., Lughofer, E., Natschläger, T., & SamingerPlatz, S. (2017). Central moment discrepancy (CMD) for domaininvariant representation learning. In International conference on learning representations. https://openreview.net/pdf?id=SkB_mcel.
 Zellinger, W., Moser, B., Chouikhi, A., Seitner, F., Nezveda, M., & Gelautz, M. (2016). Linear optimization approach for depth range adaption of stereoscopic videos. Stereoscopic displays and applications XXVII, IS&T Electronic Imaging.Google Scholar
 Zellinger, W., Moser, B. A., Grubinger, T., Lughofer, E., Natschläger, T., & SamingerPlatz, S. (2019). Robust unsupervised domain adaptation for neural networks via moment alignment. Information Sciences, 483, 174–191.CrossRefGoogle Scholar
 Zhang, Y., & Yang, Q. (2017). A survey on multitask learning. CoRR http://arxiv.org/abs/1707.08114.
 Zhong, E., Fan, W., Yang, Q., Verscheure, O., & Ren, J. (2010). Cross validation framework to choose amongst models and datasets for transfer learning. In Proceedings of the Joint European conference on machine learning and knowledge discovery in databases (pp. 547–562). Springer.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.