The aim of the proposed ScITSM method is to remove the scenario specific differences in heterogeneous cyclical process manufacturing data such that the transformed data can be jointly modeled by subsequent machine learning procedures. In principle any regression model that accepts time-series data as input can subsequently be employed, e.g. recurrent neural networks or standard machine learning methods based on features contracted from expert knowledge. From our experience, the former usually is the first choice for complex application with very large amounts of available data, while the latter is particularly useful if only a limited amount of data is available.
Theoretical motivation
Intuitively the error in Eq. (1) cannot be small if the target scenario is too different from the source scenarios. However, if the data distributions of the scenarios are similar, this error can be small as shown by the following theorem (obtained as extension of (Ben-David et al. 2010, Theorem 1) to multiple sources and time series).
Theorem 1
Consider some distributions \(P_1,\ldots ,P_S\) and Q on the input space \({\mathbb {R}}^{N\times T}\) and a target function \(l:{\mathbb {R}}^{N\times T}\rightarrow [0,1]^T\). Then the following holds for all regression models \(f:{\mathbb {R}}^{N\times T}\rightarrow [0,1]^T\):
$$\begin{aligned} \begin{aligned} \mathrm {E}_{Q}[||f-l&||] \le \frac{1}{S}\sum _{i=1}^{S} E_{P_i}[\left||f-l\right||]+ \frac{2 \sqrt{T}}{S} \sum _{k=1}^{S}d(P_{i},Q) \end{aligned} \end{aligned}$$
(2)
where
$$\begin{aligned} d(P,Q) = \sup _{B\in {\mathcal {B}}} \left| P(B)-Q(B)\right| \end{aligned}$$
(3)
is the total variation distance with Borel \(\sigma \)-algebra \({\mathcal {B}}\).
Proof
See “Appendix”. \(\square \)
Theorem 1 shows that the error in the target scenario can be expected to be small if the mean over all errors in the source scenarios is small and the mean distance of the target distribution to the source distributions is small. For simplicity, Theorem 1 assumes a target feature in the unit cube, which can be realized in practice by additional normalization procedures.
Our method tries to minimize the left-hand side of Eq. (2) (target error) by mapping the data in a new space where an approximation of the right-hand side is minimized. The minimization of the second term on the right-hand side is tackled by aligning all source distributions in the new space (they move towards zero in Fig. 1, right column). The minimization of the first term is tackled by subsequent regression.
It is important to note that the alignment of only the source distributions does not minimize the second term on the right-hand side, if the target distribution Q is too different from all the source distributions \(P_1,\ldots ,P_S\) (Ben-David et al. 2010). As there is no data given from Q in our problem setting (Sect. 3), we cannot identify such cases based on samples. As one possible solution to this problem, we propose to consider only parameter vectors \(\mathbf p_Q\) which represents physical dimensions of tool settings that are similar to related tool settings represented by \(\mathbf p_1,\ldots ,\mathbf p_S\) (see Fig. 2 and compare Fig. 1).
Practical implementation
Consider some source samples \(X_1,\ldots ,X_S\in {\mathbb {R}}^{L\times N\times T}\) with target feature vectors \(Y_1,\ldots ,Y_S\in {\mathbb {R}}^{L\times T}\) and parameter vectors \(\mathbf p_1,\ldots ,\mathbf p_S\in {\mathbb {N}}^P\) (e.g. parameters 30, 50 in Fig. 1). For simplicity of the subsequent description, the number of samples L is assumed to be equal for each scenario.
The goal of ScITSM is to compute a mapping
$$\begin{aligned} \varPsi : {\mathbb {R}}^{N\times T}\times {\mathbb {R}}^P \rightarrow {\mathbb {R}}^{N\times T} \end{aligned}$$
(4)
which transforms a time series \(\mathbf x\) and a scenario parameter vector \(\mathbf p\) to a new time series \(\varPsi (\mathbf x,\mathbf p)\) such that the (source) distributions of \(\varPsi (X_1,\mathbf p_1),\ldots ,\varPsi (X_S,\mathbf p_S)\) are similar and such that a subsequently learned regression model \(f:{\mathbb {R}}^{T}\rightarrow {\mathbb {R}}^T\) performs well on each scenario.
Here, \(\varPsi (X,\mathbf p)\) refers to the sample matrix that is obtained by applying \(\varPsi (\cdot ,\mathbf p)\) to each row of the sample matrix X.
The computation of the function \(\varPsi \) in ScITSM involves three processing steps: 1. Calculation of a mean curve for each source scenario, 2. Learning of correction functions at equidistant fixed time steps, and, 3. Smooth connection of correction functions.
Step 1: Calculation of mean curves In a first step a smooth curve called mean curve is fitted for each source scenario (dashed lines in middle column of Fig. 1).
Therefore, for each of the scenarios samples \(X_1,\ldots ,X_S\), the mean value for each of the N features and T time steps is computed and a spline curve is fitted subsequently by means of the algorithm proposed in Dierckx (1982). This process results in a matrix \({\widehat{X}}\in {\mathbb {R}}^{S\times N\times T}\) storing the mean curves (rows) for each of the S source scenarios.
Step 2: Learning of Equidistant Corrections After the mean curves are computed K equidistant points \(t_1,\ldots ,t_K\) are fixed and K corresponding correction functions
$$\begin{aligned} \varPhi _1,\ldots ,\varPhi _K:{\mathbb {R}}^P\rightarrow {\mathbb {R}}^N \end{aligned}$$
(5)
are learned which map a parameter vector \(\mathbf p_i\) corresponding to the i-th scenario close to the corresponding points \({\widehat{x}}_{t_1},\ldots ,{\widehat{x}}_{t_K}\) of the i-th mean curve \(\widehat{\mathbf x}_i=({\widehat{x}}_1,\ldots ,{\widehat{x}}_T)\), i.e. the i-th row of \({\widehat{X}}\). This is done under the constraint of similar predictions \(\varPhi _{t'}(\mathbf p_i),\varPhi _{t''}(\mathbf p_i)\) of nearby time steps \(t',t''\) of two points \({\widehat{x}}_{t'},{\widehat{x}}_{t''}\) on the mean curve.
We apply ideas from the Multi-Task Learning approach proposed in Evgeniou et al. (2004) that aims at similar predictions by means of similar parameters \(\theta _1,\ldots ,\theta _K\) of the learning functions \(\varPhi _1,\ldots ,\varPhi _K\). More precisely, we propose the following objective function:
$$\begin{aligned}&\min _{\varPhi _1,\ldots ,\varPhi _K} \sum _{k=1}^{K} \bigg ( \sum _{i=1}^{S} \left||\widehat{X}_{i,:,t_k} - \varPhi _k\left( \mathbf p_i\right) \right||\nonumber \\&\qquad + \alpha \sum _{r=\max (1,k-R)}^{\min (k+R,K)} \frac{\left||\theta _k-\theta _r\right||^2}{l^{|k-r|-1}} + \beta \left||\theta _k\right||_1) \end{aligned}$$
(6)
where \(X_{i,:,j}\) is the vector of features corresponding to the i-th scenario and the j-th timestep, \(\left||\cdot \right||\) is the Euclidean norm, \(\left||\cdot \right||_1\) is the 1-norm and \(\theta _k\in {\mathbb {R}}^Z\) refers to the parameter vector of \(\varPhi _k\), e.g. \(\varPhi _k(\mathbf p)=\langle \theta _k, \mathbf p\rangle +b\) is a linear model with Euclidean inner product \(\langle .,.\rangle \), parameter vector \(\theta _k\in {\mathbb {R}}^P\) and bias \(b\in {\mathbb {R}}\). The first term of Eq. (6) ensures that the prediction of the correction functions applied on the mean curves are not far away from the mean curves itself. The second term of Eq. (6) ensures similar parameter vectors of 2R nearby correction functions, where \(R\in {\mathbb {N}}\) and \(\alpha ,l\in {\mathbb {R}}\) are hyper-parameters. The last term ensures sparse parameter vectors by means of L1-regularization (Andrew and Gao 2007) with hyper-parameter \(\beta \in {\mathbb {R}}\).
Step 3: Smooth Connection To obtain a time series of length T, we aim at a smooth connection of the functions \(\varPhi _1,\ldots ,\varPhi _K\) between the points \(t_1,\ldots ,t_K\). This is done by applying ideas from moving average filtering (Makridakis and Wheelwright 1977). For a new time step \(t\le T\), we denote by
$$\begin{aligned} {\mathcal {R}}(t)= & {} \Big \{\big (\lfloor t \rfloor -R+1,\lceil t \rceil +R-1\big ), \nonumber \\&\big (\lfloor t \rfloor -R+2,\lceil t \rceil +R-2\big ),\ldots ,\big (\lfloor t \rfloor ,\lceil t \rceil \big )\Big \} \end{aligned}$$
(7)
a set of pairs constructed from the equidistant timesteps \(t_1,\ldots ,t_K\) in a nested order, where \(\lfloor t \rfloor \) (\(\lceil t \rceil \)) denotes the largest (smallest) number in \(\{t_1,\ldots ,t_K\}\) being smaller (larger) than t. The coordinates of the final transformation \({\varPsi (\mathbf x,\mathbf p)=(\varPsi _1(\mathbf x,\mathbf p),\ldots ,\varPsi _T(\mathbf x,\mathbf p))}\) in Eq. (4) are obtained by
$$\begin{aligned}&\varPsi _t(\mathbf x,\mathbf p)= \mathbf x_t -\nonumber \\&\qquad -\sum _{(i,j)\in {\mathcal {R}}(t)} \frac{m^{\frac{|{\mathcal {R}}(t)|-2i+2}{2}}\left( \varPhi _{i}(\mathbf p) + (t-i) \frac{\varPhi _{j}(\mathbf p) - \varPhi _{i}(\mathbf p)}{j - i}\right) }{\sum _{(i,j)\in {\mathcal {R}}(t)} m^{\frac{|{\mathcal {R}}(t)|-2i+2}{2}}} \end{aligned}$$
(8)
where \(|{\mathcal {R}}(t)|\) is the cardinality of \({\mathcal {R}}(t)\) and \(m\in (0,1]\) is the smoothing hyper-parameter. That is, for each vector element \(\mathbf x_t\) of the time series \(\mathbf x\), a sum is subtracted which describes a (weighted) average of linear interpolations between the points \(\varPhi _i\) and \(\varPhi _j\) for each time step pair \((i,j)\in {\mathcal {R}}(t)\). ScITSM is summarized by Algorithm 1.
Subsequent regression
Consider a transformation function \(\varPsi : {\mathbb {R}}^{N\times T}\times {\mathbb {R}}^P \rightarrow {\mathbb {R}}^{N\times T}\) as computed by ScITSM, a previously unseen target scenario sample \(X_Q=(\mathbf x_1,\ldots ,\mathbf x_L)\) of size L drawn from the unknown target distribution Q over \({\mathbb {R}}^{N\times T}\) and a corresponding parameter vector \(\mathbf p_Q\in {\mathbb {N}}^P\) (e.g. parameter 40 in Fig. 1). As motivated in Sect. 4.1, the distribution of the transformed sample \(\varPsi (X_Q,\mathbf p_Q)\) is assumed to be similar to the distributions of the samples \(\varPsi (X_1,\mathbf p_1),\ldots , \varPsi (X_S,\mathbf p_S)\) which is induced by the selection of an appropriate corresponding parameter space (see e.g. Figs. 1 and 2).
Subsequently to ScITSM, a regression function
$$\begin{aligned} f:{\mathbb {R}}^{N\times T}\rightarrow {\mathbb {R}}^T \end{aligned}$$
(9)
is trained using the concatenated input sample \((\varPsi (X_1,\mathbf p_1);\ldots ; \varPsi (X_S,\mathbf p_S))\) and its corresponding target features \((Y_1;\ldots ;Y_S)\). Finally, the target features of \(X_Q\) can be computed by \(f(\varPsi (X_Q,\mathbf p_Q))\).
Theorem 1 shows that the empirical error
$$\begin{aligned} \frac{1}{N}\sum _{i=1}^N\left|| f(\varPsi (\mathbf x_i,\mathbf p_Q))-l(\mathbf x_i)\right|| \end{aligned}$$
(10)
of the function \(f\circ \varPsi \) on the target sample can be expected to be small, if the sample size is large enough (i.e. the empirical error approximates well the error in Eq. 1) and the model f performs well on the concatenated source data.