1 Introduction

Artificial intelligence (AI) has been widely applied in the industry over the last two decades [1, 2]. Machine learning (ML), including deep learning (DL), as a subfield of AI, is becoming increasingly important for research in image and signal processing, as well as for industrial applications [3, 4].

In [5], the focus was on the detection of defects in plastic parts with a convolutional neural network (CNN) through image analysis in combination with edge computing and Internet of Things (IoT) systems. Some challenges of quality prediction in the context of big data in the field of Industry 4.0 were mentioned in [6].

In this paper, our focus is on time series with a high sampling rate and different lengths, but a small number of records. In this context, the topic of TSC has gained importance as a key component of signal processing from ML.

In [7] a new approach for adaptive multi-scale pooling and the use of temporal encoding was used to improve classification accuracy, especially on short data obtained as partial time series.

Dempster et al. [8] described the use of the HYbrid Dictionary-ROCKET Architecture (HydRa) model and compared it with the RandOm Convolutional KErnel Transform (ROCKET) algorithm for extracting and counting symbolic patterns in time series. It is shown that ROCKET is an outstanding algorithm capable of classifying time series accurately and quickly [9].

In [10], a deep neural network for TSC was used, which was the most exhaustive study of DNNs for TSC. A total of 8730 deep learning models have been trained on 97 time series data sets. Another method is the shepard interpolation neural networks (SINN) model which provides a shallow learning approach with minimal use of training samples. The SINN networks are more interpret-able than other neural networks. It learns metric features of the trained time series. In [11] a leverage novel SINN architecture was bench-marked to other TSC algorithms. However, the choice of ML models and algorithms for time series classification is limited by classification accuracy, computational time and the ability of interpretation.

Batch production processes are extremely popular in the process industry. A distinct property of such processes is that the process data are time series with different lengths by different batch operations. The analysis of non-periodic time series from batch production processes is the aim of this study. Such signals are often measured in discontinuous production processes, e.g. injection moulding of plastic parts. The product quality of a batch process will be influenced by the operating parameters.

The first step in obtaining information about the production process is data acquisition. It consists of process parameters, quality values, or sensor signals. Figure 1 illustrates the steps to build a prediction model using ML. The upper part shows the steps mentioned in [12]. The lower part of Fig. 1 describes the proposed approach in this study. The data are accumulated by sensor signals from inside the mould cavity during the production of plastic parts. A brief description of the process is given in Sect. 3.1. The next step of the upper part in Fig. 1 would be the task of feature extraction. Feature extraction can be much more difficult than building the ML model and in some cases requires more expertise on the production process. The advantage of the proposed end-to-end method consists of the simultaneous computation of the features, prediction of the product quality, and the visualization by the CAM algorithm to validate the model decision. This requires much less effort for feature extraction and helps to get clues where the ML model makes the decision of the classification problem.

Fig. 1
figure 1

Steps to design a prediction model using machine learning. The upper path shows the steps from [12]. The path below shows the steps proposed in this study using end-to-end learning with explainability for decision-making

Another important aspect is to the computation time taken for training the model. One way is the use of batch computation. According to [13], when training neural networks using batch normalization (BN), the batch size is an important hyper-parameter to speed up the training time. To optimize the training processes, a batch size greater than one and parallel computing will be the best solution. [14] used some regularization techniques with a batch size greater than one. In general, a convolutional neural network (CNN) with a batch size greater than one cannot be trained with time series of different lengths. A solution without using CNN could be realized with 1NN with DTW. This method is one of the most popular and traditional algorithms for time series classification [15,16,17]. In addition, in [18] the DTW was described for varying time series. Bagnall and Lines [19] has stated that it is very difficult to outperform the 1NN-DTW combination.

In this work, we compare this method with a proposed 1D CNN using a masking layer for TSC for varying lengths. Both methods are investigated without feature extraction steps. As an application case for verifying the proposed ML model, discontinuous time series from a plastics production process are used in this study.

In [20, 21] the focus was on TSC and comprehensible quality prediction in the plastic injection moulding process. Monitoring the quality of plastic injection moulded parts is often difficult and expensive. With rising energy costs, quality prediction using ML algorithms has become a new focus for more and more manufacturers. Finding the right process parameters is a time-consuming task. To reduce this time and to find suitable process parameters e.g. pressure, temperature, and injection speed, ML methods were used in [1, 4].

In addition, the study of the interpretability of black-box models from DL is a promising research topic called eXplainable Artificial Intelligence (XAI) according to [22, 23]. In [24], Shapley Additive Explanations (SHAP) are used to interpret neural networks on Process State Points (PSPs) extracted from cavity pressure profiles. The study by [25] describes a method for multivariate data series classification using CNN and CAM.

In this study, we use visualization of the activation for interpretability of the 1D CNN model using CAM according to [26]. The input data for the models are sensor signals from inside the cavity of an injection mould during the production of the plastic part. The major contribution of this study lies in the TSC of non-periodic signals with different lengths. For better comparability, the 1NN with DTW and the 1D CNN with masking layer will be trained on the same data set without feature extraction.

1.1 Basic principles of 1NN with DTW

In this section, we give an overview of the reference method used for TSC. As a reference method without feature extraction, a DTW with 1NN is chosen. According to [19], this method is hard to beat for classification problems. The problem of classification models with unequal sample lengths was discussed in [17], where the Proximity Forest and DTW methods were used to deal with this type of time series.

DTW is a distance-based method for dealing with time series for clustering, classification and similarity search [27]. DTW can also be used for time series with unequal sampling rates or series of different lengths. For classification tasks, DTW combined with k-Nearest Neighbours (kNN) leads to a powerful model with high accuracy. Therefore, we chose such a model as a reference model to compare our results. For details of the DTW calculation see [28] and [29]. In order to gain a better understanding of the functionality of DTW, a brief overview is given as follows.

DTW compares two signals (x, e.g. Eq. 1 and y, e.g. Eq. 2) and results in a distance measurement of one signal to the other series. This matrix is called the cost matrix \(C_{(m,n)}\), where \(C \in {\mathbb {R}}^{M \times N}\) according to [30]. The absolute cost matrix between the signals is computed using Eq. 3. The distance can be used for classification problems, for example with 1NN classifiers. Using this distance value, the accumulated cost matrix \(D_{(m,n)}\) is calculated according to Eqs. 4 to 8.

$$\begin{aligned} x(n) =&(2,1,1,8,8,8,6,1,1) \end{aligned}$$
(1)
$$\begin{aligned} \{ n \in {\mathbb {N}}&, 1 \le n \le N \} \nonumber \\ y(m) =&(2,2,0,8,7,4,3) \end{aligned}$$
(2)
$$\begin{aligned} \{ m \in {\mathbb {N}}&, 1 \le m \le M \} \nonumber \\ C_{(m,n)} =&|x(n) - y(m) | \end{aligned}$$
(3)
$$\begin{aligned} D_{(1,1)} =&C_{(1,1)} \end{aligned}$$
(4)
$$\begin{aligned} D_{(m,1)} =&C_{(m,1)} + D_{(m-1,1)} \end{aligned}$$
(5)
$$\begin{aligned} D_{(1,n)} =&C_{(1,n)} + D_{(1,n-1)} \end{aligned}$$
(6)
$$\begin{aligned} D_{(m,n)} =&C_{(m,n)} + d_{(m.n)} \end{aligned}$$
(7)
$$\begin{aligned} d_{(m.n)} =&argmin \left\{ \begin{matrix} D_{(m,n-1)}, \\ D_{(m-1,n-1)}, \\ D_{(m-1,n)} \end{matrix} \right. \end{aligned}$$
(8)

Figures 2 and 3 show the results of the simple example with x and y.

Fig. 2
figure 2

Result of \(C_{(m,n)}\) for element-wise absolute differences between series x and y

Fig. 3
figure 3

Resulting accumulated cost matrix \(D_{(m,n)}\) from Fig. 2 according to Eq. 7 and visualization of optimal warping path \(p^*\) for this example

For the visualization of the difference between the two signals, the optimal warping path \(p^*\) with the minimum costs in the accumulated cost matrix can be calculated. Algorithm 1 shows the steps for calculating the optimal warping path in a different manner according to [30].

Algorithm 1
figure a

Optimal Warping Path

The accumulated cost matrix \(D_{(m,n)}\) and the calculation of the optimal warp path are the key elements of the DTW algorithm. However, this method is not well suited for obtaining information about regions of interest in the time series. The reference method used has another disadvantage: prediction of long time series can be very slow for naive DTW computation. An idea to speed up the full DTW calculation is to use parallel computing on CPUs or GPUs [31]. Other ideas to speed up DTW computations as described in [32] are part of further work. In this study, we use parallel CPUs for the calculations.

2 Problem description

This study focuses on the following objectives. First, the classification of time series with varying lengths is considered. This type of time series is often encountered in problems with real data, such as data sets with varying production parameters. This leads to the task of predicting the quality of injection-moulded plastic parts from sensor signals during production cycles.

To solve this problem, sensor signals from inside an injection mould for the production of plastic parts were collected. The label for the ML model is a quality parameter of the products. However, this quality parameter is difficult to measure, which increases the cost of quality management. To address this problem, we investigate two classification models to predict the quality parameter in order to realize comprehensive quality monitoring during production. As a hard-to-measure critical quality parameter, the force between the part and a force gauge should be predicted with the ML algorithms.

We use a state-of-the-art ML method, 1NN with DTW, to treat this type of time series and a CNN with CAM model for comparison. Our proposed method consists of a 1D CNN layer with a masking layer, a hidden CNN layer with Global Average Pooling (GAP) and a dense layer. Detailed information can is given in Sect. 4.1.

The final objective is to make it possible to explain and understand the classification decision made by the 1D CNN classification model. For human operators, it will be essential to recognize or understand the decision from the model. A CAM algorithm will be used to colour the activation of some regions in the time signal for the classification decision.

3 Materials

3.1 Data acquisition and preprocessing

In each dataset, we recorded six sensor signals during the production cycle from inside the cavity of an injection mould. All sensor signals were acquired at a sampling rate of 1000 kHz. The produced parts were numbered, matched to the dataset and quality checked by quality management (QM). Two thermocouples in the cavity, two in the mould and two piezoelectric cavity pressure sensors were used for data acquisition. Figure 4 shows schematically the mould cavity with the sensor positions at part 1 (\(p_1\)) and part 2 (\(p_2\)), respectively.

The piezoelectric sensors are amplified by a chopper charge amplifier with an analogue-to-digital converter (ADC) described in [20, 33].

Fig. 4
figure 4

Mould design with two cavities and the sensor positions of the thermocouples and the piezoelectric pressure sensors. The production machine is an Arburg A 320 S

The course of the cavity internal sensors is available as a time series, whereby their length differs from product to product and between the DOE steps. Figure 5 shows two pressure signals from inside the cavity of part one.

Fig. 5
figure 5

An example of piezoelectric pressure signals (\(p_{1}\)) from two different DOE parameter sets. The different lengths can be seen. The signal ‘A’ (red curve) shows a fast injection process with high injection pressure and signal ‘B’ (black curve) shows a slow process with lower injection pressure

The relevant areas of the plastic part, where the critical quality parameter corresponds, are in a hard-to-measure position, and a tactile distance measurement process is too time-consuming and expensive. In addition, indirectly measuring the force based on the dimensions is also difficult to realize. Moreover, the tensile test has an effect on the surface of the tested parts which have to be destroyed after inspection. This leads us to an indirect measurement of the quality parameter. The next section describes the proposed 1D CNN model with masking layer and its use for time series classification with different lengths.

3.2 Production process and design of experiments

To predict the quality from sensor signals during the production of plastic products, we first designed a series of experiments for the production of plastic parts and data collection. This is made by using the Taguchi method and Minitab statistical software. This design of experiment (DOE) method can be used to visualize the effects of process parameters and product quality. In [34], an overview of injection moulding and, in particular, the most important process parameters for the production of plastic parts is presented.

Our Taguchi plan consists of the following parameters: the hot runner temperature \(T_{\text {HR}}\), the mould temperature \(T_{\text {M}}\), the velocity of the injected melt \(v_{\text {m}}\), the post-pressure \(P_{\text {a}}\), the cooling water flow rate \(\dot{V}_{\text {W}}\), and the cooling time \(t_{\text {c}}\). An extra parameter is built by the combination of \((T_{\text {M}}; t_{\text {c}})\). All of the above factors were varied to a high, medium and low level as applicable by the machine and tool limits. The combination of two factors is equivalent to a simultaneous change in the same direction. Details of the Taguchi Plan are confidential. Further information on the Taguchi method can be found in [35].

A combination of Taguchi, response surface method, and nondominated sorting genetic algorithm II (NSGA-II) was used to optimize the injection moulding process of fibre-reinforced composites [36]. The advantage of using DOE is that it helps to collect sensor and production data in a small space close to the working point for producing high-quality plastic parts. In fact, it is the easiest way to produce plastic parts with different quality parameters in a controlled way.

In addition to the process and quality parameters mentioned above, some sensor signals from inside the mould cavity were also sampled. Now we need to determine which sensor has the highest effect to predict the product quality. The sensors we used are shown schematically in Fig. 4 in Sect. 3.1. To understand the result of the DOE with the information from the sensor signals, some further steps are necessary. Here we use Principal Component Analysis (PCA) in [37] to identify the influence of the process parameters on the shrinkage behaviour in the production of plastic moulded gears. PCA or ANalysis Of VAriance (ANOVA) method is an option to extract information about the sensor signals and their influence by the DOE parameter set. This method helps to reduce the dimension of the classification problem, from a multivariate to a univariate classification model. Thus it also helps to reduce the training time and the complexity of the model. The signals with higher fluctuations are more influenced by the DOE parameters. For the calculation, the mean of the signals is calculated and used as input to ANOVA. The results of the DOE and ANOVA are shown in Sect. 5.

Our DOE analysis aims to predict the quality of the plastic parts. It is shown that the tensile force between the plastic parts and the gauge used is the critical quality parameter. If the force is less than a critical value, the product will be treated as a "good" (P positive) part. All products with a tensile force greater than or equal to the critical value will be labelled as "not good" (N negative) parts. The amount of good and bad parts varies considerably in the data collected. The ratio of good parts to bad parts is 1:4. Some effects from balanced and unbalanced data distributions are described in [38]. Table 1 summarizes the properties of the data set, i.e. the collected time series \(p_{\text {2}}\).

Table 1 Properties of the collected time series \(p_{\text {2}}\)

It is shown in Table 1 that a property of the recorded data set is the unbalanced label distribution. To estimate the influence of the unbalanced data, different data processing schemes are employed with adjusted data distributions. As a result, the number of bad and good parts in the training data is re-distributed. In Tables 2 and 3 the subdivision of the two variants with the corresponding data distributions for the 1NN and 1D CNN classifiers are shown.

Table 2 Representation of the unbalanced training data distribution
Table 3 Representation of the balanced training data distribution

Two graphic processing units (GPUs) of an Nvidia DGX-1 deep learning cluster were used for our computational implementation. Python and Keras [39] with Tensorflow [40] were used for data preparation, classification and visualization.

4 Methods

4.1 Proposed 1D CNN model with masking layer

In this study, a method is developed based on a CNN model with an additional masking layer to deal with variations in the length of time-series. The feature of the proposed 1D CNN is that the whole set of signals is examined and not only parts of it. A major advantage of a masking layer is that the training time can be reduced by using a batch size larger than 1 for series with different lengths. In addition, training with batch normalization can increase the accuracy of the classification model, according to [14].

The proposed model has two convolutional layers with a Rectified Linear Unit (ReLU) layer, as shown in Fig. 6.

Fig. 6
figure 6

The proposed CNN model for the classification task with two convolutional layers, ReLU, GAP layer inspired by a Fully Convolutional Neural Network (FCNN) from [16]

The masking method is often used for image analysis for example human pose estimation [41]. Here, we apply this method for time series analysis. Figure 7 shows the functionality of the masking-layer for time series with varying lengths.

Fig. 7
figure 7

Functionality of the masking layer. Shorter time series (red signal) are supplemented by large negative values not included in the signal itself (masking value / grey values). The information from the masking layer is passed to the next layers of the CNN model. The masked values are excluded from the back-propagation algorithm

The description of the masking layer is part of the Keras deep learning API that we used for our development. Further information on masking can be found in [42].

For the masking layer, a few more steps need to be taken. First, the data sets must be stacked. For DataFrames in Python this step automatically fills all shorter series with nan values to the length of the largest series. The next step is to replace the nan values with the masking value. The masking value has to be a value that doesn’t exist in the time series. The function of the masking layer is to mask (skip) all values in the time series that are equal to the masking value at those time steps. This is carried out in the first step as shown in Fig. 6. The next step is to perform the first convolution (orange) layer. At the end of each convolution layer, a ReLU layer is used. After the ReLU layer, the outputs are provided by a blue box. The output (output 2) of the last convolutional layer is the input to the Global Average Pooling (GAP) layer. Finally, the dense layer provides the output units to the two final neurons to obtain the predicted quality classification. The combination of output 2, the weights \(w_x\) and the dense layer forms the CAM activation. The activation result is mapped onto the input time series, producing the coloured signal in regions of high activation for the corresponding class.

To improve the generalization of the model to unknown or new data and to avoid overfitting, the data are randomly mixed into training, test, and validation subsets. The validation data are required for the evaluation of the performance of the model. The random mixing ensures that each subset contains enough data features to reliably guarantee the model results, even if the data volume is relatively small. In addition, to bring the features of the dataset to the same order of magnitude, all input data are re-scaled to the interval [0, 1] using normalization. The corresponding target data are pre-processed for classification by one-hot coding. Furthermore, to reduce the training time, the sensor signals used are selected by an ANOVA method. All plastic parts produced are inspected and labelled by the QM department.

The last step of our study is to explain the classification results. The CAM algorithm described in [26] is used for this task with some adjustments. Equations 9 to 11 [26] are used to calculate the activation back to the GAP layer. This activation can be visualized as a heat map on the input signal. As a result, this method helps to better understand the classification decision and find interesting regions in the time series to match with the injection moulding process.

$$\begin{aligned} S_c&= \sum _{k} w_{k}^{c} \sum _{x,y} f_{k}(x,y) \end{aligned}$$
(9)
$$\begin{aligned}&= \sum _{x,y} \sum _{k} w_{k}^{c} f_{k}(x,y) \end{aligned}$$
(10)
$$\begin{aligned} M_{c}(x,y)&= \sum _{k} w_{k}^{c} f_{k}(x,y) \end{aligned}$$
(11)

5 Results

5.1 DOE & ANOVA analysis

Based on the developed DOE, 726 records were collected, of which 165 are described as good and 561 as bad parts. The result of the regression analysis by the developed DOE and the process parameters have a \(R^2\) of 91.62%. The regression equation from the DOE is as follows:

$$\begin{aligned} F_{1} =&-8.6 - 1.135 t_\text {c} + 0.4718 T_\text {M} + - 0.189 \dot{V} \nonumber \\&0.1091 v_\text {m} + 0.0221 T_\text {HR} + 0.00026 P_\text {a} \end{aligned}$$
(12)

This result shows which process parameter has more influence on the DOE regression model. However it is not possible to determine the influence of environmental conditions, the quality of the plastic, and the sensor properties from this regression model. The information content of the process parameters themselves is not the same as the sensor signals from inside the cavity. The calculation of the variances leads to the result shown in Fig. 8. It is shown that the pressure sensor \(p_{2}\) has the highest variance.

Fig. 8
figure 8

Plot of the variances for the investigated sensor signals: \(p_\text {1}\), \(p_\text {2}\), \(T_\text {M1}\), \(T_\text {M2}\), \(T_\text {1}\) and \(T_\text {2}\). The highest variance of the sensor shows \(p_\text {2}\)

ANOVA is used for sensors with similar variances, types, and positions. Table 4 shows the grouped sensors and the results of the ANOVA with f-value and its significance. Due to some problems with the sensors \(T_{M2}\), the signals \(T_{M1}\) and \(T_{M2}\) cannot be used for further investigations.

Table 4 Sensor groups from ANOVA

It is shown from ANOVA that sensor group 1 has a larger f-value and a smaller significance. Therefore, the second cavity pressure sensor \(p_{\text {2}}\) is selected as a representative sensor signal for the classification model. The following results are divided into classification approaches by the reference model and the proposed 1D CNN with masking layer model. In addition, the visualization of the activation for the interpretability of the 1D CNN classification model is compared by using CAM according to [26].

5.2 Results by 1NN with DTW and 1D CNN with masking layer

Figure 9 shows a DTW result for two pressure signals from the injection moulding process. The DTW calculation was done in parallel with 48 jobs. This reduces the calculation time to about 12 h by using the DTAIDistance and 24 h for the FastDTW algorithm. A run of the DTW algorithm on one job took between 16 s (balanced) and 27 s (unbalanced). For the 1NN prediction, the minimum value of the corresponding input class with the smallest DTW value leads to the decision for the classification.

Fig. 9
figure 9

Result of the DTW for two pressure signals. The calculated DTW distance, in this case, is around 5196. The heat-map shows the accumulated cost matrix. The red line shows the optimal warping path between the two signals

Table 5 shows the results for the unbalanced FastDTW model without scaling. The data distribution of this classification model is improved to balance the amount of good and bad parts within the training set. The result for the balanced model and unknown test data is displayed in Table 6. For the given training-test split, the results from the FastDTW calculation are slightly better than those from the DTAIDistance. The results from the DTAIDistance DTW and all other results can be found in Sect. 8. The same unknown test dataset is used to compare the results for all models. The corresponding overall classification accuracy for unknown test data of the 1NN with FastDTW classification model is calculated according to Eq. 13 [43]. The variable \(r_i\) stands for the correct predictions and S denotes the total number of predicted records.

$$\begin{aligned} Acc&= 100\% \cdot \frac{r_i}{S} \end{aligned}$$
(13)
$$\begin{aligned} r_i&= TN + TP \end{aligned}$$
(14)
$$\begin{aligned} S&= TP + FP + FN + TN \end{aligned}$$
(15)
Table 5 Confusion matrix result for unknown test data of the 1NN with FastDTW classification (test case: 1)
Table 6 Confusion matrix result of unknown test data for the 1NN FastDTW balanced and without scaling (test case: 2)

Table 7 lists the hyperparameters used to train the 1D CNN models. For training the 1D CNN model, the computation time of an epoch requires 3 s for the unbalanced training dataset. The prediction time per dataset needs about 0.0035 s. The same initialization parameters were used to train all models for unbalanced, unbalanced scaled, balanced and balanced scaled. The validation split for Keras was set to 0.3.

Table 7 Used hyperparameters in training the experimental 1D model

The first result represents the data distribution of the 1D CNN classification model with the unbalanced data distribution. Table 8 shows the corresponding confusion matrix. The overall classification accuracy using Eq. 13 with 1D CNN with a masking layer is calculated to 83.7%.

Table 8 Confusion matrix for the unknown test data of the 1D CNN classification model with unbalanced data distribution without scaling (test case: 9)
Fig. 10
figure 10

Training process of 1D CNN classification model with original data distribution. The blue lines represent the training data and the orange is the validation data set

The 1D CNN classification model results with balanced training data are presented as follows. The results of the confusion matrix of the test data are presented in Table 9.

Table 9 Confusion matrix for the unknown test data of the 1D CNN classification model with balanced data distribution without scaling (test case: 10)

The corresponding training process and overall classification accuracy of the test data of this 1D CNN classification model are shown in Fig. 11.

Fig. 11
figure 11

Training process of the 1D CNN classification model with balanced training data distribution. The blue lines represent the training dataset and the orange is the validation dataset

5.2.1 CAM results & visualization

For generating the CAM result, the best models from the training were loaded. The visualization of CAM activation is shown in Fig. 12 The signal and the implemented CAM of an unbalanced CNN classification model show the result of a good predicted part. The heat-map from the CAM algorithm shows which region of the time series has more influence on the model decision. The activation process is shown in Fig. 12 for a good classified and in Fig. 13 for a bad classified signal. The results from CAM are also mapped to the sensor signal. This result depends on the DOE parameter set and may change for other critical quality parameters. Both results are predicted by model 9 from Sect. 8.

Fig. 12
figure 12

CAM visualization by the 1D CNN classification unbalanced trained model from Table 8. The upper time series shows a pressure signal with the coloured CAM activation (lower curve). The signal shows a good classified cavity pressure curve resulting from test scenario 9

Fig. 13
figure 13

CAM visualization result for a bad classified pressure signal, also resulting from test scenario 9 from Table 8

6 Discussion

Figure 8 shows the variance of the sensor signals. A high variance in this context means a greater influence on the DOE parameter set. For this type of classification task, the ANOVA method is suitable to reduce the training time and complexity of the classification models. Compared to the DOE, it can be seen that the ANOVA method solves the problem from a different direction. The result of DOE represents the influence of the part quality by the variation of the process parameters. The quality parameter can also be influenced by the machine, plastic quality level, environmental conditions, and many other factors. All these parameters influence the sensor signals. The position of the sensor, the sensor technology, or the ADC also affects the significance of the ANOVA result. However, it is shown from our results that the melt temperature has less influence on the ANOVA result. For this type of sensor signal, small changes in the DOE parameter set will reduce the variance of the mean. Another aspect could be the size of the plastic part and the sensor positions, but both parameters cannot be changed. The problem with the \(T_{M2}\) signal could be a hardware problem, but analysis of a single of this type is impossible.

The results of the DTW algorithms used are very satisfactory. Both DTW algorithms were tested under the same conditions. The slightly better accuracy of the FastDTW algorithm compared to the DTAIDistance algorithm is unexpected, but this is not the main aspect of our study. The training time of the DTAIDistance algorithm is half that of the FastDTW algorithm, but it is much slower than that of the proposed 1D CNN with a masking layer.

From Fig. 10, it can be seen that the validation loss of the 1D CNN classification models falls rapidly at the beginning of the training and increases during that. This is an indication of overfitting of the model to the over-represented labels in the unbalanced training. The high fluctuation at the beginning could be a result of the small training data and the validation split ratio of 0.8 in combination with the chosen batch size. With an unbalanced dataset, the probability of picking more bad parts is much higher than getting good parts for the training batch. This could be the reason for the high variations at the beginning of the training process.

Therefore, this could be reduced by using a balanced dataset for training or increasing the batch size. The balanced training is shown in Fig. 11, resulting in a better training process during the first training epochs. However, in the middle of training, the distance between the training loss and the validation loss increases, and the model starts to overfit and the validation accuracy cannot be improved. A possible reason could be the implementation of a dropout layer [44] and early stopping, but we believe that the main problem would be the small number of recorded datasets.

The results in Fig. 12 from the CAM algorithm show high activations in the pressure signals during the production phase, which agrees well with the experts’ opinion. A part of the compression phase, where the internal pressure increases rapidly, is recognized as an important region for predicting good parts. The highest CAM activation for bad parts is shown in the middle and the end of the pressure curve  where the cavity pressure drops rapidly or has a step downward. These areas correspond to the technical meaning of the injection moulding process and show strong changes where the internal pressure drops or rises rapidly. Such region of the injection moulding process is called the after-pressure and holding phase which has a greater impact on the product quality than other phases. Some other CAM results are displayed in Sect. 8 which shows how different the CAM results can be for different types of trained models. Some models show a high rise at the beginning of the pressure signal, others present the step at the end of the signal. The difference in the accuracy offers an indication of which model is better and how the model should be trained for a small number of collected data.

On the other hand, there are results from CAM that are not easy to understand even for experienced moulders. This could be caused by a non-optimal experimental design or overfitting of the classifier. The proposed CNN with CAM can highlight regions of interest in the pressure curves that contribute to its prediction. The reference model 1NN with DTW can also produce graphical output directly from the distance calculation of DTW. However, its warp path is difficult to interpret and only contains information about the phase and periodicity of the inputs. As a result, the reference model is not able to realize this kind of information as the CAM do.

7 Conclusions

The contributions of this study can be summarized as: a) the development of a fast classification model for TSC with up to 83.7 % accuracy for a hard-to-measure prediction task, b) the application of CNN with a masking layer for time series with different lengths, c) the combination with the CAM helps to lead to a better understanding of the classification results, d) better results by CNN with CAM than a 1NN with DTW for this type of data, and e) using both methods to predict hard-to-measure quality parameters on real-world data.

Further details on the main findings of this study are as follows. With the proposed 1D CNN and a masking layer, a fast classification for TSC of discontinuous signals with different lengths can be realized. The state-of-the-art 1NN with DTW is a suitable method to compare the classification results, but it is computationally expensive for a large number of time series. A simple way to decrease the computation time for the classification models is to reduce the multivariate classification problem to a univariate one. This was done by using the ANOVA method to determine the signal with the largest variance. In comparison to the 1NN with DTW, this makes the prediction task much faster. The overall classification accuracy of this model reaches 82.3 %. The confusion matrix and overall classification accuracy of the 1D CNN models with unbalanced data distribution is 83.7 % and with balanced distribution 66.0 %. The result of the unbalanced data distribution is \(\approx 1.4\,\%\), which is better than the best 1NN with DTW model. The results for the two DTW algorithms (FastDTW and DTAIDistance) are also shown in the tables in Sect. 8.

The classification results with the implemented masking layer are promising especially for the hard-to-predict quality parameter. The computation time of the reference model was between 12 and 24 h, which is too slow for prediction tasks for this type of data. The model calculation with this train and test subsets for the 1NN with DTW in 12 h/24 h could be realized only by parallel computing with 48 tasks. The FastDTW [45] is slower than the DTAIDistance [46] and leads to the same result as in the literature. A 3.4 % better accuracy by FastDTW compared to the DTAIDistance algorithm is negligible. The difference between the computation times was already analysed in [47] and could therefore be verified.

Compared to the CNN model with a masking layer, the average training time for the unbalanced dataset is 33 min, and for the balanced dataset, it is further reduced to 16 min. In addition, the 1D CNN with a masking layer delivers significantly faster predictions and has improved accuracy using the same dataset. This highlights the importance of training the CNN on diverse series with a batch size greater than one, which is part of our recommendation.

In subsequent applications, the use of up to 200 epochs proves to be sufficient for effective training. This large number of epochs provides a more robust basis for comparing different models. Notably, this approach not only reduces training time but also maintains the superior performance of the developed CNN for this specific data type.

While the accuracy and loss metrics show a favourable trajectory for the training data, a plateau is observed for the test data set. This observation is indicative of the complexity of the learning task, with the training model showing a tendency to overfit due to the limited number of records.

The properties of all models used in this paper are summarized in Sect. 8. The higher activation during the production phases in the pressure signals corresponds well with the meaning of the injection experts. On the other hand, some CAM results are difficult to understand, which could be due to the random training initialization. At these points, other sensors like the temperature may have a slightly greater influence on the results, but this information is not part of this work. The classification methods produce good results without the need for a time-consuming feature extraction algorithm. The major advantage of the 1D CNN model with a masking layer is the fast training time and the ability to obtain some regions of interest from the CAM algorithm. This made the proposed model more efficient for application to TSC with varying length series, especially in discontinuous production processes.

Regarding future work, the input data of the proposed 1D CNN models are the data sets of only one cavity pressure. To avoid overfitting or to reduce the scatter of classification accuracy and loss value, it might be better to use more signals or simulation data. In addition, a combination with the second pressure sensor or with the melt temperature sensors will be interesting for improving the results. This requires a multivariate method to extract some information from the sensor signals. PCA in [37] was used to reduce the influence of process parameters on the shrinkage behaviour in the manufacture of plastic moulded gears. In [36], a combination of the Taguchi method, response surface methodology, and nondominated sorting genetic algorithm II (NSGA-II) was used to optimize the injection moulding process of fibre-reinforced composites. For further work, PCA will be the next method to investigate the reduction in sensor signals in the context of multivariate classification tasks.

Another approach could be transfer learning. Some interesting aspects have been mentioned in [48]. Furthermore, it is conceivable to combine regression and classification approaches using so-called multi-output models to address the problem of unbalanced datasets on the one hand and to ensure compatibility with other explanatory models on the other hand. In this way, model-agnostic explanatory models could complement CAM explanations in the future. An implementation of our approach as a real-time quality prediction method for plastic injection moulding in Apache Spark may be possible. As stated in [49], to meet the demands of digital transformation in Industry 4.0, a near real-time quality prediction method for plastic injection moulding could be effectively realized using Apache Spark.

A final point is the possibility of exploring Batch Normalization (BN) methods and Residual neural NETwork (ResNET) models for this type of data with masking layers and varying lengths. This can increase classification accuracy and may be more efficient for small and unbalanced datasets.

Moreover, research on LSTM and GRU models or the classification of other datasets can be topics of future work.