Continual learning of neural networks for quality prediction in production using memory aware synapses and weight transfer

Deep learning-based predictive quality enables manufacturing companies to make data-driven predictions of the quality of a produced product based on process data. A central challenge is that production processes are subject to continuous changes such as the manufacturing of new products, with the result that previously trained models may no longer perform well in the process. In this paper, we address this problem and propose a method for continual learning in such predictive quality scenarios. We therefore adapt and extend the memory-aware synapses approach to train an artificial neural network across different product variations. Our evaluation in a real-world regression problem in injection molding shows that the approach successfully prevents the neural network from forgetting of previous tasks and improves the training efficiency for new tasks. Moreover, by extending the approach with the transfer of network weights from similar previous tasks, we significantly improve its data efficiency and performance on sparse data. Our code is publicly available to reproduce our results and build upon them.


Introduction
Predictive quality enables manufacturing companies to make data-driven in-process predictions of the quality of a produced product based on process data.The general approach to predictive quality involves three main steps: the collection and aggregation of process and quality data, the training of a predictive model, and the use of the model for real-time predictions as a basis for decisions on measures to be taken in the process.Machine learning and especially deep learning methods based on neural networks enable such predictions based on multi-modal process, sensor and machine data.In the current state of research, there are already many examples that successfully demonstrate the feasibility of deep learning based predictive quality in various manufacturing processes such as deep drawing, hydrocracking, lasermachining, or B Hasan Tercan tercan@uni-wuppertal.de 1 Chair of Technologies and Management for Digital Transformation, University of Wuppertal, Rainer-Gruenter-Strasse 21, Wuppertal, Germany additive manufacturing1 (Baumeister et al. 2018;Meyes et al. 2019;Yuan et al. 2020;Li et al. 2020;McDonnell et al. 2021;Hsu and Liu 2021).
The mentioned examples mainly focus on a particular learning problem, where the training of a neural network happens under the assumption that enough data is available for the respective problem.However, this assumption is often not met in production.In fact, a central challenge is that production processes are subject to continuous changes.For example, as soon as a new product is manufactured or a process is reparameterized, the process behavior changes and with it the relationships between process and quality data.Consequently, a lot of new process data would have to be collected each time to train another completely new model on it (Escobar et al. 2021).This strongly limits the sustainable use of deep learning in the production context, especially since the collection of representative process data is costly and time-consuming.Other common problems in the production domain are that, due to limited hardware capacities or corporate policies, long-term process data cannot be stored or accessed and model training must be carried out in a Fig. 1 Use of a neural network for regression to estimate the product quality y (here part deformation) on the basis of machine parameters X (e.g.pressure and time).With each plastic brick, the relationship between X and y changes.Note that each new product represents a new prediction task while the input and output always remain the same for every task resource-friendly manner.It is therefore necessary to address this research gap and to find solutions for the efficient training of neural networks across production process variations with sparse data (Wang et al. 2018).
We believe that potential solutions lie in the research field of continual learning, a paradigm in deep learning that addresses the training of neural networks over multiple (similar) tasks.The common goal in continual learning is to keep the training effort low (i.e.reduced computational effort, increased memory efficiency) and to prevent the so-called catastrophic forgetting of the networks with each new task.
In this paper, we address this issue and demonstrate the successful application of continual learning for a real use case in injection molding, where we train a neural network for numerical prediction of product quality based on machine parameters.Tercan et al. (2018) demonstrated the feasibility of such predictions in previous work.However, the network is also continually confronted with new prediction tasks due to product changes.Figure 1 schematically illustrates this problem.Our goal is to investigate a learning process of a single neural network across theses prediction tasks.Since many production environments have the above mentioned constraints, our approach must meet the following criteria: The two main contributions of this paper are: 1. We gain useful findings for the use of continual learning in a real-world predictive quality case in injection molding.We therefore perform extensive experiments with an existing continual learning method (i.e.memory-aware synapses) and evaluate its feasibility and benefit by comparing it with baseline methods.2. We provide a valuable extension of the method by transferring (i.e.cloning) neural network weights from previous tasks.As a result, we achieve improved performance on sparse data and can also satisfy the criteria mentioned above.
The paper is organized as follows: Sect. 2 presents the state of the art of continual learning and discusses it with respect to the criteria posed.Section 3 describes our approach, which is based on the memory-aware synapses method for continually learning neural networks.In Sect.4, the injection molding use case and the experimental setup for evaluation is provided.
The evaluation results and discussions are provided in Sect. 5. Finally, Sect.6 briefly summarizes the main issues of this paper and gives an outlook on the future research.

Continual learning
One major assumption of machine learning learning is that the training data for a model is drawn from the same domain and shares the same characteristics (e.g.input features, distributions) as the test data.This ensures that models generalize well to new unseen data.However, in many real world scenarios this is not the case and the training data for a learning task becomes available only during a certain time.
In such cases, a new model would have to be trained from scratch on new data for each new task.Continual learning is a paradigm of machine learning that tackles this problem and deals with training machine learning models over time in such a way that they can both acquire knowledge for new tasks and retain knowledge from previously trained tasks (Parisi et al. 2019;Chen and Liu 2018).A related paradigm to continual learning is transfer learning, which has also been successfully investigated in industrial applications, such as by Zhao et al. (2020) or Zellinger et al. (2020).The main idea of transer learning is also to leverage the knowledge of a model pre-trained on a source task to a given new task (Pan and Yang 2009;Tan et al. 2018).However, when training neural networks with transfer learning, they may suffer from catastrophic forgetting.As soon as they are trained for sequentially occurring tasks, their performance for previously learned tasks drops due to changes of their parameters respectively network weights (Goodfellow et al. 2013).In contrast, continual learning methods address this issue and try to find a trade-off between the stability and plasticity of network parameters when training on new tasks.State-of-the-Art methods for continual learning can be roughly divided into three categories: memory-based rehearsal strategies, dynamic architectures, and regularization strategies (Parisi et al. 2019).
Rehearsal methods use a fixed sized memory to store data samples from previously trained tasks.Theses samples are then later revisited during the training of new tasks in order to mitigate catastrophic forgetting.For example, Rebuffi et al. (2017) keep an episodic memory with representative samples for each task.When training new tasks, they calculate an additional distillation loss to prevent the network's predictions for these samples from changing significantly.In contrast to that, Lopez-Paz and Ranzato (2017) use the memory to compute the network's gradients for previous tasks.They then formulate the learning of a new task as a dual optimization problem allowing the calculated gradients to minimize both the new loss and the previous losses.Shin et al. (2017) propose a pseudo-rehearsal approach that uses an autoencoder as a generative model to replay previous tasks and to generate new data for each of them when training a new task.
Approaches with dynamic architectures change the architecture of the network when training for new tasks.Often, they dynamically expand the capacity of a network in order to learn new patterns without conflicts.Parisi et al. (2017) use a growing when required (GWR) approach to train recurrent self-organizing neural networks that are hierarchically extended for new tasks.Rus et al. (2016) propose a progressively growing neural network, with the network being extended by an additional column when trained on a new task.By freezing the previous columns and using lateral connections to the new column, both catastrophic forgetting is prevented and previously learned knowledge is reused for the current task.Schwarz et al. (2018) consolidate previously learned knowledge in a base column (i.e.knowledge base) by means of knowledge distillation.For new tasks an additional active column is trained, which again is connected to the basis and can therefore leverage previous knowledge and at the same time acquire new knowledge.
Regularization strategies reduce catastrophic forgetting of neural networks by restricting network parameter updates while training on new tasks.In elastic weight consolidation (EWC) by Kirkpatrick et al. (2017), this is realized by penalizing changes of parameters that are important for previous tasks.The parameter importance is estimated via probability densities using the fisher information matrix.Because of that, EWC is suited for continual classification tasks.In contrast to that, the memory-aware synapses approach proposed by Aljundi et al. (2018Aljundi et al. ( , 2019) ) represents the importance of network parameters by the sensitivity of the network output to changes of the parameters.By incorporating changes of important parameters into the loss function, mainly those networks weights are adapted to a new task that are not yet important.A number of other regularization strategies exist that constrain the learning of a network in a different way.Zhizhong and Hoiem (2018) use a distillation loss to prevent the actual outputs (i.e.predictions) of the updated network from deviating too much from those of its older version.Pomponi et al. (2020) applies regularization to penalize significant changes of a network's embeddings (i.e.activations) for previously learned tasks.
The mentioned methods show promising results and most of them could also be applied for predictive quality scenarios with varying process conditions.However, only few of the them fulfill the three requirements defined in Sect. 1.Although rehearsal methods can yield the best results (Hsu et al. 2018), they require access to old data and a sufficiently large memory for training.Dynamic architectures manage learning without forgetting but are mostly inefficient and do not scale well with the number of tasks.Among the established regularization methods, the memory-aware synapses approach by Aljundi et al. (2018) is most suitable for our use case because this method can be used for regression (unlike EWC by Kirkpatrick et al. (2017)) and it does not require access to old data (like by Lopez-Paz and Ranzato (2017)).The importance of the network weights for a task is computed only once and withheld for future training.Our proposed approach in this paper is therefore mostly based on memory-aware synapses (see detailed description in Sect.3).ated in applied research areas.In the field of medicine, for example, methods like elastic weight consolidation are used for semantic segmentation (Baweja et al. 2018;Karin van Garderen et al. 2019) and classification of medical images (Matthias Lenga et al. 2020) or for pattern recognition in physiological time series data (Kiyasseh et al. 2020).In the field of engineering, research effort is put into continual learning strategies for robotic systems based on supervised learning or deep reinforcement learning scenarios (Lesort et al. 2020;Wong 2016).The goals are mainly to enable intelligent and autonomous agents to perform lifelong learning of new tasks and to overcome the challenge of needing large data sets from real robotic controls by means of incremental learning strategies.Examples are found by Dehghan et al. (2019) and Ayub and Wagner (2020), where both work deal with visual object detection systems for robotic tasks that allow incrementally adding new objects.
With regard to manufacturing and production, few works exist yet on the use of continual learning methods.Tercan et al. (2019) continually train neural networks for predictive quality tasks, where the learning approach consists of a finetuning on the new task and a subsequent retraining on data of the old tasks, yielding better learning rates than training models from scratch.Maschler et al. (2020) use EWC for fault prediction of turbofan engines based on LSTM networks.Their results show that though EWC successfully prevents forgetting in similar tasks, its performance deteriorates as soon as the tasks are very different from each other.Tian et al. (2021) propose an online learning method for Long Short Term Memory (LSTM) networks in vibration signal prediction.

Approach
Memory-aware synapses (MAS) by Aljundi et al. ( 2018) is a regularization-based continual learning approach for training a neural network across a sequence of consecutive tasks T n .Given a neural network with parameters θ i (i.e.network weights), changes of these parameters are penalized proportionally to their importance Ω i w.r.t. to previously learned tasks.The importance values are derived from the respective output layer's sensitivity to the parameter changes.Considering a model trained on a task T A with training data examples x 1 , . . ., x N and learned parameter values θ A,i , the importance of each parameter for T A is then estimated using the gradients w.r.t. the squared l 2 -norm of the output layer: where O A is the network's output for the data examples and 0 is zero vector of same size as O A (x k ).In order to learn another task T B , we create a separate output head for T B and use an additional loss term during the training: where L B is the regular loss of choice for the task and θ i the current parameter values during training.The hyperparameter λ is a positive real number representing the weighting of the additional Ω-loss term.The term θ i − θ A,i describes the change in the network parameter value from its original value after learning T A .This approach allows to leverage the network's already acquired knowledge using shared parameters while minimizing changes of important network parameters.
After training on task T B , we then estimate the importance values Ω i,B w.r.t.T B and accumulate them with the previously computed values: Note that this procedure can easily be extended to arbitrary numbers of tasks.Figure 2 illustrates the approach for three tasks.In the equation above, γ represents another positive real number hyperparameter adjusting the impact of previous tasks.In our injection molding use case, each task is equally important in terms of forgetting, which is why we set γ = 1.In general, the optimal selection of the hyperparameters λ and γ depends on the learning problem, the tasks and the training data and must be evaluated individually for each use case.In Sect.5.1 we show the results of our hyperparameter search.
The described MAS method creates randomly initialized output heads of the network for each new task.However, we want to take advantage of previously learned task when training on new tasks and propose an extension of the method based on the idea of transfer learning: instead of randomly initializing an output head before training a new task, we transfer (i.e.clone) the weights of a already trained head from a similar previous task and subsequently finetune it on the data for the new task.More precisely, we extend the MAS method by the following steps before training the model on a new task: 1. Iterate through all pre-trained output heads of the model and compute the loss of the network with each head on the new task data.2. Identify the head with the lowest loss.In the following, we refer to this approach as MAS-Cloning.
Our assumption behind this approach is that similar prediction tasks in the given injection molding use case also lead to similar parameter values for their respective output heads.
Our experimental results in Sect. 5 show that this assumption is verified.

Data basis
Injection molding is a manufacturing process for producing plastics in a single production steps (Kashyap and Datta 2015).During the process, plastics material is plasticized and subsequently injected into the cavity of the mold, where it is formed and cooled down to the final shape of the product.Injection molding involves complex behavior of interdependent process variables, with the mold design and the settings of the machine parameters in particular having a major impact on the quality of the product produced.The principle goal in this use case is to train a neural network for predicting the part quality, namely the maximum deformation under load, on the basis of six machine setting parameters (holding pressure level, holding pressure time, mold temperature, cooling time, melt temperature, and volume flow).Since the process behavior changes when producing a different product, we apply our proposed continual learning method to improve the performance of the model when applied to the data of a new part.
For evaluation purpose, we conducted molding simulations of different plastic brick specimens that provide the data basis.We designed 16 plastic bricks of different sizes by varying the number of studs on top of the brick (3, 4, or 6 studs per row, 1 or 2 rows) and the height of the bricks.Figure 3 illustrates two exemplary bricks with 3 × 1 studs and 4 × 1 studs.The simulations were performed with the software Cadmould 3D-F.For each part respectively prediction task, we varied the parameters in a central composite experimental design with 77 examples.

Experiments and pre-tests
In our experiments, with the exception of those in Sect.5.1, we train the neural network incrementally over sequences with 16 consecutive tasks (one base task and 15 following incremental tasks).Thereby, all increments are conducted in a sequential way.The learning of the first (base) task involves an untrained network and is regarded as the zeroth increment.
In each increment, we train the model on training data of a new task and compute its loss a on separate test set.Thereby we use the last trained model for the new increment.
To obtain reliable results, we run the experiments on ten different task sequences, each with a randomly selected task order, and each sequence with five different shuffles for training and holdout testing data set (thus 60 experiments in total).In addition, we also run the experiments on a reduced data set (45% of training data) to investigate how well the approaches perform on sparse data.
For the evaluation of the methods, we are mainly interested in both their ability to learn new tasks as well as the capability to retain already learned tasks.For the former, we compute the mean squared error (MSE) on the test data of a new task.For the latter, we compute the backward loss for an increment, which defines the (positive or negative) effect of the increment on the model's performance on all previously trained tasks: where t > 0 is the number of the increment of interest and L i, j refers to the test loss (MSE) of the ith task after the jth increment.A positive value-i.e. an increase in average loss-corresponds to a negative effect on previous tasks.Conversely, a negative value signifies that the tasks actually benefited from learning the new task.For a sequence of t increments, we can further compute the average backward loss over all its increments.We compare both approaches, the regular MAS method and MAS-Cloning, with two baseline approaches: -From scratch: for each task a new neural network is trained from scratch, thus leading to 16 different networks by the end of the sequence.-Finetuning: a neural network is further trained on new tasks via finetuning and without any regularization, thus leading to a single network with a single output head by the end of the sequence.Because the training in each increment uses an already pre-trained network without any restriction (regularization), we assume that the finetuning approach provides the best possible forward transfer and thus the lower limit for the average new task loss.
We implement our methods using PyTorch, an open source library for deep learning (Pytorch 2020).Prior to the main evaluation, we conducted initial grid search based tests on the data to identify the best performing topologies and hyperparameters for the neural network (see Table 1 for parameter ranges), resulting in a two-layer multi-layer perceptron (MLP) with 20 neurons per layer and the Rectified Linear Unit (ReLU) activation functions.Each training is performed using early stopping with a patience of 50 epochs and a validation set size of 20% of the available training set.The training is performed using the Adam optimizer (Kingma and Ba 2017) with a batch size of 16 and a learning rate of 0.001 throughout.
The result of the grid search, which was performed on the data sets using fivefold cross-validation, provided a MSE of 0.037 for the best performing neural network.In order to better assess this performance and to know how good the network is, we performed the same experiments with a polynomial regression of degrees two and three for comparison.The MSEs of these two methods are 0.046 and 0.044, respec- and the average backward loss for all previous tasks (right).Note that the regular MAS method was used here without cloning of the output heads tively, thus significantly higher than the error of the neural network.

Hyperparameters for MAS
MAS has basically two hyperparameters, namely λ and γ , which affect the abilities to learn new tasks and to avoid catastrophic forgetting.In order to find the best possible selection, we conduct an initial hyperparameter search by performing reduced experiments with 10 random sequences of seven tasks (one base task and six incremental tasks).Each experiment is further performed with five different data set shuffles.The box plots in Fig. 4 show the results for selected parameter combinations.Note that MAS with λ = 0 (green bars) does not use any regularization for the network training w.r.t. the Ω-loss.
The results clearly illustrate the trade-off between avoiding forgetting and learning of new tasks.On average, the model without regularization performs best on new tasks (mean loss of 0.04), but suffers greatly from forgetting.In contrast, MAS effectively prohibits forgetting for large λ values while only resulting in minor performance losses for new tasks.The best performing values are λ = 1000 and γ = 1 with a mean task loss of 0.05 and a mean backward loss of 0.04.We use these values in all subsequent experiments.It can also be seen that the loss values in the experiments vary significantly.We assume that this is mainly due to the fact that the plastic bricks and prediction tasks sometimes differ greatly from each other.results show that though the MAS methods are constrained by regularization, they can learn new tasks similarly well as unconstrained models.In particular, MAS cloning performs best on new tasks-even slightly better than the finetuning approach.This also applies to the backward loss.While, as expected, simple finetuning causes the neural network to underperform on old tasks, both MAS approaches strongly prevent the network from forgetting when training on new tasks.Cloning has also a positive effect on forgetting.The reason might be that using pre-trained network weights leads to faster convergence of training and thus to fewer changes of the parameters.As a consequence, cloning leads to an average backward loss of almost 0.

Avoiding catastrophic forgetting
The difference between regularization (MAS) and finetuning on forgetting becomes clear when looking at the network performances across the whole task sequences.Figure 5 shows how the task loss for the base task (increment 0) changes over all following increments.While the loss remains almost constant with MAS-Cloning and slightly increases for the regular MAS, finetuning significantly deteriorates the performance.Interestingly, its loss slowly decreases again with each task-a phenomenon we have observed often, but without the model ever reaching the original loss value again.

Training with sparse data
In order to evaluate the performances on sparse data, we conduct the same experiments with a reduced training data size of 45%.Table 2 (right columns) provides their results.On the one hand, they show that still both MAS approaches prevent the network from forgetting, though the backward loss values slightly increase.On the other hand, finetuning and MAS-Cloning perform best for new tasks.In particular, cloning the output head significantly improves the data efficiency of our method.While the regular MAS method suffers from the data reduction, cloning ensures that even with sparse data very good performances are achieved, which are almost as good as a model trained from scratch on 100% of the training data.The differences between the methods are also illustrated in Fig. 6a, which presents the loss values for each increment.It can be seen that finetuning and MAS cloning perform better in each increment than regular MAS, which sometimes performs very poorly, and training a new network from scratch.
The finetuning approach performs almost as well with only 45% of the training data as with 100% of the data.This raises the question of how much data all other methods need for training a new task to achieve similarly good performances.For this purpose, we conduct further experiments by training the networks on different proportions of target training data.On this basis, we identify the proportion that is required to achieve the same test loss (with 10% margin) as with training on 100% of the data within the respective increment.Figure 6b depicts the proportions for all evaluated approaches.It shows that both finetuning and MAS-Cloning have a higher data efficiency than for example the regular MAS method.Starting with the first increment, MAS-Cloning requires only about 50% of the training data to achieve a similarly good performance as with 100% of the data.However, it can also be seen that its data efficiency does not improve over the course of the increments, so that a certain amount of training data is always necessary.

Cloning and task similarity
The results of our experiments show that cloning the output head of a similar previous task has a significant positive impact on the forward transfer of the neural network.To show that cloning must happen from a similar task, we compare our approach with a minor modification: instead of selecting the output head with the lowest loss on the new data for the initialization, we use the one with the highest error (called negative cloning).Figure 7a shows its test results, where the experiments are conducted the same as before.Compared to MAS-Cloning, negative cloning not only provides worse performance, it actually results in a larger loss than MAS with randomly initialized weights.This indicates a knowledge transfer that has a negative impact on new tasks, a problem that is called negative transfer in deep learning research.
A further closer look at the cloning approach reveals that the source task of which the output head is cloned for initialization is often similar to the new target task (similarity in terms of plastic bricks with similar dimensions).This leads us to the assumption that similar tasks also form similar network weights during training.To test this, we extract the trained weights (including the bias weight) of all output heads after a fully trained sequence of 16 tasks and visualize them in a two-dimensional space using the dimensionality reduction  method UMAP (McInnes et al. 2018), see Fig. 7b.Each point in the plot accordingly represents one output head.For the representation of task similarity, we use the maximum flow distance of the plastic bricks.This parameter defines the distance the molten plastic must travel during molding and can be computed from the simulation for each brick.The figure shows that output heads for bricks that are more similar to each other are also closer to each other in the space and share similar network weights accordingly.

Conclusion and outlook
In this paper, we investigated a deep learning-based continual learning method for quality prediction across several different product variations in an injection molding use case.Our proposed approach extends the existing memoryaware synapses method by transferring pre-trained network weights.Thus, when learning a new product, the neural network is able to use network structures and weights already trained for previous products as well as train so far unused weights.Our extensive experiments showed that our approach can learn over multiple product variations without forgetting and does not require any data from previous tasks.In addition, it also performs better on sparse data than a trained model from scratch.We also detected that the training of the network weights is related to the product characteristics.In future work, we will therefore investigate approaches to incorporate task-related information (such as product geometries) into the learning process.The goal here will be to develop a learning system that adapts its weights for new task variations based on their characteristics, thus increasing the systems generalizability and data efficiency.Furthermore, we will investigate the applicability of the proposed continual learning approach in other production use cases beyond injection molding.We expect that the findings and methods obtained are generalizable to other application fields.
3. Create a new output head for the new task by cloning the parameter values of the identified output head.4. Start training the model on the new task data according to MAS.

Fig. 2
Fig. 2 MAS method for training the network across three consecutive tasks.Highlighted network edges represent weights that are important for the respective task output

Fig. 3
Fig. 3 Used plastic brick parts with different numbers of studs on the top of the bricks and part heights: a 3 studs in a single row, b 4 studs in two rows with 50% height

Fig. 4
Fig. 4 Performance comparison of different values for the hyperparameters λ and γ regarding the average loss after training new tasks (left) and the average backward loss for all previous tasks (right).Note that the regular MAS method was used here without cloning of the output heads

Fig. 5
Fig. 5 Comparison of three approaches regarding forgetting.Lines represent the evolution of the test loss for the base task after all increments (mean curves and the interquartile ranges across all experiments)

Fig. 6 aFig. 7 a
Fig. 6 a Average new task losses across all increments while training on only a subset of training data (45%).b Average proportion of training data required to reach at least 90% of the accuracy as with training on 100% training data (mean values with a 95% confidence interval across all experiments

Table 1
Parameters and values varied for initial hyperparameter grid search for the neural network

Table 2 (
left columns) provides the computed average task losses and backward losses across all experiments when using 100% of the available training data.The task loss

Table
Comparison of the approaches trained on 60 sequences with 57 training examples (100%, left)) and 25 training examples (45%, right) per task