Introduction

Artificial intelligence (AI) is now recognised as a crucial key enabling technology (KET), facilitating the transition to Industry 4.0 through the development of AI-based industrial applications (Forni et al., 2023; Groshev et al., 2021). With the increased computational capacity of modern computing systems, the industrial sector can now leverage different intelligent applications including online manufacturing process monitoring (Caggiano et al., 2023).

Reinforcement learning (RL), a sub-field of machine learning, is particularly relevant in this regard, with its growing applications in various scientific disciplines (Li et al., 2023; Mattera & Mattera, 2023; Mattera et al., 2022; Polydoros & Nalpantidis, 2017). RL algorithms solve sequential decision-making problems by employing a trial-and-error approach, where an agent interacts with a system through actions and learns from experiences, receiving observations and rewards based on the dynamic environment (Sutton & Barto, 1998). Figure 1 illustrates a generic RL algorithm. In the case of industrial process control, the industrial process is the environment that exchanges observations, states and rewards with the agent operating on it through actions.

Fig. 1
figure 1

Reinforcement learning scheme: principal components. The environment exchanges observations, states and rewards with the agent that operates on it through actions

Reinforcement learning is a machine learning approach that differs from other algorithms since it does not require specific prior knowledge about how to solve a problem. Instead, it utilizes a trial-and-error approach and optimization algorithms to achieve the desired goal. The policy, which is the map learned by the agent that correlates the best actions to take with the observations, can be approximated using neural networks, leading to the so-called “deep reinforcement learning”. RL is commonly used to solve decision-making problems, including control problems (Recht, 2019).

While traditional controllers such as proportional-integral-derivative (PID) controllers are commonly used in industrial process control, their implementation can be challenging when dealing with non-linear systems, requiring time-consuming activities for design. RL offers an innovative approach to overcome such complexity issues. It enables controllers to learn by interacting with the process and incrementally improving the control behaviour through optimization algorithms. These approaches are referred to as model-free controllers since the tuning procedure is achieved without a model of the system, regardless of its linearity, determinism, or multi-input/output nature. However, stability remains a significant concern when using RL in an industrial environment. While it is suitable for developing high-level controllers used as reference generators, lower-level controllers are more critical as they interact directly with the electronic parts of the system, as shown in Fig. 2.

Fig. 2
figure 2

High-level controllers are the “brain” of a hierarchical control module since they act as control planners, generating the optimal references for many low-level controllers (C#) using the knowledge that comes from the environment through perception modules

Moreover, the task of finding the optimal set of process parameters that maximize the quality of the final product is challenging and often requires conducting numerous time- and cost-consuming experiments. Thus, implementing an intelligent decision-making framework can significantly reduce the number of experiments required, resulting in cost savings and an increase in the overall intelligence of the system. In summary, while model-free controllers based on reinforcement learning are still relatively unexplored in the industrial domain, they have the potential to solve many challenges associated with industrial controller design and parameter optimization. To address this issue, this research work introduces a comprehensive workflow focusing on the implementation of reinforcement learning-based industrial process control. The proposed workflow involves the setup of a reward function, the development of reduced order models, and the construction of control policies. Notably, a novel process-based reward function is suggested. To illustrate the efficacy of this approach, a case study involving a wire arc additive manufacturing (WAAM) process is presented and a WAAM simulator is developed to simulate the process in a realistic environment and enable the generation of code for deployment on the motion platform controller. “Background” section of this paper presents a detailed discussion on the topics of process control and reinforcement learning. “Reinforcement learning framework for industrial process control” section proposes a workflow for implementing reinforcement learning-based industrial process control and insights on data-driven modelling and agent design, such as constructing rewards and scaling the action space. Finally, an example of industrial application with reference to the WAAM process is provided in “Reinforcement learning application to wire arc additive manufacturing” section.

Background

Process control

Manufacturing processes can be challenging to control due to their non-linear and time-varying behaviour, the presence of constraints on control and state variables, and multivariable interactions. Despite this complexity, linear system design tools are often used to control manufacturing processes due to their rigorous stability and performance as well as ease of modelling and simulation. However, more advanced and complex strategies have been developed to achieve better performance than traditional controllers.

Since industrial processes suffer from coupling effects that preclude the use of traditional control theory, significant efforts are needed to decouple multi-input multi-output (MIMO) systems into several single-input single-output (SISO) systems. (Liu et al., 2019) Once the MIMO system is transformed into a number of SISO systems, adaptive controllers can be developed to address the problem of non-linearity. (Henson & Seborg, 1990) These adaptive controllers use active linearization around operating points and linear control tools or gain scheduling techniques to handle time-varying systems (Sereni et al., 2020).

Moreover, industrial processes are often characterized by multiple-input, multiple-output systems with uncertainties, which can be defined as stochastic processes. To overcome these uncertainties, robust controllers such as H infinity control (Xu & Chen, 2002), linear quadratic gaussian (LQG) control (Athans, 1971) and internal model control (IMC) (Rivera et al., 1986) have been proposed, and adaptive robust controllers (ARC) have been developed using mixed strategies (Yao et al., 1997). However, these methods do not manage the problems related to the existing constraints on state and control variables. To address this issue, an evolution of LQG control called receding horizon control (RHC), also known as model predictive control (MPC), has been proposed. RHC involves solving an optimization problem subject to constraints at each step, which makes it useful for systems with low sampling rate according to the hardware used for computation or with other techniques such as explicit MPC (Qin & Badgwell, 1997).

To develop industrial controllers using the presented techniques, it is crucial to have a perfect understanding of the optimal process references during both controller development and validation phases. As discussed in the introduction, all process parameters can usually be tuned by controlling single devices, such as electric motors for the motion or orifice-valves, that can be often treated as linear SISO systems. For these cases, once a model is found for the whole process, a high-level controller needs to be developed to optimise the process through optimal references that minimise errors. For this purpose, RHC controllers are extremely useful since they solve quadratic optimisation problems that allow funding the optimal control variables considering state and control constraints according to the process quality.

However, these controllers require a realistic system model to ensure that the optimization effort is not wasted. Additionally, high-performance hardware is necessary to meet the time constraints of a digital control system. In this complex scenario, for the development of new intelligent manufacturing systems, it is important to find a way to optimise the processes with less effort from a computational perspective. For this scope, in this work a novel reinforcement learning approach is proposed for the optimisation and optimal control of industrial processes and the proper workflow to develop applications in the industrial world is presented.

Reinforcement learning

The roots of reinforcement learning can be traced back to optimal control theory, including Receding Horizon Control and dynamic programming. These approaches involve optimizing the input trajectory obtained by interacting with the environment to maximize the value function \(V\left(s\right)\), which describes the goodness of being in a particular state. The Bellman equation (Bellman, 1966), reported in Eq. 1, is commonly used in reinforcement learning to describe the relationship between the value function \(V\left(s\right)\) and the obtained reward \(r(s)\).

$$ V\left( s \right) = r\left( s \right) + \gamma \sum P \left( {s^{\prime}|s,u} \right) \cdot V(s^{\prime}) $$
(1)

where s and s′ are the states at time t and t + 1 respectively, \(\gamma \) is the discount factor, u is the action vector and P is the transitional probability of arriving at the next state, s', given the current state s and control input u, representing the model of the system. Although the theoretical background of dynamic programming is elegant, this technique is hindered by the challenge of dealing with high-dimensional problems, known as the curse of dimensionality, since the computational cost grows exponentially with the number of states, and the need to have a realistic model of the system to achieve good results. To overcome this problem, different algorithms have been proposed to find the best action u to take in a given state s that maximises the value V, such as: (Sutton & Barto, 1998).

  • Monte Carlo

  • Temporal difference learning

Today, most reinforcement learning algorithms are based on temporal difference (TD) learning, due to its simplicity and less computational cost. TD methods leverage the benefits of both Monte Carlo methods, which enable learning from direct interaction with systems based on experience, and dynamic programming’s bootstrapping approach, which involves updating estimates using other learned estimates without waiting for a final outcome. These characteristics allow having a model-free agent that can learn online during the interaction with the environment, without a model and without waiting for the end of an episode to evaluate the final discounted cumulative reward, G. Conventionally, the computation of G for Monte Carlo methods is carried out as per Eq. 2, where it is defined as the sum of the rewards Ri obtained at each time step i. The calculation involves weighting each reward by a discount factor, γ, raised to the power of the time step i. The discount factor γ ranges from 0 to 1, and its value determines the relative significance of future rewards in comparison to immediate ones.

$$G={\sum }_{i=0}^{N}{\gamma }^{i}{R}_{i}$$
(2)

It is worth noting that in Monte Carlo methods, G cannot be estimated until the end of the episode as it requires the full sequence of rewards to be computed. In fact, s reported in Eq. 3, also the value V can be computed at the end of the episode. Conversely, in TD learning, the value function V(s) can be updated after each time step using Eq. 4, where \({\delta }_{t}\) is the TD error, defined as reported in Eq. 5.

$$V\left(s\right)=V\left(s\right)+\frac{1}{N}[G-V\left(s\right)]$$
(3)
$$V\left(s\right)=V\left(s\right)+\alpha {\delta }_{t}$$
(4)
$$ \delta _{t} = R_{{t + 1}} + \gamma V\left( {s^{\prime}} \right) - V(s) $$
(5)

With reference to control tasks, the Q value, which describes the goodness of state-action pairs, is generally more used compared to the V value, since it provides more information about the quality of control. The most important feature distinguishing reinforcement learning from other types of learning is that no prior knowledge is given to the agent regarding the best action to take, thus generating the need for active exploration. In this approach, it is crucial for any algorithm to find the best trade-off between exploration and exploitation. In general, during the initial stages of learning, the exploration phase occurs, in which the action space is explored using random policies. As the learning process progresses, there is an increasing emphasis on exploitation, where actions are directly derived from the optimal policy. Originally, RL was proposed to solve discrete systems, and in this case, all the V or Q values were stored in tables. Since industrial processes are physical continuous-based systems, both state and action space are infinitely large, so methods based on the approximation of the value function were developed, as reported in Eq. 6 (Mes & Rivera, 2017), in which a function f is used to compute the value V using a continuous state s.

$$\widehat{{V}_{\pi }}(s)\approx {f}_{\theta }(s)$$
(6)

where \(\theta \) are the parameters of the function f that approximate the Value V coming from the usage of the policy \(\pi \). In this case, the perfect optimum is not guaranteed, but using a differentiable and continuous approximation function it is possible to use optimisation algorithms, such as gradient-based optimisation (Mes & Rivera, 2017), to evaluate with low-computational effort the Q or V values using the TD error for the loss computation, as reported in Eqs. 7, 8.

$$L=\frac{1}{N}{\sum }_{t}{\delta }_{t}^{2}$$
(7)
$$\theta =\theta +\alpha \nabla L$$
(8)

where L is the loss, N is the number of collected trajectories over the episode time with time step t, \(\alpha \) is the learning rate, \(\uptheta \) are the parameters of the function. In any case, the optimal control policy \({\pi }^{*}\) is straightforward because the best action to take is:

$${\pi }^{*}\left(s\right)=argma{x}_{u}[V\left(s\right)]$$
(9)

To deal with continuous problems, a simplification is required also for the continuous actions, such as the discretisation of the space of actions. However, in this way, a strong dependence on the degrees of freedom of the system exists. In fact, if we consider a joint in space, with 6 degrees of freedom, trying to discretise the space of actions in the simplest way using only the lower limit, the upper limit and the central value, will lead to a number of actions for each joint as described in Eq. 10.

$${A}_{j}={d}^{DOF}={3}^{6}=729$$
(10)

Therefore, using this methodology would provide a very large action space dimension, and this makes it very difficult for the agent to explore it all, making the training process highly inefficient. For this reason, also the policy has to be approximated by functions, and neural networks are used in state-of-art methods. In the field of reinforcement learning, three different families of algorithms can be used, i.e. value-based, gradient-based and actor-critic algorithms.

In value-based algorithms, such as SARSA or DQN (Wang et al., 2013), a differentiable function with a softmax output is used to approximate the Q or V values, as reported in Eq. 6 and the policy is equal to Eq. 9. Unfortunately, as demonstrated by Mnih et al., (2015), the adoption of these algorithms to solve real problems with huge action and state space brings to unstable training. Therefore, several adjustments, such as replay buffers and target networks had to be introduced to stabilise learning (Van Hasselt et al., 2016).

Gradient-based is a family of algorithms in which the policy is directly approximated using gradient-based optimisation algorithms. An example is REINFORCE (Williams, 1992) in which the gradient of the loss in Eq. 11 is used in a gradient ascent optimization algorithm.

$$\nabla J\left({\pi }_{\theta }\right)={\sum}_{t=0}^{T}{R}_{t}{\nabla }_{\theta }{\text{log}}\left({\pi}_{\theta}\left(u|s\right)\right)$$
(11)

where \({\pi }_{\theta}\) is the policy approximated by a differentiable function with \(\theta\) parameters, \({\nabla }_{\theta}{\text{log}}\left({\pi }_{\theta}\left(u|s\right)\right)\) is gradient of the log probability of the distribution associated with the policy at time t, which is stochastic. This means there will be a categorical distribution for discrete action space and diagonal Gaussian distribution for continuous action space. In the case of continuous actions, the output dimension is \(2\cdot size\left(u\right),\) leading to \(size\left(u\right)\) output for the mean values and \(size\left(u\right)\)for the variance of the Gaussian distribution associated with each action, and different considerations have to be made to deal with these two parameters depending on the choice of the output function. However, gradient-based methods have a huge drawback associated with the high variance caused by the calculation of rewards at each time step and training instability (Kakade & Langford, 2002). A common way to reduce variance is to subtract a baseline from the rewards in the gradient in Eq. 11 that does not depend on the action taken from the policy. In this way, it does not introduce any bias to the podlicy gradient. This could be a random number, but a good choice is to use the Q or V values as a baseline, as reported in Eq. 12.

$$\nabla J\left({\pi }_{\theta }\right)={\sum }_{t=0}^{T}({R}_{t}-V(s){\nabla }_{\theta }{\text{log}}\left({\pi}_{\theta }\left(u|s\right)\right)$$
(12)

If the value function is approximated by a function, such as in the case of DQN or SARSA, the structure of the agent becomes actor (gradient-based) plus critic (baseline approximation), as reported in Fig. 3. The gradient ascent formula, in Eq. 13, is used to update the policy parameters, and the TD error with a L2 loss is used to update the critic parameters. A typical example is given by the Deep Deterministic Policy Gradient algorithm (Silver et al., 2014).

Fig. 3
figure 3

Actor-critic reinforcement learning algorithm. The actor approximates the optimal policy while the critic the V or Q function

$$\theta =\theta +\alpha \nabla J({\pi }_{\theta })$$
(13)

As an extension of the DDPG algorithm, the soft actor critic (SAC) (Haarnoja et al., 2018) algorithm introduced the concept of maximum entropy reward regularization. This approach incorporates the maximum entropy principle into the reward function to encourage the actor to take actions that are not only high-quality but also diverse. As a result, the agent is encouraged to explore the environment by taking actions that have uncertain outcomes.

Although the introduction of the baseline allows to reduce the high variance of the returns, the problem of training instability requires additional adjustments. To overcome this problem, trust-region learning (TRL) was proposed by Schulman et al. (2015). TRL enhances the stability of training by limiting the size of policy updates in each learning iteration and ensures the preservation of monotonic improvement guarantees by employing a second-order optimization algorithm based on the constraint outlined in Eq. 14.

$${D}_{KL}\left[{\pi }_{{\theta }^{\prime}}\left(u,s\right),{\pi }_{\theta }\left(u,s\right)\right]\le \delta $$
(14)

where \({\theta }^{\prime}\) are the parameters of the new policy and \(\theta \) those of the old policy, \({D}_{KL}\) refers to the Kullback–Leibler divergence between the two stochastic policies. Resolving the TRPO problems, i.e. an optimization with hard constraint, becomes complicated for high space problems, so the same authors introduced the proximal policy optimization (PPO) (Schulman et al., 2017), replacing the hard constraint with the clipping objective or DKL regularization, which reduced the mathematical and computational complexity leaving the properties of the trust region policy optimization (TRPO) framework untouched using a “proof by analogy”. Nowadays, among the presented RL algorithms, the DDPG and PPO are the state-of-art in this technology, thanks to the numerous applications developed by researchers, and Multi-Layer Perceptron networks are used as function approximator for both baseline and policy optimization (Mousavi et al., 2018).

Reinforcement learning framework for industrial process control

To develop an effective reinforcement learning framework for industrial process control, several key ingredients must be considered. One of the most critical elements is data-driven modelling, which entails gathering and analysing large amounts of data from the process being controlled. These data are then used to build an accurate model of the system, which the RL agent employs to make decisions and take actions.

The formulation of the reward function is another key issue in RL for process control, as it determines the desired outcome for the controlled system. In industrial process control, the reward function must be carefully constructed to ensure that the system is optimized for the desired outcome, such as maximum production efficiency or quality control. This may involve incorporating multiple objectives, weighing different goals or incorporating feedback from human experts.

In addition to data-driven modelling and reward function design, there are other crucial aspects of RL that are relevant to process control. For instance, active exploration is a technique used to ensure that the RL agent can effectively learn and explore the state space of the system being controlled. Exploitation is another key element which involves making decisions based on the RL agent’s current understanding of the system to achieve the desired outcome.

Finally, output layer construction and scaling are critical factors that can improve the performance and stability of the RL framework and allow to deal with control variables constraints. Proper output layer construction and scaling ensure that the output of the RL agent is appropriately scaled and interpreted, enhancing the reliability and usefulness of the RL framework.

Reduced order modelling (ROM)

A model can be defined as a mathematical representation of the dynamic or static behaviour of a system. In the industrial field, physics-based modelling is a widely used approach, providing accurate input-to-output mapping for various industrial processes. However, since physics-based dynamics processes can be studied through non-linear differential equations, many problems occur in the development of controllers, as reported in the previous section.

Many examples of physics-based modelling of industrial processes can be found in literature, such as in welding (Tipi et al., 2015) or chemical processes (Harrison et al., 2006) and in some cases more complex analyses, such as finite element analyses (FEA) (Liu et al., 2012) or computational fluid dynamics (CFD) are also used to study the behaviour of processes; moreover, static geometrical models can be used. (Ding et al., 2015).

To address these difficulties, model reduction techniques are often employed to replace a given mathematical model with a lower-dimensional approximation that provides similar results. There are two main methodologies for model reduction: the black-box approach and the physics-based approach. Black-box approaches generate surrogate models using data-driven techniques, such as principal component analysis (Lang et al., 2009) artificial neural networks (Nabeshima et al., 1998) or transfer function parameterization (Juneja et al., 2020). In contrast, the physics-based approach involves manipulating the equations and structure of the larger model to find a lower dimensional approximation.

In industrial settings where there is a high volume of data, data-driven techniques are often the preferred choice. Alternatively, a combination of approaches can be utilized. For instance, as highlighted in the study by Xiao et al., (2020), a realistic simulation model can be developed and calibrated using experimental data. This model can then be used to generate new synthetic data, such as data related to novel system configurations or operational points that may be challenging or hazardous to obtain in a real-world industrial setting. All this information can be leveraged to create a low-dimensional data-driven model that is valuable for control tasks, for example, by using it as a state observer or soft-sensor (Mattera et al., 2023).

In general, for control tasks, linear models are commonly used, such as regression (Xiong et al., 2014), autoregressive–moving-average (ARMA) (Xia et al., 2020), or identification of transfer function. The reason for this is that it is easier to design a controller for a linear system with the assumption of no correlation between different states. On the other hand, when using a reduced-order model (ROM) to approximate the behaviour of a system, it is important to consider the correlation between variables in the model. Ignoring this correlation can lead to biased estimated parameters (Srivastava & Giles, 2020). One way to address this issue is to estimate the coefficients of the system's equations all together, including the correlation between the error terms in the estimation procedure, for example by using seemingly unrelated regression equation (SURE).

Alternatively, a fully data-driven approach can be employed to extract a reduced-order model, utilizing techniques such as feedforward neural networks for static models or memory-based networks like recurrent neural networks for dynamic models. However, if a neural network is used to reduce the order model, no equation is available for traditional control design. In this case, there are two ways to solve the problem. One is to use a neural network model predictive control (NN-MPC) (Piche et al., 2000), but this method is not suitable for fast dynamic systems due to the inference time and optimization required at each time step. The second is to treat the obtained simulation model as the real system and use a model-free controller, such as a reinforcement learning agent. Once a model for the system is obtained in a data-driven manner, and a deep reinforcement learning agent based on a neural network is chosen as controller, it is important to find an optimal reward and configure the output layer of the neural network that approximates the control policy. Therefore, designing an optimal reward function and appropriately configuring the output layer of the neural network are crucial aspects of the control design process. More research is needed in this area to apply RL techniques to industrial control systems.

In summary, physics-based modelling and model reduction techniques are important tools for developing controllers for industrial processes. Black-box and physics-based approaches are two common methodologies for model reduction, and linear models are often used for control tasks. However, more advanced data-driven techniques like neural networks enable improved modeling results. As a result, neural network model predictive control (NN-MPC) and reinforcement learning (RL) controllers are increasingly being explored for industrial applications. RL controllers, in particular, hold an advantage over MPC-based controllers by overcoming the computational time requirements.

Reward function

The reward function is the most important ingredient in a reinforcement learning algorithm since it should reflect the desired outcome of the considered task. In control systems, the main goal is to force certain variables to follow prescribed trajectories or setpoint. Generally, a reward function is more successful if the agent has a gradient to work with. This allows it to better understand when it gets closer or further from the target. Matignon et al. (2006) proposed a Gaussian-based function, reported in Eq. 15, built so that R is uniform for some states s far from the goal \({s}_{g}\) to avoid unlearning problems, and on the other hand, there is a reward gradient in a zone around the goal. Specifically, \(\beta \) adjusts the amplitude of the function and \(\sigma \) the standard deviation which define the reward gradient influence area.

$$R\left(s, u, s{\prime}\right)=\beta {e}^{-\frac{d{\left(s, {s}_{g}\right)}^{2}}{2\sigma }}$$
(15)

Spielberg et al. (2017) proposed a formula for the reward function that works better for the process control problems since they are set-point tracker based, in Eq. 16.

$$ R\left( {s,u,s^{\prime } } \right) = \begin{cases} cif \sum _{i} s_{g}^{i} - s^{i} &{\ominus } < \varepsilon \\ - \left|\sum _{i} s_{g}^{i} - s^{i} \right| & {otherwise} \\ \end{cases} $$
(16)

When the process output is closer to the set-point than the error tolerance defined by \(\epsilon \), the agent receives the highest possible reward c, otherwise it receives a negative reward whose magnitude is equal to the deviation from the set-point. From a learning perspective, if there are regions of the state space where the reward is primarily negative, the agent may attempt to reach a termination as quickly as possible to avoid accumulating lots of negative rewards. On the other hand, negative values can be added to the reward to accelerate the learning phase, so the negative amplitude of the reward needs to be carefully selected. In a control problem, these negative quantities need to be proportional to errors or to the process power consumption. In conclusion, it is crucial for a reward function to possess a gradient that provides insight into the deviation of the current trajectory from the optimal solution. This gradient must be bounded both above and below a predefined value to prevent issues related to unlearning. Additionally, the incorporation of a negative value in the reward function can potentially accelerate the learning process.

Based on these considerations, in this research work a new polynomial reward function is proposed, as reported in Eq. 17, to deal with process parameters optimisation. In this formulation, \({\Delta }_{1}<{\Delta }_{2}\) are the reward bounds, \(\delta =\left|\sum {s}_{g}-s\right|\) is the absolute value of the error, and \(\epsilon \) is the error threshold parameter:

$$ R\left( {s,u,s^{\prime}} \right) = \left\{ {\begin{array}{*{20}c} {\Delta _{2} \xi {\mkern 1mu} } & {if\delta < \varepsilon } \\ { - \left( {\delta ^{2} + k\delta + \Delta _{1} } \right)\xi {\mkern 1mu} } & {otherwise} \\ \end{array} } \right. $$
(17)

The values of \({\Delta }_{1}\) and \({\Delta }_{2}\) must be chosen in conjunction with other elements to be included in the reward calculation, such as power consumption or the presence of defects. For example, \({\Delta }_{2}\) can be multiplied by a variable \(\xi \) that continuously varies within the range \(0<\xi <1\) based on factors like power consumption or the quality of the final component. The value of k is determined by resolving Eq. 18 for \(x=\epsilon \) and \(y={\Delta }_{1}\), as reported in Eq. 19. In that way, when the error threshold is reached, the reward is equal to the lower bound value.

$$y={x}^{2}+kx+ {\Delta }_{2}$$
(18)
$$k= \frac{{\Delta }_{2}-{\Delta }_{1}-{\epsilon }^{2}}{\epsilon }$$
(19)

For example, let’s consider a drilling operation where the optimal outcome is achieved when the diameter (d) is precisely equal to the nominal value and there are no burrs (b) present. These two conditions can be formalised as two setpoints \(d={d}_{nom} ;{b}_{ref}=0\) and, in addition, the parameter \(\xi \) can be set equal to the ratio between the instantaneous power consumption P and the maximum power allowable, i.e.\(\xi =\frac{P}{{P}_{max}}\).

Policy approximation

As described in the previous section, neural networks have emerged as popular policy and value approximators in modern reinforcement learning techniques such as DDPG and PPO to tackle continuous state-action space problems (Fig. 4). However, industrial processes impose constraints on the action space, necessitating a bounded activation function in the output layer of the actor. When faced with vanishing gradient issues due to large inputs, the sigmoid and tanh functions are the two viable S-shaped bounded curves that can serve as activation functions in the output layer of a neural network employed to approximate the optimal control policy. By examining the trends of the two functions in Fig. 5, it is evident that tanh is symmetric about the origin. This property can be leveraged to expedite training by normalizing the input data, which leads to faster convergence when preceded by a batch normalization layer (Sola & Sevilla, 1997).

Fig. 4
figure 4

Comparison between sigmoid and tanh activation function. The symmetry about the origin of the hyperbolic tangent helps the learning process

Fig. 5
figure 5

Scheme of the Wire Arc Additive Manufacturing process: a motion platform equipped with a welding system is used to deposit fused metal on a determined path

Therefore, the best choice that helps speed up the learning process is to use a hyperbolic tangent function. At this point, a scale operation needs to be used to transform the maximum value of the network output, namely + 1, into upper bounds, and at the same to transform the minimum value, namely -1, into lower bounds, as reported in Eq. 20.

$$nv=\frac{ov-o{v}_{max}}{o{v}_{max}-o{v}_{min}}\left(n{v}_{max}-n{v}_{min}\right)+n{v}_{max}$$
(20)

where nv is the new value, defined in a range \(n{v}_{max}-n{v}_{min},\) and ov is the old value defined in the range \({v}_{max}-o{v}_{min}\). Since \(o{v}_{max}=1\) and \({v}_{min}=-1\), the new value, namely the final bounded action, is described in Eq. 21.

$$u=\frac{\pi \left(s\right)-1}{2}\left(UB-LB\right)+UB$$
(21)

where UB and LB are the upper and lower bounds of the process parameters.

In the context of developing a first-tentative architecture for a RL framework in industrial process control, some guidelines are proposed in this work. First, it is recommended to avoid using a deep architecture for the actor, as this can lead to the vanishing gradient problem (Glorot & Bengio, 2010). Second, it is suggested to use two different gates for the inputs of the critic when estimating the Q-value, namely the state and action. The state branch can be constructed similarly to the policy network, and a concatenation layer can be used to merge the two branches. Furthermore, to ensure effective neural network training, it is recommended to use hyperbolic tangent activation function for the actor and rectified linear unit (ReLU) for the critic. To initialize the weights of the actor, Xavier uniform weights-initialization is suggested, starting with low values to avoid rapid saturation. For the critic, He normal initialization is recommended to ensure efficient training. These suggestions can serve as a starting point for the construction of an effective RL architecture for industrial process control (Datta, 2020).

In addition to the above considerations, there are other important aspects to be taken into account in the development of RL frameworks. One of these is the gradient amplitude, which can lead to activation saturation during the training process. Therefore, clipping the gradient amplitude can be a useful technique to mitigate this problem. Moreover, exploration and exploitation strategies are also critical in the learning process. Finding a proper balance between exploration and exploitation is important for effective learning. In the case of DQN algorithms, the ϵ-greedy exploration technique is commonly employed, where a random action is taken with a probability of ϵ to explore the whole action space and obtain an unbiased view of the problem. The value of ϵ is typically reduced during the training process according to defined strategies, so that more exploration is carried out at the beginning of the learning process, and only the greedy policy is employed at the end of the training. For continuous action space problems, such as those addressed by DPPG algorithms, a similar myopic approach is used. While myopic exploration is known to have exponential sample complexity in the worst case (Osband et al., 2019), it remains a popular choice in practice. This is because myopic exploration is easy to implement and works well in realistic scenarios. (Dann et al., 2022) In DDPG, an undirected exploration is applied adding noise sampled from a noise process N, described in Eq. 22.

$${\pi }^{\prime}\left(s\right)={\pi }_{\theta }\left(s\right)+N$$
(22)

where \(\theta \) are the parameters of the policy network and N is an Ornstein–Uhlenbeck correlated noise process. In the Twin Delayed Deep Deterministic policy gradient algorithm (TD3) presented in (Fujimoto et al., 2018), the authors proposed to use an uncorrelated noise for exploration as they found that the originally proposed noise process offered no performance benefits. Accordingly, a white Gaussian noise N can be added to give the agent exploration capability during the training, as described in Eq. 23.

$${\pi }^{\prime}\left(s\right)={\pi }_{\theta }\left(s\right)+N(0, \sigma )$$
(23)

With \(\sigma <1\) and descending with the number of episodes. For a MIMO control problem, also a multivariate diagonal Gaussian process with zero-mean and variable standard deviation may be used. After training, hence during the validation and then the deployment phase, the exploration noise needs to be neglected.

Reinforcement learning application to wire arc additive manufacturing

In the subsequent section, a case study is presented to show the application of the reinforcement learning approach to a wire arc additive manufacturing (WAAM) process. To develop a control strategy for the WAAM process, experimental data are first collected to build a reduced-order model of the system. This model captures the essential dynamics of the process and allows for efficient simulation-based optimisation. This model is then used to train a deep deterministic policy gradient (DDPG) controller, which learns how to optimise the process parameters for depositing a steel weld bead with a pre-defined geometry value.

The DDPG controller operates in a continuous action space and its policy is represented by a neural network. The actor network takes the current state of the process as input and outputs the optimal set of process parameters. The critic network evaluates the performance of the policy by estimating the Q-value for the current state-action pair. During training, the DDPG controller explores the state-action space using a myopic approach, balancing exploration and exploitation, as described in the previous section.

Overall, this example demonstrates the potential of reinforcement learning in industrial process control. By developing accurate models and training controllers using machine learning techniques, it is possible to improve the efficiency and quality of complex manufacturing processes such as WAAM.

Wire arc additive manufacturing

Wire-arc additive manufacturing, schematically represented in Fig. 5, is a metal additive manufacturing process under the category of direct energy deposition (DED) technologies. The employment of WAAM is exponentially growing thanks to its advantages in terms of short time for producing large metal components, low production cost (Wu et al., 2018a, 2018b) and reduced power consumption (Priarone et al., 2020).

In the literature, it has been demonstrated that the geometry of the deposited layer in the WAAM process, shown in Fig. 6, is highly dependent on the selection of process parameters (Xiong et al., 2014). Moreover, constant process parameters do not always lead to the desired layer geometry due to the impact of different mechanisms, such as heat accumulation (Wu et al., 2018a, 2018b). Thus, closed-loop control is essential for obtaining the final component with the highest possible quality. However, like welding, the wire arc additive manufacturing process presents several challenging issues from a control perspective, such as its MIMO non-linear stochastic behaviour and the need for managing constraints on control variables and state variables, such as the layer geometry. As reported by Mattera et al., (2023), several authors have proposed different control strategies to reduce the error between measured layer geometry and a reference from a path planner. In this context, a controller based on reinforcement learning is proposed in this work. The controller aims to optimise the process parameters in simulation to deposit a steel weld bead with a pre-defined geometry value, using a deep deterministic policy gradient algorithm.

Fig. 6
figure 6

Wire arc additive manufacturing bead geometry: layer height (h) and width (w)

Experimental set-up

The experimental set-up comprises a three-axes motion station equipped with a SELCO Quasar 400 MSE welding machine and a 430 Smart wire feeder unit. The welding torch was fixed to the motion platform with an angle of 90° with respect to the travel deposition axis. The motion platform was used to deposit several layers to obtain wall structures, as shown in Fig. 7, by changing the process parameters, such as the welding voltage (V), the wire feed speed (WFS), the gas flow rate (GFR), the welding speed (WS) and the nozzle-to-plate distance (NTPD).

Fig. 7
figure 7

Weld beads obtained using fixed WFS, CTWD and welding voltage but different torch speed values: a torch speed of 561 mm/s, b torch speed of 280 mm/s

In the experimental tests, a S355J2 plate was employed as substrate and a AWS ER 70S-6 copper-plated solid wire with a diameter of 1 mm, typically used for metal active gas welding (MAG), was used as feedstock. This wire was produced by Elbor under the name SG2. The gas mixture employed for all the experiments was a mixture of Argon and CO2, namely M21 according to ISO 14175, with different gas flow rates (GFR), depending on the nozzle-to-plate distance. As the nozzle-to-plate distance increased, the gas flow rate was also increased to ensure proper shielding of the melting pool. In general, for WAAM applications, a wire diameter of 1.2 mm or greater is typically used since it allows for higher productivity (Oliveira et al., 2022). However, in this work, a wire diameter of 1 mm was chosen with the aim of obtaining components with a more precise geometry. Accordingly, the process parameters were selected following the recommendations of the wire manufacturer, since limited literature data are available for this wire type. Specifically, the process parameters were selected to obtain a short arc or spray arc transfer mode during a standard constant voltage welding process. The reduced heat input of the short arc transfer mode allows for lower final distortion of the component, while the high heat input of the spray arc transfer mode allows for a higher deposition rate and reduced production time. Once the wire feed speed and welding voltage were selected to achieve the desired metal transfer mode, the nozzle-to-plate distance and welding speed were adjusted to reduce heat accumulation on the material or to increase productivity. The heat produced during welding can be evaluated using Eq. 24 and the deposition rate is strongly dependent on the heat produced by the arc.

$$Q=\frac{{V}_{arc}{I}_{arc}}{TS}$$
(24)

It is widely recognized in the literature that wire feed speed (WFS), is proportionally related to the arc current (Tipi et al., 2015), but also the nozzle-to-plate distance has a crucial role in the final value of the arc current. As a matter of fact, by increasing the nozzle-to-plate distance through the motion platform, an increment of the arc length can be achieved. As a result, the increased arc resistance leads to a reduction in arc current and subsequently affects the final heat input to the material (Henckell et al., 2020).

Therefore, once selected the wire arc additive procedure specification (WAAPS) in terms of welding voltage, wire feed speed and welding speed, the nozzle-to-plate distance and the gas flow rate are changed. Finally, the experimental tests were conducted using four welding voltage values (18, 20, 22 and 24 V), 5 wire feed speed values (2.5, 3.5, 4.5, 6 and 7 m/s), 2 values for the nozzle-to-plate distance (12 and 20 mm) and gas flow rate (15 and 18 l/min) and 3 torch speed values (280, 487 and 561 mm/min), depositing the raw material of the wire on top of a substrate plate. The substrate plate was a low carbon manganese steel plate with dimensions of 150 × 50 × 10 mm. The chemical composition and mechanical proprieties of the feedstock are reported in Table 1.

Table 1 Chemical composition and mechanical properties of the used wire

The experimental campaign involves the deposition of 18 walls, each comprising 10 layers, and an additional 6 single layers on a substrate, resulting in a total of 186 layers being deposited. Throughout the wall deposition process, three distinct interpass temperature strategies were utilized in order to investigate their impact on the geometry of the individual layers.

During the deposition, all the process parameters were maintained constant. The optical microscope was used to measure the steady-state values of layer width and height, as depicted in Fig. 8. The resulting data, including the process parameters employed and the corresponding geometrical measurements, was recorded in a CSV file.

Fig. 8
figure 8

Manufactured specimens observed under an optical microscope. The procedure allowed the geometry of the layers to be determined

WAAM geometry reduced order model

Li et al., (2022) have demonstrated that the process of layer formation in wire arc additive manufacturing (WAAM) involves three distinct stages—melting, flowing, and solidification—and is characterized by dynamic behaviour. In order to approximate the bead formation process with a lumped model, the double ellipsoid heat source model developed by Goldak and Akhlaghi (2005) can be considered, as shown in Fig. 9. As a matter of fact, the geometry of the bead at point 1, \({g}_{1}\), is dependent upon the previous geometry, \({g}_{0}\), as well as the process parameters used to arrive at point 2. Similarly, the geometry of the bead at point 2, \({g}_{2}\), is dependent upon the previous geometry and the process parameters used to arrive at point 3 and so on, in accordance with Goldak heat model.

Fig. 9
figure 9

Goldak double ellipsoid heat source model. The geometry in a point at the back of the welding torch depends on the process parameters used to arrive to the next point

The Goldak ellipsoid heat model employs a set of coefficients to describe the heat distribution along each of the three principal axes of the ellipsoidal heat source. Four coefficients are used to describe the heat distribution along the major, intermediate, and minor axes of the ellipsoid, denoted as \({c}_{f}, {c}_{r}, a, b.\) These coefficients are related to the shape of heat distributions, and in Eqs. 25, 26, x, y, and z refer to the local coordinate system fixed to the moving heat source. Finally, the parameters \({f}_{f} {\text{and}} {f}_{r}\) define the fractions of the heat deposited in the front and rear parts, respectively.

$${Q}_{f}\left(x,y,z\right)=\frac{6\sqrt{3}{f}_{f}Q}{ab{c}_{f}\sqrt[3]{\pi }} {e}^{-\frac{3y}{{b}^{2}}} {e}^{-\frac{3z}{{a}^{2}}} {e}^{-\frac{3x}{{c}_{f}^{2}}}$$
(25)
$${Q}_{r}\left(x,y,z\right)=\frac{6\sqrt{3}{f}_{r}Q}{ab{c}_{r}\sqrt[3]{\pi }} {e}^{-\frac{3y}{{b}^{2}}} {e}^{-\frac{3z}{{a}^{2}}} { e}^{-\frac{3x}{{c}_{r}^{2}}}$$
(26)

Using this assumption, it is possible to obtain a data-driven lumped dynamic model for the system which simulates the bead formation back to the welding torch. It was experimentally observed that when the parameters employed to reach point \({g}_{1}\) remain unchanged, it can be assumed that the bead geometry will also remain the same as that at point \({g}_{0}\). However, if the parameters used to move from \({g}_{1}\) to \({g}_{0}\) are changed, it takes approximately 1 s for the variation in the bead formation process to complete, given the specific materials and process parameters used. Hence, a time constant of \(\tau = 0.5 s\) can be assumed for the system in this scenario. In this application, the width and height formation processes are assumed to follow first-order dynamics. After determining the steady-state value, \({x}_{ss}\), based on the process parameters, this value is utilized as an input in a state-space model, as shown in Eq. 27. This ultimately generates the final reduced-order dynamic model of the WAAM process.

$$\left\{\begin{array}{c}\dot{x}= -5\cdot x+{x}_{ss} \\ y=5\cdot x \end{array}\right.$$
(27)

For simulation purpose, the system is discretised with a sample time of \(TS=\frac{1}{10\cdot \tau }\) and solved with a Euler forward method, as shown in Eq. 28.

$$\left\{\begin{array}{c}\dot{x}= -\left(0.5\cdot \tau +1\right)\cdot x+{0.1\cdot (\tau \cdot x}_{ss}) \\ y=5\cdot x \end{array}\right.$$
(28)

Finally, a neural network is used to approximate the steady-state value for both layer width and height once the process parameters are given in input. The proposed network is a shallow network with:

  • 5 normalised inputs, namely the welding voltage, welding speed, wire feed speed, nozzle-to-plate distance and gas flow rate.

  • 15 neurons in the hidden layer passed in a hyperbolic tangent activation function.

  • 2 neurons in the output layer passed in a rectified linear unit activation function.

The network was trained with a Root Mean Square propagation algorithm with a learning rate \(\alpha = 1\cdot {10}^{-4}\), momentum \(\mu =0.1\) and a discount factor \(\rho =0.9\) for 2000 epochs using the collected dataset. Specifically, 70% of the data (130 layers) was utilized for training, while the remaining 30% (56 layers) was allocated for testing. This division resulted in a final mean squared error (MSE) of 0.0018 on the test dataset, and a portion of the samples can be found in Table 2.

Table 2 Test dataset

WAAM simulator development

Process simulators are engineering software that help technicians to study and optimize the behaviour of a system reducing time and cost efforts. These are typically based on mathematical models that describe the behaviour of the system being studied, which can be based on differential equations, experimental data or a mix of them, as presented in the previous section of this work. Once the mathematical model has been developed, the process simulator allows engineers and scientists to simulate the behaviour of the system under a variety of conditions. This can be used to optimize the design of the system, identify potential issues and study the effects of changes in operating conditions. A process simulator of advanced automatic manufacturing processes, such as a robotic welding simulator or the proposed WAAM simulator, usually combines a robotic simulator with the process model. The robotic simulator allows engineers to create a virtual environment where robots can be programmed and tested before the physical construction process begins. Therefore, by using the combination of robotic simulator and data-driven modelling techniques it is possible create process simulators that allow technicians to quickly test and optimize complex processes, representing an essential tool for modern manufacturing and engineering. In this work, a process simulator based on the integration of a data-driven modelling technique and RoboDK is proposed. RoboDK is a robot simulator that helps the sim-to-real transition thanks to the possibility to generate Python-based robot programs using post processors. Once the robotic cell is designed and calibrated in RoboDK, a realistic digital kinematic model is available, as reported in Fig. 10.

Fig. 10
figure 10

Comparison between the real (sx) and digital (dx) automatic welding cell

By utilizing the visualization tools provided by RoboDK in conjunction with the developed data-driven model of the bead geometry, it is possible to simulate and visualize the results of actual depositions with specific process parameters, as reported in Fig. 11. In this case, the neural network-based model of the process is related to the material and welding technology used, so mid steel wire deposited with constant voltage MAG process, while a digital model of a more flexible motion platform is used to increase the manufacturing cell flexibility.

Fig. 11
figure 11

Comparison between the simulated (sx) and real (dx) specimens

The developed WAAM simulator can be used in different ways and for several purposes: to manually optimize the process parameters to obtain the desired layer geometry, in combination with a software module that allows to generate the path, with additional software modules for the offline optimization of process parameters, to test via simulation the control policy of a AI-controller, as the one presented in this work, or of any other control design technique. Therefore, in the presented example, a new motion platform is simulated using the same welding equipment used to conduct the experimental tests.

Deep deterministic policy gradient controller

The deep deterministic policy gradient (DDPG) algorithm uses an actor-critic approach to map both the action-value function and the policy (Silver et al., 2014) used to solve continuous action space problems. DDPG, like DQN, uses a memory buffer of predefined size, allowing it to store all the transitions given by the tuple\(({s}_{t}, {u}_{t}, r, {s}_{t+1})\). Therefore, the actor and critic weights are updated by randomly sampling from the replay memory at each step. Another innovation introduced in DDPG is the use of the target networks, namely the target critic network and target actor network, that copy the parameters of the original critic and actor networks, and whose weights are updated using a soft update through a parameter \(\tau \ll 1\), as reported in Eqs. 29, 30.

$${\theta }_{c}^{\prime}\leftarrow \tau {\theta }_{c}+\left(1-\tau \right){\theta }_{c}^{\prime}$$
(29)
$${\theta }_{a}^{\prime}\leftarrow \tau {\theta }_{a}+\left(1-\tau \right){\theta }_{a}^{\prime}$$
(30)

This avoids the divergence problem that occurs during the training phase, so using \(\tau \ll 1\) greatly increases the stability of the training (Lillicrap et al., 2015). In addition to the usage of the replay buffer and target networks, DDPG also applies the concept of batch normalisation of deep learning, which is useful when dealing with states with different units.

The simulation environment, shown in Fig. 12, composed of the dynamic model of the WAAM process and of the DDPG controller, was developed in Python using the NumPy and TensorFlow libraries.

Fig. 12
figure 12

The proposed framework is composed of a Deep Deterministic Policy Gradient optimal controller that interacts with a data-driven reduced order model of the Wire Arc Additive Manufacturing process. A path planner gives in input to the controller the path references, the other input to the controller are the measured width and height and the actions used at previous time step

The actor-network is composed of 2 hidden layers with 64 neurons and ReLU activation function. The output layer is composed of 4 neurons, corresponding to the number of process parameters to control, supposing fixed the Gas Flor Rate and the Interpass temperature to 20 l/min and 25 °C respectively, with a hyperbolic tangent activation function and a random uniform distribution weight initialisation in the range of \(\pm 1\times {10}^{-5}\). The input layer of the actor-network has a size of 8, since the states in control problems can be composed of measured variables, such as layer width and height (2), the references (2) and the scaled actions—with respect to the maximum value—used at the previous time step (4). The critic-network has 2 different branches for actions and states. The state branch, like the policy network, has 64 neurons in one hidden layer, while the action branch has 32 neurons, since the action space size is near to the half of actions space. After a concatenation layer, two layers with 128 neurons are used to approximate the Q-value. All hidden layers of the critic-network used ReLU activation function, while no activation function was applied to the output layer, in which the Q-value is estimated. The used training options are summarised in Table 3.

Table 3 Hyper parameters using during training

Results

In this case study, the goal of the controller is to change the process parameters to reduce the error for both layer width and height, once the width and height references to produce the layer in Fig. 13 are known. The error used to compute the reward is the mean of the width and height errors. Given that the reduced order model (ROM) involves six input parameters, namely WFS, WS, CTWD, V, GFR and interpass temperature, and the controller’s control variables consist of WFS, WS, CTWD and V, for this scenario, GFR and interpass temperature have been maintained constant at 20 l/min and 25 °C, respectively, throughout the process. This choice was made because altering the interpass temperature in real-time is not possible and in typical welding systems, GFR is maintained constant.

Fig. 13
figure 13

The layer to be made is 150 mm long. After 75 mm, the width changes from 6 to 3.5 mm, while the layer height is fixed at 2.5 mm

In Fig. 14, the average value of the rewards obtained during the episodes is reported, which shows that after 30 episodes the average episodic reward becomes positive and after 40 episodes the average value is near to the maximum value, hence the problem was solved. In Fig. 15 and in Fig. 16, the values of the layer geometry at the beginning and at the end of the training are shown. In Fig. 17, the values of the controlled layer geometry obtained by using the reward function proposed by Spielberg et al., (2017) are presented: these results show that, by using the reward function proposed in this work, better results are achieved.

Fig. 14
figure 14

Average episodic reward trend

Fig. 15
figure 15

Output responses for width and height layer control using the trained DDPG controller at the end of episode 1, so using random actions during the active exploration phase

Fig. 16
figure 16

Output responses for width and layer control using the trained DDPG controller with the proposed reward computation

Fig. 17
figure 17

Output responses for width and layer control using the trained DDPG controller with reward computed as suggested by Spielberg with \(c=1\)

In terms of computational complexity, the WAAM simulator, which has been developed by integrating the RoboDK kinematic simulator with a ROM derived from a neural network, allows for a network update at every simulation time step. This update involves solving the kinematic problem and visualizing the geometry. Considering the inference time of the ROM, this process takes approximately 0.074 s. To complete a full simulation, which involves depositing one layer, it requires 27 s of computational time. This computation was conducted on hardware consisting of an AMD Ryzen 7 4000 CPU and a NVIDIA GTX 1650Ti GPU, utilized to accelerate the neural network inference. In total, this training process spans 50 episodes, each lasting 22 min. Regarding the inference time for control actions, which relies on input from a buffer composed of references, the current layer’s geometry and the previous control action, it takes only 23 ms. This inference time is notably faster than a control logic based on Model Predictive Control (MPC), which necessitates simulating the entire welding process to determine the optimal action, as previously discussed in the introduction. Finally, in Fig. 18, a simulated multi-layer deposition process using the optimal control law is shown.

Fig. 18
figure 18

Results of using optimal control law in the WAAM simulator developed in this work

Since the WAAM simulator has shown a high degree of fidelity, the obtained results can be replicated in the real environment using the sim-to-real capability of RoboDK software.

Limitations, current challenges and future developments

This work presents an application of the WAMM process, wherein a RL agent is trained using a simulation environment. While the provided simulator exhibits low errors in predicting layer geometry and offers good computational performance in terms of time, it is crucial to consider the real-world generalization of the simulator. Indeed, it is true that the neural network used to develop the WAAM ROM in this work is susceptible to overfitting. Consequently, the high performance achieved in the simulation environment in terms of control capabilities, necessitates an additional learning process on the real system. When different parameters are applied, the error between predicted values and actual outcomes can be higher, leading to a lower reward compared to the simulation, so additional learning steps are required to reach similar performance in the real world. However, because the policy is optimized for a system that closely resembles the real one, it requires fewer optimization steps, and the employed parameters are less likely to break the system since they are close to optimal. Even though a step in the real world is mandatory in this case, several limitations hinder the integration of real RL applications into manufacturing systems. For instance, the absence of frameworks like the robot operating system (ROS) for industrial machinery, such as spindles for machining or external additional axes and welders, and the lack of integration in it of industrial protocols like Profinet limit the simplicity of integration and real-time and safety capabilities.

The robotics field has benefited from these frameworks for the development of both RL applications and digital twin, but they are primarily designed for advanced robotics rather than manufacturing. For example, there is currently no framework designed to facilitate the integration of applications for welding machines, such as changing process parameters online using IoT protocols like OPC-UA or MQTT. These systems are tailored to industrial settings, which are typically more closed architectures compared to the research world and are not designed for the development of such applications.

However, by using external hardware, such as additional PLCs and employing simple serial TCP/IP Ethernet protocols, a case-specific system can be designed. Nevertheless, introducing an additional node into the network topology using non-typical industry communication protocols is complex and factors like latency, security and quality of the data must be considered. While this holds true for continuous online and model-free learning, additional challenges are associated with using neural networks only to make decisions, e.g. using the trained policy. In some cases, devices exist that enable the integration of AI processing hardware into PLCs for making inferences with pre-trained neural networks. However, they require training the network first.

In conclusion, although some pieces of the RL puzzle are available, the clear integration of all these components to develop complex, real-time control applications for multi-device systems remains an ongoing challenge. Our ongoing research is aimed at addressing these issues. The future developments of this work are related to the physical implementation of the communication scheme illustrated in Fig. 13 and the integration of defect detection software modules in the simulator (Nele et al., 2022). By using the monitoring modules developed in the simulation environment, a new more sophisticated optimal policy can be designed, using the novel reward formula proposed in Eq. 17, and the developed simulator may be used also to generate synthetic data for the future AI-based monitoring or control modules or to test their implementation.

Conclusion

In this work, we address the main challenges related to the development of optimal control applications for manufacturing processes. These challenges comprise factors such as process stochasticity, non-linear dynamics, coupling, and constraints on states and actions. To tackle these issues, we propose an approach based on reinforcement learning (RL) for Multi Input Multi Output non-linear constrained systems. While RL is commonly employed in domains like robotics, its application to manufacturing processes, such as machining, welding, and additive manufacturing, remains relatively limited. In these scenarios, control problems are more complex than simple set-point control, necessitating the formulation of domain-specific reward functions. While it is possible to implement an RL controller by directly interacting with the real system, this approach may not be suitable for safety–critical or costly applications. Therefore, we investigate into more sophisticated techniques, including the development of reduced-order models, reward function design, and control policy approximation architectures. We introduce a novel process-based reward function tailored to manufacturing applications. To illustrate and demonstrate our methodology, we present an example of a control problem within an additive manufacturing process, specifically the wire arc additive manufacturing (WAAM) process. Following an experimental campaign, we developed a WAAM simulator that encompasses all aspects of the robotic manufacturing process. We created a reduced order model (ROM) to capture the intricate non-linear relationship between layer geometry and process parameters. Additionally, a robotic simulator was employed to simulate robotic kinematics and generate code for both the robot and welding machine. This laid the foundation for the initialization of a control policy to be trained. Our proposed methodology enables the development of an intelligent manufacturing process simulator, facilitating the training of optimal control laws and offline parameter optimization based on reinforcement learning. Nonetheless, we acknowledge several challenges that hinder the real-time implementation of the RL framework and have thoroughly discussed these limitations. This research aims to inspire further investigations in this domain, providing valuable insights to the manufacturing community. Future research will focus on real-world implementations, comparative analyses with other control strategies, and the application of this framework to other manufacturing processes, such as intelligent feedback control for drilling operations.