Improving the surface quality of friction stir welds using reinforcement learning and Bayesian optimization

Friction stir welding is an advanced joining technology that is particularly suitable for aluminum alloys. Various studies have shown a significant dependence of the welding quality on the welding speed and the rotational speed of the tool. Frequently, an inappropriate setting of these parameters can be detected through an examination of the resulting surface defects, such as increased flash formation or surface galling. In this work, two different learning-based algorithms were applied to improve the surface topography of friction stir welds. For this purpose, the surface topographies of 262 welds, which were performed as part of ten studies, were evaluated offline. The aim was to use reinforcement learning and Bayesian optimization approaches to determine the most appropriate settings for the welding speed and the rotational speed of the tool. The optimization problem was solved using reinforcement learning, specifically value iteration. However, the value iteration algorithm was not efficient, since all actions and states had to be iterated over, i.e., each possible parameter combination had to be evaluated, to find the best policy. Instead, it was better to solve the optimization problem directly using the Bayesian optimization. Two approaches were applied: both an approach in which the information from the other studies was not used and an approach in which the information from the other studies was used. On average, both the Bayesian optimization approaches found suitable welding parameters significantly faster than a random search algorithm, and the latter approach improved the result even further compared with the former approach. Future research will aim to show that optimization of the surface topography also leads to an increase in the ultimate tensile strength.


Introduction
In friction stir welding (FSW), the mechanical properties [1] as well as the surface topography [2] are strongly affected by process parameters such as the welding speed v s and the tool rotational speed n (r/min rate). These parameters are typically determined by trial and error, based on handbook values, and by manufacturers' recommendations [3]. This selection may neither yield optimal nor near-optimal welding performance. Furthermore, it may cause additional energy and material consumption and may also result in low-quality welds [3]. For this reason, several algorithms have already been developed to optimize the process parameters in friction stir welding. Some of these are presented in the following section.

State of the art-Use of optimization algorithms in the field of FSW
Various statistical and mathematical methods have been used to investigate the influence of process parameters on mechanical properties, in particular the ultimate tensile strength, and subsequently optimize the mechanical properties [4]. In many of these investigations, either the robust parameter design (RPD) method [5] or the response surface methodology (RSM) [6] was applied: The RPD method focuses on choosing levels of parameters in a process to ensure that the mean of the output response is at a desired target and to ensure that the variability around the target value is as small as possible [5]. Taguchi [7] proposed an approach to solve the RPD problem based on designed experiments and novel methods for analyzing the resulting data [5]. He also simplified the use of orthogonal arrays [8]. An approach that has already been applied to FSW several times is the L9 orthogonal array. This method aims at understanding the influence of four independent factors with three steps each. With the L9 method, only nine experiments have to be performed in order to study four variables at three levels. So this design reduces 81 (3 4 ) configurations to nine experimental evaluations [8].
Lakshminarayanan et al. [9] determined the optimum settings for the rotational speed n, the welding speed v s , and the axial force F z at FSW by adapting the Taguchi L9 orthogonal array method and maximizing the signal-to-noise (S/N) ratio. In the Taguchi method, the S/N ratio is used to determine the deviation of the quality characteristics from the desired value [9]. In order to investigate nonlinearities, each of the three process parameters was varied in three levels. Welding experiments were conducted for only nine out of the 27 possible parameter combinations. For each of the nine applied parameter combinations, three tensile tests were performed, and the mean of the ultimate tensile strength was calculated. Based on the mean values for the S/N ratio and the ultimate tensile strength, an ideal parameter set was determined. The expected ultimate tensile strength UTS exp , when using this ideal parameter set, was calculated with the following formula [9]: whereby UTS n;L , UTS v s ;L , and UTS F z ;L are the mean ultimate tensile strengths at level L of the corresponding process parameters n, v s , and F z , and UTS is the overall mean of all 27 determined ultimate tensile strengths. Subsequently, the expected maximum ultimate tensile strength UTS exp was compared with the actual ultimate tensile strength obtained by adjusting the previously determined ideal parameter set, and the deviation was 2.6%. It was also determined that the rotational speed n had an influence on the tensile strength of 41%, the welding speed v s of 33%, and the axial force F z of 21%. The remaining 5% were referred to as errors. Ugender et al. [10] also used the Taguchi technique and the S/N ratio to find an optimum setting for the ratio of the diameter of the shoulder D s to the diameter of the probe d p , the tilt angle, and the welding speed. The results showed that the D s /d p ratio and the welding speed are the most important factors, followed by the tilt angle, when deciding on the mechanical properties of friction stir welds of aluminum alloys. Ganapathy et al. [11], Abbas et al. [12], and Ma et al. [13] also adopted Taguchi's L9 orthogonal array design and maximized the S/N ratio to optimize FSW process parameters. Vijayan et al. [14] investigated an approach using the Taguchi-based grey relational analysis (GRA) [15] instead of the S/N ratio. The RSM is an approach to solve the RPD problem that not only allows the use of Taguchi's robust design concept but also provides a more sound and more efficient approach to experiment design and analysis [5]. Furthermore, the RSM is a collection of mathematical and statistical techniques for analyzing problems in which several independent variables influence a dependent variable and the goal is to optimize the dependent variable [16]. Rajakumar et al. [3] applied the RSM and established an empirical relationship between the independent variables (tool rotational speed, welding speed, axial force, shoulder diameter, probe diameter, and tool material hardness) and the dependent variable, which was the ultimate tensile strength of the joint. For this purpose, a multiple regression model was developed for the ultimate tensile strength of the weld. The model was able to predict the ultimate tensile strength of FSW joints within the 95% confidence level. Khansare et al. [17] proposed a hybrid optimization methodology based on the combination of the RSM and a genetic algorithm (GA) [18] to approximate the optimal welding speed and tool rotational speed in which a maximum ultimate tensile strength could be achieved.
Tansel et al. [19] developed a genetically optimized neural network system (GONNS) for modeling and optimizing the FSW process. The GONNS was introduced by Tansel et al. [20] by using artificial neural networks (ANNs) in combination with a GA. The GONNS models the system by using the ANNs trained with the experimental data or observations. The optimal operating conditions are estimated by using a GA [19]. Tansel et al. [19] used one GA for searching the optimal tool rotational speed and welding speed by using five ANNs representing the FSW operation. The five separate neural networks with two identical inputs (welding speed and tool rotational speed) estimated the mechanical and metallurgical properties of the friction stir welds.

State of the art-Evaluation of the surface of friction stir welds
Trueba et al. [21] performed an optimization experiment using a factorial design to evaluate the effect of process parameters on the weld temperature, surface and internal quality, and mechanical properties during bobbin-tool friction stir welding. To evaluate the surface appearance, a semi-quantitative visual appearance rating (VAR) was developed based on the presence and severity of visually observable defects. The rating scale ranged from nine (poorest surface quality) to zero (best surface quality), and the criteria wormhole, galling, flash, and narrow bead were included. The wormhole was defined as an internal void extending to the surface. It was found that high levels of tool rotational speeds and welding speeds resulted in high welding temperatures and insufficient weld metal constraint. This in turn led to galling and the formation of wormholes with a corresponding decrease in surface quality. It was taken into account that there is a relationship between rotational speed, weld temperature, surface appearance, and void formation.
According to Zuo et al. [2], the surface topography of friction stir welds plays an important role in the performance of the joints. A larger surface roughness leads to a more serious stress concentration, which will cause the occurrence of fatigue damage and the reduction of fatigue strength of the parts [2]. Important process parameters to control the surface topography of friction stir welds are the welding speed and the rotational speed of the tool [22]. Hartl et al. [23] presented key indicators for quantifying the surface topography of friction stir welds and showed that some of these can be predicted by evaluating process variables such as the process forces or temperatures [24].
To date, there have not been any investigations regarding the very promising algorithm-based optimization of the surface topography of friction stir welds or on the application of reinforcement learning (RL) [25] and Bayesian optimization (BO) in the field of FSW, which is why these modern learning-based algorithms are used in this work. The fundamentals regarding these algorithms are contained in Appendix I.

Approach
Previous investigations have shown that the surface quality of friction stir welds significantly depends on the welding speed v s and the rotational speed n of the tool. The optimal setting of these parameters depends on factors such as the sheet thickness, the aluminum alloy, and the tool geometry used, for instance. Due to the complex interrelations, the ideal welding speed v s and tool rotational speed n can often only be found through experience and trial and error. In this work, a learning-based system was developed that helps the FSW user to find optimal settings for these two parameters. Since the production of friction stir welds is time-consuming, as few parameter combinations as possible should be sampled to find suitable parameters.
The evaluation of the surface quality was conducted on the basis of surface topography indicators for friction stir welds, which were presented in Hartl et al. [23]. The task was addressed as an optimization problem: Here, def is a function that indicates how defective the surface of the friction stir weld is for the given parameters. The function value def(v s , n) is smaller than the function value def(v s ′, n′) if the parameter combination (v s , n) leads to fewer surface defects than the parameter combination (v s ′, n′). The evaluation of def(v s , n) can be equated with the explicit testing of the parameter combination (v s , n), i.e., the production of the friction stir weld with the given parameters, the recording of the surface topography, and the calculation of the topography key indicators based on Hartl et al. [23]. Since this process is associated with considerable effort, the number of evaluations of def should be kept as low as possible. When implementing the algorithms, it had to be taken into account that there was no information about the gradient of def. Additionally, def contains an error: even if the process parameters are identical for two experiments, the surface topography of these two welds will not be completely identical. Small measurement inaccuracies may also occur when recording the surface topography with the three-dimensional profilometer. However, for simplification purposes, it was assumed that def has no error. To solve the optimization problem, three different approaches were considered: I. For the first approach, the optimization problem was modeled as a Markov decision process (MDP) and solved using the RL-based value iteration algorithm. II. For the second approach, the optimization problem was solved with BO. In the further discussion, the second approach will be called single-task. III. For the third approach, the optimization problem was also solved using Bayesian optimization. In contrast to the single-task approach, here the Gaussian process (GP) was provided with additional data that it could use to find the optimum. The GP was provided with information about the type of aluminum alloy, the sheet thickness, and the shoulder geometry used. In the further discussion, the third approach will be called multi-task.

Welding experiments
The welding experiments were conducted on a four-axis milling machining center MCH 250 from Gebr. Heller Maschinenfabrik GmbH, which was adapted for friction stir welding. The maximum axial force of the system was 30 kN.
In the experiments, the sheets were joined in the butt joint configuration and a rigid clamping device avoided gaps between the two joining partners. All tests were performed in position-controlled operation with a 2°tilt angle of the tool. Two-piece tools consisting of a shoulder and a conical welding probe with a thread and three flats were used. A total of 262 welding experiments were conducted within the scope of 10 studies. In the 10 studies, the type of aluminum alloy, the tool shoulder geometry, and the sheet thickness were varied. Table 1 provides an overview of the different studies. Some of the studies have already been described in more detail in previous research conducted by Hartl et al. [23,26]. The evaluated weld seam length varied in the ten studies, but was always between 70 and 170 mm. The evaluated weld seam area started 10 mm after the plunge point and ended approximately 20 mm before the exit hole.
The welding speed v s and the tool rotational speed n were varied in a large parameter window. As high welding speeds v s are becoming increasingly important for industrial applications, especially in the context of electromobility [27], welding speeds of up to 1500 mm/min were employed. In order to protect the welding equipment, the minimum n/v s ratio was limited to 1 mm −1 . In studies no. 1 to 8, the welding speed v s and the tool rotational speed n were varied in a full factorial manner in four steps, respectively. Thereby, the welding speeds v s ranged from 500 to 1500 mm/min and the tool rotational speeds n from 1500 to 3500 min −1 . In study no. 9, a total of 13 different rotational speeds n from 1500 to 3500 min −1 were set at a welding speed v s of 833 mm/min. In study no. 10, the welding speed v s was varied in eleven steps from 500 to 1500 mm/min and the rotational speed n was varied in eleven steps from 1500 to 3500 min −1 in a full factorial design.

Data preprocessing
The topography of the friction stir welds was recorded using a three-dimensional profilometer VR-3100 from Keyence Deutschland GmbH which was based on phasecoded structured light projection. Thereby, white LEDs projected light from two places onto the welds and the reflected light was measured by a CMOS sensor. The smallest measurable difference in the height direction normal to the sheet surface was 1 μm. The sheet surface was defined as the zero height. The distance between the individual topography points in the plane of the sheet surface was approximately 24 μm. A total of about 250,000 height information points per 10 mm weld seam length were generated. The point cloud was processed to determine the key indicators listed in Table 2 for each weld. A more detailed description of the key indicators is given in Hartl et al. [23]. Table 3 shows the ideal value for each of these eight key indicators as well as the best and the worst values obtained for the 262 welding experiments performed. The value of −2.80 mm for the largest seam underfill was notably high. This value was caused by a lack of fill occurring in some experiments in study no. 2 (see also Hartl et al. [23]). The maximum value for the peak material volume of 37.36 ml/m 2 was also remarkably high. This high value could be explained by flash that reached into the weld. The values displayed in Table 3 were therefore all considered plausible.
The eight topography indicators obtained for the 262 welds were then scaled to values between 0 and 1. The ideal value for each topography key indicator was scaled to a value of 0 and the worst occurring value for each topography key indicator was scaled to a value of 1. The ideal value for the ratio r arc is 1 [23]. The largest deviation from this ideal value was 0.95 at an r arc of 1.95, which is why that deviation was scaled as 1. The eight scaled values N for the eight topography indicators were then averaged for each weld according to: and a scaled and averaged key indicator def was obtained that took into account all eight topography indicators defined before (see Table 2). In Eq. 3, all defined topography key indicators were weighted equally. If a quality characteristic would be particularly relevant in the application, for example, the flash height, this could be weighted more prominently in Eq. 3. The perfect friction stir weld surface would therefore have the value def of 0. The best actual weld of all 262 conducted experiments was experiment no. 53, which had the value def of 0.021. The worst obtained value for def was 0.603 for experiment no. 123. Figure 1 shows the evaluated areas of these two welds as color and topography images. The images were generated using the three-dimensional profilometer. Experiment no. 53, on the one hand, contained no surface defects. Neither pronounced flash formation, nor surface galling, nor cracks were visible on the surface of the weld. The topography image in Fig. 1 shows that the seam underfill was also low and regular. Experiment no. 123, on the other hand, showed a very strong flash formation and pronounced surface galling. The defined key figure def was therefore assessed as suitable for documenting the surface quality in a scalar quantity.

Reinforcement learning
In order to solve the optimization problem using RL, it first had to be formulated as an MDP. Two formulations, labeled as formulation 1 and formulation 2, were implemented, which differed in the state transition function and the possible actions. In both formulations, the state transition function was deterministic. For a clearer presentation, the state transition function p was represented deterministically with the two functions T(s, a): S × A → S and r(s, a): S × A → ℝ. These two functions described in which state s′ the environment resulted and which reward r the agent received when the agent executed the action a in the state s. The state transition function could be derived from both functions as follows: Both formulations were solved using the value iteration algorithm. The parameter δ (see Algorithm A2) was set to 10 −8 .
In both formulations, the states S were the parameter combinations of welding speed v s and tool rotational speed n. For example, from study nos. 1 to 8, each of which included 16 different parameter combinations, S was assigned: The reward function r should lead the policy π (see Appendix I) into a minimum as fast as possible, which is why the following reward function was chosen for both formulations: where def was the function to be optimized and b s ∊ ℝ + was a constant. This reward function gave the agent a higher reward r, the smaller the value of def was, i.e., the less defective the topography of the friction stir weld was. In addition, the agent received a penalty of b s for each step, because each step was coupled with an explicit evaluation of def and thus connected with effort. The value for b s was set to 1. Since the value of def was between 0 and 1, the reward r for each step was between − 2 and − 1.
In formulation 1, only the following four different actions were allowed:  [28] n / a Arc texture formation Standard deviation of the difference between the local minima and the following local maxima along the weld center line S d Ratio between the measured number and the theoretical number of local extrema along the weld center line r arc Weld seam width Standard deviation of the weld seam width S w n/a For example, for study nos. 1 to 8, the state transition function T for formulation 1 was defined as: Figure 2 shows the 16 different states with the respective possible actions for formulation 1 for study nos. 1 to 8. The states and possible actions for study nos. 9 and 10 were analogous.
In formulation 2, in contrast to formulation 1, the agent was able to change from any state to any other state. This followed: The state transition function was thus simplified to: In both formulations, the discount factor γ was set to 1 (see Appendix I). This expressed the fact that both immediate and future evaluations of def were equally unwanted [25]. Additionally, the number of iterations no longer depended on the choice of γ.
To ensure that the algorithm still terminated, the minimum of def was regarded as the terminal state [25]. This meant that the value of the value function in the terminal state was always zero. The algorithm terminated because the agent could get from any state to the terminal state in finitely many steps and the agent received a negative reward in every state except the terminal state. So, in order to get the least negative reward possible, the agent  had to get to the terminal state as quickly as possible. In Algorithm A1, the slightly modified value iteration algorithm is described. The changes, compared with Algorithm A2, are underlined. The changes ensured that the value functions of the terminal states were always zero. In addition, the algorithm was adapted for the deterministic formulation.
Algorithm A1 Adapted value iteration algorithm Figure 3 a shows the values for the scaled and averaged surface topography indicator def (see Eq. 3) for the 16 different states in study no. 1. It is evident that the best value for def, 0.041, in study no. 1 was obtained in the state (500, 1500). This was in good agreement with the result from Hartl et al. [23], wherein this parameter combination led to the best-rated result in study no. 1 according to the visual inspection. Figure 3 b shows the initialization of the value function V π (s) with zeros (compare line 1 in Algorithm A1).
In both formulations, the algorithm evaluated the function def for all 16 parameter combinations. Figure 4 illustrates the results for the value iteration algorithm for study no. 1 when using formulation 1. Thereby, the strategy for each state was the direction in which the sum of future rewards was maximized (see Eq. 19 in Appendix I). The algorithm terminated after five iterations. Figure 4 a shows the values of the value function V π (s) for the first and the fifth (last) iteration. Figure 4 b shows the values of the value function V π (s) plus the reward, if the agent changes from another state to this state for these two iterations. Figure 4 c demonstrates the direction to the neighboring state that yields the highest improvement according to Fig. 4b. For example, for state (1500, 3500) in iteration 1, that would be the reduction of the welding speed v s . The terminal state was reached for (500, 1500), so the value function V π (s) for this state is zero from the beginning. The results for the other studies (see Table 1) led to the same findings. Figure 5 displays the results for the value iteration algorithm for study no. 1 when using formulation 2 analogous to Fig. 4. For formulation 2, the strategy in each state s was the action (500, 1500). The labeling of the boxes in Figs. 4 and 5 is analogous to the labeling in Fig. 2a.
The results showed that the optimization problem (see Eq. 2) can be solved using RL. However, the value iteration algorithm was not efficient. Since all states s∊S and all actions a∊A were iterated over, the function def had to be evaluated for each process parameter combination. There are algorithms for RL that are more efficient in this respect. For example, the value function could be approximated using a Gaussian process to reduce the number of evaluations [29].

Bayesian optimization
Optimization of the surface quality In the second approach (single-task), the optimization problem was solved using Bayesian optimization (BO). Thereby, only the data from the respective study were used. Additional information from other studies or information such as the type of aluminum alloy were not utilized. Since no information was available at the beginning of the optimization for the first selection of a parameter set, a random parameter set had to be selected in the first trial. A parameter set consisted of the welding speed v s and the tool rotational speed n. To ensure that the results were independent of the selected starting point, all parameter combinations were used once as the starting point and then the means and standard deviations of the required number of steps to find good parameter sets were subsequently calculated. In order to avoid overfitting the hyperparameters, it was assumed that they follow certain distributions. For the hyperparameter length-scale l of the Matérn 5/2 kernel, a uniform distribution was assumed, since the Gaussian process degenerated strongly at values below 0.1 and those were thus excluded. A logarithmic normal distribution was assumed for the variance σ to determine the approximate interval in which σ should be located. The exact choice of the distribution parameters was not significant. The parameters in Table 4 in Appendix II were found by trial and error using the data sets from study nos. 1 to 10, so that the distributions cover approximately the range of the hyperparameters that do not degenerate the GP. Figure 6 shows a degenerate GP for study no. 9 that had to be avoided. The cause of the degeneration was that the hyperparameter l was selected to be 0.01 and  therefore too small. Consequently, the mean in Fig. 6 is always zero and only spikes at known points. In addition, the hyperparameter variance σ of 0.001 was too small.
In the third approach (multi-task), the optimization problem was also solved using the BO, but in this case, the GP received the data sets from the nine other studies as additional information, respectively. It was suspected that this would allow the GP to better estimate the function to be minimized and to find the optimum more quickly. In order for the GP to be able to estimate which other studies were similar, it was given additional features as input variables which influence the setting of the process parameters. These features were the type of aluminum alloy, the sheet thickness, and the shoulder geometry. Since they were identical in study nos. 3 ↔ 9, 4 ↔ 8, and 5 ↔ 6 (see Table 1), the study number was also used as an additional input variable. This resulted in the following kernel for the GP: where M f , T f , and S f represent the quantities of the aluminum alloys m f , sheet thicknesses t f , and shoulder geometries s f used. Additionally, i is the number of the study and {1, 2, …, 10} is the set of natural numbers less than or equal to 10, since data from 10 studies were used in total. For the welding speed v s , the tool rotational speed n, and the sheet thickness t f , the Matérn 5/2 kernel (see Appendix I) was used. Since the type of aluminum alloy, the shoulder geometry, and the study number were categorical values, the coregionalization kernel (see Appendix I) had to be used for those. In that way, their covariance could be learned via hyperparameter optimization. The covariance function k was defined as follows: where k E is the Matérn 5/2 kernel, k M and k S are coregionalization kernels for two or three categories, and k I is a coregionalization kernel for ten categories. Since the GP was given information from the nine other studies, the first parameter set to be tested no longer had to be chosen randomly as in the single-task approach.
In order to prevent overfitting the hyperparameters, it was also assumed for the multi-task approach that the hyperparameters follow certain distributions. A uniform distribution was assumed for the hyperparameter length-scale l of the Matérn 5/2 kernel. A logarithmic normal distribution was assumed for hyperparameters, which can only have positive values, and a normal distribution for all remaining hyperparameters. Table 5 in Appendix II lists the hyperparameters for the multi-task approach and their distribution analogous to Table 4.
The number of expected steps until the random search algorithm finds a suitable parameter setting has been calculated as described below: Let Z be a random variable that indicates after how many steps a random search finds one of the o optima for the first time. Thus, p z was the probability that a random search has not found an optimum in the previous z steps and finds an optimum in the (z + 1)-th step: whereby q is the number of possible parameter combinations and Z ∊ {1, 2, …, q-o + 1}, whereby {1, 2, …, q-o + 1} is the set of natural numbers less than or equal to (q-o + 1), since there are at most (q-o) parameter sets that are not an optimum and one of the optima is found in the (q-o + 1)-th step at the latest. Thus, the expected number of necessary steps to find an optimum using random search was: Table 6 in Appendix II shows the number of steps required to find a suitable parameter set for the ten different studies. Initially, the best 20% of each study were defined as suitable. This resulted in the number of o 20 parameter sets for each study, which led to an optimum result. If the random search (RS) algorithm was used, the calculated expected value until one of the o 20 good parameter sets was found was 4.3 steps on average. When using the single-task approach and the probability of improvement (PI) acquisition function, an average of 3.3 steps, and when using the multi-task approach, an average of 2.4 steps were required to find a parameter set that was among the best 20%. The average results for the expected improvement (EI) acquisition function were almost identical with the results when using the PI acquisition function. Since the application of the single-task approach also depended on which parameter set was tested first, the mean value and standard deviation are given for each study in Table 6. The BO was started once with each of the parameter sets. Since the determination of suitable settings for the welding speed v s and the tool rotational speed n with both the single-task and the multi-task approaches succeeded faster than an RS algorithm would suggest, the use of the BO was considered suitable. It was expected that the BO would lead to suitable parameters faster than RS, since the BO models the function to be optimized. This made it possible to estimate which parameter setting should be tested next. With RS, the parameter settings were evaluated in random order. In addition, the information given in Table 6 shows that, on average, the multi-task approach led to better results than the single-task approach. The information from the other data sets could thus be successfully used to determine suitable welding parameters faster. Since the similarity to the other data sets was modeled, it was also possible to derive from which of the other studies the information could be better transferred in the multitask approach. Figure 7 illustrates the number of steps required for the different studies and approaches. It becomes obvious again that the multi-task approach led to the best results. Only in study nos. 7 and 9 the results were the worst for the multi-task approach. The reason was assumed to be that study no. 7 was the only study in which a sheet thickness of 2 mm was used (see Table 1), which is why the information from the other studies in this case even had a negative effect on the result compared with the single-task approach. Study no. 9 also differed from the other studies. In this study, the welding speed v s was not changed and only the rotational speed n was varied. Due to the greater difference as compared with the other studies, this information could therefore not be used advantageously. Table 7 in Appendix II shows the number of steps required to find suitable welding parameters for the various approaches in analogy to Table 6. Now, only the best 5% of each study were defined as suitable. This resulted in the number of o 5 good welding parameter sets for each study. This specification could be used for applications that require a very high surface quality, for example for visible welding seams. Since the number of optimal parameter sets was now lower, the number of steps necessary to find good parameter sets was higher on average. It once again became clear that the use of the BO reduced the number of necessary steps compared with RS. The average number of necessary steps was again lower with the multi-task approach than with the single-task approach. This time, the differences between the two different acquisition functions were slightly bigger and the PI acquisition function was on average better for the single-task approach, whereby for the multi-task approach, the EI acquisition function performed marginally better. Figure 8 illustrates the steps necessary to find a parameter set that was among the best 5%. Compared with Fig. 7, it is particularly noticeable that in study no. 2, the multi-task approach required the most steps to find a good parameter set. This was assumed to be due to the fact that study no. 2 was the only study in which a spiral shoulder geometry was used (see Table 1) and the information from the other data sets was rather confusing than useful for the multi-task BO algorithm. Figure 9 a shows the sequence of the tested parameter sets in study no. 1 based on the multi-task approach. The sequence was identical for both acquisition functions: PI and EI. Figure 9 b shows the achieved values for the surface topography def in each step. From the two figures, it becomes clear that the optimum parameter set (500, 1500) was already found in the first step. This was also in agreement with the result from Hartl et al. [23], where for study no. 1 the parameter set (500, 1500) led to the best result regarding the visual inspection. Figure 10 shows a color image and a topography image of the evaluated welding surface from this experiment. Since the approach tested all 16 parameter settings and have only tested each parameter set once (see Algorithm A3), the quality of the surface topography decreased in the subsequent steps (see Fig. 9b). Figure 11 visualizes the GP for study no. 1 when using the BO multi-task approach with the mean function and the 68% confidence interval. The first three steps have already been performed, so the points (500, 1500), (500, 2167), and (833, Study no. → Study no. → Required steps to achieve a parameter set that is among the best 5%

Random search
Single-task mean Single-task std. dev. Multi-task Fig. 8 Number of steps required to achieve a parameter set that is among the best 5% by using the acquisition functions a probability of improvement and b expected improvement 1500) in Fig. 11 are already known. Since the acquisition function for the point (833, 2167) was the highest compared with the other points not yet evaluated, this point was evaluated next for both the PI and the EI acquisition function. Figure 12 is analogous to Fig. 11 when using the BO single-task approach and the first three steps have already been performed. Thereby, it was specified that in the first step, the parameter set (500, 1500) was tested. It became clear that the GP mean function and the associated 68% confidence interval could estimate the real data considerably less accurately as compared with the multi-task approach, which is illustrated in Fig. 11.
In the investigations on the multi-task approach presented so far, all nine other studies shown in Table 1 were used to find suitable parameters. In the study nos. 3 ↔ 9, 4 ↔ 8, and 5 ↔ 6, the same aluminum alloy, the same sheet thickness, and the same shoulder geometry were used. The disadvantage of these studies, hereinafter referred to as duplicates, was that the GP needed additional hyperparameters (see Eq. 11), which made their learning more complex. Furthermore, it was not possible to investigate how well the other data sets, which had no duplicates, could be used for the data sets that had duplicates. In further investigations using the multi-task approach, the duplicates were therefore not used and the study number i (see Eq. 11) was no longer required for the differentiation, which also simplified the covariance function. Table 8 in Appendix II shows which data sets were used to calculate suitable parameter sets, whereas Table 9  Step → Scaled surface defects def → Fig. 9 a Sequence of the selected parameter sets for study no. 1 when using the multi-task approach for the two acquisition functions tested.  Fig. 10 Color image and topography image of the best weld surface achieved in study no. 1 in which the parameter set was found in the first step using the Bayesian optimization multi-task approach omitted. Especially in study nos. 2 and 7, which had no duplicates, the results could be improved. In the studies that had duplicates, the results were worse if the duplicates were not used. Only in study no. 9 did the results improve, even though there was a duplicate. This was probably due to the fact that study no. 3 showed a poor welding result for the welding speed of 833 mm/min and a rotational speed of 2167 min −1 , whereas study no. 9 showed the best welding results in this parameter range.
Overall, it could be shown that Bayesian optimization can be used very efficiently to find suitable process parameters for friction stir welding. The Bayesian optimization was better suited than reinforcement learning for the aim of

GP mean
Real data GP 68% confidence Already tested interval real data Acquisition functions Rotational speed n → Fig. 11 Visualization of the Gaussian process and the acquisition functions after three tested parameter sets in study no. 1 after applying the multi-task approach this project to optimize the surface quality of FSW seams as efficiently as possible. For the optimization using reinforcement learning, a strategy had to be found that maximizes the expected value of an infinite sum of random variables and the function def had to be evaluated for each process parameter combination. Using the Bayesian optimization, the function def could be directly optimized by searching for values for the welding speed and the tool rotational speed that minimize the function def.
In this work, only the welding speed and the tool rotational speed were varied. This can be extended to other process parameters in further work. For example, the tilt angle or the immersion depth of the tool could be implemented additionally to optimize the surface quality. The Bayesian Real data GP 68% confidence Already tested interval real data Acquisition functions Fig. 12 Visualization of the Gaussian process and the acquisition functions after three tested parameter sets in study no. 1 after applying the single-task approach optimization method can also be transferred to forcecontrolled processes in order to find a suitable setting for the axial force of the tool.
Optimization of the surface quality with consideration of the welding speed Since the welding speed v s has a direct influence on the process productivity [30] in any welding operation in an industrial context, the objective behind the selection of suitable welding parameters is to maximize the welding speed v s while ensuring an acceptable welding quality [31]. The growing market for the use of friction stir welding in the electromobility sector also requires sufficiently high welding speeds in order to enable economic production [27]. Richter [27] recommends aiming to attain a welding speed v s of at least 1000 mm/min. Therefore, in the investigations described in this section, slower welding speeds were penalized with p lvp according to the following formula: with lvp being a factor indicating the magnitude of low welding speed penalty, v smax being the maximum welding speed in a study, and v smin being the minimal welding speed in a study. The higher the selected lvp factor, the higher welding speeds are preferred. The value for p lvp was then added to the values for def, which were calculated according to Eq. 16 to generate the new variable def vs , which took into account both the surface quality and the welding speed v s : In this project, the maximum welding speed in all ten studies was 1500 mm/min and the minimum welding speed was 500 mm/min. Additionally, lvp was chosen to be 0. 15 Topography height → 10 mm 10 mm Fig. 14 Color image and topography image with the best weld in study no. 1, taking into account the weld surface quality and the welding speed; the parameter setting was achieved in the first step using the Bayesian optimization multi-task approach welding speed v s of 833 mm/min, for example, this resulted in a p lvp of 0.10, which was then added to all values for def where a welding speed v s of 833 mm/min was applied. Figure 13 a shows the sequence of tested parameter sets when using the BO multitask approach with all ten studies used (including the duplicates) and punishing low welding speeds v s with p lvp . It becomes clear that parameter settings with a welding speed v s of 1500 mm/min were preferred then. Figure 13 b shows the corresponding values for def vs . The optimum for def vs of 0.12 for study no. 1 was already reached in the first step again, and thus, the parameter set (1500, 2167) was used. Figure 14 shows a color image and a topography image of the evaluated welding surface for the parameter set (1500, 2167). It becomes clear that the surface has a small irregular flash formation and slight surface galling compared with the result shown in Fig. 10. However, the welding speed v s was three times as high.
A possibility was thus found which allows the FSW user to weight the two criteria surface quality and welding speed individually adjusted to the requirements by setting the parameter lvp and to consider this in the learning-based automated search for suitable process parameters.

Conclusions and future research
A total of 262 friction stir welds were performed within 10 studies. Subsequently, with reinforcement learning and Bayesian optimization, two learning-based algorithms were tested for their applicability in optimizing the surface topography by adjusting the welding speed and tool rotational speed. The following conclusions were drawn: & The optimization problem could be solved by means of reinforcement learning, but not efficiently. Furthermore, it was complicated to solve the problem with reinforcement learning, because a policy had to be found that maximizes the expected value of an infinite sum of random variables. Instead, it was better to solve the optimization problem directly by using the Bayesian optimization. & The Bayesian optimization found suitable settings for the process parameters significantly faster than random search, both without (single-task approach) and with (multi-task approach) the aid of the data sets from the other studies. & In the multi-task approach, the information from the other studies could be successfully used to find suitable welding parameters even faster compared with the single-task approach. & By penalizing low welding speeds, both the surface topography and the welding speed could be considered for the optimization.
In a future project, the aim will be to show that the optimization of the surface topography in general leads to an increase in the ultimate tensile strength of the friction stir welded joint. The transfer of the algorithms developed in this work for the inline optimization of the weld seam surface is also the subject of future investigations.
Acknowledgments The IGF-research project no. 19389 N of the "Research Association on Welding and Allied Processes of the DVS" has been funded by the AiF within the framework for the promotion of industrial community research (IGF) of the Federal Ministry for Economic Affairs and Energy because of a decision of the German Bundestag.
Funding Open Access funding provided by Projekt DEAL.

Reinforcement learning
Reinforcement learning (RL) describes the concept of goaloriented learning by an agent through the interaction with its environment. At each discrete time step, the agent performs an action, which influences the state of its environment and which rewards the agent. The agent seeks to maximize its rewards, whereby the agent is guided towards the goal. In general, the charm of RL is that the agent learns autonomously to achieve the goal, while the user only needs to assess, by using rewards, how well the goal was achieved [32].
The Markov decision processes (MDPs) have become the standard formalism to mathematically describe sequential decision-making tasks in stochastic environments, such as RL problems [33]. An MDP is defined by [25]: & S is the set of states s, in which the environment can be, & A is the set of actions a, which the agent can perform in the environment, and & p(s′, r | s, a): S × R × S × A → [0,1] is the state transition function. Thereby, p describes the probability distribution of the state s′ and the reward r after the agent has executed the action a in the state s.
At the beginning (time step t = 1), the agent is in an initial state s 1 . At every time step t, the environment is in the state s t and the agent executes an action a t . According to the state transition function, the agent is in the state s t + 1 at the next time t + 1 and gets the reward r t . Figure 15 illustrates this cycle.
By using RL, a possibly stochastic policy π(a | s): A × S → [0,1] is searched for, according to which the agent can act in the environment. The policy is supposed to maximize the sum of all expected weighted future rewards E r 1 ;r 2 ;… ∑ ∞ t¼1 γ t−1 r t Â Ã , where E r 1 ;r 2 ;… is the expected value for the random variables r 1 , r 2 , …, and π(a | s) indicates the probability with which the agent executes the action a in the state s. To keep the sum of all future rewards finite, r t is weighted by γ t-1 , where γ ∊ [0, 1].
A γ smaller than one can also express that immediate rewards are more important to the agent than future rewards [25].
Value iteration is an algorithm to find a policy π. It uses an auxiliary function, the value function V π (s): S → ℝ, which represents the sum of all expected weighted future rewards when the agent follows the policy π [25]: The value function V π can in turn be defined by itself [25]: The policy is deterministic and returns the action a that maximizes the expected weighted future reward and can be defined using the value function V π [25]: The value function is represented by a look-up table. Since the value function and thus also the values of the look-up table are unknown, the values are learned with the help of the value iteration algorithm. First, the value function is initialized arbitrarily; then, it is updated until the change is only marginal (i.e., smaller than δ) [25].
The value iteration algorithm is described in Algorithm A2.
Algorithm A2 Value iteration algorithm [34] The value iteration algorithm converges to the optimal value function, i.e., to the optimal policy as well [25]. But both S and A must be finite, because the algorithm iterates over all states s and actions a.

Gaussian processes
A Gaussian process (GP) is a stochastic process, which, among other things, can be used to model probability distributions over functions. Therefore, the GPs can be used as a model for regression analysis [35]. Figure 16 shows a GP with the mean of the GP, the 95% confidence interval, and three samples, whereby four data points be a finite set of data points [35]. Then, a GP is a stochastic process T x ð Þ x∈X in which each finite subset of points {T x 1 ; T x 2 ; …; T x n g follows a multivariate normal distribution [35]: Thereby N denotes the normal distribution, and m(x): X → ℝ is the mean function which describes the expected value of the random variable T x [35]: and k(x, x′): X Â X → ℝ is the covariance function, which can be represented by a kernel function. The covariance function describes the relationship between the two random variables T x and T x′ , respectively, the similarity between the two points x and Agent Environment a t s t+1 r t Fig. 15 The agent-environment interaction in an MDP [25] x′. In particular, an unknown point x n + 1 with the corresponding random variable T x nþ1 is normally distributed, too [35]: where T x nþ1 j T x 1 ; …; T x n is the random variable T x nþ1 in condition on T x 1 ; …; T x n . A kernel k(x, x′): X Â X→ ℝ describes the similarity between the two points x and x′. The Matérn 5/2 kernel k E is isotropic, i.e., the kernel depends only on the distance between the points x and x′ [35]. The bigger the distance between the two points x and x′, the more different they are from each other and the smaller k E (x, x′) is [35,36]: where || . || 2 is the Euclidean norm. The Matérn 5/2 kernel has the two hyperparameters length-scale l ∊ ℝ + and variance σ ∊ ℝ + . Thereby, the length-scale l scales the distance and the variance σ scales the value of the function [37]. Along with the radial basis function (RBF) kernel [35], the Matérn 5/2 kernel is a common choice for kernel functions in the Euclidean space [35]. The coregionalization kernel k c (x, x'): {1, 2, …, n} × {1, 2, …, n} → ℝ is used to model the similarity of the output dimensions of a function with a multi-dimensional output [38] and is defined as [36,39]: where & n is the number of output dimensions, & {1, 2, …, n} is the set of natural numbers less than or equal to n, & diag(x): ℝ n → ℝ n×n is a function that maps a vector with n dimensions to an n × n diagonal matrix, & (...) x,x ′ is the x′-th entry of the x-th row of the corresponding matrix, & W∊ℝ n×m and κ∊ℝ n are hyperparameters and are learned by hyperparameter optimization, and & m is an arbitrary natural number, but usually smaller than the variable n.
With the aid of the coregionalization kernel, functions with multi-dimensional output can be modeled using a GP [38]. Assuming f: X → ℝ n is any function, then f can be represented by f′ (x,i): X × {1, 2, …, n} → ℝ with f′ (x,i) = f (x) i . This means that the index of the output of f is seen as an input for the function f′ [38]. The similarity between the individual output dimensions of f can be modeled with the coregionalization kernel [38]. This results in a combined kernel [38]: where k is the kernel which measures the similarity between the points x and x′, and k c is the coregionalization kernel which measures the similarity between the different output dimensions [38]. A GP can have hyperparameters such as the length-scale l and the variance σ parameter of the Matérn 5/2 kernel. These hyperparameters are usually determined using a maximum likelihood estimate (MLE) [40], i.e., the hyperparameters H that best explain the data D are searched [35,37,40]: x → f(x) → GP mean GP 95% confidence interval Known data points GP samples Fig. 16 Example of a Gaussian process [35] If the GP has many hyperparameters and is optimized with too few data points, the hyperparameters may overfit on the data points, i.e., they only explain the data that was used to optimize, but not new data [35]. This can be prevented by assuming that the hyperparameters follow a certain distribution p(H) (e.g., a uniform, normal, logarithmic normal, or gamma distribution) [40]. This expresses which hyperparameters are likely and which are not. The hyperparameters are then determined using a maximum a posteriori (MAP) estimate [41]. In contrast to the MLE, the MAP estimation searches for the hyperparameters H that are most likely for the given data D [35,40]: Bayesian optimization The Bayesian optimization (BO) is a class of machine learningbased optimization methods focused on solving the problem [40]: The BO is suited, when the set X and the objective function f have the following properties [40]: The BO is an iterative algorithm which evaluates the function f at a certain point in each iteration. In order for the algorithm to know at which point f should be evaluated next, the function f is first modeled using a GP and the already known points P. The strategy according to which the BO then selects the point to be evaluated next is determined by the maximum of the acquisition function a(x): X →ℝ [40] One acquisition function is the probability of improvement (PI). This acquisition function indicates how likely it is that a point will improve the previous optimum. The probability of improvement is defined as [42]: where f min is the best point evaluated so far (in this work the goal was to minimize the surface defects), T x′ is the GP at point x′, and T x 1 , …, T x n is the GP at the points evaluated up to now. The disadvantage of this acquisition function is that it only indicates the probability of an improvement, but not the magnitude of improvement [42].
Another acquisition function that also includes the magnitude of improvement is the expected improvement (EI). EI provides a good balance between exploration and exploitation [43]. The EI acquisition function is defined as [44]: The BO can also be used to optimize a function f: X →ℝ n with multi-dimensional output [45]. The BO algorithm is described in Algorithm A3. At the beginning, m ∊ ℕ random samples are taken, first to investigate the function to be optimized, and second so that there is at least one point with which the function f can be modeled in the first iteration.
Algorithm A3 Bayesian optimization algorithm [40] In Algorithm A3, a(x) is any acquisition function and X \{x 1 , …} is the set X without the points already evaluated.

Random search
Random search (RS) is an optimization algorithm that evaluates the function to be optimized at random points until a stop criterion is reached. The searched optimum is the optimum of the evaluated points [46,47]. The advantages are that no requirements are placed on the function to be optimized and no gradients are required [48]. In addition, RS is useful if a significant part of the inverse image X maps to an optimum. Logarithmic normal distribution μ = 0; σ = 0.5 Table 5 Hyperparameters for the multi-task approach, its distribution, and the parameters of