Introduction

Due to the scaling of silicon technology, the process tuning and optimization becomes critical in the advanced technology node. This work uses laser annealing as the case study for Bayesian reinforcement learning (BRL). As far as semiconductor annealing is concerned, traditional furnace annealing leads to a high thermal budget, while rapid thermal annealing (RTA) is generally full-wafer, non-localized. As a result, laser annealing is a promising alternative, and it possesses several advantages. First, the localized heating in laser annealing results in less dopant diffusion, which leads to a shaper junction profile and reduces the subthreshold leakage current (Gluschenkov & Jagannathan, 2018; Robinson, 1978; Takamura et al., 2002; Whelan et al., 2002). A low thermal budget is important in monolithic 3D integrated circuits in terms of reduced heat transfer to bottom layers. Controlling the shielding layer’s refractive index can confine the heated region and prevent metal interconnects or devices from being destroyed (Pey & Lee, 2018; Rajendran et al., 2007). Second, the solidification after laser annealing only takes several microseconds, and thus, the dopant diffusion is highly alleviated (Pey & Lee, 2018). This enables the doping concentration to exceed the solid solubility, improving the film’s conductivity (Gluschenkov & Jagannathan, 2018; Zhang et al., 2006). Third, laser annealing can be conducted in the air because the impurities in the atmosphere hardly diffuse into the samples due to the fast solidification in laser annealing (White et al., 1979).

Parameter tuning is inevitable to optimize the laser annealing process or, in general, any semiconductor processes. In the new age of the industrial revolution, i.e., Industry 4.0, industries have developed strategic guidelines to facilitate process control and optimization by using emerging technologies such as big data analytics, cyber-physical systems (CPS), data science, cloud computing, and Internet of things (IoT) (Kotsiopoulos et al., 2021; Wang et al., 2018). The concept of intelligent manufacturing (IM) refers to the application of artificial intelligence (AI), IoT, and sensors in manufacturing to make wise decisions through real-time communication (Yao et al., 2017). It is also desired to adopt new methodologies for better-optimized design, manufacturing quality, and production in modern factories (Kusiak, 1990). To achieve the purpose of fully automated, optimized manufacturing, the keys are machine learning (ML) and deep learning (DL) algorithms that can help develop strategies to automatically analyze, diagnose, and predict patterns from high-dimensional data. Specific to the semiconductor industry, where the processes are complex, interwoven, and sensitive to process parameters, ML provides continuous quality improvement (Li & Huang, 2009; Monostori et al., 1998; Pham & Afify, 2005; Rawat et al., 2023). Supervised learning (SL) and unsupervised learning (USL) have been extensively used in manufacturing industries for process monitoring, control, optimization, fault detection, and prediction (Alpaydin, 2020; Çaydaş & Ekici, 2012; Gardner & Bicker, 2000; Khanzadeh et al., 2017; Li et al., 2019; Pham & Afify, 2005; Salahshoor et al., 2010; Susto et al., 2015). Nevertheless, SL/USL possesses some drawbacks, such as being more limited to the specific process of the entire manufacturing system (Doltsinis et al., 2012).

In contrast to SL/USL, reinforcement learning (RL) is another ML approach where no supervision is mandatory for training the model (Sutton & Barto, 2018). RL is a well-known approach that was initially used in control and prediction problems such as autonomous driving and go games. It has also been used in many optimization problems (Jacobs et al., 2021; Ruvolo et al., 2008; Wang et al., 2020). Specifically, in manufacturing industries, the production environment is often dynamic and non-deterministic, with unexpected scenarios or incidents (Monostori et al., 2004). In such a stochastic environment where the randomness in the dataset makes prediction difficult, an RL agent is more capable of learning the manufacturing process than SL/USL methods (Guevara et al., 2018; Günther et al., 2016; He et al., 2022; Hourfar et al., 2019; Kormushev et al., 2010; Silver et al., 2017). RL has been used in scheduling and dispatching in the semiconductor industry in literature. In particular, Washneck et al. used deep RL for production scheduling and realized optimization and decentralized self-learning (Stricker et al., 2018). Stricker et al. presented deep RL usability in semiconductor dispatching and demonstrated improved system performance (Stricker et al., 2018). Khader et al. use RL to tune the surface mount technology (SMT) in printed circuit boards (PCB) with experimental data (Khader & Yoon, 2021). Besides, some semiconductor process-control efforts have been using RL on a more theoretical side (Khakifirooz et al., 2021; Li et al., 2021; Pradeep & Noel, 2018). While RL in the semiconductor industry has been used extensively for production control (Altenmüller et al., 2020) or scheduling (Lee & Lee, 2022; Luo, 2020; Park et al., 2020; Shi et al., 2020; Waschneck et al., 2018), there have been fewer works using RL in intelligent semiconductor manufacturing, especially compared to the efforts in SL/USL. The advantage of using RL in the semiconductor industry is that, in general, RL can learn complex and dynamic problems with reasonable generalizability and efficient data utilization (Silver et al., 2017; Wiering & Otterlo, 2012). In addition, RL has the advantage of splitting the primary task into several subtasks, resulting in a flexible, decentralized structure that facilitates computation parallelization (Chang et al., 2022; Wang & Usher, 2005). In our previous attempt, we used an RL agent to analyze its performance against human knowledge in the semiconductor fabrication field and show the importance of exploration (Chang et al., 2022; Rawat et al., 2022).

In addition to RL, Bayesian inference is investigated and applied to laser annealing problems in this work. Recently, Bayes’ statistics and inference are regarded as highly promising in machine learning fields and applications, such as Bayes network (BN) and Bayesian neural network (BNN). BN is used to compute the conditional probability of each node and is usually applied in classifiers such as Naive Bayes and BN augmented Naïve Bayes (Muralidharan & Sugumaran, 2012). On the other hand, BNN can also be used in classification (Auld et al., 2007; Thiagarajan et al., 2022) and even regression problems (Chen et al., 2019). For example, BNN is applied in the field of semiconductor manufacturing by fitting the TCAD dataset and obtaining the weights and biases of each neuron as prior, thus achieving lower regression loss with a limited amount of true data (Chen et al., 2019). In addition to BN and BNN, Bayes’ theorem is also applied to RL, leading to Bayesian reinforcement learning (BRL). Many methods have been applied to realize BRL, such as Myopic value of perfect information (Myopic-VPI), Q-value sampling, and randomized prior function (RPF)(Dearden et al., 1998; Ghavamzadeh et al., 2015; Hoel et al., 2020; Osband et al., 2018; Vlassis et al., 2012). BRL is widely used in various fields (Gronewold & Vallero, 2010; Hoel et al., 2020; Kong et al., 2020; Liu et al., 2021), such as autonomous driving, gaming, and energy management, and we have not found any prior arts used in semiconductor manufacturing. Advantages of BRL are that agent can better select the actions with the assistance from prior, which possesses the domain knowledge if informative prior is used. This reduces the required training time and data amount and can lead to a more robust model (Ghavamzadeh et al., 2015; Vlassis et al., 2012). The work is included in our student thesis (Chang, 2023).

The optimization of the laser annealing process by itself is not well studied in the literature, and thus, it can be difficult to find many other papers on it (Alonso et al., 2022). Nevertheless, optimizing semiconductor processing in general is common in literature, but based on our literature review, very few papers have applied Bayesian RL to semiconductor processing (Li et al., 2023). We can only find Bayesian statistics such as Bayesian belief net, Bayesian neural network, hierarchical Bayesian model, Bayesian parameter estimation, etc., in the field of semiconductor intelligent manufacturing. The novelty of this work lies in using TCAD prior and Bayesian formulation incorporated into RL, while Bayesian RL has been used in other fields instead of semiconductor processing.

To summarize the architecture of this paper, in “Methodology” section will describe the method, including the fabrication processes in laser annealing that is achieved in the clean room facilities. The detailed formulations constituting the entire BRL process and how the BRL framework and formulas are fitted to the laser annealing case are described. The definitions of the state, action, reward, and prior are clarified in “Methodology” section. In “Results and discussion” section provides experimental and numerical results of BRL. The RL, BRL with a fixed prior, and a BRL with a variable prior are compared. The variable prior refers to a prior that changes continuously during the RL process. The relative strengths and weaknesses of respective methods are stated, and the reason for improvement will be discussed. The contribution of this work is highlighted in the “Results and discussion” and “Conclusion” section. Essentially, the application of BRL in the field of semiconductor manufacturing is less studied in the literature, and by using the study case of laser annealing, the potential of using BRL in other semiconductor processes can be seen. More importantly, the TCAD informative prior provides valuable guidance in RL, which is well demonstrated in this work.

Methodology

Sample fabrication

First, 6-inch p-type silicon wafers of resistivity 1–10 Ω cm were cleaned using the standard (STD) clean process. In the STD clean process, wafers were cleaned using a wet bench before high-temperature deposition on wafers. First, the wafers were cleaned in a solution of NH4OH:H2O2:H2O in the ratio of 1:4:20 at 75 °C for 10 min. The wafers were rinsed after the SC1 process. Then, the wafers are again cleaned using the SC2 process, which includes the solution of HCl:H2O2:H2O in the ratio of 1:1:6 at 75 °C for 10 min. Again, after rinsing, the wafers were cleaned with diluted hydrofluoric acid (DHF), which includes the HF:H2O in the ratio of 1:50 at room temperature for 1 min. After the DHF process, wafers were rinsed and dry spun.

After cleaning, the SVCS furnace system is used to deposit two films, 500 nm SiO2, and 100 nm polysilicon, to form silicon on insulator (SOI) structure to isolate the doped substrate because the target of this experiment is to get sheet resistance (Rs) after conducting laser annealing. If the substrate is not isolated, current may penetrate the substrate when measuring and make the experiment result wrong.

The next step is implantation by Varian E500HP. In this experiment, two different ions, arsenic and phosphorus, four different ion energies, 10, 25, 40, and 55 keV, and four different doses, 5 × 1015, 2 × 1015, 8 × 1014, and 5 × 1014 cm−2, are used. After implantation, the wafer will be cut into 1.7 × 1.7 cm2 to prevent a shift of correction factor of 4-point probe measuring. The following step is laser annealing. There are four different variables in this step: laser wavelength, laser repetition rate (Rep. rate), laser power (P), and processing temperature (T), as shown in Table 1. The laser annealing parameters selected in this work influence the annealing results. For example, the laser power is the main factor affecting the annealing procedure. Specifically, low laser power can lead to activation energy not being achieved due to heat loss, and overly high laser power leads to material damage and surface defects. The substrate temperature during laser annealing affects the lattice restoration phenomenon. Laser wavelengths affect photon absorption due to different photon energies, which in turn affect heating efficiency and profiles. Dopant types affect material chemistry and activation energy during annealing. Implant energy affects the dopant profiles in the polysilicon and the material damage. The implant dosage affects the sample doping concentration and the required annealing time and energy. The laser repetition rate effect is more ambiguous and affects the cooling and recrystallization during annealing.

Table 1 Process parameters of laser annealing

NAPSON RT-80 measures every sample last to get sheet resistance. Some samples are selected to check film thickness by scanning electron microscope (SEM) with Hitachi SU-8010 and conducting secondary ion-mass spectrometry (SIMS) with CAMECA IMS 7F. Process steps are shown in Fig. 1a, b below.

Fig. 1
figure 1

Summary of the experimental setup for BRL/RL optimization. a Various steps are involved in device preparations, such as deposition of wet-oxide and polysilicon, ion-implantation and laser annealing, b device measurement using a 4 probe system for sheet resistance and the dataset, and c BRL/RL model implemented in this work

TCAD simulation

In TCAD simulation, Synopsys Sentaurus 2016 is used (Sentaurus Process User Guide, 2016). From the manual of the laser used in this study, the pulse width of green and blue lasers is around 10–20 ns for both repetition rates (rep. rate), 50 and 100 kHz. The reason that TCAD cannot simulate it is that the pulse width of the laser is too short. To solve this problem, 6ts, the time laser energy is mostly released in TCAD settings, is set to 1 ms and 2 ms for the rep. rate of 50 kHz and 100 kHz, respectively. The reason why 6ts for 100 kHz rep. rate is twice larger than that for 50 kHz rep. rate is the total energy of two pulses at 100 kHz, the same as one pulse at 50 kHz. This means it takes twice the amount to release the same laser energy on wafers. To calculate fluence, the integral of Eq. (1) is set equal to the total energy per area of the laser used in this experiment, as shown in Eq. (1). In Eq. (1), the energy of the laser is assumed to distribute uniformly in the region of the laser spot to simplify the calculation

$$ \int {\frac{F}{{\sqrt {2\pi } t_{s} }}\exp \left[ {\frac{{ - \left( {t - t_{0} } \right)^{2} }}{{2t_{s}^{2} }}} \right]} dt = \frac{P}{A} \times t_{real} $$
(1)

where F is the fluence, ts is the full width at half maximum (FWHM) time interval divided by \(2\sqrt {2\ln 2}\), t0 is set to be 3ts, P is experimental laser power, A is laser area, and treal is time illuminated by laser and defined in Eq. (2).

$$ t_{real} = L/s $$
(2)

where L is the long axis of the laser spot, and s is the scanning rate, which is 5 cm/s.

Regular reinforcement learning (regular RL)

In this study, Python 3.8.13 (Rossum & Drake, 2009), Tensorflow 2.7.0 (Abadi et al., 2016), Numpy 1.21.2 (Harris et al., 2020), Scipy 1.8.1 (Jones et al., 2001), scikit-learn 1.2.0 (Pedregosa et al., 2011), and Pandas 1.4.2 (McKinney, 2011) is used for constructing the models.

For data preprocessing, because some variations in laser power and temperature can cause an error in Python training, they are fixed to the ideal same values and transformed to discrete levels. For example, there are 4 different ion energies, 10, 25, 40, and 55 keV, and each corresponds to a level of 1, 2, 3, and 4, respectively. A deep Q network (DQN) is selected for the RL agent in this study. In every trial, it will compute the Q value of each action, as shown in Eq. (3) (Mnih et al., 2015), and select an action based on the epsilon-greedy algorithm expressed in Eq. (4).

$$ Q(s,a) = \left\{ \begin{gathered} r_{real} + \gamma \mathop {\max }\limits_{a^{\prime} \in A} Q_{original} (s^{\prime},a^{\prime}),\;{\text{sampled}}\;{\text{points}} \hfill \\ {\text{Estimate}}\;{\text{from}}\;{\text{Q - Table}}\;{\text{neural}}\;{\text{network,}}\;{\text{unsampled}}\;{\text{points}} \hfill \\ \end{gathered} \right. $$
(3)
$$ \pi (s,a) = \left\{ \begin{gathered} \arg \max Q(s,a),{\text{ with probability }}1 - \varepsilon \hfill \\ {\text{random choice, with probability }}\varepsilon \hfill \\ \end{gathered} \right. $$
(4)

where rreal is the reward received from the current state (s) to the next state (s′), γ = 0.2 is the discount factor, A is the action space, a stands for the current action, and a′ stands for the next probable actions, and ε is the exploration rate defined in Eq. (5).

$$ \varepsilon = 0.5 \times 0.9({\text{epoch}} - 1) $$
(5)

For the state space, the state is defined as s = [Ion, Dose, Ion energy, Wavelength, rep. rate, P, T], the experimental parameters used in laser annealing. According to this state and the action applied, the environment will return the reward. This study has two different action types for the action space. First, action (a) is defined as the direct transition to any state in the RL. This means the action space is the same as the state space. In addition, this also implies that regardless of the current state, the same action leads to the same next state, and the transition can be directed to any defined state in an RL problem. The other type of action is to raise, maintain, or drop the level of each parameter in a state, i.e., a + 1/0/− 1 action. In this way, the result of action depends not only on the action but also on the current state. As for the reward, each pair of a state and an action has its reward, and it is defined as − Rs, which is the negative value of sheet resistance at the next state. The agent will maximize the reward by choosing a lower sheet resistance value. In this study, the Q-Table neural network model has 2 or 5 hidden layers with 100 neurons in each hidden layer, and the batch size is 2, 5, and 10 when updating the model based on Eq. (6) (Mnih et al., 2015). Training will be finished in 5 RL epochs with 10 timesteps in each epoch. The initial exploration rate is set to 0.5 in the first epoch. It will be 0.9 times lower than the previous epoch because the agent gradually receives more data from the environment. Therefore, the Q-Table training and selection should be based on the agent’s experience. The update of Q-Table is based on (Mnih et al., 2015)

$$ L_{i} = E_{{(s,a,r,s^{\prime } )\sim U(D)}} \left[ {\left( {r_{real} + \gamma \mathop {\max }\limits_{a\prime } Q\left( {s^{\prime } ,a^{\prime } } \right) - Q\left( {s,a} \right)} \right)^{2} } \right]^{{}} $$
(6)

where Lt is the loss function of the Q-Table neural network, (s,a,r,s′) is the agent’s experience, and D is the restored dataset of experiences at each time step.

Bayesian reinforcement learning (BRL)

In BRL, the definitions of the state space, the action space, the reward, and the batch size are the same as those of the regular RL. One difference is that BRL will initialize two different Q-Table neural network models. Then, it selects one of them when determining an action to apply at every 5 RL timesteps, as shown in Eq. (7). Q values will be updated by the selected model, as shown in Eq. (8).

$$ k = {\text{Uniform}}\{ 1,2\} ,{\text{ every 5 timesteps}} $$
(7)
$$ Q_{k} (s,a) = \left\{ \begin{gathered} r_{real} + \gamma \mathop {\max }\limits_{{a^{\prime } \in A}} Q_{k,original} (s^{\prime } ,a^{\prime } ),\;{\text{sampled}}\;{\text{points}}\;{\text{from}}\;k{ = 1}\;{\text{and}}\;{2} \hfill \\ {\text{Estimate}}\;{\text{from}}\;{\text{Q - Table}}\;{\text{neural}}\;{\text{networks,}}\;{\text{unsampled}}\;{\text{points}} \hfill \\ \end{gathered} \right. $$
(8)

In addition, there is a prior from TCAD, which will also be incorporated to guide the agent to select better action, as shown in Eq. (9).

$$ p(s,a) = r_{TCAD} $$
(9)

where rTCAD is the reward value from TCAD Q-table.

Combining the prior in Eq. (9) and the experiential sampling, the agent selects an action based on Eq. (10) (Hoel et al., 2020)

$$ \pi (s,a) = \left\{ \begin{gathered} \arg \;\max \left( {Q_{k} (s,a) + 2p(s,a)} \right),\;{\text{with}}\;{\text{probability}}\;1 - \varepsilon \hfill \\ {\text{random}}\;{\text{choice,}}\;{\text{with}}\;{\text{probability}}\;\varepsilon \hfill \\ \end{gathered} \right. $$
(10)

where ε is the exploration rate defined in Eq. (5).

There are two different types of priors, a fixed or variable prior. A fixed prior will keep its value until the end of BRL process. Nevertheless, the variable prior will keep changing its values during the BRL process. One update on the prior is that we partially correct the TCAD prior by the experimentally sampled true values in our practice. Thus, at the end of every RL step, true data from the experiment will fix the TCAD values, and the data points to be modified in the TCAD Q-Table are selected based on their adjacency to the sampled data point in the input space. Since our input space in the optimization objective function is discrete and levelized, the difference of one level, when considering all seven semiconductor process parameter input variables, is regarded as being within the distance eligible for TCAD prior modification. Modification is expressed by Eq. (11). In addition to the TCAD prior modification by the true experimental values, the weight of prior will keep decreasing from timestep i = 11 to suppress the effect of prior inaccuracy, as shown in Eq. (13), and the action is selected based on Eq. (14)

$$ P(s,a) = \left\{ \begin{gathered} r_{real} ,\;{\text{sampled}}\;{\text{points}}\;{\text{from}}\;k{ = 1}\;{\text{and}}\;{2} \hfill \\ r_{TCAD} \times multiplicand,\;{\text{around}}\;{\text{sampled}}\;{\text{points}}\;{\text{from}}\;k{ = 1}\;{\text{and}}\;{2} \hfill \\ r_{TCAD} ,\;{\text{others}} \hfill \\ \end{gathered} \right. $$
(11)

where multiplicand is defined in Eq. (12)

$$ multiplicand = abs\left( {reward_{i} /TCAD_{i} } \right) $$
(12)

where TCADi and rewardi are -Rs value of experiment parameters selected at timestep i from TCAD and environment, respectively.

$$ P^{\prime}(s,a) = \left\{ \begin{gathered} P(s,a),i \le 10 \hfill \\ P(s,a)/(i - 10)^{2} ,i \ge 11 \hfill \\ \end{gathered} \right. $$
(13)
$$ \pi (s,a) = \left\{ \begin{gathered} \arg \;\max \left( {Q_{k} (s,a) + 2P\prime (s,a)} \right),\;{\text{with}}\;{\text{probability}}\;1 - \varepsilon \hfill \\ {\text{random}}\;{\text{choice,}}\;{\text{with}}\;{\text{probability}}\;\varepsilon \hfill \\ \end{gathered} \right. $$
(14)

where i is the timestep, ε is the exploration rate defined in Eq. (5), and Qk is defined in Eq. (7). The comparison of algorithms between RL and BRL is shown in Fig. 2.

Fig. 2
figure 2

BRL/RL model implemented in this work

Results and discussion

Laser annealing analysis

Some samples with the same ion (Phosphorous), dose (5 × 1015 cm−2), ion energy (40 keV), and wavelength (532 nm), but different power, rep. rate, and chuck temperature are shown in Table 2.

Table 2 Result of laser annealing

In Table 2, it is easy to observe that sheet resistance increases with lower power because the depth of the melted zone decreases, and lots of defects remain in the silicon. The phenomenon can be solved with a high holder temperature to compensate for the lack of laser power. From the top two columns of Table 2, the difference between these results is about 40 Ω/sq with the same power and rep. rate, but there is almost no variation in a sample with 6 W and 50 kHz. The reason is that laser power is high enough to make silicon recrystallize well without holder temperature. Another point to pay attention to is that the holder temperature cannot be too high. A higher holder temperature causes the dopant to diffuse more easily in silicon, which loses the advantage of laser annealing to confine the dopant in the region illuminated by the laser.

The cross-sectional SEM images of the fabricated samples are shown in Fig. 3, where the thicknesses of deposited layers are labeled. SEM images confirm the deposited thicknesses of 100 nm polysilicon and 500 nm wet-oxide over the p-type silicon wafer. Figure 3a, b have arsenic as a dopant with a dose of 5 × 1015 cm−2 and energy of 25 keV and 55 keV, respectively. Figure 3c, d have phosphorous as a dopant with a dose of 5 × 1015 cm−2 and energy of 25 keV and 55 keV, respectively.

Fig. 3
figure 3

Cross-sectional SEM images of the samples. Arsenic as the dopant with dose 5 × 1015 cm−2 and energy of a 25 keV and b 55 keV, respectively, and phosphorous as the dopant with dose 5 × 1015 cm−2 and energy of c 25 keV and d 55 keV, respectively

Figure 4 shows the images SIMS. SIMS is a useful tool for analyzing the surface composition and studying the depth profiling of the dopants in the multilayer sample. SIMS provides a detailed analysis of the composition formed on the sample. For depth profiling, SIMS is very efficient in determining even the very low concentration of sub-ppm or ppb of dopants. In Fig. 4, the SIMS profile confirms the presence of As dopant and its diffusion along the depth of the annealed and non-annealed samples. Samples in Fig. 4 were annealed at room temperature using a green light laser with a repetition rate of 50kHz and two different powers of 5.999 and 2.997 W, respectively. SIMS profile also confirms the diffusion of As dopant with increased laser power used in the annealing.

Fig. 4
figure 4

SIMS profile confirms the presence of arsenic dopant in the samples. The dopant profiles for the sample annealed in a green light laser at a 50kHz repetition rate and power of 5.999 and 2.997 W

RL and BRL analysis

From Table 3 and Table 4, the average minimum Rs of RL is over 50 using regular RL without a prior. Besides, the variance of the result is also significant, i.e., 20.73 in Table 3 and 44.25 in Table 4, and this is a serious issue since this indicates the result strongly depends on hyperparameter selection and can arrive at much degraded results with improper hyperparameters. Due to a lack of prior knowledge of each action, regular RL can only retrieve information by experimentally sampling the environment, which leads to selected states whose rewards are very small, especially when the experimental sampling is insufficient. To circumvent the problem of regular RL without a prior, BRL with a fixed prior is investigated. From the results of the BRL with a fixed prior in Tables 3 and Table 4, the average minimum based on two different action types are 47.29 and 44.32 Ω/sq, with the average count before the minimum of only 11.50 and 10.50 counts. The average is over different settings of hyperparameters. The reduced count before the minimum is an important benefit in semiconductor process development, where the experimental cost is very high. Additionally, from Tables 3 and Table 4, it can be observed that the variance of the minimum Rs values from different hyperparameters is reduced in reference to regular RL. The variance of BRL with a fixed prior in this case is 2.29 in Table 3 and 0 in Table 4, in reference to 20.73 in Table 3 and 44.25 in Table 4 in the case of regular RL w/o a prior. Therefore, the burden and risk of hyperparameter selection are alleviated with the assistance of a fixed prior. From the advantages mentioned above, a fixed-prior BRL proves that it can guide the agent well to the RL state parameters, i.e., the semiconductor process input parameters, that possess a higher potential to attain lower sheet resistance. Subsequently, the more effective locating of the RL state reduces the optimization path and leads to lower RS.

Table 3 Results in the first 50 timesteps for various hyperparameters and algorithms
Table 4 Results in the first 50 timesteps for various hyperparameters and algorithms. Action is + 1/0/−1 applied to the input parameters in the state

It is observed that the fixed-prior BRL achieves better results than the regular RL does, but there is still a drawback. From Figs. 5b, e, and 6b, e, it is clearly shown that whatever the hyperparameters are set, it is easy to be stuck at the same state because it is possible that the value of prior is higher than the predicted value from experimental Q-Table model. This makes the agent choose an action based on the prior instead of the experimental Q-Table model, except at the timestep that exploration is activated. Intuitively, when more experimental steps are conducted, the experimental Q-Table should be more and more reliable. In this scenario, it can be desired to let the effect of the prior decay even if the prior is informative. As a result, the prior decay here is established based on the RL timestep, where the effect of the prior is gradually diminished. In addition, the information stored in the prior will also be corrected according to the true experimental data to improve the TCAD prior if the state is sampled.

Fig. 5
figure 5

Path of RL with network = 2 and batch size = 2 a RL w/o prior b BRL w/fixed prior c BRL w/variable prior, and network = 5 and batch size = 2 d w/o prior e w/fixed prior f w/variable prior whose action can point to any state directly

Fig. 6
figure 6

Path of RL with network = 2 and batch size = 2 a RL w/o prior b BRL w/fixed prior c BRL w/variable prior, and network = 5 and batch size = 2 d w/o prior e w/fixed prior f w/variable prior whose action is + 1/0/− 1 of level to each input parameter in the RL/BRL state

From Tables 3 and Table 4, The minimum Rs values of BRL with a variable prior for two different action types are 45.19 Ω/sq and 44.32 Ω/sq, respectively, and the counts before the minimum are 14.00 and 4.50, respectively. Compared to fixed-prior BRL, the variable-prior method achieves either smaller Rs at the expense of a higher step count before the minimum or a reduced step count before the minimum with the same resultant Rs. This reflects the improvement from using variable priors to take into account the gradually improved experimentally sampled Q-Table. Besides, another important advantage is that its variance is smaller than the other two methods across different hyperparameters, where 0 is achieved in reference to 20.73 in Table 3 and 44.25 in Table 4 in RL w/o a prior and 2.29 in Table 3 and 0 in Table 4 in RL with a fixed prior. In conclusion, using the BRL with a variable prior can achieve either lower Rs at the expense of a higher step count or similar Rs with a reduced step count. Besides, lower variance in the optimal Rs value is observed in the BRL with a variable prior. One potential risk of using a variable-prior BRL is that the agent can start to sample the areas with degraded sheet resistance when the prior almost disappears, as shown in Fig. 6c. As a result, balancing the decay of the prior and the guidance capability of the informative prior on the agent in the BRL is essential. While we use a decay rate specified in Eq. (13) in this work, meta-learning can also be used to control the decay rate adaptively to improve this aspect further. Table 5 further shows the results of some conventional optimization methods applied to the same laser annealing parameter tuning. The Taguchi method achieves either a higher RS or a larger total experiment count compared to the BRL method in Tables 3 and Table 4. Specifically, 2-level discretization gives RS = 190.94, which is too high for practical use. 4-level discretization gives RS = 55.27 using 64 total experiment counts, though the RS value is higher than the values in BRL using 50 total experimental counts. Scipy differential evolution gives a reasonably low RS = 55.98, but the count before the minimum is large. On the other hand, Scipy basin hopping gives RS = 57.98, which is still higher than the RS achieved by BRL and has a large count before the minimum.

Table 5 The results of conventional optimization methods

Discussions

Comparison of two different actions

The results of RL are shown in Tables 3 and Table 4. It is clearly shown that the case of action pointing to any state directly is worse than the case of + 1/0/− 1 actions. The reason is that there are 1024 different actions, and the agent does not know their true rewards before sampling and can only learn them one by one. In addition, once an agent finds an action with a higher reward, it is likely to use the same action in the next trial and get the same reward. This can mislead the agent by making it think this is the best action and thus trapped. Although the case of + 1/0/− 1 actions also selects an action with a large reward from previous trials, the next state and reward are dependent on the current state. This means it can get a better or worse reward when choosing the same action. As a result, this prevents the agent from being trapped in the same state and facilitates learning each action’s meaning. The disadvantage of this method is that it samples more data points.

Bayesian formulation

With normal Bayes’ theorem, which is a multiplication of likelihood and prior, the agent is excessively influenced by the likelihood that is non-informative in the beginning and wastes lots of trials to modify those points. The advantage of the addition of prior and likelihood is that it can adjust prior decay. This benefits the agent to follow the instructions of prior appropriately to sample the space with lower sheet resistance to increase sampling efficiency. Compared to VPI, which embeds prior information in the likelihood model, the methods applied in this study can easily change the decay of prior because our method does not have to fit the likelihood model with prior embedded inside at each step. Besides, VPI is computationally expensive due to the required VPI integration at each potential action. Therefore, in our case, at least 128 different integrations at each step are needed, which will take a lot of time to finish one trial.

Fixed and variable prior

In prior work (Hoel et al., 2020), 10 different non-informative prior functions are used in each member of the ensemble RL model, which can prevent agents from being trapped in local minima with priors’ uncertainty. In fact, there is no need to modify the decay of the prior because they are non-informative and are trained and varied together with the likelihood in most cases. In semiconductor manufacturing, informative prior methods are always preferred over non-informative prior methods due to the extremely high cost in fabrication. In this study, there may be wrong data due to the inaccuracy of TCAD simulation, which leads the agent to be stuck or easily waste trails. Due to the above reasons, balancing the decay of prior and information stored in prior is the most important issue. In our algorithm, the prior is only modified according to collected data but not diminished at the first epoch to specify where underestimation or overestimation is. This makes the agent modify its searching direction and avoid wasting trails. The advantage of initiating prior decay from the second epoch is that the agent will not be trapped in future trials and can incorporate more environmental knowledge and sample more efficiently. In addition, it can also prevent wrong data from being accumulated, according to Eq. (12).

Comparison with other machine learning methods

Compared to supervised learning (SL) and unsupervised learning (USL), RL trains the agent based on the data collected by the interactions with the environment instead of a fixed dataset. In SL and USL, the prediction accuracy is highly dependent on whether the data collection properly spans the entire sample space. Suppose the SL is going to be used to optimize the process condition. In that case, the data collection should be extended over all possible choices in input process parameters, leading to inefficient optimization. Besides, active learning (AL) is another method to sample other data from unlabeled pools like RL to search the environment. The major difference between AL and RL is that RL can predict future rewards based on the gamma factor, which means RL can make a decision based on the current and future rewards while AL cannot. In addition, RL can be more flexible by adjusting its reward function according to specific tasks instead of labeling data directly. In this regard, RL can select a better path and be a more potent way to optimize parameters.

Conclusion

In this study, we investigate the Bayesian approach in RL to accelerate the optimization of semiconductor process parameter tuning. The RL without TCAD prior knowledge shows that it requires more trials to achieve lower Rs values and has a large variance with different settings of hyperparameters. In contrast, BRL with a fixed prior can guide the agent to the search space where the reward is larger. This significantly prevents the agent from wasting trials. The fixed TCAD prior effectively guides the learning process in the laser annealing problem. On the other hand, the variable TCAD prior, where the TCAD prior effect is gradually decayed and corrected to reflect the increased experimental data collection, is improved so that the agent can achieve lower Rs at the expense of more step count or a similar Rs with a reduced step count. Specifically, using the action with direct state transition, the achieved Rs and the step count before minimum Rs for RL, BRL with a fixed prior, and BRL with a variable prior are 53.70 Ω, 15.17, 47.29 Ω, 11.50, 45.19 Ω, and 14.00, respectively. On the other hand, using the + 1–0/− 1 action, the achieved Rs and the step count before minimum Rs for RL, BRL with a fixed prior, and BRL with a variable prior are 51.84 Ω,18.17, 44.32 Ω,10.50, 44.32 Ω, 4.50, respectively. Based on this study, it is seen that BRL with informative prior can be an essential infrastructure for future intelligent semiconductor manufacturing.