Stochastic mixed model sequencing with multiple stations using reinforcement learning and probability quantiles

In this study, we propose a reinforcement learning (RL) approach for minimizing the number of work overload situations in the mixed model sequencing (MMS) problem with stochastic processing times. The learning environment simulates stochastic processing times and penalizes work overloads with negative rewards. To account for the stochastic component of the problem, we implement a state representation that specifies whether work overloads will occur if the processing times are equal to their respective 25%, 50%, and 75% probability quantiles. Thereby, the RL agent is guided toward minimizing the number of overload situations while being provided with statistical information about how fluctuations in processing times affect the solution quality. To the best of our knowledge, this study is the first to consider the stochastic problem variation with a minimization of overload situations.


Introduction
In the mixed model sequencing (MMS) problem, different models need to be sequenced for production on an assembly line that consists of multiple stations.Each station is operated by a human worker that requires specific processing times depending on the model.A work overload occurs if a worker cannot complete a workpiece on time, before it leaves the boundaries of the station.While the deterministic problem is well studied (see e.g., Boysen et al. 2009), little research has been done on the problem variation with stochastic processing times.Processing times may vary due to, e.g., different worker skills, fluctuations in the accuracy of tools, or lower material quality, which can disrupt schedules based on deterministic information.
The few existing studies that have considered MMS problems with stochastic processing times have focused on minimizing the expected work overload time."Work overload time" refers to the amount of work that is required by utility workers to assist at cycles, for which a work overload is foreseeable.The underlying assumption is that utility workers work side-by-side with regular workers, so that the processing speed can be doubled and the cycle is fully processed before it leaves the boundaries of the station.However, large European car manufacturers often apply a different policy, whereby utility workers exclusively process a cycle when a work overload is foreseeable (Boysen et al. 2011).This is due to the fact that it is often not possible to perform certain tasks side-by-side as there may not be enough space for both workers to process a workpiece.Utility workers may also not be flexible enough to instantly assist at the exact right moment that is sufficient to successfully process a cycle.In the alternative policy implemented by European car manufacturers, the utility worker first has to walk to the respective station, which causes setup costs that dominate the costs of utility work (Boysen et al. 2011).Accordingly, the goal of this problem variation is to minimize the number of work overload situations.
To the best of our knowledge, this study is the first to consider the stochastic MMS problem with the objective of minimizing work overload situations.We propose a reinforcement learning (RL) approach that generates the sequence iteratively, where actions denote the model to be sequenced next.The learning environment simulates normally distributed processing times (Dong et al. 2014;Mosadegh et al. 2017Mosadegh et al. , 2020;;Özcan et al. 2011) and penalizes actions that lead to work overloads with negative rewards.To make the RL agent account for stochastic processing times, we define a state representation that contains several variables which indicate if a work overload occurs given that the processing times are equal to their respective 25%, 50%, or 75% quantiles.We find that including more quantiles further reduces the number of overload situations, but the marginal decrease is underproportionate to the number of quantiles, while the time required for policy learning increases linearly.The proposed approach can theoretically be applied to all mixed model assembly lines provided that the distributions of the models' processing times are known.
The remainder of this paper is structured as follows.Section 2 provides an overview of related work.Section 3 formalizes the stochastic MMS problem and describes the characteristics of suitable solutions.Section 4 describes our RL approach by detailing environment, state and action space, reward function, and the method for policy learning.Section 5 presents the setup of our numerical evaluation and Sect.7 presents the results.Finally, Sect.8 describes several directions for how our approach could be extended.

3
Stochastic mixed model sequencing with multiple stations… 2 Related work MMS problems have been studied in multiple variations; for instance, with U-shaped (instead of straight) assembly lines (Li et al. 2012), two-sided assembly lines (Chutima and Naruemitwong 2014), heterogeneous worker types (Aroui et al. 2017;Cortez and Costa 2015), and alternative optimization targets, including worker idle times (Bautista et al. 2016;Mosadegh et al. 2020) or product rate variation (Chutima and Naruemitwong 2014).The MMS problem should be distinguished from car sequencing (Parrello et al. 1986), where work overload is implicitly minimized by satisfying handcrafted sequencing rules.Each rule has the form N o ∶ H o , which stipulates that, among N o sequence positions, a maxi- mum of H o vehicles with option o may occur.Options can denote, for instance, a sun roof, entertainment system, or a specific motor.The more rules are satisfied, the lower the work overload.A literature review on car sequencing studies is provided by Solnon et al. (2008).Compared to car sequencing, MMS provides more flexibility in finding an optimal solution, and, thus, sequences generated with MMS cause less work overload (Golle et al. 2014).However, applying MMS requires car manufacturers to spend significant effort in collecting the processing times for each model and station, but the practicability of MMS in industry has already been demonstrated in a case study (Bautista et al. 2012).
Prior research has studied the pure MMS problem and the integrated balancing and sequencing problem (Agrawal and Tiwari 2008;Boysen et al. 2007;Dong et al. 2014;Özcan et al. 2011), where the tasks to be performed for each model must first be assigned to stations given the precedence relation, such that a given criterion is optimized for a (near-)optimal sequence.Both problems exist since they are relevant in regard to different time horizons.The balancing decision has a mid-term horizon (e.g., several months) as it specifies work content and material usage for each station, while the exact daily demands are generally not known in advance.The sequencing problem instead has a short-term horizon (e.g., one shift) as it specifies the order in which a given demand of models is produced (Boysen et al. 2009).Comprehensive literature reviews of the balancing and sequencing problems, including the considered problem variations, are provided by Boysen et al. (2007) and Boysen et al. (2009).Boysen et al. (2009) have also identified stochastic processing times as a research gap.
So far, only a few studies have relaxed the strict assumption of deterministic processing times toward stochastic processing times.Table 1 provides an overview of existing studies on the stochastic MMS problem.The seminal study by Zhao et al. (2007) proposed an approach based on Markov chains to calculate the expected work overload time.In a nutshell, this approach approximates the current positions of the workers within their stations by dividing the interval of possible positions into several subintervals.For each subinterval, the expected overload time is calculated as the average of those overload times that would occur if the worker was located at the exact lower or upper interval boundaries.Based on the method by Zhao et al. (2007) and Dong et al. (2014) proposed a simulated annealing approach for the stochastic balancing and sequencing problem with U-shaped lines and a minimization of overload time.Mosadegh et al. (2017) developed a method similar to Dijkstra's algorithm that solves the stochastic MMS problem for one stochastic station.In their recent study, Mosadegh et al. (2020) considered the MMS problem with multiple stochastic stations.The authors presented an approach based on simulated annealing, for which the parameters are selected with Q-learning.
Studies on the balancing and sequencing problem with stochastic processing times generally focus on balancing workload.Agrawal and Tiwari (2008) proposed an ant colony algorithm to simultaneously minimize variation of workload and risk of line stoppage.The risk of line stoppage for a given station is calculated as the probability quantile of the z-normalized processing time with mean and variance processing time over all tasks.Özcan et al. (2011) presented a genetic algorithm to minimize the absolute deviation of workloads (ADW), such that ADW is not exceeded according to a given confidence level.
This study proposes a RL approach for the stochastic MMS problem with a minimization of overload situations.To the best of our knowledge, our study is the first to consider the stochastic problem with the objective of minimizing overload situations instead of overload time.

Stochastic mixed model sequencing with a minimization of overload situations
Mixed model assembly lines allow manufacturers to exploit the advantages of flowproduction, while offering a large diversified product portfolio (Boysen et al. 2009).
In mixed model assembly lines, the workers process a cycle while walking within the boundaries of their station.After a workpiece is completed, the worker walks back toward the beginning of the station to process the next cycle.Consequently, the starting position of a cycle depends on the processing time of the previous cycle.If the processing times are deterministic, the approach used for sequence generation can calculate the starting positions of the workers for all stations and cycles exactly and without uncertainty.However, processing times in real-world production may vary due to, e.g., different skills and experience among the workers, inaccurate tools, or faulty materials.Longer processing times increase the probability of work overloads for the current and all subsequent cycles at a given station.Therefore, the sequence must be generated in a way that accounts for stochastic processing times.
We consider the MMS problem with normally distributed processing times as they are a frequent choice in the literature (Dong et al. 2014;Mosadegh et al. 2017Mosadegh et al. , 2020;;Özcan et al. 2011).A normal distribution also reflects our idea that processing times in mixed model assembly lines should be close to their mean with high probability, while deviations from the mean are less likely.At the same time, a large positive deviation from the mean should have the same probability as a large negative deviation with the same absolute value.
In the following, we provide a mathematical problem formulation of the stochastic MMS problem with a minimization of overload situations.Subsequently, we describe the characteristics of suitable solutions.

Problem formulation
The formulation of the stochastic MMS problem with a minimization of overload situations is based on the deterministic problem introduced by Boysen et al. (2011).Table 2 provides an overview of all variables.The problem is formalized below from (1) to (13).The goal (1) is to minimize the total number of work overloads y k,t over all sta- tions k = 1, … , K and sequence positions t = 1, … , T .The sequence is modeled as several binary variables x m,t that equal 1 if model m is sequenced at position t.(2) and (3) ensure that y k,t and x m,t are binary.(4) ensures that each sequence position is filled with exactly one model.(5) ensures that the sequence complies with the demand plan.(6) ensures that the starting positions of the workers at the stations are non-negative.(7) sets the initial start position of all workers to zero and (8) resets the start position  13) ensure that the workers only move within the boundaries of their station.Besides, (13) incorporates the cycle time into the optimization problem, which specifies that a new cycle enters the conveyor belt at every c units of processing time.
If y k,t = 1 , the constraints ( 12) and ( 13) are weakened, so that it becomes easier to find a solution. (1) (2) Subject to: Stochastic mixed model sequencing with multiple stations…

Characteristics of suitable solutions
We now explain what characterizes suitable solutions for the stochastic MMS problem with a minimization of overload situations.If sequences are optimized based on deterministic overloads, the resulting sequences entail tight schedules for the workers.The cycles will be completed when the workers are close the boundaries of the station, which implies that they can easily be violated if the processing times are larger than their mean.A good solution for the stochastic problem instead provides more buffer times to account for variations in the processing times.However, such buffers come at the cost of a greater number of deterministic overloads.
We provide an example to better illustrate this idea.We consider an assembly line that consists of only one station of length 110.Let the demand plan be given as d = [2, 2, 2] and the means of the processing times be given as 1,1 = 95, 1,2 = 105 and 1,3 = 70.The cycle time is set to c = 90 .There are two sequences given as 1 plots the movements of the workers for a) seq 1 and b) seq 2 if the processing times are deterministic and equal to their means.The x-axis denotes the worker position in the interval [0, 110] and the y-axis denotes how much processing time has passed.seq 1 causes no determinis- tic overload, while seq 2 causes one deterministic overload as model m = 2 (with 1,2 = 105) is sequenced twice in a row.In particular, we observe that seq 1 always makes the worker finish a cycle close to the station boundary, while seq 2 leaves more buff- ers.We now assume that the processing time for model m = 1 increases from 95 to  1c and d, seq 1 now leads to two overloads at cycles 2 and 5, while seq 2 still causes only one overload at cycle 2.

Reinforcement learning approach
The goal of RL is to learn a policy (a t |s t ) with parameters that specifies which action a t to perform at state s t (Sutton and Barto 1998).For this purpose, the RL agent interacts with an environment and maximizes the discounted sum of rewards r t over the learning episode.In this study, one episode corresponds to the generation of a complete sequence of one problem instance.The discrete time t reflects the current sequence position t = 1, … , T .The RL agent is trained to create the sequence incrementally.At each sequence position t, the agent evaluates the current state s t and decides on the model m ∈ {1, … , M} to sequence next.Accordingly, the action space A is given by the set of models

Environment and state representation
The environment simulates the production process, including movements of workers and handling of work overloads.At each sequence position t, the environment provides the agent with the current state s t .After the agent decided on its next action a t , the environment simulates this action, updates the current state, and emits the reward r t to the agent.This continues until the learning episode is finished.After this, the environment is reset and another learning episode starts.The learning process is completed when the policy has converged to a stable state.
The generated sequence must comply with the demand plan [d1 , … , d M ] , i.e., the number of cycles where model m is produced must be equal to d m .To guide the agent toward generating valid sequences, we include the current remaining quantities d 1 t , … , d M t into the state representation s t .If the agent decides to sequence model m in state s t , then d m t is decreased by one in the next state s t+1 .When the sequence is empty at t = 1 , the remaining quantities are equal to the demand plan: The state representation also provides the agent with information about whether or not work overloads will occur if the processing times are equal to particular probability quantiles.For this purpose, we define a deterministic function o m t (q) ∶ [0, 1] → {0, 1} that equals 1 if sequencing model m at sequence posi- tion t leads to at least one work overload y k,t given that the processing times of all stations are equal to their respective q-quantile with q ∈ [0, 1]. 1 Consider the ( 14) 1 3 Stochastic mixed model sequencing with multiple stations… following example.Let the worker of a station k with length l k = 120 be at position w k,t = 30 , and the processing time of model m be distributed as p k,m ∼ N(90, 10) .The worker has l k − w k,t = 120 − 30 = 90 units of processing time left to process cycle t consisting of model m.The worker will process the cycle in time if the processing time is less than or equal to the mean which denotes the 50% quantile: o m t (q) = 0, ∀ q ≤ 0.50 .If the processing time is greater than the mean, a work over- load will occur as there is not enough time to fully process the cycle, which implies o m t (q) = 1, ∀ q > 0.50.
Based on the function o m t (q) , we implement and evaluate two environments RL sto and RL det as illustrated in Fig. 3.Both environments include the remaining quantities d 1 t , … , d M t in the state representation, but they differ in the reward function and the number of probability quantiles for which o m t (q) is provided.The stochastic environment RL sto is illustrated in Fig. 3a.The reward is calculated based on stochastic processing times p k,m that are drawn from their respective distribution N( k,m , k,m ) .The state representation of RL sto contains o m t (q) for the 25%, 50%, and 75% quantiles.
It should be noted that the state representation does not directly depend on the number of stations on the assembly line.Accordingly, a trained policy can also be applied to similar production systems with additional stations (we later provide a corresponding analysis to assess how this affects the performance).
Figure 2 shows the probability density function of a normal distribution with = 90 and = 10 .The 50% quantile is equal to the mean, whereas the 25% and 75% quantiles are equal to 83.25 and 96.75, respectively.We later present an analysis, where we study the effects of including o m t (q) for more than three quantiles.The results show that this additional state information further reduces work overload situations, but the marginal benefit decreases with the number of quantiles.
As a baseline, we also implement a purely deterministic environment RL det as shown in Fig. 3b.In RL det , the reward is calculated deterministically based on the mean processing times.Accordingly, the state representation of RL det only contains o m t (q) for the 50% quantile.

Reward
The reward signal should guide the RL agent toward generating sequences that minimize the number of work overloads.It thus seems intuitive to reward the agent with the negative sum of work overloads at the end of each learning episode.However, this implies that the reward had to be discounted over hundreds of actions, which results in a less efficient learning process (Sutton and Barto 1998).Instead, we provide an immediate reward as the negative sum of work overloads y k,t that are caused by cycle t.
The agent may decide to sequence a model m, for which the remaining quantity to be produced is zero ( d m t = 0 ).In this case, the action a t = m is invalid and the agent is punished with a negative reward of −10 .This value was determined based on a pre-study.The results of the pre-study can be found in Appendix 2 of the supplementary material.Handling invalid actions during the training process is still a challenging problem in RL (Zahavy et al. 2018).The action space cannot be altered as ( 15) 1 3 Stochastic mixed model sequencing with multiple stations… this would affect the structure of the neural network.For the purpose of this study, we handle invalid actions by returning a negative reward of −10 and the same state s t+1 = s t .This will implicitly make the agent avoid invalid actions, yet a full preven- tion is not guaranteed.An exemplary plot about the number of invalid actions over the learning process is provided in Appendix 3 of the supplementary material.Altogether, the reward r t for action a t = m is defined as During real-world application, we can always ensure that the generated sequences are valid.If the agent attempts to sequence a model m with d m t = 0 , we simply choose the next best valid action according to the policy (a t |s t ) .Invalid actions thus only affect the learning process, but they do not limit the applicability of our approach in real-world production.

Policy learning
RL alternates between generating trajectories (s 1 , a 1 , r 1 ), … , (s T , a T , r T ) in terms of state-action-reward tuples with the current policy (a t |s t ) , and updating the policy parameters based on the generated data.
In this study, we implement proximal policy optimization (PPO, Schulman et al. 2017) for policy learning.PPO is a policy gradient method (Williams 1992), where the policy is learned with a neural network, so that the weights of the network denote the policy parameters .The neural network receives a state s t as input and outputs a stochastic vector of size |A|.(a t |s t ) hence equals the probability that the agent will perform action a t in state s t .During policy learning, actions leading to higher rewards will be assigned higher probabilities, whereas actions leading to lower rewards will be assigned lower probabilities.PPO is easy to implement and tune (Schulman et al. 2017) and is considered a state-of-the-art policy gradient method for RL (Zheng et al. 2018).
We briefly describe the functionality of PPO.All explanations are based on Schulman et al. (2017).Algorithm 1 provides the pseudocode of policy learning with PPO.Besides the action probabilities (a t |s t ) , the policy network also updates a value estimate V (s t ) of state s t .This value denotes the expected reward that the agent will receive from s t to the end of the learning episode.Given a trajectory (s 1 , a 1 , r 1 ), … , (s T , a T , r T ) , PPO first calculates R t = ∑ T t � =t t � −t r t � as the sum of dis- counted rewards that the agent will receive between s t and s T , where denotes the discount parameter (set to 0.99).In addition, PPO calculates the estimated advantage of performing a t in s t as Ât = R t − V  (s t ) .The rationale of the advantage estimate is that PPO aims at assigning higher probabilities to actions that lead to higher rewards than the current estimate V (s t ) .Finally, PPO updates its policy by maximizing the loss function L( ).
The loss function L( ) consists of three terms.The first term L CLIP t ( ) ensures that the policy parameters will be updated such that actions with positive advantage Â are assigned higher probabilities, and vice versa.However, PPO limits the extent of policy updates by clipping the ratio between new and old probability The second term L VF t ( ) = (R t − V (s t )) 2 denotes the error in predicting the value of state s t .Including this term with a negative sign ensures that the update of also improves the estimate V (s t ).
The third term L H t (s t ) denotes the entropy (Shannon 1948) of the policy in state (a t �s t ) log 2 (a t �s t ) .Higher entropy values indicate that the probability distribution preserves randomness in the sense that the agent can still explore random actions instead of solely relying on the best action according to the current policy.The parameters c 1 (set to 0.50) and c 2 (set to 0.01) indicate the weight of the corresponding loss terms.All parameters except the number of time steps are set to their default values as stated in Appendix 1 of the supplementary material.We implement our RL approach in Python 3.6.8using the PPO implementation from the RL framework "Stable Baselines" in version 2.7.0. (Schulman et al. 2017) We train the policies for 50 million time steps, where one time step corresponds to one action.This results in a total of approximately 500,000 learning episodes.Figure 4 shows the learning curves as the mean overloads per episode for both RL approaches RL sto and RL det on a problem instance with sequence length T = 100 .As a baseline, the dashed lines denote the number of overloads for a simple greedy heuristic.Both policies converge to a stable state.After approximately 100,000 learning episodes, RL sto and RL det outperform the greedy heuristic (Boysen et al. 2011).As expected, the number of overloads per episode is smaller in the deterministic than in the stochastic environment.Recall that in the deterministic environment, the processing times always follow the mean, whereas processing times in the stochastic environment are drawn from their respective normal distributions.
Stochastic mixed model sequencing with multiple stations… 5 Numerical evaluation

Dataset
We use the dataset created by Boysen et al. (2011) that contains 216 problem layouts.The layouts can be divided into small and large layouts as shown in Table 3.The numbers of models and stations range from 5 to 30, while the sequence length varies between 15 and 300.The station length is either constant, or drawn from the uniform distributions U Combining all settings from Table 3 yields 216 unique problem layouts with fixed deterministic processing times.For each layout, we generate 10 demand plans based on a multinomial distribution with T trials and equal probability 1 M for each model.This results in a total of 2160 deterministic problems.2

Evaluation procedure
We train one RL policy for each layout, which results in a total of 216 policies.Training one policy requires approximately ten hours on our server infrastructure that consists of two Intel(R) Xeon(R) Gold 6150 CPUs and 755 gigabytes of main memory.The time needed for policy learning is neglected as this can be done in advance for a fixed production system.To assess the performance of our approach on unseen problems, we perform three further analyses with variations in the For the evaluation, we consider normally distributed processing times (Dong et al. 2014;Mosadegh et al. 2017Mosadegh et al. , 2020;;Özcan et al. 2011) with means k,m and joint standard deviation .
The means k,m are set to the deterministic processing times of the instances in our dataset.We require that 0 ≤ p k,m ≤ l k as negative processing times are not possible and there are no buffer areas between two adjacent stations.If p k,m > l k , it would be impossible to process the affected cycles within the boundaries of the station.The standard deviation is set to 10 in our main analysis, but we later also evaluate ∈ {5.0, 7.5, 12.5, 15} .To calculate the number of stochastic overloads, we evalu- ate 100 stochastic variations of each (deterministic) problem.In each variation, we sample the processing times from their respective distributions.
Algorithm 2 describes the evaluation procedure.We iterate over all competing approaches and problem instances.Each approach is provided a fixed cutoff time of 300 seconds to search for a solution (Boysen et al. 2011;Joly and Frein 2008;Lopes et al. 2020;Miltenburg 2002).The generated sequence is then evaluated based on 100 variations with random processing times.For this purpose, it is crucial that the sampled processing times are the same for all sequences.More precisely, if two sequences have the same model m at sequence position t, then the processing times of this model at all stations must be the same for both sequences.We therefore set a random seed (line 6) before evaluating each deterministic instance based on 100 stochastic variations.Work overloads are calculated based on the sampled processing times (lines 14, 15).The result of each stochastic variation j = 1, … , 100 is stored as a tuple (i, approach, j, O) (line 16); all results are finally aggregated according to the performance metrics (line 17).
To calculate the number of overload situations, we assume that the processing times are known before the actual working steps.The number of work overloads is then calculated using the method for the deterministic problem by Boysen et al. (2011).Once a work overload is foreseeable for cycle t, the worker of the affected station walks directly to the beginning of their station to process cycle t + 1 .The alternative would be to assume a probability threshold until which the workers still 1 3 Stochastic mixed model sequencing with multiple stations… try to process the cycle while hoping to be successful in time.However, if a work overload is signaled too late, the utility worker may not have enough time remaining to process the cycle.We simulate 100 stochastic variations to ensure that the sample distribution accurately reflects the population distribution.The sample means μk,m and the sample standard deviation σ usually vary from the population means k,m and the popula- tion standard deviation .Therefore, we calculate the 95% confidence intervals of sample mean and standard deviation for one station based on T ∈ 15, 100, 300 and = 10 .The confidence intervals of the sample mean equal μk,m ± 5.06 for T = 15 , μk,m ± 1.96 for T = 100 , and μk,m ± 1.13 for T = 300 .When evaluating 100 sto- chastic variations (i.e., 99 degrees of freedom), the confidence intervals change to μk,m ± 0.51 for T = 15 , μk,m ± 0.12 for T = 100 , and μk,m ± 0.11 for T = 300 .In addition, we calculate the 95% confidence interval of the sample standard deviation σ .For one stochastic variation, the confidence intervals are [0.73,1.58] × for T = 15 , [0.89, 1.16] × for T = 100 , and [0.93, 1.09] × for T = 300 .Again, when evaluating 100 stochastic variations, the confidence intervals narrow down to [0.97, 1.04] × for T = 15 , [0.99, 1.01] × for T = 100 and [0.99, 1.01] × for T = 300 .Altogether, when performing 100 variations, sample mean and sample standard deviation deviate less than one unit of processing time from population mean and population standard deviation.

Competing
We our approach against several competing approaches from the literature as shown in Table 4. First, we implement several stochastic approaches, including the hyper simulated annealing (HSA) approach by Mosadegh et al. (2020), which was developed for the stochastic MMS problem with a minimization of work overload time, and a multiple scenario approach [MSA, (Bent and Van Hentenryck 2004)].Second, we evaluate several deterministic approaches based on the mean processing times.This includes a genetic algorithm, simulated annealing (Aroui et al. 2017), local search (Cortez and Costa 2015), tabu search (Boysen et al. 2011), and Gurobi version 8.0.1.
However, simply applying deterministic approaches to stochastic problems is likely to result in poor performance (Pinedo and Weiss 1987).We therefore modify the deterministic problems by artificially increasing the mean processing times with ∈ {0, 0.25, 0.50, 0.75, 1} .This should yield sequences with more determin- istic overloads, but with less stochastic overloads situations as the increased processing times cause larger buffer times between two subsequent cycles.For each approach, we report the results of the value that performed best.Accordingly, these approaches have a small advantage in the sense that we perform an ex-post selection of parameters.
In the following, we briefly explain each competing approach.

Greedy heuristic
Several competing approaches (HSA, TS, LS, GA, SA) use the greedy heuristic (Boysen et al. 2011) to create an initial sequence.The greedy heuristic generates the sequence incrementally.At each sequence position, the heuristic chooses the next model based on the number of work overloads that will occur if this model is sequenced next according to the deterministic processing times.Ties are broken according to (1) the maximum sum of processing times over all stations, (2) the maximum processing time at any station, and (3) the lowest index m.

Hyper simulated annealing
We implement hyper simulated annealing (HSA) according to Mosadegh et al. (2020).HSA was originally designed to minimize the expected work overload time, which does not necessarily also minimize the number of overload situations.However, we still implement this approach as it provides a state-of-the-art solution for mixed model assembly lines with stochastic processing times.The idea of HSA is to combine simulated annealing (SA) with Q-learning.The metaheuristic improves an initial sequence through a series of mutations.At each sequence position, one out of 16 actions is either selected randomly (with probability ) or based on the current Q-learning policy.In this context, an action defines the next three mutation operators.There is a total of six mutation operators, which range from simple swaps (e.g., change models at positions (i, j)) to complex crossover operators (Kim et al. 1996).
After the sequence is updated, the approach updates the Q-table and lowers the temperature of the SA.A new sequence is accepted if it outperforms the old sequence in terms of the performance metric, i.e., expected overload time.In addition, an equal or worse sequence can still be accepted with a certain probability that depends on the temperature.We implement HSA with two different fitness criteria.The first variation ( HSA T ) minimizes the expected overload time, so that the approach is implemented like in the study by Mosadegh et al. (2020).The second variation (HSA) is based on the number of deterministic overload situations.

Multiple scenario approach
We implement a multiple scenario approach (MSA) according to Bent and Van Hentenryck (2004).The idea of MSA is to solve N deterministic instances individually within the cutoff time and aggregate the resulting sequences with a consensus function.For this purpose, we create a matrix C ∈ ℕ M×M that stores how often model i is fol- lowed immediately by model j over all sequences seq n , n = 1, … , N .Let I(condition) denote the indicator function that equals 1 if condition is true, and 0 otherwise.The matrix C is defined as The first sequence position is chosen randomly.All other positions are determined based on the consensus function f(t).Ties are broken according to the lower model index m.We implement MSA with N = 10 based on four variations.We evaluate Gurobi or SA as underlying method to generate sequences for each problem instance.In addition, we introduce a flag that indicates whether we allow for one model to be sequenced at two subsequent sequence positions.If the flag is active, the sequences may exhibit blocks of the same model.Otherwise, the sequences become more diverse as two subsequent positions must be different.In a pre-study, we find that MSA based on SA and the flag active performs best.The results of the pre-study are provided in Appendix 4 of the supplementary material.

Simulated annealing
We implement simulated annealing according to Aroui et al. (2017).In each iteration, the metaheuristic modifies the current sequence with a random flip, swap, or slide operation, each of which is performed with equal probability.If these operations decrease the number of overloads, the modified sequence is accepted as the new best sequence.
Otherwise, the sequence still has a certain chance to be accepted as the new solution, which depends on the current temperature.With each iteration, the temperature "cools down" exponentially according to a fixed cooling factor.Therefore, the metaheuristic can explore random solutions which appear worse during early iterations but reach a better overall solution during later iterations.

Genetic algorithm
We implement a genetic algorithm according to Aroui et al. (2017).We initialize the start population with the initial sequence and nine randomly generated sequences.
In each iteration, the metaheuristic improves the population by applying crossover and mutation operations to the sequences.An elitism procedure sorts the current population in ascending order according to the number of overloads.Subsequently, it copies the sequence with the fewest overloads to the last five positions of the population.Then, the algorithm performs crossover operations on the first four positions of the population to generate four child sequences based on two random pairs of parents.The child sequences are usually invalid in the sense that they no longer comply with the demand plan.The metaheuristic, therefore, employs a repair function that fixes the child sequences.If a child yields fewer overloads than a parent, the parent is replaced by the child.

Local search
We implement the local search metaheuristic according to Cortez and Costa (2015).The idea of local search is to performs swaps around sequence positions that lead to a high number of work overloads.In each iteration, the metaheuristic randomly selects a sequence position based on a probability distribution that weights the probabilities according to the number of overloads.Subsequently, local search iterates over the sequence positions and evaluates swapping sequence positions (i, j) that contain different models.If a swap yields an improved sequence, the improved 1 3 Stochastic mixed model sequencing with multiple stations… sequence becomes the new best sequence, the probability distribution is updated, and the procedure is repeated.Otherwise, the swap is undone.

Tabu search
We implement the tabu search metaheuristic according to Boysen et al. (2011).
In each iteration, the metaheuristic evaluates the neighborhood of the current best sequence, which refers to the set of all swaps (i, j) of different models.For each neighbor, the number of overloads is calculated and the neighbor with the lowest number of overloads becomes the new best sequence.To prevent back-and-forth swaps, the swap that created the next best sequence is added to a tabu list.The tabu list is defined as a first-in-first-out (FIFO) queue of capacity (T/16): when the capacity is exceeded, the oldest entry is removed.

Gurobi
Ultimately, we implement the mixed-integer linear programming solver Gurobi in version 8.0.1 based on the problem formulation from Sect.3.1 but without the constraints ( 9) and ( 10).Instead, the processing times p k,m are set to their mean values k,m or k,m + to artificially increase the processing times.Gurobi is implemented using the Python package "gurobipy" (version 8.0.1).All parameters except the cutoff time are set to their default values.

Performance metrics
We provide the results based on three performance metrics.First, and as our primary metric, we provide the mean number of stochastic work overload situations over all problem instances.Let P denote the number of problem instances and O sto j,a the number of stochastic overloads for approach a on problem j (recall that O sto j,a is calculated as the average over 100 stochastic variations of processing times).Our primary metric is defined as Second, we include the number of deterministic overloads that would occur if the processing times were deterministic and equal to the mean of their respective distribution.Let O det j,a denote the number of work overloads according to the mean processing times for approach a on problem instance j.The mean number of deterministic overloads is given as ( 24) (25) Third, we provide the average expected overload time over all instances as calculated with the method by Zhao et al. (2007).This allows us to assess the robustness of our results in regard to how minimizing the number of overload situations influences the expected overload time.Let O T j,a denote the expected overload time for approach a on instance j.The average overload time is given as We scale the result by 100 in order to facilitate the presentation of results.
In addition, we provide several metrics (mean deviation to best, percentage of best solutions, and calculation time) for informational purposes, see Appendix 5 of the supplementary material.

Results
We now describe the results of our numerical evaluation.We start with our main analysis, after which we provide the results of three analyses based on unseen problems with variations in the standard deviation of processing times, additional stations on the assembly line, and problems with additional models. 3Finally, we analyze the trade-off between the performance of our approach and the required time for policy learning when using more probability quantiles for the state representation.

Main analysis
We first analyze the performance of all approaches on a total of 2160 stochastic problem instances.Figure 5 presents the results for small, large, and all instances.The black bars denote the number of stochastic overloads, the white bars denote the number of deterministic overloads, and the gray bars denote the expected overload time (Zhao et al. 2007).The numerical results are provided in Appendix 6 of the supplementary material.The optimal values are indicated below the corresponding approaches.
Figure 5 shows that our approach as trained in the stochastic environment ( RL sto ) is superior to other methods in regard to the number of stochastic over- loads.Compared to the best competing approach SA, the relative reduction of stochastic overloads is 7.32−6.587.32   ≈ 10.11% on the small instances, 117.57−109.11117.57≈ 7.20% on the large instances, and 62.44−57.8462.44   ≈ 7.37% on all instances.Training the RL policy based on the deterministic problems also performs well, but RL det is out- performed by LS, GA, and SA on all instances.1 3 Stochastic mixed model sequencing with multiple stations… The number of stochastic overloads is always significantly greater than the number of deterministic overloads.All t-tests are statistically significant with p < 0.001 .Regarding the approaches that solve modified deterministic problems with artificially increased processing times, we find a U-shaped relation between and the solution quality.That is, small increases in deterministic processing time have a positive impact on the number of stochastic overloads, while large increases have a detrimental effect (see Appendix 7 of the supplementary material).
We also consider the expected overload time to assess the amount of utility work that is required to resolve work overloads.HSA T should perform best as it is the only approach that specifically minimizes the expected overload times.However, this only holds on the small instances, whereas HSA T achieves larger over- load time than RL sto on the large instances, and when considering all instances.One possible explanation is that Mosadegh et al. (2020) evaluated their approach based on comparably smaller problems with up to six models.
So far, we have only considered the expected values.As the results are aggregated over multiple instances of different size, we cannot directly assess the variation of the stochastic performance metrics.We now consider the variation of stochastic overloads and overload time for a fixed medium-sized problem instance with 100 cycles, 20 stations, and 20 models.Figure 6 presents the boxplots for a) stochastic overloads and b) overload time over 1000 samples of processing times for RL sto and the best competing approach SA with = 0.25 .Compared to SA, RL sto achieves a lower average and a better worst-case scenario in the number of stochastic overloads.Regarding overload time, the variation is similar but the worst-case slightly favors SA.
In summary, our main analysis indicates that RL sto is able to minimize the number of overload situations, while keeping the expected overload time similar to competing approaches.

Variation of standard deviation
Next, we study variations in the standard deviations of stochastic processing times.We assume that the variations occur during actual production and they are not known in advance.For the evaluation, we evaluate all approaches based on the same sequences as generated in the main analysis, but we sample the processing times based on the altered normal distribution.We specifically do not retrain the RL policies to ensure that this analysis is based on unseen probability distributions.
Figure 7 presents the results for ∈ {5.0, 7.5, 12.5, 15.0} .The numerical results are provided in Appendix 6 of the supplementary material.As expected, stochastic overloads and expected overload time increase for greater standard deviations.For = 5 , RL sto is outperformed by LS, GA, and, in particular, SA, which per- forms best.The relative difference of SA over RL sto is 38.52−36.3338.52   ≈ 5.69% .However, it should be noted that the evaluation with = 5 is also the closest to determin- istic processing times.For all analyses with  > 5 , RL sto is superior to the com- peting approaches.The relative improvements of RL sto over the best competing methods are 48.89−47.0448.89 ≈ 3.78% for = 7.5 , 75.79−70.4475.79 ≈ 7.06% for = 12.5 , and 89.73−84.2789.73 ≈ 6.08% for = 15.

Additional stations
We now perform another analysis, where we evaluate the performance of our approach in regard to unseen stations on the assembly line.For this purpose, we modify the problem instances to contain up to five additional stations.The processing times of the additional stations have similar processing times to the existing stations (see Sect. 5.1).All additional stations are added at the end of the assembly line without loss of generality.Due to computational constraints, we limit this analysis to a subset of 216 problem instances.Again, we do not retrain the RL policies to ensure that the problems are truly unseen but we adjust the implementation of o m t (q) accord- ing to the new environment to apply the RL policies.All competing approaches are directly applied to the modified problems.

3
Stochastic mixed model sequencing with multiple stations… Figure 8 presents the results of RL sto against the best competing approach SA (with = 0.25 ).The numerical values including all approaches are provided in Appendix 8 of the supplementary material.Evidently, the outperformance of RL sto persists when the assembly line contains up to three additional stations.However, RL sto is outperformed by SA on problems with more than three additional stations.

Additional models
We also analyze the sensitivity of our approach toward additional but unseen models.For this purpose, we generate new problem instances with up to five new models, which have similar processing times to those of the existing models.Again, all competing approaches are applied to the new problem instances, but we do not retrain the RL policies to ensure that the models are truly unseen.To apply the trained RL policy to unseen models, we simplify the new problem to a known problem, where the demand plan solely contains known models.For each new model m * , the function sim(m * ) returns the known model with the lowest Manhatten distance between the vectors of processing times.Ties are broken according to the lower index m.
When generating the sequence, we have to decide between sequencing the new model m * or the known model sim(m * ) whenever the policy wants to sequence sim(m * ) .If d m * = d m , we alternate between sequencing m * and sim(m * ) , starting with m * .If d m * > d m , we first sequence the new models until d m * t = d m t ; then we alternate between m * and sim(m * ) .And, if d m * < d m , we first alternate between both models until d m * t = 0 ; then we sequence the remaining known models.The results of RL sto and SA (with = 0.25 ) are shown in Fig. 9.The numerical values are provided in Appendix 9 of the supplementary material.We find that RL sto is superior to SA for up to two additional models.

Additional state information
Ultimately, we analyze how the performance of our approach changes if we extend the state representation by including more probability quantiles of processing times.
Recall that the function o m t (q) → {0, 1} indicates if scheduling model m at sequence position t will result in a work overload when the processing times of all stations are equal to their q-quantile.In the current implementation, we included o m t (q) for q ∈ {0.25, 0.50, 0.75} .However, it seems interesting to analyze how performance and policy learning time change if we include o m t (q) for more than three quantiles.Therefore, we retrain the policy in a modified environment, where the state representation contains o m t (q) for 0, 1, 3, 5, 7, or 9 quantiles.The respective quantiles are determined through an even split over the interval [0, 1], while ensuring that the 0.50-quantile (i.e., the mean) is always included.For instance, the environment with five quantiles calculates o m t (q) for q ∈ {0.16, 0.33, 0.50, 0.66, 0.83} .Due to compu- tational constraints, we limit this analysis to a single instance with sequence length 100, 20 stations, and 20 models.
Figure 10a plots the stochastic overloads for RL sto depending on the number of quantiles.The performance of SA is also indicated as a baseline.As expected, the number of overloads decreases with the number of quantiles, but the marginal reduction of overloads becomes smaller the more quantiles are included.At the same time, increasing the dimensionality of the state representation increases the time that is required for training the policy.Figure 10b shows that the time for policy learning increases linearly with the number of quantiles.
Table 5 presents the numerical values of Fig. 10a and b.We observe that the improvement from our current implementation RL sto 3 to RL sto 5 is less than one work overload.Hence, we argue that including o m t (q) for three probability quantiles provides a reasonable trade-off between performance and the time required for policy learning.

Extensibility
The main idea behind our approach is to incorporate deterministic information based on several probability quantiles into the state of an RL approach to solve the stochastic MMS problem.There appear to be several ways that this approach could be extended to similar stochastic problems.First, our method could be generalized to stochastic MMS problems with different probability distributions (e.g., uniform).For this purpose, the function o m t (q) must be defined with respect to the quantiles of the respective distribution.It also seems straightforward to consider other MMS problem variations, e.g., with U-shaped assembly lines or different worker types.Second, our approach can be applied to stochastic MMS problems with other objectives like minimize overload or idle time.For instance, to minimize the expected work overload time, the reward function needs to be changed to the negative sum of work overload times that occurred at cycle t.
Third, the stochastic MMS problem can be considered through the lens of robust optimization (e.g., Ben-Tal et al. 2009;Gabrel et al. 2014).Manufacturers may, for instance, only be able to handle a given number of overloads simultaneously or a maximum number of overloads during one shift.However, such constraints are generally not considered in the literature as they strongly depend on company-specific risk preferences regarding feasibility of solutions.It thus seems interesting to extend the problem formulation with additional constraints and combine our approach with methods from safe reinforcement learning (see e.g., Garcıa and Fernández 2015).
Fourth, our approach may also be extended to other stochastic scheduling problems.The main challenge is to find a meaningful output for the function o m t (q) that provides the crucial information for the state.For instance, for stochastic job-shop or flow-shop problems, o m t (q) could be defined in regard to the number of machines that will become idle if job m is sequenced next given that all processing times are equal to their q-quantile.
Funding Open Access funding enabled and organized by Projekt DEAL.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/.

( a )Fig. 1
Fig. 1 Worker movements.The x-axis denotes the position of the worker with respect to the processing time on the y-axis

Fig. 2
Fig.2Probability density function of a normal distribution N(90, 10) .The quantiles q 0.25 , q 0.50 , and q 0.75 are highlighted [85, 125]  or U[85, 145] .The cycle time c is fixed to 90, which means that a new cycle enters the assembly line at every 90 units of processing time.The deterministic processing times of the stations were created in two steps.First, an average processing time p m of model m is randomly drawn from the interval [0.75 ⋅ c, c] .Second, p m was used to define the uniform interval from which to draw the deterministic processing time of each station U[0.5 ⋅ p m , min{l k , 1.5 ⋅ p m }].

( a )Fig. 4
Fig. 4 Learning curves for both reinforcement learning environments

Fig. 6
Fig. 6 Variation of stochastic overloads and overload time Fig. 7 Analysis with different standard deviations

Fig. 8
Fig. 8 Analysis with additional stations

Table 1
Existing studies on mixed model sequencing problems with stochastic processing times

Table 2
Variables of problem definition and standard deviation k,m .(10) enforces that the sampled processing times are non-negative and less or equal to the station length l k .The con- straint p k,m ≤ l k is necessary since we assume closed stations, which implies that it must be possible to process each cycle within the boundaries of the respective station.

Table 3
Descriptives about problem layouts