Applying reinforcement learning to plan manufacturing material handling

Applying machine learning methods to improve the efficiency of complex manufacturing processes, such as material handling, can be challenging. The interconnectedness of the multiple components that make up real-world manufacturing processes and the typically very large number of variables required to specify procedures and plans within them combine to make it very difficult to map the details of such processes to a formal mathematical representation suitable for conventional optimization methods. Instead, in this work reinforcement learning was applied to produce increasingly efficient plans for material handling in representative manufacturing facilities. Doing so included defining a formal representation of a realistically complex material handling plan, specifying a set of suitable plan change operators as reinforcement learning actions, implementing a simulation-based multi-objective reward function that considers multiple components of material handling costs, and abstracting the many possible material handling plans into a state set small enough to enable reinforcement learning. Experimentation with multiple material handling plans on two separate factory layouts indicated that reinforcement learning could consistently reduce the cost of material handling. This work demonstrates the applicability of reinforcement learning with a multi-objective reward function to realistically complex material handling processes.


Introduction and motivation
Current trends in advanced manufacturing, such as "make to order" fabrication and short lifespan products, require manufacturers to be flexible and responsive in order to remain competitive.In a review on manufacturing flexibility, Sethi defines flexibility as "… adaptability to a wide range of possible environments that it may encounter" [1].Of the different types and forms of manufacturing flexibility, the focus of this work is on material handling flexibility [2,3].Material handling (MH) is the activity of transporting materials from place to place within a manufacturing facility.The cost of MH is significant and non-value-added (i.e., the end product's value does not benefit from the MH component of the manufacturing expenses), thus MH managers strive to plan and conduct MH so as to enable smooth and efficient production while minimizing MH cost.The goal of this research was to facilitate fast and effective changes to MH plans in response to external uncertainties such as change in demand [4]. 1 3 Unfortunately, MH plans for realistic factories can be and often are very complicated [5].Representing an MH plan in a manner sufficiently formal for conventional mathematical optimization requires extensive definitions, notations, and assumptions (as detailed in Sect.3).Reducing the complexity of MH plans to the point that mathematical optimization methods could be applied to produce theoretically optimum MH plans appears to be an unsolved problem.
Therefore, in order to produce fast and effective changes to MH plans reinforcement learning (RL) was applied instead of mathematical optimization.Starting from simple initial MH plans, the RL process produced a sequence of iteratively modified plans.Those plans were tested, and their costs estimated, using an abstract discrete event simulation of a factory.The cost of the MH plans calculated by the simulation was used as the reward in the RL algorithm.After each plan was generated and tested, the RL algorithm revised it using one of a set of MH-specific plan change operators.Based on the rewards, the RL algorithm learned which of the operators to apply in different situations to improve a material handling plan over the iterations of the learning process.This process was performed for two separate factory layouts and MH scenarios.
The research questions examined in this work were: • How can MH in a realistically complex factory environment be mapped to the structure of an RL process?
• How should MH routes be generated and how can they be improved?
• What is a suitable RL reward function when applied to MH?
• How should the multiple components of the MH costs be included in a multi-objective reward function?
• How can an MH plan be changed in response to a reward to produce a better plan?
• Can the RL algorithm learn a policy to improve the MH plan?
A commercial office furniture manufacturing and assembly plant located in Athens, Alabama USA was carefully studied and served as a representative example of manufacturing and MH for this work.
The remainder of this paper is organized as follows. 1 Section 2 presents background concepts and reviews applicable prior work on the application of RL to MH. Section 3 defines the problem and presents a formal representation of an MH plan suitable for processing by an RL algorithm.Section 4 discusses the learning process and reward calculation.Section 5 reports and analyzes the experimental results for the first of the two factory layouts, and Sect.6 does the same for the second factory layout.Section 7 presents the conclusions of the work and Sect.8 lists recommendations for future work.

Background
This section provides background information on the two main fields of this research, MH and RL, and summarizes selected relevant research literature applicable to MH planning.

Material handling
MH is the activity of transporting materials within a manufacturing facility, e.g., from a storage area to a workstation, or from a workstation to a loading dock, or between a building and a vehicle [10].MH often plays a significant role in the smooth and timely production of finished products and is important in industrial competitiveness [11].Effective MH not only contributes to an efficient production environment but also implicitly supports planning, forecasting, resource allocation, and inventory management.Within a single manufacturing facility, MH may use a variety of manual, semiautomated, and automated equipment that supports the manufacturing process and interfaces with supply chain operations.An MH plan details the quantities and types of material to be transported, where the materials are transported from and to, and how the materials are transported [12].Figure 1 shows examples of two types of transport equipment used for material handling within manufacturing facilities: a tow tractor, also known as a tugger, and a forklift.(Both of these types of equipment are represented in the experimentation that is reported in Sects.5 and 6.) The potential for changes in the market demand a manufacturer's products motivates the need for MH flexibility.A change in demand will typically require changes to product production, and thus to MH, in the manufacturing facility.Some demand changes may be seasonal, such as peaks in demand for school furniture just prior to the start of a new school year.Other factors may lead to more frequent, even weekly, changes in demand.To accommodate a change in demand a facility's MH process must be re-planned [4,13].Although manual processes (e.g., lean MH) and semi-automated tools (e.g., enterprise resource planning) are available for planning MH, an automated process that can quickly generate an MH plan for a changed demand in a given factory could be very useful.

Reinforcement learning
Machine learning methods may be grouped into three categories: supervised learning, unsupervised learning, and reinforcement learning [14,15].Supervised and unsupervised learning methods typically require large data sets, either to train the learning algorithm (supervised learning) or to provide sufficient information for the algorithm to learn the pattern or structure in the data (unsupervised learning).In contrast, RL does not require large data sets; instead, in RL the data is generated during the learning process in the form of rewards for actions taken, which serve as evaluative feedback to a learning agent [16,17].RL has increased in importance in recent years because of its applicability in a wide range of disciplines for which identifying all possible examples to train the agent is difficult; those applications include robotics, gaming, computer vision, business, control system engineering, simulation-based optimization, networks, anti-crime operations [18] and cybersecurity [19].The central idea of RL is that the learning agent learns, over time, what action to take in any given situation by a process analogous to methodical trial and error.The learning agent tries the different actions available to it in different situations and evaluates the outcome of each action, both in terms of the action's immediate effect on the environment and its long-term contribution to the learning agent's overall goals.
In its basic form, RL can be understood in terms of five key elements: environment, agent, state, action, and reward.Figure 2 illustrates the relationship between these elements.The environment is anything which is outside the learning agent in the RL problem.
The agent is the learner, which learns to map states to actions by attempting actions and interacting with the environment.The state is the agent's internal representation or abstraction of a particular situation or configuration of the environment as perceived by the agent.An action is an operator that changes some attribute or aspect of the environment that may result in a new state.An agent in state s i may select action a i to move towards the goal, which takes the agent to a new state s i+1 and produces a reward r i+1 .The reward is an external signal (which is calculated by a reward function) from the environment in response to an action which tells the agent whether the action taken is positive or negative with respect to the agent's goal.The reward is a method of communicating the objective of the RL problem to the agent.In general, the agent seeks to select actions that maximize its long-term cumulative reward.In simple formulations of reinforcement learning, the reward value is a scalar [16].(The notation used in this paper for representing RL elements was adapted from [16].) Two additional components of the RL process are the policy and value functions.The policy is a total function π: S × A → pr, where S is a set of states, A is a set of actions, A(s) is the set of actions available in state s, with A(s) ⊆ A, and pr ∈ [1] is a probability.If π(s, a) = pr, then the probability of the learning agent choosing action a when in state s is pr.
A value function is a secondary reward function that estimates the long-term value, i.e., the cumulative future reward, when following policy π.There are two types of reward functions.The first, denoted v π (s), is the expected reward starting from state s and following policy π.The second, denoted q π (s, a), is the expected reward starting from taking action a in state s following policy π.If the reward value of taking a particular action being in a particular state stays the same over time, then the RL problem is called stationary; otherwise, it is called non-stationary.
The value of each state and state-action pair (s, a) can be determined by the Bellman equations, which are also called dynamic programming equations [20].The Bellman equations specify how the values of the current state and state-action pair of the learning agent can be estimated using the successor state and state-action pair, respectively.The value of state s i under policy π is denoted v π (s i ); function v π is called the state-value function for policy π.Similarly, the value of state-action pair (s i , a i ) under policy π is denoted q π (s i , a i ); function q π is called the action-value function for policy π.The current state value (v π (s i )) and state-action pair value (q π (s i , a i )) are updated using successive state value (v π (s i+1 )) and state-action value (q π (s i+1 , a i )).Equations ( 1) and ( 2) are the Bellman equations for v π the statevalue function for policy π and q π the action value function for policy π, for the simplest RL method, the temporal difference (TD (0)) method: where s i and s i+1 are the states at steps i and i + 1 respectively, a i is the action at step i, r i+1 is the reward at step i + 1, α represents a step-size parameter such that 0 ≤ α ≤ 1 and γ represents a discount rate parameter such that 0 ≤ γ ≤ 1 [16].
Tabular RL methods use a table to record the approximate values of the state (or state-action pair).Tabular methods are typically used when there is a small number of states and actions.In tabular stationary RL, the step size is the mean of the number of visits to the particular state.In continuous RL problems, as the number of iterations i tends to ∞ the sum of future rewards tends to ∞, so to keep the future rewards finite, a discount rate is used.RL requires a balance between greedy selection of the action with the maximum reward, as far as is currently known, and non-greedy selection of actions with unknown or currently non-maximum rewards in order to learn their effectiveness.In the RL context, the former is called exploitation and the latter is called exploration.These aspects differentiate RL learning from supervised learning, in which the agent is explicitly taught.
The central idea of RL is that the agent learns to influence or gain control of an environment over which it initially has no control by improving its policy.A policy which is better than or equal to all other policies for the same problem is called an optimal policy [16].Finding an optimal policy is difficult for most practical problems because of issues such as the lack of computational resources to do an exhaustive search and unknown environment dynamics.Hence, for most practical purposes an improved or approximately optimal policy is the goal. (1) (2) [q s i , a i = q s i , a i + [r i+1 + q s i+1 , a i q s i , a i ] Fig. 2 General framework for reinforcement learning [16]

Applying reinforcement learning to material handling
The application of RL to manufacturing MH was both motivated and challenged by special characteristics of the MH application.
• Motivation The inputs to the MH process are a specific factory layout (including workstations, walkways, and storage areas within the factory known as warehouses), a collection of MH equipment and workers to operate that equipment, and a given level of customer demand for products to be manufactured.From an MH perspective, the demand is manifested as a set of rates at which the workstations consume parts of different types.A change to any of these inputs (layout, equipment, or demand) would likely require a change to the MH plan.Consequently, there is a very wide range of possible inputs, and the associated MH plans can be very specific to a given set of inputs.Also, examples of MH inputs and corresponding plans were not available in sufficient number and generality to train unsupervised learning methods.
• Motivation The MH plans produced during an RL learning sequence could be used later as the starting plan for a subsequent RL learning sequence executed in response to a new customer demand, potentially shortening the later learning sequence or improving the later plan.• Challenge Factories are complicated.A typical manufacturing facility contains a wide range of physical elements, including workers, workstations, raw materials, pre-made parts, manufacturing equipment, transportation equipment, computers and communication networks, and many others.These physical elements support a number of tightly interconnected processes, including planning, inventory control, production, and MH.To apply RL to MH, the elements of the factory involved in the MH process must be represented in terms of the elements of an RL algorithm, i.e., environment, states, actions, and rewards.Doing so with sufficient mathematical specificity and formality was a significant challenge because of the larger number of variables involved.
• Challenge RL algorithms learn by selecting actions and seeing how well they work.Clearly, testing the numerous MH plans produced by the RL algorithm by actually trying each one in a real factory is completely impractical.Therefore, a simulation model of the factory and the MH process was developed and used to assess the performance of the MH plans generated by the RL algorithm and generate the reward for each.The simulation model was based on standard modeling techniques from discrete event simulation but is deterministic rather than stochastic and somewhat more abstract than a conventional discrete event simulation model.

Related work
This section briefly surveys selected examples of related work.Figure 3 illustrates the relationships between the relevant topics.Of particular interest are applications of RL to MH and applications of RL with multi-objective reward functions.Although considerable research has been performed on individual MH issues, most studies have focused on dynamic scheduling for fixed layout problems; responding to changes in demand has received little attention.This research is focused on applying RL for generating increasingly efficient MH plans for more complex scenarios with multiple assembly lines, routes, part types, and equipment types in response to temporally varying demand.

RL applied to manufacturing
The potential for RL to identify or generate an improved or optimal policy in real time has led to its application in many subfields of manufacturing, such as production scheduling [21] and routing [22].The RL literature includes a number of papers that describe the use of RL for learning control policies in various areas of manufacturing, including generating a dynamic scheduling policy for a multi-agent RL system [23]; learning an inventory replenishment policy [24]; addressing single-machine dispatching rule selection problem in production scheduling [25]; learning dispatching policies in a multi-agent job-shop scheduling problem [26]; performing maintenance task scheduling [27]; path planning for mobile robots in genetic algorithm-based task scheduling in dynamic intelligent warehouse environments [28]; studying global supply chain problems [29]; and developing routing policies in multi-agent job-shop environments [22].
Recent work applied RL to a multi-load carrier scheduling problem [5].In that work, a single dolly train serves a onedimensional assembly line of multiple workstations whenever a request arrives for parts from one or more workstations.The dolly train can carry multiple loads (containers).MH requests are generated using a look-ahead request generating policy, and heuristic-based dispatching algorithms and standard pickup and drop-off rules are used to generate a dropoff sequence.
The RL reward is related to both throughput of the assembly line and the MH distance.The RL state is represented using four features, and each state contains three actions.The action values are approximated using a radial basis function with Q-learning [30].The experimental setup is relatively simple; it includes only one assembly line.The RL algorithm primarily tries to learn a multi-load carrier scheduling policy, rather than learning a policy to improve the MH plan [5].

Multi-objective reward functions
An optimization problem where more than one objective is optimized simultaneously is referred to as a Multi-Objective Optimization Problem (MOOP) in the literature [31].The MH application considered in this work has four objectives: ensure that no workstations run out of parts during the given plan duration, minimize the number of pieces of equipment used in the MH plan, minimize the number of parts stored at each workstation, and minimize the total distance traveled by all of the pieces of equipment in the MH plan.These objectives potentially compete with each other, for example, reducing the number of parts stored at a workstation increases the likelihood of that workstation running out of parts.
Each of these objectives is associated with an individual reward signal; hence in the RL MH problem the learning agent receives a vector of rewards from the environment after each action.As mentioned earlier, conventional RL algorithms typically learn from a scalar reward.If the reward is not scalar, as in multi-objective problems, it must be somehow converted to a scalar so that RL can be applied.However, the most effective way in general to perform this mapping of a reward vector to a scalar reward method is an open research question in RL research.
RL problems with multiple objectives are referred to as multi-objective RL (MORL) in the literature [32].The two main classes of MORL algorithms are single-policy and multi-policy.Single-policy algorithms map the reward vector to a scalar reward and then learn the single optimal policy, whereas multi-policy algorithms learn multiple policies (one for each of the corresponding objectives).Usually these MORL algorithms generate a set of compromised solutions called a Pareto optimal set [32].Single-policy MORL algorithms include linear scalarization [33], non-linear scalarization such as Chebyshev scalarization [34], lexicographic ordering-based scalarization [35], and hypervolume indicator-based (the volume of the area between the Pareto set and the reference point) algorithm [36].
Multi-policy algorithms include the convex-hull value-iteration algorithm [37] and Pareto Q-learning [38].The hypervolume indicator has been used as a benchmark metric to compare the single-policy and the multi-policy MORL algorithms in order to evaluate their performance and limitations in an empirical study [32].Recent research has shown how deep RL can be applied to solve MORL problems [39,40].Other methods, such as the geometric approach [41], have been applied to MORL problems as well.

RL in plan space
A neural network trained with a temporal difference algorithm was used to learn the heuristic evaluation function over states for job-shop scheduling of NASA's space shuttle processing tasks [42,43].The main goal was to find the shortest schedule for many simultaneous jobs, each consisting of multiple tasks, that satisfied both temporal and resource constraints.Two operators, "REASSIGN-POOL" and "MOVE", were applied to reassign a resource if it was over-allocated and to move the task within the temporal limit to satisfy the violated resource requirement, respectively.A resource-dilation factor, which is independent of the length of the schedule, was used to measure the immediate rewards.With several modifications, the algorithm eventually converged and outperformed the previous simulated annealing-based iterative repair method [42,43].
According to Sutton, that type of problem is an instance of learning in a "plan space", which is similar to this research [16].The objectives in [42] were to generate a conflict free schedule by removing temporal and resource constraints and to minimize the duration of the schedule.Although the proposed method involves plan-space learning, learning stops once the two temporal and resource constraints are satisfied.To optimize the duration of the schedule, the plan space method reviews the plans and selects the one with the shortest duration.They used a neural network-based function approximator by carefully selecting the features representing the schedule as input for estimating the value of each schedule after applying each change operator.In contrast, the method used here includes all of the objectives in the learning process and all of the variables of the plan that are used as the input to the simulation which generates the states.Moreover, job-shop scheduling is a scheduling problem, as opposed to the multi-objective MH plan optimization problem in this work.

Comparison of reinforcement learning with other optimization methods
This section briefly compares and contrasts RL, the method used in this research, with several other optimization methods that could conceivably be applied to the MH problem.The other methods are well-known and excellent descriptions of them are available, thus no attempt is made to provide comprehensive tutorials here.
Evolutionary algorithms explore a space of possible solutions, or solution parameterizations, for a problem using methods inspired by biological evolution.A population of possible solutions is generated, and each is evaluated as a solution to the problem using a problem-specific fitness function.The "fittest" solutions, i.e., those with the best fitness function values, are used to create a next generation of possible solutions using evolution-inspired operations including crossover and mutation.The process iterates until some stopping condition is reached, such as insufficient generationto-generation improvement.In genetic algorithms, which are arguably the best-known type of evolutionary algorithms, the possible solutions are typically strings of numbers representing parameters to some solution function or process.
As explained in more detail in [16], evolutionary algorithms differ from the RL method in several significant ways.RL methods "learn while interacting with the environment, which evolutionary methods do not" [16].Evolutionary methods use a population of multiple possible solutions, that are each evaluated once for fitness, and then may be transformed into new possible solutions using general operations.RL uses a single policy, which is analogous to a single possible solution, but that policy is evaluated and changed many times during the learning process, and those changes may be made by problem-specific operations (as in the research described in this paper).Finally, RL uses value functions, which estimate the cumulative current and future value of specific states in the solution process controlled by the policy, whereas evolutionary methods "search directly in policy space" guided by fitness function evaluations of the complete solutions [16].
Hill climbing algorithms seek to optimize (either minimize or maximize) an objective function f(x), where x may be a vector of input values, by heuristically searching the space of possible input values.Typically, hill climbing proceeds iteratively; in each iteration, an element of x is adjusted in some way and a new value of the objective function f(x) is calculated; if the new value of f(x) is better than the previous value, the change to x is retained.The iterations continue until the value of f(x) cannot be improved.
Hill climbing has some similarities to RL; the adjustments to x in hill climbing can be seen as analogous to the actions in RL, and the objective function f(x) in hill climbing can be seen as analogous to the reward function in RL.There are also differences; computing the new objective function value in hill climbing depends only on the new value of x, whereas the value function in RL estimates not only the value of the next action but all the actions that follow it.Hill climbing is a greedy method, in that the search always moves in the direction that locally improves the objective function value, whereas the standard ε-greedy RL algorithm includes both exploitation (greedy optimization) and exploration (stochastic exploration of non-locally optimal actions).For this reason, RL is less likely than hill climbing to end at a solution that is locally, but not globally, optimum.
Simulated annealing, which is similar to hill climbing and has been described as an adaptation of the Metropolis-Hastings algorithm [44], addresses the problem of converging on a local, rather than global, optimum by allowing the possibility of accepting inputs to the objective function that produce a less-optimal value.The probability of such an acceptance gradually decreases over successive iterations, which is analogous to the slow cooling and hardening of a molten material.This aspect of simulated annealing is similar to the exploration possibility in ε-greedy RL.However the other differences between RL and hill climbing also apply to simulated annealing; the absence in simulated annealing of something like the RL value function is specifically mentioned in [16].
• Part (noun): An individual item that is part of a finished product, i.e., one of the building blocks of a product.
• Warehouse (noun): A location in the factory where parts are stored; this location has a maximum capacity for each type of part stored and a current inventory for each type of part stored.
• Route (noun): A sequence of locations through which each piece of equipment moves to pick up or deliver parts, or containers of parts.
• Workstation (noun): A location where work is performed.Workstations consume one or more types of parts by adding them to a product.Each workstation has a consumption rate for each type of part that it consumes and may have a buffer for holding unconsumed parts.• Demand (noun): The type and number of items to produce.Demand is realized in this research as the rates at which each workstation consumes different type of parts.
• Plan (noun): A specific configuration of routes, with assignments of pieces of equipment, scheduled execution times, and the parts to be delivered for each of the routes.• Non-consumed parts (noun): Parts not consumed by a workstation which would have been consumed during the period of plan execution but were not because the workstation had run out of parts.Non-consumed parts measure lost production.• Mean parts threshold (noun): An average number of parts required to keep the workstations running during the plan execution period.
When planning MH processes, this research is primarily interested in four constraints: • Ensuring that none of the workstations run out of parts.
• Minimizing the number of pieces of equipment needed to deliver parts.
• Minimizing the number of parts stored at each workstation.
• Minimizing the total distance traveled by all pieces of equipment while delivering parts from the warehouse to workstations.
Satisfying the first constraint would be simple if not for other three constraints, and vice-versa, to satisfy the constraints simultaneously can make MH planning challenging.

Assumptions and simplifications
The following characteristics of the factory MH process are assumed: • A workstation may consume more than one type of parts.
• The rates at which the workstations consume parts are known.
• Each workstation has a buffer (i.e., temporary storage location) for each type of part that the workstation consumes, and the capacities of those buffers are known.• A single piece of equipment may transport parts for multiple workstations on a single trip.
• Any piece of MH equipment can physically reach any of the workstations to deliver parts.
• All MH within the factory is included in the MH plan, i.e. parts are not delivered to the workstations other than as specified in the MH plan.
• Sufficient MH resources, such as pieces of equipment and the workers to operate the equipment, will be available to execute the MH plan.• Moving partially completed products between workstations and moving completed products to a shipping area are not part of the MH plan.
The following simplifying assumptions were made; any could be relaxed in future work: • Workstations and MH equipment do not malfunction or break down.
• Once execution of a MH plan is initiated, it is not pre-empted.There are no preemptive or unscheduled requests for parts from the workstations that affect the MH plan.
• The rates at which the workstations consume parts do not change during execution of an MH plan. 2 The time required to unload parts at a workstation does not change during execution of an MH plan.
• The factory layout (i.e., the location of the warehouses, workstations, and connecting walkways) does not change during execution of an MH plan.• The reward associated with a particular MH plan does not change over time, i.e., the same plan will produce the same reward at any point in a learning sequence.

MH as an RL problem
The MH reinforcement learning process is shown in Fig. 4. The elements of the MH planning task have been mapped to those of RL in the following way.The RL environment is the factory, which includes all the workstations, intersections, warehouses, and walkways, all the equipment, and all the different types of parts.The RL state is an abstract internal representation of an MH plan and its reward (to be explained later).The RL action is a change to a plan.Finally, the RL reward is the cost of a MH plan.
In the figure, initial MH plans are created manually and stored in the plan repository.A plan learning sequence begins when an initial plan P 0 is selected from the repository and tested using an abstract simulation.The resulting reward r 1 , i.e., the MH cost for plan P 0 , is calculated by the abstract factory simulation and then passed to the agent along with the state s 1 , which is determined from the plan and reward.Then the agent checks whether the total Fig. 4 Material handling reinforcement learning process number of actions (changes to the plan) performed so far is equal to 3000 (this value was chosen empirically, based on preliminary testing).If so, then the learning sequence stops and the plan is moved into the repository.Otherwise, the agent makes another change to the plan in an attempt to improve it.The learning agent is learning in "plan space", i.e., the agent explores the space of possible plans by making small changes to each successive plan in the sequence.RL is employed to help the agent to learn what types of plan changes work best in different situations.

RL state and environment
To use RL to modify an MH plan, a formal mathematical representation of the factory layout, the material handling scenario, and the MH plan are required.They are represented as follows.
Factory layout The factory layout is represented as a graph G F = (V F , E F ) with vertex set V F and edge set E F .The vertices represent the workstations, warehouses, and walkway intersections, and the edges represent the walkways which connect the warehouses, workstations, and intersections; the walkways are bidirectional and allow movement in either direction.The vertex set VF can be partitioned into three types of vertices: …} is the set of workstations; |K| = number of workstations.
Not all pairs of vertices are connected by edges (walkways) and there are no edges from a vertex to itself; thus Graphs have been used to represent factory layouts in prior research, e.g., [45].A graph provides information needed for planning MH, e.g., how the warehouses and workstations are connected by walkways, and omits unneeded details, e.g., the physical dimensions of a workstation.Figure 5 shows a small example of a factory layout represented as a graph using the notation just defined.
Material handling scenario The representation of the material handling situation has five components, as follows: • Equipment Q Set of all pieces of equipment available for MH in the factory for a given plan period.Q = {q 1 , q 2 , q 3 , …} and q i = piece of equipment i, i ∈ Z + ; |Q| = total number of different pieces of equipment.• Part type IT Set of all types of parts available in the factory for MH consumption for a given plan period.IT = {it 1 , it 2 , it 3 , …} and it i = part type i, i ∈ Z + ; |IT|= total number of different types of parts.where c ij = number of parts of type j consumed by workstation i per hour; 1 ≤ i ≤ |K| and 1 ≤ j ≤ |IT|.
• Buffer B Number of parts of each type that can be stored at each workstation.
where b ij = maximum number of parts of type j that can be stored at workstation i; • Plan duration T. Time duration (in minutes) over which the MH plan will be executed.
Plan The material handling plan consists of a set of routes: P = {p 1 , p 2 , p 3 , …}; where p i = route i, i ∈ Z + .Each route p i has four components, W i , PE i , F i , and D i , defined as follows: • Walk (graph theory) W i : A sequence of vertices in the factory layout such that adjacent vertices are connected by a distinct edge; where and w 1 , w n ∈ H.In terms of MH, each complete transit of a walk by a piece of equipment delivering parts is referred to as a trip.
• Piece of equipment PE i : A piece of equipment used in the route i; where PE i ∈ Q. • Times trips start F i : List of times that the trips of route i start; where F i = (t 1 , t 2 , t 3 , …, t f ), 0 ≤ t j < T, t j ≤ t j+1 for 1 ≤ j < f and inter-trip interval VF i = t j+1 -t j then, Here the inter-trip interval is a constant, thus the trips are evenly spaced.The first trip cannot start before VF i .• Deliveries D i : Single list of triples (i.e., 3-tuples), where , |D i |= k and d j = (w, x, y) with w = workstation, x = part type, and y = number of parts, i.e., w ∈ K, x ∈ IT and y ∈ Z + .

RL actions
The actions available to the RL learning agent are ten distinct operations that change an MH plan in ways that are intended to improve it.Those plan change operations, or actions, are defined as follows.
IncreaseTheNumberOfTrips Increase the frequency of the route serving the workstation that had the greatest shortage of parts.Let d j ∈ D i be the delivery in route i of plan P such that d j [1] is the workstation that has the largest number of non- consumed parts of type d j [2].This action increases the number of times parts will be delivered to d j [1] by decreasing the inter-visit interval VF i , i.e., VF i = VF i -x where 0 ≤ x < T. The value of x is a constant set to a value found empirically during preliminary testing.If more than one delivery is made in route i, then delivery d j will be removed from the route i and a new route m will be formed with new trip start times generated using VF i i.e., VF m = VF i -x where 0 ≤ x < T. Route m will be added to the plan P. All of the remaining route information such as a piece of equipment is copied from the original route.The walk for the new route is calculated using Warshall's shortest path algorithm [46].The new route will start from a warehouse and drop off part type d j [2] at workstation d j [1] before returning to the warehouse.If none of the workstations in plan P is running out of parts, then this action selects the d j ∈ D i , such that d j [1] is the workstation that has the largest number of extra parts of type d j [2] stored, and then increases the number of times parts will be delivered to d j [1].
DecreaseTheNumberOfTrips Decrease the frequency of the route serving the workstation that had the greatest excess of parts.Let d j ∈ D i be the delivery in route i of plan P such that d j [1] is the workstation that has the largest number of extra parts of type d j [2] stored.This action reduces the number of times parts will be delivered to d j [1] by increasing the intervisit interval VF i , i.e., VF i = VF i + x, 0 ≤ x < T. The value of x is a constant set to a value found empirically during preliminary testing.If more than one delivery is made in route i, then delivery d j will be removed from the route and a new route m will be formed with new trip start times generated using VF i , i.e., VF m = VF i + x where 0 ≤ x < T. Route m will be added to the plan P. All of the remaining route information, such as a piece of equipment, is copied from the original route.The walk for the new route is calculated using Warshall's shortest path algorithm [46].The new route will start from the warehouse and drop off part type d j [2] at workstation d j [1] and return back to the warehouse.If none of the workstation of plan P has extra parts stored, then this action selects the d j ∈ D i such that d j [1] is the workstation that has the largest number of non-consumed parts of type d j [2] and decreases the number of times parts will be delivered to d j [1].
IncreaseTheNumberOfPartsDelivered Increase the number of parts delivered per trip to the workstations that had the greatest shortage of parts.Let d j ∈ D i be the delivery in route i of plan P such that d j [1] is the workstation that has the largest number of non-consumed parts of type d j [2].This action increases the number of parts of type d j [2] delivered to d j [1] by increasing the number of parts d j [3] = d j [3] + m where m ∈ Z + , d j ∈ D i .The value of m is a constant set to a value found empirically during preliminary testing.If none of the workstations in plan P are running out of parts, then this actions picks the d j ∈ D i such that d j [1] is the workstation that has the largest number of extra parts of type d j [2] stored and increases the number of parts of type d j [2] delivered to d j [1].(Recall the definition given earlier; "non-consumed parts" are parts that would have been consumed had the workstation not run out of parts.) DecreaseTheNumberOfPartsDelivered Decrease the number of parts delivered per trip to the workstations that had the greatest excess of parts.Let d j ∈ D i be the delivery in route i of plan P such that d j [1] is the workstation that has the largest number of extra parts of type d j [2] is stored.This action decreases the number of parts of type d j [2] delivered to d j [1] by decreasing the number of parts d j [3] = d j [3] -m where m ∈ Z + , d j ∈ D i .The value of m is a constant set to a value found empirically during preliminary testing.If none of the workstations in plan P have extra parts stored, this action picks workstation d j ∈ D i such that d j [1] is the workstation that has the largest number of non-consumed parts of type d j [2] and decreases the number parts of type d j [2] delivered to d j [1].
ChangeNumberAndOrTypeOfEquipment Split the route serving the workstation with the greatest shortage of parts into two routes, and assign a piece of equipment to the new route.Let d j ∈ D i be the delivery in route i of plan P such that d j [1] is the workstation that has the largest number of non-consumed parts of type d j [2].If more than one delivery is made in the route i and at least one workstation is running out of parts in that route i, then this action splits the route i into two routes.An unassigned piece of equipment from Q is assigned to the second route.All of the remaining route information, such as trip start times, is copied from the original route.If there is only one delivery in route i or none of the workstations of D i in route i are running out of parts, then this action tries to remove one of the deliveries from route i and adds it to any other route in plan P which has same type of equipment as route i.This action requires sorting the routes of plan P in descending order of number of different type of parts on the pieces of equipment of each route.This action tries to reduce the number of pieces of equipment used in the plan P.
MergeWithTheRouteWithSameWorkstation Remove a delivery to a workstation from one route and add it to another route that already includes a delivery to that workstation.Let i be the route with the smallest number of workstations to be served in the plan P. Remove d j ∈ D i in route i in the plan P from the beginning of D i and add it to route x such that x is also delivering the parts to the same workstation as d j [1] (as each workstation can consume multiple types of parts), i.e., d j [1] = d h [1] where If more than one such route exists (such as route x), then this action selects the very first one in the sorted route list.This action requires sorting the routes of plan P in descending order of number of different type of parts on the pieces of equipment of each route.If the equipment capacity and equipment type do not permit adding the selected workstation d j [1] to any other route in the sorted plan P, this action will consider the next delivery d j+1 in the list of D i in the same route i.This action tries to reduce the number of pieces of equipment used in the plan P.
MergeWithNeighborWorkstation Remove a delivery to a workstation from one route and add it to another route that already includes a delivery to a nearby workstation.Let i be the route with the smallest number of workstations to be served in the plan P.This action removes delivery d j ∈ D i in route i in the plan P from the beginning of D i and adds it to route x such that at least one workstation d h [1] of D x is the neighbor of d j [1] of D i , i.e., (d j [1], d h [1]) ∈ E F or if d j [1] and d h [1] have a path connected only by one or more intersections, where d j ∈ D i and d h ∈ D x .If more than one such route exists (such as route x), then this action selects the very first one in the sorted route list.This action requires sorting the routes of plan P in descending order of number of different type of parts on the pieces of equipment of each route.If the equipment capacity and equipment type do not permit adding the selected delivery d j to any other route in the sorted plan P, the action will consider the next delivery d j+1 in the list of D i in the same route i.This action tries to reduce the number of pieces of equipment used in the plan P.
MergeWithTheRouteWithinTheWalk Remove a delivery to a workstation from one route and add it to another route that already passes by that workstation.Let i be the route with the smallest number of deliveries D i to be served in the plan P.This action removes delivery d j ∈ D i in route i in the plan P from the beginning of D i and adds it to route x if workstation d j [1] is in the walk of route x, i.e., W i ∩ W x = d j [1] where d j ∈ D i .If more than one such route exists (such as route x), then this action selects the very first one in the sorted route list.This action requires sorting the routes of plan P in descending order of number of different type of parts on the pieces of equipment of each route.If the equipment capacity and equipment type do not permit adding the selected delivery d j to any other route in the plan P, the action will consider the next delivery d j+1 in the list of D i in the same route i.This action tries to reduce the number of pieces of equipment used in the plan P.
MergeWithRouteArbitrarily Remove a delivery to a workstation from one route and add it to another route arbitrarily, i.e., without any other conditions.For any arbitrarily selected delivery d j ∈ D i in route i in the plan P, this action removes the delivery d j from route i and adds it to the arbitrarily selected route x.Route x must have the same equipment type as route i and must have capacity to carry part d j [2].
SplitTheRouteArbitrarily Remove a delivery to a workstation from one route and create a new route that includes only that delivery arbitrarily, i.e., without any other conditions.For any arbitrarily selected delivery d j ∈ D i in route i in the plan P, split route i at delivery j, create a new route x, and then add it into the plan P. All of the remaining information, such as trip start times, is copied from the original route.Some of the actions require selecting a particular workstation or route in a plan P to which the action is applied, that selection is done by reviewing the information from the abstract simulation of the plan (see next section for further explanation).

RL reward
As mentioned earlier, in RL the result of selecting and performing an action in a given state is both a new state and a reward; the latter indicates the value of the selected action in the given state.In this work, the reward is an assessment of the new MH plan produced by performing a plan change operation.Clearly, assessing MH plans by testing them in an actual factory would be impractical, so an alternative method must be found.This section describes how the rewards are calculated using a simulation-based multi-objective reward function.

Reward model
The reward is calculated as the cost of executing the MH plan for a one six-hour shift.The cost is calculated using a computer simulation of MH developed specifically for this purpose.(A similar approach, using a computer simulation to evaluate flexible flow shop schedules, is described in [47].)The simulation model is based on concepts from discrete event simulation (DES) [48], but is highly abstract.For example, detailed occurrences that might be separate events in a conventional DES model, such as visiting each workstation in a route, are abstracted into a single event, namely executing the route.The model uses a Future Event List (FEL) which is initialized using the times trips start list of all the routes of the plan P. The model executes the plan by performing each event in the FEL.Because the FEL for a given plan P is initialized completely at the beginning of a simulation, later events are not scheduled by earlier events, as is typical in conventional DES.
The main activity of the simulation is route execution, i.e., starting from the warehouse, visiting each of the workstations on the route, and returning to the warehouse.This is done for each route in a plan P. The time required to execute a route is deterministic because it is calculated directly from the distances between the successive pairs of vertices in the route.The reward model maintains the current time and distance lists for each distinct piece of equipment used in the plan P, and it maintains a separate list for the sum of parts stored and sum of non-consumed parts for each workstation and part type in the plan P.These lists record the data at every visit to a particular workstation to drop off a particular type of part.
The plan is simulated to determine the actual time spent visiting each workstation and to record the sum of stored parts and non-consumed parts, which is used to calculate the reward, i.e., the MH cost.During a simulation the model generates additional information that can be used to fill in the details of the next action selected.
Each simulation runs for T minutes of simulated time; in the experiments reported later, T = 360 (one six-hour shift).All the routes of the plan are assumed to have already been executed once and their parts delivered at the start of the simulation so that none of the workstations will run out of parts immediately at the beginning of the simulation.The routes are also executed one additional time at the end of the simulation without calculating the visiting times and distance, in order to record the sum of stored parts and the sum of non-consumed parts for each workstation-part type.

Notations used for calculating RL reward for MH plan P
The following notation is used to define the details of the reward model.

m(a, b)
Euclidian distance between two adjacent vertices in the factory, such that a, b ∈ V F and (a, b) Total distance to execute a given plan P V q (t, g) Current time t of a piece of equipment g ∈ Q in a plan P m r (i) Total distance for executing route i t r (i) Total time for executing route i V d (i, j, h) Last visiting time to workstation j to drop off part type k of route i where j ∈ K and h ∈ IT P d (i, j, h) Total parts of type h stored at workstation j during the last visit to workstation j of route i where j ∈ K and h ∈ IT A(i, j, h) Number of parts of type h consumed by workstation j since the last visit to workstation j of route i where j ∈ K and h ∈ IT T d (i, j, h) Number of parts of type h expected to be consumed by workstation j since the last visit to workstation j of route i where j ∈ K and h ∈ IT; calculated using consumption rate C B(i, j, h) Number of parts of type h currently at workstation j of route i after every visit where j ∈ K and h ∈ IT Z(i, j, h) List (z 1 , z 2 , z 3 , …, z |Fi|+1 ) of the sum of the number of parts of type h stored at every visit to workstation j of route i where j ∈ K and h ∈ IT V c (i, j, h) Number of visits to workstation j to drop off part type h of route i where j ∈ K and h ∈ IT H d (i, j, h) Total extra parts of type h stored at workstation j of route i since the last visit to workstation j of route i where j ∈ K and h ∈ IT V q Total number of distinct pieces of equipment used in the plan P Wp Total number of deliveries in the given plan P Y(i, j, h) List (y 1 , y 2 , y 3 , …, y |Fi|+1 ) of the sum of non-consumed parts of type h at every visit to workstation j of route i where j ∈ K and h ∈ IT Plty d (i, j, h) Total number of non-consumed parts (defined in the next section) of type h of workstation j of route i, where j ∈ K and h ∈ IT.

Reward response variables
Several response variables produced by the abstract MH model when simulating a plan are used to calculate the RL reward.Those response variables are: • Distance travelled M: Total distance (in meters) travelled by all pieces of equipment in the plan P to deliver the parts from the warehouse to the workstations during a given plan period T.This includes the distance travelled after the parts are delivered to the last workstation and the equipment returns to the warehouse.An example of a typical route starting from the warehouse h 1 ∈ H and visiting four workstations (k 1 , k 2 , k 3 , k 4 ∈ K) and returning back to the warehouse in dotted line is shown in Fig. 6.The total distance for executing route i in the figure is where m(h 1 , k 2 ) = distance between warehouse h 1 ∈ H and the first workstation k 2 ∈ K in the route, and m(k 1 , h 1 ) = distance between the last workstation k 1 ∈ K back to warehouse h 1 ∈ H.
Similarly, the total time for executing route i in the figure is where t(h 1 , k 2 ) = travel time between warehouse h 1 ∈ H and the first workstation k 2 ∈ K in the route, and t(k 1 , h 1 ) = travel time between the last workstation k 1 ∈ K back to warehouse h 1 ∈ H.
• Equipment utilization V utilization : Total number of distinct pieces of equipment needed to execute the given plan P.
• Mean parts V: Total time averaged quantity of all the parts that have been delivered to all the workstations but have not yet been used in production and are therefore stored in the workstation's part storage buffer, i.e., the average of all the parts stored at all the workstations at any given time within a given plan period T. The minimum number of parts stored at each workstation should be sufficient to keep the workstation from running out of parts before the next delivery.• Non-consumed parts Penalty: Number of parts that would have been consumed by all workstations had they been operating at their normal rate for the entire plan period T, but were not consumed because a workstation had run out of parts.If the MH plan has not delivered sufficient parts to the workstations to keep up with their consumption rates and one or more workstations have run out of parts then production stops.Production stoppages are expensive and disruptive.Therefore, the Penalty response variable is heavily weighted in the reward calculation and avoiding non-consumed parts due to stoppages is a goal of the learning process.

Weighted sum method
Because there are multiple reward response variables, the weighted sum method [49] was used to scalarize the vector of reward response variables so that the RL method can be applied.The weighted sum method is one of the most commonly used approaches for scalarizing a vector reward.As the name describes, each four response variables in the reward vector is weighted according to the importance of that variable to the final reward.In general, a single scalar RL reward can be calculated as where j = the number of response variables in the reward vector, Re i = the value of reward response variable i, N i = the cost per unit of the reward response variable i, and w i = the weight of reward response variable i.
As defined earlier, the four reward response variables were M, V utilization , V, and Penalty.Thus the expanded formula for reward R is The four reward response variables are not all in the same units; distance travelled M is in meters, equipment utilization V utilization is in number of pieces of equipment, mean parts V is in number of parts, and non-consumed parts Penalty is also in number of parts.Therefore, the reward response variables' values are normalized to use cost as a common unit.Each of the response variables' values are converted to a cost by multiplying the variable's value by a quantity representing the cost of that variable's unit; thus M is multiplied by cost per meter travelled, V is multiplied by cost per part per unit time stored at a workstation, V utilization is measured as the cost per piece of equipment used, and Penalty is measured as cost per part that a workstation ran out of and thus was unable to consume during the execution of an MH plan.The four cost normalization multipliers are denoted N i in Eq. ( 5) and N 1 , N 2 , N 3 , and N 4 in Eq. ( 6).The values for the four cost multipliers were estimated by subject matter experts based on historical data from the Steelcase factory used as the example in this work.Similarly, the weights w 1 , w 2 , w 3 , w 4 that were used to represent the relative importance of the four response variables were elicited from subject matter experts in the same facility. 3mportantly, note that the calculated reward is a cost, which MH managers want to reduce, thus the RL algorithm seeks to minimize the reward, not maximize it.
Figure 7 gives the pseudocode of the reward calculation, i.e., the abstract simulation of the MH plan.Lines 1-7 create the FEL from the times trips start (TS i ) for each route i of plan P. Each event notice consists of the piece of equipment PE i serving route i, the route number, and the time at which each trip is supposed to start.The trip time of each route can be reviewed without creating the list, although the list is convenient for implementation.Next, the list is sorted in increasing order of the time the trip starts, as shown in line 8.Then, all variables of the each workstation and part type are initialized as shown in lines 9-12.The first trip to each workstation starts at minute 0 of the simulation.Lines 13-38 model the actual execution of the route.All the relevant data for computing the RL reward R is recorded in these steps.For each route i and each workstation j on i, lines 15-16 calculate and update the total distance travelled and the current time of PE i .Then, lines 28 and 31-35 calculate and update the total sum of parts of type h stored at workstation j since the last visit and the sum of non-consumed parts of type h by workstation j since the last visit, respectively.The total mean parts V and non-consumed parts Penalty variables for the entire plan P are calculated at lines 39-45.Then, at line 46, the total RL reward, which is the MH cost, is calculated using the weighted sum method.

Learning process
In general, the goal of RL is to find a policy that optimizes (minimizes or maximizes) total reward of time.As stated in [16], finding an optimum policy is difficult, even for relatively simple tasks such as board games (which are much simpler than the MH problem), because of computational costs and memory constraints.Therefore, the goal of this research was not to find a strictly optimal policy, but rather to find an effective policy that produced improved (less costly) MH plans for a given demand.
Because this work sought to generate an effective policy which could simultaneously optimize multiple reward response variables, many improved plans were generated which together form a Pareto front [32].Certain plans minimize distance travelled, others minimize mean stored parts, whereas others minimized equipment usage.The ability to generate multiple optimal plans may be considered an advantage of the RL method in that MH managers can select from among the generated plans the one most best suited for their requirements.

Temporal difference learning
The RL MH problem is a continuing problem (i.e., a continual process control problem) as defined in [16] because a definite stopping condition is not available.A method to predict the state that will result from taking a particular action in the current state is not available, thus the probability of any particular state being the next state is unknown.Therefore, a model-free RL learning method is used to learn a policy that will improve an MH plan.In particular, tabular Sarsa(0), a temporal difference algorithm with one-step look ahead, was used [16].
At the outset of this work, our initial intention was to use the entire MH plan as the RL state.However, an MH plan contains many variables, each with many possible values (see Sect. 3.3); consequently, the number of possible different states, i.e., the state space of possible MH plans, is extremely large.Overly large state spaces are in general problematic for RL because the learning process is not able to visit each state often enough to try all of the actions available in each state and thereby learn an effective policy for each state.
Therefore, to reduce the size of the state space, the RL state was calculated from values of variables in the reward function, rather than using the MH plan itself as the state.(This is quite similar to function approximation in RL using features of the state [16], although in this work the state was calculated from the reward function's output.)Several different ways Fig. 7 Reward calculation pseudocode of doing so, with differing numbers of states, were manually devised based on knowledge of the application and then empirically tested.The key challenge was to identify a set of features that were simpler than the complete MH plan, but contained essential information needed to help the RL algorithm select an action that would improve the MH plan.After several iterations with state calculation schemes that did not perform well, testing with a simple scheme with two states (No penalty and Penalty) showed some promise, but clearly had room for improvement.Close examination of the two-state experimental results revealed that the RL algorithm would sometimes choose actions that overcorrected, switching too drastically from actions appropriate for a No penalty state to actions appropriate for a Penalty state, or vice versa.From that insight, a third state representing a transitional state "between" No penalty and Penalty was added.These states were used in the result report here: • State 0, No penalty.Penalty = 0 with V ≥ mean parts threshold.
• State 2, Penalty.Penalty > 0. Figure 8 illustrates these three states and the possible transitions between them.For example, from the No Penalty state, if the agent selects an action that changes the MH plan in a way that the reward model when executed for the changed plan returns a reward with a non-zero Penalty value in the reward vector, then transitions to the Penalty state.If the reward model instead returns a reward with a zero Penalty value and with the mean parts variable greater than the mean parts threshold, then the state remains in the No penalty state.Finally, if the reward model returns a reward with a zero Penalty value and with the mean parts variable less than the mean parts threshold, then the next state is the Transition state.
The Transition state was added after preliminary testing showed that using just two states, No penalty and Penalty, was ineffective.With only two states, it was possible that actions that are normally positive, such as reducing the number of parts delivered to a workstation so as to reduce the cost associated with storing excess parts, could be invoked when the number of parts at a workstation was very close to the minimum level required to avoid running out of parts.In that situation the action would reduce the parts enough to cause a workstation (or workstations) to run out of parts, producing a non-zero Penalty value, and the action would receive a large negative reinforcement, causing it to not be selected again, in spite of the fact that for many plans it would often be effective at reducing the cost.The Transition state provided a means for the learning process to learn to distinguish between situations when such actions are desirable and situations when they are not.
The best value for mean parts threshold (labeled "threshold" in Fig. 8) is learned along with the policy by the RL process.The threshold value was set to zero initially, then was updated during execution as the moving average of the mean parts V stored whenever a transition into the Penalty state occurred.This eliminated the need to select an empirical value for the means parts threshold, and allows the algorithm to be applied to any demand.Figure 9 gives the pseudocode of the RL MH plan learning algorithm.The learning process starts by selecting a naive initial plan P 0 , initializing the action values to 0.5, and setting the values for the RL Q-learning parameters learning rate α, discount rate γ, and ε-greedy random action selection probability ε in line 1.(The specific values for α, γ, and ε used in this work were chosen based on guidance from the RL literature and preliminary testing.)Then the algorithm calculates the reward for the initial plan by executing the abstract simulation (described in Sect.3.6) and sets the initial state (line 3).
In the initial state s i , the agent takes action a i based on the ε-greedy policy to produce a new plan P i+1 as indicated in lines 6-7.From the new plan P i+1 , the simulation is used to calculate the reward r i+1 and new state s i+1 as shown in line 8.The ε-greedy action selection process is used to select a new action a i+1 from the new state as shown in lines 9-13.
Instead of using the direct MH cost to update the action value, the algorithm uses the difference of the previous and the current immediate reward as an actual reward as shown in line 14.Then the previous action value is updated using the new action value and the MH cost from the actual reward (line 15 in Fig. 9).The current plan, state, and action are updated as shown in line 19.This process repeats until 3000 actions have been taken.The number of actions per run was set at 3000 based on preliminary testing, which showed that in almost all runs the reward had stabilized and the agent had explored the actions available in each state often enough to learn which actions were useful by the time 3000 actions had been completed.

Observations from the reward calculation
As described earlier, the reward is calculated using a simulation.That simulation produces considerable information beyond simply the values of the four response variables that are used to calculate the reward.For example, because the simulation computes the details of the MH process at discrete time steps, a time series for each of the factory variables is available.Some information available from the simulation, such as the identity of a specific workstation that runs out of parts and the type of parts it runs out of, or time series of parts stored at each workstation, are potentially very useful for improving the MH plan.A feature of this work is to exploit that additional information within the context of RL.The additional data produced by the simulation is retained in a list.For practical applications of RL, such as the MH application studied here, the additional information can reduce the size of the state space by retaining the additional information as supplementary information with each state, rather encoding it all as the state.

Action selection
The information produced by the abstract simulation is exploited in the action selection process.That process has two stages: first, an action is selected from among those available in the state, and second, the specific parameters of the chosen action are set.The stages proceed as follows: 1. Select the action.One of the actions available in the current state is chosen using the conventional RL ε-greedy method and the current action values learned by the RL process for those actions.2. Set the action's parameters.As described earlier (Sect.3.5), each of the actions has a number of specific parameters.
For example, action IncreaseTheNumberOfPartsDelivered has two parameters: part type and the number of parts to be increased.Once an action has been chosen, the specific values of that action's parameters are calculated by heuristics using the additional data produced by the abstract simulation.The action parameters' calculations are specific to each action, but all actions use the data produced by the abstract simulation.For example, for IncreaseTheNumber-OfPartsDelivered, the part type and number of parts are calculated using the penalty list Y(i, j, h) from the abstract simulation.

Experiments, results, and analysis for factory layout 1
This section reports and analyzes the experimental results for the first factory layout.Factory layout 1 is the smaller and simpler of the two factory layouts.As shown in Fig. 10, factory layout 1 consisted of 10 workstations (k 1 , k 2 , k 3 , …, k 10 ), seven intersections (x 1 , x 2 , x 3 , …, x 7 ), and one warehouse (h 1 ).The ten workstations consume a total of 20 different types of parts.Each workstation had a temporary storage buffer for parts that it has not consumed yet.The buffers' capacity for each part type was twice the number of parts of that type consumed by the workstation in an hour.Three types of equipment were used: tugger, forklift, and human puller.Each type of equipment could carry multiple parts of the same or of different types.Each equipment type was limited in the number of different part types it could carry; those limits are shown in Table 1.Nine pieces of equipment (three of each type) were used in total, and each part type could be carried on multiple types of equipment.In the current implementation, pieces of equipment are limited in terms of different types of parts they can transport, but not in the number of parts.
Table 2 shows the consumption rate of each part type by each workstation and the pieces of equipment that can be used to transport parts of that part type.In an actual application, the consumption rates would be derived from the demand prior to the start of the MH planning process; during the process they are constant (see Table 2).The consumption rates and equipment combination are generated randomly for the experiments.The equipment type combination and equipment limit on number of part types must be checked while moving the delivery from one route to another during the execution of a plan change action to ensure no physical constraints are violated.
The RL MH plan improvement process was tested with six different initial MH plans; those initial plans are summarized in Table 3. Recall that Table 3 describes the initial MH plans; the learning process is intended to modify an initial plan in ways that will improve it.In the table, the first column identifies the initial plan.The second column shows how many parts are present in each workstation's temporary parts buffer at the beginning of the shift; the two values in the column, 2.0 and 0.2, indicate 200% and 20% of the workstation's consumption rate of each part type respectively.The third column specifies the number of the deliveries present in each route of the plan.If the value in the column is 1, each route in the plan consists of one piece of the equipment delivering one type of part to one workstation.If the value is > 1, then a route may include multiple deliveries initially.The fourth column shows the frequency at which the workstation is visited; the two values 12 and 24 correspond to deliveries of parts to the workstation every 30 min and every 15 min respectively.Finally, the fifth column shows the state the learning process starts in for that plan.Table 4 shows the weights and cost per unit of response variables used in the experiment.The weights of the response variables were elicited from material handling subject matter experts at the Steelcase office furniture factory in Athens  Alabama.However, the cost values should be treated as relative, not absolute, i.e., we do not assert that the current implementation accurately estimates actual material handling costs, only that it can reduce those costs.As the figures show, the majority of the learning occurred within the first 250 actions, after which the cost of the MH plan tended to stabilize.Although the cost stabilized early, and only the first 1000 actions are shown for each initial plan, in total the algorithm was run for 3000 actions so that the learning agent had ample opportunity to explore and improve the plan.The slower stabilization observed for initial plan 5 suggests that the frequent visits to the workstation in that initial plan slowed the rate at which the plan could be improved.
Table 5 shows the final RL action values of all the actions for each state and for each initial plan.The action values were calculated using Bellman updates after each action selection as shown in Fig. 9  for a particular action in a particular state indicates that selecting that action in that state could lead to less desirable rewards.These values are the average of the five runs of 3000 iterations of each initial plan.The action values indicate that in the Penalty state an increase in the number of parts and the number of trips is appropriate whereas a decrease in the number of parts and trips is not appropriate, and vice-versa when the agent is in the No penalty state.The remaining actions generally lead to plan improvements (i.e., cost reductions) when there is no penalty, which the learning agent is expected to learn.Table 6 shows the mean MH cost reduction for factory layout 1 for each initial plan, averaged over 5 runs of 3000 actions each.As can be seen by comparing initial cost to final cost, the RL process produced a significant reduction in the total MH cost for every initial plan.

(line 15). A lower action value
For both factory layouts, the RL MH plan generating algorithm and the abstract discrete event simulation of the factory serving as the reward function were implemented in C++ using the Visual Studio 2017 framework on a Dell Core i5 computer.Each run of 3000 actions required approximately 5 min to execute on that computer.The Floyd-Warshall all-pairs shortest-paths algorithm [46] was used to pre-calculate the distances between the places in the layouts before learning began.This section reports and analyzes the experimental results for the second factory layout.Any details of the factory, the material handling equipment, the initial material handling plans, or of the experimentation not explicitly reported for factory layout 2 should be assumed to be the same as for factory layout 1.Factory layout 2 is approximately twice as large as factory layout 1 and has more complex connectivity, and thus presented a greater challenge to the RL algorithm.As shown in Fig. 17, factory layout 2 consisted of 20 workstations (k 1 , k 2 , k 3 , …, k 20 ), 14 intersections (x 1 , x 2 , x 3 , …, x 14 ), and one warehouse (h 1 ).
The 20 workstations consume a total of 20 different types of parts.Table 7 shows the consumption rate of each part type by each workstation and the pieces of equipment that can be used to transport parts of that part type.
The RL MH plan improvement process was tested for factory layout 2 with six different initial MH plans; those initial plans have the same descriptions as for factory layout 1 (see Table 3).The weights and costs of the reward response variables were also the same as for factory layout 1 (see Table 4).As the figures show, for the more complex factory layout the learning process required more actions.The majority of the learning occurred within the first 500 actions, after which the cost of the MH plan tended to stabilize.Although the cost stabilized early, and only the first 1000 actions are shown for each initial plan, in total the algorithm was run for 3000 actions for each initial plan.Table 8 shows the mean MH cost reduction for factory layout 2 for each initial plan, averaged over 5 runs of 3000 actions each.As can be seen by comparing initial cost to final cost, the RL process produced a significant reduction in the total MH cost for every initial plan.

Conclusions
All six of the research questions listed in Sect. 1 have been addressed.In particular, this work produced these results: • A set of useful MH plan change operators was defined.This was challenging because the actions were intended to simultaneously improve, or at least not degrade, multiple reward response variables.Similar operators may enable RL to be applied to complex manufacturing environments with a large number of variables and where limited test data sets are available to train the system.• A potentially very large state space was simplified using information from the reward calculation.Often, RL applications for problems with many states use neural networks or other non-linear function approximators to identify states based on carefully hand-selected features.However, such approximators typically require large amounts of training data to learn the correct weights.The approach used here, deriving states from the reward calculation, can reduce the state space without requiring training data.• Similarly, a potentially large set of RL actions was simplified using separate action selection and action parameterization stages; the latter stage exploited the supplemental information produced by the reward calculation.• An important criterion for state transition, the mean parts threshold value, was learned concurrently with learning the action selection policy.This possibility was previously identified as future work in [16].• During the experimental runs a sequence of increasingly efficient plans were produced from each initial plan, thereby enabling an MH manager to select from among those plans the one that is best suited to their requirements.• As shown in Tables 6 and 8, the RL process consistently reduced MH costs across a range of different initial plans; the minimum cost reduction for both layouts and all 12 initial plans was approximately 54.7% and the mean cost reduction was approximately 82.8%.For most real-world factories, a reduction in MH costs of even a tenth of those magnitudes would be very significant.The experimental results demonstrate that a conventional RL method (temporal learning) was able to learn a policy that could effectively reduce the cost of material handling plans.In this research, the action values were estimated using a one-step backup operation, i.e., the action values of a given state were used to estimate the action values of the previous state using the simulation-calculated immediate reward.Actions which reduced the reward, i.e., the cost of the material handling plan, were selected greedily in each state by following the standard ε-greedy policy.As is typical in RL approaches, the learning agent repetitively visited each state, using the estimated and recorded action values to learn an effective policy [16].A simulation-calculated cost was used as an immediate reward by the RL algorithm, and some of the additional non-cost information produced by the reward calculation was used by problem-specific heuristics to parameterize the selected action; this heuristic action parameterization is unlike conventional RL algorithms.One of the contributions of this research is this demonstration that for complex problems such as material handling planning, additional reward calculation information, which we call reward observation, can be useful along with recording the action values of each state for generating and applying an efficient policy.
Overall, this work demonstrates that RL can be successfully applied to a realistically complex MH problem with a multi-objective reward function.A key feature of this research is its potential to lead to practical application, which distinguishes it from much of the prior work on applying RL to MH.The inputs to the RL process included factory layout, equipment capacities, part types, workstation consumption rates and buffer capacities, and plan duration.These are also common inputs used in practical MH modeling applications.Because the cost data and factory layouts used in the experiments were based on a real factory and the discrete event simulation used to evaluate the MH plans is consistent with the discrete event simulations typically used in the manufacturing industry, it is reasonable to expect that an RL algorithm using multiple reward response variables and a two-stage action selection process can be potentially applied to real world MH plan generation in response to changes in demand.

Fig. 3
Fig. 3 Relationships between the related work topics

Fig. 5
Fig. 5 Example of a factory layout represented as a graph Speed of a piece of equipment in plan P (2.235 m/s) t(a, b) m(a, b) / s v = travel time between two vertices in the factory, such that a, b ∈ V F and (a, b) ∈ E F M

Figures 11 ,
12, 13, 14, 15 and 16 show the results of the experiments for the six initial plans; there is one figure for each of the initial plans.In each figure, the x axis represents the number of plan change actions taken and the y axis represents the MH cost of a plan after each action.The RL process was run five times for each of the six initial plans.The narrow blue lines show the MH costs for the five individual runs for each initial plan and the wide red line shows the mean MH cost of those five runs.

Figures 18 ,
19, 20, 21, 22 and 23  show the results of the experiments for the six initial plans; there is one figure for each of the initial plans.As before, in each figure, the x axis represents the number of plan change actions taken and the y axis represents the MH cost of a plan after each action.The RL process was run five times for each of the six initial plans.The narrow blue lines show the MH costs for the five individual runs for each initial plan and the wide red line shows the mean MH cost of those five runs.

Fig. 18
Fig. 18 MH cost improvement for initial plan 1 on factory layout 2

Fig. 23
Fig. 23 MH cost improvement for initial plan 6 on factory layout 2

Table 2
Workstations, consumption rate, and allowable equipment for each part type for factory layout 1.

Table 4
Weights and costs of the reward response variables

Table 7
Workstations, consumption rate, and allowable equipment for each part type for factory layout 2

Table 8
Reduction in MH cost produced by the RL process for factory layout 2