1 Introduction

Machine learning approaches such as Reinforcement Learning (RL) have the potential to solve problems in various manufacturing-related areas, leading to improvements in terms of time, cost, and quality [1]. One promising application field for RL is factory layout planning [2], which is a central element of the factory planning process [3]. Existing literature on the subject mainly focuses on the principal utilization [2] and the scalability of the RL-based planning approach [4]. However, the potential to support planners in the early stage of layout planning is highlighted without presenting a holistic framework that covers the corresponding application steps and needed information. This paper addresses this research gap by defining requirements that have to be met by a factory planning approach. These requirements build the basis for the framework developed for RL-based factory layout planning.

The paper is structured as follows: First, an introduction to factory layout planning (Subsect. 2.1) and the existing planning approaches (Subsect. 2.2) is presented. Subsection 2.3 introduces RL and its functionality, followed by the research gap in Sect. 3. Section 4 presents the requirements as well as the framework developed followed by an evaluation of the framework. Section 5 consists of a conclusion and a summarization of future developments.

2 State of the Art

2.1 Introduction to Factory Layout Planning

Factory layouts have to be adapted according to changing and influential factors such as technological innovations, shortened product life cycles, and changing market conditions [5]. Consequently, factory layout planning is frequently performed to ensure an economical, efficient, flexible, and versatile production under consideration of all external and internal characteristics [6].

Factory layout planning problems can be divided into new planning (greenfield planning) and the restructuring (brownfield planning) of a factory. The early stage of layout planning mainly focuses on the positioning of functional units (e.g. machines) in a given space without considering all planning details [7]. The result of the early stage can be a block layout that predefines the main characteristics of the future factory. It builds the starting point for further detailed manual planning steps in the later phase of layout planning. Consequently, the early planning phase is of high importance and it needs to be ensured that the initial solution quality is as high as possible [3].

In the early stage of layout planning, a variety of different and partially conflicting goals must be considered. Examples are the minimization of transportation costs, transportation intensities, throughput times, locked capital in the form of unused inventory stocks, and area demand while maximizing the transparency of the material flow, machine utilization, and supply readiness. Furthermore, boundary conditions such as the floor-bearing capacity, safety requirements, and media supply and disposal have to be considered [8]. In the case of brownfield planning, it must also be ensured that the restructuring costs do not exceed the positive effects of adapting the existing factory to new circumstances [9].

Among the multiple goals, the material flow related target variables have a predominant position. The material flow can be defined as the interconnection of all operations in the processing, and distribution process of material goods within defined areas of manufacturing. Thus, the material flow describes the linking between functional units and incorporates not only transport processes but also storage and handling processes [10]. The optimization complexity of the material flow and the corresponding material flow devices, such as conveyors, storage, and picking technology, is increased by multiple influential factors, such as [11]:

  • Increase in product complexity and variant diversity

  • Increase in the complexity of production networks

  • Smaller batch sizes

  • Shorter product life cycles

Therefore, multiple partially dynamically changing interdependencies must be considered; for example, the influence of order-related fluctuations on interlinked processes. Without considering dynamic effects in the planning phase, disturbances in the material flow can only be detected and eliminated during the operation phase. However, costly adaptions in the operation phase can be avoided by anticipative planning. Thus, the material flow in a manufacturing system can only be planned sufficiently if dynamic effects are already considered during layout planning [12]. Hence, discrete event material flow simulations (DES) are commonly used to analyze and optimize the material flow and its dynamic and stochastic characteristics. The central element of a DES is the simulation model, which is defined as a digital representation of the system behavior and its interdependencies. The DES allows to perform simulation experiments with different material flow configurations to analyze its influence regarding the throughput times or sensitivity to disruptions. This allows to design a robust and well-performing material flow that leads to reduced operational costs [11].

Summarized, a holistic planning approach needs to incorporate multiple planning objectives, the necessary boundary conditions, and DES to appropriately consider the dynamic effects in the material flow.

2.2 Approaches for the Early Phase of Factory Layout Planning

The existing factory planning approaches can be divided into manual and computer-aided. Manual planning requires expertise and can be characterized as a creative problem-solving process [12]. The starting point is usually an initial layout that was generated without consideration of detailed boundary conditions, such as geometrical restrictions (ideal layout). The cycle method, according to Schwerdtfeger, is an exemplary approach that can be used to generate the initial layout [13]. Another example is the Computerized Relative Allocation of Facilities (CRAFT) algorithm [14]. Furthermore, Sankey diagrams, which visualize the transportation intensity between functional units, can be used to generate the initial layout. Manual planning aims at transferring this initial layout into multiple layout configurations that satisfy the building-related boundary conditions of the factory (real layout) [13]. This process is often time-consuming due to the high variety of different positioning options. Consequently, only a limited number of factory layouts are generated in practice, and it can be assumed that the optimal layout configuration is difficult to find. The complexity and time consumption of generating one layout variant further increases in the case of multi-objective optimization [15]. Technologies such as virtual reality (VR) or augmented reality (AR) can be used to support the manual planning task. VR and AR applications can be interlinked with a DES which allows to analyze material flow properties in detail [16]. However, building the VR/AR planning environments is again time and resource-consuming on its own, which limits its industrial applicability [17].

Computer-aided approaches are used to support the planner in the early phase of layout planning by generating several layout variants. These approaches consider building-related boundary conditions and objective functions in the form of equations, which are aggregated to obtain an optimization problem that is called the facility layout problem [18]. However, since not all planning details are considered in the facility layout problem, a manual detailed planning step is required in the later phase of layout planning [3]. Computer-aided approaches can be further divided into exact and approximative planning approaches.

Exact planning approaches are built upon mathematical formulations such as the Branch-and-Bound method and calculate the optimal solution for a given problem. This is, however, computationally expensive since the facility layout problem is categorized as NP-hard. Consequently, such exact approaches are only suitable for small exemplary problem sizes [19].

Approximative planning approaches are used to overcome the computational problem by generating optimum-near solutions. Commonly used solving methods (metaheuristics) are the genetic algorithm (GA), simulated annealing (SA), or the large adaptive neighborhood search [20]. The majority of the existing approaches are applied to a single objective optimization problem. An example is presented by Lin et. al, who use a GA to optimize a layout according to the resulting transportation costs [21]. Chen et al. optimize a layout according to the work-in-progress effects using SA [22]. Besides, novel techniques use quantum annealing to minimize the transportation intensity for different problem sizes within seconds [23].

Even though these approaches can generate valuable solutions, a single objective optimization seems inappropriate for the facility layout problem since usually multiple objectives must be considered at the same time. Only a few approaches allow the optimization under consideration of multiple objectives [20]. Guan et al. use a particle swarm approach to optimize multiple workshops regarding the transportation distance, the number of workshops, and the workshop floor utilization [24].

Most such approaches rely on the optimization of analytically formulated objective functions that are defined by using assumptions and simplifications such as optimizing according to the transportation intensity. Dynamic performance metrics of the material flow, such as the throughput time, are often neglected [25] which can lead to a mismatch between a generated solution and the real behavior of the manufacturing system. Consequently, a DES is a suitable basis for multi-objective optimization instead of analytically described objectives [12].

Our previous work highlights the potential of an RL-based factory planning approach [2, 4, 26]. RL-based factory layout planning is capable to solve problems with varying sizes by learning the metrics of the problem [4]. Furthermore, RL has been successfully used in other disciplines in combination with simulation environments [27] which also bears the potential for an RL and simulation-based factory planning approach. However, to extract the full potential of RL-based factory layout planning, a holistic framework is needed that describes the application phases, the information flow, and all sub-steps in detail.

2.3 Introduction to Reinforcement Learning

Besides unsupervised and supervised learning, RL is a subclass of machine learning algorithms. These algorithms have the ability to learn about complex relationships between different parameters by extracting patterns from data. While supervised and unsupervised learning require existing training data, RL approaches generate their training data within the training process [28]. Figure 1 visualizes the structure of an RL approach, which consists of an agent and the environment. As depicted in Fig.  1, the agent interacts with its environment at timestep \(t\) by selecting an action \({A}_{t}\) based on the current state \({S}_{t}\). The environment receives the action and transitions to the next state \({S}_{t+1}\). Furthermore, a reward \({R}_{t}\) is returned to the agent. The agent aims at learning a strategy (policy) that maximizes the accumulated reward over an episode. Consequently, the reward is used to incentivize a certain behavior [29].

A variety of alternative agent architectures exist. They can be categorized into value-based, policy-based, and hybrid (Actor-Critic) approaches according to their action selection and policy improvement process. Value-based approaches either estimate the value of selecting an action in the current state (action-value function) or the value of being in a state (state-value function). In contrast, policy-based approaches map the state to a probability distribution that is used directly for the action selection process [30].

One of the most commonly used architectures is the Double Deep Q Learning Agent (DDQN) which is categorized as a value-based approach. This agent is characterized by a stable and converging training behavior. Since two artificial neural networks (ANN) build the basis of the approach, DDQN can be used for problems with large state and action space representations. However, it only allows the selection of actions from a discrete action space [31]. The DDQN Agent uses a replay buffer to store its experience in form of transitions. One transition consists of the following elements:

  • State \({S}_{t}\)

  • Action \({A}_{t}\)

  • Next State \({S}_{t+1}\)

  • Reward \({R}_{t}\)

A subset of all transitions stored in the experience replay buffer is sampled between each step to train the agent by changing the weights of the DDQN networks according to the loss function [31]. The training process ends after a defined number of episodes, a certain training duration, or after reaching a performance threshold. Within the training process, the trade-off between exploration and exploitation is of special importance. The agent uses exploitation if the actions are selected according to the policy while exploration describes a behavior that results in a deviation from the policy. Only by defining high exploration rates, it is possible to experience novel state-action combinations which might help to overcome local minima. However, reaching meaningful states is only possible if the policy is exploited [29].

Continuous actions require different architectures, such as the Proximal Policy Optimization Agent, which can be categorized as a hybrid approach [32]. Hybrid approaches aim at combining the positive effects of both value- and policy-based approaches. Precisely, a hybrid approach consists of two agents the actor and the critic. The actor selects an action according to its policy and is categorized as policy-based. The policy is updated according to the feedback of the value-based critic. Consequently, both approaches are combined, leading to increased performances [33].

Besides, action masking methods can be applied to ensure that all actions that are selected are valid. Consequently, the agent must not learn the difference between valid and invalid actions which reduces the number of selectable actions and improves the learning behavior [34].

Fig. 1.
figure 1

Conventional reinforcement learning architecture that consists of an agent and its environment. The agent interacts with the environment by selecting actions based on the current state and reviews a reward as a feedback sign.

RL has several potential application fields and has been successfully applied to manufacturing-related problems. Examples are robot arm path planning in human-robot collaboration environments [35], job shop scheduling [36], or RL as a part of an energy management system that reduces energy-related costs in a manufacturing system [37]. Furthermore, RL has the ability to reach superior performance in complex decision-making problems with dynamic influences compared to existing approaches or manual methods. The reason for this superiority is the ability to extract and learn the patterns of a problem with only limited information. Hence, only the reward signal is necessary to develop a strategy compared to classic optimization problems and solving strategies that are based on defined behavior rules [38].

There are three advantages of the RL technique that can be summarized compared to standard metaheuristics. First, the agent can learn a strategy without defining rules for each situation. This would allow an optimized behavior even in situations that deviate from the normal case. Second, the agent can develop novel strategies that exceed the existing problem-solving knowledge. Third, the problem patterns are stored in the model of the ANN. Consequently, the same model can be reused again to solve a similar problem which is not possible with standard metaheuristics such as GA, or SA.

3 Research Gap

The advantages described in the previous section can be valuable in the context of factory layout planning for the following reasons. Factory layout planning is characterized as a decision-making process with a high combinatorial complexity that requires a high solution quality to reduce the operational costs to a minimum. The complexity of solving this problem is increased since dynamic effects due to changing interdependencies between functional units have to be incorporated which can be performed most appropriately using a DES. RL has demonstrated in other application fields that it is suitable to solve such complex decision-making processes since it is capable of learning the (often non-linear) problem metrics. Furthermore, it was successfully used in combination with simulation environments for problems that can only hardly be described by distinct rules.

Previous research also highlights the match between the challenges of the facility layout problem and the capabilities of RL. However, they mainly focused on the principal utilization [2] while ensuring the scalability of the existing approach [4].

However, the application perspective remains undiscussed. It is unclear which information is needed in order to use a RL-based factory planning approach, and how this information is processed.

This paper aims at closing this gap by improving the applicability of an RL-based factory layout planning approach with the development of a holistic framework that is divided into five steps. Within the framework, the information flow and all sub-steps are described in detail starting at the initialization phase of the approach up to the utilization of the results.

4 Framework for Factory Layout Planning Using Reinforcement Learning

4.1 Requirements

The first development step for the application of the RL method involves the definition of requirements for a holistic factory layout planning approach from a general perspective. These requirements define the boundary conditions for application purposes and influence the required information basis as well as the needed expertise to use an automated planning approach. The requirements are then transferred to an RL-related framework.

Requirement 1: Greenfield and Brownfield Planning of Factory Layouts

Factory planning can be divided into brownfield planning (restructuring) of an existing factory and greenfield planning of a new factory. Holistic approaches consider greenfield planning as a special case of brownfield planning without considering the restructuring effort. Hence, the problem and its boundary conditions must be modeled accordingly to allow the application to both cases.

The majority of existing planning approaches focus only on greenfield planning. However, brownfield planning shouldn`t be neglected since minor changes in the factory layout do not require an entirely new planning but a slight variation of the existing layout. Furthermore, brownfield planning is performed more frequently than greenfield planning. A holistic approach should incorporate both planning cases and allow for greenfield as well as for brownfield planning. Even though, the characteristics associated with brownfield planning and the considered optimization objectives differ slightly from those in greenfield planning case.

Requirement 2: Multi-objective Optimization

Within layout planning, multiple partially conflicting planning objectives can be considered. Consequently, a holistic planning approach should allow the selection and prioritization of multiple optimization objectives at the same time. They can be divided into analytic and dynamic optimization objectives. Dynamic planning objectives, such as the throughput time, should be incorporated using a DES. However, not all objectives can be handled by a simulation. Analytic objectives can be calculated without simulation such as the area demand. Layout planning problems have a heterogeneous character and require the consideration of different objectives. Consequently, a holistic approach should be able to consider one or multiple analytic and dynamic objectives that apply to a large variety of existing planning scenarios. Existing approaches often only use a single objective for optimization. Furthermore, the minority includes simulation results in the optimization phase.

Requirement 3: Practical Relevance of the Planning Results (Degree of Abstraction)

The applicability of the approach is directly connected to the degree of abstraction. A planning approach should generate results that are close to reality. To achieve this, diverse boundary conditions must be considered. Examples can be the floor-bearing capacity or the consideration of media supply and disposal. Furthermore, the modeling approach defines, whether functional units can be placed freely or only in defined positions. Modeling the facility layout problem as an open field layout allows a high degree of freedom since more placement options are available compared to a single-row planning problem.

The degree of abstraction differs significantly in the existing approaches. The majority of the approaches model the problem as an open-field problem but only consider the width and length of the functional units. However, boundary conditions such as the floor-bearing capacity or availability of media supply are often neglected.

Requirement 4: Scalability

The scalability requirement ensures the industrial applicability of a layout planning approach. As described in Sect. 2, layout planning requires high-quality solutions to reduce the operational costs. However, high solution quality should be combined with a reasonable computation time to ensure a fast planning process. Ensuring scalability with a rising number of functional units is a central challenge in the development of automated planning approaches.

The scalability of existing automated approaches differs depending on the used algorithm. RL seems to be a valuable tool to overcome this problem since it has demonstrated in other application fields that it is capable of solving problems with a large degree of combinatorial complexity while ensuring high-quality results.

Requirement 5: Accessibility

The initialization of computer-aided automated planning approaches can be challenging since mathematic formulations for optimization objectives and boundary conditions have to be defined. This process requires a large degree of expertise. Consequently, the planning approach should support the planner by providing suitable recommendations to reduce the manual modeling effort.

Requirement 6: Comprehensibility of the Planning Results

The last requirement is especially important for computer-aided, automated planning approaches. In contrast to manual planning approaches, computer-aided approaches lack comprehensibility since the optimization process is to some degree characterized by probabilistic influences. This effect is even more pronounced in the case of an RL-based planning approach since a machine learning model can be considered a black box model. However, it is important to generate comprehensible and trustable results. Consequently, a holistic planning approach should be able to provide additional information regarding the main influencing factors as well as the degree of uncertainty in the solution. This will increase the trustability and supports the applicability of an automated planning approach. Conventional metaheuristics also lack comprehensibility since the solution process is characterized by probabilistic processes such as mutations for the GA. RL, in contrast, offers the potential to analyze the solution strategy (policy) if suitable explainable RL methods are applied.

4.2 Description of the Framework

The described requirements build the basis for the framework depicted in Fig. 2. As indicated in this Figure, the framework consists of five sequentially executed steps.

Step 1: Initialization of the Layout Planning Problem

In the first step, all major characteristics of the layout planning problem are defined. First, the planning step (greenfield or brownfield planning) must be specified. It directly influences the initialization of the layout characteristics and optimization objectives since brownfield planning requires the existence of an initial layout while greenfield planning doesn`t.

The 4 dynamic and 5 analytic optimization objectives are summarized in Table 1. They can be considered individually or in an arbitrary combination, depending on the problem characteristics and strategic goals of the company. The target variables have a heterogeneous character and can thus be combined as desired. An exception is the transport intensity and the throughput time. In the case of mass production that uses conveyor belts for material transportation, an optimization regarding the throughput time or the transportation intensity can lead to similar results. In these cases, only one of the two objectives should be considered simultaneously. In the brownfield planning case, the objective of similarity is available and a mandatory requirement. Similarity to the existing layout helps to reduce the restructuring effort. As mentioned in Sect. 2, restructuring is only profitable if the restructuring costs do not exceed the positive effects of the optimization. Consequently, the objective of similarity should always be combined with an additional objective to generate a layout that slightly differs from the existing solution.

The layout characteristics influence the available optimization objectives. The layout characteristics summarize the geometrical, and boundary-related properties of the layout. They can be divided into mandatory, and optional conditions. A mandatory condition is the geometrical 2D shape of the layout. Optional properties are information regarding existing media supply, floor bearing capacity, and height information. In the best case, all optional properties are available, since optimization regarding media supply and disposal is only possible if this information is provided. If the information is not provided and optimization regarding the throughput time or transportation intensity is still possible. Consequently, the available information density of the layout characteristics will directly influence the practical relevance of the planning results (Requirement 3).

The layout characteristics are also interlinked with the characteristics of the functional units. The functional units must provide the same information density as the layout, and vice versa. Hence, the 2D shape of each functional unit is mandatory information, while additional information will only increase the practical relevance of the solution. Furthermore, the following general, process-related information must be provided:

  • Identification number or name of each functional unit

  • Functional unit type: warehouse or processing unit

  • Maximum product storage capacity of each functional unit

  • Initial storage inventory at the start of the simulation of each functional unit

  • Information per processing mode:

    1. o

      Processing time

    2. o

      Setup time

    3. o

      Input-Output relationship

Fig. 2.
figure 2

Framework for RL-based factory layout planning structured as 5 sequentially executed steps containing loops to adjust earlier steps according to the results of later steps.

Table 1. Overview of optimization objectives

Finally, the material handling system has to be defined. This involves information regarding the means of transportation, external supply, and the considered jobs that can be defined using forecasting methods for future orders based on historical data.

For the means of transportation, the following information must be provided:

  • Identification number or name of the material handling systems

  • Loading capacity

  • Loading and unloading duration

  • Speed

The external supply delivers products to defined functional units, such as warehouses. Consequently, the amount of goods and the frequency of delivery need to be defined. The job-related information provides the number of goods that should be produced and the corresponding production steps. Furthermore, control-related information must be provided, which involves the scheduling sequence of the jobs as well as a control strategy for the material flow, for example, a push or pull strategy.

At the end of step 1, all characteristic information of the layout planning problem is defined. In the case of brownfield planning, this also involves an existing layout and information about additional or missing functional units for the new layout configuration.

Step 2: Initialization of the RL-Algorithm

The second step builds upon the defined characteristics of the first step. The agent of the RL approach consists of two ANNs with identical architecture that are designed to process the layout, material flow, and functional unit characteristics in order to select the position of each functional unit. The starting point for action selection is the current state of the layout combined with additional information regarding the material flow characteristics. Based on the provided information in step one, the state representation provides a different degree of information density. Mandatory information is the material flow characteristics aggregated in form of a transportation matrix, and the information about occupied and free space in the layout. The latter information can be encoded by dividing the layout into a grid with \(n\) positions and transferring the positional information to a matrix. The additional but optional information (media, height, and floor-bearing capacity) can be transferred in a similar way. The input layer must fit the layout size and the amount of available information. One possible way is to use a graph neural network (GNN) to process the information of the transportation matrix, while a convolutional neural network can process the layout information (Fig. 3).

The GNN requires an embedding that can be generated by processing the material flow connections with a single dense layer in order to generate the initial node representation. Functional units are connected by edges if they have a material flow connection. Furthermore, pairwise attention scores between the functional units can be computed based on the outputs of the first dense layer. This allows to incorporate the influence of connected nodes and will reflect the interdependence between functional units. Finally, the result of this aggregation is added to the initial node representation and is processed by an additional dense layer to obtain the final node embeddings for each functional unit.

The architecture of the hidden layers can be defined without any restriction while the output layer represents the placement options (action space). The functional units are placed sequentially by selecting the position of the bottom left corner and its rotation. Consequently, the output layer consists of \({n}_{x}\) neurons which are used to select the x-position, \({n}_{y}\) neurons for the y-position, and 4 additional neurons to rotate the functional unit up to three times with an angle of 90° each. The final position is selected using the estimated Q-values. The \({n}_{x}\) Q-values of the first output layer will be compared and the neuron with the highest value defines the x-position of the functional unit (y-position respectively). The rotation is obtained by a similar logic. The first neuron implies a rotation of 0° followed by 90° for the second neuron up to 270° for the last one. All subparts of the action are combined to define the final action that should be executed.

Fig. 3.
figure 3

Placement process and corresponding information. The input information is processed by a Graph neural network and a convolutional neural network to obtain an action from the output layer.

Afterwards, the hyperparameters of the ANN and the agent have to be defined. This includes the optimizer, learning rate, activation functions, and agent-related parameters such as the exploration strategy. An alternative for that sub-step is the usage of a reference model that was already trained to solve a similar layout planning problem. Consequently, it stores information about the solving process, which can be used to solve another problem without training the model from scratch (transfer learning). The weights of the model can be used completely if the entire ANN structure of the pre-trained model matches the structure of the new ANN. If that isn`t the case, only parts can be reused. Using a pre-trained model can lead to faster and better results, depending on the model quality and the problem characteristics (Requirement 4: Scalability).

Finally, the reward function (\({R}_{t}\)) is defined based on the following optimization objectives:

  • Throughput time (\({R}_{TT}\))

  • Utilization of functional units (\({R}_{UFU}\))

  • Utilization of the material flow entities (\({R}_{UMF}\))

  • Traffic congestion (\({R}_{TC}\))

  • Transportation intensity (\({R}_{TI}\))

  • Media supply (\({R}_{MS}\))

  • Floor-bearing capacity (\({R}_{FBC}\))

  • Clarity of the material flow (\({R}_{CMF}\))

  • Similarity (\({R}_{S}\))

Each objective \(i\) is prioritized using weight \({w}_{i}\) according to Eqs. 1 and 2. Within the environment, it needs to be ensured that each sub-reward reaches only values between −1 and 0. Better values will lead to rewards closer to 0. Without this transformation, prioritization will be ineffective.

$$\begin{array}{l}R_{t} = {w_1}*{R_{TT}}(t) + {w_2}*{R_{UFU}}(t) + {w_3}*{R_{UMF}}(t)\\ + {w_4}*{R_{TC}}(t) + {w_5}*{R_{TI}}(t) + {w_6}*{R_{MS}}(t)\\ + {w_7}*{R_{FBC}}(t) + {w_8}*{R_{CMF}}(t) + {w_9}*{R_S}(t)\end{array}$$
(1)
$$\sum_{i=1}^{9}{w}_{i}=1$$
(2)

If the planner knows which prioritization is appropriate only one model will be trained. If that is not the case, multiple prioritizations and corresponding weights need to be defined leading to multiple training sets. Each training set will result in one final layout that will be evaluated in step 4.

Step 3: Training

For each configuration of the reward function, a training set will be initiated which will lead to different layout variants at the end of the training. Before the training starts, the termination criterion, which can be either linked to the number of training episodes, the training duration, or a reward threshold, needs to be defined.

Within the training process, all improvements of the best-known solution are stored containing the following information:

  • Current episode and training duration

  • Total reward and all sub-rewards

  • Weight of the prioritization

  • Position of all functional units (actions)

Fig. 4.
figure 4

Exemplary reward curve of a training process that stabilizes after 15000 training episodes

The changes in the training loss, the average reward, and the individual reward values can be observed, for example, using a tensorboard or self-build visualization tools. A stable and converging behavior is a sign of a successful training process as depicted in Fig. 4. If the training process isn`t stable a change in the ANNstructure or a change of the hyperparameters might lead to a stabilization.

Step 4: Evaluation of the Training Results

The evaluation process aims at selecting one layout alternative for further detailed manual planning steps. The planner will be supported by different visualization techniques. If multiple layouts are generated and multiple conflicting objectives are considered, a visualization of the corresponding pareto frontier leads to valuable insights into the problem characteristics. By analyzing the pareto frontier, the trade-off between the objectives can be evaluated. Furthermore, different solutions can be selected in order to visualize the corresponding layout (Fig. 5). The most suitable layout will be selected based on the preferences and the strategic goals.

A second sub-step focuses on the field of explainable artificial intelligence (ExAI) and helps to increase the comprehensibility of the planning results (Requirement 6). The ExAI methods will provide information about the overall training behavior. The methods can ensure, that the reward function is aligned with the overall optimization objective. Furthermore, it is beneficial to evaluate the uncertainty in the placement process and to analyze in which situations a conflict in positioning occurred (two or more functional units should be placed in the same position). This information can be extracted using policy summarization methods, which analyze the placement behavior of the RL model. Besides the feature relevance explains which input information has the biggest influence on the decision-making process. This information might also help to identify bottlenecks and positions of higher relevance in the layout. At the end of step 4, the most favorable layout alternative is selected as a starting point for step 5.

The loops in the framework allow to repeat parts of the sequence if the planning results are not satisfying. Changes in the ANN structure in step 2 can lead to a change in the training behavior and new prioritizations might lead to new layouts for the evaluation step. Furthermore, lengthening the training duration can lead to additional improvements in the solution quality.

Fig. 5.
figure 5

Exemplary pareto frontier. Each datapoint corresponds to a different layout alternative. Two exemplary layouts are visualized

5 Step 5: Manual Planning

The manual planning step is the final step of the framework and focuses on the detailed planning tasks for the selected layout variant of step 4. First, the final position of each functional unit needs to be defined. The explainability methods can highlight conflicting goals and uncertainties to support this process. Afterwards, the functional units are planned in detail, incorporating lighting conditions, emission limits, ergonomic conditions, and industrial safety to improve the workplace from an employee perspective. The detailed planning step can be supported by AR or VR technology, which will support the planner by designing the workspace according to the considered goals.

5.1 Evaluation of the Framework

In this Subsection, the framework and the corresponding RL-based planning approach are evaluated under consideration of the requirements defined in Subsect. 4.1. The placement process of the functional units is modeled in a way that allows for both a planning a new factory layout and restructuring an existing one. The only difference occurs regarding the input information and the reward function. Brownfield planning requires an initial layout configuration, and the reward function must contain the similarity reward. Consequently, Requirement 1 is satisfied. Furthermore, the framework allows to select and prioritize one or multiple objectives at the same time, which can be calculated based on a simulation or analytic formulation, which satisfies Requirement 2. However, until now, no simulation-based approach has been published.

The practical relevance of the planning results (Requirement 3) can only be fully ensured if the mandatory and optional information is provided. Furthermore, the free placement process (open field layout) leads to a higher level of practical relevance compared to commonly used single- and multi-row planning approaches. Furthermore, if only dynamic objectives are considered, the number of simplifications is smaller, since the DES allows a more realistic representation of the manufacturing system than a pure evaluation based on analytic formulations.

The scalability requirement can be ensured using an action masking method as introduced in [4] which prevents the selection of invalid actions, makes the learning phase faster, and makes the approach applicable for problems of different sizes.

Requirement 5 (accessibility) is improved by this research by presenting the sequential configuration process and highlighting the mandatory and optional information basis. The process can be further improved by developing a graphical user interface that supports a planner within the initialization and evaluation process.

The last requirement (comprehensibility of the planning results) is still an open research topic. The presented framework contributes to this open research question by providing a first overview of the valuable information that can be generated by an ExAI method. However, the ANNs of the agent can be considered as black box models. Consequently, reaching full transparency is challenging. Nevertheless, the methods mentioned in step 4 can increase the trustability and provide the necessary comprehensibility to apply the framework to industrial applications.

6 Conclusion and Outlook

Factory layout planning is an important but time-consuming process. Recently developed RL-based planning approaches have the potential to support the planner in the early planning phase by generating optimized layout variants. This paper contributes to the recent developments by presenting a holistic framework that increases the applicability of the RL-based approach by presenting the necessary steps and the underlying information flow that are required to utilize the layout planning potential. Consequently, the framework can be used as guidance on how to initialize, train, and evaluate the results of an RL-based factory layout planning approach. The framework is developed based on six layout planning-related requirements. It consists of five steps: the initialization of the layout planning problem, the initialization of the algorithm, multiple training sets, the evaluation of the training results, and the manual planning step for the selected layout variant. Each step consists of multiple sub-steps that are interlinked by an information flow. The framework describes for each sub-step which information is needed and further distinguishes between mandatory and optional information.

Furthermore, the framework provides guidance for further developments. Future work will consequently, focus on developing an discrete event material flow simulation and the necessary interfaces to integrate it into the environment of an RL approach. The development aims at including the dynamic optimization objectives presented in this framework and further allows to validate the framework in detail. Besides, the scalability of multiple optimization objectives needs to be investigated since the complexity increases compared to the existing single objective investigations. Furthermore, explainable artificial intelligence methods will be developed to provide insights into the training process. For example, a policy summarization method can be used to increase the trustability by providing information about the reward structure, the placement uncertainties, and the alignment of short-term rewards with the overall optimization objective.