1 Introduction

In the last decades, engineers have significantly improved the crashworthiness of vehicles due to ever-increasing requirements for the passive safety of automobiles imposed by legislation and consumer protection. The simulation of increasingly complex crash models using the finite element method (FEM) helps in the development process. The simulation of crash-loaded structures is subject to high nonlinearities due to large structural deformations, contact phenomena between adjacent components, plasticity in the material model and failure of individual components. In addition, simulation results from crash analyses suffer from numerical noise. Individual simulations of crash models can take hours to days depending on the model complexity and the available computing resources. In this context, the use of algorithms for automatic optimization of structures is difficult. Though sensitivities can be calculated numerically through finite differences, it is not only time consuming, but also unreliable due to the complexity of crash simulation outlined above. In this context, sensitivities are the derivatives of the objective functions and restrictions of the optimization problem according to the design variables.

For improving the layout of a structure, algorithms of topology optimization can be used. In topology optimization, the mechanical structure is improved by adjusting the shape and position of structural components. Development of methods for topology optimization of crash models are current research topics. Thereby, the approaches are manifold. One possibility is to obtain useful gradients for the optimization as shown in [1], where topological derivatives via the adjoint equilibrium equation are determined.

The Equivalent Static Loads Method (ESL) transforms the loads of a nonlinear problem into several static problems. The load is determined such that the displacements in an equivalent linear simulation correspond to those in a specific time step of the nonlinear simulation. Sensitivities can be determined for the static problems, which are then used to perform the topology optimization [2,3,4].

In addition to the calculation of usable sensitivities, there are also methods for topology optimization that do not use direct sensitivity information. Instead, these use engineering knowledge that guides the optimization in the form of heuristic rules. The rules of cellular automata as used in Hybrid Cellular Automata (HCA) [5] can be used to form three-dimensional voxel based structures. In this process, the cellular automaton redistributes material such that each voxel in the model is equally utilized for the particular load case. As a criterion for this, the internal energy density of the respective voxels is used. The process of the Graph- and Heuristic-based Topology optimization (GHT) are also driven by heuristic update rules [6,7,8,9]. Those updates are performed on a mathematical graph consisting of nodes and edges, which describes the cross section of an extrusion profile. With the GHT, the real objective and constraints can be considered in the optimization process.

This work presents a reinforcement learning (RL) based approach for topology optimization of crash loaded structures using the GHT to improve local structural cells with respect to their stiffness. The RL model is integrated into the GHT optimization process and functions as an additional heuristic applicable in many different models and load cases. That allows for a more diverse design generation during optimization which in return should result in better optima at the cost of a higher simulation count.

The core concept that is presented here is the underlying RL environment which defines the interface between the agent and the crash model that should be improved. For this, a shape preservation metric is proposed, that describes the stiffness of a cell by measuring how much the undeformed and deformed cells differ from each other geometrically. While the GHT process is already well described in the literature [6,7,8,9], the RL based approach is completely new and will be investigated in this paper. The key contributions of this work are

  • the combination of the two research fields of RL and crash optimization,

  • the support of the GHT with a new RL based heuristic that increases the stiffness of local cells,

  • the concept of cells and their advantages and disadvantages as an interface between the graph based structures and the RL model,

  • the implementation of an environment the agents are trained on,

  • the calculation of the shape preservation metric describing the stiffness of a cell and

  • the assessment of the performance of the trained agents in practical optimizations.

The paper is structured as follows. Section 2 presents related literature and concepts that are used in this paper or have directly influenced the work. In Section 3, the requirements for the RL models, also called agents, the implementation of the RL environment and the training of the agents are introduced. It also describes how the environment and the agents are integrated into the GHT process. The best trained model is selected and evaluated in Section 4 within the training environment. To assess the performance of the agent based heuristic within practical topology optimizations, a frame model and a rocker model are studied with the GHT in Section 5. Finally, the results and findings are summarized in Section 6.

2 Related work

Section 2.1 gives an overview of different approaches to integrate artificial intelligence (AI) into crash simulation and crash optimization. These studies collectively highlight the intersection of AI with crash simulations, illustrating the growing trend in this research area. However, while they offer valuable insights, they do not directly lay the groundwork for the present work. This is followed with Section 2.2, where an introduction to RL is given. Lastly, Section 2.3 discusses how the GHT works, as it is the framework for the RL based method presented in this paper.

2.1 Use of artificial intelligence in crash simulation and optimization

Crash simulation and optimization are integral aspects of modern automotive and aeronautical crash safety analysis. Leveraging AI and machine learning techniques has recently gained momentum in the field of crash analysis, allowing for enhanced computational and predictive capabilities.

A primary focus of recent studies has been the application of dimensionality reduction and clustering techniques for analyzing crash simulations. In [10] a clustering algorithm to discern structural instabilities from reduced crash simulation data is incorporated. The study delves into clustering of nodal displacements derived from finite element (FE) models spanning different simulations. By subsequently processing these clusters through dimensionality reduction techniques, inherent characteristics of the simulation runs are unraveled, obviating the need for manual sifting of the data. The practicality of their approach is underscored by its effective application to a longitudinal rail example.

[11] presented an automated evaluation technique aimed at discerning anomalous crash behaviours within sets of crash simulations. By calculating an outlier score for each simulation state via a k-nearest-neighbour strategy, the study aggregates these results into a consolidated score for individual simulations. By averaging these scores for a given simulation, the method facilitates the distinction between regular and outlier simulations. The effectiveness of this method is underscored by its high precision and notable recall when evaluated on five distinct datasets.

A geometric approach is presented in [12]. Primarily one-dimensional structures are embedded by a representative regression line used to analyze the deformation behavior of different crash models. Those regression lines are parameterized as Bézier curves. Simulation responses are then projected onto the regression line and smoothed with a kernel density smoothing. Leveraging a discretized version of the smoothed data, it is possible to effectively identify and categorize distinct deformation patterns and find the most influencing parameters regarding the deformation modes through data mining techniques. This method is validated on different important structural components in a full frontal crash.

Crash simulations are intrinsically time dependent. The use of time data for AI in crash simulations is therefore suggestive. The study from [13] offers a novel data analysis methodology for efficiently post-processing bundles of FE data from numerical simulations. Similar to the Fourier transform, which decomposes temporal signals into their individual frequency spectra, [13] propose a method that characterises the geometry of structures using spectral coefficients. The coefficients with the highest values are decisive for the representation of the original geometry. By selecting these predominant spectral coefficients, the geometry can consequently be represented in a low-dimensional way. The method is successfully validated on a full frontal crash by analyzing the behaviour of the vehicles support structure.

In a simpler approach, [14] bring forth the concept of Oriented Bounding Boxes (OBBs). Those are cuboids that encapsulate FE components at minimum volumes throughout the simulation. This geometric abstraction enabled the estimation of size, rotation and translation of crash structures over time. Moreover, their method, which uses a Long Short-Term Memory (LSTM) [15] autoencoder to generate a low dimensional representation of the original data, paves the way for predicting and stopping simulations that exhibit undesirable deformation modes. The method is validated on 196 simulations with varying material properties in a full frontal crash by analyzing different crash relevant components.

In [16], the impact point in low speed crashes is identified based on the time history of sensor data with conventional feature extracting algorithms. The impact points are classified by 8 different positions around the vehicle. From 3176 extracted features of the time series, the 9 most important features are chosen and passed into a decision tree. Using this method, a cross-validated accuracy of 76 % for the given dataset has been achieved.

2.2 Reinforcement learning overview

RL [17] is a subset of AI, that describes the process of learning tasks in an unknown and often dynamic environment. An agent performs actions within the environment over time according to its policy. The actions are selected by the agent depending on an observation of the environment, i.e. the agents perception of the current state of the environment, with the goal of maximizing a cumulative reward. Applying the action to the environment and generating a new observation based on the new state of the environment is called a step. Depending on the task of the agent, a few steps up to an theoretically infinite amount of steps are performed until the environment reaches a terminal state. The steps from the initial state of the environment to the final state are called an episode. For each step performed, a numerical reward is given to the agent. This way the agent learns to understand how beneficial the chosen action has been for the previous state. After an episode, the environment is reset and a new episode starts. This iterative concept of stepping through the environment is referred to as the RL loop.

In many state-of-the-art RL algorithms, the agent itself is a function approximator, usually an artificial neural network (ANN). Depending on the actual algorithm used, the ANN is trained to either predict the value of the observed state or predict an action distribution over the possible actions that maximizes a given objective directly. The value of a state is the expected return starting in the current state and following the current policy. The return is a possibly discounted sum of the rewards gained in each performed step. Choosing an action is then done by finding the action that maximizes the value in the current state.

In case of actor-critic algorithms [18], which are used in this paper, both the action distribution and the states value are approximated in two distinct ANNs. The network predicting the action distribution is called an actor and the network predicting the value of the states is called critic. The policy given by the actor network is updated with a gradient based approach with information provided by the critic [18]. This combination of both algorithmic approaches enables a much higher sampling efficiency compared to their individual counterparts.

When using an RL model that has been trained with an actor-critic-approach, only the actor is necessary for the decision making given an observed state. This interaction between the actor and the environment is visualized in Fig. 1. The best actions will be sampled from the probability distribution over the actions. The environment acts according to the given action and answers with a new state which will be the foundation for the observation passed next to the actor model.

Fig. 1
figure 1

The RL loop showing the interaction between the ANN based actor model and the environment

In contrast to supervised learning, where an ANN is trained on a given dataset, the RL training is driven by a trial and error approach. The agent steps through the environment according to its policy, collecting data as it steps. This is the main advantage of using RL especially for complex mechanical problems. For training an RL agent, no information about optimized structures must be known beforehand.

This poses the problem described in the literature as the exploration-exploitation trade-off [17]. While training, the agent should act according to its current policy, such that it is able to use the already learned knowledge about the environment to reach favourable states. This is called exploitation. At the same time, the agent needs to try out new actions, to avoid getting stuck in local optima by only acting according to the current policy. This is called exploration.

Typical examples where RL is frequently applied is robotics [20] and video games [19, 21]. New observations can be obtained fast and in a simple and structured format, like scalar sensor data in robotics and a visual representation of the environment in games. Thus, RL is a suitable method for training agents in these application fields.

An example of an application of RL in mechanical problems is given in [22, 23]. In the mentioned work, the volume for planar steel frames is calculated with RL considering stresses, displacements and other engineering relevant constraints. There, the cross sectional size is chosen from a list of discrete sizes. The steel frames are represented by a graph. A tailored graph embedding is used to preprocess the graph data into an RL suitable format.

In this work, Python is used for implementing the heuristic and the RL training. The most important python modules used are stable-baselines3 [24], which is an RL framework, gym [25], which enables a standardized implementation of environments, networkx [26] for processing graph data, numpy [27] for numerical operations on arrays and qd cae [28] for parsing the simulation results. The crash simulations are carried out with Ls-Dyna [29].

2.3 Crash optimization with the graph- and heuristic based topology optimization

The GHT is a shape and topology optimization method for crash structures. There are possibilities to optimize the cross section of an extrusion profile (GHT-2D) [6, 9] as well as the layout of different combined profiles (GHT-3D) [7].

In this work, the focus will be on the GHT-2D. For all following references to the GHT, it is always the GHT-2D that is referred to. In the GHT, the cross sections of extrusion profiles are described by graphs. The graph nodes contain the relevant coordinates that describe the geometry of the profile. Edges between the nodes represent the walls of the extruded model. For the automatic translation of a graph into a FE simulation model, the GHT-internal mesher GRAMB (Graph based Mechanics Builder) can be used. This software has been initiated by [8] and further developed in [6]. An example of a graph and its FE counterpart is given in Fig. 2.

Fig. 2
figure 2

Example of a GHT graph consisting of nodes and edges describing the cross section of the profile that is converted into an FE model of an extruded profile using GRAMB

As described in [7, 30], the use of a graph allows for

  • an easy way of manipulating the structure,

  • generating an interpretable structure with little effort,

  • a simple check of the manufacturing constraints and

  • high quality meshs due to the automatic FE meshing with GRAMB for every design.

Starting from an initial design, the graph is modified over several iterations using heuristics. Heuristics are expert knowledge condensed into formulas that analyze the mechanical behavior of the structure from the simulation. Within an iteration, these heuristics suggest a new topology. If desired, a dimensioning or shape optimization of each new structural proposal is subsequently performed. The heuristics operate in parallel and in competition. Only the best designs are passed on to the next iteration. These new designs are the basis for the following iterations.

The heuristics used in this work have been developed in [6, 7] and are listed below.

  • Delete Needless Walls (DNW) is a heuristic where unimportant edges are deleted. Here, the energy density of the structure walls is the criterion whether an edge is classified as unimportant.

  • Support Buckling Walls (SBW) identifies FE nodes that are moving rapidly towards each other, detecting that the structure has a buckling tendency. These areas are supported with an additional wall.

  • Balance Energy Density (BED) provides a homogeneous distribution of the absorbed energy in the structure by connecting low and high energy areas.

  • Use Deformation Space (UDS) has the variants compression (UDSC) and tension (USDT). For this purpose, deformation spaces moving towards and away from each other are identified and supported by a wall.

  • Split Long Edges (SLE) reduces the buckling tendency by splitting and connecting the longest edge with another long edge in the graph.

3 Reinforcement learning based heuristic generation

In the following, the framework of the RL based GHT heuristic is formulated. The design of the training environment for the heuristic is described in detail in Section 3.1. Section 3.2 describes, how the heuristic is integrated into the GHT optimization process.

In this investigation, the aim of the heuristic is to improve the local stiffness of the structure, hence the heuristic name RLS, which is an abbreviation for “Reinforcement Learning Stiffness”.

3.1 Environment implementation

The environment definition is the most important part when training an agent with RL. It defines what the agent should learn based on the received rewards and how the interaction between the agent and the environment is implemented. Therefore, the main concept of the environment is shown in Section 3.1.1. Section 3.1.2 explains the interaction between the agent and the environment, i.e. the actions the agent can take and the observations the agent receives. The random generation of models and load cases is shown in 3.1.3. When the agent inserts an edge into the graph locally, a reward is calculated, which indicates whether the newly added edge improved the local performance of the model. How the reward for each step is calculated, is shown in Section 3.1.4. Lastly, in Section 3.1.5, the training process for the agent is described.

3.1.1 Environment concept

The first step in implementing the environment is to clarify what the environment should achieve. It is supposed to give a framework for an agent to increase the stiffness of a structure by manipulating the topology of the graph locally. This should be achieved by sequentially inserting edges into the graph. An optimal topology proposal by the trained agent acting in the environment is desirable, but not mandatory. This is because the RLS heuristic, which is derived from the proposed environment, works in competition to other heuristics listed in Section 2.3. Therefore, suboptimal topologies are sorted out in the optimization process.

Figure 3 gives an overview of stepping through an episode of the environment. The environment is split into two main modules. When an episode starts, the reset module is activated. This reset module handles a randomized generation of a GHT graph representing the cross section of an extrusion profile, which is then translated into a finite element model and simulated in a randomized load case. Local parts of the graph, referred to as cells, are identified. Edges will be inserted directly into those cells. Based on the results of the FE analysis, an observation consisting of the mechanical properties of the initial and deformed simulation model is build. Using this initial observation, the agent is able to choose its first action, which is passed into the step module. The step module contains similar procedures as the reset module. Additionaly, the topology of the cell inside the graph is modified first using the given action. The cell is evaluated based on its updated topology. Then the reward and a termination flag that decides whether the episode should terminate is calculated. All of these concepts will be explained in more depth throughout this section.

Fig. 3
figure 3

Overview of stepping through an episode of the environment

Before explaining the concepts more, a simulation model must be defined with which the agent can interact. The simulation model is an FE model that is used in the environment to generate the observations based on all necessary crash responses. Due to the fact that RL needs many function calls by its nature, a finite element model must be chosen that computes in a few seconds. Therefore, a simplified version of a frame simulation model, which was originally presented in [6] and then modified in [30], is used. The graph representation of the simulation model is shown in Fig. 4. It is important to notice, that the entire method is dependent on the given simulation model. How well the trained agent is able to generalize the mechanical behaviour in GHT optimizations must be further investigated.

Fig. 4
figure 4

Graph representation of the frame simulation model in its default load case. The semi-transparent box highlights the fully constrained side of the model. \(\hat{x}\) and \(\hat{y}\) refer to a local profile coordinate system

Typical for an extrusion profile, it is made from aluminum and has an initial wall thickness of 3 mm and a weight of 20.25 g. The rigid sphere has a velocity of 10 ms\(^{-1}\) and a mass of 105.1 g. The extrusion depth is set to 5 mm. Likewise, the mean element edge length is set to 5 mm, which means that there is only one row of elements in the extrusion direction. Because of this, phenomena in the extrusion direction cannot be represented. This compromise has to be made to save simulation time. A randomized derivative of this frame model is generated for the agent to train on, which will be explained in more detail in Section 3.1.3.

In order to develop the training environment for the RL agent, it must be clarified in which format the environment passes the observations to the agent and in which format the agents actions are given. Due to the complexity of the mechanical problem, ANNs will be used as function approximators for the agent. The use of ANNs for training on structured data, such as tabular data like images and fixed-length vectors, is straightforward. Since the raw data obtained from the GHT either has a graph structure or originates from FE simulations with different numbers of nodes and elements for every design, the data is considered to be unstructured.

To circumvent these problems, the environment identifies a local part of the graph, called a cell, which is optimized by the agent. By looking at cells spanned by a fixed number of nodes, the mechanical state of the cell can be described by fixed length vectors. These vectors can directly be used for the observation.

An additional advantage of this approach is that the agent can identify generalizable behavior in the simulation data, due to similarities between all cells and their mechanical behaviour. It is also possible for the agent to recognize patterns from the training data on different models and load cases, as long as the underlying mechanics are similar enough to the behaviour of the frame model. The disadvantages of this approach are that the structure can only be optimized locally and that inserted edges are bound to the nodes along the cell boundaries. An example of a graph and a valid cell is shown in Fig. 5. The different size of the graph in Fig. 5 compared to the default graph in Fig. 4 is due to a randomization process which is applied onto the default graph to generate a variety of different graphs.

Fig. 5
figure 5

Example of an extrusion profile with a cell of 4 nodes that is to be optimized by the agent. The cell is chosen randomly from all possible cells

To increase the number of possible topologies in the cell, the edges along the boundaries of the cell are split. Some splitting nodes of the cell might already exist in the graph, due to a connection onto the side of a cell from the outer graph. In this case, no additional edge splitting is required, since the specific side of the cell is already split.

A cell is considered valid, if all of the following criteria are fulfilled.

  • The cell shape must match the shape which the agent was trained for. In this example, the cell is a quadrilateral consisting of 4 nodes spanning the cell and 8 nodes in total along the cell boundary. This ensures that all vectors for the observation can be represented with vectors of fixed length.

  • The cell shape must be convex. This ensures that added edges by the agent will always be contained inside the cell.

  • The initial cell must be empty, i.e. there must be no edges or nodes in it.

  • The cell must be large enough, i.e. it is checked whether the area of the cell is greater than 5% of the area spanned by the entire graph.

  • The cell must initially absorb a minimum amount of energy. This is achieved by calculating the shape preservation value, which will be explained in more detail in Section 3.1.4. This ensures that the cell is deformed and it makes sense improving the cell.

This cell based method is preferred over approaches that process graph data as the observation directly. Examples of such approaches include Graph Neural Networks (GNN) [31], Graph Convolutional Networks (GCN) [32] and graph embedding algorithms like the graph2vec algorithm [33]. Those methods can parse graphs of various sizes, but are not used in this paper, since they introduce difficulties for the given task.

  • It is unclear how the action of an agent, which is usually a scalar or a vector, is translated into a corresponding edge to be activated in an arbitrary graph.

  • Difficulties in training the agent on the mechanical behaviour of the entire graph would also arise due to the high mechanical complexity. Since the training models and load cases are randomly generated, no training model will be identical to a previous one. For the RL agent it would accordingly be difficult recognizing a generalizable pattern within the deformations of the simulated structure. This is especially the case for the limited number of training episodes due to the computational costs of the crash simulations.

  • Although the graph based approaches are currently subject of research, a generally applicable and flexible integration of these approaches into a reinforcement learning library that can be used out of the box does not exist yet.

As already mentioned, the RLS heuristic is intended to improve the stiffness of the cells and thus of the overall structure. To allow a fair comparison of the structures found by the agent, the wall thickness of the whole structure is scaled in such a way that the profile mass remains constant between all the designs proposed by the agent.

3.1.2 Interaction between agents and environment

The action of the agent is a tuple consisting of an identification number for the edge to be activated and a flag whether the episode should be terminated. By giving the agent the option to terminate the episode itself, one can avoid making the cell unnecessarily complex or non-manufacturable by inserting more edges than necessary. The episode is also terminated when the cell performance is considered to be sufficient, i.e., when the deformation of the cell is small enough or when more than 5 edges have been added by the agent. Because the cells always follow the same building principle, it is possible to create a mapping between the edge number given by the agent and the actual edge to be inserted into the graph [34]. If an edge in the graph is already active, the action is ignored by the agent and no new edge is inserted.

The observation is composed of various geometric parameters and simulation responses. For the evaluation of the simulation data, a section plane is generated centered along the profile perpendicular to its extrusion direction. Along this cut, the FE nodes and elements are extracted to form a new graph, the evaluation graph, which describes the deformation and other responses of the cell at the given section in every point in time [34]. Since the agent only needs information about the cells behaviour, the evaluation graph only contains nodes and edges describing the deformation of the cell, but not the rest of the graph. Figure 6 shows an example of an evaluation graph in its deformed state.

Fig. 6
figure 6

Example of an evaluation graph representing the cells deformation. The nodes of the evaluation graph are derived from corresponding FE nodes at the center of the considered profile along its extrusion axis. The edges show the connectivity of the graph nodes similar to the FE elements connecting the FE nodes in the underlying FE model

Although the evaluation graph keeps track of the cells behaviour for every point in time, only specific points in time are used for the generation of the observation. Most commonly these are responses in the undeformed and a specific deformed state. The deformed state corresponds to the point in time, where the internal energy of the cell is maximized. This point is called evaluation time. Specific responses are also evaluated at an individual point in time for each edge of the cell, i.e. when their respective response value is maximized.

By reducing the problem to the consideration of one cell in a graph, the structural properties and responses of the mechanical model can be reduced to single vectors of fixed length for each property. The simulation responses are processed using the evaluation graph and translated into node and edge properties of the corresponding cell with a fixed number of nodes along its cell border.

The following features define the observation that is passed to the agent for analysis of the current state:

  • The edge lengths of the cells edges. Those ensure that the agent has a sense of the size of the model. This feature allows a better assessment of the buckling tendency of the corresponding walls.

  • The wall thicknesses of the cells edges. This should help the agent to further comprehend the buckling tendency. Due to a mass constraint on the model, the wall thickness gets smaller for every inserted edge. This information helps the agent to estimate whether a new edge can be inserted without going below the minimum allowed wall thickness.

  • The manufacturability of the structure. This includes a flag for an intersection check, a check for any unresolved intersections inside the graph, a flag for edge lengths and edge thicknesses of members and a flag for distances and angles between members. The exact manufacturing constraints are discussed in Section 3.1.3.

  • Whether graph nodes on the boundary of the cell are connected to the rest of the graph. This is an important information, since those connections might be a source for additional support for the inserted member and the cell.

  • A reduced vector representation of the adjacency matrix of the cell. The adjacency and the derived vector describe the connectivity of nodes in a graph. With this vector, the agent always knows the current topology of the cell [34].

  • Coordinates and displacements of the graph nodes at the evaluation time. Those coordinates and displacements are given with respect to a local coordinate system of the cell and give the agent a sense of the scale of the cell and its deformation. Only the graph nodes along the border of the cell are evaluated, since those are sufficient to analyse the stiffness of the cell.

  • The internal energy density for each edge of the cell at the evaluation time and at the individual times. It tells the agent how much energy the specific member absorbed in the crash.

  • A cross sectional image (96 px by 96 px) of the cell in its undeformed and deformed state derived from the evaluation graph. This enables the agent to comprehend the entire cell deformation, independent of the mesh used in the FE model [34]. A Convolutional Neural Network (CNN) [35] is used to process the image data.

Fig. 7
figure 7

Example of an aggregated internal energy feature vector, where each of its entries is mapped onto its corresponding edge of the cell. The semi-transparent box highlights the fully constrained side of the model

For edge properties like edge lengths, a feature vector consists of entries along the frame boundary and all edges the agent can theoretically insert. For all edges that are already inserted into the cell, the structural responses are written into the vector. For all other edges not yet inserted, a response type specific dummy value is used instead. Since added edges can intersect, the two original intersecting edges have to be removed and replaced with four split edges. The environment keeps track of those edge splittings and aggregates the responses, such that the vectors remain of fixed length. An example of such an aggregated edge feature vector mapped onto the cell is given in Fig. 7. The inserted edges in the cell intersect, but the intersection is not resolved, i.e. no node is added at the intersection point, to emphasize that the feature vector length remains constant. Since the number of nodes of the cell is fixed to the number of nodes along the cell border and inserting edges will not introduce new nodes into the cell as shown, node feature arrays like nodal displacements consist of as many entries as nodes along the border. In case the model is not manufacturable, no FE results are available. The response values from the simulations are then substituted with dummy values for the observation.

Fig. 8
figure 8

Example of a randomized graph and load case build by the environment. The semi-transparent box highlights the fully constrained side of the model

All features, that are not inherently bound to any known interval, are normalized, e.g. the edge lengths, coordinates, displacements and energies. The mean and standard deviation, which are used when normalizing, are not known before the training. Therefore, the normalization is done by calculating the running mean and running standard deviation of the observations values distributions while training.

3.1.3 Model and load case generation

During the training, the agent should be able to analyze as many different deformation modes of various cells as possible. This enables the trained agent to perform useful actions for similar deformation states in GHT optimizations.

The frame model shown in Section 3.1.1 is used as a foundation to derive randomized models for training the agent. Based on the models graph, about 30000 different graph topologies from previous GHT optimizations were identified to serve as the basis for the randomized model generator. A randomly selected edge segment along the outer frame of the structure is constrained and another edge segment is chosen as the impact segment. An edge segment is a series of edges with the same orientation. The velocity of the spherical impactor is also randomly selected. Here, the sphere can either move perpendicular towards the selected edge segment, or the sphere can move in the direction of the center of gravity of the graph. With additional random adjustment of the size, the wall thickness and the orientation of the graph in space, an unlimited number of models and different load cases can be built, that will result in a large variety of different deformation modes. Figure 8 shows a randomized model in a random load case.

The manufacturing constraints used for training the agents are given in Table 1. Those are derived from [30] and adapted to suit the environment. The edge distance of two edge pairs is calculated for all edge pairs that do not share a node. Since the edges in the cell border are split, their distance to other edges changes. This results in smaller distances than the unsplit cell, although it is geometrically identical. For this specific reason, the minimum distance between edges d is set to the small value of 4 mm.

Table 1 Manufacturing constraints used for training the agent

3.1.4 Reward function

The definition of the reward function is essential because it controls what the agent learns in an RL problem. As mentioned earlier, the goal for the agent is to improve the overall stiffness of the model by increasing the stiffness of a cell. For this purpose, a shape preservation measure \(\tilde{A}_\Delta \) is presented, which is derived from the deformation of the structure based on the evaluation graph. The evaluation graphs in the undeformed state and in the deformed state at the evaluation time step are superimposed at their respective center of gravity. If the structure preserves its shape, then both cell boundaries lie exactly on top of each other. Otherwise, difference areas emerge, which are summed up and normalized. This process is independent of any inserted edges and works for empty cells and for cells with edges in it. Since rigid body translation and rotation have no influence on the shape preservation of the cell, these are eliminated when creating the evaluation graph. Figure 9 shows the superposition of the evaluation graphs in more detail.

Fig. 9
figure 9

The superposition of an empty deformed and undeformed cell to calculate the shape preservation value \(\tilde{A}_\Delta \)

The following original formula for calculating the shape preservation measure \(\tilde{A}_\Delta \) is given by

$$\begin{aligned} \tilde{A}_\Delta = \frac{\sum A_\Delta ^{(j)}\big |_{t_\text {eval}}}{A\big |_{t_0} + A\big |_{t_\text {eval}}}. \end{aligned}$$
(1)

The area spanned by the evaluation graph at a given simulation point in time is given by \(A\big |_{t}\). Difference areas between the superimposed evaluation graphs of the undeformed and deformed state are given by \(A_\Delta ^{(j)}\). j is the index of the considered difference area.

The shape preservation measure value is bound between 0 and 1 due to the normalization in \(\tilde{A}_\Delta \) with \(A\big |_{t_0} + A\big |_{t_\text {eval}}\). A value of 0 implies that the shape of the cell did not change from the initial state to the deformed state and a value of 1 means that the structures collapsed into a point, i.e. the structure is infinitely weak. In the case when \(\tilde{A}_\Delta = 0\), no difference area emerge, setting the numerator to 0 in the formula for \(\tilde{A}_\Delta \). For the collapsed cell, \(\tilde{A}_\Delta = 1\) is true due to the fact that the cell has a cross sectional area of \(A\big |_{t_\text {eval}} = 0\). Then only one difference area \(A_\Delta ^{(1)}\) emerges. With \(A\big |_{t_0} = A_\Delta ^{(1)}\) one can see that the value of \(\tilde{A}_\Delta \) is indeed 1.

One advantage of using such a metric is, that its values are not subject to significant noise, unlike section forces or other crash relevant responses, as they are entirely based on the displacements of the cell nodes in the evaluation graph. In addition, the values of the metric are normalized so that they are model and load case independent. The simple behavior of the metric simplifies the training for the agent.

This measure is also used to identify if an empty cell is a candidate for optimization with the agent. An \(\tilde{A}_\Delta \ge 0.03\) identifies the cell being deformed and therefore it makes sense to optimize it. A value of \(\tilde{A}_\Delta \le 0.01\) terminates the episode, since the deformation of the cell is small.

With this measure on how well the current cell performs, it is possible to reward or punish the agent by how much the cell performance improved compared to the cell from the previous step for a given episode. A relative improvement is considered instead of the absolute improvement to ensure that all resulting rewards have a similar range independent of the actual load case and deformation of the cell. The relative improvement is given by the formula

$$\begin{aligned} \delta = \text {clip}\left( \frac{\tilde{A}_\Delta \big |_{s_{i-1}}-\tilde{A}_\Delta \big |_{s_{i}}}{\tilde{A}_\Delta \big |_{s_0}}, -3, 3\right) , \end{aligned}$$
(2)

where \(s_i\) refers to the evaluation of the metric value at the current environment step. The clipping is just a safety precaution. It is clipped for numerical stability to avoid any outliers generating too small or large rewards.

Using this improvement the reward function r evaluates to

$$\begin{aligned} r = p + {\left\{ \begin{array}{ll} \delta &{} \text {if model is manufacturable}, \\ -1 &{} \text {else,} \end{array}\right. } \end{aligned}$$
(3)

which is to be maximized by the agent. p is a penalty defined by the user to penalize every step through the environment. In this work it is set to \(p=-0.05\), which means that the added edge by the agent must at least generate a 5% increase in performance to be considered useful. In case the model derived from the graph with the newly added edge is not manufacturable, the agent is penalized with a value of \(p-1\).

3.1.5 Training of the agents

Before the agents can be trained, it must be determined which algorithm will be used. In this work, the PPO (proximal policy optimization) algorithm [36] is used, since it is a state-of-the-art algorithm that has a good sample complexity and outperforms other algorithms. It is also able to handle discrete actions, i.e. discrete edge numbers that translate into the edges that are inserted into the cell and continuous observation spaces.

It is necessary to determine the best hyperparameter settings, such as the architecture of the underlying ANN to achieve the best possible performance with the PPO algorithm. Since the best hyperparameter settings are not known beforehand, one would usually do a large hyperparameter tuning, which is too time consuming for the given task. Instead, 12 agents with different hyperparameters are trained simultaneously on the identical task and after training the best agent is manually picked. Table 2 shows the parameters and their set of values which is sampled from to generate the batch of 12 agents. The parameters chosen are the ones that are expected to significantly impact the training behaviour of the agents. Other parameters exist, but are not shown, since they have not been tuned and are set to the default values used in stable-baselines3.

Table 2 Hyperparameters of the PPO algorithm and their sets of values that are sampled from to generate the batch of agents to train

The learning rate is a hyperparameter that determines the step size at which the parameters of the policy network are updated during training via stochastic gradient ascent algorithms. The batch size parameter specifies how many of samples of observations and actions will be used to compute the policy gradient and update the policy network. Gamma is a discount factor used in RL algorithms to balance the importance of immediate and future rewards. A Gamma of 0 results in an agent with myopic behaviour, i.e. an agent that does not look into the future and only uses its current state for decision making. A Gamma of 1 means that future rewards are as important as the immediate reward. Using a value close to 0 would help in this environment due to the high uncertainty in the crash simulations. However, this approach might only find mediocre optima. A value closer to 1 would help in finding better optima due to the future planning that is involved in the decision making. In the end, this might result in a worse performance, depending on the unknown uncertainty of the environment and how well the agent can predict the future. The number of rollout steps determines the number of steps taken in the environment before computing the policy update.

The possible hyperparameter values for the policy that is sampled from to generate the batch of 12 agents are given in Table 3.

Table 3 Hyperparameters of the policy and their set of values that is sampled from to generate the batch of agents to train

The policy parameters define the size of the underlying neural network. Since the PPO is considered an actor-critic-algorithm, a neural network for the actor and a neural network for the critic is build with the respective number of hidden layers and neurons. Both the actor and the critic network share the same preceding layer with the specified amount of neurons. For the CNN architecture, the default CNN in stable-baselines3 is used, which is the Nature CNN. The hyperparameters of the algorithm and the policy resulting in the best agents will be shown in Section 4, where the top performing agents are selected and evaluated.

Three different cell types are trained. These include cells with 3 sides, with 4 sides and with 5 sides. Every cell type needs an individual agent for training, since the action and observation space is determined by it. In total, 36 agents are trained in parallel.

Table 4 PPO hyperparameters of the best performing agents, that differ from the stable-baselines3 defaults
Table 5 Policy hyperparameters of the best performing agents, that differ from the stable-baselines3 defaults

In order to expedite the training process, parallel computation across eight environments for each agent is used. This approach ensures a rapid provision of data points, enhancing the efficiency of the training phase. With these settings, training a single agent on a compute cluster takes approximately a month. This high computation effort is justified by the fact that the agent only needs to be trained once and can then be used in the new heuristic for a wide variety of problems without any significant additional invest.

3.2 Integration of the agents and the environment into the GHT process

In the previous sections, the structure of the training environment is explained. The environment can almost be directly used as the new RLS heuristic. Differences between the environment used for the RLS heuristic and the training environment are

  • the selection of actions, which is now done by the trained agent based on the normalized observations,

  • the FE simulation model to optimize, which is given in the GHT process and not randomly generated by the environment and

  • the cell selection process.

While in training mode, the cell to optimize was chosen randomly from the set of valid cells inside the graph. In the heuristic counterpart, a cell selection scheme is deployed. According to this scheme, the shape preservation measure \(\tilde{A}_\Delta \) for all empty cells within the current graph is calculated. The more a cell is deformed, the more suitable it is to be optimized. At the same time, larger cells should be preferred, since they are more likely to show an influence on the global structural behavior. Therefore, the shape preservation measure of the cell is weighted with the corresponding spanned area of the cell. This results in the importance \(\lambda \) of a cell

$$\begin{aligned} \lambda = \tilde{A}_\Delta \cdot A. \end{aligned}$$
(4)

The cell with the highest \(\lambda \) in the given graph, will be chosen for the heuristic activation.

It can happen that the agent worsens the cell performance compared to the previous step due to a wrong decision. Therefore, the heuristic chooses the structure that generates the best shape preservation value of the cell over all steps in the episode.

4 Evaluation and selection of the agents in the training environment

In the following, the best performing agents out of the 12 trained agents per cell type are shown. The Tables 4 and 5 show the hyperparameters for the best performing agents for every cell type. The parameters given are the ones that have been tuned in the hyperparameter tuning.

The corresponding training history for the best performing agents is given in Figs. 10 and 11 with a mean return and a mean episode length respectively. In the rollout plots, data is collected and averaged over the last 100 training episodes. Since those episodes are determined randomly and the actions from the agent are sampled from the probability distribution given by the actor network, values at different steps are not directly comparable. In the evaluation plots, 15 different models and load cases are averaged and compared between steps. Those models and load cases are always the same for a specific cell type and the agent always chooses the most promising action according to its current policy. This allows for a better comparison of the agents performance between the steps.

Fig. 10
figure 10

Mean return on rollout and evaluation data for agents training on three different cell types

Fig. 11
figure 11

Mean episode length on rollout and evaluation data for agents training on three different cell types

It can be seen from the plots, that all agents were able to achieve a significant improvement of the return, especially in the first steps. The mean return is higher for cell types with a smaller number of cell sides. The episode length for all agents after training is close to 2, which approximates the number of inserted edges in the graph. It is not the exact number of inserted edges, because it is possible that the agent theoretically inserts the same edge several times in a row, which would not change the structure. But this effect is negligible in the trained state, because agents rarely insert edges multiple times, due to the penalization of inserting them.

Since the mean return only measures the performance quantitatively but not qualitatively, examples of initial cell deformation behaviour compared to their improved counterparts are shown in Fig. 12. The figure includes the shape preservation value \(\tilde{A}_\Delta \) for the given examples.

Fig. 12
figure 12

Examples of cell optimizations performed by the agents for different cell types, where SC is the selected cell in a given graph. The circled numbers indicate the order in which the agents inserted the edges. The semi-transparent box highlights the fully constrained side of the model

In all examples, the agent manages to improve the structure performance in terms of the shape preservation metric. It is consistent with the episode length evaluation, that all agents in these examples insert one or two edges. All example structures can be manufactured, which is not granted due to the high number of possible invalid edge combinations in the cell. Since these are only non-representative examples, it must be mentioned that the agents often, but not always, make reasonable decisions. It is noticeable that the shape preservation value is always close to 0 for the final cell. While this is desirable, it is not always possible, e.g., if the cell has to absorb a lot of energy due to a direct impact of the sphere on the cell.

The final 3 sided cell has a shape preservation value \(\tilde{A}_\Delta = 0.02\). Only one edge is inserted by the agent. For the 4 and 5 sided cells, the intermediate steps and the corresponding cells where only one edge is inserted are also discussed. For the 4 sided cell, the agent has an \(\tilde{A}_\Delta = 0.016\) with only the diagonal edge inserted first. This diagonal edge supports the cell stiffness by utilizing the deformation space along the compression direction of the deformed cell. From an engineering point of view, it might make more sense to support the cell in the tension direction to avoid buckling of the edge. Although the diagonal edge does not buckle in this example, the agent has learned to avoid the risk of buckling and inserts a supporting second edge, which is reasonable for cells that absorb more energy. This means that the agent failed to recognize that the episode could have been terminated earlier. For use in a later optimization, where the overall stiffness of a structure is to be improved, a correct recognition of the terminal step would have been advantageous, as the overall wall thickness would have remained larger due to the mass constrain. Similar behaviour can be observed for the cell with 5 sides. The shape preservation value, where only the first edge connecting the blue and orange edge segment with the pink and red edges is inserted, is \(\tilde{A}_\Delta = 0.031\). This is only slightly worse than the shape preservation of in the final cell. Although the structural performance improved to \(\tilde{A}_\Delta = 0.026\) with the final cell, the reward received for this action and final cell is slightly negative due to the penalization of inserting an edge, implying that the agent should not have inserted the second edge.

5 Performance of the agents in topology optimization processes

In this section the RLS heuristic is used in practical GHT optimizations. When dealing with an optimization tasks with the GHT, it is important to note that the agent with the RLS heuristic can insert multiple edges sequentially in one heuristic call. Conventional heuristics can insert or delete only one edge per call. This gives the RLS heuristic an advantage that must be factored into the evaluation.

5.1 Optimization of a frame model

The load case and the frame model for the GHT optimization have already been shown in Fig. 4. The objective of the optimization is to minimize the displacement of the rigid sphere in the global y-direction \(\Delta y\), while satisfying manufacturing constraints and keeping the mass m of the model constant. The global y-direction is identical to the local y-direction \(\hat{y}\) of the profile shown in Fig. 4. It is important to notice, that the optimization is performed with heuristic activations only and no shape optimization is done at any point. The number of designs passed into the next generation is set to 5. A maximum of 10 iterations is allowed. In the following, the optimization problem is formulated:

$$\begin{aligned} \!\min{} & {} \Delta y\nonumber \\ \text {subject to}{} & {} l \ge 10\, \text {mm},\nonumber \\{} & {} d \ge 10\, \text {mm},\nonumber \\{} & {} \alpha \ge 15^{\circ }, \nonumber \\{} & {} 1 \text {mm} \le t_{\text {wall}} \le 4\, \text {mm},\nonumber \\{} & {} m = 20.25 \text {g}. \end{aligned}$$
(5)

In Fig. 13, the optimization path that leads to the optimized design is shown. More simulations and iterations were carried out than shown, but those did not lead to a better design. It can be seen that the heuristic RLS found a valid design in iteration 1 to 6. The design proposed by RLS in iteration 1 is part of the path to the optimized design. In iteration 7, the RLS heuristic was not successful finding a valid design, due to either cells not reaching the threshold of the shape preservation value \(\tilde{A}_\Delta \) or cells making the graph not manufacturable due to the edge splitting process. All designs of the optimization fulfill the constraints.

Fig. 13
figure 13

Optimization path that leads to the optimized frame design with the RLS heuristic active

The optimized design reduces the objective \(\Delta y\) from 72.36 mm to 7.46 mm. The greatest impact on structural performance comes from combining the RLS and DNW heuristics from iteration 1 to 3 by changing the shape of the overall structure to a triangular shape. This is initiated by the RLS heuristic in iteration 1, where a diagonal edge is inserted into the graph.

Comparing the findings to an optimization without the RLS heuristic active, shows that the GHT finds a much more complex structure with a worse objective value. This can be seen in a side by side comparison of the deformed states of the initial, the optimized design with the RLS heuristic active and the optimized design without the RLS heuristic in Fig. 14.

It can be observed that the edges must be much thinner in the optimized design without the RLS heuristic active in order to keep the mass of the model constant. The objective of the optimized design without the RLS heuristic is improved from 72.36 mm to 14.65 mm compared to the initial design. The lower performance can be attributed to the fact that none of the already existing heuristics inserted an edge diagonally in the frame. If a shape optimization was active, the GHT without the RLS heuristic would also find a similar triangular shape.

Fig. 14
figure 14

Comparison of the initial and the optimized frame designs with and without the RLS heuristic active in the optimization. The rigid impactor is shown in gray

In total, 273 function calls, i.e. FE simulations, were carried out in the optimization with the RLS heuristic active. Those split up in 185 function calls from the conventional GHT heuristics and an additional 73 function calls from the RLS heuristic. In the current implementation of the interface between the GHT and the RLS heuristic, more than the strictly necessary simulations are performed. The number of function calls of the RLS heuristic can be reduced to 31, if the interface between the GHT and the RLS heuristic is optimized. In the optimization without the RLS heuristic 230 simulations are performed.

5.2 Optimization of a rocker model

So far, the frame model has been studied, which was also used to train the agent. How the agent based RLS heuristic performs in other models and load cases is examined in this section.

Since the agent is designed to improve the stiffness of a structure, a relevant vehicle component is selected for occupant protection in a side crash. In such side impacts, there is little deformation space until the occupant is struck. It is vital that the vehicle occupants are protected from excessive intrusion by the opposing vehicle or impactor.

Therefore, the performance of the RLS heuristic is investigated based on a model of a rocker in a side crash against a rigid pole, which is presented in Fig. 15.

Fig. 15
figure 15

Rocker model in a side crash against a rigid pole

The rocker is made out of aluminum, has an initial wall thickness of 3.5 mm and has an extrusion length of 600 mm, resulting in a mass of 2.801 kg. The energy of a moving rigid wall is introduced into the rocker through the seat cross member. The rigid wall has a mass of 85 kg and an initial velocity in negative y-direction of 8.056 ms\(^{-1}\).

Fig. 16
figure 16

Optimization path that leads to the optimized rocker design with the RLS heuristic active

The objective is to find a topology, that minimizes the displacement \(\Delta y\) of the rigid wall and therefore increases the stiffness of the rocker. All manufacturing constraints must be fulfilled and the mass of the model must be equal for all designs. The number of concurrent designs is set to 5 and the maximum number of iterations allowed is set to 12. No shape optimization is performed during the optimization. The exact optimization problem is formulated as:

$$\begin{aligned} \!\min{} & {} \Delta y\nonumber \\ \text {subject to}{} & {} l \ge 20\, \text {mm},\nonumber \\{} & {} d \ge 10\, \text {mm},\nonumber \\{} & {} \alpha \ge 15^{\circ },\nonumber \\{} & {} 1.5\, \text {mm} \le t_{\text {wall}} \le 3.5\, \text {mm},\nonumber \\{} & {} m = 2.801 \text {kg}. \end{aligned}$$
(6)

The design variations that lead to the optimized structure are shown in Fig. 16. Along this path, the RLS heuristic is activated twice. One time it is activated in iteration 2 and another time in iteration 5. In iteration 3 and 4 the agent proposed the same topology change as in iteration 5, but the activation of other heuristics resulted in a better overall performance of the structure. The design proposals of the RLS heuristic in iterations 6 and 8 increase the shape preservation value of the respective cell, but are not useful with respect to the stiffness of the rocker structure.

The performance of the structure from iteration 4 to 5 gets worse when the RLS heuristic is activated. In iteration 6, the DNW heuristic removes part of an edge added by the RLS heuristic, causing the structure to perform better in the long run. Similar to the optimization of the frame model, the combination of the RLS and DNW heuristic works well.

The optimization was able to increase the objective \(\Delta y\) from 68.92 mm to 29.95 mm. Comparing this to the optimization with inactive RLS heuristic, a slightly worse improvement from 68.92 mm to 31.53 mm is achieved. A direct comparison of the deformed models is given in Fig. 17.

Fig. 17
figure 17

Comparison of the initial and the optimized rocker designs with and without the RLS heuristic active in the optimization

Table 6 Summary of the results of the RLS heuristic in two distinct GHT optimizations

Although the optimized structure with the active RLS heuristic performs slightly better, the emerging pattern of the inserted supporting walls in an offset manner is similar. The optimization with the RLS heuristic was able to fill more space with this pattern successfully.

The optimization with the RLS heuristic active is based on 296 function calls. 213 of those function calls are from the conventional GHT heuristics and an additional 83 function calls are from the RLS heuristic. With an improved implementation of the interface between the RLS heuristic and the GHT, the number of function calls of the RLS heuristic can be reduced to 37. The number of function calls in the optimization without the RLS heuristic is 166.

6 Discussion and conclusion

In this paper, a novel heuristic for the topology optimization of crash structures with the GHT was presented. For this purpose, RL was used to train agents that can locally improve the stiffness of structures. Within the training environment, the agents were able to make plausible decisions about the topology of the cells. It was more difficult for the agents to differentiate if an episode could be terminated early.

The trained agents have been used as a new RL based heuristic in two GHT optimizations. Firstly, an optimization of a frame model was performed, where the new heuristic was able to direct the optimization to a better design compared to the optimization without the new heuristic. Secondly, an optimization of an application-oriented rocker model was performed. The differences between the designs with and without the new heuristic were smaller compared to the frame model optimization.

Table 6 summarizes the results of those two optimizations.

In both optimizations the optimized structures performed better with the RLS heuristic active. The use of a new heuristic increases the number of function calls in an optimization. Especially with the RL heuristic, where after every added edge the performance of the cell must be evaluated, many additional simulations must be performed.

Given the results shown, it is valid to state that the presented heuristic is able to help the GHT in the optimization process at the cost of an increased numer of function calls. The underlying agents perform reasonable from an engineering perspective with respect to the goal of stiffening the cells. It is not guaranteed that the heuristic will always improve the optimization results. The displacements in crash simulations, which are assumed to play a major role in the decision process of the agent, behave well in crash simulations from a mechanical point of view and therefore one could assume that the agents decisions are fairly robust. Further research needs to be conducted to substantiate this assumption.

Accordingly, there are some things that should be further explored in future work. To enhance the design diversity of the cells, it is useful to extend the edge splitting process with more nodes along one edge. But there are limitations to this, as very short edges are generated that do not fulfill the manufacturing constraints. Also, only one simulation model for training has been considered with limited amount of diversity in the load case. It is unclear how different training models will affect the performance in real optimizations. Objective functions other than stiffness were also not investigated in this paper. In crash development, force levels are often used as an optimization objective for crash load cases. An additional RL based heuristic that makes the structure more compliant instead of stiffer could help in those optimizations.