1 Introduction

Thanks to the recent developments in Deep Learning (DL), AI is permeating our daily lives. However, there is a growing concern about the trustworthiness of DL-based systems [1]. This is especially evident in high-stakes applications (e.g., finance, healthcare, and industrial control systems), where a single unexpected behavior shown by the AI system can cause catastrophic damages. These issues led, in recent years, to the growing interest in XAI approaches [2,3,4,5].

However, the developments of the XAI field are seeing opposing points of view from the research community. On the one hand, some researchers think that DL is “hitting a wall” [6, 7]. Other researchers, instead, neglect the importance of XAI [8]. Regardless this dispute, what is clear, judging from the interest in this topic from public institutions (e.g., the EUFootnote 1 and UNESCOFootnote 2), is that XAI will have a dominant role in the future of AI.

To add more fuel to this debate, some researchers are advocating for a particular subset of XAI, called IAI, which focuses on the use of transparent models (e.g., DTs and rule-based systems) [9]. In fact, these researchers point out that general XAI methods merely provide a posteriori explanations of opaque ML models (e.g., DNNs), usually through simple model approximations, and as such they are not exact. Not having an exact explanation in high-stakes scenarios is however not acceptable, as a wrong (and unexplainable) model behavior can lead to significant damages. As described in [2], interpretability is instead a structural property of a model, which means that an interpretable model can be exactly understood and inspected by humans, while explainability is, in general, a behavioral property of a model, meaning that one can give “explanations” about the decisions taken by the model even without knowing its internal details.Footnote 3

IRL has been recognized as one of the current grand challenges in the field of AI [9]. In fact, since RL is an extremely general methodology, it can be applied to a wide variety of problems, from the definition of taxation policies [10] to the control of tokamak plasmas in nuclear fusion plants [11]. These two application fields are also good examples of high-stakes domains where AI-based systems can be highly beneficial but, if controlled poorly, may have catastrophic effects. In these scenarios, users cannot just use opaque models in a plug-and-play manner, because even high-performing policies may be biased or have unpredictable behavior [9, 12]. Using an interpretable model, instead, would allow a thorough testing and inspection of the trained policy, to assess potentially unsafe behaviors.

Despite the importance and the potential application of IRL, this field is however currently falling behind the more established field of XAI. As discussed in [9], this is mainly because, compared to standard supervised settings (where a posteriori XAI models can be applied), IRL is much more challenging, mainly because: 1) in RL, the agent does not receive all the data at once, observations may be partial, and rewards may be delayed and depend on the long-term consequences of a series of interdependent state-actions transitions; 2) the search spaces in RL can be massive, with uncertain estimates. Yet, interpretability could be particularly valuable in RL, as it might help to reduce the RL search space, and possibly remove actions that might lead to harm. For these reasons, the research on IRL is highly relevant and timely, yet it presents several important challenges.

To partially address these challenges, recently some works proposed hybrid approaches for producing IRL models [13,14,15]. In these works, the authors combine Grammatical Evolution (GE) [16], i.e., a variant of Genetic Programming (GP) [17], and Q-learning [18], to produce interpretable DTs with RL-optimized behaviors. Another hybrid AI approach that makes use of DTs is presented in [19]. There, the authors combine deep RL models with DTs to obtain learned and instinctive behaviors.

Another direction that has attracted a growing interest in the recent literature focuses on RL algorithms that are capable of discovering not just one single policy, but rather a set of good, diverse policies. RL algorithms that focus on diversity are indeed able to explore better the policy space, and discover more complex (and possibly more robust) behaviors [20]. Examples of applications of RL where diversification may be needed are in the field of game playing, where agents capable of producing moves that go beyond known patterns can be more successful. Another area of applications may be in the context of intelligent control systems, such as smart industry or smart buildings, where a diversity-driven AI-based system could be able to propose diverse control strategies to produce different system trajectories with different dynamics, also reacting to unexpected conditions, such as changes in the task and/or environment.

In IRL, diversity-driven algorithms could be even more important, as they would allow the discovery of a set of solutions that can solve the task at hand not only with good performance but also using different strategies with different levels of complexity and interpretability. This could be extremely useful, as it would allow the users of the system to analyze and interpret the set of solutions and draw different insights on the problem at hand, and ultimately, improve their general understanding of the problem and their problem-solving skills.

An early attempt in this direction is presented in [20], where the authors propose a method called DOMiNO for discovering behaviorally different policies, without affecting performance in a significant way. However, it should be noted that DOMiNO does not focus on interpretability, as it is based on DNNs. Furthermore, it uses gradient-based optimization, differently from the present study where instead we use gradient-free optimization. Concerning DTs-based models, most of the previous works focus on goal-directed (i.e., not focused on diversity) optimization, while only limited work has been done aimed at searching for diversity in the optimization of DTs for IRL.

One possible way to achieve diversity in this context is through the use of the so-called QD algorithms [21], namely, optimization methods driven by an explicit search for diversity of solutions rather than the minimization of maximization of a given objective function (as done in traditional goal-directed approaches). QD methods allow in fact for a better exploration of the search space, potentially discovering better-performing policies than goal-directed optimization [22, 23]. Among the existing QD algorithms, of particular interest, especially (but not only) in the context of DTs, is ME [23], which as we will see later is one of the main methods we use in the present study. This algorithm has in fact many desirable properties that can be particularly useful for IRL. First of all, differently from any other (either goal-directed or QD) algorithm, ME allows the user to define specific feature descriptors (as we will see in Sect. 2.2.3, in our case, these are the tree depth and the action entropy) and explicitly makes use of these descriptors to explore the search space. These descriptors can be either problem-dependent or problem-agnostic (as the ones chosen in our experiments), to provide different levels of abstraction and knowledge in the exploration process (potentially, also in an interactive way). Furthermore, the possibility of defining a user-specified feature space comes with a simple yet intuitive visualization in the form of heatmaps (such as the ones that we will show in Sect. 3), that easily allow the user to identify the most promising/effective areas of the search space. Another important advantage of ME is that it does not make use of gradients (differently from gradient-based, goal-oriented optimization algorithms, such as DOMiNO [20]), hence it can be applied also to non-differentiable objective functions. Finally, another important feature of ME is its ease of implementation and use.

In our previous work [24], we have made a first step toward QD optimization of DTs for IRL. However, that work was limited to a single Evolutionary Algorithm (EA), namely GE, and a single QD scheme, namely ME.

In the present work, we extend the analysis previously presented in [24] by significantly increasing the scope of the experimentation. In particular, we address two relevant research questions in the context of optimization of DTs for IRL, namely: (1) What are the effects of the introduction of QD schemes in different EAs? (2) How do different QD schemes affect the optimization process? To address these questions, we employ two different EAs, namely, GE [16] and GP [17], and two different QD schemes, i.e., ME [23] and its recent variant CMA-ME [25], comparing them with their corresponding baseline goal-directed (i.e., with no QD scheme) EAs. We study the performance and the “illumination” patterns of the methods shown above in two well-known benchmarks from the OpenAI Gym suite [26]: CartPole-v1 and MountainCar-v0.

To summarize, the main contributions of the present work are the following:

  • We conduct an experimental study on two RL tasks, with two different EAs (GE and GP), combined with two different QD schemes, namely, ME and CMA-ME.

  • We compare the QD approaches with vanilla (i.e., goal-directed) versions of GE and GP, used as baselines, thus for a total of 6 different algorithms for each task.

  • For each task and algorithm, we analyze the fitness trend and the corresponding “illumination” capability w.r.t. a feature space described by the trees’ depth and the entropy of their actions.

The rest of the paper are structured as follows. The next section describes the methods used in our comparison. Then, Sect. 3 shows the numerical results. Finally, Sect. 4 concludes this work and suggests possible future works. Please note that the list of acronyms used throughout the paper is reported in Table 1.

Table 1 List of acronyms used in the paper

2 Methods

As mentioned above, our goal is to discover diverse IRL models for a given task. To do so, we evolve DTs by combining EAs with QD schemes, and RL (through Q-learning, see Sect. 2.3). More specifically, we compare different setups, involving two goal-directed EAs and two QD schemes. The two EAs we employ are GP [17] and GE [16]. The two QD schemes, instead, are the following:

  • ME [23]: in this scheme, we employ the ME selection scheme on top of the EA used;

  • CMA-ME [25]: in this scheme, the solutions coming from the ME algorithm are further refined by using Covariance Matrix Adaptation Evolution Strategies (CMAES) [27]. Finally, the ME scheme is applied to the refined solutions.

Moreover, we compare the results obtained with the two QD schemes to two goal-directed EAs (based, respectively, on vanilla GE and GP), which act as baselines.

2.1 Evolutionary algorithm

Here, we briefly present the EAs used in the experimentation.

2.1.1 Grammatical evolution

In the GE setup, the genotype of a DT (i.e., its encoding) is a vector \(\textbf{g}= (g_0,\dots , g_\mathrm{{size}});~g_i \in [0, M]\), where M is an integer whose value must be significantly larger than the number of productions for each of the rules, to ensure uniform probabilities for all the productions of all the rules.

As a mutation operator, we use a uniform random mutation: given a genotype (\(\textbf{g}\)), we replace a gene choosing a new value \(g_\mathrm{{new}} \sim \mathcal {U}(0, M)\), with probability \(p_g\), where \(\mathcal {U}(\cdot )\) denotes the uniform probability distribution.

As a crossover operator, instead, we use one-point crossover: given \(\textbf{p}_1\) and \(\textbf{p}_2\), we choose a random splitting point and create two new solutions by concatenating the two split genotypes.

The grammar used for GE in our experiments is shown in Table 2.

Table 2 Oblique grammar used in the experiments

2.1.2 Genetic programming

When using GP, the genotype coincides with the DT. To enforce the interpretability of the solutions, we set a limit on the expression length.

The mutation operator works as follows. Initially, it randomly chooses a node. Then, if the chosen node is a leaf, it replaces it with a condition. Otherwise, if the chosen node is a condition, its expression is randomly modified.

To perform crossover between two trees, we randomly select two nodes, one for each tree, and swap them.

The functional set is composed of the following symbols: \(\texttt {if\_then\_else}\), \(\texttt {gt}\), \(\texttt {lt}\), +, -, \(\cdot\), /.

The terminal set, instead, is composed of leaves, input variables, and random constants.

2.2 Quality–diversity schemes

In this subsection, we briefly describe the QD schemes used in combination with the two EAs mentioned above.

2.2.1 MAP-Elites

ME (that stands for “Multi-dimensional Archive of Phenotypic Elites”) [23] is a QD algorithm that tries to maximize the “illumination” of the search space (i.e., the balance between exploration and exploitation) by maintaining a multi-dimensional archive of the best solutions, which are indexed using the values of their “feature descriptors” (which are typically based on problem-dependent, user-defined properties of the solutions). To have meaningful illumination patterns, it is extremely important to have descriptors that are orthogonal to the solutions’ fitnesses (i.e., the values of the objective function).

A descriptor can be defined as \(\textbf{d} \in \mathcal {D} = \{(d_0, \dots , d_n): d_i \in [\rm{min}_i, \rm{max}_i]; \forall i \in [0, n]\}\). A function \(\mathcal {F}: S \rightarrow \mathcal {D}\) is defined to compute the descriptors associated with any solution \(s \in S\).

The archive is structured as a multi-dimensional grid, where each dimension is divided into m equally-spaced bins.

When a new solution s is generated, its fitness and its descriptor \(\mathcal {F}(s)\) are computed. Then, if the location of the archive is empty, the solution is inserted in the corresponding cell of the grid. Otherwise, the fitness of the solution present in the archive is checked. If it is worse than that of the current solution, the old solution is replaced by the new solution. Otherwise, if the fitness of the current solution is equal to or worse than that of the existing one, the current solution is discarded.

When using ME, we first initialize the map by randomly evaluating \(init_\mathrm{{pop}}\) random solutions. Then, we perform an iterative phase in which we sample solutions from the archive to generate new solutions through mutation and crossover (which depend on the use of GE or GP).

2.2.2 Covariance matrix adaptation MAP-Elites

CMA-ME [25] is a variant of ME that takes the benefit of the well-known CMA-ES algorithm [27]. The idea is to use the \(\rm{batch}_{n}\) solutions sampled for the new batch (i.e., the new set of candidate solutions sampled from ME) as the initial population for \({\rm batch}_{n}\) parallel instances of CMA-ES. If CMA-ES is not able to improve the solution, at the next step of the algorithm, the new candidate solutions will be obtained through the mutation operator. This mechanism allows the algorithm to try escaping local optima.

Note that CMA-ES works on real-valued vectors, while ME works with GP or GE, as described before. To solve this discrepancy, we select from the phenotype (i.e., the DT) all the real values that are used in the DT, which are then optimized by CMA-ES.

2.2.3 Feature descriptor

As explained earlier, ME (and its derived algorithms) need a descriptor function \(\mathcal {F}\) to map solutions to a location in the archive.

In the present study, we use a two-dimensional descriptor. The first dimension uses the entropy H of the actions taken by the agent: \(H(s) = - \sum \nolimits _{j=0}^{n_a} f(j) \cdot \rm{log}_{n_a}(f(j))\), where f(j) is the frequency of the j-th action in the list of actions taken by the solution s. Entropy allows us to measure the diversity in terms of action distribution. Note that we use the number of actions \(n_a\) as the base for the logarithm, hence \(H(s) \in [0, 1]\).

The second dimension of the descriptor, instead, measures a structural property of a solution: the depth of the DT. Note that, to make this descriptor more accurate, before computing the depth, we first execute a pruning on the DT, as described in [13].

It is important to note that the two features used for the descriptor are not to be considered as objectives, but as properties that are interesting for the study. For instance, one cannot say whether we want to maximize or minimize entropy, as the corresponding performance depends on the task at hand. On the other hand, one may need to have multiple solutions, each one with a different depth, to have the opportunity to choose the most appropriate model based on, e.g., hardware constraints.

2.3 Reinforcement learning

During the fitness evaluation phase, we perform RL on the leaves of the DTs, using \(\varepsilon\)-greedy Q-Learning [18], meaning that, with a probability of \(\varepsilon\) we take a random action, otherwise we choose the action with the best value.

More specifically, each leaf of the DT represents a “macro-state” \(\sigma\), and the Q function is updated using the Bellman equation:

$$\begin{aligned} Q(\sigma , a) = (1 - \alpha ) Q(\sigma , a) + \alpha (r + \gamma * \rm{max}_a' Q(\sigma ', a')) \end{aligned}$$

where \(\sigma\) is a macro-state (i.e., a group of states that, navigating the DT, end in the same leaf), a is the action taken, \(\alpha\) is a learning rate, r is the reward received by the environment, \(\gamma\) is the discount factor (that tunes the importance of future rewards w.r.t. the current ones), and \(\sigma '\) is the next macro-state, caused by the execution of action a in the current state.

The parameters used for the Q-Learning algorithm are shown in Table 3.

Table 3 Parameters used for Q-learning

2.4 Fitness evaluation

To evaluate the quality of the evolved DTs, we use two well-known OpenAI Gym [26] environments: CartPole-v1 and MountainCar-v0.

We simulate m episodes for each solution, in order to: (1) allow the RL algorithm to converge to a well-performing policy, and (2) have a reliable estimate of the quality of the DT. A simulation ends when either the task is solved or a predefined time limit is hit. Once the m simulations have been completed, we compute the fitness of the DT by computing the mean of the scores across the m simulations.

Please note that, for the MountainCar-v0 environment, we normalize the values of each variable composing the observations since they have significantly different ranges of variation. To do so, we perform a min-max normalization (w.r.t. the ranges).

2.4.1 CartPole-v1

In the CartPole-v1 task,Footnote 4 the goal of the agent is to maintain a pole in equilibrium by moving the cart it lies on. The observations provided to the agent are: (1) x: the position of the cart, (2) v: the velocity of the cart, (3) \(\theta\): the angle of the pole, and (4): \(\omega\): the angular velocity of the pole. The agent can perform two actions: push the cart to the left, or push it to the right. The agent receives a reward of 1 for each timestep in which the pole is balanced and the cart is inside the bounds (i.e., \(\mid \theta \mid< 12^{\circ } \wedge |x| < 2.4\)), otherwise it receives a reward of 0. The simulation is ended whenever \(\mid \theta \mid \ge 12^\circ\) or 500 timesteps have passed. This task is considered solved when the mean cumulative reward over 100 episodes is greater than or equal to 475.

2.4.2 MountainCar-v0

In the MountainCar-v0 task,Footnote 5 the goal of the agent is to drive a car on top of the right hill of a valley. To do so, the agent has to learn how to exploit the left hill to build momentum. The observation of the agent is composed of two variables: (1) x: the position of the car, and (2) v: the velocity of the car. The actions that the agent can perform are: (1) accelerate to the left, (2) do not accelerate, and (3) accelerate to the right. The agent receives a reward of \(-1\) for each step, and it receives a reward of 0 when it reaches the top of the right hill. This reward function makes this problem hard to explore since the agent experiences a reward greater than \(-1\) only when it completes the task. The simulation is terminated when the agent reaches the right hill or when the limit of 200 timesteps is reached. The task is considered solved when the mean cumulative reward, computed over 100 episodes, is greater than or equal to \(-110\).

3 Results

In this section, we quantitatively compare the setups described in Sect. 2.

For each setup (i.e., each combination of an EA and a QD scheme), we performed 5 independent runs in order to statistically assess the significance of the results. This number was chosen to have sufficient statistical evidence about the comparison between the different setups while keeping the computational cost of the experiments limited.

Table 4 Parameters used in the experiments

The parameters used for all the methods are shown in Table 4. Note that the bounds for the behavioral feature (i.e., the entropy) are different across the two tasks. In fact, while in MountainCar-v0 we use the entire co-domain for the entropy, in CartPole-v1 we only consider the interval [0.8, 1]. This is because, in preliminary experiments, we observed that the region [0.0, 0.8] is scarcely populated with good solutions. We hypothesize that the reason underlying this phenomenon is that, being MountainCar-v0 a balancing task, the agent should frequently switch between the two actions, leading to high entropy.

Regarding the interpretability of the solutions, the authors in [28] proposed a quantitative metric to measure the interpretability of a mathematical formula. Here, we will use this metric to compare the interpretability of the solutions produced. Note, however, that instead of using the version of \(\mathcal {M}\) proposed in [28], we will adopt the modified version from [13], as it is more general and can be used with any Machine Learning model. In fact, this version of \(\mathcal {M}\) is essentially a proxy for the model complexity (meaning that the higher \(\mathcal {M}\), the worse the interpretability), which is a general property of models. Moreover, it is worth mentioning that, for the schemes based on ME, we will compute the \(\mathcal {M}\) value for the best-performing solution. In the case of ties, we choose the tree with minimum depth.

As for the “illumination” capability, we limit our analysis to a qualitative observation of how the two EAs fill the feature space.

3.1 CartPole-v1

As introduced before, we compare the results from both a performance and a diversity point of view.

For the GE setups, the top row of Fig. 1 shows the fitness trends of the best solutions found during the evolution. All the setups produce solutions capable of solving the task in less than 2000 fitness evaluations. Of note, GE+ME and GE+CMA-ME solve the task faster than GE, in terms of the number of fitness evaluations needed to converge.

A comparison of the results of our best DT (found across 5 runs) with the state-of-the-art is shown in Table 5. In the table, we can see that our method achieves the maximum score allowed by the environment, on par with most of the other methods (both interpretable and non-interpretable).

Concerning the illumination capability of the three setups, the bottom row of Fig. 1 shows the archives at the end of the evolution. Note that, in the case of GE, we consider all the solutions generated during the evolutionary process, rather than just the last generation, and fill the map a posteriori. In the case of GE+ME and GE+CMA-ME, instead, the map is filled during the evolutionary process, by construction. The results show that, while GE can find solutions that solve the task, its ability to illuminate the feature space is limited, as expected: in fact, the algorithm does not find a satisfactory number of diverse solutions. On the other hand, GE+ME finds at least one solution for each possible DT depth and level of entropy.

Regarding the behavioral feature, while GE+ME still finds more different and high-performing solutions, GE vanilla and GE+ME seem to produce better results when the entropy values are in the range 0.9–0.92. This is probably due to the nature of the task, which requires high coordination between the two actions (Push Left/Push Right), leading to a similar frequency for the actions, and hence, high entropy. Interestingly, GE+CMA-ME found different solutions w.r.t. GE+ME, finding solutions either with a very high entropy level (between 0.98 and 1) or with relatively low entropy (in the interval 0.82–0.84). This suggests that this problem is multi-modal. Moreover, it appears that different QD schemes focus on different regions of the search space. Finally, it is worth noting that none of the considered QD schemes is able to illuminate well all the “promising” regions (i.e., \(H \in \{[0.82, 0.84), [0.9, 0.92), [0.9, 0.94), [0.98, 1.00]\}\).

The results obtained using GP are different. Regarding performance, see Fig. 2 (top row), we can observe that GP+ME converges slower than GP vanilla, while GP+CMA-ME converges faster than GP vanilla.

Regarding the illumination capabilities, Fig. 2 bottom row, we can observe that, while the general observation that GP vanilla explores the search space less effectively than GP+ME/CMA-ME is still valid, in this case, GP+ME and GP+CMA-ME are able to find solving solutions in most of the bins. Figure 3 shows some examples of DTs that solve the task, one for each GP setup.

Fig. 1
figure 1

Results on the CartPole-v1 task (GE setups). Top: fitness trends (mean ± std. dev. across 5 runs at each step of the algorithm). Bottom: maps obtained with the three setups. The results in each bin are averaged over 5 runs

Fig. 2
figure 2

Results on the CartPole-v1 task (GP setups). Top: fitness trends (mean ± std. dev. across 5 runs at each step of the algorithm). Bottom: maps obtained with the three setups. The results in each bin are averaged over 5 runs

Fig. 3
figure 3

Representation of three DTs that solve the CartPole-v1 task (after simplification). In this case, all setups (also those that are not shown in the figure) are able to find solutions that solve the task based on a single condition

Table 5 Comparison of our results with SOTA methods on CartPole-v1.

3.2 MountainCar-v0

As for the MountainCar-v0 task, the top row of Fig. 4 shows the fitness trend for the GE setups. Similarly to the previous case, all the algorithms can solve the task. However, GE vanilla and GE+ME solve the task in a comparable number of evaluations, with the former slightly faster (\(11 \times 10^4\) vs. \(13 \times 10^4\) fitness evaluations). On the contrary, GE+CMA-ME is the fastest, finding a solution after only \(10^4\) fitness evaluations, probably due to the exploitation capabilities of the Covariance Matrix Adaptation (CMA) component. Note that, while GE+ME requires \(110\%\) of the fitness evaluations to reach the same performance of GE, GE+CMA-ME only requires \(10\%\), which means that this scheme may significantly reduce the time needed to train DTs for IRL.

A comparison of the results of our best DTs (found across 5 runs) with the state-of-the-art is shown in Table 6. Here, we observe that both GP+ME and GP+CMA-ME achieve state-of-the-art performance. Moreover, a two-tailed Welch T-test (with confidence threshold \(\alpha =0.05\)) between these two methods confirmed the statistical significance w.r.t. the previous state-of-the-art approach [13].

The bottom row of Fig. 4 shows the archive at the end of the evolution for the three GE setups. Similar to the CartPole-v1 case, GE+ME and GE+CMA-ME illuminate the feature space better than GE vanilla, covering \(97\%\) of bins in all 5 runs. On the other hand, GE vanilla concentrates on a small portion of the feature space. Overall, we can observe that the three setups find high-performing solutions in different areas of the feature space. Regarding the behavioral feature, while the DTs found by GE present a high entropy level (as in the CartPole-v1 task), GE+ME produces also DTs that have lower entropy. Hence, these DTs present behaviors in which at least one action is less frequent than the others. GE+CMA-ME pushes the search toward solutions with even lower entropy (smaller than 0.1), finding trees that use, for the majority of the time, just a single action (accelerate left). For the structural feature, we can observe that, as for the CartPole-v1 task, GE focuses only on small DTs (of depth 2 to 4), while GE+ME and GE+CMA-ME produce solutions that cover the entire range of tree depths [1, 10].

Of note, GE+ME and GE+CMA-ME produce also DTs with a depth equal to 1, meaning that the maximum number of leaves is 2. Hence, the entropy, in these cases, is bounded to be lower than 0.63, corresponding to the case in which the two actions have the same frequency (note that we calculate the entropy using as the base for the logarithm the number of actions, see Sect. 2.2.3). Figure 6 shows a representation of two example DTs.

With the GP setups, we can observe a similar behavior. However, in this case, both versions based on ME converge faster than GP vanilla, which slowly increases its performance, solving the task only after \(12 \times 10^4\) fitness evaluations, see the first row of Fig. 5. Regarding the illumination capabilities, shown in the bottom row of Fig. 5, similar to the GE vanilla case, we have that GP+ME and GP+CMA-ME achieve a better exploration of the feature space, finding several high-performing solutions. However, GP vanilla shows better illumination performances than GE vanilla, probably due to the different mutation and crossover operators.

Fig. 4
figure 4

Results on the MountainCar-v0 task (GE setups). Top: fitness trends (mean ± std. dev. across 5 runs at each step of the algorithm). Bottom: maps obtained with the three setups. The results in each bin are averaged over 5 runs

Fig. 5
figure 5

Results on the MountainCar-v0 task (GP setups). Top: fitness trends (mean ± std. dev. across 5 runs at each step of the algorithm). Bottom: maps obtained with the three setups. The results in each bin are averaged over 5 runs

Fig. 6
figure 6

Representation of two DTs that solve the MountainCar-v0 task (after simplification). GE finds solutions that use all the three actions (see Sect. 2.4.2). Hence, the depth of the DT is 2, while GE+ME finds also solutions that do not use the Do Not Accelerate action. Therefore, it is possible to produce a DT with a depth of 1

Table 6 Comparison of our results with SOTA methods on MountainCar-v0.

4 Conclusions and future works

In this paper, we have applied two QD schemes, namely, ME and CMA-ME, to a hybrid approach combining evolutionary optimization and RL for finding a diverse collection of interpretable models. In our experiments, we combined the two QD schemes with two EAs, namely, GE and GP. By testing all the combinations between the schemes and EAs on two tasks from OpenAI Gym, we draw insights into the capabilities of each setup, in terms of performance, efficiency, and exploration capabilities. Our experimental findings mainly suggest that different QD schemes achieve different illumination patterns, meaning that each algorithm explores the feature space differently.

In summary, we observed that ME and CMA-ME find high-performing solutions while “illuminating” the feature space in a more efficient way w.r.t. the baseline approaches without QD. We also observed different behaviors, in terms of illumination capabilities, between GE and GP. This suggests that the encoding and the mutation/crossover operators highly contribute to the illumination capabilities of the algorithm, as discussed in other works [32,33,34] outside the context of IRL.

In future works, we will extend this study to more recent variants of ME, such as those proposed in [25, 35], and to more challenging RL tasks, such as tasks with larger observation and action spaces, as well as tasks with delayed rewards and/or partial observations. Our intuition is that, by leveraging on better exploration, QD algorithms may be particularly effective in those scenarios. Moreover, we will investigate the scalability of ME schemes w.r.t. the number of features used in the descriptor. In fact, while in this work, we used only one behavioral and one structural feature, in some specific applications, one may need to define more than two features, e.g., to describe non-functional requirements of the solutions. Another interesting research direction would be to introduce interactions with the user during the search process, as done in [36]. In fact, incorporating the user feedback during the search may complement the natural tendency of QD algorithms to explore the feature space, while also allowing the optimization process to focus on the areas of the search space that are more interesting from the user perspective.