Model-free inverse reinforcement learning with multi-intention, unlabeled, and overlapping demonstrations

In this paper, we define a novel inverse reinforcement learning (IRL) problem where the demonstrations are multi-intention, i.e., collected from multi-intention experts, unlabeled, i.e., without intention labels, and partially overlapping, i.e., shared between multiple intentions. In the presence of overlapping demonstrations, current IRL methods, developed to handle multi-intention and unlabeled demonstrations, cannot successfully learn the underlying reward functions. To solve this limitation, we propose a novel clustering-based approach to disentangle the observed demonstrations and experimentally validate its advantages. Traditional clustering-based approaches to multi-intention IRL, which are developed on the basis of model-based Reinforcement Learning (RL), formulate the problem using parametric density estimation. However, in high-dimensional environments and unknown system dynamics, i.e., model-free RL, the solution of parametric density estimation is only tractable up to the density normalization constant. To solve this, we formulate the problem as a mixture of logistic regressions to directly handle the unnormalized density. To research the challenges faced by overlapping demonstrations, we introduce the concepts of shared pair, which is a state-action pair that is shared in more than one intention, and separability, which resembles how well the multiple intentions can be separated in the joint state-action space. We provide theoretical analyses under the global optimality condition and the existence of shared pairs. Furthermore, we conduct extensive experiments on four simulated robotics tasks, extended to accept different intentions with specific levels of separability, and a synthetic driver task developed to directly control the separability. We evaluate the existing baselines on our defined problem and demonstrate, theoretically and experimentally, the advantages of our clustering-based solution, especially when the separability of the demonstrations decreases.


Introduction
In the last few decades, there has been a surge of interest in the task of learning from demonstrations (LfD) in various domains (Belogolovsky et al., 2021;Kangasrääsiö & Kaski, 2018;Neu & Szepesvári, 2009). In an LfD task, the agent learns a mapping, called policy, from the world states to actions, solely by observing the experts' demonstrations, which are various sequences of state-action pairs. Even though the policy can be directly learned from demonstrations in a supervised learning fashion, the recovery of the reward function via Inverse Reinforcement Learning (IRL) has shown to provide a much more compact and generalizable description of behaviors (Ng et al., 2000).
In many IRL tasks, it can be the case that the demonstrations are collected from experts with multiple and inherently different intentions. In this situation, a single reward function is inadequate to cover all the demonstrations, and multiple reward functions are required to clearly express the differences between the multiple experts' intentions. To be able to recover multiple rewards functions from the experts' demonstrations, each demonstration should be accompanied by an extra piece of information that indicates to which intention it belongs. However, such information may not always be available in many real-world scenarios or is too expensive to be manually added. This leads to the relatively new problem of multi-intention and unlabeled demonstrations, i.e., the demonstrations are without intention labels. Even though the IRL problem with Multi-intention and Unlabeled Demonstrations, referred to as IRL-MUD, is potentially relevant to many real-world applications of IRL, a search of the literature reveals that this domain has received relatively little attention from the community.
It is commonplace to distinguish the IRL-MUD studies based on the dimensionality of the joint state-action space and the knowledge about the system dynamics. The studies either address the problems with low-dimensional spaces and known dynamics models, referred to as model-based IRL-MUD, e.g., Babes et al. (2011); Bighashdel et al. (2021), or they consider high dimensionality in the joint state-action space with unknown dynamics model, referred to as model-free IRL-MUD, e.g., Li et al. (2017); Hsiao et al. (2019). Model-based approaches normally solve the IRL-MUD problem using methods from parameter density estimation, which involves selecting a family of distributions and employing a clustering algorithm, e.g., Expectation Maximization (EM), to estimate both the reward parameters and the intention labels of the demonstrations (Babes et al., 2011;Rajasekaran et al., 2017). Despite the straightforwardness in model-based approaches, this clustering-based technique is more difficult to apply in a model-free setting, i.e., highdimensional spaces and unknown dynamics models, which is the focus of this work, due to the two following challenges: Estimation of the partition function. Model-based approaches assume that the experts sample the demonstrations according to a Boltzmann distribution, parameterized up to the normalization factor, i.e., partition function. Given the knowledge of system dynamics and an efficient Reinforcement Learning (RL) solver like dynamic programming, the partition function can be analytically computed in the inner loop of the reward learning algorithm (Ng et al., 2000;Ziebart et al., 2008). In model-free settings, the partition function cannot be expressed analytically, and it is computationally expensive to completely solve the RL via approximate methods in each iteration during reward learning. A practical approach, however, is to interleave the reward learning with a policy optimization procedure and adapt the sampling distribution in the policy optimization to estimate the partition function with importance weights . Unfortunately, this approach is shown to have 1 3 high variance in the cases where the sampling distribution fails to cover the trajectories with high rewards. When the demonstrations have only one intention, this coverage problem can be addressed by mixing the generated samples with some samples from demonstrations with an estimated distribution, e.g., a Gaussian distribution . However, when it comes to model-free IRL-MUD, the distribution of multi-intention demonstrations is multi-modal and, therefore, more challenging to be estimated. We later show (in Table 3) that using a fixed family of distributions like the Gaussian Mixtures Model (GMM) shows poor performance in estimating the distribution of the demonstrations.
Estimation of the posterior distribution. Since in our IRL-MUD problem the demonstrations are without intention labels, the posterior probabilities of the latent intentions, given the model parameters, need to be estimated. This estimation is normally done via Bayes' rule. In model-free IRL-MUD, the true likelihood function is not known and, at best, can be estimated with the sampling distribution of the policy optimizer. However, even with a known likelihood function, the high dimensionality of the state-action space makes the Bayes' rule impracticable to analytically compute the posterior probabilities for all demonstrations and intentions. Therefore, alternative approximate methods are required to efficiently estimate the posterior distribution of the latent intentions.
Given the two key challenges discussed above, applying a standard clustering-based technique like EM to model-free IRL-MUD remains an unsolved problem, which we aim to tackle in this work. Due to the challenges, a greater focus in the literature has been placed upon non-clustering approaches to solve the problem of model-free IRL-MUD by directly inferring the structures of the latent intention. Two non-clustering methods, namely InfoGAIL (Li et al., 2017), and IntentionGAIL (Hausman et al., 2017), have shown relative success in various illustrated experiments compared to the methods with singleintention assumptions. However, one critical question remains about their performances. Although both models propose a model-free IRL-MUD solution, it is not clear to what extent the multi-intention demonstrations need to be separable in the state-action space for these methods to work successfully. Therefore, in this work, we focus on the setting when the multi-intention and unlabeled demonstrations can be partially overlapping. We refer to this setting as the problem of model-free IRL-MUD with Overlapping demonstrations (IRL-MUD-O).
For the sake of clarity, we discuss the following scenario that is depicted in Fig. 1. Here, the demonstrations, the sequences of state-action pairs, are the paths derived by an expert Fig. 1 An example of partially overlapping demonstrations. The three demonstrations, i.e., sequences of state-action pairs, are the paths derived by an expert driver to one of the three colored destinations: blue, green, and brown, each of which is considered as one intention (Color figure online) driver to one of the three colored destinations: blue, green, and brown, each of which is considered as one intention. What can be clearly seen in this figure is that regardless of different intentions, the demonstrations partially overlap as the paths to different destinations go over the same road. From the state-action point of view, all the state-action pairs from the start point to the first exit are shared between the three destinations. Although the methods like InfoGAIL allocate an intention-specific policy for each intention, we later show in Sect. 4.2 that due to their specific reward structures in the policy optimization phase, a shared, perfectly imitated state-action pair receives a lower reward. This is while reward must be proportional to imitation quality rather than the degree to which the demonstrations overlap with the imitated behavior.
Previous studies have failed to demonstrate (any) convincing evidence that the non-clustering solutions, e.g., InfoGAIL and IntentionGAIL, for the problem of model-free IRL-MUD are also valid when the demonstrations overlap. We claim and experimentally validate that a suitable clustering-based approach performs better in these situations.
In summary, the primary goal of this work is twofold: • We propose a practical clustering-based approach via EM to the problem of model-free IRL where the demonstrations are multi-intention, unlabeled, and partially overlapping; • We demonstrate, theoretically and proposed experimentally, the benefits of our clustering-based approach to non-clustering ones, e.g., InfoGAIL and IntentionGAIL.
To accomplish our goals, we first provide suitable definitions for the problems of modelfree IRL-MUD and model-free IRL-MUD-O. The latter is done by introducing the concept of the shared pair (Sect. 3). Then, we propose a solution for the problem of model-free IRL-MUD in Sect. 4.1 by addressing the aforementioned challenges. To address the challenge regarding the estimation of the partition function, we propose a mixture of logistic regressions, where the partition function is considered as a model parameter. As to the impracticality of estimating the posterior distributions, we propose a network, called posterior network, which directly outputs an estimate of the posterior probabilities. In Sect. 4.2 we use our proposed clustering-based solution for the problem of modelfree IRL-MUD-O and provide theoretical analysis to prove the correctness of our solution. Furthermore, we give experimental comparisons with non-clustering solutions to further demonstrate the advantages of our proposed solution (Sect. 5). This is done by (1) introducing the metric of separability to measure the level of overlap in the demonstrations, (2) extending well-known simulated robotics environments to cover multiple intentions with different separability levels, and (3) developing a synthetic driver environment where the separability can be directly controlled. The experimental results show that our clusteringbased approach outperforms the state-of-the-art methods.

Related works
The research in IRL can be categorized into three frameworks: (1) Single-intention IRL, (2) Multi-intention IRL, and (3) Meta-IRL. Figure 2 illustrates an overview of each framework. Although all of these research directions target the LfD task, they fundamentally differ in their problem definition. In the following subsections, we clearly define the objectives of the three aforementioned IRL frameworks and provide an overview of the proposed methods in each research direction.

Single-intention IRL
Given the expert's demonstration set D and intention set I , the single-intention IRL methods assumes that the expert has only one intention, i.e., |I| = 1 (see Fig. 2). The objective of the single-intention IRL methods is to perfectly imitate the expert's demonstration set D . If the imitated behavior set can be shown by D , the single-intention IRL methods try to minimize the distance function d (D,D) , which is normally done through recovering the reward function that justifies the expert's behaviors.
A great deal of previous research into IRL task has focused on single-intention IRL. This line of research has addressed various aspects including, reward ambiguity and the problem of degeneracy (Ratliff et al., 2006;Ng et al., 2000), constraints on the demonstrations distribution (Ziebart et al., 2008), nonlinearity of the reward function (Wulfmeier et al., 2015), high dimensionality of state-action spaces Fu et al., 2018;Ho & Ermon, 2016), outperforming the demonstrators (Yu et al., 2020), imperfect and noisy demonstrations (Tangkaratt et al., , 2021Wu et al., 2019), etc. Various survey articles have also extensively reviewed the challenges and future research trends in this domain (Fang et al., 2019;Hussein et al., 2017;Zheng et al., 2021).

Multi-intention IRL
In the multi-intention IRL approach, which is the focus of this work, it is assumed that the expert has multiple intentions, i.e., |I| > 1 and the expert's demonstration set D typically doesn't include the intention labels, i.e., Inverse Reinforcement Learning with Multiintention and Unlabeled Demonstrations (IRL-MUD). As in the case of single-intention IRL, the objective of multi-intention IRL methods is the perfect imitation of the expert's Fig. 2 The schematics of problem definition in three research directions: single-intention IRL, multi-intention IRL, and meta-IRL. Given the expert's demonstration set D and intention set I , the single-intention IRL methods assume |I| = 1 and try to minimize the distance function d (D,D) , where D is the imitated behaviors. While minimizing the same distance function, the multi-intention IRL methods assume that the expert has multiple intentions, |I| > 1 , and the expert's demonstration set D typically doesn't include the intention labels. The meta-IRL methods address rapid adaptation to unknown environments that have, so far, unseen sets of target behavior D t and intention I t . The meta-IRL methods minimize d(D t ,D t ) , where D t , |D t | << |D| , is the new target demonstration set with the new target intention set I t , i ≠ i t for all i ∈ I and i t ∈ I t , and D t is the imitated target behaviors (Color figure online) 1 3 demonstration set D (see Fig. 2). As the demonstrations are unlabeled, the multi-intention IRL methods try to infer the intention labels and the corresponding rewards functions in D in order to minimize the distance function d (D,D) where D is the imitated behaviors.
Although it is natural in this research direction to assume that the demonstrations are unlabeled, several studies have simplified the problem by considering the intention label as an available piece of information Likmeta et al., 2021;Lin & Zhang, 2018;Nikolaidis et al., 2015;Ramponi et al., 2020). However, providing the intention labels can be too expensive or practically impossible in many real-world applications, and researchers try to address the problem of unlabeled demonstrations, i.e., IRL-MUD. In an early work, Babes et al. (2011) turned the problem into a parametric density estimation of the experts' demonstrations. Using EM, they tried to iteratively cluster the observed demonstrations and estimate the parameters of the rewards functions per cluster. The idea of density estimation has been applied by several authors to the problem of IRL-MUD (Almingol et al., 2013;Bighashdel et al., 2021;Choi & Kim, 2012;Michini et al., 2015;Michini & How, 2012;Rajasekaran et al., 2017;Ranchod et al., 2015). Despite the promising results, these methods are developed to handle low-dimensional problems with the knowledge of system dynamics, i.e., model-based setting.
There are a number of works that address the problem of model-free IRL-MUD (Hsiao et al., 2019;Hausman et al., 2017;Lin & Zhang, 2018;Li et al., 2017;Morton & Kochenderfer, 2017;Wang et al., 2017). Except ACGAIL (Lin & Zhang, 2018) and Goal-GAIL (Ding et al., 2019), where the demonstrations are labeled, all of the proposed methods have focused on a direct inference of the latent intentions in an unsupervised manner by employing generative models, namely variational auto-encoders (Hsiao et al., 2019;Morton & Kochenderfer, 2017;Wang et al., 2017) and Generative Adversarial Networks (GAN) (Hausman et al., 2017;Li et al., 2017). Hausman et al. (2017) and Li et al. (2017) in parallel studies, proposed two equivalent models, referred to as IntentionGAIL and Info-GAIL, respectively, as the extensions of Generative Adversarial Imitation Learning (GAIL) (Ho & Ermon, 2016) for the problem of model-free multi-intention IRL. Inspired by Chen et al. (2016), they tried to infer the latent structures by optimizing the mutual information between the intentions and the demonstrations in the GAN training. In comparison, we propose a clustering-based approach and demonstrate the benefits of our method in imitating the experts' behaviors when the multi-intention demonstrations are overlapping (see Sect. 5).

Meta-IRL
Another line of research in IRL is associated with the problem of generalization in the LfD task. When applied to unseen environments, satisfactory performance of the abovediscussed IRL methods requires training from scratch. To avoid this, researchers have developed IRL methods in the framework of meta-learning, where the main goal is a cross-domain generalization. The meta-IRL methods seek to exploit the structural similarity among a distribution of environments and optimize for rapid adaptation to unknown environments with a limited amount of data. Given the expert's demonstration set D and intention set I , the meta-IRL methods try to perfectly imitate the so far unseen target demonstration set D t with the target intention set I t , where |D t | << |D| and i ≠ i t for all i ∈ I and i t ∈ I t (see Fig. 2). Therefore, the objective of the meta-IRL methods is to minimize the distance function d(D t ,D t ) , where D t is the imitated target behaviors. A meta-IRL method considers the available demonstrations D as a prior for D t that can then be used to efficiently learn the structure of new demonstrations with new intentions. As the reward function is able to succinctly capture the structure of demonstrations, Meta-IRL methods aim at fast inferring the new reward functions governing the new demonstrations D t . The goal, therefore, is not to acquire good reward functions that explain the expert's demonstration set D , but rather to learn reward functions that can be quickly and efficiently adapted to the new target demonstration set D t .
In the LfD literature, the Meta-learning methods are commonly developed by extending the single-intention and multi-intention approaches into the meta-learning frameworks. Table 1 shows an overview of various Meta-learning methods and demonstrates their relations with their single-intention and multi-intention counterparts. In contrast to the multiintention IRL approaches, the meta-IRL methods normally assume that the demonstration sets are labeled (Gleave & Habryka, 2018;Seyed Ghasemipour et al., 2019;Xu et al., 2019;Wang et al., 2021). Nevertheless, in the case of unlabeled demonstrations, the ideas from multi-intention IRL are employed as done by Yu et al. (2019). The authors have proposed a meta-IRL method by extending the Adversarial IRL (Fu et al., 2018) model to a meta-learning framework with unlabeled demonstrations. Similar to InfoGAIL (Li et al., 2017;Yu et al., 2019) infer the intentions by optimizing the mutual information between the latent intention variable and the induced demonstrations. They further employ intention-specific regularization terms in the reward functions for more efficient reward adaptation in the case of new intentions. The authors compare their proposed meta-IRL method with a meta-learning variant of InfoGAIL (Li et al., 2017), called Meta-InfoGAIL, and demonstrate superior performance in various meta-learning tasks (Yu et al., 2019).
As one can see, this line of research is parallel to the research direction of multi-intention IRL, and any direct comparison between the methods of these two research directions requires modification on the problem formulation level. In this paper, we consider the setting of model-free IRL with multi-intention and unlabeled demonstrations, model-free IRL-MUD, and consequently, we only compare our method with the methods of this line of research. In the following sections, we define our problems in more detail and propose our novel approach.

Problem definition
In this section, we provide the exact definitions of the IRL problems discussed in Sect. 1, namely (1) model-free IRL-MUD and (2) model-free IRL-MUD-O. To accomplish this, we first define the task of RL with Multiple Reward Functions (RL-MRF), i.e., multiple intentions where each intention corresponds to one reward function.
is the initial state distribution, I is the set of intentions with prior probability p(i), and R is the set of intention specific reward functions R = {r i |∀i ∈ I} where r i ∶ S × A → ℝ . We further define a policy as ∶ S × A → [0, 1] . In model-free RL-MRF, the dynamics model and the reward functions are not available in explicit forms. Therefore, model-free RL approaches are employed to approximate the optimal policy: Definition 1 The task of model-free RL-MRF is defined as approximating the optimal policy * (a|s) = i∼p(i) * (a|s, i) that maximizes the expected discounted reward over the Table 1 The connection between single-intention LfD methods with their multi-intention and meta-learning variants Single-intention For the optimal policy * (a|s) , its occupancy measure can be defined as Ho and Ermon (2016): The occupancy measure is interpreted as the distribution of state-action pairs that an agent with optimal policy (referred to as an expert) encounters in navigating the environment. Therefore, we are allowed to write (Ho & Ermon, 2016): In IRL-MUD, the reward functions R = {r i |∀i ∈ I} are unknown, and instead a set of experts' unlabeled demonstrations D , i.e., without the intention label, are given. An expert's demonstration is defined as a sequence of state-action pairs, (s 0 , a 0 ), (s 1 , a 1 ), ... , induced by an optimal policy which is assumed to be the solution of RL-MRF (Definition 1). Given the definition of the occupancy measure, the experts' demonstrations can be perceived as a set of state-action pairs D = {s, a ∼ p(s, a)} where p(s, a) corresponds to the optimal policy * (a|s) through Eq. (1).

Problem 1
The problem of model-free IRL-MUD is defined as finding the pseudo-reward function R(s, a) = i∼p(i) R i (s, a; ) from a set of experts' unlabeled demonstrations D such that the background distribution q(s, a) = i∼p(i) q(s, a|i) , induced by the approximated optimal policy under the the pseudo-reward function * R (a|s) = i∼p(i) * R (a|s, i) , matches p(s, a). Similar to Eq. (1), the background distribution relates to the policy * R (a|s) as: As discussed in 2.2, a number of studies have proposed solutions for the problem of modelfree IRL-MUD via non-clustering approaches. These studies lack any evidence to support the validity of their proposed solutions in the environments where the demonstrations overlap. The overlapping demonstrations can be characterized by the existence of shared pairs:
In other words, an imitated state-action pair is a shared pair when it is shared between more than one intention. Using the definition above, we can define the problem that is the focus of this work:

Problem 2 The problem of model-free IRL-MUD-O is defined as a model-free IRL-MUD
problem with the presence of shared pairs.
To solve this problem of model-free IRL-MUD-O, we propose and research a clustering-based approach that is detailed in the next section.

Approach
In this section, we first introduce our clustering-based solution for Problem 1: model-free IRL-MUD, see Sect. 4.1. Then in Sect. 4.2, we verify the validity of our solution for Problem 2: model-free IRL-MUD with overlapping demonstrations. This done by providing the theoretical analysis as shortly shown in Theorem 2 and Corollary 2.

Model-free IRL-MUD
To provide a clustering-based solution for the problem of model-free IRL-MUD, we first use parametric estimation of the occupancy measure p(s, a). Then, we indicate the intractability of the solution by addressing the aforementioned challenges (see Sect. 1) and propose a new tractable approach using a mixture of logistic regressions.
Due to the multi-intention nature of the demonstrations, we model the occupancy measure p(s, a) as a mixture of Boltzmann distributions, p(s, a; ) , parameterized by , where the energy is given by the parameterized, intention-specific reward functions r i (s, a; ): where Z i ( ) are the intention-specific partition functions.

Definition 3
The Parametric Density Estimation (PDE) approach for the problem of model-free IRL-MUD is defined as minimizing the following loss function: The partition functions can not be obtained analytically for high dimensional domains without knowing the system dynamics, i.e., model-free IRL-MUD. Therefore, a samplingbased approach is employed where the partition functions are estimated from a background distribution q(s, a) = i∼p(i) [q(s, a|i)] . The background distribution is adaptively refined using the current reward functions in a policy optimization procedure. However, especially in the early stages, where the reward estimates experience high errors, the background distribution may fail to generate high-reward state-action pairs, which can lead the non-convergent behavior. As proposed by Finn et al. (2016a, b this problem is addressed by mixing the background distribution with an estimated distribution of the occupancy measure that has naturally high rewards:

Definition 4 The mixed sampling estimation of the partition function is defined as
where (s, a|i) = 1 2p (s, a|i) + 1 2 q(s, a|i) is the mixed sampler, and p(s, a|i) is an estimate of the occupancy measure p(s, a|i).
The mixed sampling estimation can help us to reach a solution via the EM algorithm: Proposition 1 Given the mixed sampling estimation, the PDE solution for the problem of model-free IRL-MUD constitutes the following iterative steps: The PDE solution is merely conceptual, and in order to come to practical implementation, we first need to address the following two key challenges: (1) estimation of the partition function (Sect. 4.1.1), and (2) estimation of the posterior distribution (Sect. 4.1.2).

Estimation of the partition function
The mixed sampling estimation for approximating the partition function requires an estimation of the true distribution of the experts' state-action pairs. Due to the multi-modality of the experts' multi-intention behavior, this estimation can be quite challenging. Inspired by Gutmann and Hyvärinen (2010), we take a different approach and consider the intentionspecific partition functions as a set of additional parameters of the model = { i |∀i ∈ I} to avoid the explicit estimation: By defining the partition functions as the model parameters, we can learn the partition functions rather than explicit estimation, e.g., via mixed sampling (Definition 4). However, this approach is no longer consistent with the loss function of the PED approach (Eq. 6), as the maximum likelihood can lead to arbitrarily large numbers by ) . (9) making the partition parameters, i , reach zero. To address this, we first define a substitute loss function for the problem of model-free IRL-MUD (Problem 1) using a Mixture of Logistic Regressions (MLR) (Definition 5) and obtain the solution via EM (Proposition 2). Then, we prove that the MLR approach results in the same solution as the PDE approach (Theorem 2) and, therefore, is a valid substitute for the PDE approach.
Definition 5  The idea of the MLR approach is to estimate the parameters by learning to discriminate between the real state-action pairs, which are sampled from intention-specific occupancy measures p(s, a|i; ) , and the fakes ones, sampled from the intention-specific background distributions q(s, a|i) . Once again, we can obtain a solution by employing the EM algorithm: .

Proof See Appendix A.3. ◻
In other words, the parameter set w reaches the mixed sampling estimation of the partition functions. Now, we are ready to conclude that: Applying Theorem 1, we can avoid the explicit, mixed sampling estimation of the partition functions by employing the MLR solution and assuming ≈ * at each iteration.

Estimation of the posterior distribution
The goal of the E-step is to estimate the posterior probabilities of the a priori unknown intention labels, given the model parameters. This estimation is typically done via the Bayes' rule as shown in Eq. (8). However, according to the Bayes' rule, the intention-specific likelihood functions p(s, a|i; ) are required, which further depend on the availability of the intention-specif partition functions Z i ( ) . Even if the partition functions can be estimated via the partition parameters in the MLR approach, the problem's dimensionality makes it impractical to compute the posterior probabilities for all demonstrations in all intentions. To address this, we propose a posterior network P, parameterized by , to obtain an estimate of the posterior probabilities. The posterior network can be trained in a supervised fashion by maximizing the likelihood of the state-action pairs sampled from the background distribution. Since intention-specific policies generate them, they have known intention labels.

Definition 6
The Posterior Probability Estimation (PPE) is defined as minimizing the following loss function: In each E-step, we first update the parameters and then we set (i|s, a) ≈ P(i|s, a; ).

A practical clustering-based solution
Although the defined MLR approach for the problem of model-free IRL-MUD eliminates the need for the mixed sampling estimation of the partition functions, it still requires an specific parameterization of the nonlinear functions f (s, a, i; ) , as the inputs to the logistic functions D(s, a, i; ) (Eqs. 11 and 12). In order to avoid this, the logistic functions D(s, a, i; ) can be directly parameterized in a more general manner, e.g. a neural network with a logistic function as the final layer, to output the classification probabilities. Given the E-step, our proposed solution, referred to as EM-GAIL, can be seen as a GAN where in the M-step, the discriminator D(s, a; ) = i∼p(i) [D(s, a, i; )] is trained to minimize Eq. (11), and in the subsequent B-step (Background-step), the background distribution q(s, a) = i∼p(i) q(s, a|i) is trained to maximize Eq. (11). Given the definition of the background distribution in Eq. (3), maximizing Eq. (11) is equal to training a policy under the pseudo-reward function R(s, a): where R i (s, a; ) are the intention-specific pseudo-reward functions.  • E-step: • M-step: • B-step with the pseudo-reward function: The following corollary shows the resemblance of our clustering-based solution to the GAIL solution (Ho & Ermon, 2016), which is proposed for model-free IRL with singleintention demonstrations:

Corollary 1
The EM-GAIL solution reduces to GAIL solution when |I| = 1 , i.e., the number of intentions is one.
In the following, first, we define the IntentionGAIL and InfoGAIL solutions as the nonclustering alternative for the problem of IRL-MUD in Sect. 4.1.4. Then, we theoretically analyze our clustering-based solution, along with the non-clustering alternatives, in the presence of overlapping demonstrations (see Sect. 4.2).

Non-clustering alternatives
InfoGAIL (Li et al., 2017) and IntentionGAIL (Hausman et al., 2017) are non-clustering approaches for the problem of model-free IRL-MUD, proposed as the extensions of GAIL Ho and Ermon (2016). Inspired by Chen et al. (2016), both methods infer the latent intentions by optimizing the mutual information between the intentions and the demonstrations in the GAN training. Despite the alternative derivations, the objective functions are identical (see Section A.5).    D(s, a, i; )) .

Solution 2 The
• Discriminator-step: • Generator-step with the pseudo-reward function:

Model-free IRL-MUD-O
In this section, we apply our clustering-based solution (Solution 1) to the problem of model-free IRL-MUD-O (Problem 2). Then, we provide theoretical comparisons with the Info/IntentionGAIL solution (Solution 2).
Given Definition 2, a shared pair is an imitated state-action pair that is shared between more than one intention. Since a shared pair mimics a demonstrated state-action pair, it should receive a high reward, regardless of the number of intentions in which it is shared. In other words, the reward of an imitated state-action pair should not be decreased by increasing the number of intentions in which the imitated state-action pair is shared. To research this behavior, we define multi-intention reward error to analyze the reward assignment sensitivity in various solutions for the problem of model-free IRL-MUD-O.

Definition 7
The multi-intention reward error for state-action pair s, a is defined as: where R |I|=1 is the pseudo-reward function when |I| = 1.
The multi-intention reward error can be used as an indication of solution validity for the problem of model-free IRL-MUD-O. When the multi-intention reward error of a stateaction pair is zero, it means that the reward assignment for the state-action pair is not sensitive to the number of intentions, whether the state-action pair in question is a shared pair or not. In other words, the reward assignment only depends on the imitation quality of the state-action pairs and not on the number of intentions in which they are shared. Therefore, a solution is considered to be valid for the problem of model-free IRL-MUD-O if the multi-intention reward errors of the state-action pairs are zero.
To prove Theorem 2, we first need to identify the condition for achieving a global optimality solution. The global optimality condition is defined when the discriminator cannot differentiate between a demonstrated and a generated state-action pair, and the intention predictions match the true intentions. Therefore:  D(s, a; )) .

Lemma 2 (Theorem 1 in Goodfellow et al. (2014)) The global optimality of EM-GAIL is achieved if and only if the D(s, a, i; ) and P(i|s, a; ) are optimal and p(s, a|i) = q(s, a|i) , ∀i ∈ I . The global optimality of Info/IntentionGAIL is achieved if and only if D(s, a; ) and P(i|s, a; ) are optimal and p(s, a) = q(s, a).
Proof See Appendix A.6. ◻ Now we are ready to give a direct proof of Theorem 2. A shared pair ŝ,â is similar to one state-action pair ∀i ∈Î . Therefore, the optimal posterior function outputs similar probabilities for all intentions in Î including intention k, i.e P(k|ŝ,â) = 1 ∕| I| . In global optimality condition, we have D(ŝ,â) = D(ŝ,â, k) = 1 ∕2 (see Lemma 2). Therefore, the multi-intention reward error in EM-GAIL for the shared pair ŝ,â is: In other words, the reward assignment for a state-action pair in EM-GAIL is not sensitive to the number of intentions. As long as an intention-specific state-action pair is similar to a demonstrated one with the same intention, it receives a high reward in EM-GAIL. Therefore:

Corollary 2 EM-GAIL is a solution for the problem of model-free IRL-MUD-O (Problem 2).
In the case of Info/IntentionGAIL, the multi-intention reward error of the shared pair ŝ,â is: Since |Î| ≥ 1 , the error is always greater than zero, with equality when |Î| = 1 . This means that the pseudo-reward of a perfectly imitated state-action pair for a specific intention is reduced by increasing the number of intentions where the pair is shared.
The intuition behind these behaviors can be seen in the role of the discriminators. The goal of the policy optimizer in Info/IntentionGAIL is to fool the discriminator by making q(s, a) → p(s, a) , and consequently, * R (a|s) → * (a|s) , which is the direct result of the policy and occupancy measure uniqueness property (Syed et al., 2008). This is while no constraints are put upon the intention-specific background distributions q(s, a|i) and the corresponding intention-specific policies * R (a|s, i) . The lack of sufficient constraints on the policy may lead to possible errors in the policy optimization procedure in multi-intention tasks. In EM-GAIL, on the other hand, the more stricter goal is to trick the intention-specific discriminators by making q(s, a|i) → p(s, a|i) , i.e., * R (a|s, i) → * (a|s, i) . In other words, the discriminator is viewing each individual intention-specific policy instead of the combined policy as in Info/ IntentionGAIL. This way, not only the Info/IntentionGAIL goal is satisfied, but also more constraints are set for the policy, leading to more stable behaviors in multi-intention environments.

Experimental analyses
In this section, we compare the experimental results of our proposed EM-GAIL with two well-known baselines: (1) GAIL, proposed by Ho and Ermon (2016), which was originally developed for single-intention imitation learning, and (2) Info/IntentionGAIL, proposed by Li et al. (2017) and Hausman et al. (2017) for multi-intention imitation learning. The main goal of this section is twofold: (1) to emphasize the limitation of the algorithms with the single-intention assumption, and (2) to experimentally validate our theoretical outcomes. The former is accomplished by evaluating the GAIL algorithm in various environments with a different number of intentions. For the second part, we introduce the metric of separability, which indicates the level of overlap in the demonstrations (the lower the separability, the higher the number of shared pairs), and we compare the algorithms in several environments with various levels of separability.

Environments
The experiments are conducted in four robotics environments, implemented in OpenAI Gym (Brockman et al., 2016) with the MuJoCo (Todorov et al., 2012) physics engine. For the purpose of this work, they are extended to cover multiple intentions by defining additional movements (see . Note that, as in their standard single-intention versions, the horizontal location of the robot is excluded from the state space in all multi-intention environments. As a result, it is not possible for the algorithms to separate the demonstrations by simply observing the moving direction of the robot. Swimmer is a 2-intention environment with an 8D state space and 2D action space. We have defined the intentions as moving forward and backward. For each intention, a Fig. 3 Environments. a Swimmer, which is a 2-intention environment. b Hopper, which is a 3-intention environment. c Half Cheetah, which is a 3-intention environment. d Reacher, which is a 4-intention environment. e Synthetic driver, which is a 2-intention environment (Color figure online). linear reward is assigned for moving progress. We set the maximum time-step of each episode to 500.
Hopper is a 3-intention environment with an 11D state space and 3D action space. We have defined the intentions as moving forward, backward, and standing. A positive linear reward is assigned for moving progress for intended moving, and a negative moving reward for intended standing. The maximum time-step of each episode is set to 500.
Half cheetah is a 3-intention environment, with a 17D state space, 6D action space. We have defined the intentions as moving forward, backward, and flipping. A positive linear reward is assigned for moving progress for intended moving, and a negative moving reward for intended flipping, along with a positive linear reward for increasing the body angle. The maximum time-step is 500.
Reacher is a 4-intention environment, with a 26D state space and 2D action space. We have defined the intentions as reaching to separate colored balls. The balls are randomly located at separate quarters and the reward is defined as the distance to each intended ball. The maximum time-steps is 50.
We additionally introduce a synthetic driver task (see Fig. 3e) in which the separability between multiple intentions can be controlled directly.
Synthetic driver is a 2-intention environment in which the agent can move freely from the origin at a constant speed in a 2D environment by controlling the steering angle t at discrete time t. For the agent, the state at time t is the 2D positions from t − 4 to t. As shown in Fig. 3e, the intentions are defined as two destinations where each destination has a curvature path with a specific radius. While the first destination is always fixed, the second destination and its corresponding path radius vary in different experiments. As will be shown later (Fig. 4b), each radius of the second destination results in an specific level of overlap in the demonstrations.

Metrics
The performances of IRL algorithms are evaluated by the following metrics: Expected Reward Difference (ERD), which is inspired by Choi and Kim (2012), Bighashdel et al. (2021), is a measure of how accurate the optimal policy, trained on the learned pseudo-reward function, performs under the ground truth reward function. We train an agent on the learned pseudo-reward function and evaluate its behavior on the ground truth reward function by computing the cumulative ground truth reward. The expected difference between the agent's cumulative ground truth reward and the expert's cumulative ground truth reward is referred to as ERD. The ERD avoids having to define task-specific metrics for each unique robotics environment (e.g., a swimming metric for the Swimmer environment or a hopping metric for the Hopper environment), thereby allowing a uniform approach to compare methods concerning different robotics tasks. The ERD is defined as: where r i is the ground truth intention-specific reward function. Given Eq. (2), ERD can be further written as: t , a t )) . The only unknowns in Eq. (28) are the intention-specific policies * R (a|s, i) . When the IRL training is completed and the intentionspecific, pseudo-reward functions R i (s, a; ) are learned, the intention-specific policies are obtained by training an agent on the learned pseudo-reward functions using RL. We normalize the ERD values between 0 and 1, where 1 corresponds to random policy and 0 represents the experts' policy * .
Unsupervised clustering accuracy is evaluated in experts' state-action space and is defined as Yang et al. (2010): where k m and k m are the ground truth and predicted intention label of the expert's stateaction pair m, respectively, M is the total number of experts' state-action pairs, and G is the set of all possible one-to-one mappings between ground truth and predicted intention labels. Unsupervised clustering accuracy can evaluate the performance of the posterior networks.
We further introduce a metric to measure the level of overlap in the demonstrations of each environment: Separability is obtained by first fitting a Gaussian Mixture Model (GMM) on the expert's state-action space and then predicting the unsupervised clustering accuracy. Then, the separability is computed by normalizing the accuracy between 1 and 1 ∕|I| . The separability is a measure of how well the intention-specific state-action pairs are represented by separate distributions. Low separability indicates a high level of overlap in the demonstrations and consequently increases the probability of generating the shared pairs.
The separability of the experts' demonstrations in simulated robotics and the synthetic driver environments are depicted in Fig. 4a, b, respectively. As can be seen in Fig. 4a, the separability differs in various robotics environments with the same number of intentions, and in all cases, the separability is reduced by increasing the number of intentions. Figure 4b also shows that the separability in the synthetic driver environment gets lower by reducing the distance between the two destinations, which is the result of decreasing the radius of path 2.

Implementation details
We used one implementation for InfoGAIL (Li et al., 2017) and IntentionGAIL (Hausman et al., 2017), referred to as Info/IntentionGAIL, as both approaches follow exactly the same procedure (See Sect. 4.1.4 and A.5), although they were presented in parallel studies. We used policies and discriminators with the same neural network architecture in all IRL algorithms (GAIL by Ho and Ermon (2016), Info/IntentionGAIL by Li et al. (2017), Hausman et al. (2017) and our EM-GAIL) for all experiments. We employed two hidden layers of dimension 64 for policies, discriminators, and posterior networks, with Tanh nonlinear functions in between. The policies of both Info/IntentionGAIL and EM-GAIL, as well as the discriminator of EM-GAIL, accept an additional, one-hot, intention assignment vector.
The number of learnable parameters of GAIL, Info/IntentionGAIL, and EM-GAIL, as well as their averaged training time for one iteration, is depicted in Table 2 for the evaluated robotics environments. As can be seen, both Info/IntentionGAIL and EM-GAIL have approximately the same number of learnable parameters, leading to similar training times for both methods. Due to additional posterior networks, both Info/Inten-tionGAIL and EM-GAIL have higher numbers of learnable parameters and training times compared to GAIL.
For each experiment, we first obtain the expert's policy by running Trust Region Policy Optimization (TRPO) algorithm (Schulman et al., 2015) on the true reward functions to generate the expert's demonstrations, i.e., solving the task of model-free RL-MRF (Definition 1). The number of expert's state-action pairs per intention is fixed to 2500 for swimmer, hopper, and half cheetah environments, 1500 for reacher, and 500 for the synthetic driver. Then, the IRL algorithms are trained for 400 iterations on the expert's demonstrations by running Adam optimizer  with a fixed learning rate of 0.001 (after testing for the range of 0.1 to 0.0001) for the discriminators and posterior networks, and TRPO with the reported hyperparameters in Schulman et al. (2015) for the policies. The optimization outline is depicted in Algorithm 1. All the experiments are run on a GPU-enabled computer (NVIDIA GeForce RTX 2080 Ti).

Results
Each experiment is repeated four times, and the results are shown in the form of means (lines) and standard deviations (shadings). The results are normalized between 0, corresponding to the expert's behavior, and 1, corresponding to the random behavior. Figure 5a shows the ERD for various levels of separability in the synthetic driver task. The results in this Figure show that the GAIL algorithm, which is developed with the single-intention assumption, is not able to imitate the expert's behaviors in any of the experiments. What can be further seen in Fig. 5a is the insensitivity of our purposed EM-GAIL to the separability of the demonstrations. The Info/IntentionGAIL shows insensitive behavior as long as the separability, i.e., the distance between the two destinations, is more than a certain limit ( ≈ 0.2 ). When the separability of the demonstrations gets lower than 0.2, which according to Fig. 4b corresponds to the second destination path radius of lower than ≈ 2 , the number of shared pairs grows exponentially, leading to higher multi-intention reward errors in Info/IntentionGAIL. This will consequently result in a performance drop in Info/IntentionGAIL for the separability of lower than 0.2.  Figure 6 shows the sensitivity of ERD to the number of intentions in four robotics environments. The results once again show the incapability of GAIL in environments with more than one intention. What stands out in Fig. 6 is that when the number of intentions gets higher, which according to Fig. 4a leads to a lower separability and consequently a growth in the number of shared pairs, the performance of Info/IntentionGAIL drops significantly. This is while EM-GAIL has shown more stable behavior regardless of the number of intentions. To make the conclusion more clear, we further demonstrate the performances of the algorithms directly with respect to the separability in Fig. 5b. As shown, Info/Intention-GAIL shows insensitive behavior as long as the separability is more than a certain limit. These results are also consistent with the sub-optimal behavior of Info/IntentionGAIL in the synthetic driver task in Fig. 5a. This is while it is apparent from Fig. 5b that the performance of EM-GAIL is less sensitive to the separability of the demonstrations. Further experiments are conducted to evaluate the unsupervised clustering performance of the multi-intention IRL solutions. Since GAIL is not built for multi-intention IRL, it has been excluded from these experiments. To evaluate the clustering performance of Info/ IntentionGAIL, the intention labels of the expert's state-action pairs are obtained using the posterior network. Table 3 indicates the average unsupervised clustering accuracy of the methods for the varying number of intentions. What is interesting in Table 3 is that even though both Info/IntentionGAIL and EM-GAIL are trained with the same loss function (Eq. 16), EM-GAIL outperforms Info/IntentionGAIL in unsupervised clustering, especially in environments with a higher number of intentions. The reason lies in the fact that the prediction performance of the posterior network highly depends on the quality of the generated samples, which are considered the training dataset for the posterior function. According to Theorem 2, the higher number of intentions results in higher multi-intention ERD with respect to the separability of the demonstrations, where a lower ERD is better and an ERD of 1 is equal to random behavior. a Synthetic driver, where each separability point corresponds to a specific path 2 radii (see Fig. 4a). b Robotics environments, where each separability point corresponds to a specific robotic environment with a specific number of intentions (see Fig. 4a) (Color figure online) reward errors in Info/IntentionGAIL. As a direct consequence, the policy of Info/Intention-GAIL is less able to generate perfectly imitated state-action pairs. Once again, these results emphasize the strength of EM-GAIL in imitation from overlapping demonstrations.

Conclusions
We proposed a model-free IRL framework to imitate the multi-intention behaviors of the experts, from unlabeled (without intention label) and partially overlapping (shared between multiple intentions) demonstrations. This is realized by a novel clustering-based approach, EM-GAIL, using a mixture of logistic regressions and a posterior network. The mixture of logistic regression avoids the direct estimation of the partition function, and the posterior network approximates the posterior distributions that are impractical to be expressed and 1 3 estimated analytically. We addressed the problem of overlapping demonstrations by defining the concept of shared pair and analytically proved that, under the global optimality condition, EM-GAIL is a solution. We further compared EM-GAIL with well-known baselines on a set of simulated tasks. We conclude that first, the algorithms with single-intention assumption are incapable of imitating the experts' multi-intention demonstrations, and second, the performance of the non-clustering approaches, which focus on the direct inferring of the latent intention, significantly depends on the separability of the intentions. This is while our proposed clustering-based approach shows stable behavior, regardless of the separability and the number of intentions.
Having shown the benefits of our approach with textita priori known number of experts' intentions, we aim to extend the same approach to also be able to infer the intentions when the number of intentions is a priori not known.

Appendix A
This section provides the proofs for all the statements in the main paper.

A.1 Proof of Proposition 1
Minimizing the negative log likelihood leads to the following loss function: Taking the derivatives results in: a∼p(s,a) [log i∼p(i) [p(s, a|i; )].  We can use the definition of the intention posterior probability which is computed in the E-step: Multiplying the nominator and the denominator in the derivatives by p(s, a|i; )p(i) yields: Using the definition of Boltzmann distribution results in: In the E-step of the EM algorithm, we set (i|s, a) ≈ p(i|s, a) where is the current set of parameters. Assuming a constant p(i) and given that p(s, a) = p(s,a|i)p(i) (A5) (A6)

3
We can use the definition of the expectation as follows: where g is a function. This property of the expectation has been frequently used throughout the paper. Now, we can reach the final equation for the M-step:

A.2 Proof of Proposition 2
The mixture of logistic regressions is defined as: where D (c|s, a) is the true class label of the state-action pair s, a, and where we have defined a constant prior D(i) = p(i) . Taking the derivatives with respect to yields: (A7)    D(s, a, i; )) .

A.3 Proof of Lemma 1
The mixture of logistic regressions for the multi-intention IRL problem is defined as: The nonlinear logistic function D(s, a, i; ) is further defined as: Given that (s, a|i) = 1 2p (s, a|i) + 1 2 q(s, a|i) and by setting p(s, a|i; ) as an estimate of p(s, a|i) i.e. p(s, a|i) = p(s, a|i; ) , we have: Replacing Eq. (A23) in Eq. (A21) and separating the terms independent of i yields: where −i is the set excluding i , and f and g are some functions. Taking the derivatives with respect to i leads to: The optimal value results by setting the derivative to zero:  D(s, a, i; )) .

A.4 Proof of Theorem 1
Replacing Eq. (A23) in Eq. (A21) and separating the terms independent of yields: Taking the derivative with respect to leads to: Now by setting = * , i.d i = Z i ( ) , we have:

A.5 Equivalency of InfoGAIL and IntentionGAIL objectives
The main objective function of the InfoGAIL is (equation 3 in Li et al. (2017) without the constant coefficients and the entropy term): where: Please note that on page 4, paragraph 4 of the InfoGAIL paper (Li et al., 2017), the authors of InfoGAIL use a simplified posterior approximation Q(c| ) ≈ Q(c|s, a) , to avoid working (A26) * i = s,a∼ (s,a|i) exp(r i (s, a; )) (s, a|i) . with entire trajectories. Furthermore, in the InfoGAIL paper (Li et al., 2017), the authors have used the letters "c" and "Q" to address the intention and posterior function, respectively. In order to have more consistent notations, the letters "c" and "Q" are replaced with the letters "i" and "p", respectively. Given these, the final objective function will be: On the other hand, the main objective function of IntentionGAIL is (equation 8 in Hausman et al. (2017) without the constant coefficients and the entropy term): where the identical terms with respect to the Eq. (A32) are labeled. As can be seen, the objective functions of both InfoGAIL and IntentionGAIL are identical.

A.6 Proof of Lemma 2
The Info/IntentionGAIL (Li et al., 2017;Hausman et al., 2017) corresponds to the following max-min game: For a fixed q(s, a) and P(i|s, a) , the optimal discriminator D * (s, a) is Goodfellow et al. (2014): The max-min game can now be reformulated as: For a fixed P(i|s, a) , the maximum with respect to q(s, a) happens at p(s, a) = q * (s, a) (Goodfellow et al., 2014). At this point, V(D * , q * , P) achieves: Finally, the max-min game is reached to the global optimal, when the log likelihood of posterior function is maximized, i.e. optimal posterior P * (i|s, a): For EM-MIRL with an optimal posterior P * (i|s, a) , i.e. p(i|s, a) = P * (i|s, a) we have the following max-min game: For a fixed q(s, a|i) , the optimal discriminator D * (s, a, i) is Goodfellow et al. (2014): The max-min game can now be reformulated as: The maximum of V(D * , q * , P * ) is achieved for p(s, a|i) = q * (s, a|i) (Goodfellow et al., 2014) with the value of log 4: Author contributions The authors' contributions are: Methodology, formal analysis and investigation: AB; Writing-original draft preparation: AB; Writing-review and editing: PJ and GD, Supervision: PJ and GD; Resources: GD.
Funding This research has received funding from ECSEL JU project PRYSTINE in collaboration with the European Union's 2020 Framework Programme and National Authorities, under grant agreement no. 783190.
Data availibility Necessary data and materials are available with the code.
Code availability For easing the reproducibility of our work, the code of our method is shared with the community https:// github. com/ tue-mps/ EM-GAIL.

Conflict of interest
The authors have no competing interests to declare that are relevant to the content of this article.

Consent to participate Not applicable.
Consent for publication Not applicable.

3
Ethical approval Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.