Inverse Reinforcement Learning in Contextual MDPs

We consider the Inverse Reinforcement Learning (IRL) problem in Contextual Markov Decision Processes (CMDPs). Here, the reward of the environment, which is not available to the agent, depends on a static parameter referred to as the context. Each context defines an MDP (with a different reward signal), and the agent is provided demonstrations by an expert, for different contexts. The goal is to learn a mapping from contexts to rewards, such that planning with respect to the induced reward will perform similarly to the expert, even for unseen contexts. We suggest two learning algorithms for this scenario. (1) For rewards that are a linear function of the context, we provide a method that is guaranteed to return an $\epsilon$-optimal solution after a polynomial number of demonstrations. (2) For general reward functions, we propose black-box descent methods based on evolutionary strategies capable of working with nonlinear estimators (e.g., neural networks). We evaluate our algorithms in autonomous driving and medical treatment simulations and demonstrate their ability to learn and generalize to unseen contexts.


Introduction
We study a sequential decision-making problem where the environment is a Markov Decision Process (MDP), but the dynamics and the reward depend on a static parameter referred to as the context.For example, consider a lifelong learning task, in which an autonomous driving car must navigate the road while avoiding other vehicles.The car is likely to incur similar instances of this problem with different conditions, such as visibility and weather.While we expect some tasks to be similar, the agent is also required to adapt.For instance, when encountering fog, the agent is expected to drive more safely and slowly than during optimal weather conditions.[Itenov et al., 2018] Figure 1: The COIRL framework (left): a context vector parametrizes the environment.For each context, the expert uses the true mapping from contexts to rewards, W * , and provides demonstrations.The agent learns an estimation of this mapping Ŵ and acts optimally with respect to it.
Another example is the dynamic treatment regime [Chakraborty and Murphy, 2014].Here, there is a sick patient and a clinician who acts to improve the patient's health.The state space of the MDP is composed of the patient's clinical measurements, and the actions are the clinician's decisions.Traditionally, treatments are targeted to the "average" patient.Instead, in personalized medicine, people are separated into different groups and the medical decisions are tailored to the individual patient based on the predicted response or risk of disease (Fig. 1b).Recently, with the cost of genetic sequencing dropping dramatically, and with the growth of patients that are willing to track and share their healthcare records, personalized medicine is being developed to match the specific needs of patients.One success story of personalized medicine was the development of a drug called Herceptin for a group of cancers termed HER2+ that are highly aggressive and often have a poor prognosis.Herceptin was diagnosed to treat women (the context) with HER2+ breast cancer and it improved their survival time from 20 months to 5 years1 .For acute respiratory distress syndrome (ARDS), clinicians argue that treatment goals should rely on individual patients' physiology [Berngard et al., 2016].In [Wesselink et al., 2018], the authors study organ injury that might occur when mean arterial pressure decreases below a certain threshold, and report that this threshold varies across different patient groups.
These examples highlight the importance of patient information in the online treatment regime and motivate us to consider contextual information within the RL framework.One possibility is to expand the state such that it will include the patient information (the context).However, this approach can increase the complexity of the problem significantly, as the set of possible MDPs grows exponentially with the dimension of the context.Therefore, in the Contextual MDP framework [Hallak et al., 2015], the goal is to learn the mapping from contexts to the environment (dynamics and rewards).A learning algorithm for this problem should learn a mapping that generalizes to unseen contexts, improves at that task as it observes contexts, and achieves desired sample complexity [Modi et al., 2018] or regret [Modi and Tewari, 2019].
Another issue that is prevalent in real-world problems is that the reward function may be sparse and misspecified.For example, in online treatment problems like sepsis [Komorowski et al., 2018], the only available signal is the mortality of the patient at the end of the treatment.Manually designing a reward function for this problem is complicated and could lead to poor performance [Raghu et al., 2017, Lee et al., 2019].In many cases, it is easier for humans to define the reward implicitly by providing demonstrations of what constitutes a proper treatment.Inverse Reinforcement Learning [Ng et al., 2000, IRL] is concerned with inferring the reward function by observing an expert in order to find a policy that guarantees a value that is close to that of the expert.
Finally, deploying RL algorithms to treat patients or to drive cars cannot be regarded in the same way as solving a video game due to safety considerations and lack of a simulator.To address these issues, we propose the Contextual Inverse Reinforcement Learning (COIRL) model.We study a safe, online learning framework where an expert supervises the RL algorithm as follows.The agent observes a context, estimates the reward and proposes a policy.The expert evaluates the agent's actions and decides if they are -optimal.If not, the expert provides a demonstration to the agent.The goal of the agent is to learn a mapping from contexts to rewards by observing expert demonstrations.
We design and analyze two algorithms for COIRL: (1) For linear reward mappings, we study the ellipsoid method, for which we provide theoretical guarantees and analyze the sample complexity for finding an -optimal solution.(2) For nonlinear mappings, we study a black-box optimization solution that minimizes a surrogate loss using evolutionary strategies [Salimans et al., 2017, ES].We consider two loss functions that enable feature expectation matching; an intuitive but non-differentiable loss that minimizes the distance of the value of the agent from the value of the expert, and a differentiable loss that is based on a min-max objective.We evaluate our algorithms in autonomous driving and online treatment simulators and demonstrate their ability to generalize to unseen contexts.

Problem formulation and notation
Contextual Markov Decision Processes (CMDPs): A Markov Decision Process (MDP) is defined by the tuple (S, A, P, ξ, R) where S is the state space, A the action space, P : S × S × A → [0, 1] the transition kernel, ξ the initial state distribution and R : S → R the reward function.A Contextual MDP (CMDP) is an extension of an MDP, and is defined by (C, S, A, M) where C is the context space, and M is a mapping on C such that M(c) is an MDP with space and action spaces S, A for each c ∈ C. We consider a CMDP with finite state and action spaces, and associate each state with a feature vector φ : S → [0, 1] k .Additionally, we assume that the transitions P and the initial state distribution ξ are not context dependent.
In our work we will further assume a linear setting, in which the reward function for context c is linear in the state features: R * c (s) = r * T c φ(s), where r * c is the rewards coefficients vector which is given by the linear mapping r * c = c T W * , where W * ∈ R d×k .We assume ||W * || ∞ ≤ 1, and c ∈ C = ∆ d−1 , i.e., the standard d − 1 dimensional simplex.This allows a straight-forward expansion to a model in which the transitions are also given by a linear mapping of the context, as seen in [Modi et al., 2018].One way of viewing this model is that each row in the mapping W * is a base rewards coefficient vector, and the reward for a specific context is a convex combination of these base rewards.
We consider deterministic policies π : S → A which dictate the agent's behaviour at each state.The (normalized) value of a policy π for reward coefficients vector r is: called the feature expectations of π, and γ ∈ [0, 1) is the discount factor.For the optimal policy with respect to (w.r.t.) the reward coefficients vector, we denote the value by V * r .The normalization of the value function by the constant 1−γ k is for convenience, i.e. to claim ∀π : ||µ(π)|| 1 ≤ 1, and does not affect the resulting policies.Given W * , for each context c ∈ C we may calculate the reward coefficients vector r * c and find the optimal policy, i.e. the policy with the highest value, using standard methods such as policy/value iteration.

Inverse Reinforcement Learning in CMDPs:
In standard IRL the goal is to learn a reward which best explains the behavior of an observed expert.The model describing this scenario is the MDP\R which is an MDP without a reward function (also commonly called a controlled Markov chain).Similarly, we denote a CMDP without a mapping of context to reward by CMDP\M.The goal in Contextual IRL (COIRL) is to infer a mapping W , from observations of an expert, which will induce near-optimal policies for all contexts.As shown in [Ng et al., 2000], the IRL problem is ill-defined and we aren't ensured to learn the real reward or, in this case, the mapping W * ; however, it is still possible to learn a mapping which induces -optimal policies and enables generalization to new contexts.
While learning a transition kernel and an initial distribution which are parametrized by the context is related to the IRL problem, it can be seen as a separate, precursory problem, allowing us to make the simplifying assumptions presented previously.By using existing methods to learn the mappings for the transition kernel and initial distribution in a contextual model, such as in [Modi et al., 2018], and by using the simulation lemma [Kearns and Singh, 2002], our results can be extended to a more general CMDP setting.

Methods
In this section, we study learning algorithms for COIRL that are motivated by the online treatment regime.We begin with an online learning framework, where we design algorithms that do not have access to a simulator of the environment, and the agent is only allowed to explore near-optimal actions.We then consider an offline learning framework, where observational (off-policy) data of expert demonstrations was collected a priori.For example, in the medical domain, these demonstrations may represent collected data of clinicians treating patients.Such data is publicly available, for example, in the MIMIC-III data set [Johnson et al., 2016].Finally, we consider a warm start framework where the agent policy is initialized in the offline framework and continues to learn in the online framework.
More explicitly, in the online framework, the agent learns under the supervision of an expert.We propose a setting, in which at each time-step t a new context c t is revealed, possibly adversarially, and the agent acts based on the optimal policy w.r.t.its estimated mapping W t , denoted by πt .The expert provides two forms of supervision for the agent.First, the expert evaluates the agent's behavior and produces a binary signal which determines if the agent's policy is -optimal, i.e., Second, when the agent is sub-optimal, the expert provides a demonstration in the form of its policy (or feature expectations) for c t .The goal is to learn a mapping which induces -optimal policies for all contexts based on a minimal number of examples from the expert.
Next, we present two approaches to solving the COIRL problem.We begin with the linear model, for which we propose an ellipsoid-based approach with proven polynomial upper bounds.We then consider nonlinear models and propose descent-based algorithms.

Ellipsoid algorithms for COIRL
Algorithm 1 Online ellipsoid algorithm for COIRL The goal of the algorithms in this section is to find a linear mapping W * from contexts to rewards by observing expert demonstrations.We study ellipsoid based algorithms that maintain an ellipsoid-shaped feasibility set that contains W * .At any step, the current estimation W t of W * is defined as the center of the ellipsoid, and the agent acts optimally w.r.t.this estimation.If the agent performs sub-optimally, the expert provides a demonstration in the form of the optimal feature expectations for c t , µ(π * ct ).The feature expectations are used to generate a linear constraint (hyperplane) on the ellipsoid that is crossing its center.Under this constraint, we construct a new feasibility set that is half of the previous ellipsoid, and still contains W * .For the algorithm to proceed, we compute a new ellipsoid that is the minimum volume enclosing ellipsoid (MVEE) around this "half-ellipsoid"2 .These updates are guaranteed to gradually reduce the volume of the ellipsoid (a well-known result [Boyd and Barratt, 1991]) until its center is a mapping which induces -optimal policies.Theorem 1 shows that this algorithm achieves a polynomial upper bound on the number of sub-optimal time-steps.Finally, notice that in Algorithm 1 we use an underline notation to denote a "flattening" operator for matrices, and to denote a composition of an outer product and the flattening operator.The proof for Theorem 1 is provided in the supplementary material, and is adapted from [Amin et al., 2017] to the COIRL problem.Theorem 1.In the linear setting where R * c (s) = c T W * φ(s), for an agent acting according to Algorithm 1, the number of rounds in which the agent is not -optimal is O(d 2 k 2 log( dk )).
Practical ellipsoid algorithm: In many real-world scenarios, the expert cannot evaluate the value of the agent's policy and cannot provide its policy or feature expectations.To address these issues, we consider a relaxed approach, in which the expert evaluates a single trajectory of the agent and, if it is sub-optimal, the expert demonstrates a single H-step trajectory.Due to the stochasticity of the underlying MDP, evaluating the value of the agent based on a single trajectory is impractical.Hence we consider an alternative approach, in which the expert evaluates each of the individual actions performed by the agent.We define the expert criterion for providing a demonstration to be for each state-action pair (s, a) in the agent's trajectory.This implies that for the initial distribution which assigns probability 1 to a state in which the agent is sub-optimal, the value of the agent is not -optimal which enables us to make similar arguments as before.
Sub-optimal experts: In addition, we relax the requirement that the expert must be optimal and instead assume that, for each context c t , the expert acts optimally w.r.t.W * t which is close to W * ; the expert also evaluates the agent w.r.t.this mapping.This allows the agent to learn from different experts, and from non-stationary experts whose judgment and performance vary over time.If a sub-optimal action w.r.t.W * t is played at state s, the expert provides a roll-out of H steps from s to the agent.As this roll-out is a sample of the optimal policy w.r.t.W * t , we aggregate n examples to assure that with high probability, the linear constraint that we use in the ellipsoid algorithm does not exclude W * from the feasibility set.Note that these batches may be constructed across different contexts, different experts, and different states from which the demonstrations start.In the supplementary material, we provide pseudo code for this process (Algorithm 3).Theorem 2 below upper bounds the number of sub-optimal actions that Algorithm 3 chooses.Theorem 2. For an agent acting according to Algorithm 3 , with probability of at least 1 − δ, for The proof for Theorem 2 is provided in the supplementary material, and is adapted from [Amin et al., 2017] to the setup of COIRL with near optimal experts.
Warm-start for the ellipsoid algorithm: In this setup, the goal is to use offline data to initiate the ellipsoid algorithm with a smaller feasibility set.Although this approach leads to lesser regret, similarly to the online setting and in order to ensure the optimal solution remains within the feasibility set, an expert's supervision is required for training.We simulate the online setting by iterating over the trajectories in the data.The expert evaluates the agent's suggested action for each state and provides the binary optimality signal.As each trajectory is an expert demonstration, we use it as an alternative to the online expert demonstration.By adhering to the conditions of Theorem 2, its theoretical guarantees remain.

Optimization methods for COIRL with nonlinear mappings
Algorithm 2 Black-box algorithm for COIRL -contexts and their respective expert feature expectations, α -learning rate, σ -noise standard derivation, m -number of evaluations Define L(θ) In the previous section, we analyzed a scenario in which the mapping from contexts to rewards was linear, i.e.R * c (s) = c T W * φ(s).This reward structure enabled the analysis of the sample complexity of the ellipsoid algorithm and guaranteed its convergence.In this section, we extend the COIRL framework to nonlinear mappings, i.e.R * c (s) = f * (c) T φ(s), where f * is a nonlinear function.We formulate COIRL as an optimization problem and provide descent algorithms to solve it.
The goal is to find a mapping which induces policies that have feature expectations that match the expert's feature expectations for any context, i.e., minimize ||µ(π(f θ (c))) − µ(π * c )||, an approach known as feature expectation matching [Abbeel andNg, 2004, Ziebart et al., 2008].However, minimizing such a loss is difficult, as it is piece-wise constant in f θ (c) (or W in the linear case).For this reason, we explore two surrogate loss functions (alternative loss functions whose minimization leads to feature expectation matching).
The first surrogate loss function is the MSE between the estimated value of the expert and the agent: where π * c denotes the optimal policy w.r.t.R * c (s) and πc denotes the optimal policy w.r.t.f θ (c).Note that in order to evaluate the loss, we have to compute the optimal policies w.r.t.f θ (c), which involves solving tabular MDPs (e.g. with policy iteration).This fact makes Eq. ( 1) non-differentiable w.r.t. to θ as solving an MDP is non-differentiable.On the the other hand, the loss function is Lipschitz continuous w.r.t.f θ (c), as the following lemma states.The proof can be found in the supplementary material and is based on the simulation lemma [Kearns and Singh, 2002].
To minimize Eq. ( 1), we design a black box algorithm (Algorithm 2) that is based on Evolution Strategies [Salimans et al., 2017, ES]; a gradient-free descent method for solving black-box optimization problems based on computing finite differences [Nesterov and Spokoiny, 2017].The algorithm receives a set of D context-demonstration tuples and returns parameters θ.At each step, a context-demonstration tuple is sampled at random.Next, a set of m random Gaussian noise vectors ( 1 , ..., m ) is sampled at random, and used to perturb θ to yield a set of m reward functions f θ+ j (c).Each reward is used to evaluate the losses (Eq.( 1)) L j .Finally, the descent direction d L(θ) is computed as a the sum of perturbed vectors, weighted by the losses d L(θ) = L j j .
While the loss in Eq. ( 1) is intuitive, it has a few drawbacks.First, its evaluation requires solving MDPs (which is computationally prohibitive), and second, we found it hard to minimize in some settings (see the experiments section for more details).For these reasons, we consider a second surrogate optimization problem, defined by: a similar problem to the IRL formulation in [Syed andSchapire, 2008, Ho andErmon, 2016].This approach requires a two-step optimization process.At each iteration: (1) given the current estimation f θ , we compute the optimal policies for {c j } D i=1 and their corresponding feature expectations.Then (2) given the feature expectations, perform ES on the loss and take a single step to update θ.On the positive side, this loss is differentiable, and can be optimized with standard backpropagation.

Experiments
This section is organized as follows.We begin with analyzing our approach on a common IRL task, an autonomous driving simulation [Abbeel andNg, 2004, Syed andSchapire, 2008], adapted to the contextual setup.We then test our method in a medical domain, using a data set of expert (clinicians) trajectories for treating patients with sepsis3 .More details will follow in the relevant subsections.
We experimented with the methods that we presented in the previous section, namely, the ellipsoid algorithm, and the ES method with losses Eq. (1) and Eq.(2).We evaluate and compare their cumulative regret, the number of demonstrations they require, and their ability to generalize to a holdout test set.In each experiment, we create a random sequence of contexts {c t }, average the results across several seeds and report the mean and the standard deviation.Note that once an algorithm achieves an −optimal value in the online framework, it will stop requesting demonstrations from the expert.For that reason, algorithms that perform better request fewer contexts from the expert and their generalization graph appears truncated.We emphasize here that if a plot ends abruptly in these experiments, the reason is that at that point, the algorithm achieves an -optimal value and stops requesting for demonstrations.For nonlinear reward model, we take f θ (c) to be a multilayer perceptron; for the ES methods, we use value iteration to compute optimal policies; all details and hyper-parameters can be found in the supplementary material.(1) a speed feature, (2) a collision feature, which is valued 0 in case of a collision and 0.5 otherwise, and (3) an off-road feature, which is 0.5 if the car is on the road and 0 otherwise.The environment is modeled as a tabular MDP that consists of 1531 states.The speed is selected once, at the initial state, and is kept constant afterward.The other 1530 states are generated by 17 X-axis positions for the agent's car, 3 available speed values, 3 lanes and 10 Y-axis positions in which car B may reside.During the simulation, the agent controls the steering direction of the car, moving left or right, i.e., two actions.

Driving simulation
In this task, the context vector implies different priorities for the agent; should it prefer speed or safety?Is going off-road to avoid collisions a valid option?For example, an ambulance will prioritize speed and may allow going off-road as long as it goes fast and avoids collisions, while a bus will prioritize avoiding both collisions and off-road driving as structural integrity is its main concern.The optimal behavior is defined using a linear mapping W * or a nonlinear mapping f : C → [−1, 1] k .To demonstrate the effectiveness of our solutions, our mappings are constructed in a way that induces different behaviors for different contexts, making generalization a challenging task.For the nonlinear task, we consider two reward coefficient vectors r 1 and r 2 , and define the mapping by f * (c) = r 1 if ||c|| ∞ ≥ 0.55, and r 2 otherwise.
Results: For the online linear setting (Fig. 3), we define the optimality threshold to be = 10 −3 for all algorithms.We report the cumulative regret (Fig. 3a), the number of demonstrations that each algorithm requested (Fig. 3b), and their ability to generalize to a holdout test set (Fig. 3c), which were calculated using 20 seeds.Examining the results, we can see that despite the theoretical guarantees, the descent methods achieve better sample efficiency and regret than the ellipsoid.Also, Eq. ( 2) leads to better regret overall and requires significantly fewer demonstrations to reach -optimal performance.For the nonlinear online setting (Fig. 4) we compare the ES method for minimizing loss (2) with the ellipsoid algorithm, with = 10 −3 across 5 seeds.These results demonstrate that the ellipsoid does not perform well in nonlinear settings, highlighted in the inability to generalize and the linear regret growth, while the ES method with a non-linear model is able to converge to a near-optimal solution.Notably, loss (1) is excluded from the nonlinear results as it was unable to generalize and thus required a demonstration at nearly every time-step, making it about on par with the ellipsoid.A possible explanation for this is that this loss discourages advancing in the correct direction under certain circumstances.For example, consider the case where the agent's coefficient for the speed feature is 0.1, and the agent's and expert's feature expectations are 0.5, 1 respectively.A speculative step increasing the coefficient to 0.2 may not be sufficient to change the agent's feature expectations, and thus will increase the loss.On the other hand, an increase in the coefficient is necessary to match the feature expectations, therefore the step the ES algorithm takes would go in the opposite direction.Loss (2) avoids such issues, which may explain its superior results.

Dynamic treatment regime
In this setup, there is a sick patient and a clinician who acts to improve the patient's medical condition.
The context (static information) represents patient features which do not change during treatment, such as age and gender.The state of the agent summarizes the dynamic measurements of the patient throughout the treatment, such as blood pressure and EEG readouts.The action space, i.e., the clinician's actions, consists of a sequence of decision rules, one per stage of intervention, and represent a combination of intervention categories.Dynamic treatment regimes are particularly useful for managing chronic disorders and fit well into the larger paradigm of personalized medicine [Komorowski et al., 2018, Prasad et al., 2017].
We focus on an intensive care task, where the agent needs to choose the right treatment for a patient that is diagnosed with sepsis.We use the MIMIC-III data set [Johnson et al., 2016] and follow the data processing steps that were taken in Jeter et al. [2019].However, performing off-policy evaluation is not possible using this data-set, as it does not satisfy basic requirements [Gottesman et al., 2018[Gottesman et al., , 2019]].Therefore, we designed a simulator of a CMDP, based on this data.
The data-set consists of 5366 trajectories.Each trajectory represents a sequential treatment that was provided by a clinician to a patient.The available information for each patient consists d = 8 static features (the context, e.g.gender, age), k = 42 dynamic measurements of the patient at each time step (e.g.heart rate, body temperature).In addition, each trajectory contains the reported clinician actions (the amount of fluids and vasopressors given to a patient at each time-step and binned to 25 different values), and a mortality signal which indicates whether the patient was alive 90 days after his hospital admission.In order to create a tabular MDP, we cluster the dynamic features using K-means [MacQueen et al., 1967].Each cluster is considered a state and the coordinates of the cluster centroids are taken as its features φ(s).We construct the transition kernel between the clusters using the empirical transitions in the data.As in the previous sections, we consider a reward which is linear in W , i.e., R * c (s) = c T W * φ(s), where W * ∈ R 8×42 is a matrix we construct from the data for the simulator.In the simulator, the expert acts optimally w.r.t.this W * .
When treating a sepsis patient, the clinician has several decisions to make.One such decision is whether or not to provide a patient with vasopressors, drugs which are commonly applied to restore and maintain blood pressure in patients with sepsis.However, what is regarded as normal blood pressure differs based on the age and weight of the patient [Wesselink et al., 2018].In our setting, W captures this information, as it maps from contextual information (age) and dynamic information (blood pressure) to reward.
Results: Here we compare our algorithms within the online framework, over 1000 time-steps, where = 5 × 10 −4 , across 5 seeds for all algorithms.Similarly to the autonomous vehicle experiments, we measure the regret (Figure 5a) and the number of demonstrations that each algorithm requested (Figure 5b).In addition to generalization to a holdout set in Figure 5c, we provide results for the in-accuracy (miss rate) of the agents, i.e., in how many states the policy of the agent differs from that of the expert.These results suggest that in this more complicated environment, the ES approaches perform even better compared to the ellipsoid method.While all algorithms are able to learn and generalize, both ES approaches require significantly fewer demonstrations and accumulate less regret.We also note that although the miss rate decreases over time, it does not go below 11% for any of the methods.This shows that while the accuracy metric is indicative of good performance, it may not be a good metric when evaluating policies learned through IRL, as it only measures the ability to imitate the expert rather than the ability to learn the latent contextual reward structure.

Discussion
We studied the COIRL problem with linear and nonlinear reward mappings.While nonlinear mappings are more appropriate to model real-world problems, for a linear mapping, we were able to provide theoretical guarantees and sample complexity analysis.Moreover, when applying AI agents to real-world problems, interpretability of the learned model is of major importance, in particular when considering deployment in medical domains [Komorowski et al., 2018].Interpretability of linear models can be achieved by analyzing the mapping W and providing insights on the importance of specific features for specific contexts.
We experimented with two approaches for COIRL in the linear setup -the Ellipsoid and the ES methods.While the Ellipsoid has theoretical guarantees, we observed that ES performed better in all of our experiments.This raises an important question -what is the lower bound on the number of samples required?In [Amin et al., 2017], the ellipsoid method was proposed in a non-contextual IRL setup, and was shown to achieve a sample complexity of d 2 log(1/ ) while the lower bound is d log(1/ ).This may explain the fact that ES achieved better performance than the ellipsoid, even in the linear setup, although we cannot analyze its performance.
Finally, the literature on contextual MDPs is concerned with providing theoretical guarantees and sample complexity analysis for the scenario in which we can model each patient as a tabular MDP.However, when the measurements of the patient are continuous, deep learning methods are likely to perform better than state aggregation.In the deep setup, the critical question is how to design an architecture that will leverage the structure of the static and dynamic information.While there has been some preliminary work in robotics domains [Xu et al., 2018], these works often focus on meta-learning, i.e., few-shot adaptation, whereas COIRL considers the zero-shot scenario.
A Ellipsoid Algorithm for trajectories Algorithm 3 Batch ellipsoid algorithm for COIRL

B MVEE computation
This computation is commonly found in optimization lecture notes and textbooks.First, we define an ellipsoid by {x : (x − c)Q −1 (x − c) ≤ 1} for a vector c, the center of the ellipsoid, and an invertible matrix Q.Our first task is computing Θ 1 -the MVEE for the initial feasibility set The result is of course a sphere around 0: and calculate the new ellipsoid by

C Proof of Theorem 1
For simpler analysis, we define a "flattening" operator, converting a matrix to a vector: R d×k → R d•k by W = [w 1,1 , . . ., w 1,k , . . ., w d,1 , . . ., w d,k ].We also define the operator to be the composition of the flattening operator and the outer product: Therefore, the value of policy π for context c is given by Lemma 2 (Boyd and Barratt [1991]).If B ⊆ R D is an ellipsoid with center w, and x ∈ R D \{0}, we define D+1) .
Proof of Theorem 1.We prove the theorem by showing that the volume of the ellipsoids Θ t for t = 1, 2, ... is bounded from below.In conjunction with Lemma 2, which claims there is a minimal rate of decay in the ellipsoid volume, this shows that the number of times the ellipsoid is updated is polynomially bounded.We begin by showing that W * always remains in the ellipsoid.We note that in rounds where > .In addition, as the agent acts optimally w.r.t. the reward r t = c T t W t , we have that W T t c t µ(π * ct ) − µ(π t ) ≤ 0 .Combining these observations yield: This shows that W * is never disqualified when updating Θ t .Since W * ∈ Θ 0 this implies that ∀t : W * ∈ Θ t .Now we show that not only W * remains in the ellipsoid, but also a small ball surrounding it.If θ is disqualified by the algorithm: Multiplying this inequality by -1 and adding it to (3) yields: We apply Hölder inequality to LHS: Finally, let M T be the number of rounds by T in which

D Proof of Theorem 2
Lemma 3 (Azuma's inequality).For a martingale Proof of Theorem 2. We first note that we may assume that for any t: where e j is the indicator vector of coordinate j in which W t exceeds 1, and the inequality direction depends on the sign of (W t ) j .If W t ∈ Θ 0 still, this process can be repeated for a finite number of steps until W t ∈ Θ 0 , as the volume of the ellipsoid is bounded from below and each update reduces the volume (Lemma 2).Now we have W As no points of Θ 0 are removed this way, this does not affect the correctness of the proof.Similarly, we may assume We denote W t which remains constant for each update in the batch by W .We define t(i) the time-steps corresponding to the demonstrations in the batch for i = 1, ..., n.We define z * ,H i to be the expected value of ẑ * ,H i , and z * i to be the outer product of c t(i) and the feature expectations of the expert policy for W * t(i) , c t(i) , ξ t(i) .We also denote W * t(i) by W * i .We bound the following term from below, as in Theorem 1: (1): is bounded from below by , identically to the previous proof. ( we can apply Azuma's inequality (Lemma 3) with b = 4 and with our chosen n this yields: ) is never disqualified, and the number of updates is bounded by 2dk(dk + 1) log( 12√ dk ), and multiplied by n this yields the upper bound on the number of rounds in which a sub-optimal action is chosen.By union-bound, the required bound for term (4) holds in all updates with probability of at least 1 − δ.

E Proof of Lemma 1
Proof of Lemma 1.Our proof leverages the simulation lemma, showing that a small change in f θ correlates to a small change in R. In turn, the resulting policies are 'close' in value.We recall the results from [Kearns and Singh, 2002], both the definition of an α-approximate MDP and the similarity result over the resulting value functions (Lemma 4).
Definition 1.Let M and M be Markov decision processes over the same state space.Then we say that M is an α-approximation of M if: Lemma 4 (Simulation Lemma [Kearns and Singh, 2002]).Let M be any Markov decision process over , where A and B are some constants.This implies that the MSE is also Lipschitz, which concludes our proof.
Remark 1.This analysis can be extended to show that the objective is Lipschitz in θ, e.g., the neural networks parameters.Methods presented in Cisse et al. [2017] and Arjovsky et al. [2017] can be used to force the network to be Lipschitz continuous and in turn ensure that the objective is Lipschitz in θ.

F Experimental Details
In this section, we describe the technical details of our experiments, including the hyper-parameters used.To solve MDPs, we use value iteration.Our implementation is based on a stopping condition with a tolerance threshold, τ , such that the algorithm stops if |V t − V t−1 | < τ.In the driving simulation we used τ = 10 −4 and in the sepsis treatment we use τ = 10 −3 .

F.1 Autonomous driving simulation
In these experiments, we define our mappings in a way that induces different behaviours for different contexts, making generalization a more challenging task.Specifically, for the linear setting we use W * = ( −1 0.75 0.75 0.5 −1 1 0.75 1 −0.75 ), before normalization.For our nonlinear mapping, contexts with ||c|| ∞ > 0.55 are mapped to reward coefficients vector (1, −1, −0.05), otherwise they are mapped to (−0.01, 1, −1), which induce the feature expectations (9.75, 3.655, 5), (5.25, 5, 2.343) respectively.The decision regions for the nonlinear mapping are visualized in Appendix F.1.The contexts are sampled uniformly in the 2-dimensional simplex.We evaluate all algorithms on the same sequences of contexts, and average the results over 20 such sequences.
Hyper-parameter selection: By definition, the ellipsoid algorithm is hyper-parameter free and does not require tuning.1), the algorithm was executed with the parameters: σ = 0.1, m = 8, α = 0.1 with decay rate of 0.94, for 50 epochs, where the algorithm takes one step to minimize the loss for each context and the order of the contexts randomized when a new context added but not during the algorithm run.For loss ( 2), the algorithm was executed with the parameters: σ = 10 −3 , m = 250, α = 0.1 with decay rate of 0.95, for 50 iterations which didn't iterate randomly over the contexts, but rather used the entire training set for each step.Note, for this loss additional points sampled for the descent direction estimation do not require solving MDPs and thus more can be used for a more accurate calculation.For both losses, the matrix was normalized according to || • || 2 , and so was the step calculated by the ES algorithm, before it was multiplied by α and applied.
For the nonlinear setting, the model used for the nonlinear mapping was a fully connected neural net, with layers of sizes 15, 10, 5, 3. The activation function used was the leaky ReLU function, with a parameter of 0.1.Note that we can't normalize the parameters here as in the linear case; therefore an L2-normalization layer is added to the output.The same parameters were used as in the linear case, except with 120 iterations over the entire training set.They were originally optimized for this model and setting and worked as-is for the linear environment.As we aim to estimate the gradient, a small σ was used and performed best.The number of points, m = 250, was selected as fewer points produced noisy results.The step size, decay rate and the number of iterations were selected in a way that produced fast yet accurate convergence of the loss.Note that here the steps were also normalized before application, and the normalization was applied per layer.
We also provide results for the offline framework, demonstrating the ellipsoid method isn't suited for this framework and must be initiated in the manner we describe in the warm start section.Here, we used a training and test set of contexts to evaluate the algorithms.The ellipsoid method uses all contexts in the training set to update its estimation of W * , as it would for = 0.The results show that the descent methods are appropriate for the offline framework, and the ellipsoid is not.

F.2 Sepsis treatment
The environment we describe in 4.2 simulates a decision-making process for treating sepsis.Sepsis is a life-threatening severe infection, where the treatment applied to a sepsis patient is crucial for saving

Figure 2 :
Figure 2: Driving simulator The driving task simulates a three-lane highway, in which there are two visible cars -car A and car B. The agent, controlling car A can drive both on the highway and off-road.Car B drives on a fixed lane, at a slower speed than car A. Upon leaving the frame, car B is replaced by a new car, appearing in a random lane at the top of the screen.The reward is defined to be linear in the feature expectations R * c (s) = r * T c φ(s), where φ(s) is composed of 3 features:(1) a speed feature, (2) a collision feature, which is valued 0 in case of a collision and 0.5 otherwise, and (3) an off-road feature, which is 0.5 if the car is on the road and 0 otherwise.The environment is modeled as a tabular MDP that consists of 1531 states.The speed is selected once, at the initial state, and is kept constant afterward.The other 1530 states are generated by 17 X-axis positions for the agent's car, 3 available speed values, 3

Figure 3 :
Figure 3: Experimental results in the autonomous driving simulation with a linear mapping

Figure 4 :
Figure 4: Experimental results in the autonomous driving simulation with a nonlinear mapping Figure 5: Experimental results in the dynamic treatment regime with a linear mapping for some constant C and for all contexts c, implies that ||R c − Rc || ∞ ≤ .Plugging this result into the simulation lemma we conclude that |V π

Figure 6 :
Figure 6: Visualization of nonlinear decision boundaries