Learning from demonstration for locally assistive mobility aids

Active assistive systems for mobility aids are largely restricted to environments mapped a-priori, while passive assistance primarily provides collision mitigation and other hand-crafted behaviors in the platform’s immediate space. This paper presents a framework providing active short-term assistance, combining the freedom of location independence with the intelligence of active assistance. Demonstration data consisting of on-board sensor data and driving inputs is gathered from an able-bodied expert maneuvring the mobility aid around a generic interior setting, and used in constructing a probabilistic intention model built with Radial Basis Function Networks. This allows for short-term intention prediction relying only upon immediately available user input and on-board sensor data, to be coupled with real-time path generation based upon the same expert demonstration data via Dynamic Policy Programming, a stochastic optimal control method. Together these two elements provide a combined assistive mobility system, capable of operating in restrictive environments without the need for additional obstacle avoidance protocols. Experimental results in both simulation and on the University of Technology Sydney semi-autonomous wheelchair in settings not seen in training data show promise in assisting users of power mobility aids.


Introduction
The global population is ageing quickly, with predictions that the worldwide proportion of people aged 60 and over is expected to double by 2050 (United Nations 2015). As mobility aids promote independence and self-esteem in elderly and frail users, there is a strong motivation to develop intelligent systems capable of providing improved assistance to these groups (Tegart 2010). One aim is to thus develop an assistive framework that can be incorporated into power mobility devices (PMDs) such as the one shown in Fig. 1.
As devices such as power wheelchairs are large, heavy and powerful machines it is common for prospective users to meet a strict set of conditions before prescription is approved (Queensland Department of Health 2016) even if they are otherwise capable of independently performing other routine tasks. In this light, the target end users of this work are those who are still capable of independence apart from mobility rather than those requiring constant oversight from a carer due to more complex healthcare and lifestyle needs.
Assistance for sensor-equipped PMDs can be broadly divided into two dominant flavors, referred to as reactive or active assistance. Most reactively assistive systems keep the user significantly in the loop, only intervening the user's input commands when collision is imminent or autonomous takeover is momentarily required, for example, in traversing a particularly narrow passage. In active assistance, the user is mostly removed from the control loop; their command signals are used to infer an intended destination, to which autonomous navigation algorithms then guide the PMD.
Reactive assistance primarily aims to mitigate collisions, as well as possibly accommodating hand-tooled heuristic behaviors such as driving parallel to walls. Earlier systems such as the Bremen Autonomous Wheelchair (Lankenau et al. 1998) or NavChair (Simpson et al. 1998) provided safety mechanisms such as nullifying potentially hazardous input signals or incorporating obstacle avoidance behaviours adapted from algorithms developed for autonomous robots, e.g. the Vector Field Histogram (Borenstein and Koren 1991) used recently in Ashley et al. (2017) and Li et al. (2017). A more recent approach in reactively assistive PMDs is the weighted fusion of robot command signals with that from the user, as investigated by Devigne et al. (2016), Goil et al. (2013) and Urdiales et al. (2011) among others.
These frameworks utilise a measure of "goodness" from the user's behaviours to allocate authority over the final command signal, typically a weighted sum between the user's input and command signals inferred from another algorithm. This measure typically considers metrics such as obstacle proximity, user input fluidity or navigational task relevance. The authors believe however that it is inherently safer for the user to merely provide suggestions to a system that is in total control at all times, as there is little promise of safety in the combined command signal even if both may be safe in isolation. The primary disadvantage to reactive assistance is that the user's intended destination in most implementations is not considered. As a result the system is unable to actively assist the user in areas where pure obstacle avoidance would become trapped in local minima.
Active assistance tends to rely on a-priori maps of spaces that the system operates in. An early instance of this can be seen again on the NavChair (Simpson and Levine 1999), which seeks to change between modes such as doorway navigation and wall-following depending on its location and observed surroundings. Given data obtained from end-users or demonstrators, later works have approached inference of the likeliest intended target as a classification exercise. Such long-term "global" destination inference has already been previously addressed by several means including Hierarchical Hidden Markov Models (Patel et al. 2014), Gaussian Processes (Matsubara et al. 2015) and various Bayesian frameworks (Huntemann et al. 2013;Escobedo et al. 2014) among other heuristic approaches (Carlson and Demiris 2012;Derry and Argall 2013;Narayanan et al. 2016). Once an intention is deemed sufficiently probable, semi-autonomous navigation typically commences. In other domains such as actuated walking canes (Wakita et al. 2013) and assistive limb exoskeletons (Huang et al. 2015;Li and Ge 2014), intentions have also been respectively modelled as desired walking directions or limb movements.
A motivating example for active assistance can be seen in Fig. 2. User inputs are projected forward over a brief temporal window, yielding an approximate navigational goal for a reactive local controller; in this example, the Dynamic Window Approach (Fox et al. 1997). These algorithms consider both the goal and immediately available sensor data to yield a safe control point and corresponding twist; however as they are limited solely to determining the next platform control signal with no consideration of longer-term path viability, the platform can become stuck in scenarios where comprehensive path planning is necessary for ensuring reliable traversal.
A significant limitation of active assistance is its restriction to working in areas having up-to-date occupancy maps. Instead it is arguably more desirable for a system to provide assistance 'anywhere', like reactively assistive frameworks are capable of, without the need for map-building by only relying upon immediately available on-board sensor data. In this "local" space one can infer immediate short-term destinations being points of interest that the user wishes to pass through or stop at at a given instant, to which short-term path planning can take place. It is these short-term destinations which are defined as intentions for the purpose of this work. The limited physical range of these short-term destinations  in comparison to the long-term destinations in the aforementioned literature is not a significant issue, given user inputs and sensor data are readily available over the entire duration of travel. Although such an approach does not guarantee global long-term optimality, as postulated in Huntemann et al. (2007) a destination inference and its corresponding path only have to be accurate for a brief portion of travel until the next path is available, resulting in an overall route that is an overall concatenation of many fragments of these short-term paths. While the aforementioned classification methodologies can function solely on low-dimensional pose information when considered alongside user input data, applying such an approach directly to sensor data is meaningless as there is little correlation between long-term destination and the immediately observable scene. However, the high dimensionality of typical mobile robot sensor data has, to the best of the authors' knowledge, made short-term active assistance a largely unexplored area.
The contribution of this work in the literature covering shared control of PMD devices is in short-term intention estimation without depending upon a a-priori map, for enabling subsequent robot path planning and navigation in a non-reactive manner. As these intentions can encompass difficult locations such as doorways for which heuristics are designed to handle in reactively assistive systems, the inference of intentions allows the merging of collision avoidance and selective interaction with artefacts, bypassing the need for situational behaviors found in systems based upon obstacle avoidance. Our work was originally published as a conference paper (Poon et al. 2017). This manuscript expands upon this preliminary work with the following additions: 1. Consideration of additional metrics (Urdiales et al. 2013;Yang et al. 2017) for a more thorough framework evaluation. 2. A human-centric analysis of the simulator used in both works against the real wheelchair. 3. Further experimentation in simulation with a larger user pool in a more complex test environment, and on the UTS wheelchair with a disabled volunteer.

4.
A quantitative a-posteriori evaluation of the framework's predictive accuracy on driving data by able users.
The remainder of this paper is arranged as follows. Section 2 documents a methodology to capture the local intention inference and path planning behaviors of able experts. Section 3 presents new experiments undertaken to evaluate the assistive framework both in simulation and on an instrumented mobility aid ( Fig. 1), with a discussion of results and outcomes following in Sect. 4. Section 5 closes with conclusions and future work.

Methodology
This section details the modelling of local intention estimation and path planning, as summarized in Fig. 3. Training data comes from an able demonstrator driving the PMD around a simulated interior space while position, expert joystick inputs and sensor measurements from a planar laser scanner, are logged. A simulated environment is used here to allow for flexibility in creating environments suited to the capture of expert behaviours. As demonstrated in Sect. 3.1, the simulation behaves acceptably closely to the real PMD for data collection and user interaction. For intention estimation a behavioural model is built yielding a likelihood distribution across discrete points in a moving window centralised around the PMD, trained only upon sensor data and user input. Fitting with the proposed definition of a destination as a point which the user intends to traverse to or through, this distribution then allows the likeliest destination point to be inferred. This is covered in Sect. 2.2.
From the same training data, a methodology for the learning of short-term path planning behaviors and user compliance is also proposed and detailed in Sects. 2.3 and 2.4. Short local paths are combined with sensor data as a reward surface, which is then refined online via Dynamic Policy Programming (DPP) (Azar et al. 2012), a stochastic optimal control method. The final output of this planner is taken as the local path to be followed to its terminus, or user compliance indicates a loss of alignment with their true intention.

Defining local paths
In order to break down a continuous sequence of driving data, several criteria to terminate paths are proposed in Fig. 4.
• Loss of visibility from starting position or exiting local window of starting position • PMD turns ≥ 90 • relative to starting orientation • Inflection in forward/reverse joystick axis The first criteria ensures paths are terminated once immediately available sensor data can no longer provide information on the PMD's future surroundings, whereas the latter two indicate the user either intends to pursue a significantly different objective or they have reached their final destination.
In determining a suitable local window size for both the latter half of the first termination criterion and the subsequent path planning, the spatial requirements of typical PMDs are first considered. In Australia where this research was undertaken, an open circulation space of approximately 2 m 2 is a requirement for building planners (Standards Australia 2009) for spaces such as landing areas near doors for disabled access bathrooms, so this is taken as a lower bound. On the basis that the framework is concerned with only an immediate path to track, it follows that the usefulness of path points diminishes as the path length increases; when the system is able to constantly replan paths at a reasonable speed, there is no need to plan beyond challenges in the immediate vicinity. Hence the requisite quantity of local space can be thought of as a softly defined region, bounded between the requisite minimum of planning space as implicated by the PMD's physical characteristics and an upper limit beyond which a path planning cycle becomes needlessly complex.
In this work the front half of a 5m square moving window centred at the PMD is taken, allowing for an acceptable path planning time while also exceeding the lower bound for maneuvrability. With additional sensor coverage of the wheelchair's rear, the framework would also be readily applicable to reversing maneuvres.

Local intention estimation
Here, an intention within the local window around the PMD is defined as the discrete cell with the highest likelihood of being a local path termination point. A cell size of 0.05 m 2 is taken to allow for an acceptable path planning time (Sect. 2.3). Demonstration training data contains Cartesian pose information X 1∶N , expert actions Y 1∶N of 2D Cartesian joystick positions, and sensor data Z 1∶N containing polar co-ordinates (r, ) 1∶|z| n of obstacle points across an on-board laser scanner's 180 • horizontal field of view. For all instances of training data a local path is obtained as per the criteria above.
Rather than attempting to match instances of z 1,…,N in their entirety to the new z * as this would be easily overfitted, individual laser scan beams are instead considered in conjunction with user input. This is done with the aim of both mitigating effects from the curse of dimensionality, and to allow for improved tolerance to large variation in environmental structure while retaining some information on the relationship between Z and the sense of space the expert afforded the PMD. For each path indexed n = 1, 2, … , N a training data tuple for the cell corresponding to goal G n = [x g , y g ] , the Cartesian point at which the path terminated (Fig. 5), is recorded. A tuple A summary of nomenclature can be found in Table 5.
Only the edges of obstacles within the local window perceived by z are considered, to remove reliance upon obstacle points from future measurements. Each grid cell then has its own one-class classifier built in the form of a Radial Basis Function Network (Orr 1996) utilising a strong 0 prior, as the data only contains positive examples. Each classifier is designed to take a training tuple as input, and is trained to yield an intention likelihood between 0 and 1 following the additive approach in Hagan et al. (2014). Taking all classifiers into consideration and normalizing across the grid cells thus yields an intention likelihood estimate P(g * |y * , z * ) given new joystick input y * and laser range information z * . To reduce computational cost in experiments, only classifiers for cells which recieved training data were queried. The final output of the intention estimator is the position g * of the cell with the highest likelihood, which serves as an objective in the subsequent path planning step.
An illustrative example of the intention estimator is shown in Fig. 6.

Local path planning
Due to our significantly truncated operating window and rapid rate of re-planning, here path primitives are utilized for planning due to their simplicity and speed. This is in contrast to the more exhaustively complete longer-term path planning algorithms detailed in Yang et al. (2016) and other literature.
Local paths from training data with similar endpoint orientations are gathered and a primitive is obtained as a leastsquares solution. As opposed to the velocity domain-based arcs in Yang et al. (2017), the primitives in this paper are in real space and are formulated similarly to the approach in Ballesteros et al. (2017). However the environment is not immediately considered, as the framework operates solely within visible space and considers the local occupancy map later. In this work 17 discretized endpoint orientions are taken for creating expert-styled path primitives; this number was determined as the limit beyond which no training data would be available for a given orientation. Then for a goal point g * , a path primitive is selected based on the nearest average end point and spatially scaled (Fig. 7) to reach it for a natural baseline. As this inferred path is not guaranteed to be safe, it instead serves as the basis for real-time path generation with consideration to the local occupancy map.

Dynamic policy programming
Stochastic optimal control learns a Markov Decision Process (MDP) defined by a 5-tuple (S, A, T, R, ) . S is a finite set of states, A is a finite set of actions, T a ss ′ is the transition probability from state s to state s ′ under the action a, r a ss � = R(s, s � , a) is the reward from state s to state s ′ under the  action a. S is defined as the local window around the PMD discretized into a grid-world for path planning, and A as a list of 9 actions that can be taken at each grid cell: moving to any of its 8 immediate neighbors or remaining in place (Fig. 8). where denotes the expectation over transition model T and the current policy . ∈ (0, 1) is a constant that controls the Kullback-Leibler divergence term.
According to Azar et al. (2012), the action preferences function (Sutton and Barto 1998) for all state action pairs represents the closed form of the optimal policy * following: The optimal action preferences function determines DPP's optimal policy according to Eq. (5). The update recursion of follows: where M k (s) is the the Boltzmann soft-max operator: Following Eq. (6), DPP updates the action preferences function to iteratively improve its value function to the optimal one while considering the smoothness in the policy update (controlled by ). A summary of nomenclature can be found in Table 6.

Path planning via DPP
In order to use the inferred local path within DPP, it is taken to serve as the baseline reward (Eq. 5). Cells along the path receive increasingly positive reward, whereas cells near obstacles perceived by the laser scanner receive an increasingly negative reward. DPP then optimizes its policy grid-world with smooth updates giving consideration to the baseline policy, and the resultant path drawn from the final policy is then followed. An example of path planning can be seen in Fig. 9. Given initial action preferences 0 (⋅, ⋅) , DPP parameters , and the number of iterations K, the process of the path generator is summarised in Algorithm 1.

User path compliance
Path tracking via Pure Pursuit (Coulter 1990) commences once a local path has been transformed into global co-ordinates. This controller was chosen primarily for its simplicity and speed, although it can be readily substituted by others such as DWA (Fox et al. 1997). The magnitude of joystick displacement scales the output linear and angular velocities, allowing the user to always control the rate of PMD motion. Pure Pursuit derives linear and angular velocities from a control point located a fixed lookahead distance (0.8 m for the UTS wheelchair) further ahead along the path. Given this point is always within the PMD's reference frame, it serves as a suitable objective to gauge the relevance of the remainder of the current tracking path against.
In a similar manner to the construction of the intention estimation model, a compliance model is built for potential control points within the lookahead distance while only considering joystick input, resulting in a compliance estimate in range (0, 1) given a control point and a * . Either reaching the end of the path or insufficient compliance results in a new path request.

Experiments
This section details the experiments undertaken in further evaluating the assistive framework in both simulation and on the UTS wheelchair. Ethical approval 1 was granted prior to experimentation, as was consent from all participants.
Training data was provided by an able expert in a Stage (wiki.ros.org/stage ) simulation of the UTS wheelchair driving inside the home environment depicted in Fig 10. The driver was tasked with navigating throughout the entirety of the environment's accessible space while maintaining safe distances to obstacles and walls. PMD odometry, user input and simulated laser scanner data were logged at 10 Hz.
Experimentation was conducted on an Intel i7 Ubuntu laptop with 16 GB RAM and Nvidia 980M GPU. An average intention inference and path generation cycle took 0.46 ± 0.06 seconds, excluding minor latencies arising from MATLAB/ROS (www.ros.org) communications. For path planning a reward of -100 was allocated to obstacle cells. Cells near obstacles received a negative reward from Gaussian blurring of the obstacle reward map with a sd of 2 cells when all other cells were set to 0. Cells along the path were then allocated a linearly increasing reward from 10 to 100 from beginning to end. DPP learning parameters , and K were set to 0.99, 0.001 and 100 respectively. The compliance threshold was set to 0.9.

Comparison of simulation to real PMD
Five able users drove around a 4m by 2m Figure-8 in both a computer simulation and on the real PMD within the UTS Data Arena (Fig. 11), an augmented reality/motion capture facility capable of real-time millimeter accuracy 6-DOF marker tracking which serves as a localization ground truth. The UTS wheelchair (schematic in Fig. 12) is an Invacare Roller M1 fitted with a Hokuyo UTM-30LX laser scanner, shaft-mounted wheel encoders, and an Arduino to both read and write joystick signals. For additional information about modelling of the PMD's control, see Poon and Miro (2015). The resultant paths are shown in Fig. 13. The simulation GUI measured 26 × 13 cm , and the users were free to adjust the computer's position before attempting the task. Driving metrics from these experiments are presented in Table 1. The closenesses in task completion time and input magnitude, and the low path deviation, indicate a strong similarity in both user performance and task relevance. A similar level of steering entropy and angular jerk was also observed, representing a similar level of task effectiveness between both experiments. Given the similarities in the results from both experiments the authors put forward that the simulation is a viable evaluation setting for the wheelchair in terms of human factors. With other phenomena such as wheel slippage across varying ground surfaces beyond the scope of such evaluations, it can be seen that users perform similarly in terms of task completion and efficiency.

Simulation experiment
The assistive framework is first assessed in simulation with 10 able-bodied test users, utilizing a course (Fig. 14) seen in Vanhooydonck et al. (2010). This course significantly differs   (a) Unassisted path.
(b) Assisted path. to the home environment from which training data was gathered for model building purposes. PMD users suffering from limb paresis or other motor skill losses may only be able to provide rough indications (Vanhooydonck et al. 2010) of desired direction. To imitate this coarseness in able users, their joystick input signals were discretized amongst 5 evenly spaced joystick orientations. Each test user drove around the test course twice with such hampered input signals; without assistance, then with the assistive framework in operation.

Real experiment
A 63 year old female volunteered to evaluate the framework on the UTS wheelchair. 2 Due to complications from back injuries, she is unable to walk without heavy reliance upon dual walking canes and has been considering PMD prescription. Her disabilities in conjunction with relative inexperience pose a significant challenge in maneuvring the PMD safely.
Experimentation took place on the publicly accessible campus of a college neighboring UTS with which the volunteer was already familiar, following a 10 minute acclimation period after which the test user deemed herself sufficiently confident. As shown in Fig. 15, the route featured several doorways and narrow corridors, as well as large open areas. Foot traffic at the time was sufficiently sparse that no moving entities had an effect on the user or PMD behavior; this is reasonable given both the tendency of pedestrians to avoid PMDs, and the framework's significantly truncated operating envelope.

Results
For quantitative comparison, all 20 simulated runs were ended upon re-entering the starting 'diamond'. From Urdiales et al. (2013) several task metrics are evaluated ( Table 2) that do not consider deviation from some optimal route, to uphold the belief that users should not be penalized for particular driving preferences such as keeping to the right of corridors.
The same metrics extracted from the real experiment follow in Table 3. From the simulation experiments, the angular jerk decreased slightly with the assistive framework; the paths taken by the PMD also appear smoother, particularly noticeably in the slalom portions on the left sides of Fig. 14a, b. As the real test course largely consisted (a) Unassisted path, with collision point(s) highlighted in red.
(b) Assisted path with this paper's framework.  of straight corridors, the improvements in platform jerk was less noticeable in this experiment compared to the change seen in the simulated experiments. The distances travelled in both experiments are similar with or without assistance, however the course completion times taken by the users in simulation were noticeably shorter with a reduction of over 10%. The difference in completion times is far larger in the real experiment due to several factors. Firstly the user lost speed scraping along the walls on several occasions, which also resulted in several stops for recovery. By contrast, such contact is not handled by the simulator and no hindrance was incurred as a result. Secondly when unassisted the user drove at an average linear velocity of 0.6 m/s when moving freely compared to 0.85 m/s while assisted, whereas from the simulation experiments less disparate average linear velocities of 0.59 ± 0.07 and 0.67 ± 0.07 m/s were observed. There were also no collisions in both simulation and real experiments when users drove with assistance, whereas without assistance several collision events occurred.

Steering entropy
Introduced in Nakayama et al. (1999) to assess driver workload, steering entropy has also been taken as a measure of task effectiveness . For each input steering angle u t in a time series, an input prediction error e t is taken as the difference between u t and a second order Taylor approximation û t : A frequency distribution P of all e is then discretized into 9 bins (Nakayama et al. 1999), of which the Shannon entropy H is taken as the steering entropy of the user over the entire time series: The steering entropy observed between the able users with simulated disabilities and the disabled volunteer are very similar at approximately 0.57 (Tables 2, 3), indicating that the simulated disability provided a level of task impedance comparable to the disabled volunteer's various health issues.
However this does not imply that the her disabilities were accurately replicated; rather that the able users experienced a similar difficulty in providing task effective inputs.

Assistive framework metrics
For evaluating predicted paths only an initial segment truncated by the path's first incompliant point is considered: given a path point p generated from time-step t, PMD positions X * 1∶T and user inputs Y * 1∶T , incompliance is defined as the compliance estimator falling below the experimental threshold for ( p ′ ,Y * t∶T ), where p ′ is p w.r.t. X * t∶T . Table 4 documents several metrics directly concerning predictive performance from both simulated (14,664 samples) and real (2517 samples) experiments. Paths can be evaluated in isolation due to the lack of reliance upon past or expected future data.
Although the observed percentage-wise path compliance and utilisation of each path is arguably equal to or greater than in the work by Huntemann et al. (2007), it is worth noting that this does not necessarily translate into superior overall performance given that both works can always provide new paths at a given instant. The RMSE between compliant path points and PMD positions indicates reliable performance of Pure Pursuit.

User impressions
When asked about her experience, the disabled volunteer stated that she received the impression that the PMD was "pulling" towards empty spaces, most noticeably in restrictive areas, and felt more confident knowing her inputs were buffered. The authors were also informed that the integrated joystick was awkwardly positioned, and that she felt it negatively impacted her performance. This has been previously documented in Esquenazi et al. (2016), and the authors are currently exploring alternatives.

Evaluation of generalization in path prediction
In order to evaluate alignment between predicted paths and natural human behaviour, the framework was run offline (Fig. 16) for all 10, 948 driving samples from Matsubara et al. (2015) without retraining from the environment simulated in Fig. 10. These samples were collected from several able users freely driving the UTS wheelchair to various long-term destinations. For each path point planned over the course of a single driving sequence, an error is recorded as the minimum distance from the path point to the nearest recorded PMD position in the sequence. In total 132, 917 path points were generated, with a mean error of 0.068 ± 0.081 m. With a null hypothesis of an accuracy of 0.1 m , p < 0.01 is obtained. 0.1 m , equivalent to the width of two cells in the path planning grid-world, is taken here as it is the average path tracking error encountered during the real experiment. This result indicates that the framework is capable of capturing user intentions, and can also produce natural human-like paths under conditions not seen in training data.

Conclusions
Besides their amenability to be modelled upon demonstration data, a significant advantage of actively assistive mobility systems is the ability to naturally encompass maneuvres which would fail under the strict collision avoidance protocols that form a core component of passively assistive systems. This paper presents a framework allowing for the utilisation of expert demonstration data to serve as the basis of short-term intention estimation and path planning behaviors for users of sensor-equipped powered mobility aids, in order to provide active navigational assistance in areas that have not been explored and mapped a-priori.
Testing in environments not explored in training data by both a disabled volunteer on the UTS semi-autonomous wheelchair, and in simulation by able users subject to a comparable level of impedance, reveal strong correlations between desired wheelchair behavior and predicted paths. Although trials with disabled individuals were limited to a single user, the comprehensive testing with a typical enduser suited for PMD prescription demonstrates how intelligent PMD assistance would have significant benefits.
Avenues for further investigation include the application of deep learning methodologies towards both intention estimation and path planning as these techniques have been proven for building complex models given adequate data, as well as further experimentation with a greater test user base.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendix: nomenclature
See Tables 5 and 6.    Transition likelihood from state s to s ′ via action a R(s, s � , a) Reword obtained by transitioning from s to s ′ via a Discount factor ∈ (0, 1) (a|s) Policy representing probability of a being performed at s V (s) Expected reward returned from following starting at s K Number of iterations Parameter ∈ (0, 1) controlling policy update smoothness k (s, a) Action preference function at iteration k M k (s) Boltzmann softmax operator of k at s for all a ∈ A