1 Introduction

Mobile health (mHealth) applications deliver treatments in users’ everyday lives to support healthy behaviors. These mHealth applications offer an opportunity to impact health across a diverse range of domains from substance use (Rabbi et al. 2017), to disease self-management (Hamine et al. 2015) to physical inactivity (Consolvo et al. 2008). For example, to help users increase their physical activity, an mHealth application might send walking suggestions at the times and in the contexts (e.g. current location or recent physical activity) when a user is likely to be able to pursue the suggestions. A goal of mHealth applications is to provide treatments in contexts in which users need support while avoiding over-treatment. Over-treatment can lead to user disengagement (Nahum-Shani et al. 2017), for example users might ignore treatments or even delete the application. Consequently, the goal is to be able to learn an optimal policy for when and how to intervene for each user and context without over-treating.

Contextual bandit algorithms appear ideal for this task. Contextual bandit algorithms have been successful in a range of application settings from news recommendations (Li et al. 2010) to education (Qi et al. 2018). However, as we discuss below, many challenges remain to adapt contextual bandit algorithms for mHealth settings. Thompson sampling offers an attractive framework for addressing these challenges. In their seminal work (Agrawal and Goyal 2013), Agrawal and Goyal show that Thompson sampling for contextual bandits, which works well in practice, can also achieve strong theoretical guarantees. In our work, we propose Thompson sampling contextual bandit algorithm which introduces a mixed effects structure for the weights on the feature vector, an algorithm we call IntelligentPooling. We demonstrate empirically that IntelligentPooling has many advantages. We also derive a high-probability regret bound for our approach which achieves similar regret to (Agrawal and Goyal 2013). Unlike (Agrawal and Goyal 2013), our regret bound depends on the variance components introduced by the mixed effects structure which is at the center of our approach.

1.1 Challenges

There are significant challenges to learning optimal policies in mHealth. This work primarily addresses the challenge of learning personalized user policies from limited data. Contextual bandit algorithms can be viewed as algorithms that use the user’s context to adapt treatment. While this approach can have advantages compared to ignoring the user’s context, it fails to address that users can respond differentially to treatments even when they appear to be in the same context. This occurs since sensors on smart devices are unlikely to record all aspects of a user’s context that affect their health behaviors. For example, the context may not include social constraints on the user (e.g., care-giving responsibilities), which may influence the user’s ability to be active. Thus, algorithms that can learn from the differential responsiveness to treatment are desirable. This motivates the need for an algorithm that not only incorporates contextual information, but that can also learn personalized policies. A natural first approach would be to use the algorithm separately for each user, but the algorithm is likely to learn very slowly if data on a user is sparse and/or noisy. However, typically in mHealth studies multiple users are using the application at any given time. Thus an algorithm that pools data over users intelligently so as to speed up learning of personalized policies is desirable.

An additional challenge is non-stationary responses to treatment (e.g. non-stationary reward function). For example, in the beginning of a study, a user might be excited to receive a treatment, however after a few weeks this excitement can wane. This motivates the need for algorithms that can learn time-varying treatment policies.

1.2 Contributions

We develop IntelligentPooling, a type of Thompson sampling contextual bandit algorithm specifically designed to overcome the above challenges. Our main contributions are:

  • IntelligentPooling: A Thompson sampling contextual bandit algorithm for rapid personalization in limited data settings. This algorithm employs classical random effects in the reward function (Raudenbush and Bryk 2002; Laird and Ware 1982) and empirical (Bayes Morris 1983; Casella 1985) to adaptively adjust the degree to which policies are personalized to each user. We present an analysis of this adaptivity in Sect. 3.5 showing that IntelligentPooling can learn to personalize to a user as a function of the observed variance in the treatment effect both between and within users.

  • A high probability regret bound for IntelligentPooling.

  • An empirical evaluation of IntelligentPooling in a simulation environment constructed from mHealth data. IntelligentPooling not only achieves 26% lower regret than state-of-the-art approaches, it also is better able to adapt to the degree of heterogeneity present in a population than this approach.

  • Feasibility of IntelligentPooling from a pilot study in a live clinical trial. We demonstrate that IntelligentPooling can be executed in a real-time online environment and show preliminary evidence of this method’s effectiveness.

  • We show how to modify IntelligentPooling to learn in non-stationary environments.

Next, in Sect. 2 we discuss relevant related work. In Sect. 3 we present IntelligentPooling and provide a high-probability regret bound for this algorithm. We then describe how we use historical data to construct a simulation environment and evaluate our approach against state-of-the-art in Sect. 4. Next, in Sect. 5 we introduce the feasibility study and provide preliminary evidence into the benefits of this approach. We then discuss how to extend this work to include time-varying effects in Sect. 6. Finally, we discuss the limitations with our approach in Sect. 7 before concluding.

2 Related work

To put the proposed work in a broader healthcare perspective, an overview of similar work in mHealth is provided by Sect. 2.1. Next, we discuss the extent to which reinforcement learning/bandit algorithms have been deployed in mHealth settings (Sect. 2.1). IntelligentPooling has similarities with several modeling approaches, here we discuss the most relevant: multi-task learning, meta-learning, Gaussian processes for Thompson Sampling contextual bandits, and time-delayed bandits. These topics are discussed in Sects. 2.22.4.

2.1 Connections to Bandit algorithms in mHealth

Bandit algorithms in mHealth have typically used one of two approaches. The first approach is person specific, that is, an algorithm is deployed separately on each user, such as in Rabbi et al. 2015; Jaimes et al. 2016; Forman et al. 2018 and Liao et al. 2020. This approach makes sense when users are highly heterogeneous, that is, their optimal policies differ greatly one from another. However, this approach can present challenges for policy learning when data is scarce and/or noisy, as in our motivating example of encouraging activity in an mHealth study where only a few decision time-points occur each day (see (Xia 2018) for an empirical evaluation of the shortcomings of Thompson sampling for personalized contextual bandits in mHealth settings). The second approach completely pools users’ data, that is one algorithm is used on all users so as to learn a common treatment policy both in bandit algorithms (Paredes et al. 2014; Yom-Tov et al. 2017), and in full reinforcement learning algorithms (Clarke et al. 2017; Zhou et al. 2018). This second approach can potentially learn quickly but may result in poor performance if there is large heterogeneity between users. We compare to these two approaches empirically as they not only represent state-of-the-art in practice, they also represent two intuitive theoretical extremes.

In IntelligentPooling we strike a balance between these two extremes, adjusting the degree of pooling to the degree that users are similarly responsive. When users are heterogeneous, IntelligentPooling achieves lower regret than the second approach while learning more quickly than the first approach. When users are homogeneous our method performs as well as the second approach.

2.2 Connections to multi-task learning and meta-learning

Following original work on non-pooled linear contextual bandits (Agrawal and Goyal 2013), researchers have proposed pooling data in a variety of ways. For example, Deshmukh et al. (2017) proposed pooling data from different arms of a single bandit problem. Li and Kar 2015 used context-sensitive clustering to produce aggregate reward estimates for the bandit algorithm. More relevant to this work is multi-task Gaussian Process (GP), e.g., Lawrence and Platt 2004; Bonilla et al. 2008; Wang and Khardon 2012, however these have been proposed in the prediction as opposed to the reinforcement learning setting. The Gang of Bandits approach (Cesa-Bianchi et al. 2013), which is a generalization from the original LinUCB algorithm for a single task (Li et al. 2010), has been shown to be successful when there is prior knowledge on the similarities between users. For example, a known social network graph might provide a mechanism for pooling. It was later extended to the Horde of Bandits in (Vaswani et al. 2017) which used Thompson Sampling, allowing the algorithm to deal with a large number of tasks.

Each of the multi-task approaches introduces some concept of similarity between users. The extent to which a given user’s data contributes to another user’s policy is some function of this similarity measure. This is fundamentally different from the approach taken in IntelligentPooling. Rather than determining the extent to which any two users are similar, IntelligentPooling determines the extent to which a given user’s reward function parameters differ from parameters in a population (average over all users) reward function. This approach has the advantage of requiring fewer hyper-parameters, as we do not need to learn a similarity function between users. Instead of a pairwise similarity function it is as if we are learning a similarity between each user and the population average. In the limited data setting, we expect this simpler model to be advantageous.

In meta-learning, one exploits shared structure across tasks to improve performance on new tasks. IntelligentPooling thus shares similarities with meta-learning for reinforcement learning (Nagabandi et al. 2018; Finn et al. 2019; Finn et al. 2018; Zintgraf et al. 2019; Gupta et al. 2018; Sæmundsson et al. 2018). At a high level, one can view our method as a form of meta-learning where the population-level parameters are learned from all available data and each user’s parameters represent deviations from the shared parameters. However, while meta-learning might require a large collection of source tasks, we demonstrate the efficacy of our approach on data on the small scale found in clinical mHealth studies.

2.3 Connections to Gaussian process models for Thompson sampling contextual bandits

IntelligentPooling is based on Bayesian mixed effects model of the reward, which is similar to using a Gaussian Process (GP) model with a simple form of the kernel. GP models have been used for multi-armed bandits (Chowdhury and Gopalan 2017; Brochu et al. 2010; Srinivas et al. 2009; Desautels et al. 2014; Wang et al. 2016; Djolonga et al. 2013; Bogunovic et al. 2016) , and for contextual bandits (Li et al. 2010; Krause and Ong 2011). However the above approaches do not structure the way in which the pooling of data across users occurs. IntelligentPooling uses a mixed effects GP model to pool across users in structured manner. Although mixed effects GP models have been previously used for off-line data analysis (Shi et al. 2012; Luo et al. 2018), to the best of our knowledge they have not been previously used in the online decision making setting considered in this work.

2.4 Connection to non-stationary linear bandits

There is a growing literature investigating how to adapt linear bandit algorithms to changing environments. A common approach is for the learning algorithm to differentially weight data across time. Differential weighting is used by both Russac et al. 2019 (using a LinUCB algorithm) and Kim and Tewari 2019 (using perturbation-based algorithms). Cheung et al. 2018 to estimate the parameters in the reward function and (Zhao et al. 2020) restart the algorithm at regular intervals discarding the prior data. Similarly (Bogunovic et al. 2016), using GP-based UCB algorithms, accommodate non-stationarity by both restarting and using an autoregressive model for the rewards function. Kim and Tewari 2020 analyze the non-stationary setting with randomized exploration. Wu et al. introduce a model which detects abrupt time changes cite (https://dl.acm.org/doi/pdf/10.1145/3209978.3210051).

IntelligentPooling allows for non-stationary reward functions by the use of time-varying random effects. The correlation between the time-varying random effects induces a weighted estimator whereby more weight is put on the recently collected samples, similar to the discounted estimators in Russac et al. 2019 and Kim and Tewari 2019. In contrast to existing approaches, IntelligentPooling considers both individual and time-specific variation.

3 Intelligent Pooling

IntelligentPooling is a generalization of a Thompson sampling contextual bandit for learning personalized treatment policies. We first outline the components of IntelligentPooling and then introduce the problem definition in Sect. 3.2. As our approach offers a natural alternative to two commonly used approaches, we begin by describing these simpler methods in Sect. 3.3. We introduce our method in Sect. 3.4.

3.1 Overview

The central component of IntelligentPooling is a Bayesian model for the reward function. In particular, IntelligentPooling uses a Gaussian mixed effects linear model for the reward function. Mixed effects models are widely used across the health and behavioral sciences to model the variation in the linear model parameters across users (Raudenbush and Bryk 2002; Laird and Ware 1982) and within a user across time. Use of these models enhances the ability of domain scientists to inform and critique the model used in IntelligentPooling. The properties and pitfalls of these models are well understood; see (Qian et al. 2019) for an application of a mixed effects model in mHealth. IntelligentPooling uses Bayesian inference for the mixed effects model. As discussed in Sect. 2.3, a Bayesian mixed effects linear model is a GP model with a simple kernel. This facilitates increasing the flexibility of the model for the reward function, given sufficient data.

Furthermore, IntelligentPooling uses Thompson sampling (Thompson 1933), also known as posterior sampling (Russo and Van Roy 2014), to select actions. At each decision point, the parameters in the model for the reward function are sampled from their posterior distribution, thus inducing exploration over the action space (Russo et al. 2018). These sampled parameters are then used to form an estimated reward function and the action with the highest estimated reward is selected.

The hyper-parameters (e.g., the variance of the random effects) control the extent of pooling across users and across decision times. The right amount of pooling depends on the heterogeneity among users and the non-stationarity, which is often difficult to pre-specify. Unlike other bandit algorithms in which the hyper-parameters are set at the beginning (Deshmukh et al. 2017; Cesa-Bianchi et al. 2013; Vaswani et al. 2017), IntelligentPooling includes a procedure for updating the hyper-parameters online. In particular, empirical (Bayes Carlin and Louis 2010) is used to update the hyper-parameters in the online setting, as more data becomes available.

3.2 Problem formulation

Consider an mHealth study which will recruit a total of N users.Footnote 1 Let \(i \in [N] = \{1, \dots , N\}\) be a user index. For each user, we use \(k \in \{1, 2, \dots \}\) to index decision times, i.e., times at which a treatment could be provided. Denote by \(S_{i, k}\) the states/contexts at the \(k^{th}\) decision time of user i. For simplicity, we focus on the case where the action is binary, i.e., \(A_{{i, k}} \in \{0,1\}\). The algorithm can be easily generalized to cases with more than two actions. After the action \(A_{i, k}\) is chosen, the reward \(R_{i, k}\) is observed. Throughout the remainder of the paper, SA and R are random variables and we use lower-case (s, a and r) to refer to a realization of these random variables.

Below we consider a simpler setting where the parameters in the reward are assumed time-stationary. We discuss how to generalize the algorithm to the non-stationary setting in Sect. 6. The goal is to learn personalized treatment policies for each of the N users. We treat this as N contextual bandit problems as the reward function may differ between users. In mHealth settings this might occur due to the inability of sensors to record users’ entire contexts. Section 3.3 reviews two approaches for using Thompson Sampling (Agrawal and Goyal 2012) and Sect. 3.4 presents IntelligentPooling, our approach for learning the treatment policy for any specific user.

3.3 Two Thompson sampling instantiations

Fig. 1
figure 1

Consider a setting with two users, here we show the relationship between select random variables in our model: \(R_{i,k}\) the reward for user i at decision time k, \(\sigma ^2_{\epsilon _{i, k}}\) the noise for user i at time k and \(\varvec{w}_{i}\) the latent weight vector for user i. In Person-Specific we see that each user’s parameters are independent. Only the prior parameter values are shared, all else is updated independently

First, consider learning the treatment policy separately per person. We refer to this approach as Person-Specific. At each decision time k, we would like to select a treatment \(A_{i, k} \in \{0,1\}\) based on the context \(S _{i, k}\). We model the reward \(R_{i, k}\) by a Bayesian linear regression model: for user i and time k

$$\begin{aligned} { R_{i, k} = \phi (S_{i, k}, A_{i, k})^\top {w_{i}} + \epsilon _{i, k}}, \end{aligned}$$

where \(\phi (s, a)\) is a pre-specified mapping from a context s and treatment a (e.g., those described in Sect. 4.2), \(w_i\) is a vector of weights which we will learn, and \(\epsilon _{i, k} \sim {\mathbf {N}}(0, \sigma _{\epsilon }^2)\) is the error term. The weight vectors \(\{w_i\}\) are assumed independent across users and to follow a common prior distribution \(w_i \sim {\mathbf {N}}(\mu _w, \varSigma _w)\). See Fig. 1 for a graphical representation of this approach.

Now at the \(k^{th}\) decision time with the context \(S_{i, k} = s\), Person-Specific selects the treatment \(A_{i, k} = 1\) with probability

$$\begin{aligned} \pi _{i, k} = \text {Pr} \{ \phi (s, 1) ^\top {{\tilde{w}}}_{i,k} > \phi (s, 0)^\top {{\tilde{w}}}_{i, k}\} \end{aligned}$$

where \({{\tilde{w}}}_{i, k}\) follows the posterior distribution of the parameters \(w_i\) in the model (1) given the user’s history up to the current decision time k. We emphasize that in this formulation the posterior distribution of \(w_i\) is formed based each user’s own data.

The opposite approach is to learn a common bandit model for all users. In this approach, the reward model is a single Bayesian regression model with no individual-level parameters:

$$\begin{aligned} R_{i, k}= \phi (S _{i, k},A_{i, k})^\top w+ \epsilon _{i, k}. \end{aligned}$$

where the common parameters, \(w\), follows the prior distribution \(w\sim {\mathbf {N}}(\mu _w, \varSigma _w)\). See Fig. 2 for the graphical representation of this approach. We then use the posterior distribution of the weight vector \(w\) to sample treatments for each user. Here the posterior is calculated based on the available data from all users observed up to and including time k. This approach, which we refer to as Complete, may suffer from high bias when there is significant heterogeneity among users.

Fig. 2
figure 2

Consider a setting with two users, here we show the relationship between select random variables in our model: \(R_{i,k}\) the reward for user i at decision time k, \(\epsilon _{k}\) the noise at time k and \(w_{pop}\) the latent weight vector. In Complete we see that each user’s parameters are the same. With each parameter update the weight vector for every user is also updated

3.4 Intelligent pooling across bandit problems

IntelligentPooling is an alternative to the two approaches mentioned above. Specifically, in IntelligentPooling data is pooled across users in an adaptive way, i.e., when there is strong homogeneity observed in the current data, the algorithm will pool more from others than when there is strong heterogeneity.

3.4.1 Model specification

We model the reward associated with taking action \(A_{i, k}\) for user i at decision time k by the linear model (1). Unlike Person-Specific where the person-specific weight vectors \(\{w_i, i \in [N]\}\) are assumed to be independent to each other, IntelligentPooling imposes structure on the \(w_{i}\)’s, in particular, a random-effects structure (Raudenbush and Bryk 2002; Laird and Ware 1982):

$$\begin{aligned} w_{i} = w_{pop} + u_i, \end{aligned}$$

where \(w_{pop}\) is a population-level parameter and \(u_i\) is a random effect that represents the person-specific deviation from \(w_{pop}\) for user i. The extent to which the posterior means for \(w_{pop}\) and \( u_i\) are based on user i’s data relative to the population depends on the variances of the random effects (for a stylized example of this see Sect. 3.5). In Sect. 6 we show how we can modify this structure to include time-specific parameters, or a time-specific random effect. A graphical representation for IntelligentPooling is shown in Fig. 3.

We assume the prior on \(w_{pop}\) is Gaussian with prior mean \(\mu _{w}\) and variance \(\varSigma _w\). \(u_i\) is also assumed to be Gaussian with mean \({\mathbf {0}}\) and covariance \(\varSigma _u\). Furthermore, we assume for \(i \ne j\) and . The prior parameters \(\mu _{w}, \varSigma _w\) as well as the variance of the random effect \(\varSigma _u\), and the residual variance \(\sigma _\epsilon ^2\) are hyper-parameters. In (4), there is a the random effect, \(u_i\) on each element of \(w_i\). In practice, one can use domain knowledge to specify which of the parameters should include random effects; this will be the case in the feasibility study described in Sect. 6. Conditioned on the latent variables \((w_{pop}, u_i)\), as well as the current context and action, the expected reward is

$$\begin{aligned} E[R_{i,k}|w_{pop}, u_i, S_{i,k}=s, A_{i,k}=a]= \phi (s,a)^T (w_{pop}+ u_i). \end{aligned}$$

3.4.2 Model connections to Gaussian processes

Under the Gaussian assumption on the distribution of the reward and prior, the Bayesian linear model of the reward (1) together with the random effect model (4) can be viewed as an example of Gaussian Process with a special kernel (see Eq. 5). We use this connection to derive the posterior distribution and facilitate the hyper-parameter selection. An additional advantage of viewing the Bayesian mixed effects model as a Gaussian Process model is that we can now flexibly redesign our reward model simply by introducing new kernel functions. Here, we assume linear model with a person-specific random effects. In Sect. 6 we discuss a generalization to time-specific random effects. Additionally, one could adopt non-linear kernels and incorporate more complex structures on the reward function.

3.4.3 Posterior distribution of the weights on the feature vector

In the setting where both the prior and the linear model for the reward follow a Gaussian distribution, the posterior distribution of \(w_{i}\) follows a Gaussian distribution and there are analytic expressions for these updates, as shown in (Williams and Rasmussen 2006). Below we provide the explicit formula of the posterior distribution based on the connection to a Gaussian Process regression. Suppose at the time of updating the posterior distribution, the available data collected from all current users is \(\mathcal {D}\), where \(\mathcal {D}\) consists of n tuples of state, action, reward and user index \(x = (s, a, r, i)\). The mixed effects model (Eqs. 1 and 4) induces a kernel function K. For any two tuples in \(\mathcal {D}\), e.g., \(x_l = (s_{l}, a_{l}, r_{l}, i_l), l = 1, 2\)

$$\begin{aligned} K_{}(x_1, x_2)&=\phi (s_1, a_1)^\top (\varSigma _w+ 1_{\{i_1 = i_2\}} \varSigma _{u} ) \phi (s_2, a_2). \end{aligned}$$

Note that the above kernel depends on \(\varSigma _w\) and \(\varSigma _u\) (one of the hyper-parameters that will be updated using empirical Bayes approach; see below). The kernel matrix \({\mathbf {K}}\) is of size \(n \times n\) and each element is the kernel value between two tuples in \(\mathcal {D}\). The posterior mean and variance of \(w_{i}\) given the currently available data \(\mathcal {D}\) can be calculated by

$$\begin{aligned} \begin{aligned} {\hat{w}}_{i}&= \mu _w+ M_{i}^\top ({\mathbf {K}} + \sigma _{\epsilon }^2 I_{n})^{-1} {{\tilde{R}}}_{n}\\ \varSigma _{i}&= \varSigma _w+ \varSigma _u - M_{i}^\top ({\mathbf {K}} + \sigma _{\epsilon }^2 I_{n})^{-1} M_{i} \end{aligned} \end{aligned}$$

where \({{\tilde{R}}}_{n}\) is the vector of the rewards centered by the prior means, i.e., each element corresponds to a tuple (sarj) in \(\mathcal {D}\) given by \(r - \phi (s, a)^\top \mu _{w}\), and \(M_{i}\) is a matrix of size n by p (recall p is the length of \(w_i\)), with each row corresponding to a tuple (sarj) in \(\mathcal {D}\) given by \(\phi (s, a)^\top (\varSigma _w+ {1}_{\{j = i\}} \varSigma _{u} )\).

Fig. 3
figure 3

Consider a setting with two users, here we show the relationship between select random variables in our model: \(R_{i,k}\) the reward for user i at decision time k, \({\epsilon _{i,k}}\) the noise for user i at time k, \(w_{pop}\) the latent weight vector and \(u_i\) the random effect for user i. In IntelligentPooling we see that some parameters (\(w_{pop}\)) are shared across the population which others (\(u_i\)) are user specific

3.4.4 Treatment selection

To select a treatment for user i at the \(k^{th}\) decision time, we use the posterior distribution of \(w_{i}\) formed at the most recent update time T. That is, for the context \(S_{i, k}\) of user i at the \(k^{th}\) decision time, IntelligentPooling selects the treatment \(A_{i,k} = 1\) with the probability calculated in the same formula as in (2) but with a different posterior distribution as discussed above.

3.4.5 Setting hyper-parameter values

Recall that the algorithm requires the hyper-parameters \(\mu _{w}, \varSigma _w\), \(\varSigma _u\), and \(\sigma _\epsilon ^2\). The prior mean \(\mu _{w}\) and variance \(\varSigma _w\) of the population parameter \(w_{pop}\) can be set according to previous data or domain knowledge (see Sect. 5 for a discussion on how the prior distribution is set in the feasibility study). As we mention in Sect. 3.1, the variance components in the mixed effects model impact how the users pool the data from others (see Sect. 3.5 for a discussion) and might be difficult to pre-specify. IntelligentPooling uses, at the update times, the empirical (Bayes Carlin and Louis 2010) approach to choose/update \(\lambda = (\varSigma _u, \sigma _\epsilon ^2)\) based on the currently available data. To be more specific, suppose at the time of updating the hyper-parameters, the available data is \(\mathcal {D}\). We choose \(\lambda \) to maximize \(l(\lambda | \mathcal {D})\), the marginal log-likelihood of the observed reward, marginalized over the population parameters \(w_{pop}\) and the random effects \(u_i\). The marginal log-likelihood \( l(\lambda | \mathcal {D})\) can be expressed as

$$\begin{aligned} \begin{aligned} l(\lambda | \mathcal {D}) = -\frac{1}{2} \Big \{ {{\tilde{R}}}_{n}^\top [{\mathbf {K}}(\lambda )&+ \sigma _\epsilon ^2 I_{n}]^{-1} {{\tilde{R}}}_{n} + \text {log}\det [{\mathbf {K}}(\lambda ) + \sigma _\epsilon ^2 I_{n}] + n\,\text {log}(2\pi ) \Big \} \end{aligned} \end{aligned}$$

where \({\mathbf {K}}(\lambda )\) is the kernel matrix as a function of parameters \(\lambda = (\varSigma _{u}, \sigma _{\epsilon }^2)\). The above optimization can be efficiently solved using existing Gaussian Process regression packages; see Sect. 4.2 for more details.

figure c

3.5 Intuition for the use of random effects

IntelligentPooling uses random effects to adaptively pool users’ data based on the degree to which users exhibit heterogeneous rewards. That is, the person-specific random effect should outweigh the population term if users are highly heterogeneous. If users are highly homogeneous, the person-specific random effect should be outweighed by the population term. The amount of pooling is controlled by the hyper-parameters, e.g., the variance components of the random effects.

To gain intuition, we consider a simple setting where the feature vector \(\phi \) in the reward model (Eq. 1) is one-dimensional (i.e., \(p =1\)) and there are only two users (i.e., \(i=1,2\)). Denote the prior distributions of population parameter \(w_{pop}\) by \({\mathbf {N}}(0, \sigma _{w}^2)\) and the random effect \(u_i\) by \({\mathbf {N}}(0, \sigma _u^2)\). Below we investigate how the hyper-parameter (e.g., \(\sigma _u^2\) in this simple case) impacts the posterior distribution.

Fig. 4
figure 4

The posterior mean of \(w_i\), \({\hat{w}}_{1}\). As the variance of random effect \(\sigma _u^2\) decreases, \(\gamma \) increases and the posterior mean approaches the population-informed estimation (Complete) and departs from the person-specific estimation (Person-Specific).

Let \(k_i\) be the number of decision time of user i at an updating time. In this simple setting, the posterior mean of \({\hat{w}}_{1}\) can be calculated explicitly:

$$\begin{aligned} {\hat{w}}_{1} = \frac{[\delta \gamma + (1-\gamma ^2) C_2] Y_1 + \delta \gamma ^2 Y_2}{(1-\gamma ^2)C_1 C_2 + \delta \gamma (C_1 + C_2) + (\delta \gamma )^2} \end{aligned}$$

where for \(i=1,2\), \(C_i = \sum _{k=1}^{k_i} \phi (A_{i, k}, S_{i, k})^2\), \(Y_i = \sum _{k=1}^{k_i} \phi (A_{i, k}, S_{i, k}) R_{i, k}\), \(\gamma = \sigma _w^2/(\sigma _w^2 + \sigma _u^2)\) and \(\delta = \sigma _\epsilon ^2/\sigma _w^2\). Similarly, the posterior mean of \(w_2\) is given by

$$\begin{aligned} {\hat{w}}_{2} = \frac{[\delta \gamma + (1-\gamma ^2) C_1] Y_2 + \delta \gamma ^2 Y_1}{(1-\gamma ^2)C_1 C_2 + \delta \gamma (C_1 + C_2) + (\delta \gamma )^2} \end{aligned}$$

When \(\sigma _u^2 \rightarrow 0\) (i.e., the variance of random effect goes to 0), we have \(\gamma \rightarrow 1\) and both posterior means (\({\hat{w}}_{1}, {\hat{w}}_{2}\)) approach the posterior mean under Complete (Eqn 3) using prior \({\mathbf {N}}(0, \sigma _w^2)\)

$$\begin{aligned} {\hat{w}}_{1}, {\hat{w}}_{2} \rightarrow \frac{Y_1 + Y_2}{C_1 + C_2 + \delta }. \end{aligned}$$

Alternatively, when \(\sigma _u^2 \rightarrow \infty \), we have \(\gamma \rightarrow 0\) and the posterior means (\({\hat{w}}_{1}, {\hat{w}}_{2}\)) each approach their respective posterior means under Person-Specific (Eqn 1) using a non-informative prior

$$\begin{aligned} {\hat{w}}_{1} \rightarrow \frac{Y_1}{C_1}, ~{\hat{w}}_{2}\rightarrow \frac{Y_2}{C_2}. \end{aligned}$$

Figure 4 illustrates that when \(\gamma \) goes from 0 to 1, the posterior mean \({\hat{w_i}}\) smoothly transitions from the population estimates to the person-specific estimates.

3.6 Regret

We prove a regret bound for a modification of IntelligentPooling similar to that in Agrawal and Goyal 2012; Vaswani et al. 2017 in a simplified setting. Further details are provided in Appendix 1. Let d be the length of the weight vector \(w_i\) in the Bayesian mixed effects model of the reward in Eq. 1. Recall that \(\varSigma _w\) is the prior covariance of the weight vector \(w_{pop}\), \(\varSigma _u\) is the covariance of the random effect \(u_i\) and \(\sigma _\epsilon ^2\) is the variance of the error term. Let \(K_i\) be the number of decision times for user i up to a given calendar time and \(T = \sum _{i=1}^N K_i\) be the total number of decision times encountered by all N users in the study up to the calendar time. We define the regret of the algorithm after T decision times by \({\mathcal {R}}(T) = \sum _{i=1}^N \sum _{k=1}^{K_i} \max _{a} \phi (S_{i,k}, a)^T w_{i} - \phi (S_{i,k}, A_{i, k})^T w_{i}\).

Theorem 1

With probability \(1-\delta \), where \(\delta \in (0,1)\) the total regret of the modified Thompson Sampling with IntelligentPooling after T total number of decision times is:

$$\begin{aligned} {{R}}(T) ={{\tilde{O}}}\Bigg (dN\sqrt{T}\sqrt{\text {log}\Big (\frac{( \text {Tr}({\varSigma _w})+\text {Tr}({\varSigma _u})+\text {Tr}({ \varSigma _u^{-1})})}{d}+\frac{T}{\sigma _\epsilon ^2dN} \Big ) \text {log}{\frac{1}{\delta }}}\Bigg ) \end{aligned}$$


Observe that, up to logarithmic terms, this regret bound is \({\tilde{O}}(dN\sqrt{T})\). Recall that (Vaswani et al. 2017) introduces a similar regret bound for a Thompson Sampling algorithm which utilizes user-similarity information. The bound from (Vaswani et al. 2017), \({{\tilde{O}}}(dN\sqrt{T/\lambda })\), additionally depends on a hyper-parameter \(\lambda \) that is not included in our model. In (Vaswani et al. 2017), \(\lambda \) controls the strength of prior user-similarity information. Instead of introducing a hyper-parameter our model follows a mixed effects Bayesian structure which allows user similarities (as expressed in the extent to which users’ data is pooled) to be updated with new data. Thus, in certain regimes of hyper-parameter \(\lambda \), IntelligentPooling will incur much smaller regret, as demonstrated empirically in Sect. 4.3.

4 Experiments

This work was conducted to prepare for deployment of IntelligentPooling in a live trial. Thus, to evaluate IntelligentPooling we construct a simulation environment from a precursor trial, HeartStepsV1 (Klasnja et al. 2015). This simulation allows us to evaluate the proposed algorithm under various settings that may arise in implementation. For example, heterogeneity in the observed rewards may be due to unknown subgroups across which users’ reward functions differ. Alternatively, this heterogeneity may vary across users in a more continuous manner. We consider both scenarios in simulated trials. In Sects. 4.1-4.3 we evaluate the performance of IntelligentPooling against baselines and a state-of-the-art algorithm. In Sect. 5 we assess feasibility of IntelligentPooling in a pilot deployment in a clinical trial.

4.1 Simulation environment

HeartStepsV1 was a 6-week micro-randomized trial of an Android-based physical activity intervention with 41 sedentary adults. The intervention consisted of two push interventions: planning and contextually-tailored activity suggestions. Activity suggestions acted as action cues and were designed to provide users with actionable options for engaging in short bouts of activity in their current situation. The content of the suggestions was tailored based on the users’ location, weather, time of day, and day of the week. For each individual, on each day of the study, the HeartSteps system randomized whether or not to send an activity suggestion five times a day. The intended outcome of the suggestions—the proximal outcome used to evaluate their efficacy—was the step count in the 30 minutes following suggestion randomization.

HeartStepsV1 data was used to construct all features within the environment, and to guide choices such as how often to update the feature values. Recall that \(S_{i,k}\) and \(R_{i,k}\) denote the context features and reward of user i at the \(k^{th}\) decision time. The reward is the log step counts in the thirty minutes immediately following a decision time. In HeartStepsV1 three treatment actions were considered: \(A_{i,k}=1\) corresponded to a smartphone notification containing an activity suggestion designed to take 3 minutes to perform, \(A_{i,k}=0\) corresponded to a smartphone notification containing an anti-sedentary message designed to take approximately 30 seconds to perform and \(A_{i,k}=-1\) corresponded to not sending a message. However, in the simulation only the actions 1, 0 are considered.

Fig. 5
figure 5

Contextual features for a simulated User are composed of both general environmental features (such as time of day) and individual features (such as location). At decision times a simulated user receives a message determined by the current treatment policy. Periodically this policy is updated according to a learning algorithm which outputs a new posterior distribution for each User

Figure 5 describes the simulation while Table 1 describes context features and rewards. Each context feature in Table 1 was constructed from HeartStepsV1 data. For example, we found that in HeartStepsV1 data splitting participants’ prior 30 minute step count into the two categories of high or low best explained the reward. Additional details about this process are included in Appendix 4.

The temperature and location are updated throughout a simulated day according to probabilistic transition functions constructed from HeartStepsV1. The step counts for a simulated user are generated from participants in HeartStepsV1 as follows. We construct a one-hot feature vector containing the group-ID of a participant, the time of day, the day of the week, the temperature, the preceding activity level, and the location. Then for each possible realization of the one-hot encoding we calculate the empirical mean and empirical standard deviation of all step counts observed in HeartStepsV1. The corresponding empirical mean and empirical standard deviation from HeartStepsV1 form \(\mu _{S _{i,k}}\) \(\sigma _{S _{i,k}}\) respectively. At each 30 minute window, if a treatment is not delivered step counts are generated according to

$$\begin{aligned} R_{i,k} = {\mathbf {N}}(\mu _{S _{i,k}},\sigma ^2_{S _{i,k}}). \end{aligned}$$
Table 1 The value used in encoding each feature is shown in parentheses

Heterogeneity This model, which we denote Heterogeneity, allows us to compare the performance of the approaches under different levels of population heterogeneity. The step count after a decision time is a modification of Eq. 8 to reflect the interaction between context and treatment on the reward and heterogeneity in treatment effect. Let \(\beta \) be a vector of coefficients of \(S_{i,k}\) which weigh the relative contributions of the entries of \(S_{i,k}\) that interact with treatment on the reward. The magnitude of the entries of \(\beta \) are set using HeartStepsV1. Step counts (\(R_{i,k}\)) are generated as

$$\begin{aligned} R_{i,k} = {\mathbf {N}}(\mu _{S _{i,k}},\sigma ^2_{S _{i,k}})+ A_{i,k}(S _{i,k}^T\beta _{i} + Z_i). \end{aligned}$$

The inclusion of \(Z_i\) will allow us to evaluate the relative performance of each approach under different levels of population heterogeneity. Let \(\beta ^l_i\) be the entry in \(\beta _i\) corresponding to the location term for the \(i^{th}\) user. We consider three scenarios (shown in Table 6) to generate \(Z_i\), the person-specific effect, and \(\beta ^l_i\) the location-dependent effect. The performance of each algorithm under each scenario will be analyzed in Sect. 4.3. In the smooth scenario, \(\sigma \) is equal to the standard deviation of the observed treatment effects \([f(S_{i,k})^\top \beta \ :\ S_{i,k} \in \textsc {HeartStepsV1}{}]\). The settings for all \(Z_i\) and \(\beta ^l_i\) terms are discussed in Sect. D.

In the bi-modal scenario each simulated user is assigned a base-activity level: low-activity users (group 1) or high-activity users (group 2). When a simulated user joins the trial they are placed into either group one or two with equal probability. Whether or not it is optimal to send a treatment (an activity suggestion) for user i at their \(k^{th}\) decision time depends both on their context, and on the values of \(z_1,\beta ^l_1\) and \(z_2,\beta ^l_2\). The values of \(z_1,\beta ^l_1\) and \(z_2,\beta ^l_2\) are set so that for all users in group 1, it is optimal to send a treatment under 75% of the contexts they will experience. Yet for all users in group 2, it is only optimal to send a treatment under 25% of the contexts they will experience. Group membership is not known to any of the algorithms Table 2. The settings for all values in Table 6 are included in Sect. D.

Table 2 Settings for Z in three cases of homogeneous, bimodal and smoothly varying populations

4.2 Model for the reward function in IntelligentPooling

In Sect. 3 we introduced the feature vector \(\phi (S _{i,k},A_{i,k}) \in \mathbb {R}^p\). This vector is used in the model for the reward and transforms a user’s contextual state variables \(S _{i,k}\) and the action \(A_{i,k}\) as follows:

$$\begin{aligned} \begin{aligned} \phi (S _{i,k},A_{i,k})^T =&\big ( S _{i,k}^T, \pi _{i,k} S _{i,k}^T, (A_{i,k}-\pi _{i,k})S _{i,k} \big ), \end{aligned} \end{aligned}$$

where \(S_{i, k} = \{1, \text {time of day}, \text {day of the week}, \text {preceding activity level}, \text {location}\}\). Recall that the bandit algorithms produce \(\pi _{i,k}\) which is the probability that \(A_{i,k}=1\). The inclusion of the term \(\small {(A_{i,k}-\pi _{i,k})S _{i,k}}\) is motivated by Liao et al. 2016; Boruvka et al. 2018; Greenewald et al. 2017, who demonstrated that action-centering can protect against mis-specification in the baseline effect (e.g., the expected reward under the action 0). In \(\textsc {HeartStepsV1}\) we observed that users varied in their overall responsivity and that a user’s location was related to their responsivity. In the simulation, we assume the person-specific random effect on four parameters in the reward model (i.e., the coefficients of terms in \(S \) involving the intercept and location).

Finally, we constrain the randomization probability to be within [0.1, 0.8] to ensure continual learning. The update time for the hyper-parameters is set to be every 7 days. All approaches are implemented in Python and we implement GP regression with the software package (GPytorch Gardner et al. 2018).

4.3 Simulation results

In this section, we compare the use of mixed effects model for the reward function in \(\textsc {IntelligentPooling}\) to two standard methods used in mHealth, \(\textsc {Complete}\) and \(\textsc {Person}-\textsc {Specific}\) from Sect. 3.3. Recall that IntelligentPooling includes person-specific random effects, as described in Eq. 14. In \(\textsc {Person}-\textsc {Specific}\), all users are assumed to be different and there is no pooling of data and in \(\textsc {Complete}\), we treat all users the same and learn one set of parameters across the entire population.

Additionally, to assess IntelligentPooling’s ability to pool across users we compare our approach to Gang of Bandits (Cesa-Bianchi et al. 2013), which we refer to as GangOB. As this model requires a relational graph between users, we construct a graph using the generative model (9) and Table 6 connecting users according to each of the three settings: homogeneous, bi-modal and smooth. For example, with knowledge of the generative model users can be connected to other users as a function of their \(Z_i\) terms. As we will not have true access to the underlying generative model in a real-life setting we distort the true graph to reflect this incomplete knowledge. That is we add ties to dissimilar users at 50% of the strength of the ties between similar users.

From the generative model (9), the optimal action for user i at the \(k^{th}\) decision time is \(a^*_{i,k} = \mathbb {1}_{\{S_{i,k}^T\beta _{i}^ *+Z_i \ge 0\}}\). The regret is

$$\begin{aligned} \text {regret}_{i,k}=|S_{i,k}^T\beta _{i}^ *+Z_i| \mathbb {1}_{\{a^*_{i,k}\ne A_{i,k}\}} \end{aligned}$$

where \(\beta ^*_i\) is the optimal \(\beta \) for the \(i{th}\) user.

In these simulations each trial has 32 users. Each user remains in the trial for 10 weeks and the entire length of the trial is 15 weeks, where the last cohort joins in week six. The number of users who join each week is a function of the recruitment rate observed in \(\textsc {HeartStepsV1}\). In all settings we run 50 simulated trials.

First, Fig. 6 provides the regret averaged across all users across 50 simulated trials where the reward distribution follows (9) for each of the Table 6 categories. The horizontal axis in Fig. 6 is the average regret over all users in their nth week in the trial, e.g. in their first week, their second week, etc. In the bi-modal setting there are two groups, where all users in group one have a positive response to treatment when experiencing their typical context, while the users in group two have a negative response to treatment under their typical context. An optimal policy would learn to not typically send treatments to users in the first group, and to typically send them to users in the second. To evaluate each algorithm’s ability to learn this distinction we show the percentage of time each group received a message in Table 3.

Fig. 6
figure 6

Heterogeneity generative model Regret averaged across all users for each week in the trial, i.e. average regret of all users in their first week of the trial

Table 3 The fraction of time that messages were sent to users in each group

The relative performance of the approaches depends on the heterogeneity of the population. When the population is very homogenous Complete excels, while its performance suffers as heterogeneity increases. Person-Specific is able to personalize; as shown by Table 3, it can differentiate between individuals. However, it learns slowly and can only approach the performance of Complete in the smooth setting of Table 6 where users differ the most in their response to treatment. Both IntelligentPooling and GangOB are more adaptive than either Complete or Person-Specific. GangOB consistently outperforms Person-Specific and achieves lower regret than Complete in some settings. In the homeogenous setting we see that GangOB can utilize social information more effectively than Person-Specific does while in the smooth setting it can adapt to individual differences more effectively than Complete. Yet, IntelligentPooling demonstrates stronger and swifter adaptability than does GangOB, consistently achieving lower regret at quicker rates. Finally, the algorithms differ in their suitability for real-world applications, especially when data is limited. GangOB requires reliable values for hyper-parameters and can depend on fixed knowledge about relationships between users. IntelligentPooling can learn how to pool between individuals over time and without prior knowledge.

5 IntelligentPooling feasibility study

The simulated experiments provide insights into the potential of this approach for a live deployment. As we see reasonable performance in the simulated setting, we now discuss an initial pilot deployment of IntelligentPooling in a real-life physical activity clinical trial.

5.1 Feasibility study design

The feasibility study of IntelligentPooling involves 10 participants added to a larger 90-day clinical trial of HeartSteps v2, an mHealth physical activity intervention. The purpose of the larger clinical trial is to optimize the intervention for individuals with Stage 1 hypertension. Study participants with Stage 1 hypertension were recruited from Kaiser Permanente Washington in Seattle, Washington. The study was approved by the institutional review board of the Kaiser Permanente Washington Health Research Institute (under number 1257484-14).

HeartSteps v2 is a cross-platform mHealth application that incorporates several intervention components, including weekly activity goals, feedback on goal progress, planning, motivational messages, prompts to interrupt sedentary behavior, and—most relevant to this paper—actionable, contextually-tailored suggestions for individuals to perform a short physical activity (suggesting, roughly, a 3 to 5 minute walk). In this study physical activity is tracked with a commercial wristband tracker, the Fitbit Versa smart watch.

In this version of the intervention, activity suggestions are randomized five times per day for each participant on each day of the 90-day trial. These decision times are specified by each user at the start of the study, and they roughly correspond to the participant’s typical morning commute, lunch time, mid-afternoon, evening commute, and after dinner periods. The treatment options for activity suggestions are binary: at a decision time, the system can either send or not send a notification with an activity suggestion. When provided, the content of the suggestion is tailored to current sensor data (location, weather, time of day, and day of the week). Examples of these suggestions are provided in Klasnja et al. 2018. At a decision time, activity suggestions are randomized only if the system considers that the user is available for the intervention—i.e., that it is appropriate to intervene at that time (see Fig. 8 for criteria used to determine if it is appropriate to send an activity suggestion at a decision time). Subject to these availability criteria, IntelligentPooling determines whether to send a suggestion at each decision time. The posterior distribution was updated once per day, prior to the beginning of each day. Figure 7 provides a schematic of the feasibility study.

The feasibility study included the second set of 10 participants in the trial of HeartSteps v2, following the initial 10 enrolled participants. IntelligentPooling (Algorithm 1) is deployed for each of the second set of 10 participants. At each decision time for these 10 participants, IntelligentPooling uses all data up to that decision time (i.e. from the initial ten participants as well as from the subsequent ten participants). Thus the feasibility study allows us to assess performance of IntelligentPooling after the beginning of a study instead of the performance at the beginning of the study (when there is little data) or the performance at the end of the study (when there is a large amount of data and the algorithm can be expected to perform well).

In the feasibility study, the features used in the reward model were selected to be predictive of the baseline reward and/or the treatment effect, based on the data analysis of HeartStepsV1; see Sect. 6.2 in (Liao et al. 2020) for details. All features used in the reward model are shown in Table 4. The feature engagement represents the extent to which a user engages with the mHealth application measured as a function of how many screen views are made within the application within a day. The feature dosage represents the extent to which a user has received treatments (activity suggestions). This feature increases and decreases depending on the number of activity suggestions recently received. The feature location refers to whether a user is at home or work (encoded as a 1) or somewhere else (encoded as a 0). The temperature feature value is set according to the temperature at a user’s current location (based off of phone GPS). The variation feature value is set according to the variation in step count in the hour around that decision point over the prior seven-day period. As before we construct a feature vector \(\phi \), however here we only use select terms to estimate the treatment effect. Here,

$$\begin{aligned} \begin{aligned} \phi (S _{i,k},A_{i,k})^T =&\big ( S _{i,k}^T, \pi _{i,k} S _{i,k}^{'T}, (A_{i,k}-\pi _{i,k})S '_{i,k} \big ), \end{aligned} \end{aligned}$$

where \(S _{i, k} = \{1, \text {temperature}, \text {yesterday's step count}, \text {preceding} \text { activity level}, \text {step variation}, \text {engagement}, \text {dosage}, \text {location}\} \) and \(S '_{i, k} = \{1, \text {step variation}, \text {engagement}, \text {dosage}, \text {location}\}\) is a subset of \(S_{i, k}\).

We provide a full description of these features in Sect. E. The prior distribution was also constructed based on HeartStepsV1; see Sect. 6.3 in (Liao et al. 2020) for more details. As this feasibility study only includes a small number of users, a simple model with only two person-specific random effects, each on the intercept term in \(S \) and \(S '\) (Eq. 12) was deployed.

Fig. 7
figure 7

Setup of FeasibilityStudy. Users can receive treatments up to five times a day during the 90 days. Users enter the trial asynchronously

Fig. 8
figure 8

Availability criteria

Table 4 State feature descriptions for FeasibilityStudy

Here we discuss how much data we have to personalize the policy to each user. Recall the 10 users only receive interventions when they meet the availability criteria outlined in Fig. 8, thus we find that in practice we have a limited number of decision points to learn a personalized policy from. In the case of perfect availability, we would have at most 450 decision points per person. However due to the criteria in Fig. 8, the algorithm is used with only approximately 23% of each user’s decision points. Pooling users’ data allows us to learn more rapidly. On the day that the first pooled user joined the feasibility study there were 107 data points from the first set of 10 users.

The 10 users received an average number of .20 (\(\pm 0.015\)) messages a day. The average log step count in the 30-minute window after a suggestion was sent was 4.47, while it was 3.65 in the 30-minute windows after suggestions were not sent. Figure 9 shows the entire history of treatment selection probabilities for all of the users who received treatment according to IntelligentPooling. We see that the treatment probabilities tended to be low, though they covered the whole range of possible values.

Fig. 9
figure 9

We see that IntelligentPooling covers the full range of treatment selection probabilities. The tendency seems to be to send with a lower rather than higher probability

Fig. 10
figure 10

Posterior mean and standard deviation of the coefficient of \(A_{i,k}\) in Eq. 12 for all users in the feasibility study

Fig. 11
figure 11

Posterior mean of the coefficient of \(A_{i,k}\) in Eq. 12 for users A and B in the feasibility study

Fig. 12
figure 12

Mean squared distance of the posterior mean from prior mean of the coefficients of \(A_{i,k}\)

We would like to assess the ability of IntelligentPooling to personalize and learn quickly. To do so we perform an analysis of the learning algorithms of IntelligentPooling, Complete and Person-Specific on batch data containing tuples of (SAR). Note that the actions in this batch data were selected by IntelligentPooling, however, here we are not interested in the action selection components of each algorithm but instead on their ability to learn the posterior distribution of the weights on the feature vector.

Personalization By comparing how the decisions to treat under IntelligentPooling differ from those under Complete, we gather preliminary evidence concerning whether IntelligentPooling personalizes to users. Figure 10 shows the posterior mean of the coefficient of the \(A_{i,k}\) term in the estimation of the treatment effect, for all users in the feasibility study on the 90th day after the last user joined the study. We show this term not only for IntelligentPooling but also for Complete and Person-Specific. We see that for some users this coefficient is below zero while for others it is above. While the terms under IntelligentPooling differ from Complete they do not vary as much as those learned by Person-Specific. Yet, crucially, the variance is much lower for these terms.

Figure 11 displays the posterior mean of the coefficient of the \(A_{i,k}\) term in the estimation of the treatment effect. This coefficient represents the overall effect of treatment on one of the users, User A. During the prior 7 days User A had not experienced much variation in activity at this time and the user’s engagement is low. Note that the treatment appears to have a positive effect on a different user, User B, in this context whereas on User A there is little evidence of a positive effect. If Complete had been used to determine treatment, User A might have been over-treated.

Speed of policy learning We consider the speed at which IntelligentPooling diverges from the prior, relative to the speed of divergence for Person-Specific. Figure 12 provides the Euclidean distance between the learned posterior and prior parameter vectors (averaged across the data from the 10 users at each time). From Fig. 12 we see that Person-Specific hardly varies over time in contrast to IntelligentPooling and Complete, which suggests that Person-Specific learns more slowly.

In conclusion IntelligentPooling was found to be feasible in this study. In particular the algorithm was operationally stable within the computational environment of the study, produced decision probabilities in a timely manner, and did not adversely impact the functioning of the overall mHealth intervention application. Overall, IntelligentPooling produced treatment selection probabilities which covered the full range of available probabilities, though treatments tended to be sent with a low probability.

6 Non-stationary environments

An additional challenge in mHealth settings is that users’ response to treatment can vary over time. To address this challenge we show that our underlying model can be extended to include time-varying random effects. This allows each policy to be aware of how a user’s response to treatment might vary over time. We propose a new simulation to evaluate this approach and show that IntelligentPooling achieves state-of-the-art regret, adjusting to non-stationarity even as user populations vary from heterogenous to homogenous.

6.1 Time-varying random effect

In addition to user-specific random effects we extend our model to include time-specific random effects. Consider the Bayesian mixed effects model with person-specific and time-varying effects: for user i at the \(k^{th}\) decision time,

$$\begin{aligned} { R_{i, k} = \phi (S_{i, k}, A_{i, k})^\top {w_{i, k}} + \epsilon _{i, k}}. \end{aligned}$$

In addition, we impose the following additive structure on the parameters \(w_{i,k}\):

$$\begin{aligned} \small { w_{i,k} = w_{pop} + u_i + v_k }, \end{aligned}$$

where \(w_{pop}\) is the population-level parameter, \(u_i\) represents the person-specific deviation from \(w_{pop}\) for user i and \(v_k\) is the time-varying random effects allowing \(w_{i, k}\) to vary with time in the study.

The prior terms for this model are as introduced in Sect. 3.4. Additionally, \(v_k\) has mean \({\mathbf {0}}\) and covariance \( D_v\). The covariance between two relative decision times in the trial is \(\text {Cov}(v_k, v_{k'}) = \rho (k, k') D_v\), where \(\rho (k, k') = \exp (-dist(k, k')^2/\sigma _{\rho })\) for a distance function, dist and . There is no change to Algorithm 1 except that now the algorithm would select the action based on the posterior distribution of \(w_{i, k}\), which depends on both the user and time in the study.

6.2 Experiments

We now modify our original simulation environment so that users’ responses will vary over time. To do so we introduce the generative model Disengagement. This generative model captures the phenomenon of disengagement. That is as users are increasingly exposed to treatment over time they can become less responsive. This model adds a further term to (9), \(A_{i,k}X_w^T\beta _w \) where \(X_w\) is defined as follows. Let \(w_{i,k}\) be the highest number of weeks user i has completed at time k; \(X_w\) encodes a user’s current week in a trial, \(X_w =[\mathbb {1}_{\{w_{i,k}=0\}},\dots ,\mathbb {1}_{\{w_{i,k}=11\}}]\). We set \(\beta _w\) such that the longer a user has been in treatment, the less they respond to a treatment message. When a simulated user is at a decision time the user will receive a treatment message according to whichever RL policy is being run through the simulation.

In order to evaluate the effectiveness of our time-varying model we compare to Time-Varying Gaussian Process Thompson Sampling (TV-GP) (Bogunovic et al. 2016). This approach incorporates temporal information for non-stationary environments and was shown to be competitive to stationary models. To compare this method to IntelligentPooling we use a linear kernel for the spatial component. We then modify Eq. 6 to compute the posterior distribution by removing the random-effects and modifying the kernel (Eq. 5) to include the temporal terms introduced in (Bogunovic et al. 2016).

Fig. 13
figure 13

Disengagement generative model Regret averaged across all users for each week in the trial, i.e. average regret of all users in their first week of the trial

Table 5 Average fraction of times treatment was sent (action=1), over 50 simulations (generative model Heterogeneity with homogenous \(Z^h\) setting)

Figure 13 provides the regret averaged across all users across 50 simulated trials where the reward distribution follows generative model \(\textsc {Disengagement}\). As before the horizontal axis in Fig. 13 is the average regret over all users in their \(n^{th}\)week in the trial, e.g. in their first week, their second week, etc. In Disengagement, the time-specific response to treatment is set so that a negative response to treatment is introduced in the seventh week of the trial.

In the Disengagement condition as users become increasingly less responsive to treatment good policies should learn to treat less. Thus, Table 5 provides the average number of times a treatment is sent in the last week of the trial for both the first and last cohort. We expect that a policy which learns not to treat will treat less often in the last week of the last cohort than in the last week of the first cohort.

7 Limitations

A significant limitation with this work is that our pilot study involved a small number of participants. Our results from this work must be considered with caution as preliminary evidence towards the feasibility of deploying IntelligentPooling, and bandit algorithms in general, in mHealth settings. Moreover, we cannot claim to provide generalizable evidence that this algorithm can improve health outcomes; for this larger studies with more participants must be run. We offer our findings as motivation for such future work.

Our proposed model is designed to overcome the challenges faced when learning personalized policies in limited data settings. As such, if data was abundant our model would likely have limited effectiveness compared to more complex models. For example, a more complex model could allow us to pool between users as a function of their similarity. Our current model instead determines the extent to which a given user deviates from the population and does not consider between-user similarities. A limitation with our current understanding of mHealth is that it is unclear what a good similarity measure would be. We leave the question of designing a data-efficient algorithm for learning such a measure as future work.

A component of IntelligentPooling is the use of empirical Bayes to update the model hyper-parameters. Here, we used an approximate procedure. However, with our model it is possible to produce exact updates in a streaming fashion and we are currently developing such an approach.

Ideally, we would evaluate IntelligentPooling against all other approaches in a clinical trial setting. However, here we only demonstrated the feasibility of our approach on a limited number of users and did not have the resources to similarly test the other approaches. To overcome this limitation we constructed a realistic simulation environment so that we could evaluate on different populations without the costly investment of designing multiple arms of a real-life trial. While the simulated experiments and the feasibility study together demonstrate the practicality of our approach, in future work one might deploy all potential approaches in simultaneous live trials.

Finally, IntelligentPooling can incorporate a time-specific random effect to capture the phenomenon of responsivity changing over the course of a study. There is much to be improved with this model. For example, the first cohort in a study will not have prior cohorts to learn from, and the final cohort will have the greatest amount of data to benefit from. Other models might treat different cohorts with greater equality. Furthermore, this representation does not incorporate alternative temporal information, such as continually shifting weather patterns, where temperatures might change slowly and gradually alter one’s desire to exercise outside.

8 Conclusion

When data on individuals is limited a natural tension exists between personalizing (a choice which can introduce variance) and pooling (a choice which can introduce bias). In this work we have introduced a novel algorithm for personalized reinforcement learning, IntelligentPooling that presents a principled mechanism for balancing this tension. We demonstrate the practicality of our approach in the setting of mHealth. In simulation we achieve improvements of 26% over a state-of-the-art-method, while in a live clinical trial we show that our approach shows promise of personalization on even a limited number of users. We view adaptive pooling as a first step in addressing the trade-offs between personalization and pooling. The question of how to quantify the benefits and risks for individual users is an open direction for future work.