1 Introduction

In the wake of the vast population of smart devices (smartphones and wearable devices such as the Fitbit Fuelband and Jawbone etc.) users worldwide, mobile health (mHealth) technologies become increasingly popular among the scientist communities. The goal of mHealth is to use smart devices as great platforms to collect and analyze raw data (weather, location, social activity, stress, etc.). Based on that, the aim is to provide in-time interventions to device users according to their ongoing status and changing needs, helping users to lead healthier lives, such as reducing the alcohol abuse [4] and the obesity management [11].

Formally, the tailoring of mHealth intervention is modeled as a sequential decision making (SDM) problem. It aims to learn the optimal decision rule to decide when, where and how to deliver interventions [7, 10, 13, 17] to best serve users. This is a brand-new research topic. Currently, there are two types of reinforcement learning (RL) methods for mHealth with distinct assumptions: (a) the off-policy, batch RL [16, 17] assumes that all users in the mHealth are completely homogenous: they share all information and learn an identical RL for all the users; (b) the on-policy, online RL [7, 17] assumes that all users are completely different: they share no information and run a separate RL for each user. The above assumptions are good as a start for the mHealth study. However, when mHealth are applied to more practical situations, they have the following drawbacks: (a) the off-policy, batch RL method ignore the fact that the behavior of all users may be too complicated to be modeled with an identical RL, which leads to potentially large biases in the learned policy; (b) for the on-policy, online RL method, an individual user’s trajectory data is hardly enough to support a separate RL learning, which is likely to result in unstable policies that contain lots of variances [14].

A more realistic assumption lies between the above two extremes: a user may be similar to some, but not all, users and similar users tend to have similar behaviors. In this paper, we propose a novel group driven RL for the mHealth. It is in an actor-critic setting [3]. The core idea is to find the similarity (cohesion) network for the users. Specifically, we employ the clustering method to mine the group information. Taking the group information into consideration, we learn K (i.e., the number of groups) shared RLs for K groups of users respectively; each RL learning procedure makes use of all the data in that group. Such implementation balances the conflicting goals of reducing the complexity of data while enriching the number of samples for each RL learning process.

2 Preliminaries

The Markov Decision Process (MDP) provides a mathematical tool to model the dynamic system [2, 3]. It is defined as a 5-tuple \(\left\{ \mathcal {S},\mathcal {A},P,R,\gamma \right\} \), where \(\mathcal {S}\) is the state space and \(\mathcal {A}\) is the action space. The state transition model \(\mathcal {P}:\mathcal {S}\times \mathcal {A}\times \mathcal {S}\mapsto \left[ 0,1\right] \) indicates the probability of transiting from one state s to another \(s'\) under a given action a. \(\mathcal {R}:\mathcal {S}\times \mathcal {A}\mapsto \mathbb {R}\) is the corresponding reward, which is assumed to be bounded over the state and action spaces. \(\gamma \in [0,1)\) is a discount factor that reduces the influence of future rewards. The stochastic policy \(\pi \left( \cdot \mid s\right) \) determines how the agent acts with the system by providing each state s with a probability over all the possible actions. We consider the parameterized stochastic policy, i.e., \(\pi _{\theta }\left( a\mid s\right) \), where \(\theta \) is the unknown coefficients.

Formally, the quality of a policy \(\pi _{\theta }\) is evaluated by a value function \(Q^{\pi _{\theta }}\left( s,a\right) \in \mathbb {R}^{\left| \mathcal {S}\right| \times \left| \mathcal {A}\right| }\) [12]. It specifies the total amount of rewards an agent can achieve when starting from state s, first choosing action a and then following the policy \(\pi _{\theta }\). It is defined as follows [3]:

$$\begin{aligned} Q^{\pi _{\theta }}\left( s,a\right) =\mathbb {E}_{a_{i}\sim \pi _{\theta },s_{i}\sim \mathcal {P}}\left\{ \sum _{t=0}^{\infty }\gamma ^{t}\mathcal {R}\left( s_{t},a_{t}\right) \mid s_{0}=s,a_{0}=a\right\} . \end{aligned}$$
(1)

The goal of various RL methods is to learn an optimal policy \(\pi _{\theta ^{*}}\) that maximizes the Q-value for all the state-action pairs [2]. The objective is \(\pi _{\theta ^{*}}=\arg \max _{\theta }\widehat{J}\) \(\left( \theta \right) \) (such procedure is called the actor updating [3]), where

$$\begin{aligned} \widehat{J}\left( \theta \right) =\sum _{s\in \mathcal {S}}d_{\text {ref}}\left( s\right) \sum _{a\in \mathcal {A}}\pi _{\theta }\left( a\mid s\right) Q^{\pi _{\theta }}\left( s,a\right) , \end{aligned}$$
(2)

where \(d_{\text {ref}}\left( s\right) \) is a reference distribution over states; \(Q^{\pi _{\theta }}\left( s,a\right) \) is the value for the parameterized policy \(\pi _{\theta }\). It is obvious that we need the estimation of \(Q^{\pi _{\theta }}\left( s,a\right) \) (i.e. the critic updating) to determine the objective function (2).

3 Cohesion Discovery for the RL Learning

Suppose we are given a set of N users; each user is with a trajectory of T points. Thus in total, we have \(NT=N\times T\) tuples summarized in \(\mathcal {D}=\left\{ \mathcal {D}_{n}\mid n=1,\cdots ,N\right\} \) for all the N users, where \(\mathcal {D}_{n}=\left\{ \mathcal {U}_{i}\mid i=1,\cdots ,T\right\} \) summarizes all the T tuples for the n-th user and \(\mathcal {U}_{i}=\left( s_{i},a_{i,}r_{i},s_{i}'\right) \) is the i-th tuple in \(\mathcal {D}_{n}\).

3.1 The Pooled-RL and Separate RL (Separ-RL)

The first RL method (i.e. Pooled-RL) assumes that all the N users are completely homogenous and following the same MDP; they share all information and learn an identical RL for all the users [16]. In this setting, the critic updating (with an aim of seeking for solutions to satisfy the Linear Bellman equation [2, 3]) is

$$\begin{aligned} \mathbf {w}=f\left( \mathbf {w}\right) =\arg \min _{\mathbf {h}}\frac{1}{\left| \mathcal {D}\right| }\sum _{\mathcal {U}_{i}\in \mathcal {D}}\left\| \mathbf {x}\left( s_{i},a_{i}\right) ^{\intercal }\mathbf {h}-\left[ r_{i}+\gamma \mathbf {y}\left( s_{i}';\theta \right) ^{\intercal }\mathbf {w}\right] \right\| _{2}^{2}+\zeta _{c}\left\| \mathbf {h}\right\| _{2}^{2}, \end{aligned}$$
(3)

where \(\mathbf {w}\,=\,f\left( \mathbf {w}\right) \) is a fixed point problem; \(\left| \mathcal {D}\right| \) represents the number of tuples in \(\mathcal {D}\); \(\mathbf {x}_{i}\,=\,\mathbf {x}\left( s_{i},a_{i}\right) ^{\intercal }\) is the value feature at the time point i; \(\mathbf {y}_{i}=\mathbf {y}\left( s_{i}';\theta \right) =\sum _{a\in \mathcal {A}}\mathbf {x}\left( s_{i}',a\right) \pi _{\theta }\left( a\mid s_{i}'\right) \) is the feature at the next time point; \(\zeta _{c}\) is a tuning parameter. The least-square temporal difference for Q-value (LSTDQ) [5, 6] provides a closed-form solver for (3) as follows

$$\begin{aligned} \widehat{\mathbf {w}}=\left( \zeta _{c}\mathbf {I}+\frac{1}{\left| \mathcal {D}\right| }\sum _{\mathcal {U}_{i}\in \mathcal {D}}\mathbf {x}_{i}\left( \mathbf {x}_{i}-\gamma \mathbf {y}_{i}\right) ^{\intercal }\right) ^{-1}\left( \frac{1}{\left| \mathcal {D}\right| }\sum _{\mathcal {U}_{i}\in \mathcal {D}}\mathbf {x}_{i}r_{i}\right) . \end{aligned}$$
(4)

As \(d_{\text {ref}}\left( s\right) \) is generally unavailable, the T-trial objective for (2) is defined as

$$\begin{aligned} \hat{\theta }=\arg \max _{\theta }\frac{1}{\left| \mathcal {D}\right| }\sum _{\mathcal {U}_{i}\in \mathcal {D}}\sum _{a\in \mathcal {A}}Q\left( s_{i},a;\mathbf {\widehat{\mathbf {w}}}\right) \pi _{\theta }\left( a|s_{i}\right) -\frac{\zeta _{a}}{2}\left\| \theta \right\| _{2}^{2}, \end{aligned}$$
(5)

where \(Q\left( s_{i},a;\mathbf {\widehat{\mathbf {w}}}\right) =\mathbf {x}\left( s_{i},a\right) ^{\intercal }\widehat{\mathbf {w}}\) is the newly defined Q-value which is based on the critic updating result in (4); \(\zeta _{a}\) is the tuning parameter to prevent overfitting. In case of large feature spaces, one can iteratively update \(\widehat{\mathbf {w}}\) via (4) and \(\widehat{\theta }\) in (5) to reduce the computational cost.

The Pooled-RL works well when all the N users are very similar. However, there are great behavior discrepancies among users in the mHealth study because they have different ages, races, incomes, religions, education levels etc. Such case makes the current Pooled-RL too simple to simultaneously fit all the N different users’ behaviors. It easily results in lots of biases in the learned value and policy.

The second RL method (Separ-RL), such as Lei’s online contextual bandit for mHealth [7, 15], assumes that all users are completely heterogeneous. They share no information and run a separate online RL for each user. The objective functions are very similar with (3), (4), (5). This method should be great when the data for each user is very large in size. However, it generally costs a lot of time and other resources to collect enough data for the Separ-RL learning. Taking the HeartSteps for example, it takes 42 days to do the trial, which only collects 210 tuples per user. What is worse, there are missing and noises in the data, which will surely reduce the effective sample size. The problem of small sample size will easily lead to some unstable policies that contain lots of variances.

3.2 Group driven RL learning (Gr-RL)

We observe that users in mHealth are generally similar with some (but not all) users in the sense that they may have some similar features, such as age, gender, race, religion, education level, income and other socioeconomic status [8]. To this end, we propose a group based RL for mHealth to understand how to share information across similar users to improve the performance. Specifically, the users are assumed to be grouped together and likely to share information with others in the same group. The main idea is to divide the N users into K groups, and learn a separate RL model for each group. The samples of users in a group are pooled together, which not only ensures the simplicity of the data for each RL learning compared with that of the Pooled-RL, but also greatly enriches the samples for the RL learning compared with that of the Separ-RL, with an average increase of \(\left( N/K-1\right) \times 100\%\) on sample size (cf. Sect.  3.1).

To cluster the N users, we employ one of the most benchmark clustering method, i.e., K-means. The behavior information (i.e. states and rewards) in the trajectory is processed as the feature. Specifically, the T tuples of a user are stacked together \(\mathbf {z}_{n}=\left[ s_{1},r_{1},\cdots ,s_{T},r_{T}\right] ^{\intercal }\). With this new feature, we have the objective for clustering as \(J=\sum _{n=1}^{N}\sum _{k=1}^{K}r_{nk}\left\| \mathbf {z}_{n}-\varvec{\mu }_{k}\right\| ^{2}\), where \(\varvec{\mu }_{k}\) is the k-th cluster center and \(r_{nk}\in \left\{ 0,1\right\} \) is the binary indicator variable that describes which of the K clusters the data \(\mathbf {z}_{n}\) belongs to. After the clustering step, we have the group information \(\left\{ \mathcal {G}_{k}\mid k=1,\cdots ,K\right\} \), each of which includes a set of similar users. With the clustering results, we have the new objective for the critic updating as \(\mathbf {w}_{k}=f\left( \mathbf {w}_{k}\right) =\mathbf {h}_{k}^{*}\) for \(k=1,\cdots K\), where \(\mathbf {h}_{k}^{*}\) is estimated as

$$\begin{aligned} \min _{\left\{ \mathbf {h}_{k}\mid k=1,\cdots ,K\right\} }\sum _{k=1}^{K}\left\{ \frac{1}{\left| \mathcal {G}_{k}\right| }\sum _{\mathcal {U}_{i}\in \mathcal {G}_{k}}\left\| \mathbf {x}{}_{i}^{\intercal }\mathbf {h}_{k}-\left( r_{i}+\gamma \mathbf {y}_{i}^{\intercal }\mathbf {w}_{k}\right) \right\| _{2}^{2}+\zeta _{c}\left\| \mathbf {h}_{k}\right\| _{2}^{2}\right\} , \end{aligned}$$
(6)

which could be solved via the LSTDQ. The objective for the actor updating is

$$\begin{aligned} \max _{\left\{ \theta _{k}\mid k=1,\cdots ,K\right\} }\sum _{k=1}^{K}\left\{ \frac{1}{\left| \mathcal {G}_{k}\right| }\sum _{\mathcal {U}_{i}\in \mathcal {G}_{k}}\sum _{a\in \mathcal {A}}Q\left( s_{i},a;\widehat{\mathbf {w}}_{k}\right) \pi _{\theta _{k}}\left( a|s_{i}\right) -\frac{\zeta _{a}}{2}\left\| \theta _{k}\right\| _{2}^{2}\right\} . \end{aligned}$$
(7)

The objectives (6) and (7) could be solved independently for each cluster. By properly setting the value of K, we could balance the conflicting goal of reducing the discrepancy between connected users while increasing the number of samples for each RL learning: (a) a small K is suited for the case where T is small and the users are generally similar; (b) while a large K is adapted to the case where T is large and users are generally different from others. Besides, we find that the proposed method is a generalization of the conventional Pooled-RL and Separ-RL: (a) when \(K=1\), the proposed method is equivalent to the Pooled-RL; (b) when \(K=N\), our method is equivalent to the Separ-RL.

4 Experiments

There are three RL methods for comparison: (a) the Pooled-RL that pools the data across all users and learn an identical policy [16, 17] for all the users; (b) the Separ-RL, which learns a separate RL policy for each user by only using his or her data [7]; (c) The group driven RL (Gr-RL) is the proposed method.

The HeartSteps dataset is used in the experiment. It is a 42-days trial study where there are 50 participants. For each participant, 210 decision points are collected—five decisions per participant per day. At each time point, the set of intervention actions can be the intervention type, as well as whether or not to send interventions. The intervention is sent via smartphones, or via wearable devices like a wristband [1]. In our study, there are two choices for a policy \(\left\{ 0,1\right\} \): \(a=1\) indicates sending the positive intervention, while \(a=0\) means no intervention [16, 17]. Specifically, the parameterized stochastic policy is assumed to be in the form \(\pi _{\theta }\left( a\mid s\right) \,=\,\frac{\exp \left[ -\theta ^{\intercal }\phi \left( s, a\right) \right] }{\sum _{a'}\exp \left[ -\theta ^{\intercal }\phi \left( s,a\right) \right] }\), where \(\theta \in \mathbb {R}^{q}\) is the unknown variance and \(\phi \left( \cdot ,\cdot \right) \) is the feature processing method for the policy, i.e., \(\phi \left( s,a\right) =\) \(\left[ as^{\intercal },a\right] ^{\intercal }\in \mathbb {R}^{m}\), which is different from the feature for the value function \(\mathbf {x}\left( s,a\right) \).

4.1 Experiments Settings

For the \(n^{\text {th}}\) user, a trajectory of T tuples \(\mathcal {D}_{n}=\left\{ \left( s_{i},a_{i},r_{i}\right) \right\} _{i=1}^{T}\) are collected via the micro-randomized trial [7, 10]. The initial state is sampled from the Gaussian distribution \(S_{0}\sim \mathcal {N}_{p}\left\{ 0,\varSigma \right\} \), where \(\varSigma \) is the \(p\times p\) covariance matrix with pre-defined elements. The policy of selecting action \(a_{t}=1\) is drawn from the random policy with a probability of 0.5 to provide interventions, i.e. \(\mu \left( 1\mid s_{t}\right) =0.5\) for all states \(s_{t}\). For \(t\ge 1\), the state and immediate reward are generated as follows

$$\begin{aligned} S_{t,1}&=\beta _{1}S_{t-1,1}+\xi _{t,1},\nonumber \\ S_{t,2}&=\beta _{2}S_{t-1,2}+\beta _{3}A_{t-1}+\xi _{t,2},\\ S_{t,3}&=\beta _{4}S_{t-1,3}+\beta _{5}S_{t-1,3}A_{t-1}+\beta _{6}A_{t-1}+\xi _{t,3},\nonumber \\ S_{t,j}&=\beta _{7}S_{t-1,j}+\xi _{t,j},\qquad \text {for}\ j=4,\ldots ,p\nonumber \end{aligned}$$
(8)
$$\begin{aligned} R_{t}=\beta _{14}\times \left[ \beta _{8}+A_{t}\times (\beta _{9} +\beta _{10}S_{t,1}+\beta _{11}S_{t,2})+\beta _{12}S_{t,1}-\beta _{13}S_{t,3} +\varrho _{t}\right] , \end{aligned}$$
(9)

where \(\varvec{\beta }=\left\{ \beta _{i}\right\} _{i=1}^{14}\) are the main parameters for the MDP; \(\left\{ \xi _{t,i}\right\} _{i=1}^{p}\sim \mathcal {N}\left( 0,\sigma _{s}^{2}\right) \) is the noise in the state (9) and \(\varrho _{t}\sim \mathcal {N}\left( 0,\sigma _{r}^{2}\right) \) is the noise in the reward model (9). To mimic N users that are similar but not identical, we need N different \(\varvec{\beta }\)s, each of which is similar with a set of others. Formally, there are two steps to obtain \(\varvec{\beta }\) for the i-th user: (a) select the m-th basic \(\varvec{\beta }\), i.e. \(\varvec{\beta }_{m}^{\text {basic}}\); it determines which group the i-th user belongs to; (b) add the noise \(\varvec{\beta }_{i}=\varvec{\beta }_{m}^{\text {basic}}+\varvec{\delta }_{i},\ \text {for}\ i\in \left\{ 1,2,\cdots ,N_{m}\right\} \) to make each user different from others, where \(N_{m}\) indicates the number of users in the m-th group, \(\varvec{\delta }_{i}\sim \mathcal {N}\left( 0,\sigma _{b}\mathbf {I}_{14}\right) \) is the noise and \(\mathbf {I}_{14}\in \mathbb {R}^{14\times 14}\) is an identity matrix. The value of \(\sigma _{b}\) specifies how different the users are. Specially in our experiment, we set \(M=5\) groups (each group has \(N_{m}=10\) people, leading to \(N=50\) users involved in the experiment). The basic \(\varvec{\beta }\)s for the M groups are set as follows

$$\begin{aligned} \varvec{\beta }_{1}^{\text {basic}}&=\left[ 0.40,0.25,0.35,0.65,0.10,0.50,0.22,2.00,0.15,0.20,0.32,0.10,0.45,800\right] \\ \varvec{\beta }_{2}^{\text {basic}}&=\left[ 0.45,0.35,0.40,0.70,0.15,0.55,0.30,2.20,0.25,0.25,0.40,0.12,0.55,700\right] \\ \varvec{\beta }_{3}^{\text {basic}}&=\left[ 0.35,0.30,0.30,0.60,0.05,0.65,0.28,2.60,0.35,0.45,0.45,0.15,0.50,650\right] \\ \varvec{\beta }_{4}^{\text {basic}}&=\left[ 0.55,0.40,0.25,0.55,0.08,0.70,0.26,3.10,0.25,0.35,0.30,0.17,0.60,500\right] \\ \varvec{\beta }_{5}^{\text {basic}}&=\left[ 0.20,0.50,0.20,0.62,0.06,0.52,0.27,3.00,0.15,0.15,0.50,0.16,0.70,450\right] , \end{aligned}$$

Besides, the noises are set \(\sigma _{s}=\sigma _{r}=1\) and \(\sigma _{\beta }=0.01\). Other variances are \(p=3\), \(q=4\), \(\zeta _{a}=\zeta _{c}=0.01\). The feature processing for the value estimation \(Q^{\pi _{\theta }}\left( s,a\right) \) is \(\mathbf {x}\left( s,a\right) =\left[ 1,s^{\intercal },a,s^{\intercal }a\right] ^{\intercal }\in \mathbb {R}^{2p+2}\) for all the compared methods.

Fig. 1.
figure 1

Average reward of 3 RL methods: (a) Pooled-RL, (b) Separ-RL, (c) Gr-RL\(_{K=3}\) and Gr-RL\(_{K=7}\). The left sub-figure shows the results when the trajectory is short, i.e. \(T=42\); the right one shows the results when \(T=100\). A larger value is better.

4.2 Evaluation Metric and Results

In the experiments, the expectation of long run average reward (ElrAR) \(\mathbb {E}\left[ \eta ^{\pi _{\hat{\theta }}}\right] \) is proposed to evaluate the quality of a learned policy \(\pi _{\hat{\theta }}\) [9, 10]. Intuitively in the HeartSteps application, ElrAR measures the average step a user could take each day when he or she is provided by the intervention via the learned policy \(\pi _{\hat{\theta }}\). Specifically, there are two steps to achieve the ElrAR [10]: (a) get the \(\eta ^{\pi _{\hat{\theta }}}\) for each user by averaging the rewards over the last 4, 000 elements in the long run trajectory with a total number of 5, 000 tuples; (b) ElrAR \(\mathbb {E}\left[ \eta ^{\pi _{\hat{\theta }}}\right] \) is achieved by averaging over the \(\eta ^{\pi _{\hat{\theta }}}\)’s of all users.

Table 1. The average reward of three RL methods when the discount factor \(\gamma \) changes from 0 to 0.95: (a) Pooled-RL, (b) Separ-RL, (c) Gr-RL\(_{K=3}\) and Gr-RL\(_{K=7}\). A larger value is better. The bold value is the best and the is the 2nd best.

The experiment results are summarized in Table 1 and Fig. 1, where there are three RL methods: (a) Pooled-RL, (b) Separ-RL, (c) Gr-RL\(_{K=3}\) and Gr-RL\(_{K=7}\). \(K=3,7\) is the number of cluster centers in our algorithm, which is set different from the true number of groups \(M=5\). Such setting is to show that Gr-RL does not require the true value of M. There are two sub-tables in Table 1. The top sub-table summarizes the experiment results of three RL methods under six \(\gamma \) settings (i.e. the discount reward) when the trajectory is short, i.e. \(T=42\). While the bottom one displays the results when the trajectory is long, i.e. \(T=100\). Each row shows the results under one discount factor, \(\gamma =0,\cdots ,0.95\); the last row shows the average performance over all the six \(\gamma \) settings.

As we shall see, Gr-RL\(_{K = 3}\) and Gr-RL\(_{K = 7}\) generally perform similarly and are always among the best. Such results demonstrate that our method doesn’t require the true value of groups and is robust to the value of K. In average, the proposed method improves the ElrAR by 82.4 and 80.3 steps when \(T = 42\) as well as 49.8 and 51.7 steps when \(T = 100\), compared with the best result of the state-of-the-art methods, i.e. Separ-RL. There are two interesting observations: (1) the improvement of our method decreases as the trajectory length T increases; (2) when the trajectory is short, i.e. \(T=42\), it is better to set small Ks, which emphasizes the enriching of dataset; while the trajectory is long, i.e. \(T=100\), it is better to set large Ks to simplify the data for each RL learning.

5 Conclusions and Discussion

In this paper, we propose a novel group driven RL method for the mHealth. Compared with the state-of-the-art RL methods for mHealth, it is based on a more practical assumption that admits the discrepancies between users and assumes that a user should be similar with some (but not all) users. The proposed method is able to balance the conflicting goal of reducing the discrepancy between pooled users while increasing the number of samples for each RL learning. Extensive experiment results verify that our method gains obvious advantages over the state-of-the-art RL methods in the mHealth.