Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling

Ghosh, Susobhan; Kim, Raphael; Chhabria, Prasidh; Dwivedi, Raaz; Klasnja, Predrag; Liao, Peng; Zhang, Kelly; Murphy, Susan

doi:10.1007/s10994-024-06526-x

Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling

Published: 10 April 2024

(2024)
Cite this article

Machine Learning Aims and scope Submit manuscript

Susobhan Ghosh ORCID: orcid.org/0000-0003-3654-4141¹^na1,
Raphael Kim²^na1,
Prasidh Chhabria⁷,
Raaz Dwivedi^1,3,4,
Predrag Klasnja⁵,
Peng Liao⁶,
Kelly Zhang¹ &
…
Susan Murphy^1,3

116 Accesses
1 Altmetric
Explore all metrics

Abstract

There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential decision-making problems involve decisions about when to treat and how to treat based on the user’s context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user’s historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an “optimized” intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IntelligentPooling: practical Thompson sampling for mHealth

Article 21 June 2021

Reinforcement learning for sequential decision making in population research

Article Open access 02 November 2023

pH-RL: A Personalization Architecture to Bring Reinforcement Learning to Health Practice

Data availibility

Under the current data policies for HeartSteps V2/V3, the research team cannot make the data publicly available.

Code availability

The code used for generating resampled trajectories and reproducing the plots is available at the following link.

Notes

This is our informal working definition of personalization. In Sect. 2.3 we formally define personalization and ways to measure personalization.
Details on this feature and others are provided further in Table 1.
We remark that throughout this paper, we use resampling and resimulations interchangeably. In particular, our methodology resimulates user trajectories that we generate by resampling states and rewards using generative models, and resampling actions by re-running the RL algorithm.
Note that the underlying problem might restrict the allowed actions depending on the value of s. For example, in HeartSteps, sending a notification is not allowed when the user is driving.
We demonstrate the stability of our results to the choice of $\delta$ in Appendix D. Also, note that the number of users with $\texttt {Score\_int}_{1} \ge 0.9$ and $\texttt {Score\_int}_{1} \le 0.1$ are 17 and 1 respectively; these counts are slightly obscured in the histogram Fig. 2.

References

Albers, N., Neerincx, M. A., & Brinkman, W.-P. (2022). Addressing people’s current and future states in a reinforcement learning algorithm for persuading to quit smoking and to be physically active. PLoS ONE, 17(12), 0277295.
Article Google Scholar
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256. https://doi.org/10.1023/A:1013689704352
Article Google Scholar
Auer, P., & Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1–2), 55–65. https://doi.org/10.1007/s10998-010-3055-6
Article MathSciNet Google Scholar
Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.
MathSciNet Google Scholar
Bibaut, A., Chambaz, A., Dimakopoulou, M., Kallus, N., & Laan, M. (2021). Post-contextual-bandit inference.
Boger, J., Poupart, P., Hoey, J., Boutilier, C., Fernie, G. R., & Mihailidis, A. (2005). A decision-theoretic approach to task assistance for persons with dementia. In: Kaelbling, L.P., Saffiotti, A. (eds.), IJCAI-05, Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30–August 5, 2005, pp. 1293–1299. Professional Book Center, UK. http://ijcai.org/Proceedings/05/Papers/1186.pdf.
Boruvka, A., Almirall, D., Witkiewitz, K., & Murphy, S. A. (2018). Assessing time-varying causal effect moderation in mobile health. Journal of the American Statistical Association, 113(523), 1112–1121.
Article MathSciNet Google Scholar
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. The Journal of Machine Learning Research, 2, 499–526.
MathSciNet Google Scholar
Buja, A., Cook, D., & Swayne, D. F. (1996). Interactive high-dimensional data visualization. Journal of Computational and Graphical Statistics. https://doi.org/10.2307/1390754
Article Google Scholar
Dempsey, W., Liao, P., Klasnja, P., Nahum-Shani, I., & Murphy, S. A. (2015). Randomised trials for the fitbit generation. Significance, 12(6), 20–23. https://doi.org/10.1111/j.1740-9713.2015.00863.x
Article Google Scholar
Ding, P., Feller, A., & Miratrix, L. (2016). Randomization inference for treatment effect variation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 78, 655–671.
Article MathSciNet Google Scholar
Dwaracherla, V., Lu, X., Ibrahimi, M., Osband, I., Wen, Z., & Roy, B. V. (2020). Hypermodels for exploration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, Addis Ababa. https://openreview.net/forum?id=ryx6WgStPB.
Dwivedi, R., Tian, K., Tomkins, S., Klasnja, P., Murphy, S., & Shah, D. (2022). Counterfactual inference for sequential experiments.
Eckles, D., & Kaptein, M. (2019). Bootstrap thompson sampling and sequential decision problems in the behavioral sciences. SAGE Open. https://doi.org/10.1177/2158244019851675
Article Google Scholar
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Monographs on Statistics and applied Probability. https://doi.org/10.1201/9780429246593
Article Google Scholar
Elmachtoub, A. N., McNellis, R., Oh, S., & Petrik, M. (2017). A practical method for solving contextual bandit problems using decision trees. https://doi.org/10.48550/arxiv.1706.04687.
Fisher, R. A. (1935). The design of experiments. Oliver and Boyd.
Google Scholar
Forman, E. M., Berry, M. P., Butryn, M. L., Hagerman, C. J., Huang, Z., Juarascio, A. S., LaFata, E. M., Ontañón, S., Tilford, J. M., & Zhang, F. (2023). Using artificial intelligence to optimize delivery of weight loss treatment: Protocol for an efficacy and cost-effectiveness trial. Contemporary Clinical Trials, 124, 107029.
Article Google Scholar
Forman, E. M., Kerrigan, S. G., Butryn, M. L., Juarascio, A. S., Manasse, S. M., Ontañón, S., Dallal, D. H., Crochiere, R. J., & Moskow, D. (2019). Can the artificial intelligence technique of reinforcement learning use continuously-monitored digital data to optimize treatment for weight loss? Journal of Behavioral Medicine, 42(2), 276–290.
Article Google Scholar
Gelman, A. (2004). Exploratory data analysis for complex models. Journal of Computational and Graphical Statistics. https://doi.org/10.1198/106186004X11435
Article MathSciNet Google Scholar
Good, P. I. (2006). Resampling methods. Springer.
Google Scholar
Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S., & Athey, S. (2019). Confidence intervals for policy evaluation in adaptive experiments.
Hanna, J. P., Stone, P., & Niekum, S. (2017). Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the international joint conference on autonomous agents and multiagent systems, AAMAS, vol. 1.
Hao, B., Abbasi-Yadkori, Y., Wen, Z., & Cheng, G. (2019). Bootstrapping upper confidence bound. In Advances in neural information processing systems, vol. 32.
Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C., & Wang, M. (2021). Bootstrapping fitted q-evaluation for off-policy inference. In Proceedings of the 38th international conference on machine learning, vol. 139.
Hoey, J., Poupart, P., Boutilier, C., & Mihailidis, A.(2005). POMDP models for assistive technology. In: Bickmore, T.W. (ed.), Caring machines: AI in Eldercare, Papers from the 2005 AAAI Fall Symposium, Arlington, Virginia, USA, November 4-6, 2005. AAAI Technical Report, vol. FS-05-02, pp. 51–58. AAAI Press, Washington, D.C. https://www.aaai.org/Library/Symposia/Fall/2005/fs05-02-009.php.
Liang, D., Charlin, L., McInerney, J., & Blei, D. M. (2016). Modeling user exposure in recommendation. In Proceedings of the 25th international conference on World Wide Web. WWW ’16, pp. 951–961. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE. https://doi.org/10.1145/2872427.2883090.
Liao, P., Greenewald, K., Klasnja, P., & Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1), 1–22.
Article Google Scholar
Liao, P., Greenewald, K., Klasnja, P., & Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. https://doi.org/10.1145/3381007
Article Google Scholar
Piette, J. D., Newman, S., Krein, S. L., Marinec, N., Chen, J., Williams, D. A., Edmond, S. N., Driscoll, M., LaChappelle, K. M., Maly, M., et al. (2022). Artificial intelligence (AI) to improve chronic pain care: Evidence of AI learning. Intelligence-Based Medicine, 6, 100064.
Article Google Scholar
Qian, T., Yoo, H., Klasnja, P., Almirall, D., & Murphy, S. A. (2021). Estimating time-varying causal excursion effects in mobile health with binary outcomes. Biometrika, 108(3), 507–527.
Article MathSciNet Google Scholar
Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W., & Cheng, G. (2021). Online bootstrap inference for policy evaluation in reinforcement learning.
Rosenbaum, P. (2002). Observational studies. Springer.
Book Google Scholar
Russo, D., & Roy, B. V. (2014). Learning to optimize via information-directed sampling. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1583–1591. https://proceedings.neurips.cc/paper/2014/hash/301ad0e3bd5cb1627a2044908a42fdc2-Abstract.html.
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. et al (2018). A tutorial on thompson sampling. Foundations and Trends® in Machine Learning11(1):1–96.
Russo, D., Roy, B. V., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1), 1–96. https://doi.org/10.1561/2200000070
Article Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. IEEE Transactions on Neural Networks. https://doi.org/10.1109/tnn.1998.712192
Article Google Scholar
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.
Article Google Scholar
Tomkins, S., Liao, P., Yeung, S., Klasnja, P., & Murphy, S. (2019). Intelligent pooling in Thompson sampling for rapid personalization in mobile health.
Tukey, J. W. (1977). Exploratory data analysis (Vol. 2). Reading.
Google Scholar
Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition. Nauka.
Google Scholar
Wang, C.-H., Yu, Y., Hao, B., & Cheng, G. (2020). Residual bootstrap exploration for bandit algorithms.
White, M., & White, A. (2010). Interval estimation for reinforcement-learning algorithms in continuous-state domains. In Advances in neural information processing systems 23: 24th annual conference on neural information processing systems 2010, NIPS 2010.
Yang, J., Eckles, D., Dhillon, P., & Aral, S. (2020). Targeting for long-term outcomes. arXiv:2010.15835.
Yom-Tov, E., Feraru, G., Kozdoba, M., Mannor, S., Tennenholtz, M., & Hochberg, I. (2017). Encouraging physical activity in patients with diabetes: Intervention using a reinforcement learning system. Journal of Medical Internet Research, 19(10), 338.
Article Google Scholar
Yom-Tov, E., Feraru, G., Kozdoba, M., Mannor, S., Tennenholtz, M., & Hochberg, I. (2017). Encouraging physical activity in patients with diabetes: Intervention using a reinforcement learning system. Journal of Medical Internet Research. https://doi.org/10.2196/JMIR.7994
Article Google Scholar
Zhang, K. W., Janson, L., & Murphy, S. A. (2020). Inference for batched bandits.
Zhang, K. W., Janson, L., & Murphy, S. A. (2023). Statistical inference after adaptive sampling for longitudinal data.
Zhou, M., Mintz, Y., Fukuoka, Y., Goldberg, K., Flowers, E., Kaminsky, P. M., Castillejo, A., & Aswani, A. (2018). Personalizing mobile fitness apps using reinforcement learning. In: Said, A., Komatsu, T. (eds.), Joint Proceedings of the ACM IUI 2018 Workshops Co-located with the 23rd ACM Conference on Intelligent User Interfaces (ACM IUI 2018), Tokyo, Japan, March 11. CEUR Workshop Proceedings, vol. 2068. CEUR-WS.org, Tokyo (2018). https://ceur-ws.org/Vol-2068/humanize7.pdf.

Download references

Funding

RK is supported by NIGMS Biostatistics Training Grant Program under Grant No. T32GM135117. SG and SM acknowledge support by NIH/NIDA P50DA054039, and NIH/NIBIB and OD P41EB028242. RD acknowledges support by NSF DMS-2022448, and DSO National Laboratories grant DSO-CO21070. RD and SM also acknowledge support by NSF under Grant No. DMS-2023528 for the Foundations of Data Science Institute (FODSI). PK acknowledges support by NIH NHLBI R01HL125440 and 1U01CA229445. SM also acknowledges support by NIH/NCI U01CA229437, and NIH/NIDCR UH3DE028723. KWZ is supported by the Siebel Foundation and by NSF CBET-2112085 and by the NSF Graduate Research Fellowship Program under Grant No. DGE1745303.

Author information

Susobhan Ghosh and Raphael Kim have contributed equally to this work.

Authors and Affiliations

Department of Computer Science, Harvard University, Cambridge, MA, USA
Susobhan Ghosh, Raaz Dwivedi, Kelly Zhang & Susan Murphy
Department of Biostatistics, Harvard University, Boston, MA, USA
Raphael Kim
Department of Statistics, Harvard University, Cambridge, MA, USA
Raaz Dwivedi & Susan Murphy
Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA, USA
Raaz Dwivedi
School of Information, University of Michigan, Ann Arbor, MI, USA
Predrag Klasnja
DRW Holdings, LLC, Chicago, IL, USA
Peng Liao
Aetion, Boston, MA, USA
Prasidh Chhabria

Authors

Susobhan Ghosh
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Kim
View author publications
You can also search for this author in PubMed Google Scholar
Prasidh Chhabria
View author publications
You can also search for this author in PubMed Google Scholar
Raaz Dwivedi
View author publications
You can also search for this author in PubMed Google Scholar
Predrag Klasnja
View author publications
You can also search for this author in PubMed Google Scholar
Peng Liao
View author publications
You can also search for this author in PubMed Google Scholar
Kelly Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Susan Murphy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to conceptualization and methodology. Data Preparation and Software: PC, PL, RK, and SG. Analysis: RK. Writing: RK, RD, SM, and SG. Funding and Administration: PK, SM. Supervision: RD, KZ, SM. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Raphael Kim, Raaz Dwivedi or Susan Murphy.

Ethics declarations

Conflict of interest

KWZ worked as a summer intern at Apple. The other authors declare no conflict of interest.

Ethical approval

We adhere to the policies outlined in https://www.springer.com/gp/editorial-policies/ethical-responsibilities-of-authors.

Consent to participate

The reported study was approved by the Kaiser Permanente Washington Institutional Review Board (IRB 1257484-16). All participants completed a written informed consent to take part in the study.

Consent for publication

Adhering to the IRB and consent forms, individual level data that could be used for identification are not released.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done by Peng Liao and Prasidh Chhabria authors while they were at Harvard University.

Editors: Emma Brunskill, Minmin Chen, Omer Gottesman, Lihong Li, Yuxi Li, Yao Liu, Zonging Lu, Niranjani Prasad, Zhiwei Qin, Csaba Szepesvari, Matthew Taylor.

Appendices

Details on the RL framework for HeartSteps

We now provide further details regarding the RL framework for the HeartSteps study discussed in Sect. 3.2 (Table 1).

Table 1 Description of state features used in the HeartSteps study

Full size table

Clipping probabilities

The function h appearing in (10) is given by

$$\begin{aligned} h(p) \triangleq \textrm{min}\{ 0.8, 0.2+ \frac{0.8}{0.5} \cdot \textrm{max}\{ p-0.5,0 \} \} \quad \text {for}\quad p \in [0, 1]. \end{aligned}$$

(13)

Prior and posterior formulation

Using the notation $\theta ^\top \triangleq (\alpha _0^\top , \alpha _1^\top , \beta ^\top )$, the prior for $\theta$ was specified as $\mathcal {N}(\mu _0, \Sigma _0)$, where

$$\begin{aligned} \overline{\mu }_0 = \begin{bmatrix} \mu _{\alpha _0} \\ \mu _{\beta } \\ \mu _{\beta }\end{bmatrix} \quad \text {and}\quad \overline{\Sigma }_0 = \begin{bmatrix} \Sigma _{\alpha _0} &{}&{} \\ {} &{} \Sigma _{\beta } &{}\\ {} &{} &{}\Sigma _{\beta } \end{bmatrix} \end{aligned}$$

(14)

and $\{ \mu _{\alpha _0}, \mu _\beta , \Sigma _{\alpha _0}, \Sigma _{\beta }\}$ were computed from the prior study (see Liao et al. 2020, Sect. 6) for details on how priors were constructed and (17) for the specific values used by us; note that the HeartSteps team decided to update the priors used from those those presented in Liao et al. (2020)). Given the Gaussian prior and the Gaussian working model, the posterior for $\theta$ on the day d is also Gaussian and is given by $\mathcal {N}(\overline{\mu }_d, \overline{\Sigma }_d)$, where these posterior parameters are recursively updated as

$$\begin{aligned} \overline{\Sigma }_{d}&= \left( \frac{1}{\sigma ^2} \sum _{t=5(d-1)}^{5d} I_t \phi (S_t, A_t) \phi (S_t,A_t)^\top + \overline{\Sigma }_{d-1}^{-1} \right) ^{-1}, \quad \text {and}\quad \end{aligned}$$

(15a)

$$\begin{aligned} \overline{\mu }_{d}&= \overline{\Sigma }_d\left( \frac{1}{\sigma ^2} \sum _{t=5(d-1)}^{5d} I_t \phi (S_t, A_t) R_t + \overline{\Sigma }_{d-1}^{-1} \overline{\mu }_{d-1} \right) \end{aligned}$$

(15b)

where $\phi (S_t, A_t)^\top \triangleq [g(S_t)^\top , \pi _t f(S_t)^\top , (A_t-\pi _t)f(S_t)^\top ]$ collects all the feature vectors from the working model (8). The updates (15a) and (15b) are denoted by PosteriorUpdate in Algorithm 2. For k-dimensional $\beta$, the posterior parameters $\mu _{d, \beta }, \Sigma _{d, \beta }$ for $\beta$ are respectively given by the last k entries of $\overline{\mu }_d$ and $k\times k$ sub-matrix formed by taking the last k columns and rows of $\overline{\Sigma }_d$.

Model estimates used by ParaSim for resampling trajectories

For a user trajectory $(S_{t}, A_{t}, R_{t})_{t= 1}^{T}$ from the reward model (8), we estimate the parameters $(\alpha , \beta )$ using the updates (15) albeit without action centering. Hence the estimates $(\hat{\alpha }_T ^\top , \hat{\beta }_T^\top )$ (that inform the model parameters used by ParaSim after suitable modifications in Sect. 3.3) are given by

$$\begin{aligned} \begin{bmatrix} \hat{\alpha }_{T} \\ \hat{\beta }_{T} \end{bmatrix}&= \left( \frac{1}{\sigma ^2} \sum _{t=1}^{T} I_t \widetilde{\phi }(S_t, A_t) \widetilde{\phi }(S_t,A_t)^\top + \begin{bmatrix} \Sigma _{\alpha _0} &{}&{} \\ {} &{} \Sigma _{\beta } &{} \end{bmatrix} ^{-1}\right) ^{-1}\nonumber \\&\quad \left( \frac{1}{\sigma ^2} \sum _{t=1 }^{T} I_t \widetilde{\phi }(S_t, A_t) R_t \!+\! \begin{bmatrix} \Sigma _{\alpha _0} &{}&{} \\ {} &{} \Sigma _{\beta } &{} \end{bmatrix} ^{-1}\begin{bmatrix} \mu _{\alpha _0} \\ \mu _{\beta } \end{bmatrix} \right) \end{aligned}$$

(16)

where $\widetilde{\phi }(S_t, A_t)^\top \triangleq \left[ g(S_t)^\top , A_t f(S_t)^\top \right]$.

Priors means and variances

We now summarize the exact values of prior parameters used by the RL algorithm. For $\ell \in \mathbb {N}$, let $\textrm{diag}(a_1, \ldots , a_{\ell }) \in \mathbb {R}^{\ell \times \ell }$ denote an $\ell \times \ell$ diagonal matrix with its j-th diagonal entry equal to $a_j$ for $j=1, \ldots , \ell$. Then the mean and variance parameters used in (14) are given by

$$\begin{aligned} \begin{aligned} \mu _{\alpha _0}&= [0.82, 1.95, 3.81, -0.19, 0.76, 0.0, -0.92, 0.0]^\top \in \mathbb {R}^{7}, \\ \mu _{\beta }&= [0.47, 0.0, 0.0, 0.0, 0.0]^\top \in \mathbb {R}^{5}, \\ \Sigma _{\alpha _0}&= \textrm{diag}(14.24, 13.35, 3.24, 0.57, 19.00, 0.26, 17.00, 7.35) \in \mathbb {R}^{7\times 7}, \quad \text {and}\quad \\ \Sigma _{\beta }&= \textrm{diag}(4.93, 24.56, 4.95, 0.67, 0.82) \in \mathbb {R}^{5\times 5}, \end{aligned} \end{aligned}$$

(17)

where the features of $\alpha _0$ are ordered as (Intercept, temperature, prior 30 min step count, yesterday step count, $\texttt {dosage}$, $\texttt {engagement}$, $\texttt {location}$, $\texttt {variation}$) and that for $\beta$ and $\alpha _1$ are ordered as (Intercept, $\texttt {dosage}$, $\texttt {engagement}$, $\texttt {location}$, $\texttt {variation}$).

Details on $\texttt {Score\_int}_{}$ computation for HeartSteps

We now describe the smoothened version of $\texttt {Score\_int}_{1}$ (1) and $\texttt {Score\_int}_{2,\textsf{z}}$ (3) that we use to add stability in our HeartSteps results in Sects. 3.4 and 3.5.

At a high level, we use the following steps: (i) We use moving windows to average out the advantage forecasts and use an averaged forecast on a daily scale, i.e., for $d=\lceil {(t-1)/5}\rceil$ and not for each decision time $t\in [T]$ as in (1) and (3). (iii) While computing the interestingness scores, we omit days when the quality of data is not good due to low availability or low diversity of features. (iii) We compute the forecasts without changing any state feature from the observed data other than $\texttt {dosage}$. That is, if the observed feature value $\textsf{z} _t=1$, we do not compute a counterfactual forecast by artificially forcing $\textsf{z} _t = 0$. (iv) Finally, we do not consider a user’s interestingness score if they have only a few days of good data.

We now describe these steps in detail for a user with total T decision times in their data trajectory (with total $D\triangleq \lfloor {T/5}\rfloor$ days of data). We note that total decision times for each user might vary.

1.
Sliding window: For each day $d \in \{ 1, \ldots , D\}$, we define a sliding window $W_d$ using all 5 decision times on day d when computing $\texttt {Score\_int}_{1}$ and all 5 decision times on days $\{ d-1, d, d+1\}$ (total 15 decision times) when computing $\texttt {Score\_int}_{2, \textsf{z}}$. That is,
$$\begin{aligned} W_d \triangleq {\left\{ \begin{array}{ll} \{ 5(d-1), 5(d-1)+1, \ldots , 5d\} \cap [T] &{}\quad \text {for}\quad \texttt {Score\_int}_{1}, \\ \{ 5(d-1)+1, 5(d-1)+2, \ldots , 5(d+1)\} \cap [T] &{}\quad \text {for}\quad \texttt {Score\_int}_{2, \textsf{z}}. \end{array}\right. } \end{aligned}$$
2.
Characterizing good data day: Next, when considering $\texttt {Score\_int}_{1}$, we define an indicator variable $G_{d, 1}$ to denote a good day. It is set to 1 if the following two conditions hold: (a) if the user was available for at least 2 decision times in $W_d$, i.e., $\sum _{t\in W_d} I_t \ge 2$ and (b) the RL algorithm posterior was updated on the night of day $d-1$; we impose this additional constraint to deal with the real-time and missing data update issues. We set $G_{d, 1}=0$ in all other cases. When considering $\texttt {Score\_int}_{2, \textsf{z}}$, we define the variable $G_{d, 2, \textsf{z}}$ ta denote a good data day based on whether the user’s observed states exhibit enough diversity in the value of the variable $\textsf{z}$ for the decision times in $W_d$. In particular, we set $G_{d, \textsf{z}}=1$ when the following two conditions hold: (a) the feature $\textsf{z}$ takes values 1 and 0 at least twice out of the decisions times in $W_d$ when the user was available for randomization ($I_t=1$), i.e.,
$$\begin{aligned} \sum _{t\in W_d} I_t \textsf{z} _t \ge 2 \quad \text {and}\quad \sum _{t\in W_d} I_t (1-\textsf{z} _t) \ge 2, \end{aligned}$$
where $\textsf{z} _t$ denotes the value of the variable $\textsf{z}$ for the user at decision time t, and (b) the RL algorithm posterior was updated on the night of at least one of the days in $\{ d-1, d, d+1\}$. In all other cases, we set $G_{d,\textsf{z}}=0$.
3.
Interestingness score for a user trajectory: We consider a user for interestingness only if the fraction of good days is greater than a certain threshold, i.e.,
$$\begin{aligned} \frac{1}{D} \sum _{d=1}^{D} G_{d,1 } \ge (1-\gamma ) \ \text {for}\ \texttt {Score\_int}_{1} \quad \text {or}\quad \frac{1}{D} \sum _{d=1}^{D} G_{d, 2, \textsf{z}} \ge (1-\gamma ) \ \text {for}\ \texttt {Score\_int}_{2, \textsf{z}}, \end{aligned}$$
(18)
for a suitable $\gamma \in (0, 1)$. (Note that increasing the value of $\gamma$ lowers the cutoff for a user to become eligible for being considered for interestingness.) For such a user, we define the interestingness scores as follows:
$$\begin{aligned} \texttt {Score\_int}_{1} (\mathcal {U})&\triangleq \frac{1}{\sum _{d=1}^{D} G_{d, 1}} \sum _{d=1}^{D} G_{d, 1} \varvec{1}\left( \frac{\sum _{t\in W_d} I_t \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t}> 0 \right) \\ \texttt {Score\_int}_{2, \textsf{z}} (\mathcal {U})&\triangleq \frac{1}{\sum _{d=1}^{D} G_{d, 2, \textsf{z}}} \sum _{d=1}^{D} G_{d, 2, \textsf{z}} \varvec{1}\left( \frac{\sum _{t\in W_d} I_t \textsf{z} _t \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t \textsf{z} _t} > \frac{\sum _{t\in W_d}I_t (1-\textsf{z} _t) \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t (1-\textsf{z} _t)} \right) , \end{aligned}$$
where we multiply by indicators $G_{d, 1}$ and $G_{d, 2, \textsf{z}}$ to include only “good days” in our score computations. Note that $\frac{\sum _{t\in W_d} I_t \textsf{z} _t \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t \textsf{z} _t}$ is the stable proxy (without counterfactual imputation) for the quantity $\hat{\Delta }_{t}(S_t(\textsf{z} =1))$ in (3).

Remark 3

Note that we omit all users from our results who do not satisfy the good day requirement (18). The number of omitted users depends on the value of $\gamma$; see Fig. 10 for the histogram of $\sum _{d}G_{d, 1}/D$ and $\sum _{d}G_{d, 2, \textsf{z}}/D$ across the 91 users in HeartSteps. For the results in Sects. 3.4 and 3.5, we use $\gamma =0.4$, that allows 63, 60, 12, and 43 to be considered respectively for interestingness of type 1, and of type 2 for feature $\texttt {variation}$, $\texttt {location}$, and $\texttt {engagement}$.

Another look at user 2’s advantage forecasts

Figure 11a reproduces the advantage forecasts for user 2 from Fig. 1b. In addition, panels (b) and (c) of Fig. 11 show the analogs of the panel (a), where user 2’s standardized advantage forecasts are color-coded based on the values of $\texttt {location}$ and $\texttt {engagement}$ respectively. Overall, we observe from the three panels in Fig. 11 that user 2 does not appear interesting of type 2 for $\textsf{z} \in \{ \texttt {location}, \texttt {engagement} \}$ since the standardized advantage forecasts are not well separated when $\textsf{z} = 0$ versus $\textsf{z} = 1$ like that for $\textsf{z} = \texttt {variation}$ in panel (a) (or equivalently in Fig. 1b). In particular, for this user, we have $\texttt {Score\_int}_{2, \texttt {location}} = 0.38$, and $\texttt {Score\_int}_{2, \texttt {engagement}} =0.38$, while $\texttt {Score\_int}_{2,\texttt {variation}} =0$.

Deeper dive into interestingness of type 1 for HeartSteps

To further refine the conclusions from Fig. 3, we consider one-sided variants of the definition (2) for the number of interesting users. For reader’s convenience, we reproduce Fig. 3 in panel (a) Fig. 12, besides the corresponding results with one-sided interesting user counts, defined by counting the users with $\texttt {Score\_int}_{1} \ge 0.9$ and $\texttt {Score\_int}_{1} \le 0.1$ separately in panels (b) and (c).

From Fig. 12b, we find that in the original data 17 users exhibit $\texttt {Score\_int}_{1} \ge 0.9$; we denote this user count by $\texttt {\#User\_int}_{1} ^+$. However, the value of $\texttt {\#User\_int}_{1} ^+$ is always significantly smaller than 17 across the 500 trials with resampled trajectories. In Table 2, we denote this analysis as Type $1^+$.

Table 2 Summary of results from our resampling-based exploratory data analyses for HeartSteps data

Full size table

On the other hand, Fig. 12c shows that one user exhibits $\texttt {Score\_int}_{1} \le 0.1$ in the original data; we denote this count by $\texttt {\#User\_int}_{1} ^-$. We also find that all 500 trials have $\texttt {\#User\_int}_{1} ^->1$. In Table 2, we denote this analysis as Type $1^-$.

Overall, we conclude that the data presents evidence in favor that the RL algorithm is potentially personalizing by learning that many users benefit from sending an activity message. However, many users might exhibit $\texttt {Score\_int}_{1} \le 0.1$ and it might appear that sending the message is less beneficial than not sending for these users, just due to algorithmic stochasticity. Consequently, the value of $\texttt {\#User\_int}_{1}$—the number of interesting users with $|\texttt {Score\_int}_{1}-0.5| \ge 0.4$, which is also equal to $\texttt {\#User\_int}_{1} ^+ + \texttt {\#User\_int}_{1} ^-$—can be as high as 18 (the observed value in the original data) due to algorithmic stochasticity.

Stability of conclusions with respect to the choice of $(\delta , \gamma )$

Next, we investigate the stability of the claims made above for $\texttt {Score\_int}_{1}$ and the one-sided variants to the choice of hyper-parameters $\delta$ and $\gamma$, appearing in (2) and (18), respectively. Note that for a given definition of $\texttt {Score\_int}_{}$, increasing $\gamma$ in (18) for a fixed $\delta$ in (2) allows more users to become eligible for being considered as interesting, both in original data and the resampled trials. Similarly, decreasing $\delta$ in (2) for a fixed $\gamma$ in (18) would typically lead to more number of interesting users, both in original data and the resampled trials.

The results of this exploration for the choices $\delta \in \{ 0.35, 0.40, 0.45\}$, and $\gamma \in \{ 0.65, 0.70, 0.75\}$ are presented in Fig. 14. For a given panel, the value in the cell corresponding to the value of $\delta$ on the horizontal axis and $\gamma$ on the vertical axis is equal to the fraction of the 500 trials for which the number of interesting users $\texttt {\#User\_int}_{}$ computed using those hyperparameter choices was as at least as large that in was at least as large as that in the original data. Looking at Fig. 14b, c, we find the conclusions drawn from Fig. 12b, c with $(\delta , \gamma )=(0.4, 0.75)$ about $\texttt {\#User\_int}_{1} ^+$ and $\texttt {\#User\_int}_{1} ^-$ remains stable even if we slightly perturb the values of $\delta$ and $\gamma$. In particular, across the $3\times 3$ choices for $\delta$ and $\gamma$, the value of $\texttt {\#User\_int}_{1} ^+$ would not appear as high as the observed value in the original data just by chance. On the other hand, the value of $\texttt {\#User\_int}_{1} ^-$ might appear higher than the observed value in the original data simply due to algorithmic stochasticity. Given the competing nature of these two quantities and the fact that $\texttt {\#User\_int}_{1} = \texttt {\#User\_int}_{1} ^+ + \texttt {\#User\_int}_{1} ^-$, the resulting fraction of trials with a count at least as high as $\texttt {\#User\_int}_{1}$ in the original data is quite sensitive to the particular choice of $(\delta , \gamma )$ as Fig. 13a illustrates.

Stability of HeartSteps results for interestingness of type 2

We perform stability analysis for $\texttt {\#User\_int}_{2,\textsf{z}}$ with respect to the choice of $(\delta , \gamma )$ similarly to that done for $\texttt {\#User\_int}_{1}$ above in Fig. 13 and provide the results in Fig. 14.

Panels (a), (b), and (c) display the results, respectively, for $\textsf{z} = \texttt {variation}, \texttt {location}$, and $\texttt {engagement}$. In a given panel, the value in the cell corresponding to the value of $\delta$ on the horizontal axis and $\gamma$ on the vertical axis is equal to the fraction of the 500 trials for which the number of interesting users $\texttt {\#User\_int}_{2, \textsf{z}}$ computed using those hyperparameter choices was as at least as large that in the original data. Across the $3\times 3$ choices for $\delta$ and $\gamma$, we notice that for interestingness of type 2 for the features $\texttt {variation}$, $\texttt {location}$, and $\texttt {engagement}$, the fraction remains stable around 0, 0, and 1, same as the fraction in Fig. 7 for $(\delta , \gamma )=(0.4, 0.75)$.

Interesting users of type 2 for $\texttt {location}$ and $\texttt {engagement}$

We now demonstrate the analysis (like in Figs. 1b and 8) for two different users, who exhibit potential interestingness of type 2 for $\texttt {location}$ and $\texttt {engagement}$.

A potentially interesting user of type 2 for $\texttt {location}$

Fig. 15 displays the advantage forecasts for a user, who we call user 3, to distinguish them from the two users associated with Fig. 1. The three panels in Fig. 15 plot user 3’s advantages color-coded by the value of the three features for that user. We find that this user admits $\texttt {Score\_int}_{2, \texttt {variation}} =0.68$, $\texttt {Score\_int}_{2, \texttt {location}} =0$, and $\texttt {Score\_int}_{2, \texttt {engagement}} =0.43$—so that this user would be deemed potentially interesting of type 2 for $\texttt {location}$ (and not other features) as per our definition (4).

Next, we evaluate how likely the user graph in Fig. 15b would appear just by chance. Panels (a) and (b) of Fig. 16 visualize two resampled trajectories of user 3 (chosen uniformly at random from user 3’s 500 resampled trajectories) generated under the generative model that there is no differential advantage of sending a message based on the value of $\texttt {location}$. The color coding is as in Fig. 15b, namely, the forecasts are marked in red triangles if $\texttt {location}$ = 1 and blue circles if $\texttt {location}$ = 0. In panel (c) of Fig. 16, we plot the histogram for the $\texttt {Score\_int}_{2, \texttt {location}}$ for this user across all 500 resampled trajectories and denote the observed value in the original data as a vertical dotted line.

Figure 16a, b show that the resampled trajectories do not appear interesting of type 2 for $\texttt {location}$ as in Fig. 15b; the two trajectories, respectively, have $\texttt {Score\_int}_{2,\texttt {variation}} =$ 0.50, and 0.42. Moreover, panel (c) shows that the interestingness score of 0, which was observed for user 3 in the original data (Fig. 15b), never appears across any of the resampled trajectories. Thus we can conclude that the data presents evidence that the RL algorithm potentially personalized for user 3 by learning to treat the user based on $\texttt {location}$ differentially, and this personalization would not likely arise simply due to algorithmic stochasticity.

A potentially interesting user of type 2 for $\texttt {engagement}$

Fig. 17 displays the advantage forecasts for a user, who we call user 4, to distinguish them from the three users associated with Figs. 1 and 15. The three panels in Fig. 17 plot user 4’s advantages color-coded by the three features; which admit $\texttt {Score\_int}_{2, \texttt {variation}} =0.65$, $\texttt {Score\_int}_{2, \texttt {location}} =0.9$, and $\texttt {Score\_int}_{2, \texttt {engagement}} =0.037$, respectively. Thus based on our definition (4), this user is potentially interesting of type 2 for $\texttt {engagement}$, but not for $\texttt {variation}$. The user does not qualify the criterion ($\gamma = 0.75$ in (18)) for being considered as a potentially interesting user for $\texttt {location}$ due to a lack of diversity in the values taken by its $\texttt {location}$ feature.

Next, we evaluate how likely the user graph in Fig. 17c would appear just by chance. Panels (a) and (b) of Fig. 18 visualize two resampled trajectories of user 4 (chosen uniformly at random from user 4’s 500 resampled trajectories) generated under the generative model that there is no differential advantage of sending a message based on the value of $\texttt {engagement}$. The color coding is as in Fig. 17c, namely, the forecasts are marked in red triangles if $\texttt {engagement}$ = 1 and blue circles if $\texttt {engagement}$ = 0. In panel (c) of Fig. 18, we plot the histogram for the $\texttt {Score\_int}_{2, \texttt {engagement}}$ for this user across all 500 resampled trajectories and denote the observed value in the original data as a vertical dotted line.

Figure 18a, b show that the resampled trajectories do not appear interesting of type 2 for $\texttt {engagement}$ as in Fig. 17c; the two trajectories, respectively, have $\texttt {Score\_int}_{2,\texttt {engagement}} =$ 0.94, and 0.41. However, panel (c) shows that the interestingness score of 0.037, which was observed for user 4 in the original data (Fig. 17c), appears for around 20% of the resampled trajectories. Thus we can conclude that the data presents evidence that user 4’s interestingness score for $\texttt {engagement}$ might appear extreme simply due to algorithmic stochasticity.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ghosh, S., Kim, R., Chhabria, P. et al. Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06526-x

Download citation

Received: 17 April 2023
Revised: 22 August 2023
Accepted: 14 February 2024
Published: 10 April 2024
DOI: https://doi.org/10.1007/s10994-024-06526-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling

Abstract

Access this article

Similar content being viewed by others

IntelligentPooling: practical Thompson sampling for mHealth

Reinforcement learning for sequential decision making in population research

pH-RL: A Personalization Architecture to Bring Reinforcement Learning to Health Practice

Data availibility

Code availability

Notes

References

Funding