Abstract
There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential decision-making problems involve decisions about when to treat and how to treat based on the user’s context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user’s historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an “optimized” intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.
Similar content being viewed by others
Data availibility
Under the current data policies for HeartSteps V2/V3, the research team cannot make the data publicly available.
Code availability
The code used for generating resampled trajectories and reproducing the plots is available at the following link.
Notes
This is our informal working definition of personalization. In Sect. 2.3 we formally define personalization and ways to measure personalization.
Details on this feature and others are provided further in Table 1.
We remark that throughout this paper, we use resampling and resimulations interchangeably. In particular, our methodology resimulates user trajectories that we generate by resampling states and rewards using generative models, and resampling actions by re-running the RL algorithm.
Note that the underlying problem might restrict the allowed actions depending on the value of s. For example, in HeartSteps, sending a notification is not allowed when the user is driving.
References
Albers, N., Neerincx, M. A., & Brinkman, W.-P. (2022). Addressing people’s current and future states in a reinforcement learning algorithm for persuading to quit smoking and to be physically active. PLoS ONE, 17(12), 0277295.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2–3), 235–256. https://doi.org/10.1023/A:1013689704352
Auer, P., & Ortner, R. (2010). UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(1–2), 55–65. https://doi.org/10.1007/s10998-010-3055-6
Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6, 679–684.
Bibaut, A., Chambaz, A., Dimakopoulou, M., Kallus, N., & Laan, M. (2021). Post-contextual-bandit inference.
Boger, J., Poupart, P., Hoey, J., Boutilier, C., Fernie, G. R., & Mihailidis, A. (2005). A decision-theoretic approach to task assistance for persons with dementia. In: Kaelbling, L.P., Saffiotti, A. (eds.), IJCAI-05, Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, Edinburgh, Scotland, UK, July 30–August 5, 2005, pp. 1293–1299. Professional Book Center, UK. http://ijcai.org/Proceedings/05/Papers/1186.pdf.
Boruvka, A., Almirall, D., Witkiewitz, K., & Murphy, S. A. (2018). Assessing time-varying causal effect moderation in mobile health. Journal of the American Statistical Association, 113(523), 1112–1121.
Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. The Journal of Machine Learning Research, 2, 499–526.
Buja, A., Cook, D., & Swayne, D. F. (1996). Interactive high-dimensional data visualization. Journal of Computational and Graphical Statistics. https://doi.org/10.2307/1390754
Dempsey, W., Liao, P., Klasnja, P., Nahum-Shani, I., & Murphy, S. A. (2015). Randomised trials for the fitbit generation. Significance, 12(6), 20–23. https://doi.org/10.1111/j.1740-9713.2015.00863.x
Ding, P., Feller, A., & Miratrix, L. (2016). Randomization inference for treatment effect variation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 78, 655–671.
Dwaracherla, V., Lu, X., Ibrahimi, M., Osband, I., Wen, Z., & Roy, B. V. (2020). Hypermodels for exploration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, Addis Ababa. https://openreview.net/forum?id=ryx6WgStPB.
Dwivedi, R., Tian, K., Tomkins, S., Klasnja, P., Murphy, S., & Shah, D. (2022). Counterfactual inference for sequential experiments.
Eckles, D., & Kaptein, M. (2019). Bootstrap thompson sampling and sequential decision problems in the behavioral sciences. SAGE Open. https://doi.org/10.1177/2158244019851675
Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap. Monographs on Statistics and applied Probability. https://doi.org/10.1201/9780429246593
Elmachtoub, A. N., McNellis, R., Oh, S., & Petrik, M. (2017). A practical method for solving contextual bandit problems using decision trees. https://doi.org/10.48550/arxiv.1706.04687.
Fisher, R. A. (1935). The design of experiments. Oliver and Boyd.
Forman, E. M., Berry, M. P., Butryn, M. L., Hagerman, C. J., Huang, Z., Juarascio, A. S., LaFata, E. M., Ontañón, S., Tilford, J. M., & Zhang, F. (2023). Using artificial intelligence to optimize delivery of weight loss treatment: Protocol for an efficacy and cost-effectiveness trial. Contemporary Clinical Trials, 124, 107029.
Forman, E. M., Kerrigan, S. G., Butryn, M. L., Juarascio, A. S., Manasse, S. M., Ontañón, S., Dallal, D. H., Crochiere, R. J., & Moskow, D. (2019). Can the artificial intelligence technique of reinforcement learning use continuously-monitored digital data to optimize treatment for weight loss? Journal of Behavioral Medicine, 42(2), 276–290.
Gelman, A. (2004). Exploratory data analysis for complex models. Journal of Computational and Graphical Statistics. https://doi.org/10.1198/106186004X11435
Good, P. I. (2006). Resampling methods. Springer.
Hadad, V., Hirshberg, D. A., Zhan, R., Wager, S., & Athey, S. (2019). Confidence intervals for policy evaluation in adaptive experiments.
Hanna, J. P., Stone, P., & Niekum, S. (2017). Bootstrapping with models: Confidence intervals for off-policy evaluation. In Proceedings of the international joint conference on autonomous agents and multiagent systems, AAMAS, vol. 1.
Hao, B., Abbasi-Yadkori, Y., Wen, Z., & Cheng, G. (2019). Bootstrapping upper confidence bound. In Advances in neural information processing systems, vol. 32.
Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C., & Wang, M. (2021). Bootstrapping fitted q-evaluation for off-policy inference. In Proceedings of the 38th international conference on machine learning, vol. 139.
Hoey, J., Poupart, P., Boutilier, C., & Mihailidis, A.(2005). POMDP models for assistive technology. In: Bickmore, T.W. (ed.), Caring machines: AI in Eldercare, Papers from the 2005 AAAI Fall Symposium, Arlington, Virginia, USA, November 4-6, 2005. AAAI Technical Report, vol. FS-05-02, pp. 51–58. AAAI Press, Washington, D.C. https://www.aaai.org/Library/Symposia/Fall/2005/fs05-02-009.php.
Liang, D., Charlin, L., McInerney, J., & Blei, D. M. (2016). Modeling user exposure in recommendation. In Proceedings of the 25th international conference on World Wide Web. WWW ’16, pp. 951–961. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE. https://doi.org/10.1145/2872427.2883090.
Liao, P., Greenewald, K., Klasnja, P., & Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1), 1–22.
Liao, P., Greenewald, K., Klasnja, P., & Murphy, S. (2020). Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies. https://doi.org/10.1145/3381007
Piette, J. D., Newman, S., Krein, S. L., Marinec, N., Chen, J., Williams, D. A., Edmond, S. N., Driscoll, M., LaChappelle, K. M., Maly, M., et al. (2022). Artificial intelligence (AI) to improve chronic pain care: Evidence of AI learning. Intelligence-Based Medicine, 6, 100064.
Qian, T., Yoo, H., Klasnja, P., Almirall, D., & Murphy, S. A. (2021). Estimating time-varying causal excursion effects in mobile health with binary outcomes. Biometrika, 108(3), 507–527.
Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W., & Cheng, G. (2021). Online bootstrap inference for policy evaluation in reinforcement learning.
Rosenbaum, P. (2002). Observational studies. Springer.
Russo, D., & Roy, B. V. (2014). Learning to optimize via information-directed sampling. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 1583–1591. https://proceedings.neurips.cc/paper/2014/hash/301ad0e3bd5cb1627a2044908a42fdc2-Abstract.html.
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. et al (2018). A tutorial on thompson sampling. Foundations and Trends® in Machine Learning11(1):1–96.
Russo, D., Roy, B. V., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1), 1–96. https://doi.org/10.1561/2200000070
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. IEEE Transactions on Neural Networks. https://doi.org/10.1109/tnn.1998.712192
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4), 285–294.
Tomkins, S., Liao, P., Yeung, S., Klasnja, P., & Murphy, S. (2019). Intelligent pooling in Thompson sampling for rapid personalization in mobile health.
Tukey, J. W. (1977). Exploratory data analysis (Vol. 2). Reading.
Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition. Nauka.
Wang, C.-H., Yu, Y., Hao, B., & Cheng, G. (2020). Residual bootstrap exploration for bandit algorithms.
White, M., & White, A. (2010). Interval estimation for reinforcement-learning algorithms in continuous-state domains. In Advances in neural information processing systems 23: 24th annual conference on neural information processing systems 2010, NIPS 2010.
Yang, J., Eckles, D., Dhillon, P., & Aral, S. (2020). Targeting for long-term outcomes. arXiv:2010.15835.
Yom-Tov, E., Feraru, G., Kozdoba, M., Mannor, S., Tennenholtz, M., & Hochberg, I. (2017). Encouraging physical activity in patients with diabetes: Intervention using a reinforcement learning system. Journal of Medical Internet Research, 19(10), 338.
Yom-Tov, E., Feraru, G., Kozdoba, M., Mannor, S., Tennenholtz, M., & Hochberg, I. (2017). Encouraging physical activity in patients with diabetes: Intervention using a reinforcement learning system. Journal of Medical Internet Research. https://doi.org/10.2196/JMIR.7994
Zhang, K. W., Janson, L., & Murphy, S. A. (2020). Inference for batched bandits.
Zhang, K. W., Janson, L., & Murphy, S. A. (2023). Statistical inference after adaptive sampling for longitudinal data.
Zhou, M., Mintz, Y., Fukuoka, Y., Goldberg, K., Flowers, E., Kaminsky, P. M., Castillejo, A., & Aswani, A. (2018). Personalizing mobile fitness apps using reinforcement learning. In: Said, A., Komatsu, T. (eds.), Joint Proceedings of the ACM IUI 2018 Workshops Co-located with the 23rd ACM Conference on Intelligent User Interfaces (ACM IUI 2018), Tokyo, Japan, March 11. CEUR Workshop Proceedings, vol. 2068. CEUR-WS.org, Tokyo (2018). https://ceur-ws.org/Vol-2068/humanize7.pdf.
Funding
RK is supported by NIGMS Biostatistics Training Grant Program under Grant No. T32GM135117. SG and SM acknowledge support by NIH/NIDA P50DA054039, and NIH/NIBIB and OD P41EB028242. RD acknowledges support by NSF DMS-2022448, and DSO National Laboratories grant DSO-CO21070. RD and SM also acknowledge support by NSF under Grant No. DMS-2023528 for the Foundations of Data Science Institute (FODSI). PK acknowledges support by NIH NHLBI R01HL125440 and 1U01CA229445. SM also acknowledges support by NIH/NCI U01CA229437, and NIH/NIDCR UH3DE028723. KWZ is supported by the Siebel Foundation and by NSF CBET-2112085 and by the NSF Graduate Research Fellowship Program under Grant No. DGE1745303.
Author information
Authors and Affiliations
Contributions
All authors contributed to conceptualization and methodology. Data Preparation and Software: PC, PL, RK, and SG. Analysis: RK. Writing: RK, RD, SM, and SG. Funding and Administration: PK, SM. Supervision: RD, KZ, SM. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Conflict of interest
KWZ worked as a summer intern at Apple. The other authors declare no conflict of interest.
Ethical approval
We adhere to the policies outlined in https://www.springer.com/gp/editorial-policies/ethical-responsibilities-of-authors.
Consent to participate
The reported study was approved by the Kaiser Permanente Washington Institutional Review Board (IRB 1257484-16). All participants completed a written informed consent to take part in the study.
Consent for publication
Adhering to the IRB and consent forms, individual level data that could be used for identification are not released.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Work done by Peng Liao and Prasidh Chhabria authors while they were at Harvard University.
Editors: Emma Brunskill, Minmin Chen, Omer Gottesman, Lihong Li, Yuxi Li, Yao Liu, Zonging Lu, Niranjani Prasad, Zhiwei Qin, Csaba Szepesvari, Matthew Taylor.
Appendices
Details on the RL framework for HeartSteps
We now provide further details regarding the RL framework for the HeartSteps study discussed in Sect. 3.2 (Table 1).
Clipping probabilities
The function h appearing in (10) is given by
Prior and posterior formulation
Using the notation \(\theta ^\top \triangleq (\alpha _0^\top , \alpha _1^\top , \beta ^\top )\), the prior for \(\theta\) was specified as \(\mathcal {N}(\mu _0, \Sigma _0)\), where
and \(\{ \mu _{\alpha _0}, \mu _\beta , \Sigma _{\alpha _0}, \Sigma _{\beta }\}\) were computed from the prior study (see Liao et al. 2020, Sect. 6) for details on how priors were constructed and (17) for the specific values used by us; note that the HeartSteps team decided to update the priors used from those those presented in Liao et al. (2020)). Given the Gaussian prior and the Gaussian working model, the posterior for \(\theta\) on the day d is also Gaussian and is given by \(\mathcal {N}(\overline{\mu }_d, \overline{\Sigma }_d)\), where these posterior parameters are recursively updated as
where \(\phi (S_t, A_t)^\top \triangleq [g(S_t)^\top , \pi _t f(S_t)^\top , (A_t-\pi _t)f(S_t)^\top ]\) collects all the feature vectors from the working model (8). The updates (15a) and (15b) are denoted by PosteriorUpdate in Algorithm 2. For k-dimensional \(\beta\), the posterior parameters \(\mu _{d, \beta }, \Sigma _{d, \beta }\) for \(\beta\) are respectively given by the last k entries of \(\overline{\mu }_d\) and \(k\times k\) sub-matrix formed by taking the last k columns and rows of \(\overline{\Sigma }_d\).
Model estimates used by ParaSim for resampling trajectories
For a user trajectory \((S_{t}, A_{t}, R_{t})_{t= 1}^{T}\) from the reward model (8), we estimate the parameters \((\alpha , \beta )\) using the updates (15) albeit without action centering. Hence the estimates \((\hat{\alpha }_T ^\top , \hat{\beta }_T^\top )\) (that inform the model parameters used by ParaSim after suitable modifications in Sect. 3.3) are given by
where \(\widetilde{\phi }(S_t, A_t)^\top \triangleq \left[ g(S_t)^\top , A_t f(S_t)^\top \right]\).
Priors means and variances
We now summarize the exact values of prior parameters used by the RL algorithm. For \(\ell \in \mathbb {N}\), let \(\textrm{diag}(a_1, \ldots , a_{\ell }) \in \mathbb {R}^{\ell \times \ell }\) denote an \(\ell \times \ell\) diagonal matrix with its j-th diagonal entry equal to \(a_j\) for \(j=1, \ldots , \ell\). Then the mean and variance parameters used in (14) are given by
where the features of \(\alpha _0\) are ordered as (Intercept, temperature, prior 30 min step count, yesterday step count, \(\texttt {dosage}\), \(\texttt {engagement}\), \(\texttt {location}\), \(\texttt {variation}\)) and that for \(\beta\) and \(\alpha _1\) are ordered as (Intercept, \(\texttt {dosage}\), \(\texttt {engagement}\), \(\texttt {location}\), \(\texttt {variation}\)).
Details on \(\texttt {Score\_int}_{}\) computation for HeartSteps
We now describe the smoothened version of \(\texttt {Score\_int}_{1}\) (1) and \(\texttt {Score\_int}_{2,\textsf{z}}\) (3) that we use to add stability in our HeartSteps results in Sects. 3.4 and 3.5.
At a high level, we use the following steps: (i) We use moving windows to average out the advantage forecasts and use an averaged forecast on a daily scale, i.e., for \(d=\lceil {(t-1)/5}\rceil\) and not for each decision time \(t\in [T]\) as in (1) and (3). (iii) While computing the interestingness scores, we omit days when the quality of data is not good due to low availability or low diversity of features. (iii) We compute the forecasts without changing any state feature from the observed data other than \(\texttt {dosage}\). That is, if the observed feature value \(\textsf{z} _t=1\), we do not compute a counterfactual forecast by artificially forcing \(\textsf{z} _t = 0\). (iv) Finally, we do not consider a user’s interestingness score if they have only a few days of good data.
We now describe these steps in detail for a user with total T decision times in their data trajectory (with total \(D\triangleq \lfloor {T/5}\rfloor\) days of data). We note that total decision times for each user might vary.
-
1.
Sliding window: For each day \(d \in \{ 1, \ldots , D\}\), we define a sliding window \(W_d\) using all 5 decision times on day d when computing \(\texttt {Score\_int}_{1}\) and all 5 decision times on days \(\{ d-1, d, d+1\}\) (total 15 decision times) when computing \(\texttt {Score\_int}_{2, \textsf{z}}\). That is,
$$\begin{aligned} W_d \triangleq {\left\{ \begin{array}{ll} \{ 5(d-1), 5(d-1)+1, \ldots , 5d\} \cap [T] &{}\quad \text {for}\quad \texttt {Score\_int}_{1}, \\ \{ 5(d-1)+1, 5(d-1)+2, \ldots , 5(d+1)\} \cap [T] &{}\quad \text {for}\quad \texttt {Score\_int}_{2, \textsf{z}}. \end{array}\right. } \end{aligned}$$ -
2.
Characterizing good data day: Next, when considering \(\texttt {Score\_int}_{1}\), we define an indicator variable \(G_{d, 1}\) to denote a good day. It is set to 1 if the following two conditions hold: (a) if the user was available for at least 2 decision times in \(W_d\), i.e., \(\sum _{t\in W_d} I_t \ge 2\) and (b) the RL algorithm posterior was updated on the night of day \(d-1\); we impose this additional constraint to deal with the real-time and missing data update issues. We set \(G_{d, 1}=0\) in all other cases. When considering \(\texttt {Score\_int}_{2, \textsf{z}}\), we define the variable \(G_{d, 2, \textsf{z}}\) ta denote a good data day based on whether the user’s observed states exhibit enough diversity in the value of the variable \(\textsf{z}\) for the decision times in \(W_d\). In particular, we set \(G_{d, \textsf{z}}=1\) when the following two conditions hold: (a) the feature \(\textsf{z}\) takes values 1 and 0 at least twice out of the decisions times in \(W_d\) when the user was available for randomization (\(I_t=1\)), i.e.,
$$\begin{aligned} \sum _{t\in W_d} I_t \textsf{z} _t \ge 2 \quad \text {and}\quad \sum _{t\in W_d} I_t (1-\textsf{z} _t) \ge 2, \end{aligned}$$where \(\textsf{z} _t\) denotes the value of the variable \(\textsf{z}\) for the user at decision time t, and (b) the RL algorithm posterior was updated on the night of at least one of the days in \(\{ d-1, d, d+1\}\). In all other cases, we set \(G_{d,\textsf{z}}=0\).
-
3.
Interestingness score for a user trajectory: We consider a user for interestingness only if the fraction of good days is greater than a certain threshold, i.e.,
$$\begin{aligned} \frac{1}{D} \sum _{d=1}^{D} G_{d,1 } \ge (1-\gamma ) \ \text {for}\ \texttt {Score\_int}_{1} \quad \text {or}\quad \frac{1}{D} \sum _{d=1}^{D} G_{d, 2, \textsf{z}} \ge (1-\gamma ) \ \text {for}\ \texttt {Score\_int}_{2, \textsf{z}}, \end{aligned}$$(18)for a suitable \(\gamma \in (0, 1)\). (Note that increasing the value of \(\gamma\) lowers the cutoff for a user to become eligible for being considered for interestingness.) For such a user, we define the interestingness scores as follows:
$$\begin{aligned} \texttt {Score\_int}_{1} (\mathcal {U})&\triangleq \frac{1}{\sum _{d=1}^{D} G_{d, 1}} \sum _{d=1}^{D} G_{d, 1} \varvec{1}\left( \frac{\sum _{t\in W_d} I_t \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t}> 0 \right) \\ \texttt {Score\_int}_{2, \textsf{z}} (\mathcal {U})&\triangleq \frac{1}{\sum _{d=1}^{D} G_{d, 2, \textsf{z}}} \sum _{d=1}^{D} G_{d, 2, \textsf{z}} \varvec{1}\left( \frac{\sum _{t\in W_d} I_t \textsf{z} _t \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t \textsf{z} _t} > \frac{\sum _{t\in W_d}I_t (1-\textsf{z} _t) \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t (1-\textsf{z} _t)} \right) , \end{aligned}$$where we multiply by indicators \(G_{d, 1}\) and \(G_{d, 2, \textsf{z}}\) to include only “good days” in our score computations. Note that \(\frac{\sum _{t\in W_d} I_t \textsf{z} _t \hat{\Delta }_t(S_t) }{\sum _{t\in W_d}I_t \textsf{z} _t}\) is the stable proxy (without counterfactual imputation) for the quantity \(\hat{\Delta }_{t}(S_t(\textsf{z} =1))\) in (3).
Remark 3
Note that we omit all users from our results who do not satisfy the good day requirement (18). The number of omitted users depends on the value of \(\gamma\); see Fig. 10 for the histogram of \(\sum _{d}G_{d, 1}/D\) and \(\sum _{d}G_{d, 2, \textsf{z}}/D\) across the 91 users in HeartSteps. For the results in Sects. 3.4 and 3.5, we use \(\gamma =0.4\), that allows 63, 60, 12, and 43 to be considered respectively for interestingness of type 1, and of type 2 for feature \(\texttt {variation}\), \(\texttt {location}\), and \(\texttt {engagement}\).
Another look at user 2’s advantage forecasts
Figure 11a reproduces the advantage forecasts for user 2 from Fig. 1b. In addition, panels (b) and (c) of Fig. 11 show the analogs of the panel (a), where user 2’s standardized advantage forecasts are color-coded based on the values of \(\texttt {location}\) and \(\texttt {engagement}\) respectively. Overall, we observe from the three panels in Fig. 11 that user 2 does not appear interesting of type 2 for \(\textsf{z} \in \{ \texttt {location}, \texttt {engagement} \}\) since the standardized advantage forecasts are not well separated when \(\textsf{z} = 0\) versus \(\textsf{z} = 1\) like that for \(\textsf{z} = \texttt {variation}\) in panel (a) (or equivalently in Fig. 1b). In particular, for this user, we have \(\texttt {Score\_int}_{2, \texttt {location}} = 0.38\), and \(\texttt {Score\_int}_{2, \texttt {engagement}} =0.38\), while \(\texttt {Score\_int}_{2,\texttt {variation}} =0\).
Deeper dive into interestingness of type 1 for HeartSteps
To further refine the conclusions from Fig. 3, we consider one-sided variants of the definition (2) for the number of interesting users. For reader’s convenience, we reproduce Fig. 3 in panel (a) Fig. 12, besides the corresponding results with one-sided interesting user counts, defined by counting the users with \(\texttt {Score\_int}_{1} \ge 0.9\) and \(\texttt {Score\_int}_{1} \le 0.1\) separately in panels (b) and (c).
From Fig. 12b, we find that in the original data 17 users exhibit \(\texttt {Score\_int}_{1} \ge 0.9\); we denote this user count by \(\texttt {\#User\_int}_{1} ^+\). However, the value of \(\texttt {\#User\_int}_{1} ^+\) is always significantly smaller than 17 across the 500 trials with resampled trajectories. In Table 2, we denote this analysis as Type \(1^+\).
On the other hand, Fig. 12c shows that one user exhibits \(\texttt {Score\_int}_{1} \le 0.1\) in the original data; we denote this count by \(\texttt {\#User\_int}_{1} ^-\). We also find that all 500 trials have \(\texttt {\#User\_int}_{1} ^->1\). In Table 2, we denote this analysis as Type \(1^-\).
Overall, we conclude that the data presents evidence in favor that the RL algorithm is potentially personalizing by learning that many users benefit from sending an activity message. However, many users might exhibit \(\texttt {Score\_int}_{1} \le 0.1\) and it might appear that sending the message is less beneficial than not sending for these users, just due to algorithmic stochasticity. Consequently, the value of \(\texttt {\#User\_int}_{1}\)—the number of interesting users with \(|\texttt {Score\_int}_{1}-0.5| \ge 0.4\), which is also equal to \(\texttt {\#User\_int}_{1} ^+ + \texttt {\#User\_int}_{1} ^-\)—can be as high as 18 (the observed value in the original data) due to algorithmic stochasticity.
Stability of conclusions with respect to the choice of \((\delta , \gamma )\)
Next, we investigate the stability of the claims made above for \(\texttt {Score\_int}_{1}\) and the one-sided variants to the choice of hyper-parameters \(\delta\) and \(\gamma\), appearing in (2) and (18), respectively. Note that for a given definition of \(\texttt {Score\_int}_{}\), increasing \(\gamma\) in (18) for a fixed \(\delta\) in (2) allows more users to become eligible for being considered as interesting, both in original data and the resampled trials. Similarly, decreasing \(\delta\) in (2) for a fixed \(\gamma\) in (18) would typically lead to more number of interesting users, both in original data and the resampled trials.
The results of this exploration for the choices \(\delta \in \{ 0.35, 0.40, 0.45\}\), and \(\gamma \in \{ 0.65, 0.70, 0.75\}\) are presented in Fig. 14. For a given panel, the value in the cell corresponding to the value of \(\delta\) on the horizontal axis and \(\gamma\) on the vertical axis is equal to the fraction of the 500 trials for which the number of interesting users \(\texttt {\#User\_int}_{}\) computed using those hyperparameter choices was as at least as large that in was at least as large as that in the original data. Looking at Fig. 14b, c, we find the conclusions drawn from Fig. 12b, c with \((\delta , \gamma )=(0.4, 0.75)\) about \(\texttt {\#User\_int}_{1} ^+\) and \(\texttt {\#User\_int}_{1} ^-\) remains stable even if we slightly perturb the values of \(\delta\) and \(\gamma\). In particular, across the \(3\times 3\) choices for \(\delta\) and \(\gamma\), the value of \(\texttt {\#User\_int}_{1} ^+\) would not appear as high as the observed value in the original data just by chance. On the other hand, the value of \(\texttt {\#User\_int}_{1} ^-\) might appear higher than the observed value in the original data simply due to algorithmic stochasticity. Given the competing nature of these two quantities and the fact that \(\texttt {\#User\_int}_{1} = \texttt {\#User\_int}_{1} ^+ + \texttt {\#User\_int}_{1} ^-\), the resulting fraction of trials with a count at least as high as \(\texttt {\#User\_int}_{1}\) in the original data is quite sensitive to the particular choice of \((\delta , \gamma )\) as Fig. 13a illustrates.
Stability of HeartSteps results for interestingness of type 2
We perform stability analysis for \(\texttt {\#User\_int}_{2,\textsf{z}}\) with respect to the choice of \((\delta , \gamma )\) similarly to that done for \(\texttt {\#User\_int}_{1}\) above in Fig. 13 and provide the results in Fig. 14.
Panels (a), (b), and (c) display the results, respectively, for \(\textsf{z} = \texttt {variation}, \texttt {location}\), and \(\texttt {engagement}\). In a given panel, the value in the cell corresponding to the value of \(\delta\) on the horizontal axis and \(\gamma\) on the vertical axis is equal to the fraction of the 500 trials for which the number of interesting users \(\texttt {\#User\_int}_{2, \textsf{z}}\) computed using those hyperparameter choices was as at least as large that in the original data. Across the \(3\times 3\) choices for \(\delta\) and \(\gamma\), we notice that for interestingness of type 2 for the features \(\texttt {variation}\), \(\texttt {location}\), and \(\texttt {engagement}\), the fraction remains stable around 0, 0, and 1, same as the fraction in Fig. 7 for \((\delta , \gamma )=(0.4, 0.75)\).
Interesting users of type 2 for \(\texttt {location}\) and \(\texttt {engagement}\)
We now demonstrate the analysis (like in Figs. 1b and 8) for two different users, who exhibit potential interestingness of type 2 for \(\texttt {location}\) and \(\texttt {engagement}\).
A potentially interesting user of type 2 for \(\texttt {location}\)
Fig. 15 displays the advantage forecasts for a user, who we call user 3, to distinguish them from the two users associated with Fig. 1. The three panels in Fig. 15 plot user 3’s advantages color-coded by the value of the three features for that user. We find that this user admits \(\texttt {Score\_int}_{2, \texttt {variation}} =0.68\), \(\texttt {Score\_int}_{2, \texttt {location}} =0\), and \(\texttt {Score\_int}_{2, \texttt {engagement}} =0.43\)—so that this user would be deemed potentially interesting of type 2 for \(\texttt {location}\) (and not other features) as per our definition (4).
Next, we evaluate how likely the user graph in Fig. 15b would appear just by chance. Panels (a) and (b) of Fig. 16 visualize two resampled trajectories of user 3 (chosen uniformly at random from user 3’s 500 resampled trajectories) generated under the generative model that there is no differential advantage of sending a message based on the value of \(\texttt {location}\). The color coding is as in Fig. 15b, namely, the forecasts are marked in red triangles if \(\texttt {location}\) = 1 and blue circles if \(\texttt {location}\) = 0. In panel (c) of Fig. 16, we plot the histogram for the \(\texttt {Score\_int}_{2, \texttt {location}}\) for this user across all 500 resampled trajectories and denote the observed value in the original data as a vertical dotted line.
Figure 16a, b show that the resampled trajectories do not appear interesting of type 2 for \(\texttt {location}\) as in Fig. 15b; the two trajectories, respectively, have \(\texttt {Score\_int}_{2,\texttt {variation}} =\) 0.50, and 0.42. Moreover, panel (c) shows that the interestingness score of 0, which was observed for user 3 in the original data (Fig. 15b), never appears across any of the resampled trajectories. Thus we can conclude that the data presents evidence that the RL algorithm potentially personalized for user 3 by learning to treat the user based on \(\texttt {location}\) differentially, and this personalization would not likely arise simply due to algorithmic stochasticity.
A potentially interesting user of type 2 for \(\texttt {engagement}\)
Fig. 17 displays the advantage forecasts for a user, who we call user 4, to distinguish them from the three users associated with Figs. 1 and 15. The three panels in Fig. 17 plot user 4’s advantages color-coded by the three features; which admit \(\texttt {Score\_int}_{2, \texttt {variation}} =0.65\), \(\texttt {Score\_int}_{2, \texttt {location}} =0.9\), and \(\texttt {Score\_int}_{2, \texttt {engagement}} =0.037\), respectively. Thus based on our definition (4), this user is potentially interesting of type 2 for \(\texttt {engagement}\), but not for \(\texttt {variation}\). The user does not qualify the criterion (\(\gamma = 0.75\) in (18)) for being considered as a potentially interesting user for \(\texttt {location}\) due to a lack of diversity in the values taken by its \(\texttt {location}\) feature.
Next, we evaluate how likely the user graph in Fig. 17c would appear just by chance. Panels (a) and (b) of Fig. 18 visualize two resampled trajectories of user 4 (chosen uniformly at random from user 4’s 500 resampled trajectories) generated under the generative model that there is no differential advantage of sending a message based on the value of \(\texttt {engagement}\). The color coding is as in Fig. 17c, namely, the forecasts are marked in red triangles if \(\texttt {engagement}\) = 1 and blue circles if \(\texttt {engagement}\) = 0. In panel (c) of Fig. 18, we plot the histogram for the \(\texttt {Score\_int}_{2, \texttt {engagement}}\) for this user across all 500 resampled trajectories and denote the observed value in the original data as a vertical dotted line.
Figure 18a, b show that the resampled trajectories do not appear interesting of type 2 for \(\texttt {engagement}\) as in Fig. 17c; the two trajectories, respectively, have \(\texttt {Score\_int}_{2,\texttt {engagement}} =\) 0.94, and 0.41. However, panel (c) shows that the interestingness score of 0.037, which was observed for user 4 in the original data (Fig. 17c), appears for around 20% of the resampled trajectories. Thus we can conclude that the data presents evidence that user 4’s interestingness score for \(\texttt {engagement}\) might appear extreme simply due to algorithmic stochasticity.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ghosh, S., Kim, R., Chhabria, P. et al. Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling. Mach Learn (2024). https://doi.org/10.1007/s10994-024-06526-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10994-024-06526-x