Abstract
We expand methods for estimating an optimal treatment regime (OTR) from the personalized medicine literature to educational data mining applications. As part of this development, we detail and modify the current state-of-the-art, assess the efficacy of the approaches for student success studies, and provide practitioners the machinery to apply the methods in their specific problems. Our particular interest is to estimate an optimal treatment regime for students enrolled in an introductory statistics course at San Diego State University (SDSU). The available treatments are combinations of three programs SDSU implemented to foster student success in this large enrollment, bottleneck STEM course. We leverage tree-based reinforcement learning approaches based on either an inverse probability-weighted purity measure or an augmented probability-weighted purity measure. The thereby deduced OTR promises to significantly increase the average grade in the introductory course and also reveals the need for program recommendations to students as only very few, on their own, selected their optimal treatment.
Similar content being viewed by others
References
Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double debiased machine learning for treatment and structural parameters. Economet J 21(1):C1–C68
Doubleday K, Zhou H, Fu H, Zhou J (2018) An algorithm for generating individualized treatment decision trees and random forests. J Comput Graph Stat 27:849–860
Gill RD, Robins JM (2001) Causal inference for complex longitudinal data: the continuous case. Ann Stat 29(6):1785–1811
Kusner M, Russell C, Loftus J, Silva R (2019) Making decisions that reduce discriminatory impacts. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, proceedings of machine learning research, vol 97, pp 3591–3600
Kusner MJ, Loftus J, Russell C, Silva R (2017) Counterfactual fairness. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc., New York, pp 4066–4076
Laber EB, Zhao YQ (2015) Tree-based methods for individualized treatment regimes. Biometrika 102:501–514
Luedtke AR, van der Laan MJ (2016) Optimal individualized treatments in resource-limited settings. Int J Biostat 12(1):283–303
Martin DC, Arendale DR (1992) Supplemental instruction: improving first-year student success in high-risk courses (2nd ed). National Resource Center for The First Year Experience
Meier Y, Xu J, Atan O, van der Schaar M (2016) Predicting grades. IEEE Trans Signal Process 64(4):959–972
Powell MG, Hull DM, Beaujean AA (2020) Propensity score matching for education data: worked examples. J Exp Educ 88(1):145–164
Qian M, Murphy SA (2011) Performance guarantees for individualized treatment rules. Ann Stat 39(2):1180–1210
R Core Team (2020) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Math Model 7(9–12):1393–1512
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55
Rubin DB (1980) Randomization analysis of experimental data: the fisher randomization test comment. J Am Stat Assoc 75(371):591–593
Schneider M, Preckel F (2017) Variables associated with achievement in higher education: a systematic review of meta-analyses. Psychol Bull 143:565–600
Sobel ME (2006) What do randomized studies of housing mobility demonstrate? Causal inference in the face of interference. J Am Stat Assoc 101(476):1398–1407
Spoon MK, Beemer J, Whitmer J, Fan J, Frazee PJ, Andrew J, Bohonak JS, Levine AR (2016) Random forests for evaluating pedagogy and informing personalized learning. Educ Data Min 20:20–50
Tao Y, Wang L (2017) Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics 73(1):145–155
Tao Y, Wang L, Almirall D (2018) Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. Ann Appl Stat 12(3):1914–1938
Toth B, van der Laan M (2018) Targeted learning of optimal individualized treatment rules under cost constraints. In: Biopharmaceutical applied statistics symposium, pp 1–22
VanderWeele TJ, Hernan MA (2013) Causal inference under multiple versions of treatment. J Causal Inference 1(1):1–20
Xu Y, Greene T, Bress A, Sauer B, Bellows B, Zhang Y, Weintraub W, Moran A, Shen J (2020) Estimating the optimal individualized treatment rule from a cost-effectiveness perspective. Biometrics 20:1–15
Zhang B, Tsiatis AA, Laber EB, Davidian M (2012) A robust method for estimating optimal treatment regimes. Biometrics 68:1010–1018
Zhao Y, Zeng D, Rush AJ, Kosorok MR (2012) Estimating individualized treatment rules using outcome weighted learning. J Am Stat Assoc 107(499):1106–1118
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Communicated by Ryan Baker.
Appendices
Proofs
Lemma 1
1If Assumptions 1through 4hold, we have
Proof
Recall the definition of the OTR in Eq. (1)
Hence, it is sufficient to show that
We begin using the law of iterated expectation to get
Now, we leverage Assumption 3, stating that—conditioned on the characteristics X—all potential outcomes \(Y^*(a)\) are independent of the treatment A to add the condition in (13) that we assign treatment according to a treatment regime \(\pi (X)\) which gives us
where we used in the last step the definition of the conditional expectation. Note that the expression in Eq. (14) is well defined as Assumption 4 guarantees \(P\{A=\pi (X)|X\}>0\). As
we can rewrite Eq. (14) as
where we used Assumption 2 in Eq. (16) and that \(P\{A=\pi (X)|X\}\) is a \(\sigma (X)\)-measurable function, where \(\sigma (X)\) is the \(\sigma\)-algebra generated by X. This ends our proof. \(\square\)
Lemma 2
If Assumptions 1through 4hold, we have
for any arbitrary function \(g:{\mathbb {R}}^p\mapsto {\mathbb {R}}\) (Laber and Zhao 2015).
Proof
Let \(g:{\mathbb {R}}^p\mapsto {\mathbb {R}}\) be an arbitrary function to define
Then, it holds that
and therefore
\(\square\)
Adjusted R code of Tao et al. (2018)
1.1 Function DTRtree to grow tree with \({\mathcal {P}}^{\text {Tao}}\)
1.2 Function LZtree to grow tree with \({\mathcal {P}}^{LZ}\)
1.3 Example
This subsection provides the R-code of the file 03 Example Application to demonstrate how to deploy the developed methods. The functions DTRtree and LZtree are available in the file 01 TRL Functions—along with other functions.
The example considers simulated data in the file 02 Example Data which presents a similar structure to the student success data, but is not based on actual student data. For each student, an SAT Math Score, HSGPA, Age, Gender, and URM were simulated along with an overall grade. Three treatments were assumed to be available to every student: no program (encoded with 1), MSLC (encoded with 2), and SI (encoded with 3).
To estimate an OTR for the given data either using \({\mathcal {P}}^{\text {Tao}}\) or \({\mathcal {P}}^{LZ}\), we need to choose at first a maximal tree depth, a minimal purity gain \(\lambda\), and a minimal node size \(\gamma\) following our algorithm in Sect. 2.3.
The remaining steps of the algorithm are then performed by the functions DTRtree for \({\mathcal {P}}^{\text {Tao}}\) and LZtree for \({\mathcal {P}}^{LZ}\).
The output taotree is given as a matrix:
For example, the first node is split with the help of the third covariate—which is age—at a value of 18. The purity \({\mathcal {P}}^{\text {Tao}}\) associated with this split is 608. No treatment is assigned as node 1 is not a terminal node. All students who are at most 18 are sent to node 2 which assigns treatment 2 (MSLC) to each student. All students who are older than 18 are differentiated according to their SAT Math score—which is the first covariate in the dataset. If they achieved an SAT Math Score of at most 580, they are recommended to attend SI (encoded as 3); otherwise, they should not attend any program (encoded as 1). Figure 5 displays the graphical representation of the output taotree.
About this article
Cite this article
Wilke, M.C., Levine, R.A., Guarcello, M.A. et al. Estimating the optimal treatment regime for student success programs. Behaviormetrika 48, 309–343 (2021). https://doi.org/10.1007/s41237-021-00140-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41237-021-00140-0