Estimating the optimal treatment regime for student success programs

Wilke, Morten C.; Levine, Richard A.; Guarcello, Maureen A.; Fan, Juanjuan

doi:10.1007/s41237-021-00140-0

Estimating the optimal treatment regime for student success programs

Original Paper
Published: 14 July 2021

Volume 48, pages 309–343, (2021)
Cite this article

Behaviormetrika Aims and scope Submit manuscript

Morten C. Wilke^1,2,
Richard A. Levine ORCID: orcid.org/0000-0002-7553-4264¹,
Maureen A. Guarcello¹ &
…
Juanjuan Fan¹

170 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

We expand methods for estimating an optimal treatment regime (OTR) from the personalized medicine literature to educational data mining applications. As part of this development, we detail and modify the current state-of-the-art, assess the efficacy of the approaches for student success studies, and provide practitioners the machinery to apply the methods in their specific problems. Our particular interest is to estimate an optimal treatment regime for students enrolled in an introductory statistics course at San Diego State University (SDSU). The available treatments are combinations of three programs SDSU implemented to foster student success in this large enrollment, bottleneck STEM course. We leverage tree-based reinforcement learning approaches based on either an inverse probability-weighted purity measure or an augmented probability-weighted purity measure. The thereby deduced OTR promises to significantly increase the average grade in the introductory course and also reveals the need for program recommendations to students as only very few, on their own, selected their optimal treatment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Global and Individual Treatment Effects Using Machine Learning Methods

Article 28 July 2020

Estimating a Dose-Response Relationship in Quasi-Experimental Student Success Studies

Article 13 April 2022

Ensemble Learning for Estimating Individualized Treatment Effects in Student Success Studies

Article 25 May 2017

References

Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973
Article MathSciNet MATH Google Scholar
Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double debiased machine learning for treatment and structural parameters. Economet J 21(1):C1–C68
Article MathSciNet MATH Google Scholar
Doubleday K, Zhou H, Fu H, Zhou J (2018) An algorithm for generating individualized treatment decision trees and random forests. J Comput Graph Stat 27:849–860
Article MathSciNet MATH Google Scholar
Gill RD, Robins JM (2001) Causal inference for complex longitudinal data: the continuous case. Ann Stat 29(6):1785–1811
Article MathSciNet MATH Google Scholar
Kusner M, Russell C, Loftus J, Silva R (2019) Making decisions that reduce discriminatory impacts. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, proceedings of machine learning research, vol 97, pp 3591–3600
Kusner MJ, Loftus J, Russell C, Silva R (2017) Counterfactual fairness. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc., New York, pp 4066–4076
Google Scholar
Laber EB, Zhao YQ (2015) Tree-based methods for individualized treatment regimes. Biometrika 102:501–514
Article MathSciNet MATH Google Scholar
Luedtke AR, van der Laan MJ (2016) Optimal individualized treatments in resource-limited settings. Int J Biostat 12(1):283–303
Article MathSciNet Google Scholar
Martin DC, Arendale DR (1992) Supplemental instruction: improving first-year student success in high-risk courses (2nd ed). National Resource Center for The First Year Experience
Meier Y, Xu J, Atan O, van der Schaar M (2016) Predicting grades. IEEE Trans Signal Process 64(4):959–972
Article MathSciNet MATH Google Scholar
Powell MG, Hull DM, Beaujean AA (2020) Propensity score matching for education data: worked examples. J Exp Educ 88(1):145–164
Article Google Scholar
Qian M, Murphy SA (2011) Performance guarantees for individualized treatment rules. Ann Stat 39(2):1180–1210
Article MathSciNet MATH Google Scholar
R Core Team (2020) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/
Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Math Model 7(9–12):1393–1512
Article MathSciNet MATH Google Scholar
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55
Article MathSciNet MATH Google Scholar
Rubin DB (1980) Randomization analysis of experimental data: the fisher randomization test comment. J Am Stat Assoc 75(371):591–593
Google Scholar
Schneider M, Preckel F (2017) Variables associated with achievement in higher education: a systematic review of meta-analyses. Psychol Bull 143:565–600
Article Google Scholar
Sobel ME (2006) What do randomized studies of housing mobility demonstrate? Causal inference in the face of interference. J Am Stat Assoc 101(476):1398–1407
Article MathSciNet MATH Google Scholar
Spoon MK, Beemer J, Whitmer J, Fan J, Frazee PJ, Andrew J, Bohonak JS, Levine AR (2016) Random forests for evaluating pedagogy and informing personalized learning. Educ Data Min 20:20–50
Google Scholar
Tao Y, Wang L (2017) Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics 73(1):145–155
Article MathSciNet MATH Google Scholar
Tao Y, Wang L, Almirall D (2018) Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. Ann Appl Stat 12(3):1914–1938
Article MathSciNet MATH Google Scholar
Toth B, van der Laan M (2018) Targeted learning of optimal individualized treatment rules under cost constraints. In: Biopharmaceutical applied statistics symposium, pp 1–22
VanderWeele TJ, Hernan MA (2013) Causal inference under multiple versions of treatment. J Causal Inference 1(1):1–20
Article MathSciNet Google Scholar
Xu Y, Greene T, Bress A, Sauer B, Bellows B, Zhang Y, Weintraub W, Moran A, Shen J (2020) Estimating the optimal individualized treatment rule from a cost-effectiveness perspective. Biometrics 20:1–15
Google Scholar
Zhang B, Tsiatis AA, Laber EB, Davidian M (2012) A robust method for estimating optimal treatment regimes. Biometrics 68:1010–1018
Article MathSciNet MATH Google Scholar
Zhao Y, Zeng D, Rush AJ, Kosorok MR (2012) Estimating individualized treatment rules using outcome weighted learning. J Am Stat Assoc 107(499):1106–1118
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

San Diego State University, San Diego, CA, USA
Morten C. Wilke, Richard A. Levine, Maureen A. Guarcello & Juanjuan Fan
Ulm University, Ulm, Baden-Württemberg, Germany
Morten C. Wilke

Authors

Morten C. Wilke
View author publications
You can also search for this author in PubMed Google Scholar
Richard A. Levine
View author publications
You can also search for this author in PubMed Google Scholar
Maureen A. Guarcello
View author publications
You can also search for this author in PubMed Google Scholar
Juanjuan Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard A. Levine.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Ryan Baker.

Appendices

Proofs

Lemma 1

1If Assumptions 1through 4hold, we have

$$\begin{aligned} \pi ^{\text {opt}}(X) = \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] . \end{aligned}$$

Proof

Recall the definition of the OTR in Eq. (1)

$$\begin{aligned} \pi ^{\text {opt}} = \text {arg max}_{\pi \in \varPi } {\mathbb {E}}[Y^*\{\pi (X)\}]. \end{aligned}$$

Hence, it is sufficient to show that

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}] = {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] . \end{aligned}$$

We begin using the law of iterated expectation to get

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}]&={\mathbb {E}}[{\mathbb {E}}\{Y^*\{\pi (X)\}|X]]\nonumber \\&= {\mathbb {E}}\left[ {\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{\pi (X)=a\}}\bigg |X\right] \right] . \end{aligned}$$

(13)

Now, we leverage Assumption 3, stating that—conditioned on the characteristics X—all potential outcomes $Y^*(a)$ are independent of the treatment A to add the condition in (13) that we assign treatment according to a treatment regime $\pi (X)$ which gives us

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}]&={\mathbb {E}}\left[ {\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{\pi (X)=a\}}|X, A=\pi (X)\right] \right] \nonumber \\&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{\pi (X)=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] , \end{aligned}$$

(14)

where we used in the last step the definition of the conditional expectation. Note that the expression in Eq. (14) is well defined as Assumption 4 guarantees $P\{A=\pi (X)|X\}>0$. As

$$\begin{aligned} {\mathbb{1}}_{\{\pi (X)=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}&=\left\{ \begin{array}{ll} 1 &{} \, \text {if} \, \pi (X)=a \, \text {and} \, A=\pi (X) \\ 0 &{} \, \text {otherwise}\\ \end{array} \right. \nonumber \\&=\left\{ \begin{array}{ll} 1 &{} \, \text {if} \, \pi (X)=a=A\\ 0 &{} \, \text {otherwise} \\ \end{array} \right. \nonumber \\&=\left\{ \begin{array}{ll} 1 &{} \, \text {if} \, A=a \, \text {and} \, A=\pi (X)\\ 0 &{} \, \text {otherwise} \\ \end{array} \right. \nonumber \\&={\mathbb{1}}_{\{A=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}, \end{aligned}$$

(15)

we can rewrite Eq. (14) as

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}]&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{A=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] \nonumber \\&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ \left\{ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{A=a\}}\right\} {\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] \nonumber \\&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ Y{\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] \nonumber \\&={\mathbb {E}}\left[ {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\bigg | X\right] \right] \nonumber \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] , \end{aligned}$$

(16)

where we used Assumption 2 in Eq. (16) and that $P\{A=\pi (X)|X\}$ is a $\sigma (X)$-measurable function, where $\sigma (X)$ is the $\sigma$-algebra generated by X. This ends our proof. $\square$

Lemma 2

If Assumptions 1through 4hold, we have

$$\begin{aligned} \pi ^{\text {opt}}(X) = \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{\{Y-g(X)\}{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] . \end{aligned}$$

for any arbitrary function $g:{\mathbb {R}}^p\mapsto {\mathbb {R}}$ (Laber and Zhao 2015).

Proof

Let $g:{\mathbb {R}}^p\mapsto {\mathbb {R}}$ be an arbitrary function to define

$$\begin{aligned} L_g\{\pi (X)\}=\frac{\{Y-g(X)\}{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}. \end{aligned}$$

(17)

Then, it holds that

$$\begin{aligned} {\mathbb {E}}[L_g\{\pi (X)\}]&= {\mathbb {E}}\left[ \frac{\{Y-g(X)\}{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] \\&= {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ \frac{g(X){\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] \\&= {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ {\mathbb {E}}\left[ \frac{g(X){\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\bigg |X\right] \right] \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ \frac{g(X)}{P\{A=\pi (X)|X\}}{\mathbb {E}}[{\mathbb{1}}_{\{A=\pi (X)\}}|X]\right] \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ \frac{g(X)}{P\{A=\pi (X)|X\}}P\{A=\pi (X)|X\}\right] \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ g(X)\right] , \end{aligned}$$

and therefore

$$\begin{aligned} \text {arg max}_{\pi \in \varPi } {\mathbb {E}}[L_g\{\pi (X)\}]&= \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ g(X)\right] \\&= \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] \\&=\pi ^{\text {opt}}(X). \end{aligned}$$

$\square$

Adjusted R code of Tao et al. (2018)

1.1 Function DTRtree to grow tree with ${\mathcal {P}}^{\text {Tao}}$

1.2 Function LZtree to grow tree with ${\mathcal {P}}^{LZ}$

1.3 Example

This subsection provides the R-code of the file 03 Example Application to demonstrate how to deploy the developed methods. The functions DTRtree and LZtree are available in the file 01 TRL Functions—along with other functions.

The example considers simulated data in the file 02 Example Data which presents a similar structure to the student success data, but is not based on actual student data. For each student, an SAT Math Score, HSGPA, Age, Gender, and URM were simulated along with an overall grade. Three treatments were assumed to be available to every student: no program (encoded with 1), MSLC (encoded with 2), and SI (encoded with 3).

To estimate an OTR for the given data either using ${\mathcal {P}}^{\text {Tao}}$ or ${\mathcal {P}}^{LZ}$, we need to choose at first a maximal tree depth, a minimal purity gain $\lambda$, and a minimal node size $\gamma$ following our algorithm in Sect. 2.3.

The remaining steps of the algorithm are then performed by the functions DTRtree for ${\mathcal {P}}^{\text {Tao}}$ and LZtree for ${\mathcal {P}}^{LZ}$.

The output taotree is given as a matrix:

For example, the first node is split with the help of the third covariate—which is age—at a value of 18. The purity ${\mathcal {P}}^{\text {Tao}}$ associated with this split is 608. No treatment is assigned as node 1 is not a terminal node. All students who are at most 18 are sent to node 2 which assigns treatment 2 (MSLC) to each student. All students who are older than 18 are differentiated according to their SAT Math score—which is the first covariate in the dataset. If they achieved an SAT Math Score of at most 580, they are recommended to attend SI (encoded as 3); otherwise, they should not attend any program (encoded as 1). Figure 5 displays the graphical representation of the output taotree.

About this article

Cite this article

Wilke, M.C., Levine, R.A., Guarcello, M.A. et al. Estimating the optimal treatment regime for student success programs. Behaviormetrika 48, 309–343 (2021). https://doi.org/10.1007/s41237-021-00140-0

Download citation

Received: 27 October 2020
Accepted: 01 July 2021
Published: 14 July 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s41237-021-00140-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the optimal treatment regime for student success programs

Abstract

Access this article

Similar content being viewed by others

Global and Individual Treatment Effects Using Machine Learning Methods

Estimating a Dose-Response Relationship in Quasi-Experimental Student Success Studies

Ensemble Learning for Estimating Individualized Treatment Effects in Student Success Studies

References