Skip to main content

Advertisement

Log in

Estimating the optimal treatment regime for student success programs

  • Original Paper
  • Published:
Behaviormetrika Aims and scope Submit manuscript

Abstract

We expand methods for estimating an optimal treatment regime (OTR) from the personalized medicine literature to educational data mining applications. As part of this development, we detail and modify the current state-of-the-art, assess the efficacy of the approaches for student success studies, and provide practitioners the machinery to apply the methods in their specific problems. Our particular interest is to estimate an optimal treatment regime for students enrolled in an introductory statistics course at San Diego State University (SDSU). The available treatments are combinations of three programs SDSU implemented to foster student success in this large enrollment, bottleneck STEM course. We leverage tree-based reinforcement learning approaches based on either an inverse probability-weighted purity measure or an augmented probability-weighted purity measure. The thereby deduced OTR promises to significantly increase the average grade in the introductory course and also reveals the need for program recommendations to students as only very few, on their own, selected their optimal treatment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 61(4):962–973

    Article  MathSciNet  MATH  Google Scholar 

  • Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J (2018) Double debiased machine learning for treatment and structural parameters. Economet J 21(1):C1–C68

    Article  MathSciNet  MATH  Google Scholar 

  • Doubleday K, Zhou H, Fu H, Zhou J (2018) An algorithm for generating individualized treatment decision trees and random forests. J Comput Graph Stat 27:849–860

    Article  MathSciNet  MATH  Google Scholar 

  • Gill RD, Robins JM (2001) Causal inference for complex longitudinal data: the continuous case. Ann Stat 29(6):1785–1811

    Article  MathSciNet  MATH  Google Scholar 

  • Kusner M, Russell C, Loftus J, Silva R (2019) Making decisions that reduce discriminatory impacts. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, PMLR, proceedings of machine learning research, vol 97, pp 3591–3600

  • Kusner MJ, Loftus J, Russell C, Silva R (2017) Counterfactual fairness. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc., New York, pp 4066–4076

    Google Scholar 

  • Laber EB, Zhao YQ (2015) Tree-based methods for individualized treatment regimes. Biometrika 102:501–514

    Article  MathSciNet  MATH  Google Scholar 

  • Luedtke AR, van der Laan MJ (2016) Optimal individualized treatments in resource-limited settings. Int J Biostat 12(1):283–303

    Article  MathSciNet  Google Scholar 

  • Martin DC, Arendale DR (1992) Supplemental instruction: improving first-year student success in high-risk courses (2nd ed). National Resource Center for The First Year Experience

  • Meier Y, Xu J, Atan O, van der Schaar M (2016) Predicting grades. IEEE Trans Signal Process 64(4):959–972

    Article  MathSciNet  MATH  Google Scholar 

  • Powell MG, Hull DM, Beaujean AA (2020) Propensity score matching for education data: worked examples. J Exp Educ 88(1):145–164

    Article  Google Scholar 

  • Qian M, Murphy SA (2011) Performance guarantees for individualized treatment rules. Ann Stat 39(2):1180–1210

    Article  MathSciNet  MATH  Google Scholar 

  • R Core Team (2020) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/

  • Robins J (1986) A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Math Model 7(9–12):1393–1512

    Article  MathSciNet  MATH  Google Scholar 

  • Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55

    Article  MathSciNet  MATH  Google Scholar 

  • Rubin DB (1980) Randomization analysis of experimental data: the fisher randomization test comment. J Am Stat Assoc 75(371):591–593

    Google Scholar 

  • Schneider M, Preckel F (2017) Variables associated with achievement in higher education: a systematic review of meta-analyses. Psychol Bull 143:565–600

    Article  Google Scholar 

  • Sobel ME (2006) What do randomized studies of housing mobility demonstrate? Causal inference in the face of interference. J Am Stat Assoc 101(476):1398–1407

    Article  MathSciNet  MATH  Google Scholar 

  • Spoon MK, Beemer J, Whitmer J, Fan J, Frazee PJ, Andrew J, Bohonak JS, Levine AR (2016) Random forests for evaluating pedagogy and informing personalized learning. Educ Data Min 20:20–50

    Google Scholar 

  • Tao Y, Wang L (2017) Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics 73(1):145–155

    Article  MathSciNet  MATH  Google Scholar 

  • Tao Y, Wang L, Almirall D (2018) Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. Ann Appl Stat 12(3):1914–1938

    Article  MathSciNet  MATH  Google Scholar 

  • Toth B, van der Laan M (2018) Targeted learning of optimal individualized treatment rules under cost constraints. In: Biopharmaceutical applied statistics symposium, pp 1–22

  • VanderWeele TJ, Hernan MA (2013) Causal inference under multiple versions of treatment. J Causal Inference 1(1):1–20

    Article  MathSciNet  Google Scholar 

  • Xu Y, Greene T, Bress A, Sauer B, Bellows B, Zhang Y, Weintraub W, Moran A, Shen J (2020) Estimating the optimal individualized treatment rule from a cost-effectiveness perspective. Biometrics 20:1–15

    Google Scholar 

  • Zhang B, Tsiatis AA, Laber EB, Davidian M (2012) A robust method for estimating optimal treatment regimes. Biometrics 68:1010–1018

    Article  MathSciNet  MATH  Google Scholar 

  • Zhao Y, Zeng D, Rush AJ, Kosorok MR (2012) Estimating individualized treatment rules using outcome weighted learning. J Am Stat Assoc 107(499):1106–1118

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard A. Levine.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Ryan Baker.

Appendices

Proofs

Lemma 1

1If Assumptions 1through 4hold, we have

$$\begin{aligned} \pi ^{\text {opt}}(X) = \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] . \end{aligned}$$

Proof

Recall the definition of the OTR in Eq. (1)

$$\begin{aligned} \pi ^{\text {opt}} = \text {arg max}_{\pi \in \varPi } {\mathbb {E}}[Y^*\{\pi (X)\}]. \end{aligned}$$

Hence, it is sufficient to show that

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}] = {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] . \end{aligned}$$

We begin using the law of iterated expectation to get

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}]&={\mathbb {E}}[{\mathbb {E}}\{Y^*\{\pi (X)\}|X]]\nonumber \\&= {\mathbb {E}}\left[ {\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{\pi (X)=a\}}\bigg |X\right] \right] . \end{aligned}$$
(13)

Now, we leverage Assumption 3, stating that—conditioned on the characteristics X—all potential outcomes \(Y^*(a)\) are independent of the treatment A to add the condition in (13) that we assign treatment according to a treatment regime \(\pi (X)\) which gives us

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}]&={\mathbb {E}}\left[ {\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{\pi (X)=a\}}|X, A=\pi (X)\right] \right] \nonumber \\&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{\pi (X)=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] , \end{aligned}$$
(14)

where we used in the last step the definition of the conditional expectation. Note that the expression in Eq. (14) is well defined as Assumption 4 guarantees \(P\{A=\pi (X)|X\}>0\). As

$$\begin{aligned} {\mathbb{1}}_{\{\pi (X)=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}&=\left\{ \begin{array}{ll} 1 &{} \, \text {if} \, \pi (X)=a \, \text {and} \, A=\pi (X) \\ 0 &{} \, \text {otherwise}\\ \end{array} \right. \nonumber \\&=\left\{ \begin{array}{ll} 1 &{} \, \text {if} \, \pi (X)=a=A\\ 0 &{} \, \text {otherwise} \\ \end{array} \right. \nonumber \\&=\left\{ \begin{array}{ll} 1 &{} \, \text {if} \, A=a \, \text {and} \, A=\pi (X)\\ 0 &{} \, \text {otherwise} \\ \end{array} \right. \nonumber \\&={\mathbb{1}}_{\{A=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}, \end{aligned}$$
(15)

we can rewrite Eq. (14) as

$$\begin{aligned} {\mathbb {E}}[Y^*\{\pi (X)\}]&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{A=a\}}{\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] \nonumber \\&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ \left\{ \sum _{a\in {\mathcal {A}}}Y^*(a){\mathbb{1}}_{\{A=a\}}\right\} {\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] \nonumber \\&={\mathbb {E}}\left[ \frac{{\mathbb {E}}\left[ Y{\mathbb{1}}_{\{A=\pi (X)\}}|X\right] }{P\{A=\pi (X)|X\}}\right] \nonumber \\&={\mathbb {E}}\left[ {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\bigg | X\right] \right] \nonumber \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] , \end{aligned}$$
(16)

where we used Assumption 2 in Eq. (16) and that \(P\{A=\pi (X)|X\}\) is a \(\sigma (X)\)-measurable function, where \(\sigma (X)\) is the \(\sigma\)-algebra generated by X. This ends our proof. \(\square\)

Lemma 2

If Assumptions 1through 4hold, we have

$$\begin{aligned} \pi ^{\text {opt}}(X) = \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{\{Y-g(X)\}{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] . \end{aligned}$$

for any arbitrary function \(g:{\mathbb {R}}^p\mapsto {\mathbb {R}}\) (Laber and Zhao 2015).

Proof

Let \(g:{\mathbb {R}}^p\mapsto {\mathbb {R}}\) be an arbitrary function to define

$$\begin{aligned} L_g\{\pi (X)\}=\frac{\{Y-g(X)\}{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}. \end{aligned}$$
(17)

Then, it holds that

$$\begin{aligned} {\mathbb {E}}[L_g\{\pi (X)\}]&= {\mathbb {E}}\left[ \frac{\{Y-g(X)\}{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] \\&= {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ \frac{g(X){\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] \\&= {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ {\mathbb {E}}\left[ \frac{g(X){\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\bigg |X\right] \right] \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ \frac{g(X)}{P\{A=\pi (X)|X\}}{\mathbb {E}}[{\mathbb{1}}_{\{A=\pi (X)\}}|X]\right] \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ \frac{g(X)}{P\{A=\pi (X)|X\}}P\{A=\pi (X)|X\}\right] \\&={\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ g(X)\right] , \end{aligned}$$

and therefore

$$\begin{aligned} \text {arg max}_{\pi \in \varPi } {\mathbb {E}}[L_g\{\pi (X)\}]&= \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] -{\mathbb {E}}\left[ g(X)\right] \\&= \text {arg max}_{\pi \in \varPi } {\mathbb {E}}\left[ \frac{Y{\mathbb{1}}_{\{A=\pi (X)\}}}{P\{A=\pi (X)|X\}}\right] \\&=\pi ^{\text {opt}}(X). \end{aligned}$$

\(\square\)

Adjusted R code of Tao et al. (2018)

1.1 Function DTRtree to grow tree with \({\mathcal {P}}^{\text {Tao}}\)

figure a
figure b

1.2 Function LZtree to grow tree with \({\mathcal {P}}^{LZ}\)

figure c
figure d

1.3 Example

This subsection provides the R-code of the file 03 Example Application to demonstrate how to deploy the developed methods. The functions DTRtree and LZtree are available in the file 01 TRL Functions—along with other functions.

figure e

The example considers simulated data in the file 02 Example Data which presents a similar structure to the student success data, but is not based on actual student data. For each student, an SAT Math Score, HSGPA, Age, Gender, and URM were simulated along with an overall grade. Three treatments were assumed to be available to every student: no program (encoded with 1), MSLC (encoded with 2), and SI (encoded with 3).

figure f

To estimate an OTR for the given data either using \({\mathcal {P}}^{\text {Tao}}\) or \({\mathcal {P}}^{LZ}\), we need to choose at first a maximal tree depth, a minimal purity gain \(\lambda\), and a minimal node size \(\gamma\) following our algorithm in Sect. 2.3.

figure g

The remaining steps of the algorithm are then performed by the functions DTRtree for \({\mathcal {P}}^{\text {Tao}}\) and LZtree for \({\mathcal {P}}^{LZ}\).

figure h

The output taotree is given as a matrix:

figure i

For example, the first node is split with the help of the third covariate—which is age—at a value of 18. The purity \({\mathcal {P}}^{\text {Tao}}\) associated with this split is 608. No treatment is assigned as node 1 is not a terminal node. All students who are at most 18 are sent to node 2 which assigns treatment 2 (MSLC) to each student. All students who are older than 18 are differentiated according to their SAT Math score—which is the first covariate in the dataset. If they achieved an SAT Math Score of at most 580, they are recommended to attend SI (encoded as 3); otherwise, they should not attend any program (encoded as 1). Figure 5 displays the graphical representation of the output taotree.

Fig. 5
figure 5

Example tree grown with the help of the purity measure \({\mathcal {P}}^{\text {Tao}}\)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wilke, M.C., Levine, R.A., Guarcello, M.A. et al. Estimating the optimal treatment regime for student success programs. Behaviormetrika 48, 309–343 (2021). https://doi.org/10.1007/s41237-021-00140-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41237-021-00140-0

Keywords

Navigation