Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction

Zhou, Guojing; Azizsoltani, Hamoon; Ausin, Markel Sanz; Barnes, Tiffany; Chi, Min

doi:10.1007/s40593-021-00269-9

Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction

ARTICLE
Published: 16 August 2021

Volume 32, pages 454–500, (2022)
Cite this article

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Guojing Zhou ORCID: orcid.org/0000-0002-9401-4854¹,
Hamoon Azizsoltani¹,
Markel Sanz Ausin¹,
Tiffany Barnes¹ &
…
Min Chi¹

444 Accesses
3 Citations
Explore all metrics

Abstract

In interactive e-learning environments such as Intelligent Tutoring Systems, pedagogical decisions can be made at different levels of granularity. In this work, we focus on making decisions at two levels: whole problems vs. single steps and explore three types of granularity: problem-level only (Prob-Only), step-level only (Step-Only) and both problem and step levels (Both). More specifically, for Prob-Only, our pedagogical agency decides whether the next problem should be a worked example (WE) or a problem-solving (PS). In WEs, students observe how the tutor solves a problem while in PSs students solve the problem themselves. For Step-Only, the agent decides whether to elicit the student’s next solution step or to tell the step directly. Here the student and the tutor co-construct the solution and we refer to this type of task as collaborative problem-solving (CPS). For Both, the agency first decides whether the next problem should be a WE, a PS, or a CPS and based on the problem-level decision, the agent then makes step-level decisions on whether to elicit or tell each step. In a series of classroom studies, we compare the three types of granularity under random yet reasonable pedagogical decisions. Results showed that while Prob-Only may be less effective for High students, Step-Only may be less effective for Low ones, Both can be effective for both High and Low students. Motivated by these findings, we propose and apply an offline, off-policy Gaussian Processes based Hierarchical Reinforcement Learning (HRL) framework to induce a hierarchical pedagogical policy that makes adaptive, effective decisions at both the problem and step levels. In an empirical classroom study, our results showed that the HRL policy is significantly more effective than a Deep Q-Network (DQN) induced step-level policy and a random yet reasonable step-level baseline policy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Article Open access 28 October 2019

Re-evaluating GPT-4’s bar exam performance

Article Open access 30 March 2024

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Article Open access 25 March 2022

Notes

Fewer students were assigned to the WE condition, because another purpose of this study was to collect training data for inducing the HRL policy.
A square root was used in this definition to reduce the variance and the difference between different incoming competence groups, see Appendix D for a comparison of two NLG definitions.

References

Anderson, J. R. (1993). Problem solving and learning. American Psychologist, 48(1), 35.
Article Google Scholar
Anderson, J. R., Corbett, A. T., Koedinger, K. R., & Pelletier, R. (1995). Cognitive tutors: Lessons learned. The journal of the learning sciences, 4(2), 167–207.
Article Google Scholar
Andrychowicz, M., Baker, B., & et al. (2018). Learning dexterous in-hand manipulation. arXiv:1808.00177.
Azizsoltani, H., Kim, Y. J., Ausin, M. S., Barnes, T., & Chi, M. (2019). Unobserved is not equal to non-existent: Using gaussian processes to infer immediate rewards across contexts. IJCAI, 1974–1980.
Azizsoltani, H., & Sadeghi, E. (2018). Adaptive sequential strategy for risk estimation of engineering systems using gaussian process regression active learning. Engineering Applications of Artificial Intelligence, 74(July), 146–165.
Article Google Scholar
Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2), 41–77.
Article MathSciNet Google Scholar
Beck, J., Woolf, B. P., & Beal, C. R. (2000). Advisor: a machine learning architecture for intelligent tutor construction. AAAI/IAAI, 2000(552-557), 1–2.
Google Scholar
Chaiklin, S., et al. (2003). The zone of proximal development in vygotsky’s analysis of learning and instruction. Vygotsky’s educational theory in cultural context, 1, 39–64.
Article Google Scholar
Chi, M., & Vanlehn, K. (2007). Accelerated future learning via explicit instruction of a problem solving strategy. Frontiers In Artificial Intelligence And Applications, 158, 409.
Google Scholar
Chi, M., & VanLehn, K. (2010). Meta-cognitive strategy instruction in intelligent tutoring systems: how, when, and why. Educational Technology & Society, 13(1), 25–39.
Google Scholar
Chi, M., VanLehn, K., Litman, D., & Jordan, P. (2011). Empirically evaluating the application of reinforcement learning to the induction of effective and adaptive pedagogical strategies. User Modeling and User-Adapted Interaction, 21 (1-2), 137–180.
Article Google Scholar
Clement, B., Oudeyer, P. Y., & Lopes, M. (2016). A comparison of automatic teaching strategies for heterogeneous student populations. In EDM 16-9th international conference on educational data mining.
Cronbach, L. J., & Snow, R. E. (1977). Aptitudes and instructional methods: A handbook for research on interactions. Irvington.
Cuayáhuitl, H., Dethlefs, N., Frommberger, L., Richter, K. F., & Bateman, J. (2010). Generating adaptive route instructions using hierarchical reinforcement learning. In International conference on spatial cognition (pp. 319–334). Springer.
Doroudi, S., Aleven, V., & Brunskill, E. (2017). Robust evaluation matrix: Towards a more principled offline exploration of instructional policies. In Proceedings of the fourth (2017) ACM conference on learning@ scale (pp. 3–12).
Doroudi, S., Aleven, V., & Brunskill, E. (2019). Where’s the reward? International Journal of Artificial Intelligence in Education, 29(4), 568–620. https://doi.org/10.1007/s40593-019-00187-x.
Article Google Scholar
Eaton, M. L. (1983). Multivariate statistics: a vector space approach. John Wiley & Sons, Inc., 605 Third Ave., New York, NY 10158, USA, 1983, 512, 116–117.
MATH Google Scholar
Feller, W. (2008). An introduction to probability theory and its applications Vol. 2. Hoboken: Wiley.
MATH Google Scholar
Goldberg, P. W., Williams, C. K., & et al. (1998). Regression with input-dependent noise: a gaussian process treatment. In NIPS (pp. 493–499).
Guo, D., Shamai, S., & Verdú, S. (2005). Mutual information and minimum mean-square error in gaussian channels. IEEE Transactions on Information Theory, 51(4), 1261–1282.
Article MathSciNet Google Scholar
Haarnoja, T., Zhou, A., & et al. (2018). Soft actor-critic algorithms and applications. arXiv:1812.05905.
Iglesias, A., Martínez, P., Aler, R., & Fernández, F. (2009). Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning. Applied Intelligence, 31(1), 89–106.
Article Google Scholar
Iglesias, A., Martínez, P., Aler, R., & Fernández, F. (2009). Reinforcement learning of pedagogical policies in adaptive and intelligent educational systems. Knowledge-Based Systems, 22(4), 266–270.
Article Google Scholar
Kalyuga, S., & Renkl, A. (2010). Expertise reversal effect and its instructional implications: Introduction to the special issue. Instructional Science, 38 (3), 209–215.
Article Google Scholar
Kulkarni, T. D., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems (pp. 3675–3683).
Lillicrap, T. P., Hunt, J. J., & et al. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971.
Liz, B., Dreyfus, T., Mason, J., Tsamir, P., Watson, A., & Zaslavsky, O. (2006). Exemplification in mathematics education. In Proceedings of the 30th conference of the international group for the psychology of mathematics education. ERIC, (Vol. 1 pp. 126–154 ).
Mandel, T., Liu, Y. E., Levine, S., Brunskill, E., & Popovic, Z. (2014). Offline policy evaluation across representations with applications to educational games. In Proceedings of the 2014 international conference on autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems (pp. 1077–1084).
McLaren, B. M., van Gog, T., Ganoe, C., Yaron, D., & Karabinos, M. (2014). Exploring the assistance dilemma: Comparing instructional support in examples and problems. In Intelligent tutoring systems (pp. 354–361). Springer.
McLaren, B. M., & Isotani, S. (2011). When is it best to learn with all worked examples?. In International conference on artificial intelligence in education (pp. 222–229). Springer.
McLaren, B. M., Lim, S. J., & Koedinger, K. R. (2008). When and how often should worked examples be given to students? new results and a summary of the current state of research. In Cogsci (pp. 2176–2181).
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., & et al. (2015). Human-level control through deep reinforcement learning. Nature, 518 (7540), 529.
Article Google Scholar
Najar, A. S., Mitrovic, A., & McLaren, B. M. (2014). Adaptive support versus alternating worked examples and tutored problems: Which leads to better learning?. In UMAP (pp. 171–182). Springer.
Peng, X. B., Berseth, G., Yin, K., & Van De Panne, M. (2017). Deeploco: Dynamic locomotion skills using hierarchical deep reinforcement learning. ACM Transactions on Graphics (TOG), 36(4), 41.
Article Google Scholar
Phobun, P., & Vicheanpanya, J. (2010). Adaptive intelligent tutoring systems for e-learning systems. Procedia-Social and Behavioral Sciences, 2(2), 4064–4069.
Article Google Scholar
Rafferty, A. N., Brunskill, E., Griffiths, T. L., & Shafto, P. (2016). Faster teaching via pomdp planning. Cognitive science, 40(6), 1290–1332.
Article Google Scholar
Rasmussen, C. E. (2004). Gaussian processes in machine learning. In Advanced lectures on machine learning (pp. 63–71). Springer.
Renkl, A., Atkinson, R. K., Maier, U. H., & Staley, R. (2002). From example study to problem solving: Smooth transitions help learning. The Journal of Experimental Education, 70(4), 293–315.
Article Google Scholar
Rowe, J., Mott, B., & Lester, J. (2014). Optimizing player experience in interactive narrative planning: a modular reinforcement learning approach. In Tenth artificial intelligence and interactive digital entertainment conference.
Rowe, J. P., & Lester, J. C. (2015). Improving student problem solving in narrative-centered learning environments: a modular reinforcement learning framework. In International conference on artificial intelligence in education (pp. 419–428). Springer.
Ryan, M., & Reid, M. (2000). Learning to fly: an application of hierarchical reinforcement learning. In Proceedings of the 17th international conference on machine learning. Citeseer.
Salden, R. J., Aleven, V., Schwonke, R., & Renkl, A. (2010). The expertise reversal effect and worked examples in tutored problem solving. Instructional Science, 38(3), 289–307.
Article Google Scholar
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv:1511.05952.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889–1897).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Schwab, D., & Ray, S. (2017). Offline reinforcement learning with task hierarchies. Machine Learning, 106(9-10), 1569–1598.
Article MathSciNet Google Scholar
Schwonke, R., Renkl, A., Krieg, C., Wittwer, J., Aleven, V., & Salden, R. (2009). The worked-example effect: Not an artefact of lousy control conditions. Computers in Human Behavior, 25(2), 258–266.
Article Google Scholar
Shen, S., Ausin, M. S., Mostafavi, B., & Chi, M. (2018). Improving learning & reducing time: a constrained action-based reinforcement learning approach. In Proceedings of the 26th conference on user modeling, adaptation and personalization (pp. 43–51). ACM.
Shen, S., & Chi, M. (2016). Reinforcement learning: the sooner the better, or the later the better?. In Proceedings of the 2016 conference on user modeling adaptation and personalization (pp. 37–44). ACM.
Shih, B., Koedinger, K. R., & Scheines, R. (2011). A response time model for bottom-out hints as worked examples. Handbook of educational data mining, 201–212.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., & et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484.
Article Google Scholar
Silver, D., Hubert, T., Schrittwieser, J., & et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419), 1140–1144.
Article MathSciNet Google Scholar
Snow, R. E. (1991). Aptitude-treatment interaction as a framework for research on individual differences in psychotherapy. Journal of Consulting and Clinical Psychology, 59(2), 205.
Article Google Scholar
Stamper, J. C., Eagle, M., Barnes, T., & Croy, M. (2011). Experimental evaluation of automatic hint generation for a logic tutor. In International conference on artificial intelligence in education (pp. 345–352). Springer.
Sutton, R. S., Precup, D., & Singh, S. (1999). Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2), 181–211.
Article MathSciNet Google Scholar
Sweller, J., & Cooper, G. A. (1985). The use of worked examples as a substitute for problem solving in learning algebra. Cognition and Instruction, 2 (1), 59–89.
Article Google Scholar
Swetz, F. (1995). To know and to teach: Mathematical pedagogy from a historical context. Educational Studies in Mathematics, 29(1), 73–88.
Article Google Scholar
Swetz, F. J. (1987). Capitalism and arithmetic: the new math of the 15th century, including the full text of the Treviso arithmetic of 1478, translated by David Eugene Smith Open Court Publishing.
Van Gog, T., Kester, L., & Paas, F. (2011). Effects of worked examples, example-problem, and problem-example pairs on novices’ learning. Contemporary Educational Psychology, 36(3), 212–218.
Article Google Scholar
Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. In AAAI. Phoenix, AZ, (Vol. 2 p. 5).
Vanlehn, K. (2006). The behavior of tutoring systems. IJAIED, 16(3), 227–265.
Google Scholar
VanLehn, K., Bhembe, D., Chi, M., Lynch, C., Schulze, K., Shelby, R., Taylor, L., Treacy, D., Weinstein, A., & Wintersgill, M. (2004). Implicit versus explicit learning of strategies in a non-procedural cognitive skill. In International conference on intelligent tutoring systems (pp. 521–530). Springer.
Vinyals, O., Babuschkin, I., Czarnecki, W., & et al. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575, 350.
Article Google Scholar
Wang, X., Chen, W., Wu, J., Wang, Y. F., & Yang Wang, W. (2018). Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4213–4222).
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2015). Dueling network architectures for deep reinforcement learning. arXiv:1511.06581.
Williams, J.D. (2008). The best of both worlds: unifying conventional dialog systems and pomdps. In INTERSPEECH (pp. 1173–1176).
Zhou, G., Azizsoltani, H., Ausin, M. S., Barnes, T., & Chi, M. (2019). Hierarchical reinforcement learning for pedagogical policy induction. In International conference on artificial intelligence in education.
Zhou, G., & Chi, M. (2017). The impact of decision agency & granularity on aptitude treatment interaction in tutoring. In Proceedings of the 39th annual conference of the cognitive science society (pp. 3652–3657).
Zhou, G., Lynch, C., Price, T. W., Barnes, T., & Chi, M. (2016). The impact of granularity on the effectiveness of students’ pedagogical decision. In Proceedings of the 38th annual conference of the cognitive science society (pp. 2801–2806).
Zhou, G., Price, T. W., Lynch, C., Barnes, T., & Chi, M. (2015). The impact of granularity on worked examples and problem solving. In Proceedings of the 37th annual conference of the cognitive science society (pp. 2817–2822).
Zhou, G., Wang, J., Lynch, C., & Chi, M. (2017). Towards closing the loop: Bridging machine-induced pedagogical policies to learning theories. In EDM.
Zhou, G., Yang, X., Azizsoltani, H., Barnes, T., & Chi, M. (2020). Improving student-tutor interaction through data-driven explanation of hierarchical reinforcement induced pedagogical policies. In Proceedings of the 28th conference on user modeling, adaptation and personalization. ACM.

Download references

Acknowledgements

This research was supported by the NSF Grants: CAREER: Improving Adaptive Decision Making in Interactive Learning Environments(1651909), Integrated Data-driven Technologies for Individualized Instruction in STEM Learning Environments(1726550), Generalizing Data-Driven Technologies to Improve Individualized STEM Instruction by Intelligent Tutors (2013502), and Educational Data Mining for Individualized Instruction in STEM Learning Environments (1432156).

Author information

Authors and Affiliations

Department of Computer Science, North Carolina State University, Raleigh, NC, 27695, USA
Guojing Zhou, Hamoon Azizsoltani, Markel Sanz Ausin, Tiffany Barnes & Min Chi

Authors

Guojing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Hamoon Azizsoltani
View author publications
You can also search for this author in PubMed Google Scholar
Markel Sanz Ausin
View author publications
You can also search for this author in PubMed Google Scholar
Tiffany Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Min Chi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guojing Zhou.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: An Example Training Problem

Table 7 shows an example of the training problem. Steps in the problem are packed as a series of main steps. In training, students need to first select the main step to work on and then carry it out. To reduce students’ typing and calculation load, the tutor completes the specific “specify given” and “solve equation” procedure for them, as shown in the “Tutor” column.

Table 7 An example training problem

Full size table

Appendix B: Analysis on Granularity and Incoming Competence with Median Split

Table 8 shows the test score and training time results for the High and Low incoming competence groups (split by the median of pre-test). A two-way ANCOVA analysis for the Full post-test on the factors of granularity and incoming competence using the pre-test score as a covariate showed a marginally significant interaction effect: F(2, 291) = 2.59, p = 0.077, η = 0.011. But there was no significant main effect of granularity or incoming competence. Subsequent contrast analysis showed that for High students, there was a trend that the Both_H group scored higher than the Prob_H group: t(291) = 1.81, p = 0.071, d = 0.39. The Step_H group also scored 4.3 points higher than the Prob_H group, but the difference was not significant: t(291) = − 1.53, p = 0.127, d = 0.35. The Step_H and Both_H groups scored similarly with no significant difference: t(291) = 0.28, p = 0.782, d = 0.07. For Low students, there was a trend that the Prob_L group scored higher than the Step_L group: t(291) = 1.66, p = 0.098, d = 0.29. Both_L also seems to score higher than Step_L, but the difference was not significant: t(291) = 1.38, p = 0.168, d = 0.26. Prob_L and Both_L scored similarly with no significant difference: t(291) = − 0.16, p = 0.870, d = 0.02.

Table 8 Learning performance and time on task results with median split

Full size table

For time on task, a two-way ANOVA analysis on granularity and incoming competence showed a significant interaction effect: F(2, 292) = 3.15, p = 0.044, η = 0.020 and a significant main effect of granularity: F(2, 292) = 4.89, p = 0.008, η = 0.031 in that Prob-Only and Both spent less time than Step-Only: t(295) = − 3.00, p = 0.003, d = 0.40 and t(295) = − 2.08, p = 0.039, d = 0.32 respectively. Subsequent contrast analysis showed that for High students, Step_H spent significantly more time than Prob_H and Both_H: t(292) = 2.47, p = 0.014, d = 0.45 and t(292) = 3.20, p = 0.002, d = 0.77 respectively; but there was no significant difference between Prob_H and Both_H: t(292) = − 0.83, p = 0.406, d = 0.15. For Low students, Step_L and Both_L spent significantly more time than Prob_L: t(292) = 2.07, p = 0.039, d = 0.40 and t(292) = 2.00, p = 0.047, d = 0.44 respectively; but there was no significant difference between Step_L and Both_L: t(292) = 0.11, p = 0.915, d = 0.02.

Appendix C: Features Used for State Representation

C.1 Autonomy

Autonomy features describe the amount of work the student or the tutor has done, either recently or over a long period. The following 4 features describe the amount of work the student or the tutor has done recently.

ntellsSinceElicit: The number of tells the student has received since the last elicit.
ntellsSinceElicitKC: ntellsSinceElicit for the current KC.
nElicitSinceTell: The number of elicits the student has received since the last tell.
nElicitSinceTellKC: nElicitSinceTell for the current KC.

The following 6 features describe the amount of work the student or the tutor has done over a long period.

pctElicit: The total number of elicit steps divided by the total number of steps the students have received so far.
pctElicitKC: pctElicit for the current KC.
pctElicitSession: pctElicit for the current session.
pctElicitKCSession: pctElicit for the current KC and the current session.
nTellSession: the total number of tells the student has has received so far in the current session.
nTellKCSession: nTellSession for the current KC.

C.2 Temporal

Temporal features describe time-related information, such as the amount of time the student has spent on the current session or on a specific KC. The following five features are calculated based on the difference between the two timestamps, such as the difference between the current timestamp and the beginning of the current session.

durationKCBetweenDecision: The time since the last tutorial decision was made on the current KC.
timeInSession: The time that has elapsed since the start of the current session.
timeBetweenSession: The time elapsed between the end of the previous session and the beginning of the current one.
timeOnCurrentProblem: The time elapsed since the start of the current problem.
timeOnLastStepKCElicit: the time the student spent on the last elicit step with the same KC as the current step.

In the following, the total time is defined as the summation of the time student has spent on certain steps that were the focus of the training. All other intervals, such as between problem intervals or time spent on irrelevant steps, were excluded. The following 12 features describe the total amount of time the student has spent on certain materials.

timeOnTutoring: The total time the student has spent on the tutoring.
timeOnTutoringTell: The total time the student has spent on tells.
timeOnTutoringElicit: The total time the student has spent on Elicits.
timeOnTutoringKC: The total time the student has spent on the current KC.
timeOnTutoringKCTell: The total time the student has spent on the current KC with tell.
timeOnTutoringKCElicit: The total time the student has spent on the current KC with elicit.
timeOnTutoringSession: The total time the student has spent on the current session.
timeOnTutoringSessionTell: timeOnTutoringSession with tells.
timeOnTutoringSessionElicit: timeOnTutoringSession with elicits.
timeOnTutoringProblem: The total time the student has spent on the current problem.
timeOnTutoringProblemTell: timeOnTutoringProblem with tells.
timeOnTutoringProblemElicit: timeOnTutoringProblem with elicits.

The following 12 features describe the student’s working speed.

avgTimeOnStep: The average time the student spent on each step.
avgTimeOnStepTell: The average time the student spent on each tell step.
avgTimeOnStepElicit: The average time the student spent on each elicit step.
avgTimeOnStepKC: avgTimeOnStep for the current KC.
avgTimeOnStepKCTell: avgTimeOnStepTell for the current KC.
avgTimeOnStepKCElicit: avgTimeOnStepElicit for the current KC.
avgTimeOnStepSession: avgTimeOnStep for the current session.
avgTimeOnStepSessionTell: avgTimeOnStepTell for the current session.
avgTimeOnStepSessionElicit: avgTimeOnStepElicit for the current session.
avgTimeOnStepProblem: avgTimeOnStep for the current problem.
avgTimeOnStepProblemTell: avgTimeOnStepTell for the current problem.
avgTimeOnStepProblemElicit: avgTimeOnStepElicit for the current problem.

C.3 Problem Solving

Problem solving features describe the context of the learning environment, such as the difficulty of the current problem and the students’ progress. The following seven features describe the student’s progress and the amount of practice they have done.

stepOrdering: The total number of steps the student has received so far.
stepOrderingSession: stepOrdering for the current session.
stepOrderingPb: stepOrdering for the current problem.
nKCs: The number of steps the student has completed for the current KC.
nKCsAsElicit: The number of elicit steps the student has completed for the current KC.
nKCsSession: nKCs for the current session.
nKCsSessionElicit: nKCsAsElicit for the current session.

The following nine features describe the category and difficulty level of the current problem or step.

earlyTraining: For the first two problems and the first conditional probability problem, the value is 1 and for the rest, the value is 0.
simpleProblem: For the first two problems and the first two conditional probability problems, the value is 1 and for the rest, the value is 0.
newLevelDifficulty: If the current problem is more complicated than the prior problem, the value is 1; otherwise, the value is 0. In our case, the value is one for the first, third, fifth, eighth, tenth, and twelfth problem.
performanceDifficulty: Students’ average performance on the current KC (calculated based on our historical data). More specifically $\frac {correct~elicits}{total~elicits}$ across all students.
principleDifficulty: The difficulty of the principle needed for the current step, which depends on the equation of the principle. If the step does not require a probability principle, the value is 1 (easiest).
principleCategory: If the current step requires a probability theorem principle, the value is 1; if it requires a conditional probability principle, the value is 2; and if it does not require a probability principle the value is 0.
problemDifficulty: The difficulty of the current problem, which is calculated based on the principles needed to solve the problem.
problemComplexity: The value of this feature is determined by the number of principle applications needed to solve the current problem, 2 for easy problems (first, eighth, eleventh), 3 for medium problems (second, third, ninth and tenth) and 4 for hard problems (fourth, fifth, sixth, seventh, and twelfth).
problemCategory: If the problem does not require any conditional probability principle to solve, the value is 0, otherwise the value is 1.

The following three features describe the number of principles that appeared in the current problem or session.

nPrincipleInProblem: The number of principles needed to solve the current problem (some principles may be applied more than once).
nDistinctPrincipleInSession: The total number of distinct principles that have appeared in the current session.
nPrincipleInSession: The total number of principles appeared in the current session.

The following nine features describe the tutor’s use of words and probability concepts.

nTutorConceptsSession: The number of probability concepts the tutor has mentioned so far in the current session.
tutAverageConcepts: The average number of probability concepts the tutor has mentioned in each step.
tutAverageConceptsSession: tutAverageConcepts for the current session.
tutConceptsToWords: The number of probability concepts the tutor has mentioned divided by the total number of words the tutor has used so far.
tutConceptsToWordsSession: tutConceptsToWords for the current session.
tutAverageWords: the average number of words the tutor used in each step.
tutAverageWordsSession: tutAverageWords for the current session.
tutAverageWordsElicit: the average number of words the tutor used in each elicit step.
tutAverageWordsSessionElicit: tutAverageWordsElicit for the current session.

The following feature is about quantitative and qualitative steps.

quantitativeDegree: The number of quantitative steps (select principle and apply principle) the student has received divided by the total number of steps the student has completed.

The following six features describe the number of each probability principles needed to solve the current problem. Conditional probability principles are not included because they are not heavily needed for problem solving (in terms of occurrence), and the conditional probability problems appear late in the training process.

nAdd2Prob: The number of times the Addition Theorem for Two Events is needed to solve the current problem.
nAdd3Prob: The number of times the Addition Theorem for Three Events is needed to solve the current problem.
nDeMorProb: The number of times the De Morgan’s Theorem is needed to solve the current problem.
nIndeProb: The number of times the Independent Theorem is needed to solve the current problem.
nCompProb: The number of times the Complement Theorem is needed to solve the current problem.
nMutualProb: The number of times the Mutually Exclusive Theorem is needed to solve the current problem.

C.4 Performance

Performance features describe the students’ competence level. The following twelve features describe the performance measures calculated based on the number of correct/incorrect steps or the percentage of correct steps.

pctCorrect: The number of elicit steps the student has correctly solved (on the first attempt) divided by the total number of elicit steps the student has received so far.
pctOverallCorrect: Denote the number of tell steps the student has received so far as tells, the number of elicit steps the student has correctly solved as correct elicits, and the total number of steps the student has received so far as steps. The feature value is calculated following the equation $\frac {tells + correct~elicits}{steps}$.
nCorrectKC: The total number of elicit steps the student has correctly solved for the current KC so far.
nIncorrectKC: The total number of elicit steps the student failed to solve on the first attempt for the current KC so far.
pctCorrectKC: pctCorrect for the current KC.
pctOverallCorrectKC: pctOverallCorrect for the current KC.
nCorrectKCSession: nCorrectKC for the current session.
nIncorrectKCSession: nIncorrectKC for the current session.
pctCorrectSession: pctCorrect for the current session.
pctCorrectKCSession: pctCorrectKC for the current session.
pctOverallCorrectSession: pctOverallCorrect for the current session.
pctOverallCorrectKCSession: pctOverallCorrectKC for the current session.

The following twelve features describe certain types of steps the student has received since the last wrong elicit step.

nStepSinceLastWrong: The number of steps (both elicit and tell) the student has completed since the last wrong elicit step (where the student failed the first attempt).
nStepSinceLastWrongKC: nStepSinceLastWrong for the current KC.
nTellsSinceLastWrong: The number of tell steps the student has received since the last wrong elicit step.
nTellsSinceLastWrongKC: nTellsSinceLastWrong for the current KC.
nStepSinceLastWrongSession: nStepSinceLastWrong for the current session.
nStepSinceLastWrongKCSession: nStepSinceLastWrong for the current KC in the current session.
nTellsSinceLastWrongSession: nTellsSinceLastWrong for the current session.
nTellsSinceLastWrongKCSession: nTellsSinceLastWrong for the current KC in the current session.
timeSinceLastWrongStepKC: The time that has elapsed since the last wrong elicit step for the current KC.
nCorrectElicitStepSinceLastWrong: The number of elicit steps the student has successfully solved since the last wrong elicit step.
nCorrectElicitStepSinceLastWrongKC: nCorrectElicitStepSinceLastWrong for the current KC.
nCorrectElicitStepSinceLastWrongKCSession: nCorrectElicitStepSinceLastWrong for the current KC in the current session.

The following eight features describe students’ performance on the steps that require a probability principle (the select- or apply-principle steps).

pctCorrectPrin: pctCorrect for the steps that require a probability principle.
pctCorrectPrinSession: pctCorrectPrin for the current session.
nStepSinceLastWrongPrin: nStepSinceLastWrong for the steps that require a probability principle.
nTellsSinceLastWrongPrin: nTellsSinceLastWrong for the steps that require a probability principle.
nStepSinceLastWrongPrinSession: nStepSinceLastWrongPrin for the current session.
nTellsSinceLastWrongPrinSession: nTellsSinceLastWrongPrin for the current session.
nCorrectElicitStepSinceLastWrongPrin: nCorrectElicitStepSinceLastWrong for the steps that require a probability principle.
nCorrectElicitStepSinceLastWrongPrinSession: nCorrectElicitStepSinceLastWrongPrin for the current session.

The following four features describe students’ performance on the first occurred select- and apply-principle steps in each problem, which are more complicated than the rest of principle-realted steps.

pctCorrectFirst: pctCorrect for the first occurred select- and apply-principle steps in each problem.
nStepsSinceLastWrongFirst: nStepSinceLastWrong for the first occurred select- and apply-principle steps in each problem.
nTellsSinceLastWrongFirst: nTellsSinceLastWrong for the first occurred select- and apply-principle steps in each problem.
nCorrectElicitStepSinceLastWrongFirst: nCorrectElicitStepSinceLastWrong for the first occurred select- and apply-principle steps in each problem.

The following two features describe students performance on the last problem.

pctCorrectLastProb: pctCorrect for all the steps in the last problem.
pctCorrectLastProbPrin: pctCorrect for all the steps that require a probability principle in the last problem.

The following 18 features describe students’ current competence on the six probability principles.

pctCorrectAdd2Select: pctCorrect for the select-principle steps that require selecting the Addition Theorem for Two Events.
pctCorrectAdd3Select: pctCorrect for the select-principle steps that require selecting the Addition Theorem for Three Events.
pctCorrectCompSelect: pctCorrect for the select-principle steps that require selecting the Complement Theorem.
pctCorrectDeMorSelect: pctCorrect for the select-principle steps that require selecting the De Morgan’s Law.
pctCorrectIndeSelect: pctCorrect for the select-principle steps that require selecting the Independent Theorem.
pctCorrectMutualSelect: pctCorrect for the select-principle steps that require selecting the Mutually Exclusive Theorem.
pctCorrectAdd2Apply: pctCorrect for the apply-principle steps that require entering the equation of the Addition Theorem for Two Events.
pctCorrectAdd3Apply: pctCorrect for the apply-principle steps that require entering the equation of the Addition Theorem for Three Events.
pctCorrectCompApply: pctCorrect for the apply-principle steps that require entering the equation of the Complement Theorem.
pctCorrectDeMorApply: pctCorrect for the apply-principle steps that require entering the equation of the De Morgan’s Law.
pctCorrectIndeApply: pctCorrect for the apply-principle steps that require entering the equation of the Independent Theorem.
pctCorrectMutualApply: pctCorrect for the apply-principle steps that require entering the equation of the Mutually Exclusive Theorem.
pctCorrectAdd2All: pctCorrect for the select- or apply-principle steps that require the Addition Theorem for Two Events.
pctCorrectAdd3All: pctCorrect for the select- or apply-principle steps that require the Addition Theorem for Three Events.
pctCorrectCompAll: pctCorrect for the select- or apply-principle steps that require the Complement Theorem.
pctCorrectDeMorAll: pctCorrect for the select- or apply-principle steps that require the De Morgan’s Law.
pctCorrectIndeAll: pctCorrect for the select- or apply-principle steps that require the Independent Theorem.
pctCorrectMutualAll: pctCorrect for the select- or apply-principle steps that require the Mutually Exclusive Theorem.

The following feature describes students’ competence in selecting main steps.

pctCorrectSelectMain: pctCorrect for the steps that require the student to select the next main step.

C.5 Hints

The following five features describe the number of hints the student requested in a certain period.

nTotalHint: The total number of hints the student has requested so far.
nTotalHintSession: nTotalHint for the current session.
nHintKC: nTotalHint for the current KC.
nHintSessionKC: nTotalHint for current KC in the current session.
nTotalHintProblem: nTotalHint for the current problem.

The following six features describe the student’s hint request behavior or working behavior in hint-requested steps.

AvgTimeOnStepWithHint: The average time the students spent on each hint-requested step.
durationSinceLastHint: The time that has elapsed since the last hint was requested.
stepsSinceLastHint: The number of steps the student has completed since the last hint-requested step.
stepsSinceLastHintKC: stepsSinceLastHint for the current KC.
totalTimeStepsHint: The total time the student has spent on hint-requested steps.
totalStepsHint: The total number of steps where hints were requested.

Appendix D: Two NLG Definitions

In order to choose a reliable reward measure, we compared two NLG definitions $\frac {posttest-pretest}{1-pretest}$ and $\frac {posttest-pretest}{\sqrt {1-pretest}}$, using the data collected in the Granularity studies. Table 9 shows a comparison of the two NLG scores for the High and Low groups respectively (to be consistent with the score range in the main paper, all numbers are timed 100). As expected, the square root can reduce the variance, especially for the High group (from 120.24 to 35.18). In addition, the square root rose the average of High from -35.79 to -10.04 and reduced the average of Low from 15.23 to 10.15, which reduced the difference between the High and Low groups from 51.02 to 20.19.

Table 9 A comparison of the two NLG definitions

Full size table

Appendix E: Gaussian Processes for Q-function Approximation

The standard GP Regression is used to approximate the Q function. Remind that in the context of GP, a function can be specified by the mean and a covariance function. In Q-function approximation, it takes state-action-Q observations $(S,A) \rightarrow Q$ and a prior covariance function (kernel) as input and specifies the Q-function’s posterior mean and covariance: $\hat {(\mathbf {Q}}^{\pi })^{\boldsymbol {\prime }} \sim \mathcal {N} (\overline {\hat {(\mathbf {Q}}^{\pi })^{\boldsymbol {\prime }}}, \text {COV} (\hat {(\mathbf {Q}}^{\pi })^{\boldsymbol {\prime }}))$. To model possible uncertainty, we add an Independent and Identically-Distributed noise to the prior covariance function: $\mathcal {E} \sim \mathcal {N} (0,\boldsymbol {\sigma }_{\boldsymbol {n}}^{2})$. According to the theorem of conditional probability density functions for multivariate Gaussians (Rasmussen, 2004), the mean, $\overline {\hat {(\mathbf {Q}}^{\pi })^{\boldsymbol {\prime }}}$, and covariance $\text {COV}(\hat {(\mathbf {Q}}^{\pi })^{\boldsymbol {\prime }})$ of the posterior distribution can be calculated using the following two equations (Goldberg et al., 1998; Rasmussen, 2004):

$$ \overline{\hat{(\mathbf{Q}}^{\pi})^{\boldsymbol{\prime}}} = K(\boldsymbol{X}^{\boldsymbol{\prime}}, \boldsymbol{X}) [K(\boldsymbol{X},\boldsymbol{X}) + {{\sigma}_{n}^{2}} \mathbf{I}]^{-1} \overline{\hat{(\mathbf{Q}}^{\pi})} $$

(7)

$$ \begin{array}{@{}rcl@{}} \text{COV}(\hat{(\mathbf{Q}}^{\pi})^{\boldsymbol{\prime}}) &=& K(\boldsymbol{X}^{\boldsymbol{\prime}}, \boldsymbol{X}^{\boldsymbol{\prime}}) + \mathbf{C}_{{\hat{\mathbf{Q}}^{\pi}}} - K(\boldsymbol{X}^{\boldsymbol{\prime}},\boldsymbol{X})\\ && [K(\boldsymbol{X},\boldsymbol{X}) + {\sigma_{n}^{2}}\mathbf{I}]^{-1} K(\boldsymbol{X},\boldsymbol{X}^{\boldsymbol{\prime}}). \end{array} $$

(8)

where X is the observation points (the state-action pairs (S, A) in our training data), $\overline {\hat {(\mathbf {Q}}^{\pi })}$ and $\mathbf {C}_{{\hat {\mathbf {Q}}^{\pi }}}$ are the mean and covariance matrix of the corresponding observed Q values for the (S, A) pairs, $\mathbf {X}^{\boldsymbol {\prime }}$ is the approximation points (the state-action pairs whose Q-value GP estimates), $\boldsymbol {\sigma }_{\boldsymbol {n}}^{2}$ is a parameter, K(X, X) is a covariance matrix evaluated on the observation points, $K(\boldsymbol {X}^{\boldsymbol {\prime }}, \boldsymbol {X}^{\boldsymbol {\prime }})$ is a covariance matrix evaluated on the approximation points, and $K(\boldsymbol {X}, \boldsymbol {X}^{\boldsymbol {\prime }})$ is a covariance matrix evaluated on the observation and approximation points (Rasmussen, 2004).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, G., Azizsoltani, H., Ausin, M.S. et al. Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction. Int J Artif Intell Educ 32, 454–500 (2022). https://doi.org/10.1007/s40593-021-00269-9

Download citation

Accepted: 18 July 2021
Published: 16 August 2021
Issue Date: June 2022
DOI: https://doi.org/10.1007/s40593-021-00269-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction

Abstract

Access this article

Similar content being viewed by others

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Re-evaluating GPT-4’s bar exam performance

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix A: An Example Training Problem

Appendix B: Analysis on Granularity and Incoming Competence with Median Split

Appendix C: Features Used for State Representation

C.1 Autonomy

C.2 Temporal

C.3 Problem Solving

C.4 Performance

C.5 Hints

Appendix D: Two NLG Definitions

Appendix E: Gaussian Processes for Q-function Approximation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Leveraging Granularity: Hierarchical Reinforcement Learning for Pedagogical Policy Induction

Abstract

Access this article

Similar content being viewed by others

Systematic review of research on artificial intelligence applications in higher education – where are the educators?

Re-evaluating GPT-4’s bar exam performance

The Promises and Challenges of Artificial Intelligence for Teachers: a Systematic Review of Research

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Appendices

Appendix A: An Example Training Problem

Appendix B: Analysis on Granularity and Incoming Competence with Median Split

Appendix C: Features Used for State Representation

C.1 Autonomy

C.2 Temporal

C.3 Problem Solving

C.4 Performance

C.5 Hints

Appendix D: Two NLG Definitions

Appendix E: Gaussian Processes for Q-function Approximation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation