The Tutorial Dialogue of AutoTutor

This article reflects on a paper published in 2001 on “Teaching Tactics and Dialogue in AutoTutor”, coauthored by Graesser et al. (2001). AutoTutor is a pedagogical agent that holds a conversation with students in natural language and simulates the dialogue moves of human tutors as well as ideal pedagogical strategies (Graesser et al. 2004; Graesser et al. 2008; see Nye et al. 2014, for an in depth history of 17 years of AutoTutor). My colleagues and I were inspired by the notion that there is something about conversation mechanisms in a tutoring session that help people learn (Graesser et al. 1995). And indeed, untrained tutors do help students learn better than classroom interactions and various other ecological controls (Graesser et al. 2011). We were also inspired by the notion that some of the discourse moves of tutors could be improved if they were guided by ideal pedagogical principles. So a combination of natural discourse interaction and ideal tutor moves would be the magic formula to improve student learning.

We wrestled with the possibility that ideal computer tutoring moves may be different sometimes than normal conversational moves of human tutors. For example, our analyses of human tutors revealed that they are prone to follow principles of conversational politeness so they are reluctant to give negative feedback when a student’s contribution is incorrect or vague (Graesser et al. 1995; Person et al. 1995). Accurate feedback sometimes needs to be sacrificed in order to promote confidence and self-efficacy in the student (Lepper and Woolverton 2002). However, many students expect the computers to be accurate rather than polite. Consequently, there is a trade-off between feedback accuracy and the promotion of politeness or self-esteem. There also appeared to be illuminating differences in the pragmatic ground rules of communication with computers versus humans (Person et al. 1995). Given the various trade-offs and incompatible predictions, we envisioned a program of research to investigate the impact of specific tutoring strategies and conversation patterns on student learning and motivation. This program of research continues to evolve among colleagues investigating automated tutorial dialogue both in Memphis (Nye et al. 2014; Rus et al. 2013) and other labs (e.g., Dzikovska et al. 2014; Johnson and Lester 2015; Ward et al. 2013).

The AutoTutor project was launched in 1997 at a point in history when animated conversational agents emerged and penetrated learning environments. The agents were computerized talking heads or embodied animated avatars that generate speech, actions, facial expressions, and gestures. Some of the agents were very rigid and scripted, whereas AutoTutor attempted to adapt to the knowledge states, verbosity, and the emotional states of the learner. AutoTutor was indeed successful in tracking the student’s knowledge states and adaptively generating dialogue moves (Graesser et al. 2004; Jackson and Graesser 2006; Nye et al. 2014; VanLehn et al. 2007). We also developed an affect-sensitive AutoTutor that responded intelligently to the emotions of the student, such as confusion, frustration, and boredom (D’Mello and Graesser 2012). The power of conversational agents is that they can precisely specify what the agent expresses and does under specific conditions, whereas humans could never exhibit such precision. Agents can guide the learner on what to do next, deliver didactic instruction, hold collaborative conversations, and model ideal behavior, strategies, reflections, and social interactions. Pedagogical agents have become increasingly popular in contemporary adaptive learning environments (DeepTutor, Rus et al. 2013; Bettys Brain, Biswas et al. 2010; iSTART, McNamara et al. 2006; Crystal Island, Rowe et al. 2010; Guru Tutor, Olney et al. 2012; Operation ARIES, Millis et al. 2011), just to name a few systems. These systems have covered topics in STEM (physics, biology, computer literacy), reading comprehension, scientific reasoning, and other domains and skills.

AutoTutor and these other systems with pedagogical agents have helped students learn compared to various control conditions. In the case of AutoTutor, reports covering multiple studies have reported average learning gains that vary between 0.3 sigma (Nye et al. 2014) and 0.8 (Graesser et al. 2008) when compared to reading text for an equivalent amount of time; the effect sizes are substantially higher in comparisons with pre-tests and no-study controls (Graesser et al. 2004; VanLehn et al. 2007). Human tutors have not differed greatly from AutoTutor and other ITS’s with natural language interaction in experiments that provide direct comparisons with trained human tutors (Olney et al. 2012; VanLehn 2011; VanLehn et al. 2007). For example, in a direct comparison between AutoTutor and 1-to-1 human tutoring with experienced tutors in computer-mediated conversations (either typed or spoken), the learning gains were virtually equivalent on the topic of Newtonian physics (VanLehn et al. 2007). Given these encouraging results from human and computer tutoring, we investigated what it is about conversation that helps student learning and motivation.

Conversation Patterns in AutoTutor and Human Tutors

We conducted a series of experiments that attempted to identify the features of AutoTutor that might account for improvements in learning (Graesser et al. 2004, 2008; Kopp et al. 2012; VanLehn et al. 2007). It is beyond the scope of this article to cover all of these features, but a few are particularly noteworthy. One noteworthy finding is that it is not the talking head that accounts for most of the improvement, but rather the content of what the agent says and the student says. The talking head has only a small advantage over the agent’s conveying its dialogue moves in print or spoken modalities. Learning from AutoTutor is not appreciably different from conditions where the learner is guided to read small snippets of text or summaries of a solution at opportunistic points in time. From the standpoint of student input modality, learning is no different when students express their contributions in speech or keyboard (D’Mello et al. 2011). Simply put, it is the content that matters: What gets expressed at the right time in a conversation?

Another noteworthy conclusion is that we were impressed with the robustness of the core conversation mechanisms in both AutoTutor and most human tutoring. As mentioned earlier, many of the core conversation mechanisms in AutoTutor are similar to human tutoring. We documented major conversation mechanisms of human tutors who tutored middle school children in mathematics and college students in research methods. The detailed anatomy of human tutoring was based on near 100 tutoring sessions that were videotaped, transcribed, and analyzed in depth (Graesser et al. 1997; Graesser and Person 1994; Graesser et al. 1995; Person et al. 1994; Person et al. 1995). In particular, one discourse mechanism in both AutoTutor and human tutoring is called expectation & misconception-tailored dialogue (EMT dialogue). The human tutors anticipate particular correct answers (called expectations) and particular misunderstandings (misconceptions) when they ask the students challenging questions (or problems) and track the students’ answers. As the students express their answers, which are distributed over multiple conversational turns, their contributions are compared with expectations and misconceptions through semantic pattern matching. The tutors give feedback to the students’ answers with respect to matching the expectations or misconceptions. Some feedback is short, consisting of positive, neutral, or negative expressions either in words, intonation, or facial expressions. After the short feedback, the tutor tries to lead the student to express the expectations (good answers) through multiple dialogue moves, such as pumps (“What else”), hints, or prompts to get the students to express specific words. When the student fails to answer the question correctly, the tutor contributes information as assertions. The pump-hint-prompt-assertion cycles are implemented in AutoTutor (and are frequent in human tutoring, Graesser et al. 1995) to extract or cover particular sentence-like expectations. Eventually, all of the expectations are covered and the exchange is finished for the main question or problem.

It is feasible to implement EMT dialogue computationally because it relies on semantic pattern matching and attempts to achieve pattern completion (through hints and prompts). This is a simpler mechanism than interpreting natural language from scratch, which is beyond the boundaries of reliable natural language processing. EMT dialogue is not only frequent in human tutoring but creates reasonably smooth conversations in AutoTutor and helps students learn. Interestingly, human tutors rarely use sophisticated tutoring strategies that are difficult to implement on computer, such as bona fide Socratic tutoring, modeling-scaffolding-fading, building on prerequisites, and dialogue moves that scaffold metacognitive strategies (Cade et al. 2008; Graesser et al. 1995). Automated computer tutors will possibly show major advantages over human tutors when the systems can reliably implement these more sophisticated strategies.

AutoTutor successfully implemented nearly all of the conversational mechanisms of human tutors, but one notable exception is that it could not handle most of the student questions. Student questions are infrequent in most classroom and tutoring environments because the teacher or tutor tends to control the agenda (Graesser et al. 1995). However, when students do ask questions, the relevance and correctness of the answers is disappointing in AutoTutor, as it is in other automated environments. We have had to implement diversionary tactics to handle the students questions, such as “How would you answer that question?” or “AutoTutor cannot answer that question now.” As a consequence, the frequency of student questions unfortunately extinguishes quickly in tutoring sessions with AutoTutor (Graesser and McNamara 2010).

We continued to question the use of human tutors, even expert tutors, as the gold standard in the design of AutoTutor. We identified a number of blind spots and questionable tactics of human tutors (Graesser et al. 2011) that could potentially be improved by incorporating ideal tutoring strategies. For example, tutors are prone to give a summary recap of a solution to a problem, or answer to a difficult question, that required many conversational turns. It would be better to sometimes have the student give the summary recap in order to promote active student learning, to encourage the student to practice articulating the information, or to allow the tutor to diagnose remaining deficits. As another example, tutors often assume that the student understands what the tutor expresses in an exchange whereas students often do not understand, even partially. Indeed, there often is a large gulf between the knowledge of the student and that of the tutor. It sometimes would be better for the tutor to ask follow up questions to verify the extent to which the student understands what the tutor is attempting to communicate. Ideal tutoring strategies are needed to augment or replace some of the typical conversation patterns in human tutoring.

One of the pervasive challenges throughout the development of AutoTutor and subsequent learning environments has been optimizing the semantic match scores between the students’ verbal contributions and AutoTutor’s anticipated answers (both the expectations and misconceptions). The student’s contributions over dozens of conversational turns in a single dialogue are constantly compared semantically with the set of expectations and misconceptions. There is a speech act classifier that segments the student’s verbal input within a turn into speech acts and assigns each speech act to a category, such question, statement, meta-cognitive expression (e.g., “I do not know”) or short response, as designated in the Dialogue Advancer Network (Graesser et al. 2001). The statements are the only speech acts that are compared with the expectations and misconceptions through semantic matching algorithms. An expectation (or misconception) is considered covered if it meets or exceeds some threshold parameter for matching.

We have evaluated many semantic matchers over the years. The best results are a combination of latent semantic analysis (LSA) (Landauer et al. 2007), frequency weighted word overlap (rarer words and negations have higher weight), and regular expressions. In fact, LSA plus regular expressions have had high reliability scores in comparisons with human experts versus pairs of human experts (Cai et al. 2011). Interestingly, syntactic computations did not prove useful in these analyses because a high percentage of the students’ contributions are telegraphic, elliptical, and ungrammatical. Researchers who have developed tutorial dialogue systems with deep syntactic parsers (e.g., BEETLE II, Dzikovska et al. 2014) routinely point out the limitations of syntactic parsers when there are low quality language contributions of students.

We learned, after many years, that a semantic match algorithm with impressive fidelity will not necessarily go the distance in meeting the students’ wishes. There are two problems that continue to haunt us. The first problem addresses the students’ standards on what it means to cover a sentence-like answer correctly. If a good answer has four content words (A, B, C, D) that ideally are expressed, the students want full credit if they can express only one or two of the distinctive words (e.g., A and B). They get frustrated when their partial answers only get neutral or negative feedback from the tutor; students think they have covered the sentence answer but AutoTutor does not score it as covered unless the students express the remaining words (C and D). The students assume that the assumed shared knowledge should be sufficient to fill in the remaining words, but AutoTutor wants to see a more complete answer articulated. The second problem addresses the semantic blur that invariably occurs between expectations and misconceptions when the algorithms rely on statistical algorithms like LSA, word overlap and regular expressions. Students may get negative feedback when their statements match a misconception more than an expectation; or positive feedback when they express something erroneous. This semantic blur produces inaccurate feedback which can end up confusing or frustrating the student. Although we do everything we can to engineer the content and threshold parameters, these errors still occasionally occur because of the vagueness of language. One practical solution to this challenge is to have AutoTutor give neutral short feedback after these uncertain or borderline semantic matches so that the student is not misled or frustrated when the semantic matches are imperfect. Another approach is to provide more discriminating hints and prompts when there is a semantic blur between expectations and misconceptions. The hints and prompts would more cleanly differentiate a correct expectation versus a misconception.

Future AutoTutor Directions and Trialogues

Many spinoffs from AutoTutor have been developed after its inception in 1997 and the publication of Graesser et al. (2001). Nye et al. (2014) reported that dozens of systems have evolved from AutoTutor in the Institute for Intelligent Systems at the University of Memphis. These systems have covered many STEM topics, reading comprehension, writing, and scientific reasoning, with names like DeepTutor, GuruTutor, GnuTutor, AutoMentor, iDRIVE, iSTART, Writing-Pal, Operation ARIES (and ARA). A recent system has integrated AutoTutor with ALEKS in mathematics, a system commercialized by McGraw-Hill that has helped middle school students in the Memphis area (Hu et al. 2012). The Memphis team has recently started developing AutoTutor for basic electronics and electricity in an ElectronxTutor that is funded by Office of Naval Research. The suite of AutoTutor applications is starting to cover a large curriculum landscape. Researchers at other universities, businesses, and organizations are increasingly licensing the AutoTutor Script Authoring Tool (Cai et al. 2015) to develop their own content and integrate it with our generic AutoTutor Conversation Engine (ACE). For example, Wolfe et al. (2015) used the AutoTutor authoring tools to develop a website on genetic risk factors for breast cancer, called BRCA, and reported learning gains above the existing web site on the same topic. Educational Testing Service is licensing ASAT for assessment on a variety of competencies (English Language Learning, science, mathematics) in the context of virtual worlds with agents (Zapata-Rivera et al. 2015). The Army Research Laboratory has incorporated AutoTutor in its open source Generalized Intelligent Framework for Tutoring (GIFT, Sottilare et al. 2013). AutoTutor is growing further as it migrates to new systems with new names and applications.

In recent years we have developed trialogues, which involve the human interacting with two agents, typically a student agent and a tutor agent in 3-party conversations (Graesser et al. 2014; Graesser et al. 2015a, b; Millis et al. 2011). Two agents add considerable benefits theoretically because the two agents can model successful conversational interactions, such as asking good questions and receiving good answers (Gholson et al., 2009) or staging arguments that create cognitive disequilibrium, productive confusion, and deeper learning (D’Mello et al. 2014; Lehman et al. 2013). The trialogues can help rectify some of the problems previously discussed on AutoTutor dialogues. For example, when the human’s answer is incomplete, the student agent can fill in the missing words and articulate a more complete answer; this not only models good answers but also circumvents any negative short feedback to the human.

Graesser et al. (2015b) identified seven trialogue designs that can be used in learning environments. The two agents in each design can take on different roles, but typically one is a tutor and the other a student peer.

  1. (1)

    Vicarious learning with human observer. Two agents interact and model ideal behavior, answers to questions, or reasoning.

  2. (2)

    Vicarious learning with limited human participation. The same as #1 except that the agents occasionally turn to the human and ask a prompt question, with a yes/no or single-word answer.

  3. (3)

    Tutor agent interacting with human and student agent. There is a tutorial dialogue with the human, but the student agent periodically contributes and receives feedback.

  4. (4)

    Expert agent staging a competition between the human and a peer agent. There is a competitive game between the human and peer agent, with the expert agent organizing the event.

  5. (5)

    Human teaches/helps a student agent with facilitation from the tutor agent. As the human tries to help the peer agent, the tutor agent rescues a problematic situation.

  6. (6)

    Human interacts with two peer agents that vary in proficiency. The peer agents can vary in knowledge and skills.

  7. (7)

    Human interacts with two agents expressing contradictions, arguments, or different views. The discrepancies between agents stimulate cognitive disagreement, confusion, and potentially deeper learning.

Our current hypothesis is that these seven trialogue designs should be adaptively administered, depending on the student’s knowledge and other psychological attributes. The vicarious learning designs (1 and 2) are appropriate for learners with limited knowledge, skills, and actions, whereas designs 5 and 7 are suited to the more capable students attempting to achieve deeper knowledge. Design 4 is motivating for learners by virtue of the game competition. Research needs to be conducted to assess empirically the conditions under which different trialogue designs facilitate learning and motivation.

Trialogues have been routinely incorporated in our recent AutoTutor applications. Scientific reasoning is the focus in an instructional game called Operation ARIES! (Millis et al. 2011), which was subsequently commercialized by Pearson Education as Operation ARA (Halpern et al. 2012). ARIES is an acronym for Acquiring Research Investigative and Evaluative Skills whereas ARA is an acronym for Acquiring Research Acumen. Agent trialogues are currently being developed in computer interventions to train comprehension strategies for adults with reading difficulties in the Center for the Study of Adult Literacy (CSAL, http://csal.gsu.edu/content/homepage). Interestingly, some trialogue designs have always been used in McNamara’s iSTART trainer for reading comprehension (McNamara et al. 2006). ETS is currently using trialogues for assessment and is licensing our ASAT and ACE facilities for that purpose (Zapata-Rivera et al. 2015).

It is of course possible to build systems with more than two agents and more than one human. One can imagine communities of humans and cyber agents interacting in varying numbers. The cyber agents will need conversation mechanisms that are adaptive and flexible in a similar vein as AutoTutor dialogues and trialogues. At that point we enter the arenas of collaborative problem solving (Fiore et al. 2010; Graesser et al. 2015a) and computer supported collaborative learning (Dillenbourg 1999; Rosé et al., 2008). These are two areas on our horizon during the next decade.