Introduction

Within Science, Technology, Engineering, and Math (STEM) domains, physics is considered by many to be one of the most difficult topics to master, in part because many of the underlying principles are counter-intuitive. Students’ naïve conceptions have proven to be highly resistant to traditional lecture-based methods of instruction and this finding has been the impetus for a tremendous amount of research into alternative, more effective instructional techniques. One approach that has proven effective is called the conceptual change method (Duit and Treagust 2003). This approach relies on two key processes. First, students should be presented with unavoidable and incontrovertible evidence that explicitly contradicts their existing naïve conception(s). Second, they should be guided just enough to allow them to generate an accurate conceptualization of the phenomena and principles within the domain.

This approach is consistent with the basic principles of many learning theories. For example, the second stage relies on the tutor taking on a Socratic role (Seeskin 1987) and leverages what we know about student learning through post-activity reflection and self-explanation (Chi et al. 1994; Katz et al. 2003, 2007; King 1997; Kolb 1984; Lee and Hutchison 1998; Mestre et al. 2011). At its core, this stage recognizes the importance of getting students to generate or construct their own interpretation and understanding of a domain (Duffy and Jonassen 1992; Osborne and Wittrock 1983).

A consideration of the conceptual change approach indicates that an intelligent system capable of implementing this type of science instruction needs to include both a simulation environment for the first stage and a tutorial dialogue capability for the second. Interestingly, while there are many computer-based tutoring systems that use simulation as a training tool (e.g. de Jong and van Joolingen 1998) and tutorial dialogue systems that ask students to provide justifications for problem solving steps (Aleven et al. 2002; Khuwaja et al. 1994; VanLehn et al. 2007) and to explain their conceptual reasoning (Graesser et al. 1999; Jordan et al. 2006; Litman and Silliman 2004), relatively few existing systems combine experimentation in an interactive environment with self-explanation (Pon-Barry et al. 2004; Ros et al. 2004).

In this paper, we tackle the problem of developing an effective curriculum based on the conceptual change method within the context of an Intelligent Tutoring System (ITS) for a sub-topic of physics, electricity and electric circuits. We leveraged existing research on common misconceptions in this domain to create a curriculum with exercises conducted in a simulated circuit workbench, designed to contradict such misconceptions, and post-exercise guided reflective dialogues to help students formulate new conceptions. To our knowledge, the Basic Electricity and Electronics Tutorial Learning Environment (Beetle II) is the first ITS that integrates the interactive experimentation and reflective dialogue components of the conceptual change approach, using natural language processing to understand student explanations and provide context-specific feedback.

Part of the challenge in building such a system is the requirement to deal with a dynamically changing simulation environment. Student utterances need to be interpreted in the context of the current state of the simulation, and tutoring feedback must also be adapted to the current context. This generally means that feedback must be generated “on the fly”, which presents a serious challenge for the natural language processing (NLP) module in a tutoring system.

The Beetle II system addresses this challenge by providing a natural language interpreter and a tutorial planner and generator that work together to provide dynamic feedback generation. The Beetle II dialogue component supports context-specific interpretation and diagnosis of natural language student explanations, and implements a set of generic tutoring tactics that can be instantiated in different problem-solving contexts. This allows for dynamic selection and automatic generation of context-adaptive feedback to student answers. Our approach improves upon the state of the art in dialogue ITSs that support automatic feedback generation by tackling more complex language input, while still allowing for integration with an interactive simulation.

After reviewing previous work in NLP for tutorial dialogue, we describe the design and implementation of the Beetle II system, including the construction of the curriculum and a system architecture to support dynamic feedback generation. We evaluate the overall system’s effectiveness in terms of learning gain and user satisfaction and assess the contribution of the NLP module. We then carry out a detailed evaluation of the natural language interpreter, looking at its overall performance and how that correlates with learning outcomes. Finally, we describe an error analysis focusing on the different types of interpretation failures possible in the system, and their relationship with desired outcomes. Based on this evaluation, we discuss future directions for research and development.

Previous Work

It has been solidly established that eliciting self-explanation (Chi et al. 1994) and contentful talk (Litman et al. 2009; Purandare and Litman 2008) from students is correlated with increased learning gain. Therefore, there has been substantial interest in developing tutoring systems in STEM domains that are able to understand and provide feedback to natural language input.

Decisions about how to provide feedback during natural language dialogue can be seen in the more general context of providing formative feedback in an ITS. The types of feedback that an ITS can provide include simple verification feedback that confirms whether or not an answer is correct, correct response (or “bottom-out”) feedback that gives away the correct answer, and different forms of elaborated feedback which includes giving general advice on the topic, re-teaching the material, flagging problematic parts of student responses, or giving hints, cues and prompts (Shute 2008). We will focus on the provision of “informative tutoring feedback” or “ tutoring tactics” (Chi et al. 2009; Narciss and Huth 2004; Narciss 2004), that is, feedback that combines verification, error flagging and hints adapted to the context.

In this section, we review work on dialogue ITSs that provide feedback for relatively short (1 to 2 sentence) explanation questions asking students to explain their observations or justify problem-solving steps. Several existing tutorial dialogue systems focus instead on helping students to produce short essays in answer to an explanation-type question (Graesser et al. 1999; Jordan et al. 2006; Litman and Silliman 2004). These systems incorporate NLP approaches that are highly effective in assessing longer texts. However, methods suitable for essay assessment may not perform well when applied to shorter explanations (Ventura et al. 2004).

Systems that provide feedback on short explanations have used both statistical and symbolic methods for answer analysis. The statistical methods employ techniques from textual entailment and paraphrase identification to compare the text of the student answer against sets of possible correct and incorrect answers anticipated by system developers (Jordan et al. 2007; McCarthy et al. 2008; VanLehn et al. 2007). These methods are generally robust to unexpected student inputs, and require less upfront investment in NLP infrastructure than symbolic methods. However, since they rely on matching text strings rather than building semantic representations, they use finite-state machines for dialogue management and require system developers to manually author feedback messages for every machine state. This requires careful attention to authoring tools to avoid redundancy that can cause student confusion (Jordan et al. 2005), and to deal with unanticipated answers which in one study accounted for 30 % of student utterances (Jordan et al. 2009).

In contrast, symbolic methods involve using rule-based parsers and reasoners to build semantic structures representing fine-grained details of utterance content. The availability of such detailed representations supports dynamic adaptive feedback generation because it allows a system to maintain a library of generic tutoring tactics and then choose a tactic at each point of the interaction according to a particular tutoring policy, instantiating it with relevant content from the student answer (Callaway et al. 2006; Glass 2000; Pon-Barry et al. 2004). Symbolic NLP approaches are more brittle to unexpected student input than statistical approaches, and require more upfront investment in parsing and interpretation infrastructure to develop and deploy in new domains. However, the structured semantic representations they produce offer particular advantages for integrating tutoring with simulation-based environments or with environments where problems are dynamically generated, such as cognitive tutors (Aleven et al. 2002). These representations allow such systems to offer feedback that explicitly refers to the current state of the simulation or problem-solving situation.

The development of the Beetle II system is an attempt to expand the application of the symbolic NLP approach to support dynamic feedback generation in conjunction with a curriculum that embodies best practices in science teaching and requires support for moderately complex natural language input. Compared to other systems that support dynamic feedback generation, the Beetle II system allows for significantly more complex input than the CIRCSIM-Tutor (Glass 2000), which asks questions requiring 1–2 word answers; and Beetle II supports explanations of about the same average length as the SCoT-DC tutor (Pon-Barry et al. 2004), but attempts to handle a wider range of topics (SCoT-DC covers three main topics, whereas the Beetle II curriculum covers 10 main topics, with 14 exercises involving around 150 different natural language questions asked by the system).

Another system that supports symbolic understanding and reasoning about student explanations is the Geometry Explanation Tutor (Aleven et al. 2002), which aims to give feedback on sentence-long explanations for geometry problems. However, unlike Beetle II, in order to generate feedback it represents all possible partially correct explanations in its domain ontology. The Beetle II system aims to implement a set of generic tutoring tactics (discussed in the “System Implementation” section) which does not require possible incorrect answers to be specified in advance as concepts in the ontology.

Among the systems that use statistical NLP approaches to understanding, the closest in philosophy to Beetle II is the AUTOTUTOR system (Graesser et al. 1999). The tutoring in AUTOTUTOR is based around expectations of the answer, expressed as natural language statements. Student input is matched against those expectations using latent semantic analysis. For each missed expectation, the instructional designers author a list of feedback messages that correspond to different tutoring tactics: pumps (or “contentless prompts”), hints and assertions. This provides the system with a collection of tactics it can choose from at each point in the interaction according to the desired tutoring policy. However, since no structured representation is used, the feedback messages cannot be generated automatically and have to be authored for each problem.

The Beetle II domain was chosen to be more restricted than the AUTOTUTOR domains, so that it allows for building domain reasoners and symbolic models, but is more complex than the systems that use symbolic approaches discussed above.Footnote 1 One of our research aims was to investigate whether we can effectively handle more complex language by using a domain adaptation technique originally developed for task-oriented spoken dialogue systems. In contrast to previous approaches to symbolic NLP in tutoring (which rely either on entirely domain-specific processing modules, or on syntactic parsers with domain-specific lexicons), our architecture uses a wide-coverage syntactic parser and a domain-independent lexicon and ontology that is adapted to a given domain with the aid of ontology mapping rules. This architecture has been successfully used to improve portability and robustness of parsing and semantic interpretation in task-oriented dialogue systems (Dzikovska et al. 2008). This gave us reason to believe that it can effectively support understanding explanations in a tutorial dialogue system that combines conceptual learning with experimentation in a simulation environment.

Next, we provide a general description of the curriculum, the high-level tutoring strategies used in the system, and the role that NLP plays in supporting them.

Curriculum Development

As mentioned earlier, physics is considered by many to be an extremely difficult field to master, due in large part to the inclusion of many concepts that appear to be counter-intuitive to our everyday experiences. We focused our curriculum on the sub-topic of electricity and electric circuits. This is a foundational topic for electrical engineering and electricians, and a common part of school physics curricula in order to enable students to understand the electrical appliances and machinery they encounter in everyday life. Our research goal, however, is not to teach circuits per se, but to understand how teaching techniques involving hands-on interaction and reflective dialogue, known to be effective when used by human tutors, can be implemented in a computer system integrating simulation and natural language dialogue.

In order to develop our learning materials, we leveraged the existing research in science education as much as possible. At the heart of over two decades of research in physics education we find a number of studies designed to elicit and identify common misconceptions (Chinn and Brewer 1993; Halloun and Hestenes 1985a; Tallant 1993). From there, researchers went on to develop psychometrically robust tests of conceptual understanding to support the body of research evaluating the effectiveness of a wide variety of different teaching strategies (Carey 1986; Eylon and Linn 1988; Farnham-Diggory 1994; Halloun and Hestenes 1985b; Scott et al. 1992; Vosniadou and Brewer 1987). By tying into this body of research, we were able to establish learning objectives that align with an established set of common misconceptions, apply an instructional strategy that has been demonstrated to be effective, and evaluate our system using tests and conceptual inventories that are known to be reliable and valid.

In the rest of this section, we will elaborate on the specifics of the course that we implemented in Beetle II, at three levels. We begin by addressing the content, i.e. the specific concepts that our curriculum addresses and how those concepts are sequenced. Next, we address the primary high-level instructional strategy that was used to deliver this content. Finally, we briefly explain how we arrived at the low-level tutoring tactics implemented in the system in order to help a student complete each step required by the instructional strategy.

Instructional Content and Approach

We implemented approximately four hours of instruction, using content and activities drawn from the materials published by the Physics Education Group at the University of Washington (McDermott et al. 1995). These materials are based on years of research in science education conducted by this group and others, and have been iteratively refined through multiple experiences in college classrooms. The curriculum explicitly covers topics such as open and closed paths, voltage, series and parallel circuits, and finding faults in a circuit with a multimeter. In addition, the simulation-based exercises themselves are carefully designed to elicit and counter common misconceptions in this domain. For example, many people intuitively analyze circuits sequentially. They begin by looking at the power source and then follow the circuit with their eyes, anticipating that the electrical phenomena will occur in sequence until there is a disruption in the circuit or they return to the power source. This sequential reasoning underlies several misconceptions, including the misconception that current flows out of a battery and through light bulbs until it reaches a gap, and the misconception that each light bulb consumes a bit of the current that passes through it. Our curriculum includes numerous exercises using a variety of different circuit configurations that consistently demonstrate that students must reason globally about a circuit, because a sequential analysis does not correctly predict electrical phenomena.

The sequence of concepts and activities described by the Physics Education Group are nested within the conceptual change instructional approach. (The low-level tactics that we implemented were based on an analysis of human tutoring behavior with this same curriculum, as will be discussed later in this section.) At the core of this approach is an analysis of the fundamental concepts in a scientific domain and a series of activities designed to illustrate these concepts and contradict commonly held misconceptions. Students are taken through several cycles of making predictions based on their existing conceptions, completing activities, and then reflecting on the outcomes and implications for their conceptions of the field. There are repeated demonstrations in the literature of the effectiveness of this approach (Redish et al. 1997; Thacker et al. 1994).

The lessons in Beetle II apply this instructional approach within the context of a simulated circuit workbench. For each topic, the students are asked to predict the behavior of a given circuit and explain that prediction. Next, the students test their predictions in the circuit simulator by building the circuits and observing their behavior. After the test is complete, the students are asked whether the simulated results matched their predictions. Finally, when predictions were not realized, students are asked to try and generate an explanation for the behavior of the circuit that they just observed. We refer to this general flow as the “Predict-Verify-Evaluate” strategy (PVE). Most of the curriculum is set up in this manner, which encourages students to think deeply about the phenomena that they observe and attempt to infer the underlying explanatory principles in the domain.

It may be informative at this point to compare our strategy to the strategies implemented in other ITSs that address misconceptions, such as Why2-ATLAS (Jordan et al. 2006), ITSPOKE (Litman and Silliman 2004) and AUTOTUTOR (Graesser et al. 1999). These systems have a library of remediations tailored to the assorted misconceptions. They ask students questions, and use natural language interpretation to attempt to diagnose misconceptions present in the answer and to select a matching remediation. With our tutoring strategy we do not use misconception-specific dialogue-based remediations. The simulation-based exercises are designed to elicit and counter common misconceptions. Students who make faulty predictions of circuit behavior based on misconceptions are engaged by the tutor in a dialogue designed to have them explicitly acknowledge the inability of their existing conceptual framework to accommodate the observed phenomena and then to “reverse-engineer” a more appropriate conceptual understanding of the basic principles of electricity and circuits.

We hypothesize that two things are working together to make our curriculum effective in overcoming common misconceptions. The first is our content—the presentation of a carefully designed sequence of concepts. The second is our primary tutoring strategy—the exercises associated with those concepts embedded within Predict-Verify-Evaluate cycle.

In our conceptualization of this strategy, the role of the tutor is to help students accomplish each step within the cycle, i.e. ensure that the student correctly performs the required simulations, that they remember their predictions and are aware of any differences between the predictions and the observations, and that they complete each reflection dialogue with knowledge of the correct explanation. This cannot be fully accomplished without a tutorial dialogue capability that helps students to generate for themselves accurate statements about the underlying principles of basic electricity and electronics.

The challenge for the dialogue component is then to select an appropriate generic tutoring tactic aimed at completing the current step, and then instantiate it, on the fly, within the context created by both the student’s statements and the state of the objects in the simulation environment. Our choice of tactics was based on a human-human tutoring study in the context of our curriculum, described in the next section.

Tutoring Tactics

Before implementing our curriculum in the Beetle II system, we piloted it with three experienced human tutors, each of whom worked through the exercises individually with multiple students, providing feedback during the prediction and reflection stages as necessary. All of the tutors were knowledgeable in the area of electricity and circuits. Two of the tutors had experience providing military technical training and the other had experience teaching college courses and providing private tutoring. A total of thirty students participated in this pilot study for approximately four hours each. The curriculum was presented to the student on a laptop and the student interacted with their tutor via a chat interface. All of the student and tutor questions, answers, and feedback were logged.

After all of the participant data were collected, our team analyzed the transcripts in order to identify effective tactics used by our human tutors. First, we developed a hierarchical coding scheme for the low-level techniques that the tutors used in response to student answers (both correct and incorrect) (Steinhauser et al. 2007). The scheme described micro-level decisions taken by the tutors while trying to help the student perform post-exercise reflection, for example, giving hints to help the student arrive at a better explanation of what they observed, and used eight different tactic codes at the lowest level.

Once all of the transcripts were coded according to this scheme,Footnote 2 we turned our focus to evaluating the effectiveness of each tactic and the effectiveness of the most commonly occurring sequences of remediation tactics (when a single tactic was not successful at eliciting the correct answer from a student). Each mini tutorial dialogue that followed each question in the curriculum (across all participants and all tutors) was classified according to its sequence of codes and according to the success of its final result. A mini tutorial dialogue was classified as successful if it resulted in the student generating a full and correct answer and as unsuccessful if the tutor was the one who eventually provided the full and correct answer to the student.

For example, one common sequence found in the transcripts was:

  • Student: Partially correct, some missing information

  • Tutor:  Remediates with generic prompt for more information

  • Student: Fully correct

  • Sequence: Successful

These sequences were tallied and the most frequently used successful sequences were promulgated for implementation as tutoring tactics in the Beetle II system. While the ultimate goal was to develop a system that could be used to test a wide variety of tutoring tactics, we felt that this selection would yield a strong initial version of the system. Our analysis also indicated that tutorial dialogues that were not successful after three tutor turns were rarely ever successful. Hence, we recommended that the Beetle II system default to providing the full and correct answer to the student if three attempts at eliciting the answer were not successful.

Based on this analysis, we developed a scheme to structure the tactics generated by a computer tutor. The list of tactics, found in the Appendix, reflects a collection of the various tactics that our tutors used. The details of how tactics are chosen and instantiated at each step of the dialogue are discussed in the “Tutorial planner and natural language generation” section. The Appendix also illustrates how the PVE strategy and the dialogue remediation tactics work together in the Beetle II ITS to address student misconceptions. It contains examples of students making incorrect predictions in the beginning and arriving at the correct answer during reflection dialogue, either by themselves after observing the discrepancy in simulation, or with the aid of the system.

Assessment

To allow us to test the effectiveness of our instruction, we developed pre- and post-tests by leveraging published conceptual inventories in the domain of electricity and electronics, particularly the “Determining and Interpreting Resistive Electric Circuits Concepts Test” (Engelhardt and Beichner 2004) and the “Electric Circuit Diagnostic Test” (Sokoloff 1996). After selecting a subset of questions that were relevant to our curriculum, we conducted pilot testing of these items with 106 students at the US Navy Training Command Great Lakes. In this pilot test, we followed each multiple choice question with the open-ended question, “Why?” The primary objective of this pilot effort was to determine the extent to which the distractors were valid representations of common misconceptions in the field. In other words, we evaluated the extent to which the students’ written explanations for selecting a distractor were consistent with the misconception that the distractor was purported to represent. The results of this analysis provided positive evidence for the validity of the test items. In addition, these data allowed us to assemble two different tests that were well-matched in terms of both difficulty and content coverage. We used one test, with 22 questions, for our pre-test and the other, with 21 questions, for our post-test in the current study. Note that further evidence for the equality of these two tests is presented later, through the performance of the no-training control condition of our study.

System Implementation

Architecture

A screenshot of the system is shown in Fig. 1. The student interface includes an area to display reading material, a circuit simulator, and a dialogue history window. All interactions with the system are typed to avoid issues with speech recognition.Footnote 3 Students read pre-authored curriculum slides and carry out exercises which involve experimenting with the circuit simulator. They are asked to explain observed behavior by answering questions such as “Why was bulb C off when switch Z was open?”. The system also asks high-level questions, such as “Under what circumstances will a switch affect a bulb?”. The graphical user interface is integrated with the open-source GNUCAP circuit simulator,Footnote 4 which is used to correctly display bulb brightness and provide accurate voltage and current readings for arbitrary circuits.

Fig. 1
figure 1

Screenshot of the Beetle II system

Fig. 2
figure 2

Example dialogue with the system, excerpted from Fig. 1

An excerpt from the example dialogue in Fig. 1, formatted for readability, is shown in Fig. 2. Students are intended to learn that a bulb will light up when it is in a closed path with a battery. Thus, when the student says that the bulb is contained in a closed path, the system treats the answer as partially correct but incomplete. It restates the student answer to confirm the correct part, a tactic that is effective in human-human tutoring (e.g. Dzikovska et al. 2008), and then prompts the student to complete the answer. When the student is unable to continue correctly, the system gives two different hints until the student arrives at a correct answer. We will use the first exchange (turns 2–3) of this dialogue as a running example to illustrate the system’s capabilities. Some additional examples demonstrating system behavior and strategies used, excerpted from our evaluation corpus, are provided in the Appendix.

The system architecture is shown in Fig. 3. The system behavior can be seen as an instance of the two-loop approach from VanLehn (2006). The outer loop consists of the system presenting a series of high-level activities to the user. Each activity is made up of a series of questions, which roughly correspond to steps within Vanlehn’s inner loop. Footnote 5 Both the activities and the questions are generated by a curriculum planner. For each question, the system provides a dialogue-based feedback service. The tutor processes student input (either natural language or a circuit submission) and gives feedback and hints until the correct answer is established, either by the student answering correctly, or by the tutor deciding that further remediation will be unproductive and therefore telling the student the correct answer and moving on to the next question.

Fig. 3
figure 3

System architecture diagram

To provide feedback on natural language answers, the feedback service uses a standard interpretation pipeline, with domain-independent parsing and generation components supported by domain-specific components for tutorial decision-making. Student answers are first analyzed by a deep parser. A contextual interpreter is then used to provide reference resolution and to transform the parser output into the domain-specific representation used by the domain reasoner. The diagnoser checks the student answer representation for factual correctness and matches it against the expected answer, returning a detailed diagnosis containing a list of correctly mentioned objects and relations, those that are missing, and those that contradict the expected answer. Next, the diagnosis is passed to the tutorial planner, which decides on an appropriate response, including which tutoring tactics to use. Finally, the natural language generator converts the tutorial planner’s high-level decision into the text of a response to be presented to the student.

Curriculum Planning and Exercises

The high-level lesson flow, including the presentation of reading material and exercise sequencing, is managed by the curriculum planner. For each exercise, the planner stores the reading material to be presented and the questions the system will ask the student, together with the associated circuits and reference answers (described later in the “Domain Reasoning and Diagnosis” section). These are used by the dialogue manager to load the appropriate circuits into the simulator and set up the dialogue context at the beginning of each exercise.

At present, exercises and questions are presented to all students in the same fixed sequence, based on a lesson plan specified by the curriculum designers. We are considering implementing a more flexible approach in the future, allowing the curriculum planner to choose both exercises and questions to be presented based on a student model and an adaptive planner (e.g. Cho 2000). The fixed lesson plan was selected as the simplest possible option, since the focus of our work has been on adaptive feedback generation in the feedback service. However, the dialogue manager we have implemented in the system is fully capable of handling a dynamic lesson plan, and an adaptive curriculum planner could be substituted in without any changes to the rest of the system’s architecture.

The curriculum includes four different types of questions:

  • Identify: Object and attribute identification questions, e.g. “In this circuit, which bulbs will be on and which bulbs will be off?” These questions ask the student to identify one or more objects or attributes. They require short answers which are much simpler to interpret than explanation questions, but still require the system to correctly resolve references to objects on the screen (“The bulb in 1”), ellipsis (“all off”), etc., requiring proper interpretation and integration with the domain reasoner.

  • Explain: Explanation and definition questions, e.g. “What are the conditions for a bulb to light?” These questions are key to supporting self-explanation, and student answers to these are the most difficult to process.

  • Multiple choice: Multiple choice questions, requiring students to answer “yes” or “no”, or to choose from a set of alternatives. These questions can be processed straightforwardly using simple template matching.

  • Number: Number questions, either a measurement report (“What voltage reading do you get?”) or a count (“How many closed paths can you find in diagram 1?”). These questions require some post-processing, but again, can be dealt with straightforwardly using a simple grammar.

In the remainder of this section, we discuss the interpretation and dynamic feedback generation pipeline, focusing on the processing necessary to produce automatically generated feedback for “identify” and “explain” questions.

Dialogue Management

The interpretation and feedback generation pipeline is coordinated by the dialogue manager. Our approach is based on the information-state update approach to dialogue management (Larsson and Traum 2000), in which the dialogue manager keeps the central representation of the state of the dialogue, and individual components consult it to support context-specific interpretation and generation, and update the state with the results of their processing. The dialogue manager calls the components in the feedback service pipeline (interpretation, domain reasoning and diagnosis, then tutorial planning and generation) to process the student answer and generate appropriate feedback, and decides when to pass control to the curriculum planner to move on to the next question.

Dialogue state is represented by a cumulative answer analysis which tracks, over multiple turns, the correct, incorrect, and missing parts of the answer. Every time the student attempts an answer, the objects and relations mentioned are added to the dialogue state. Content from the hints produced by the tutor is also added. This allows the system to track the answer over multiple turns. For example, if the system asks “Which components are in a closed path”, expecting the answer “the bulb and the battery”, the student may say “the bulb.” When prompted to say more, they may complete their answer by saying “the battery.” The dialogue manager will combine the information from the two answers and conclude that the complete answer has been provided, allowing the student to move on to the next question.

Next we describe the implementation of the pipeline components.

Interpretation Components

To parse student utterances, the system uses a two-stage process. First, the TRIPS dialogue parser (Allen et al. 2007) produces a domain-independent semantic representation including high-level word senses and semantic role labels. Next, the contextual interpreter uses a reference resolution approach similar to Byron (2002) and a set of hand-written rules specific to our domain to transform the domain-independent parser output into the domain-specific representation employed by the domain reasoner, using the ontology mapping mechanism described in Dzikovska et al. (2008).

Example output from the TRIPS parser and the contextual interpreter is shown in Fig. 4. The details of the parser representation are beyond the scope of this paper. Note, however, that it identifies a referential pronoun “it”, and provides some additional syntactic information (not shown here for brevity, see Allen et al. (2007) for details) that serves as the input to the reference resolution algorithm. It also identifies that the word “in” is used in its spatial, and not temporal, sense.

Fig. 4
figure 4

a The domain-independent semantic representation produced by the TRIPS parser, and (b) the corresponding domain semantic representation produced by the Beetle II interpreter for turn 2 in our example dialogue (Fig. 2). Some details are omitted for readability

The contextual interpreter takes the parser output and resolves the pronoun to a specific bulb in the environment, LightBulb-4-1, using information from the domain reasoner about the objects visible on screen, and the dialogue history which records the most salient object based on the last question asked.Footnote 6. The meaning of the word “in” is further refined from “spatial location” specified by the parser. The interpreter determines that it corresponds to the knowledge base concept contains, used to represent the relationship between a component and a path (and not, for example, state-of, as it would in the sentence “the two terminals are in the same state”).

This two-stage interpretation approach has been adapted from previous work on multi-domain task-oriented dialogue systems (Dzikovska et al. 2008) as a means to improve system robustness to unexpected answer phrasings. One of the major sources of interpretation failures in systems that use symbolic interpretation methods is missing lexical entries and grammar rules from hand-coded grammars and lexicons for the domain. The use of a domain-independent parser and semantic representation allows the system to recognize syntactic variations which represent the same meaning, e.g. “The bulb is contained in a closed path” and “There is a closed path containing a bulb.” The domain-independent ontology used by the TRIPS parser allows the system to recognize synonyms. For example, if the student says “Terminals 1 and 2 are linked”, the system will identify “linked” as an instance of the LF::Attach domain-independent sense, which is a synonym for “connected”. Both words will be automatically mapped to the connected-to concept understood by the domain reasoner by writing a single mapping rule that matches the LF::Attach and connected-to concepts in the domain-independent and domain-specific ontologies.

The TRIPS grammar and lexicon incorporate information from existing wide-coverage resources (Dzikovska et al. 2004; Swift 2005) and have been used in different dialogue domains over many years, further extending their coverage. Therefore, lexical entries and grammar rules are less likely to be missing compared to domain-specific grammars used in symbolic semantic interpreters for earlier dialogue ITSs, though occasional gaps in coverage are unavoidable. In addition, interpretation may fail where concepts that have similar meanings in a specific limited domain are not synonymous in a broader context. For example, if a student says “a broken path” instead of “an open path”, the equivalence may not be recognized because “broken” and “open” are not synonyms in a domain-independent ontology. This issue is addressed in part by writing additional rules to cover domain-specific semantic information not encoded in the domain-independent ontology, and in part by using an error recovery policy that points out the specific terms in an utterance that the interpreter fails to recognize, discussed later in this section.

Domain Reasoning and Diagnosis

The domain reasoner uses the Knowledge Machine system (Clark and Porter 1999; Dzikovska et al. 2006) to answer factual questions about the state of the world (e.g. whether a given switch is open, or which bulbs are lit in a given circuit). It uses a description logic formalism and incorporates a knowledge base with an ontology of the objects that can be seen on the screen, their properties, and relations between them (currently 14 object types and around 50 properties and relations). The reasoner supports contextually-appropriate interpretation and generation, and, most importantly, diagnosis of answer correctness.

While student answers to multiple choice and number questions are either right or wrong, for “identify” and “explain” questions the situation is more complicated. The diagnoser compares the objects and relations in the student answer against the reference answer to determine the correct, contradictory, missing and irrelevant answer parts (Dzikovska et al. 2008). At present, the system uses a heuristic matching algorithm to classify answer parts into the appropriate category. In future we will consider using a statistical classifier similar to Nielsen et al. (2008). For “identify” questions, reference answers are automatically computed based on the information provided by the domain reasoner. For “explain” questions, reference answers are provided by instructional designers to ensure that they match the curriculum goals.

Each student explanation is checked on two levels, verifying both (a) factual and (b) explanation correctness. Going back to our example, we must check (a) that the bulb identified as LightBulb-4-1 is indeed in a closed path, and (b) that being in a closed path is an acceptable explanation for the bulb being lit. Different remediation tactics are needed depending on whether the student made a factual error (i.e. they misread the diagram and the bulb is not in a closed path) or produced an incorrect or incomplete explanation (i.e. the bulb is indeed in a closed path, but they failed to mention that a battery needs to be in the same closed path for the bulb to light). The diagnoser verifies factual correctness by querying the domain reasoner and explanation correctness by matching the student answer against the reference answer. An example of the output from the diagnoser is given in Fig. 5. In this case, everything the student said was correct with respect to the reference answer, and there were two missing parts (corresponding to “the battery must be in the same closed path”).

Fig. 5
figure 5

The diagnosis produced for turn 2 in our example dialogue

Tutorial Planner and Natural Language Generation

The system generates tutoring feedback automatically based on the diagnosis of the student answer. In order to generate a response, the tutorial planner first makes a high-level decision on which tactic or set of tactics to use, and then passes them to a natural language generator to realize as text.

The tutorial planner supports a set of generic tutoring tactics and implements a top-level tutoring policy for choosing appropriate tactics at each point of the interaction. An example textual summary of a decision rule is shown in Fig. 6. It says that early in the dialogue, if the student is on the right track but some parts of the answer are missing, the system should give positive feedback, acknowledge the correct portion of the student’s statement, and go back for more information, using a generic “keep going” prompt to try to elicit the missing parts of the student’s answer.

Fig. 6
figure 6

Pseudocode for the rule used by the tutorial planner to decide on a sequence of tactics in turn 1 in our example dialogue. Tutorial planner output from application of this rule is shown in Fig. 7a

Figure 7a shows the tutorial planner output for turn 3 in our example, produced based on the diagnosis from Fig. 5 and the rule in Fig. 6. The tactic decisions are expanded into planning operators using relevant portions of the diagnosis structure, specifying the content to be acknowledged. This output is passed on to the natural language generator. The system uses a domain-specific sentence planner to determine how to allocate content to sentences and to choose syntactic structures and lexical items. The sentence planner output is shown in Fig. 7b. It shows, for example, that the system decided to use the ACKNOWLEDGE-CORRECT template for confirming correctness (e.g. saying “Right”). When reinforcing the correct portion of the student answer, the sentence planner decided that it was most appropriate to use the phrase “bulb A”, as opposed to “the bulb” or “it”, and that the focus of the sentence should be the bulb, rather than the path (in the latter case, the same content would be realized as “There is a closed path containing bulb A”.) This representation is finally passed to an existing wide-coverage surface realizer, FUF/SURGE (Elhadad and Robin 1992), which outputs the corresponding text.

Fig. 7
figure 7

Tutorial planner output resulting from the application of the tutoring rule in Fig. 6, along with the sentence planner output and the realized text

The tutorial planner typically has multiple possible tactics available at each point in the interaction. In our example, instead of making the decision to reinforce correct student content and use a contentless prompt, the tutorial planner could have used the content from the missing field in the diagnosis to generate hints. Some example hints generated by the system are shown in turns 5 and 7 in Fig. 2. For a low specificity hint the system typically selects an as-yet unmentioned object and hints at it, in our case, saying that the answer should mention the battery. For high-specificity hints, it attempts to hint at a two-place relation, in our example saying that the battery must be contained in something, with the hope that the student can fill in the blank by remembering the concept of a closed path.

The choice of which of the applicable prompt, hint or restatement tactics to use is governed by a rule-based tutoring policy. In our current policy, if remediation is necessary, the system starts by giving a contentless prompt (optionally restating the correct part of the answer, if any), then proceeds to more specific hints and suggestions for reading slides as appropriate for the context. Finally, if the student is unable to arrive at the correct answer after three to four attempts, the system gives away the answer using the bottom-out tactic. The specific decisions about which prompt or hint to use at each step take into consideration the error type (by checking matched, contradictory and missing parts of the diagnosis, as shown in Fig. 6), the number of incorrect answers received in response to the current question, and the number of previous answers that the system failed to interpret.Footnote 7

In addition to remediation tactics after flawed answers, the system implements different tactics for responding to correct answers, to predictions and to interpretation difficulties. If the student gives an acceptable answer, the tutorial planner can choose to either accept it and move on, or to accept it, but restate the answer using better terminology, based on the policy discussed in Dzikovska et al. (2008). For predictions, the system initially accepts the prediction neutrally (without acknowledging correctness) and then either confirms that the student was correct, or re-iterates the difference between the prediction and the actual outcome in the evaluation stage (see dialogue with Participant 9 in the Appendix). The list of tactics and example dialogues annotated with tactic names are given in the Appendix.

Finally, the tutorial planner implements an error recovery policy (Dzikovska et al. 2009) to deal with non-interpretable utterances. As discussed in the “Interpretation Components” section, the domain independent parser and ontology support alternative syntactic phrasings and synonyms, but interpretation may fail, for example, when the student uses terms that are related in the domain but not labeled synonymous in the domain-independent ontology. To help mitigate interpretation failures, the tutorial planner attempts to produce a message that describes the problem but without giving away the answer. For example, if the student refers to a broken or damaged path, the system will say, “I’m sorry, I’m having a problem understanding. I don’t understand when you say that paths are damaged. Batteries or bulbs can be damaged, but not paths”.Footnote 8 The help message is accompanied by a hint at the appropriate level, again depending on the number of previous incorrect and non-interpretable answers. In task-oriented spoken dialogue systems, this method has been used successfully to help users learn to phrase their requests in a way that is understood by the system (Hockey et al. 2003). We evaluate its effectiveness in the context of tutorial dialogue in the next section.

Evaluation

Setup

A key decision in any system evaluation is the nature of the control condition(s). In this study, we included two control conditions. The first was a no-training control condition, where participants completed the pre-test and the post-test, separated by an hour of distractor tasks (a lesson on differential equations). The purpose of including this condition was to generate the data needed to calculate effect sizes for the training systems of interest using Cohen’s d.

Next, given that the focus of this research is on developing the NLP capability necessary to support tutorial dialogue during the reflection phase of the PVE cycle, we determined that our second control condition should be a system that applied the same curriculum and only differed from our dynamic, adaptive, NLP-enabled ITS in the nature of the follow-up dialogue. The Beetle II tutor we have described in this paper aims to elicit the correct answer from the student over multiple turns through adaptive natural language feedback guiding them towards the correct answer. We will refer to this as the “elicit answer” system (elicit). In addition, we built another version of the system which immediately tells the student the correct answer without analyzing the answer or providing any explicit corrective feedback (the “tell answer” system, tell).Footnote 9 The user interface, lesson materials, activities and even the tutor questions were exactly the same in the two versions. The only thing that differed was the type of feedback the students received after answering a question (including all identify, explain, multiple choice, and number questions discussed in the previous section).

The elicit version of Beetle II processed student input and produced a diagnosis and dynamically generated feedback and hints as described in the previous section. In the tell version of Beetle II, the system did not attempt to provide any explicit feedback on the accuracy of student answers. Instead, each time a student answered a question the tutor would simply give them a neutral acknowledgment, followed by a statement of the correct answer (bottom-out). An example interaction with the tell system is given in Fig. 8. The system moves on to the next question without attempting remediation. It is left entirely up to the student to notice whether their answer matches the answer given by the system, and to determine whether any differences in phrasings and terminology are meaningful.

Fig. 8
figure 8

Example dialogue with the tell system, corresponding to turns 1–3 with the elicit system in Fig. 2

Procedure

After reviewing the informed consent paperwork, all participants filled out a demographic questionnaire and took the pre-test. The participants were then introduced to Beetle II and given a brief demonstration of its functionality. The students spent the majority of the experimental session working through two lessons with either the TELL or the ELICIT system. Students typically spent 2–3 hours working through the lessons. The difference in the type of feedback provided by the elicit versus tell system was the only difference between these two conditions.

After the students completed the lessons, they took a post-test which included 21 multiple choice questions and filled out a usability and satisfaction questionnaire. The REVU-IT (Report on the Enjoyment, Value, and Usability of an Intelligent Tutoring System) (Dzikovska et al. 2011) Questionnaire was used to assess usability and satisfaction with Beetle II. REVU-IT asked participants to rank their agreement, on a 5-point Likert scale, to both positive and negative statements regarding the lesson materials (10 questions), the circuit-building workspace (13 questions), the computer tutor (35 questions), and their overall reaction to Beetle II (5 questions). They were then debriefed, thanked and paid. The full study took 3–4 hours per participant.

In the no-training condition, participants first reviewed an informed consent document and then completed the same demographic survey and pre-test as the participants in the other two conditions. Next, they spent an hour completing a distractor task, which involved working through a lesson on differential equations within the context of modeling predator and prey relationships. This was followed by the post-test. They were then debriefed, thanked and paid. The no-training group took 2 hours to complete the study.

Corpus

Guided by VanLehn et al. (2007), we conducted a power analysis and determined that a total of 120 participants (40 per condition) would give us reasonable statistical power (0.85) to reject the null hypothesis (alpha = 0.05), assuming an effect size of 0.75 between our conditions. Participants were recruited from a Southeastern University in the US using the University’s online recruiting tool. The recruiting system advertised for students that had no prior knowledge of electronics and electricity and participants had to confirm the lack of prior knowledge before they were enrolled in the study. A total of 122 participants took part in the study. The no-training group consisted of 41 participants, who ranged in age from 18 to 38 years old (M = 21.4). Participants in the training conditions were randomly assigned to the elicit or tell condition. There were 41 participants in the elicit condition and 40 participants in the tell condition. Data from eight participants in the training conditions was dropped: four participants due to technical errors with the system, three who were outliers with respect to learning gain, and one outlier with respect to interpretation problems.Footnote 10 Outliers were defined as being three standard deviations from the mean for those variables.

Our analysis data set therefore includes 35 participants in the elicit condition, whose ages ranged from 18 to 37 years (M = 21); and 38 participants in the tell condition, aged from 18 to 42 years (M = 21.5). In both training conditions, students were required to work through 26 exercises with a total of 215 questions asked by the system, 150 of which required natural language answers (the remaining questions required building circuits only). In the elicit condition, an average of 422 student turns were taken (SD = 32), with an average of 230 natural language turns (SD = 24). Transcripts contained on average 1417 student words (SD = 217). In the tell condition, there was an average of 335 student turns (SD = 6.25), with an average of 156 language turns (SD = 2.04). Transcripts contained on average 900 student words (SD = 218).

The differences in the number of turns and in the number of words were significant between conditions (average number of student language turns: t(71) = 18.78, p < 0.0001; average number of student words: t(71) = 10.14, p < 0.0001). This is as expected, given that the students in the elicit condition were asked to revise their answers if they were incorrect, while the students in the tell condition were allowed to move on regardless of answer correctness.

Results

We begin by presenting descriptive and inferential statistics on our two primary outcome measures, student learning and student satisfaction. Student learning is measured both in terms of individual gain scores and effect size by condition. While other studies have found that student satisfaction is not always directly related to learning outcomes (Jackson et al. 2009; Papadopoulos et al. 2009), we include it as an important outcome measure because it contributes to a student’s willingness to continue to use a training system.

Next, we focus on one of our primary system characteristics—natural language interpretation of student answers. Accurate language understanding is a prerequisite for providing accurate feedback, and was therefore the main focus of our initial evaluation. Evaluating other aspects of the interaction, in particular the quality and appropriateness of system feedback, is the next step in our planned research program and is discussed in more detail in the “Future Work” section.

In our evaluation of the Beetle II interpretation component, we first describe how human coders judged the correctness of student answers. Then we provide a number of measures of system accuracy in interpreting student answers, when compared to this human assessment. Finally we look more deeply into the impact of interpretation quality on our two primary outcome measures (student learning and satisfaction). These final analyses point to different strategies for improving the system’s performance and the effects they may have on the desired learning outcomes.

Overall System Effectiveness

As expected, there was no learning gain in the no-training group. The participants in this condition had a mean pre-test score of 0.39 (SD = 0.13) and mean post-test score of 0.38 (SD = 0.14), with an average normalized learning gain of −0.04 (SD = 0.22). Learning gain was calculated using the formula: \(\frac {\mathit {posttest} - \mathit {pretest}}{1- \mathit {pretest}}\). The maximum score on both the pre and post-tests was 1.0 (i.e. 100 %); thus the maximum learning gain using this calculation is 1.0.

In contrast, both the elicit and tell versions of Beetle II were highly effective at teaching basic electricity and electronics concepts. Participants learned significantly between pre- and post-test. The elicit condition had a mean pre-test score of 0.35 (SD = 0.13) and mean post-test score of 0.75 (SD = 0.12). The average normalized learning gain was 0.61 in elicit (SD = 0.15), with a significant increase between pre- and post-test, (paired t-test, t(34) = −19.15, p < 0.00001). The tell condition had a mean pre-test score of 0.34 (SD = 0.13) and mean post-test score of 0.77 (SD = 0.14). The average normalized learning gain was 0.65 in tell (SD = 0.21), again a significant increase between pre- and post-test (paired t-test, t(37) = −17.38, p < 0.00001). However, there was no significant difference in normalized learning gain between conditions, t(71) = −0.863, p = 0.39.

To determine the effect of the elicit and tell conditions, effect sizes, using Cohen’s d, were calculated as \(\frac {M_{exp}-M_{notrain}}{\sigma _{pooled}}\), where M exp is the mean of the corresponding experimental condition (elicit or tell), M notrain is the mean of the no-training condition, and σ pooled is the pooled standard deviation for the no-training condition and the relevant experimental condition. Both experimental conditions had a large effect on student learning. When compared to the no-training condition, the elicit system had an effect size of d = 1.72, and the tell system had an effect size of d = 1.69.

With respect to student satisfaction, we focus on the two sections of the REVU-IT questionnaire that are related to tutoring quality.Footnote 11 The Tutor score represents the subset of questions assessing satisfaction with the natural language tutor; for example, “I felt that the tutor understood me well” or “I found our dialogues to be boring.” The Overall score covers the five questions addressing overall satisfaction with the system, e.g. “I would use this system again in the future to continue to learn about electricity.” The full questionnaire, and the divisions into specific subsets, are presented in Dzikovska et al. (2011). The mean Overall satisfaction scores were 3.4 out of 5 (SD = 0.87) for elicit and 3.67 out of 5 (SD = 0.85) for tell. There was no significant difference between conditions (t(71) = −1.31, p = 0.20). In contrast, the Tutor satisfaction score was significantly higher in tell than in elicit (tell: M = 3.33, SD = 0.65, elicit: M = 2.57, SD = 0.61; t(71) = −5.13, p < 0.0001).

Since the only difference between the elicit and tell systems is the natural language feedback used in elicit to elicit the correct response from the student, we surmised that the quality of the natural language interpretation may affect the learning outcomes. This led us to carry out a detailed evaluation of the performance of the NLP module in the elicit system.

Evaluating NLP Performance

Creating the gold standard

Evaluating NLP components for tutorial dialogue systems is complicated, because the representations used by current systems are domain specific, and therefore there are no shared data sets or standard methodologies for evaluating system performance. Ideally, we would like to have a “gold standard” semantic representation for each student answer, and then compare how well the representations generated by the system match against such standard representations. However, this requires human annotation with good inter-rater reliability, which is labor intensive and impractical on a large scale. Therefore, to evaluate the interpretation component, we use a previously established methodology based on comparing the correctness judgments made by the Beetle II diagnoser with accuracy codes assigned to the same utterances by human raters (Dzikovska et al. 2012). This approach simplifies creation of the gold standard.Footnote 12

The outputs from the Beetle II diagnoser are mapped into a 5-class annotation scheme as shown in Table 1. This scheme is based on the DEMAND coding scheme for assessing correctness of student answers in human-human dialogue (Campbell et al. 2009). All student utterances in the corpus were manually labeled with DEMAND labels (κ = 0.69, see Steinhauser et al. 2010) and automatically converted into our 5-class annotation scheme.Footnote 13 The resulting corpus consists of all student utterances in the data, each associated with two labels: a gold standard label based on the manual annotation, and an automatic label based on the output of the diagnoser. Examples of gold standard labels assigned to student utterances in the corpus are given in Table 2. The system is evaluated by comparing the labels and computing standard classification metrics.

Table 1 Label set used for evaluating interpretation accuracy
Table 2 Examples of gold standard annotations for student answers in our evaluation corpus

Evaluation Data Set

The evaluation data set consists of all student utterances typed in response to tutor questions, together with labels automatically assigned by the diagnoser, and manually annotated gold standard labels as discussed earlier. We report evaluation scores for the elicit system only. The tell system did not attempt any remediation. The entire data set covers the 150 questions requiring natural language answers asked by the system, and 8004 student answers. In our evaluation, we built three main test sets, based on the question types discussed in the “Curriculum and Exercise Management” section.Footnote 14

  • All questions: a test set containing every student response submitted to the system, together with the manual label and the automatically assigned class.Footnote 15

  • Explain: student responses to all explanation and definition questions, except those which asked for a prediction.

  • Identify: all student responses to object or attribute identification questions, except those which ask for a prediction.

Table 3 shows the characteristics of our evaluation sets in terms of the number of student answers (total number of items in the set), and the number of unique questions from which the evaluation set items were drawn.

Table 3 Characteristics of our evaluation sets

Evaluation Metrics

We focus on several different aspects of the system’s performance. First, we need to have a metric that reflects the student’s overall experience: if a student is interacting with the system, how accurately is the system interpreting their answers overall, and, correspondingly, how appropriate is the feedback that they are getting? Second, since we are particularly interested in supporting open-ended explanation questions, we want to evaluate the system’s performance on these questions separately. Finally, we want to establish baseline performance for comparing more advanced versions of the system against this initial version, and for comparing different NLP approaches to the natural language interpretation task.

From the “student experience” point of view, the most relevant metric is overall interpretation accuracy (referred to as simply “accuracy” in Tables 5 and 8). It is defined as the relative frequency of instances in which the label assigned by the system matches the gold standard label: \(\mathit {accuracy} = \frac {\#\mathit {matching}}{\#\mathit {total}} = 1 - \frac {\#\mathit {errors}}{\#\mathit {total}}\). Thus the accuracy value on the All questions set reflects the total number of times the system makes a correct decision about which tutoring tactic to use.

However, accuracy scores do not take into account potential skewness of answer class distribution. Table 4 shows that some classes are much more common than others in the corpus. For example, 60 % of all student answers are rated as correct, but only 14 % are labeled pc_incomplete and 2 % overall are rated as irrelevant. In such unbalanced class distributions the overall accuracy metric often favors systems that focus on the most frequent classes at the expense of ignoring less frequent ones. In our case, it is possible to achieve 60 % interpretation accuracy simply by classifying every student utterance as correct and never attempting any remediation. From the point of view of establishing the baseline and comparing different system versions, we need to ensure that the system is doing well across all classes, and not just the most common ones. Therefore, following Dzikovska et al. (2012), we report per-class F 1 scores for each class, to show how well each class is recognized, and the macro-average F 1 score, to characterize overall system performance.

Table 4 Gold standard label distribution in our evaluation sets. Proportion of total label count in parentheses

For each class, the F 1 score can be defined asFootnote 16

$$F_1(c) = \frac{2*\mathit{true positive}(c)}{2 * \mathit{true positive}(c) + \mathit{false negative}(c) + \mathit{false positive}(c)} $$

The macro-average F 1 score is defined as an unweighted average of individual class scores,

$$F_1(macro) = \frac{1}{N_c}\sum_c F_1(c) $$

where N c is the number of classes (5 in our case).

We report three other evaluation metrics related to student experience. First, we report a score based on a binary “accept/reject” decision as proposed in Dzikovska et al. (2012). Mismatches between system and human judgments are labeled as errors when computing the overall accuracy score but do not necessarily mean that the system will give inappropriate feedback. For example, in the dialogue with participant 24 in the Appendix, student turn msg255 is misdiagnosed by the system as irrelevant, when human annotators assigned a pc_incomplete label. Ideally, the system should acknowledge that the student’s answer is largely correct. Instead, the system gives hedged feedback (“Hmmm...”) followed by a hint. Although not optimal, this feedback is still reasonable and does help the student to improve their answer. Therefore we computed an alternative accuracy measure, “accept-reject accuracy”.Footnote 17 Using this metric, an answer counts as accepted if the diagnoser labeled it as correct. An answer counts as rejected if the diagnoser labeled it as flawedFootnote 18 or the interpreter rejected it as non-interpretable. In other words, an accepted answer is one where the system tells the student that it is correct and moves on, and a rejected answer is one where the system tells the student that there is a problem, and either asks the student to improve it or provides a correct alternative itself. This metric effectively assesses how well the system is able to decide when to accept the student answer as “good enough” and when to attempt a remediation. It does not consider whether the assigned label will allow the tutorial planner to choose an adequate remediation tactic.Footnote 19 Comparing this metric against the overall accuracy shows how the interpreter’s performance is affected by confusions between different types of answer flaws (as opposed to differentiating between correct and flawed answers).

Finally, we break down the errors that contribute to overall accuracy to look separately at the relative frequency of non-interpretable and misinterpreted utterances. It is well known from spoken dialogue system research that there is a trade-off between the frequency of misunderstandings (i.e. situations where the system incorrectly understands the user and takes the wrong actions) and non-understandings (i.e. situations where the system asks the user to repeat or rephrase the utterance rather than risk taking an incorrect action). Most dialogue systems make a decision to reject certain utterances as non-interpretable when it is not possible to find an interpretation that is reliable enough to use in response generation (Bohus and Rudnicky 2005). Such rejections can be frustrating to users. However, giving feedback based on a misinterpreted utterance can mislead and confuse the students as well. We observed many dialogue breakdowns caused by misinterpretation in early system pilots, and therefore developed a policy for dealing with non-understandings, where the system acknowledges its lack of understanding and gives a message describing the source of the problem (discussed in more detail in the next subsection). Separating out the non-interpretable from the misinterpreted utterances provides additional information about the misunderstanding vs. non-understanding trade-off within the Beetle II system.

Evaluation Results

We report the values of our evaluation metrics for the complete set of student utterances, and for the Explain and Identify subsets separately. Overall accuracy, “accept-reject” accuracy, macro-average F 1, and relative frequencies of non-interpretable and misinterpreted utterances are reported in Table 5. Table 6 breaks down the macro-average F 1 into individual class metrics. The evaluation scores show that the system makes largely correct decisions about whether to accept a student answer or to request that it be rephrased, as reflected in its high “accept-reject” accuracy; this is also reflected in a high F 1 score for the correct class. However, the system often makes incorrect decisions about which class a flawed answer belongs to. This is reflected in lower scores for classes other than correct, and correspondingly lower macro-average F 1 scores. Unsurprisingly, explanation questions are the most difficult to interpret, and thus the evaluation scores for this subset, on all metrics, are much lower compared to the overall system average. We first provide an intrinsic evaluation of system performance, and return to the relationship between overall metrics and learning outcomes in the next two sections.

Table 5 Overall evaluation scores for the ELICIT system
Table 6 Per-class F 1 scores for the ELICIT system

For intrinsic evaluation of classifier accuracy, a reasonable comparison is to the “majority class” baseline, in which the interpreter would label every student utterance using the most frequent class according to the gold standard. As can be seen from Table 4, this would mean labeling every student utterance as correct. Table 7 shows the comparison between this baseline and the Beetle II interpreter. The baseline has a macro-averaged F 1 score of 0.15 and an accuracy score of 0.60 on our All questions data set, and macro-averaged F 1 of 0.20 and accuracy of 0.44 on the Explain subset. The Beetle II interpreter outperforms the majority class baseline in both cases, and by a particularly large margin on macro-averaged F 1. The differences between the interpreter evaluation scores and the majority class baseline scores are statistically significant for all three data sets, and for both macro-averaged F 1 and accuracy scores, with p < 0.01.Footnote 20

Table 7 Comparison in evaluation scores between the elicit system and a majority class baseline

Our discussion in this paper focuses on the overall interpretation metrics. We also conducted further system evaluation looking at confusion matrices and examining how different confusions correlate with learning gain. The confusion matrices and evaluation results are provided in Dzikovska et al. (2012).

Another question is how the system’s accuracy compares to other state-of-the-art NLP methods. Direct comparison to other tutorial dialogue systems is difficult because of the lack of systematic reporting of standardized interpretation quality measures in the existing literature. However, the data from our corpus was used as part of a shared task on student response analysis at the 7th International Workshop on Semantic Evaluation (SemEval-2013), with the goal of investigating the performance of state-of-the-art NLP approaches in our setting. The participants were provided with a data set of question-answer pairs based on explanation questions extracted from the Beetle II evaluation corpus, and challenged to build a system that directly classifies student answers into the same 5 classes that we used in our evaluation. The participating teams used approaches based on semantic similarity scores, recognizing textual entailment, domain adaptation, machine translation, combining different features, and supervised classification methods. The systems and results are discussed in detail in Dzikovska et al. (2013). For the 5-way classification task, the results ranged from 0.44 to 0.71 accuracy (0.31 to 0.62 macro-averaged F 1), and for the 2-way “accept-reject” task the accuracy ranged between 0.64 and 0.84.

These results are not directly comparable to those reported in Table 5, because the evaluation corpora in the SemEval task were split into training and test sets in order to support training and testing of statistical classifiers. However, they show that our system, which was developed based on an analysis of the transcripts from 8 pilot sessions, performs in the same range as state-of-the-art statistical NLP approaches trained on data collected from 73 students interacting with the system (combining the data from both the elicit and tell conditions). It also shows that the analysis task is intrinsically difficult, with none of the systems reaching ceiling. We describe avenues for performance improvement in the Discussion section.

Impact of Interpretation Quality

While there is much scope for improvement in interpretation quality, it is also important to understand whether such improvements will contribute to improved outcomes. It is reasonable to hypothesize that improved interpretation should lead to improved feedback and correspondingly improved overall learning gain and satisfaction. There is a long tradition of investigating the importance of different interaction parameters by correlating system performance characteristics with desired outcomes in spoken dialogue systems within the PARADISE evaluation framework (Walker et al. 2000). Similar studies have been carried out for dialogue-based ITSs (Aleven et al. 2004; Litman and Forbes-Riley 2005; Pon-Barry et al. 2004). Although such correlational analyses are not proof of causality, they can help develop actionable hypotheses about system improvement that can then be tested in user experiments (Rotaru and Litman 2009). We therefore carried out an exploratory analysis investigating the relationship between interpretation quality and learning gain in our data set.

We first computed the interpretation quality scores listed in Table 5 for each student separately, and then computed the correlations between these individual scores and learning gain and user satisfaction for that student. As argued above, explanation questions constitute an important subset of our data with respect to both tutoring and to difficulties in natural language processing. Therefore, we also investigated the relationship between system performance on Explain questions for individual students, and their learning outcomes. The correlations we found are summarized in Table 8.Footnote 21

Table 8 Correlations between interpretation quality metrics and outcome metrics in ELICIT. Significant correlations (with p < 0.05) in bold

The results show that overall interpretation accuracy is significantly correlated with both learning gain (r = 0.38, p = 0.02) and Tutor satisfaction (r = 0.37, p = 0.03). Recall from the previous section that the errors in the overall accuracy computation are split into two types: non-interpretable utterances, in which the system finds no possible interpretations; and misinterpretations, where the system assigns an incorrect class to the utterance. Our evaluation shows that they occur at similar rates (see Table 5). However, the results in Table 8 show that only the frequency of non-interpretable utterances correlates with learning gain and user satisfaction: (learning gain r = −0.40, p = 0.02; Overall satisfaction r = −0.35, p = 0.04; Tutor satisfaction r = −0.48, p = 0.004). In contrast, misinterpretations are not significantly correlated with either learning gain or user satisfaction. This suggests that students are sensitive to the system explicitly acknowledging interpretation failures, but may be less sensitive to non-optimal feedback.

When looking only at explanation questions, overall interpretation accuracy was not correlated with learning gain (r = 0.24, p = 0.15), but was correlated with Tutor satisfaction (r = 0.33, p = 0.05). And when error types were examined separately, this correlation was only significant for non-interpretable utterances, similar to the pattern observed in the All Questions dataset. This underscores the need to better understand the impact of non-interpretable utterances on learning outcomes. We examine this issue further in the next subsection.

Impact of Non-interpretable Utterances

Our results clearly show that the frequency of non-interpretable utterances is negatively correlated both with learning gain and with user satisfaction. As with the overall interpretation quality scores, one inference is that if the interpreter can be improved, this will lead to the overall reduction of error scores, and hopefully to a correlated improvement in learning gain. We outline some possibilities for improving interpretation robustness in the “Future Work” section. However, even with technology improvements we can expect that some proportion of student utterances will be difficult or impossible to interpret. This has been observed in spoken dialogue system research in general. Tutoring introduces additional difficulties for interpretation because when students struggle with unfamiliar domain terminology they sometimes say things that even human tutors find difficult to understand and evaluate.Footnote 22 Moreover, different students have different degrees of success in mimicking the terminology used by the system (Steinhauser et al. 2011). Therefore, it is important to evaluate the impact of non-interpretable utterances on learning outcomes, to gain better understanding of appropriate error recovery policies.

As discussed in the evaluation metrics section, in developing a dialogue system there is always a trade-off between the rate of non-interpretable and misinterpreted utterances. It is possible to build a system that never reports any non-interpretable utterances by having the interpreter assign a pre-determined class (e.g. correct or irrelevant) to every utterance it fails to interpret, and then letting the tutorial planner make tutoring decisions based on that analysis. However, this will mean that some utterances previously treated as non-interpretable will be misinterpreted instead, resulting in an increase in inappropriate feedback decisions. In early system pilots, we observed situations where misinterpretations caused considerable confusion for the students. We were particularly concerned about situations in which the system misinterpreted correct answers as flawed and produced feedback that further confused the students and derailed the dialogue (Dzikovska et al. 2009).

In light of the pilot results, we decided that rejecting some student utterances as non-interpretable was preferable to certain misunderstandings. Therefore, as discussed in the “System Implementation” section, we implemented an error recovery policy that gives students information about terms that were not understood by the system when an interpretation failure occurs. We also hoped that these help messages would teach students to use terminology correctly, which can in turn improve learning outcomes, since higher levels of lexical cohesion with the tutor can be positively correlated with learning gain (Ward and Litman 2006). However, given that we observed significant negative correlations between the frequency of non-interpretable utterances and both learning gain and user satisfaction, it appears that the error recovery policy implemented in the system was not effective.

In order to improve our recovery policy implementation, we performed a more detailed analysis of the features of the utterances that the system failed to interpret, and of the correlations between different help message types and learning outcomes. The first thing worth noting is that non-interpretable utterances show a markedly different distribution of answer classes from interpretable utterances, which can be seen from Table 9. In particular, recall that in the corpus as a whole 60 % of utterances are labeled correct, and 21 % are labeled contradictory. But looking at the utterances that the system fails to interpret, only 25 % are labeled contradictory, with 38 % contradictory. Among the interpretable answers, 67 % are correct, and 17 % are contradictory. In fact, answers labeled contradictory were disproportionately likely to be non-interpretable (overall, P(nonint|contradictory) = 0.30, while P(nonint|correct) = 0.07). This difference in distributions is statistically significant: χ 2(4) = 875.5, p < 0.0001.

Table 9 Gold standard label distribution for interpretable vs. non-interpretable subsets. Proportion of subset total in parentheses

This analysis shows that flawed student answers are intrinsically more difficult to interpret, probably because they are more likely to contain vague or incorrect terminology that the system (and in some cases even human tutors) may find difficult to understand. We can also conclude that, relatively speaking, there is little risk involved in treating non-interpretable utterances as flawed and requesting that the student improve them.Footnote 23 Thus, our initial concern about the consequences of treating difficult-to-interpret utterances as flawed was not supported by the data. But clearly something different is needed in place of the error recovery policy that we are currently using.

We further examined the impact of interpretation failures by taking the fine-grained classification of interpretation problems developed in Dzikovska et al. (2010), and identifying four broad classes based on the cause of the problem and the way the system deals with it.

  • no-semantic-analysis-possible: the system cannot find a full parse and a reasonable fragment combination that covers the input is also impossible; or else the system can find a parse, but does not recognize the meaning of any of the content wordsFootnote 24; or (in rare instances) a crash in one of the components is preventing feedback from being generated. The system tells the student that they were not understood and asks them to rephrase (possibly with a hint).

  • identifiable-failure-point: The system knows most of the words in the sentence, but there is an unknown word or word combination that interferes with finding a complete interpretationFootnote 25; or otherwise there is a pronoun that the system cannot resolve. The system points out the word or word combination it hasn’t understood and optionally gives a hint.

  • wrong-input-form: the student is submitting a circuit when the system is expecting them to type, or else are submitting an unexpected answer, e.g. typing a long sentence in response to a question requiring them to name a circuit or component, phrased in a way that leaves the answer ambiguous. The system then tells the student what answer format it is expecting to see, e.g. “Sorry, this isn’t the form of answer that I expected. I’m looking for the name of an object”.

  • restriction-failure: student input violated knowledge base expectations. They are told what expectation is violated, with an example of a correct phrasing, e.g. “I don’t understand when you say that circuits are lit. Bulbs can be lit, but not circuits”.

The frequency of different problem types is shown in Table 10. When a problem occurs, the system attempts to provide a targeted help message that indicates the specific portion of the student input causing the problem. This can be done for all cases except no-semantic-analysis-possible, where there is insufficient information about the underlying cause. In addition, for wrong-input-form and restriction-failure the system is able to provide some guidance with respect to acceptable inputs.

Table 10 Frequencies of different interpretation problems in the corpus

A correlational analysis revealed that these four types of errors pattern differently with respect to learning gain and user satisfaction. The results of the correlational analyses are presented in Table 11. Students clearly found the situations where the system could give no help frustrating: there was a significant negative correlation between the frequency of no-semantic-analysis errors and Tutor and Overall satisfaction (Tutor: r = −0.48, p = 0.004; Overall: r = −0.36, p = 0.04). However, the frequency of such problems was not related to learning gain (r = −0.29, p = 0.09). In the two situations where the system was able to give help and clearly articulate its expectations, there was no significant correlation with user satisfaction. However, the frequency of such errors was negatively correlated with learning gain (wrong input form: r = −0.38, p = 0.02; restriction failure: r = −0.39, p = 0.02), indicating that students who repeatedly failed to phrase their answers in a way that the system could process also tended to learn less.

Table 11 Correlations between individual interpretation failure types and learning outcomes

These results suggest that different factors may be influencing our two primary outcome metrics, learning gain and user satisfaction. To improve user satisfaction with the system, it may be best to concentrate on reducing the number of cases where the system is unable to give any help. This is best achieved by improving overall system robustness. However, in order to improve learning gain, the focus should be placed on the wrong input form and restriction failure cases. There are two possibilities for what is going wrong here. First, it may be that the feedback currently produced is ineffective and needs to be improved. A second possibility is that such cases indicate students who are confused (perhaps struggling with domain terminology), or simply inattentive, and therefore unable to modify their answer even when they are given clear guidance about how to do so (this problem has previously been reported in tutorial dialogue Glass and Evens 2008). With such students, it may be counterproductive for the system to take responsibility for “interpretation failures”; instead, it may be helpful to devise strategies to recapture the student’s attention. Determining which of these two hypotheses is correct is planned as part of our future work.

This analysis also suggests that in developing and evaluating interpretation modules for tutorial dialogue systems, different metrics may be important depending on the desired target outcomes. Most current systems place the emphasis on improving learning gain. However, improving user satisfaction is important for keeping students engaged long-term. Thus, in evaluating interpretation quality, system developers may benefit from more fine-grained analyses such as the one described in this section, in order to better target the error types related to the desired outcome metric.

Discussion

We began this paper with the argument that a cyclical process of experimentation followed by explanation is an effective method for teaching conceptual material in STEM domains, but that it poses a significant challenge for the development of an NLP-enabled Intelligent Tutoring System. Our primary results have certainly borne out both pieces of this argument. Our curriculum, embedded in both versions of the system (tell and elicit), yielded impressive effect sizes of close to two sigma when compared to a no-training control condition. Still, analyses of interpretation quality showed that there is significant room for improvement in processing student answers.

The next logical step is to carefully consider the evaluation data and identify those areas that appear most promising and most important for further development. First, however, we will briefly revisit our hypothesis that the NLP capability is an important component of this system and see what, if anything, our results to date have to say on this point. Recall that the second stage of the conceptual change approach uses a tutorial dialogue with the student to elicit reflection on the observed phenomena and to guide the student to generate an accurate conceptualization of the underlying principles in the domain. One of our control groups (the tell system) was designed explicitly to test our hypothesis that the effectiveness of our system would be improved by the use of an NLP component, and our results did not yield statistically significant support for that hypothesis.

Obviously, we would have been delighted if this first version of a system using our adaptive NLP feedback (elicit) had proven to yield a significantly larger learning gain than the tell system. While that was not the case, we believe that it would be premature to draw any conclusions about the potential value added by an NLP capability. First, we would argue that the results of the evaluation of the interpretation module demonstrate that the current instantiation of NLP in our system is not yet robust enough to provide a fair test of the hypothesis that incorporating NLP capabilities will improve the effectiveness of a training system using the conceptual change approach embedded in the exercise sequence.

Next, recall from the introduction that the importance of student reflection and generation is integral to several different learning theories that have been supported with multiple bodies of empirical evidence. While it is possible that properties of our domain and curriculum limit the benefits of natural language interaction, no single study would ever be sufficient to justify abandoning that component of the instructional process. Thus, improving the system’s NLP capabilities remains relevant in the context of supporting research into ways of implementing effective instructional approaches in STEM.

Finally, while we did see substantial learning taking place, our students did not reach ceiling. Mean gain scores were 61 % and 65 % in elicit and tell respectively—indicating that there is still room for improvement. While it is certainly possible to make changes to the static content of the tell system, the elicit system affords a larger variety of manipulations. Basic principles of system development, such as the law of diminishing returns, suggest that gains are more likely to be achieved by focusing on improving weak system components and adding new capabilities than they are by fine tuning those aspects of a system that are already well developed.

Our evaluation points at a set of issues and limitations within the natural language processing module that can be addressed and evaluated in the future as part of improving the NLP components, which we discuss in the next section.

Future Work

One of the key outcomes of our evaluation is a rich data set which can be used for system improvement. The system design and implementation was informed by the analysis of dialogues from a human-human study; however, the full richness of human interaction is beyond the reach of current NLP technology, and thus we inevitably had to choose a subset of tutoring tactics that the system can use. Moreover, there are phenomena in human-computer dialogue that are not present in human-human dialogue. These include the need to deal with interpretation failures from the ITS and also negative metacognitive and social utterances from the students which may be suppressed in human-human communication because of politeness effects (Dzikovska et al. 2010; Steinhauser et al. 2010). We therefore cannot rely on human-human data collection as a guide for system behavior in those situations. Now that the system architecture and initial versions of system modules have been implemented and evaluated with a sufficient number of participants, we can use the collected data to improve the system.

The Beetle II system uses symbolic NLP to dynamically generate feedback adapted to the problem, the state of the simulation environment and the previous dialogue history, based on a library of generic tutoring tactics. In our first evaluation, we have shown that the system is effective, but have not been able to demonstrate that the adaptive feedback in elicit leads to better outcomes than the tell control condition in which the students are always shown the correct answer, without adaptive feedback.

Our evaluation focused on natural language interpretation, since it is the first stage of student input processing, and thus interpretation problems are likely to cause problems in feedback generation as well. However, there are at least four sources of potential problems in the system’s behavior, and correspondingly four main areas for improvement:

  • Interpretation quality

  • The appropriateness of the feedback automatically generated by the system

  • The choice and implementation of the generic tutoring tactics included in the library

  • The tutoring policy employed in choosing the best possible tactic

While our evaluation demonstrates that interpretation quality is correlated both with learning gain and user satisfaction, it does not account for the impact of the automatically generated feedback. Two aspects of feedback quality need to be taken into account: the appropriateness of the feedback produced by the tactics already implemented in the system, and the choice of the implemented tactics. To evaluate the appropriateness of the feedback already produced by the system, tutor utterances need to be annotated for appropriateness by human raters. Similar studies have previously been carried out in tutorial dialogue (Aleven et al. 2004) and task-oriented spoken dialogue systems (Mller et al. 2007). We plan to develop an annotation scheme to enable us to annotate and evaluate feedback quality as part of our future work.

In addition, the number and variety of tutoring tactics available can affect the quality of instruction offered by the system. In devising our library of tactics, we examined the tactics that our human tutors used in the pilot study and selected the subset that were most frequently employed with successful outcomes. Our architecture is capable of supporting a larger range of tactics than is currently used by the system. For example, early on we implemented the tactic of pointing out a counter-example to a student’s flawed answer based on the circuits visible in the simulation environment. We did not use this tactic in our final system because it was not part of the most frequently successful remediation sequences; however, this and other additional tactics can be implemented and evaluated in the future.

The challenge with implementing a large number of tactics is deciding which to apply if more than one tactic is applicable at a given time, as is usually the case. We chose to use a fixed policy for applying the tactics, based on common suggestions in the literature and on what we saw our human tutors do: start with contentless prompts, progress on to more specific hints, and finally give away the answer if the student is struggling. Recent work suggests that significant improvements in learning outcomes can be achieved by using statistical data analysis, either to detect common situations in the data where the current remediation policy is failing and additional tutoring can be beneficial (Forbes-Riley and Litman 2011), or by using reinforcement learning to learn the best policy (Chi et al. 2011). Our system is particularly well-placed for applying reinforcement learning techniques. Reinforcement learning methods typically require a set of actions (i.e. tutoring tactics) to be available at each point of the interaction and optimize the expected outcome by repeatedly trying different actions in similar situations in order to learn the best action to take. Since our system can dynamically generate different feedback messages in every situation, this makes it an excellent platform for reinforcement learning, which we plan to undertake in future work.

With respect to interpretation quality, the data we have collected can help in improving system performance. We recently carried out encouraging experiments showing that our symbolic interpreter can be combined with a statistical classifier to improve robustness while retaining the benefits of dynamically generated feedback in most cases. The resulting system performs significantly better on our evaluation data set than either the interpreter or the stand-alone statistical classifier (Dzikovska et al. 2013a, b). In addition, approaches have been developed to data-mine automatically parsed data sets in order to identify frequent word sequences that lead to parse failures and thus to rapidly improve coverage of grammar-based systems (Sagot and de La Clergerie 2006; van Noord 2004). These techniques can be adapted for our symbolic interpreter, thus using the collected data to efficiently improve the symbolic interpretation.

We are currently working on improving both the interpretation and the feedback generation, as discussed above, to develop and evaluate an improved version of the system, in order to investigate whether improvements in intrinsic evaluation scores will translate into improved learning gain and user satisfaction.

In addition, recall from the “Instructional Approach” section that the two key components of our approach are careful curriculum and exercise design intended to address common misconceptions, and the capability to conduct tutorial dialogue to support reflection and generation of correct explanations. Our tell control condition does not separate out the first component. We therefore developed another “stripped-down” baseline that provides students with the same reading material and has them build the same circuits, but does not require them to make predictions or generate any explanations. We plan to evaluate this system with human participants and compare it to the system versions described in this paper, in order to separate out the contribution made by instructional design to the overall effectiveness of the Beetle II system.

Improving the effectiveness of STEM-based Intelligent Tutoring Systems is both a high priority goal and a significant challenge for researchers and system developers. While there is clearly still much work to be done, we believe that the Beetle II system advances the state of the art by providing natural language interpretation and generation components which integrate a simulation environment with dynamic, adaptive tutorial feedback generation, in the context of an instructional approach that requires understanding natural language input as part of reflective dialogue.