Participants
The 52 participants were recruited from literacy classes of the Center for the Study of Adult Literacy (CSAL) both in Metro-Atlanta (n = 20) and Metro-Toronto (n = 32). The ages of participants varied from 16 to 69 with a mean of 40.0 (SD = 15.0). The majority (73%) of the participants were female. All participants read from 1.9 to 8.9 grade levels (M = 3.9, SD = 1.6) measured by Woodcock–Johnson III Passage Comprehension (Woodcock et al. 2001). Among the participants, 30% reported they were diagnosed as learning disabled or attended special education programs when they were children. The participants consisted of 70% native English speakers. Additionally, 69% of participants received public assistance at some point. During the intervention, the participants completed an average of 71% of the 30 lessons.
Procedure
After giving informed consent to participate, participants answered demographic questions which investigated their age, gender, race/ethnicity, educational background, native language, age of English language acquisition, and whether they had ever received public assistance (e.g., food stamps).
Before the first day of the literacy classes, the Woodcock–Johnson III Passage Comprehension test (Woodcock et al. 2001) was administered to the participants to obtain their pre-test score in reading comprehension. During a period of 4 months, the participants were offered 100 h of intervention. In the AutoTutor intervention, participants first received evidence-based, teacher-led classroom instruction to decode or comprehend material presented by the computer agent. Afterwards, participants engaged in solving word or reading comprehension problems. Finally, participants were provided independent reading time.
The AutoTutor component of the intervention covered 30 curriculum lessons, and each lesson took 20–50 min to complete. One concern was that the teacher-led instruction could impact, either positively or negatively, the participants’ performance within AutoTutor. However, our previous analysis (Shi et al. 2017) showed that participants made no significant learning gains within each of the four theoretical levels during the period of the intervention. We thus felt confident that the participants’ performance on the four theoretical levels reflected their pre-existing reading abilities. All participants received the tests of their reading abilities as the compensation for being part of the study and the study was approved by the Institutional Review Board of Georgia State, University of Memphis, Brock University, and University of Toronto.
Measures
The version of AutoTutor presented to participants consisted of 30 lessons designed to help adults with low literacy improve their reading skills, especially skills required for the comprehension of text. The system is adaptive in the sense that many lessons have easy, medium, versus difficult materials (words and texts), which were measured by Coh-Metrix (Graesser et al. 2014b), and students are assigned different levels of materials based on previous responses. Coh-Metrix is a computer tool that analyzes many different linguistic features of words, sentences, and multi-sentence texts. For the texts in the CSAL lessons, for example, Coh-Metrix computes text formality, which is a comprehensive score of the difficulty level of a text.
Within most of the AutoTutor lessons, the students first received materials at a medium level of difficulty and answer 8–12 questions that were embedded in conversation with the computer agents. Depending on the student’s performance on these questions, the student received either the easier or harder material next. That is, higher performance on the medium materials would lead to more difficult material whereas lower performance on the medium material would lead to easier material. Since writing is generally problematic for adults with low literacy skills (Olney et al. 2017), the interactions in AutoTutor for CSAL are largely point-and-click, multiple-choice questions, or drag-and-drop. However, it is allegedly the conversational component that drives learning, scaffolded in AutoTutor through EMT-structured conversations.
Instead of a simple dialogue between the tutor agent and the human learner, the conversation in AutoTutor for CSAL has two agents and one human learner who participate in trialogues (Graesser et al. 2014a; see also Johnson et al. 2017). Trialogues offer several affordances appropriate for adult learners with low domain ability that are not available to dialogues. For example, in a trialogue, adult learners can learn vicariously by observing interactions between a student and tutor agent so that learning is possible even with minimal skills. A peer agent and human student may share a misconception that can be presented to the tutor by the peer agent. The tutor agent, in turn, gives negative and corrective feedback to the peer agent which prevents or reduces the potentially negative motivational impact upon the human learner with negative feedback. Trialogues also facilitate competition between the adult learner and a peer agent in a game setting, which can be highly motivating (Graesser et al. 2016b). For example, Fig. 1 depicts a game scenario in AutoTutor where the adult learner competes with the peer agent to correctly answer questions and a cumulative score is kept. The game is designed to always allow the adult learner to win to promote self-esteem and self-efficacy. The goal in Fig. 1 is to figure out the meaning of the target word, which has multiple meanings, in the sentence of a “Word” lesson. When a learner answers incorrectly or does not provide a complete answer, the EMT trialogue kicks in and the learner receives a hint from one of the two agents, providing another chance with somewhat more guidance.
The AutoTutor data collected in the log files of each participant included participant ID, the number of times a lesson was attempted, the number of times a question within a lesson was attempted, the accuracy of answering a question when first attempted (0 or 1), the time in seconds to answer a question in the lesson when first attempted, the difficulty of the materials (easy, medium, hard), and the theoretical levels that a lesson addressed. There were other measures collected but they are not relevant to the present article. The data were stored in a data management system on a central computer sever that handled all of the participants’ data that were collected on personal computers on the web.
The present study analyzed a subset of the data that were collected. We extracted data for each participant for each of the questions within the 29 lessons that were completed; we eliminated the single lesson that focused on Syntax because there was only one lesson in that category. If the participant did not complete a particular lesson, all of the observations of the lesson were deleted from the analyses. We further reduced the dataset by considering accuracy and times for questions pertaining only to medium-level words or texts. All participants would at minimum receive questions at a medium difficulty level at the beginning of the lesson (and for some lessons the complete lessons) to ensure that all of the participants had the opportunity to contribute data to the same set of questions. Thus, we used only data corresponding to questions on medium levels of difficulty. On average, the participants completed 23 lessons (ranging from 2 to 29 lessons), and each lesson contained 14.6 questions (medium level), with a range from 6 to 30 questions.
The medium level of the lessons varied in the number of texts the user interacted with and the length of the texts. The medium level for lessons whose primary comprehension component focused on discourse (i.e., not Words) typically consisted of a single text, such as an informational article or fictional story, around 250–300 words in length. The data from these lessons included the responses to a set of sequentially presented questions regarding the text. For instance, the medium level text for the lesson Compare and Contrast discusses differences and similarities in the athletic careers of Kobe Bryant and Michael Jordan. Each question addressed some aspect of this single text. The medium level for lessons whose primary comprehension component was Words contained multiple stimuli in which a single question was asked per stimulus. For example, in the medium level of the lesson Multiple Meaning Words, participants were presented a series of questions in a fixed order. For each question (as illustrated in Fig. 1), the participants were shown a sentence (around 8–20 words in length) containing a word with multiple meanings (e.g., bank, check, etc.) and were asked to choose the correct definition of the word based on the context of the sentence. In this way, data for this lesson corresponded to a single response generated by a single stimulus.
When we examined the distributions of the resulting data, we found that response time per question was positively skewed, which is typical for response time data. To reduce the bias brought by the potential outliers, we truncated the time by replacing the observations beyond three standard deviations above the mean for the subject with the value at three standard deviations above the mean for the subject; this truncation was performed for each participant separately.
We defined accuracy as the score (1 as correct, 0 as incorrect) the participant received on the first attempt on each question of a lesson.
We defined time on a question as the number of seconds it took for a participant to answer a particular question, from the onset of the question to the participants’ click on an answer. Time was assumed to be a relevant indicator of reading proficiencies. The participants were unaware that the time they spent on answering questions would be assessed.
Each lesson tapped 1–3 of the four theoretical levels, i.e., Word, Textbase, Situation Model, and Rhetorical Structure. We assigned a measure of the relevance of each of the four theoretical components to each of the lessons. We defined the relevance of a theoretical level on a lesson as the extent to which the level was tapped in the lesson. The assigned codes were primary, secondary, tertiary or no relevance of a component to a lesson. We quantified the orderings so that components with primary relevance for a lesson received a value of 1.00, secondary relevance received a value of 0.67, tertiary relevance received a value of 0.33, and no relevance received a value of 0.00. Table 1 shows the 29 AutoTutor lessons and the relevance of each theoretical component for each lesson. The columns named W, TB, SM, and RS designate the measure of relevance for Word, Textbase, Situation Model, and Rhetorical Structure, respectively, for each lesson. The levels column summarizes this information, listing the components that are relevant for each lesson in order of relevance (i.e., the first component listed is the most relevant). For example, Stories 1 addresses aspects of comprehension primarily at the level of Situation Model (1.00), then Textbase (0.67), and lastly Rhetorical Structure (0.33), but not Word (0.00).
Table 1 Relevance of comprehension components, average accuracy, and average time spent per question for each of the 29 AutoTutor lessons
We also categorized questions on one of the four theoretical levels. The category for a particular question within a particular lesson was simply the primary theoretical level that characterized the lesson (Table 1). For example, we see in Table 1 that questions in Compare and Contrast belonged to the category RS since this lesson primarily focused on aspects of Rhetorical Structure. We then defined performance for each theoretical level as the average accuracy on questions within that theoretical level.
Our measure of grade reading level of a participant was the Woodcock–Johnson III Passage Comprehension subtest (Woodcock et al. 2001), which had been administered prior to the AutoTutor intervention. This comprehension subtest is a complex and conceptually driven processing task that measures the ability to produce the mental representations of the text during reading. In the test, participants silently read passages and fill in the missing word. The reliability is 0.83 for ages 5–19 and 0.88 for adults (McGrew and Woodcock 2006).
Data analyses
To compare the differences in accuracy and time among the four theoretical levels, we first computed descriptive statistics on the accuracy and time data for the four theoretical components. Any trends we observed from the descriptive analysis were further investigated using mixed effect models (Bates et al. 2014). We used a logistic mixed effect model to predict question accuracy (1: correct, 0: incorrect) and a linear mixed effect model to predict time spent on a question (in seconds). Item (question) was the unit of analysis for both models. Our rationale for using a linear mixed effect model instead of an ANOVA for analysis was to avoid the language-as-fixed-effect fallacy, or more properly, the stimuli-as-fixed-effect fallacy (Baayen et al. 2008; Clark 1973). The AutoTutor curriculum contains different lessons that address different comprehension levels, as well as different questions in each lesson. It is not appropriate to assume the independence among the observed outcomes (accuracy and time) since participants completed multiple questions within multiple lessons for a specific comprehension level. In addition, the variability in accuracy (and/or time) was mixed with variability among subject (participants), lessons, as well as items (questions) for each comprehension level (fixed effect). Thus, to test whether the accuracy (and/or time) differs among the four theoretical levels of comprehension (fixed effect), subject (participants), item (question), and lesson were added into the linear mixed effect models as random intercepts. We also included random-subject (participant) slopes on theoretical levels, and random effects for specific questions nested within lessons in the model since participants’ performance might vary on different theoretical levels, and questions were designed to be nested in lessons. To confirm the results of mixed effect models, a follow-up correlational analysis was performed between the four continuous measures of the theoretical levels and adult learners’ mean accuracies and average time per question (in seconds) on the 29 lessons (in Table 1). Through this approach, we will determine whether adult learners’ accuracy and time measures on the 29 lessons are correlated with the reading comprehension components that the lessons focused on.
Analyses were also conducted on the relationship between performance on four theoretical levels and reading grade level. We used multiple linear regression to predict reading grade level from accuracy on each of the four theoretical levels. Prior to this, we computed the correlation matrix on these variables and tested potential multicollinearity among the performance of the four theoretical levels. We used R programing language (R Core Team 2013) to carry out all aspects of our data analysis.