Introduction

In recent years there has been a proliferation of automated scoring systems to analyze written text including student-generated responses (Magliano and Graesser 2012; Shermis et al. 2016). In particular, automated scoring has been used to assess the writing ability of students who vary in age and language proficiency (Crossley et al. 2016; Weigle 2013; Wilson et al. 2016). Because of its precision and consistency in analyzing linguistic and structural aspects of writing (Deane and Quinlan 2010), automated scoring has potential to contribute to an understanding of the instructional needs of poor writers. One segment of this population is adults who have completed secondary education with low skills but have nevertheless been admitted to postsecondary institutions and aspire to earn a college degree (MacArthur et al. 2016). This is a large population with low literacy skills who often attend college developmental (also known as remedial) courses (Bailey et al. 2010).

Automated essay scoring appears to perform well for the purpose of assessing readiness for college writing demands (Elliot et al. 2012; Klobucar et al. 2013; Ramineni 2013). However, studies using automated scoring to understand the specific nature and pattern of adults’ writing have focused on college students who are average writers or English language learners (e.g., Crossley et al. 2012; McNamara et al. 2013; Reilly et al. 2014; Zhang 2015), while students explicitly identified as low-skilled have been overlooked. This paper reports an analysis of the writing of low-skilled college students, utilizing both automated and human scoring.

Our initial aim was to determine whether linguistic variables analyzed using the Coh-Metrix automated scoring system (Graesser et al. 2004), previously found to predict human-scored writing quality in a sample of average-performing college undergraduates (McNamara et al. 2010), also held for our lower-skilled group. To do this, we analyzed a corpus of writing samples gathered in a larger study (Perin et al. 2015). McNamara et al. (2010) ran all of the Coh-Metrix measures and identified the three strongest predictors of human-scored writing quality. We analyzed our corpus using these three variables but did not find them to be statistically significant predictors. Consequently, we expanded the analysis to repeat McNamara et al. (2010) procedure of running all of the Coh-Metrix variables in the five linguistic classes used in the prior study.

Automated Scoring to Supplement Human Writing Evaluation

Automated scoring systems have for the most part been developed to ease the human burden of scoring students’ writing, and, to validate them, major effort has been expended in attempts to predict expert human ratings from machine-generated scores (Bridgeman 2013). When used in the classroom, machine-based writing assessment can free the teacher from the laborious process of evaluating students’ writing. In consequence, more writing can be assigned, resulting in more instruction and practice for students (Shermis et al. 2016). Another major purpose of using automated scoring is to understand reading comprehension and writing processes through the lens of linguistic variables and text difficulty (Abdi et al. 2016; McNamara et al. 2014).

Objections to automated scoring of writing have been expressed (Deane 2013; Magliano and Graesser 2012; Perelman 2013; Shermis et al. 2016; Winerip 2012, April 22). Critics argue that some scores may be technically flawed, which can lead to misleading feedback to teachers and students. Further, automated scoring systems cannot yet interpret the meaning of a piece of writing, identify off-topic content, or determine whether it is well argued. Moreover, the methodology of studies concluding that automated scores are just as accurate as human scores (Shermis and Hamner 2013) has been questioned (Perelman 2013). In addition, “signaling effects” (Shermis et al. 2016, p. 405) have been noted by which a test-taker’s performance may be altered through knowledge that the writing will be evaluated by a machine rather than a person. Designers of a Massive Open Online Course (MOOC) noted potential signaling problems: “AES [i.e. automated essay scoring] largely eliminates human readers, whereas … what student writers need and desire above all else is a respectful reader who will attend to their writing with care and respond to it with understanding of its aims.” (Comer and White 2016, p. 327).

Balancing the difficulties of automated scoring with their compelling advantage of speed and volume of scoring, these systems may be best used in conjunction with rather than as a replacement for human scoring (Ramineni 2013). It should also be kept in mind that human scoring is hardly without its problems, and may suffer from low interrater reliability (Bridgeman 2013).

Postsecondary Writing Skills

Comparisons of literacy skills expected upon exit of secondary education and those required for postsecondary learning reveal considerable misalignment (Acker 2008; Williamson 2008), which may help explain why many high school graduates are underprepared for college writing demands (National Center for Education Statistics 2012). Approximately 68 % of colleges in the United States offer developmental writing courses, and 14 % of entering students enroll in them, with a larger proportion (approximately 23 %) in community colleges (Parsad and Lewis 2003).

In postsecondary courses across the subject areas, students are expected to be able to respond appropriately to an assigned prompt. To write effectively at this level, the student needs to be sensitive to the informational needs of readers, and produce text relevant to a given purpose. Concepts should be discussed coherently and accurately, and the writing should conform to accepted conventions of spelling and grammar (Fallahi 2012; MacArthur et al. 2015). Writing is frequently assigned in subject-area classrooms, when students are expected to summarize, synthesize, analyze and respond to information taught in a course, and adduce arguments for a position (O’Neill et al. 2012). For many years, two types of writing that have featured prominently in subject-area assignments at the college level are persuasive writing and written summarization (Bridgeman and Carlson 1984; Hale et al. 1996; Wolfe 2011), both of which are often based on sources, demonstrating close connections between writing and reading comprehension processes (Shanahan 2016). Although essay writing and summarization can be considered as two different genres, in fact they overlap considerably in the sense that essays often contain summarized information. Although elementary and secondary teachers report assigning summary-writing, college instructors rarely do (Burstein et al. 2014), which may be because summarization becomes more of an invisible skill, embedded in frequently-assigned tasks such as proposal writing.

Persuasive Writing

In a persuasive essay, the writer states and defends an opinion on an issue. At the advanced level, argumentative writing, an opposing position is acknowledged and rebutted, leading to a reasoned conclusion (De La Paz et al. 2012; Hillocks 2011; Newell et al. 2011). The ability to write a persuasive or argumentative essay is more than just writing; it is a demonstration of critical thinking (Ferretti et al. 2007). To write a persuasive essay the writer must first recognize that there is an issue to argue about and then, through a series of steps, suggest a resolution to the problem (Golder and Coirier 1994). In the west, the structure of argumentation consists of the statement of a position, or claim; reasons for the position; an explanation and elaboration of the reasons; anticipation of alternative positions, or counterargument; rebuttal of the counterargument; and conclusion (Ferretti et al. 2009; Nussbaum and Schraw 2007). Students with writing difficulties tend to omit mention of the counterargument (Ferretti et al. 2000) because of a “myside bias” (Wolfe et al. 2009). However, if prompted, counterarguments do appear (De La Paz 2005; Ferretti et al. 2009).

Written Summarization

A summary is defined as the condensation of information to its key ideas (A. L. Brown and Day 1983; Westby et al. 2010). Summarizing information from text requires both reading comprehension and writing skills (Mateos et al. 2008; Shanahan 2016) and in fact, summary-writing is often used as a measure of reading comprehension (Li and Kirby 2016; Mason et al. 2013). Based on analysis of writing samples and think-aloud protocols, an early study (A. L. Brown and Day 1983) proposed that the writing of a summary involves several steps based on processing information in text. The key process is analyzing the material for importance. Upon reading a passage, the person intending to summarize it first eliminates from consideration information considered unimportant or redundant, then finds and paraphrases a topic sentence if it exists in the text, collapses lists into superordinate terms, and, finally, composes a topic sentence if one is not present in the source text.

Persuasive Essay Writing and Written Summarization of Low-Skilled Adults

The limited literature on the writing skills of low-skilled adults has all been based on human scoring (MacArthur and Lembo 2009; MacArthur and Philippakos 2012, 2013; MacArthur et al. 2015; Miller et al. 2015; Perin et al. 2013; Perin and Greenberg 1993; Perin et al. 2003, 2015). These studies have found weaknesses in all areas tested. Low-skilled college writers produce essays and summaries that are inappropriately short, and characterized by errors of grammar, style and usage. When summarizing text, these individuals include only a small proportion of the main ideas from the source and their persuasive essays often contain material not pertinent to an argument. Overall, studies have rated the overall quality of essays and summaries written by this population as low, although structured strategy intervention (MacArthur et al. 2015), online guided instruction (Miller et al. 2015), and repeated practice (Perin et al. 2013) results in improvement.

The availability of automated scoring presents an opportunity to broaden what is known about writing in this population through analysis of linguistic variables that would be extremely time-consuming to rate by hand. The additional knowledge about learners’ abilities could lead to the sharpening and differentiation of instruction to meet students’ needs (Holtzman et al. 2005). As a step towards this goal, the current study used the Coh-Metrix scoring engine to determine whether a previously-reported pattern of predictiveness of the writing quality of typically-achieving college undergraduates would also be found among lower-skilled students.

Coh-Metrix Automated Scoring

The Coh-Metrix automated scoring engine was designed to assess linguistic dimensions of text difficulty, especially as an alternative to traditional (and problematic) measures of readability (Graesser et al. 2004). The system, a compilation of linguistic measures drawn from pre-existing automated systems, centers on the construct of cohesion, a text characteristic comprising words, phrases and relations that appear explicitly and permit the mental construction of connections between ideas referred to in the text. Text cohesiveness thus permits coherence, a cognitive variable describing how a reader makes meaning of the text (Crossley et al. 2016; Graesser et al. 2004). Although initially used to analyze text difficulty in the context of reading comprehension, Coh-Metrix has also been applied to assessment of the quality of student writing, especially on the assumption that it should be comprehensible to a reader (Crossley et al. 2016; Crossley and McNamara 2009; McNamara et al. 2010). In its ability to rate linguistic aspects of writing, Coh-Metrix overlaps with other AES systems such as e-rater of the Educational Testing Service and Project Essay Grade of Measurement Incorporated (Burstein et al. 2013; Shermis et al. 2016). However, while many of these scoring engines are available through commercial vendors, Coh-Metrix is an open source product and, further, has been utilized in a large body of research (McNamara et al. 2014).

The assumption underlying the Coh-Metrix work is that effective expression of ideas in writing depends on those ideas being expressed in cohesive fashion. Therefore, writing assessed as being of high quality would be expected to score high on Coh-Metrix variables (McNamara et al. 2010). Coh-Metrix is able to rate input text on over 150 linguistic variables, controlling for text length (Crossley et al. 2012; McNamara et al. 2014). While writing is a complex act that calls on multiple forms of knowledge, when using AES it may be most feasible to start with linguistic variables since they are visible in a text (Crossley et al. 2011). Linguistic variables analyzed by Coh-Metrix are causal cohesion, or ratio of causal verbs to causal particles; connectives, or words and phrases that connect ideas and clauses; logical operators such as “or,” “and,” and “not;” lexical overlap, comprising noun, argument, stem and content word overlap; semantic coreferentiality, measured as weighted occurrences of words in sentences, paragraphs, or larger portions of text; anaphoric reference, or the use of pronouns to refer to previously-mentioned material; polysemy, which analyzes words used in a text for multiple meanings; hypernymy, which measures the specificity of a word in relation to the object it depicts; lexical diversity, or the number of different words written; word frequency, i.e. the frequency of occurrence of words in a 17.9 million word corpus of English text; word information (concreteness, familiarity, imageability, and meaningfulness); syntactic complexity, measured in terms of the mean number of words appearing before a main verb, the mean number of sentences and clauses, and the mean number of modifiers per noun phrase; syntactic similarity, or the frequency of use of any one syntactic construction; and basic text measures including the number of words written, number of paragraphs, and mean number of sentences per paragraph (Crossley et al. 2011; McNamara et al. 2010). All of these variables are seen as “cohesive cues” that would be expected to characterize student writing judged to be of high quality (McNamara et al. 2010, p. 62), develop over the school years (Crossley et al. 2011), and distinguish the writing of native versus non-native speakers of English (Crossley et al. 2016).

In a study on which the current research is based, McNamara et al. (2010) analyzed a corpus of persuasive essays written by college undergraduates in order to determine whether machine-generated linguistic scores would predict human writing quality ratings. Five groups of Coh-Metrix variables comprising 53 linguistic variables (in version 2 of the automated engine) were hypothesized by the authors to be related to the quality of a written text: coreference; connectives; syntactic complexity; lexical diversity; and word frequency, concreteness and imageability. These variables were selected for analysis because of earlier work pointing to their relation to reading comprehension. A corpus of 120 persuasive essays written by adult students in a freshman composition class at Mississippi State University (MSU) was analyzed. MSU is a four-year college whose entry criteria appear typical of non-selective institutions (Hoxby and Turner 2015) in terms of high school grade point average and class standing, and test scores (see http://www.policies.msstate.edu/policypdfs/1229.pdf). This information is worth noting because data from this group are being compared to our current sample of community college (also known as two-year or junior college, see Cohen et al. 2013) developmental education students, i.e., low-skilled adults who, despite having completed secondary education, were underprepared for the college curriculum (Bailey et al. 2010).

The MSU essays were written on a variety of self-chosen essay topics; analysis showed no prompt effects. The writing was untimed, and experience- rather than source-based. The authors refer to the topics as “argumentative” (McNamara et al. 2010, p. 65) but it is noted that the prompts request an opinion without specifying a full argumentative structure, which would include counterargument and rebuttal (Ferretti et al. 2009; Hillocks 2011). The same approach was used with the current low-skilled sample. The corpus of 120 essays was scored by experienced, trained human raters using a holistic six-point persuasive quality rubric used to score the College Board’s Scholastic Aptitude Test (SAT). The rubric requires raters to consider point of view; critical thinking; use of examples, reasons and evidence; organization and focus; coherence and flow of ideas; skillful use of language; use of vocabulary; variety of sentence structure; and accuracy of grammar, usage and mechanics (McNamara et al. 2010, Appendix A). The corpus was divided through a median split into high- and low-proficiency essays. Mean essay length was 749 words for the high- and 700 words for the low- proficiency essays.

Discriminant analysis was used to predict group membership (high- versus low-proficiency). The analysis used 53 Coh-Metrix Version 2.0 indices organized in five classes, called coreference, connectives, syntactic complexity, lexical diversity, and word characteristics. Several measures were found reliably to predict group membership. From these significant predictors, the authors selected those with the three highest effect sizes: “number of words before the main verb,” within the syntactic complexity class; “measured textual lexical diversity (MTLD),” within the lexical diversity class; and “Celex logarithm frequency including all words,” within the word characteristics class. Subsequent stepwise regression showed that the lower quality essays displayed more high-frequency words, and the higher quality essays used a more diverse vocabulary and more complex syntax.

The Current Study

Methods

Participants

Participants were N = 211 students enrolled in two community colleges in a southern state. All were attending top- and intermediate-level developmental reading and writing courses one and two levels below the freshman composition level at which students in McNamara et al. (2010) study were enrolled. The current sample was 54 % Black/ African-American and 64 % female. All participants were proficient speakers of English although 25 % had a native language other than English. Mean age was 24.55 years, (SD = 10.66), with 71 % aged 18–24 years. Mean reading comprehension score on the Nelson-Denny Reading Test (a timed multiple-choice measure, J. I. Brown et al. 1993) was at the 22nd percentile for high school seniors at the end of the school year, and mean score on the Woodcock-Johnson III Test of Writing Fluency (a timed measure of general sentence-writing ability requiring constructed responses based on stimulus words, Woodcock et al. 2001) was at the 27th percentile.

Measures

Text-based persuasive essay and summary writing were group-administered by graduate assistants in 30-min tasks. The sources for the writing were two articles from the newspaper USA Today. The articles were selected to correspond to topics taught in high-enrollment, introductory-level, college-credit general education courses with significant reading and writing requirements in the two colleges. College enrollment statistics showed the highest enrollments in such courses to be psychology and sociology. The participants had not yet taken the courses. Tables of contents of the introductory-level psychology and sociology textbooks used at the two colleges were examined for topics to inform a search for source text. The criteria for the selection of articles were relevance to topics listed in the tables of contents, word count, and readability level feasible for the participants. The two articles selected were on the psychology topic of stress experienced by teenagers and the sociology topic of intergenerational tensions in the workplace. The psychology topic was used for the persuasive essay, and the sociology topic was used for the written summarization task.

Several paragraphs were eliminated from each article in order to keep the task to 30 min. (A short assessment was a condition of data collection imposed by the colleges.) The deleted paragraphs presented examples to illustrate main points in the articles and did not add new meaning. Two experienced teachers read the reduced-length articles and verified that neither cohesion nor basic meaning had been lost as a result of the reduction.

Text characteristics are shown in Table 1. Word counts for the articles were 650 for the psychology text and 676 for the sociology text. Flesch–Kincaid readabilities, obtained from the Microsoft Word software program, were 10th–11th grade, and the texts measured at 1250–1340 Lexile levels, interpreted on the Lexile website http://www.lexile.com/about-lexile/grade-equivalent/grade-equivalent-chart/ as corresponding to an approximately 12th-grade (end-of-year) level. In the context of the Common Core State Standards, students at the end of secondary education who can comprehend text at a 1300 Lexile level are considered ready for college and career-related reading (National Governors’ Association and Council of Chief State School Officers 2010, Appendix A).

Table 1 Text Characteristics

Instructions for the persuasive essay directed students to read the newspaper article and then write their opinion on a controversy discussed in the article (whether or not bad behavior among teenagers could be attributed to stress). Length was not specified for the essay. For the summary, students were instructed to read the text and then summarize it in one or two paragraphs. Both writing samples were handwritten. On both tasks, students were told that they could mark the article in any way they wished, use a dictionary if desired, and were asked to write in their own words. The writing samples were transcribed, correcting for spelling, capitalization, and punctuation in order to reduce bias in scoring for meaning (Graham 1999; MacArthur and Philippakos 2010; Olinghouse 2008). However, we did not correct grammar because of the threat of changing intended meanings in this corpus of generally poorly-written texts.

Each writing sample was scored for quality by trained research assistants who also had teaching experience. A seven-point holistic scale (based on MacArthur et al. 2015) was used for the essay and a 16-point analytic scale (based on Westby et al. 2010) for the summary. The holistic essay quality rubric covered the clarity of expression of ideas, the organization of the essay, the choice of words, the flow of language and variety of sentences written, and the use of grammar. Examples of two score points are shown in the Appendix. The analytic quality rubric used to evaluate the summaries consisted of four 4-point scales: (1) topic/key sentence, main idea; (2) text structure (logical flow of information); (3) gist, content (quantity, accuracy, and relevance); and (4) sentence structure, grammar, and vocabulary. The scores on the four scales were summed to create a summary quality score. Examples of score points are provided in the Appendix.

The assistants were trained and then practiced scoring training sets of 10 protocols obtained previously in pilot testing until they reached a criterion of 80 % exact agreement. Extensive training was needed because the writing samples were often difficult to understand. When training was complete, pairs of assistants scored the research set. Inter-rater reliability (Pearson product-moment correlation) was r = .85 for both the summary and essay. Exact interscorer agreements were 77 % for the essay and 23 % for summary quality. We attribute the low level of exact agreement on summary quality to raters’ difficulty in following many of the writers’ expression of ideas. Interscorer agreement within one point was 100 % for essay quality and 44 % for the summary. Agreement within two points on the summary rubric was 71 %.

Procedure

To investigate whether Coh-Metrix text indices had the ability to discriminate good and poor quality writing samples produced by community college developmental education students, a modified replication of the analysis of McNamara et al. (2010) was used on the current persuasive essays and summaries. As the two types of writing had different scoring rubrics, each corpus was analyzed separately. The study was conducted in two parts. First, for each essay and summary, the three Coh-Metrix variables reported by McNamara et al. (2010) as the strongest predictors of human-scored writing quality, syntactic complexity, lexical diversity, and word frequency, were analyzed for the current dataset. Second, each essay and summary was re-analyzed using the full set of Coh-Metrix variables employed by McNamara et al. (2010).

The aim of the McNamara et al. (2010) study was to investigate the use of discriminant analysis to develop a model to predict writing quality from Coh-Metrix variables. Their first step was to create two groups, high and low proficiency, based on a median split of human scores. They then compared these groups using an ANOVA to identify, from the 53 Coh-Metrix variables, those that had the largest effect sizes between high and low quality writing. After identifying the variables, they ran correlations to establish that collinearity would not be a problem for the discriminant analysis. They developed and tested a discriminant model and then investigated the individual contributions of the three strongest linguistic indices using a stepwise regression model to predict the essay rating as a continuous variable. In our first analysis, we followed this procedure, but only using the McNamara et al. (2010) three strongest variables. In our second analysis, we followed their full procedure. We note that McNamara et al. (2010) used version 2.0 of Coh-Metrix, and we used version 3.0, which was the version available in the public domain at the time of our analysis. The five classes were the same in the two studies. In version 3.0 the classes were called Referential Cohesion, Lexical Diversity, Connectives, Syntactic Complexity, and Word Information and covered 52 of the 53 variables in version 2.0.

Results

For both analyses, based on a median split of the human scores, the low-proficiency persuasive essays had holistic scores of 0–3 and the high proficiency essays had scores of 4–7. The low-proficiency written summaries had analytic scores of 0–8, and high proficiency written summaries scored 9–16.

Initial Three Measures

Persuasive Essay

As seen in Table 2, there was a statistically significant difference between the two groups in terms of their average score on the essay evaluations but no significant differences between the groups on the text indices.

Table 2 Initial Three Measures: Descriptive Statistics and ANOVA for Persuasive Essay

Collinearity between the indices was ruled out by correlations, which ranged from r = −.20 to .09. While there was a significant correlation between word frequency and lexical diversity, it was relatively moderate (−.20), and was below the .70 threshold for collinearity issues (Brace et al. 2012). Given the lack of statistically significant differences between the two groups on the text indices and the low number of high proficiency essays (only 26 out of 201 with a mean of 4.04, barely higher than the low score for the group) it is not surprising that the discriminant analysis was unable to distinguish between the two groups of essays based on the text indices. The discriminant function provided a one-group solution, placing all the essays in the low proficiency group (df = 3, n = 211, χ2 = 3.15, p = .37).

A stepwise regression was then performed to examine if any of the three text indices were predictive of the human essay evaluation ratings using the continuous scores as opposed to the dichotomous split score employed for the discriminant analysis. However, the model was not significant, producing an R2 of .01 (F(3207) = .727, p = .537), with no significant predictors.

Written Summary

As can be seen in Table 3, there was a significant difference between the groups and two of the differences in the text indices, lexical diversity and word frequency, approached significance.

Table 3 Initial Three Measures: Descriptive Statistics and ANOVA for Written Summary

As with the persuasive essays, there was no collinearity between measures on the written summaries; no correlation was above the threshold of .70 (correlations ranged from r = −.38 to .13). Given the lack of collinearity, a discriminant analysis was performed. This analysis did not reach a conventional level of significance (df = 3, n = 211, χ2 = 4.73, p = .19), and a two-group solution that it produced had low accuracy, with only 55.8 % of the cases correctly classified.

A stepwise regression was performed to examine whether any of the three text indices were predictive of the human summary ratings using the continuous scores as opposed to the dichotomous split score employed for the discriminant analysis. While there was a significant model (F(3204) = 2.90, p < .05), it only accounted for 4 % of the variance, with lexical diversity being the only statistically significant predictor.

In summary, the first part of the study found the three Coh-Metrix measures reported by McNamara et al. (2010) as the strongest predictors of human-scored writing quality in a sample of typically-achieving college undergraduates not to predict writing quality in the current sample of developmental education students.

Full Set of Measures

We proceeded to the second part of the study by analyzing all 52 Coh-Metrix measures from McNamara et al. (2010) five classes. Following the general procedure of the earlier study, we created a training set consisting of a randomly-selected set of 75 % of the cases in order to establish a discriminant model, and used the remaining 25 % to validate the model.

Persuasive Essay

We found that two of the measures, representing two Coh-Metrix classes, showed statistically significant differences between the high and low proficiency groups: “argument overlap in all sentences, binary, mean,” from the Referential Cohesion class, and “type-token ratio, all words,” from the Lexical Diversity class (see Table 4). In “argument overlap” a noun, pronoun or noun phrase in a sentence refers to a noun, pronoun or noun phrase in another sentence (Graesser et al. 2004). The “type-token ratio” measure divides the number of unique words, i.e., types, in the writing sample, by the total word count, i.e. tokens (Crossley and McNamara 2011) as a measure of vocabulary diversity. The correlation between these two variables, r = −.046, was not statistically significant, indicating that collinearity would not be a problem in the discriminant model.

Table 4 Full Set of Variables: Descriptive Statistics and ANOVA for Persuasive Essay

The discriminant function was significant (df = 2, n = 211, χ2 = 14.19, p < .01) predicting 67 % of the training group and 66 % of the validation group correctly (see Table 5 for classification table). To better understand the accuracy of the model based on the groups’ recall scores, defined as the number of correct classifications divided by the number of missed predictions plus the number of correct predictions, we calculated precision scores, defined as the number of correct predictions divided by the number of false positives plus correct predictions, and F1 scores, which were a weighted mean of the precision and recall scores (Crossley et al. 2012). The classification table (Table 5) and the various accuracy scores (Table 6) show that while the model adequately classified the low proficiency summaries, there were many false positives in the high proficiency predictions, lowering both the precision and the F1 scores for the predictions for this grouping.

Table 5 Full Set of Variables for Persuasive Essay: Predicted Text Type Versus Actual Text Type
Table 6 Full Set of Variables for Persuasive Essay: Recall, Precision and F1 scores for Discriminant Analysis

A stepwise regression was performed to determine whether either of the two text indices were predictive of the human summary ratings using the continuous scores, as opposed to the dichotomous split score employed for the discriminant analysis. The model was statistically significant (F(2208) = 25.83, p < .01), accounting for 20 % of the variance, with both predictors making a significant contribution (see Table 7).

Table 7 Full Set of Variables: Regression Analysis to Predict Persuasive Essay Ratings

Written Summary

We found that eight of the measures, representing three Coh-Metrix classes, showed statistically significant differences between the high and low proficiency groups, one from the Referential Cohesion, two from Lexical Diversity and three from Word Information. Selecting the largest effect size in each class provided the following three variables: “content word overlap, adjacent sentences, proportional, standard deviation,” from the Referential Cohesion class; “VOCD, all words,” from the Lexical Diversity class, and “familiarity for content words, mean” from the Word Information class (see Table 8).

Table 8 Full Set of Variables: Descriptive Statistics and ANOVA for Written Summary

“Content word overlap” measures the frequency with which content words, i.e., nouns, verbs (except auxiliary verbs), adjectives, and adverbs, are shared across sentences (Crossley and McNamara 2011). “VOCD” is a measure of the richness of vocabulary usage, indicated by the number of different words and the number of types of words in a language sample (McCarthy and Jarvis 2007). The “familiarity for content words” measure refers to the frequency of appearance of a content word in a large, representative sample of print (Graesser et al. 2004). The correlations between the variables ranged from r = −.09 to .07. None were statistically significant and all were below the .70 threshold, indicating that collinearity would not be a problem in the discriminant model.

The discriminant function was significant (df = 3, n = 208, χ2 = 34.13, p < .001) predicting 70 % of the training group and 68 % of the validation group correctly (see Table 9 for classification table). Recall, precision, and F1 scores were calculated (see Table 10) to better understand the accuracy of the model (see above for definitions of scores). This model’s relatively high and stable scores across both the training and validation set indicated that it is fairly stable predictive model.

Table 9 Full Set of Variables for Written Summary: Predicted Text Type Versus Actual Text Type
Table 10 Full Set of Variables for Written Summary: Recall, Precision and F1 scores for Discriminant Analysis

A stepwise regression was performed to examine whether either of the two text indices were predictive of the human summary evaluation ratings using the continuous scores as opposed to the dichotomous split score employed for the discriminant analysis. The model was statistically significant (F(3205) = 19.14, p < .001), accounting for 22 % of the variance, with all three predictors making a statistically significant contribution (see Table 11).

Table 11 Full Set of Variables: Regression Analysis to Predict Ratings of Written Summary

In summary, in the second part of the study we found that, for our sample of college developmental education students, two Coh-Metrix variables predicted human holistic scores on persuasive essays, and three predicted human analytic scores on written summaries. There was no overlap between the current measures and those found to predict human persuasive essay scores of McNamara et al.’s (2010) typically-achieving college students.

Discussion

Using persuasive essays and written summaries from low-skilled, college developmental education students, this study failed to replicate relationships between Coh-Metrix automated linguistic scores and human quality scores reported by McNamara et al. (2010) for typically- achieving college undergraduates. When the study expanded to an analysis of all Coh-Metrix variables analyzed by McNamara et al. (2010), ten measures were found to predict human-scored writing quality across the two types of writing. Two Coh-Metrix measures reliably predicted human persuasive essay quality scores and eight predicted human summary quality scores.

The first part of the study asked whether three Coh-Metrix text indices, syntactic complexity, measured as number of words before the main verb; lexical diversity, measured as measured textual lexical diversity (MTLD); and word frequency, measured as Celex logarithm frequency including all words, found by McNamara et al. (2010) to predict human- scored writing quality of persuasive essays written by average-performing college students, were similarly predictive in a sample of low-skilled adults. Specifically, we wished to learn whether these linguistic variables would be able to predict group status (low or high proficiency) for human ratings of both essays and summaries written by a group of college students attending developmental reading and writing courses. It was found that these particular text indices did not differ by group or predict group membership in the low-skilled sample, and that in predicting the rating score as a continuous variable, only lexical diversity made a significant albeit minor contribution, only accounting for 3.6 % of the variance.

When examining the means and standard deviations on the three Coh-Metrix text indices of college students from the McNamara et al. (2010) study and the developmental college students in the current sample, there are similarities in the results on the essay scores, with the summary scores being lower (see Table 12).

Table 12 Means and Standard Deviations for the McNamara et al. (2010) and Current Studies

However, when examining the standard deviations, it can be seen that there is a much wider distribution in the low-proficiency groups, with many of the standard deviations being more than twice as large. Large standard deviations have also been reported in previous work on the writing of low-skilled students across the age span (Carretti et al. 2016; Graham et al. 2014; MacArthur and Philippakos 2013; Perin et al. 2003, 2013). While this problem has ramifications in terms of the lack of statistical significance in the current analysis (larger standard deviations create more overlap between groups and make them less distinguishable from each other), the more important issue is why is there so much variation in the essays and summaries of the low-proficiency writers.

In essence, we propose that the low-proficiency writers in this population, as judged by human scorers, found more diverse ways to be poor writers, as evaluated by the Coh-Metrix text indices, than the poor writers in the McNamara et al. (2010) study. This finding points to challenges in developing targeted interventions, as the issues that define their poor writing are wide ranging and idiosyncratic. Current practice in developmental education is to group this highly varied population in single classrooms based on standardized placement scores that are not designed to be diagnostic of specific literacy needs (Hughes and Scott-Clayton 2011), whereas their differing pattern of skills seems to require more individualized approaches. Structured strategy instruction in writing that permits small group and individualized assistance (Kiuhara et al. 2012), which has been found effective in experimental work in college developmental education (MacArthur et al. 2015), seems to be a promising way to reform current practice.

The current exploration highlights a possible utility for Coh-Metrix, or other automated scoring systems, to support developmental writing students at the college level. Given the heterogeneity of the writing skills of students who will inevitably be grouped for instruction under current practices, perhaps an automated scoring system could be leveraged by instructors to identify specific writing problems. Focusing only on difficulties identified by the automated scoring engine rather than a wider range of skills, some of which students may already have mastered, may lead to more efficient use of class time. Such diagnostic use of automated scores could potentially free up instructors to focus more on content and meaning. However, further research would be needed to test the usefulness of this idea and identify the scoring system and indices that would best inform such differentiated instruction.

In fact, instruction in writing in the content areas prioritizes aspects of meaning and interpretation not (yet) amenable to automated scoring. For example, science and social studies instruction often focuses on critical thinking about controversial issues (Sampson et al. 2011; Wissinger and De La Paz 2015). It may not be feasible or considered useful by content-area educators to focus on the types of linguistic variables generated in automated scoring that predict human-scored writing quality, such as the global cohesion variable of adjacent overlap of function words across paragraphs, a predictor of quality reported by Crossley et al. (2016). Given limitations on instructional time, even more questionable might be to spend time teaching students not to repeat function words, a negative predictor of human-scored writing quality (Crossley et al. 2016). While automated scoring may be used to generate indices of writing quality that may help identify good and poor writers, the specific areas scored, as they stand now, may not serve as diagnostics for instructional planning.

The methodology in the current study departed in three ways from that of McNamara et al. (2010). First, the participants had different levels of writing skill (developmental versus average college level). Second, the persuasive writing prompts were different. Third, the earlier study called for experience-based writing whereas the current investigation used text-based tasks. Writing from text versus experience may be an especially important factor in comparing the predictiveness across the studies. McNamara et al.’s (2010) participants wrote from experience, and specific content was not an element considered in scoring. In contrast, the current group was directed to refer to a text they had read when responding, and scoring took inclusion and relevance of content into account. Since the scoring of experience-based writing focuses primarily on style and structure of the response while assessment of text-based writing also includes judgment of content (Magliano and Graesser 2012), the holistic scores in the two studies may center on different constructs. Further, the expectation that the same predictor variables would be operational in persuasive essays (as in the McNamara et al. 2010 study) and summaries needs further exploration.

Since the Coh-Metrix measures used in the first part of the present study did not result in a two-group solution (all of the writers were essentially in one group), we proceeded, in the second part of our study, to analyze the full set of measures in McNamara et al.’s (2010) five linguistic classes. Running this larger analysis, we found that, across our two prompts, ten automated measures predicted human-scored group membership. Several themes emerge when considering this finding.

First, it is interesting that different automated measures predicted human scores between the two prompts (persuasive essay and written summary). This finding supports other work that suggesting that individuals’ writing skill, when measured on the same construct using different prompts, is not necessarily stable (Danzak 2011; Olinghouse and Leaird 2009; Olinghouse and Wilson 2013). These findings suggest that future investigations of the relation between automated and human scores should measure performance within the same samples on different prompts and genres.

Second, there are reasons to be cautious about drawing conclusions about the relation between the automated and human scores when, as in the current case, very few of a large set of variables were found to be predictive. Of the 104 Coh-Metrix variables tested (52 variables tested for each writing prompt) in the second part of our study, only two reliably differentiated persuasive essay quality and only eight differentiated written summary quality. In fact, given the large number of variables tested, these may be chance findings. Further, trends towards any particular pattern of results were not evident in the current dataset, evidenced by low F values (less than 25 % of the F values were larger than 2, which is around a p value of .2) and the standard deviations were large. The large degree of variation in the data, when the Coh-Metrix variables were run from all of McNamara et al.’s (2010) five classes, supports the suggestion made above that there may be many ways to write poorly.

In fact, the investigation was atheoretical in regard to writing ability, as our concern was simply whether the automated measures were able to predict the human scores. However, in an investigation guided by a theory of the relation between linguistic measures and overall writing quality, the number of variables found to be reliable predictors would not matter, because theory-based expectations about variables that might discriminate high from low proficiency writers could be tested. In this case, even if all of the theoretically relevant, individual automated measures were not statistically significant predictors, the right combination could produce an effective discriminant model to distinguish the characteristics of writing of individuals who vary in proficiency.

Another suggestion for future directions in research on the relation between human and automated scoring is to compare outcomes for three distinct populations varying in writing quality, highly proficient writers such as graduate students, average-achieving writers such as the typically-performing undergraduates studied by McNamara et al. (2010), and low proficiency college developmental education writers as in the current study.

Further, the alignment of human and automated scores should be explored, as developmental evidence on college students may show different growth patterns on human- and machine-scored linguistic and proficiency variables (Crossley et al. 2016). Overall, it will be important, as research in writing skills using automated scoring goes forward, to verify the accuracy of human scores and the theoretical relevance of automated linguistic scores, and consider how each type of evaluation might most usefully inform classroom instruction. In this context, it will be important to compare outcomes on the same populations for different automated scoring engines that are becoming available. Findings for Coh-Metrix have been reported in a large number of research publications but these are early days in this field, and the relative utility of the various available scoring engines for improving educational outcomes remains to be demonstrated.