Keywords

Introduction

Educational assessment is a systematic method of gathering information or artifacts about a learner and learning processes to draw inferences of the persons’ dispositions (E. Baker, Chung, & Cai, 2016). Various forms of assessments exist, including single- and multiple-choice, selection/association, hot spot, knowledge mapping, or visual identification. However, using natural language (e.g., written prose or essays) is regarded as the most useful and valid technique for assessing higher-order learning processes and learning outcomes (Flower & Hayes, 1981). Essays are scholarly analytical or interpretative compositions with a specific focus on a phenomenon in question. Valenti, Neri, and Cucchiarelli (2003) as well as Zupanc and Bosnic (2015) note that written essays provide learners the opportunity to demonstrate higher order thinking skills and in-depth understanding of a subject matter. However, evaluating, grading, and providing feedback on written essays are time consuming, labor intensive, and possibly biased by an unfair human rater.

For more than 50 years, the concept of developing and implementing computer-based systems, which may support automated assessment and feedback of written prose, has been discussed (Page, 1966). Technology-enhanced assessment systems enriched standard or paper-based assessment approaches, some of which hold much promise for supporting learning processes and learning outcomes (Webb, Gibson, & Forkosh-Baruch, 2013; Webb & Ifenthaler, 2018). While much effort in institutional and national systems is focused on harnessing the power of technology-enhanced assessment approaches in order to reduce costs and increase efficiency (Bennett, 2015), a range of different technology-enhanced assessment scenarios have been the focus of educational research and development, however, often at small scale (Stödberg, 2012). For example, technology-enhanced assessments may involve a pedagogical agent for providing feedback during a learning process (Johnson & Lester, 2016). Other scenarios of technology-enhanced assessments include analyses of a learners’ decisions and interactions during game-based learning (Bellotti, Kapralos, Lee, Moreno-Ger, & Berta, 2013; Kim & Ifenthaler, 2019), scaffolding for dynamic task selection including related feedback (Corbalan, Kester, & van Merriënboer, 2009), remote asynchronous expert feedback on collaborative problem-solving tasks (Rissanen et al., 2008), or semantic rich and personalized feedback as well as adaptive prompts for reflection through data-driven assessments (Ifenthaler & Greiff, 2021; Schumacher & Ifenthaler, 2021).

It is expected that such technology-enhanced assessment systems meet a number of specific requirements, such as (a) adaptability to different subject domains, (b) flexibility for experimental as well as learning and teaching settings, (c) management of huge amounts of data, (d) rapid analysis of complex and unstructured data, (e) immediate feedback for learners and educators, as well as (f) generation of automated reports of results for educational decision-making.

Given the on-going developments in computer technology, data analytics, and artificial intelligence, there are advances in automated assessment systems, which may facilitate the feasibility, objectivity, reliability, and validity of the assessment of written prose as well as providing instant feedback during learning processes (Whitelock & Bektik, 2018). Accordingly, automated essay grading (AEG) systems, or automated essay scoring (AES systems, are defined as a computer-based process of applying standardized measurements on open-ended or constructed-response text-based test items. Measurements of written text include observable components such as content, style, organization, mechanics, and so forth (Shermis, Burstein, Higgins, & Zechner, 2010). As a result, the AES system generates a single score or detailed evaluation of predefined assessment features (Ifenthaler, 2016).

This chapter describes the evolution and features of automated scoring systems, discusses their limitations, and concludes with future directions for research and practice.

Synopsis of Automated Scoring Systems

The first widely known automated scoring system, Project Essay Grader (PEG), was conceptualized by Ellis Battan Page in late 1960s (Page, 1966, 1968). PEG relies on proxy measures, such as average word length, essay length, number of certain punctuation marks, and so forth, to determine the quality of an open-ended response item. Despite the promising findings from research on PEG, acceptance and use of the system remained limited (Ajay, Tillett, & Page, 1973; Page, 1968). The advent of the Internet in the 1990s and related advances in hard- and software introduced a further interest in designing and implementing AES systems. The developers primarily aimed to address concerns with time, cost, reliability, and generalizability regarding the assessment of writing. AES systems have been used as a co-rater in large-scale standardized writing assessments since the late 1990s (e.g., e-rater by Educational Testing Service). While initial systems focused on English language, a wide variety of languages have been included in further developments, such as Arabic (Azmi, Al-Jouie, & Hussain, 2019), Bahasa Malay (Vantage Learning, 2002), Hebrew (Vantage Learning, 2001), German (Pirnay-Dummer & Ifenthaler, 2011), or Japanese (Kawate-Mierzejewska, 2003). More recent developments of AES systems utilize advanced machine learning approaches and elaborated natural language processing algorithms (Glavas, Ganesh, & Somasundaran, 2021).

For almost 60 years, different terms related to automated assessment of written prose have been used mostly interchangeably. Most frequently used terms are automated essay scoring (AES) and automated essay grading (AEG); however, more recent research used the term automated writing evaluation (AWE) and automated essay evaluation (AEE) (Zupanc & Bosnic, 2015). While the above-mentioned system focuses on written prose including several hundred words, another field developed focusing on short answers referred to as automatic short answer grading (ASAG) (Burrows, Gurevych, & Stein, 2015).

Functions of Automated Scoring Systems

AES systems mimic human evaluation of written prose by using various methods of scoring, that is, statistics, machine learning, and natural language processing (NLP) techniques. Implemented features of AES systems vary widely, yet they are mostly trained with large sets of expert-rated sample open-ended assessment items to internalize features that are relevant to human scoring. AES systems compare the features in training sets to those in new test items to find similarities between high/low scoring training and high/low scoring new ones and then apply scoring information gained from training sets to new item responses (Ifenthaler, 2016).

The underlying methodology of AES systems varies; however, recent research mainly focuses on natural language processing approaches (Glavas et al., 2021). AES systems focusing on content use Latent Semantic Analysis (LSA) which assumes that terms or words with similar meaning occur in similar parts of written text (Wild, 2016). Other content-related approaches include Pattern Matching Techniques (PMT). The idea of depicting semantic structures, which include concepts and relations between the concepts, has its source in two fields: semantics (especially propositional logic) and linguistics. Semantic oriented approaches include Ontologies and Semantic Networks (Pirnay-Dummer, Ifenthaler, & Seel, 2012). A semantic network represents information in terms of a collection of objects (nodes) and binary associations (directed labeled edges), the former standing for individuals (or concepts of some sort), and the latter standing for binary relations over these. Accordingly, a representation of knowledge in a written text by means of a semantic network corresponds with a graphical representation where each node denotes an object or concept, and each labeled being one of the relations used in the knowledge representation. Despite the differences between semantic networks, three types of edges are usually contained in all network representation schemas (Pirnay-Dummer et al., 2012): (a) Generalization: connects a concept with a more general one. The generalization relation between concepts is a partial order and organizes concepts into a hierarchy. (b) Individualization: connects an individual (token) with its generic type. (c) Aggregation: connects an object with its attributes (parts, functions) (e.g., wings – part of – bird). Another method of organizing semantic networks is partitioning which involves grouping objects and elements or relations into partitions that are organized hierarchically, so that if partition A is below partition B, everything visible or present in B is also visible in A unless otherwise specified (Hartley & Barnden, 1997).

From an information systems perspective, understood as a set of interrelated components that accumulate, process, store, and distribute information to support decision making, several preconditions and processes are required for a functioning AES system (Burrows et al., 2015; Pirnay-Dummer & Ifenthaler, 2010):

  1. 1.

    Assessment scenario: The assessment task with a specific focus on written prose needs to be designed and implemented. Written text is being collected from learners and from experts (being used as a reference for later evaluation).

  2. 2.

    Preparation: The written text may contain characters which could disturb the evaluation process. Thus, a specific character set is expected. All other characters may be deleted. Tags may be also deleted, as are other expected metadata within each text.

  3. 3.

    Tokenizing: The prepared text gets split into sentences and tokens. Tokens are words, punctuation marks, quotation marks, and so on. Tokenizing is somewhat language dependent, which means that different tokenizing methods are required for different languages.

  4. 4.

    Tagging: There are different approaches and heuristics for tagging sentences and tokens. A combination of rule-based and corpus-based tagging seems most feasible when the subject domain of the content is unknown to the AES system. Tagging and the rules for it is a quite complex field of linguistic methods (Brill, 1995).

  5. 5.

    Stemming: Specific assessment attributes may require that flexions of a word will be treated as one (e.g., the singular and plural forms “door” and “doors”). Stemming reduces all words to their word stems.

  6. 6.

    Analytics: Using further natural language processing (NLP) approaches, the prepared text is analyzed regarding predefined assessment attributes (see below), resulting in models and statistics.

  7. 7.

    Prediction: Further algorithms produce scores or other output variables based on the analytics results.

  8. 8.

    Veracity: Based on available historical data or reference data, the analytics scores are compared in order to build trust and validity in the AES result.

Common assessment attributes of AES have been identified by Zupanc and Bosnic (2017) including linguistic (lexical, grammar, mechanics), style, and content attributes. Among 28 lexical attributes, frequencies of characters, words, sentences are commonly used. More advanced lexical attributes include average sentence length, use of stopwords, variation in sentence length, or the variation of specific words. Other lexical attributes focus on readability or lexical diversity utilizing specific measures such as Gunning Fox index, Nominal ratio, Type-token-ratio (DuBay, 2007). Another 37 grammar attributes are frequently implemented, such as number of grammar errors, complexity of sentence tree structure, use of prepositions and forms of adjectives, adverbs, nouns, verbs. A few attributes focus on mechanics, for example, the number of spellchecking errors, the number of capitalization errors, or punctuation errors. Attributes that focus on content include similarities with source or reference texts or content-related patterns (Attali, 2011). Specific semantic attributes have been described as concept matching and proposition matching (Ifenthaler, 2014). Both attributes are based on similarity measures (Tversky, 1977). Concept matching compares the sets of concepts (single words) within a written text to determine the use of terms. This measure is especially important for different assessments which operate in the same domain. Propositional matching compares only fully identical propositions between two knowledge representations. It is a good measure for quantifying complex semantic relations in a specific subject domain. Balanced semantic matching measure uses both concepts and propositions to match the semantic potential between the knowledge representations. Such content or semantic oriented attributes focus on the correctness of content and its meaning (Ifenthaler, 2014).

Overview of Automated Scoring Systems

Instructional applications of automated scoring systems are developed to facilitate the process of scoring and feedback in writing classrooms. These AES systems mimic human scoring by using various attributes; however, implemented attributes vary widely.

The market of commercial and open-source AES systems has seen a steady growth since the introduction of PEG. The majority of available AES systems extract a set of attributes from written prose and analyze it using some algorithm to generate a final output. Several overviews document the distinct features of AES systems (Dikli, 2011; Ifenthaler, 2016; Ifenthaler & Dikli, 2015; Zupanc & Bosnic, 2017). Burrows et al. (2015) identified five eras throughout the almost 60 years of research in AES: (1) concept mapping, (2) information extraction, (3) corpus-based methods, (4) machine learning, and (5) evaluation.

Zupanc and Bosnic (2017) note that four commercial AES systems have been predominant in application: PEG, e-rater, IEA, and IntelliMetric. Open access or open code systems have been available for research purposes (e.g., AKOVIA); however, they are yet to be made available to the general public. Table 1 provides an overview of current AES systems, including a short description of the applied assessment methodology, output features, information about test quality, and specific requirements. The overview is far from being complete; however, it includes major systems which have been reported in previous summaries and systematic literature reviews on AES systems (Burrows et al., 2015; Dikli, 2011; Ifenthaler, 2016; Ifenthaler & Dikli, 2015; Ramesh & Sanampudi, 2021; Zupanc & Bosnic, 2017). Several AES systems also have instructional versions for classroom use. In addition to their instant scoring capacity on a holistic scale, the instructional AES systems are capable of generating diagnostic feedback and scoring on an analytic scale as well. The majority of AES systems use focus on style or content-quality and use NLP algorithms in combination with variations of regression models. Depending on the methodology, AES system requires training samples for building a reference for future comparisons. However, the test quality, precision, or accuracy of several AES systems is publicly not available or has not been reported in rigorous empirical research (Wilson & Rodrigues, 2020).

Table 1 Overview of AES systems

Open Questions and Directions for Research

There are several concerns regarding the precision of AES systems and the lack of semantic interpretation capabilities of underlying algorithms. Reliability and validity of AES systems have been extensively investigated (Landauer, Laham, & Foltz, 2003; Shermis et al., 2010). The correlations and agreement rates between AES systems and expert human raters have been found to be fairly high; however, the agreement rate is not at the desired level yet (Gierl, Latifi, Lai, Boulais, & Champlain, 2014). It should be noted that many of these studies highlight the results of adjacent agreement between humans and AES systems rather than those of exact agreement (Ifenthaler & Dikli, 2015). Exact agreement is harder to achieve as it requires two or more raters to assign the same exact score on an essay while adjacent agreement requires two or more raters to assign a score within one scale point of each other. It should also be noted that correlation studies are mostly conducted at high-stakes assessment settings rather than classroom settings; therefore, AES versus human inter-rater reliability rates may not be the same in specific assessment settings. The rate is expected to be lower in the latter since the content of an essay is likely to be more important in low-stakes assessment contexts.

The validity of AES systems has been critically reflected since the introduction of the initial applications (Page, 1966). A common approach for testing validity is the comparison of scores from AES systems with those of human experts (Attali & Burstein, 2006). Accordingly, questions arise about the role of AES systems promoting purposeful writing or authentic open-ended assessment responses, because the underlying algorithms view writing as a formulaic act and allows writers to concentrate more on the formal aspects of language such as origin, vocabulary, grammar, and text length with little or no attention to the meaning of the text (Ifenthaler, 2016). Validation of AES systems may include the correct use of specific assessment attributes, the openness of algorithms, and underlying aggregation and analytics techniques, as well as a combination of human and automated approaches before communicating results to learners (Attali, 2013). Closely related to the issue of validity is the concern regarding reliability of AES systems. In this context, reliability assumes that AES systems produce repeatedly consistent scores within and across different assessment conditions (Zupanc & Bosnic, 2015). Another concern is the bias of underlying algorithms, that is, algorithms have their source in a human programmer which may introduce additional error structures or even features of discrimination (e.g., cultural bias based on selective text corpora). Criticism has been put toward commercial marketing of AES systems for speakers of English as a second or foreign language (ESL/EFL) when the underlying methodology has been developed based on English language with native-English speakers in mind. In an effort to assist ESL/EFL speakers in writing classrooms, many developers have incorporated a multilingual feedback function in the instructional versions of AES systems. Receiving feedback in the first language has proven benefits, yet it may not be sufficient for ESL/EFL speakers to improve their writing in English. It would be more beneficial for non-native speakers of English if developers take common ESL/EFL errors into consideration when they build algorithms in AES systems. Another area of concern is that writers can trick AES systems. For instance, if the written text produced is long and includes certain type of vocabulary that the AES system is familiar with, an essay can receive a higher score from AES regardless of the quality of its content. Therefore, developers have been trying to prevent cheating by users through incorporating additional validity algorithms (e.g., flagging written text with unusual elements for human scoring) (Ifenthaler & Dikli, 2015). The validity and reliability concerns result in speculations regarding the credibility of AES systems considering that the majority of the research on AES is conducted or sponsored by the developing companies. Hence, there is a need for more research that addresses the validity and reliability issues raised above and preferably those conducted by independent researchers (Kumar & Boulanger, 2020).

Despite the above-mentioned concerns and limitation, educational organizations choose to incorporate instructional applications of AES systems in classrooms, mainly to increase student motivation toward writing and reducing workload of involved teachers. They assume that if AES systems assist students with the grammatical errors in their writings, teachers will have more time to focus on content related issues. Still, research on students’ perception on AES systems and the effect on motivation as well as on learning processes and learning outcomes is scarce (Stephen, Gierl, & King, 2021). In contrast, educational organizations are hesitant in implementing AES systems mainly because of validity issues related to domain knowledge-based evaluation. As Ramesh and Sanampudi (2021) exemplify, the domain-specific meaning of “cell” may be different in biology or physics. Other concerns that may lower the willingness to adopt of AES systems in educational organizations include fairness, consistency, transparency, privacy, security, and ethical issues (Ramineni & Williamson, 2013; Shermis, 2010).

AES systems can make the result of an assessment available instantly and may produce immediate feedback whenever the learner needs it. Such instant feedback provides autonomy to the learner during the learning process, that is, learners are not depended on possibly delayed feedback from teachers. Several attributes implemented in AES systems can produce an automated score, for instance, correctness of syntactic aspects. Still, the automated and informative feedback regarding content and semantics is limited. Alternative feedback mechanisms have been suggested, for example, Automated Knowledge Visualization and Assessment (AKOVIA) provides automated graphical feedback models, generated on the fly, which have been successfully tested for preflection and reflection in problem-based writing tasks (Lehmann, Haehnlein, & Ifenthaler, 2014). Other studies using AKOVIA feedback models highlight the benefits of availability of informative feedback whenever the learner needs it and its identical impact on problem solving when compared with feedback models created by domain experts (Ifenthaler, 2014).

Questions for future research focusing on AES systems may focus on (a) construct validity (i.e., comparing AES systems with other systems or human rater results), (b) interindividual and intraindividual consistency and robustness of AES scores obtained (e.g., in comparison with different assessment tasks), (c) correlative nature of AES scores with other pedagogical or psychological measures (e.g., interest, intelligence, prior knowledge), (d) fairness and transparency of AES systems and related scores, as well as (e) ethical concerns related to AES systems, (f) (Elliot & Williamson, 2013). From a technological perspective, (f) the feasibility of the automated scoring system (including training of AES using prescored, expert/reference, comparison) is still a key issue with regard to the quality of assessment results. Other requirements include the (g) instant availability, accuracy, and confidence of the automated assessment. From a pedagogical perspective, (h) the form of the open-ended or constructed-response test needs to be considered. The (i) assessment capabilities of the AES system, such as the assessment of different languages, content-oriented assessment, coherence assessment (e.g., writing style, syntax, spelling), domain-specific features assessment, and plagiarism detection, are critical for a large-scale implementation. Further, (j) the form of feedback generated by the automated scoring system might include simple scoring but also rich semantic and graphical feedback. Finally, (k) the integration of an AES system into existing applications, such as learning management systems, needs to be further investigated by developers, researchers, and practitioners.

Implications for Open, Distance, and Digital Education

The evolution of Massive Open Online Courses (MOOCs) nurtured important questions about online education and its automated assessment (Blackmon & Major, 2017; White, 2014). Education providers such as Coursera, edX, and Udacity dominantly apply so-called auto-graded assessments (e.g., single- or multiple-choice assessments). Implementing automated scoring for open-ended assessments is still on the agenda of such provides, however, not fully developed yet (Corbeil, Khan, & Corbeil, 2018).

With the increased availability of vast and highly varied amounts of data from learners, teachers, learning environments, and administrative systems within educational settings, further opportunities arise for advancing AES systems in open, distance, and digital education. Analytics-enhanced assessment enlarges standard methods of AES systems through harnessing formative as well as summative data from learners and their contexts in order to facilitate learning processes in near real-time and help decision-makers to improve learning environments. Hence, analytics-enhanced assessment may provide multiple benefits for students, schools, and involved stakeholders. However, as noted by Ellis (2013), analytics currently fail to make full use of educational data for assessment.

Interest in collecting and mining large sets of educational data on student background and performance has grown over the past years and is generally referred to as learning analytics (R. S. Baker & Siemens, 2015). In recent years, the incorporation of learning analytics into educational practices and research has further developed. However, while new applications and approaches have brought forth new insights, there is still a shortage of research addressing the effectiveness and consequences with regard to AES systems. Learning analytics, which refers to the use of static and dynamic data from learners and their contexts for (1) the understanding of learning and the discovery of traces of learning and (2) the support of learning processes and educational decision-making (Ifenthaler, 2015), offers a range of opportunities for formative and summative assessment of written text. Hence, the primary goal of learning analytics is to better meet students’ needs by offering individual learning paths, adaptive assessments and recommendations, or adaptive and just-in-time feedback (Gašević, Dawson, & Siemens, 2015; McLoughlin & Lee, 2010), ideally, tailored to learners’ motivational states, individual characteristics, and learning goals (Schumacher & Ifenthaler, 2018). From an assessment perspective focusing on AES systems, learning analytics for formative assessment focuses on the generation and interpretation of evidence about learner performance by teachers, learners, and/or technology to make assisted decisions about the next steps in learning and instruction (Ifenthaler, Greiff, & Gibson, 2018; Spector et al., 2016). In this context, real- or near-time data are extremely valuable because of their benefits in ongoing learning interactions. Learning analytics for written text from a summative assessment perspective is utilized to make judgments that are typically based on standards or benchmarks (Black & Wiliam, 1998).

In conclusion, analytics-enhanced assessments of written essays may reveal personal information and insights into an individual learning history; however, they are not accredited and far from being unbiased, comprehensive, and fully valid at this point in time. Much remains to be done to mitigate these shortcomings in a way that learners will truly benefit from AES systems.

Cross-References