Developing an adaptive tool to select, plan, and scaffold oral assessment tasks for undergraduate courses

Development Article


The increased linguistic and cultural diversity of undergraduate classrooms at English language institutions has imposed additional pedagogical and assessment challenges on instructors, many of whom lack the knowledge necessary to design classroom activities and assessments that are fair to all students regardless of students’ background and language abilities. The development of an adaptive instrument for instructors who do not specialize in English language learning represents an attempt to adjust instructional practices to meet this need. This paper reports on the development of an instrument that undergraduate instructors can use to plan their courses at universities where English is the language of instruction. The instrument’s intended use is illustrated through an example that involves the planning of an interdisciplinary undergraduate course. To build this adaptive tool, a taxonomy that describes the relevant components of assessments that involve oral communication was developed and externally reviewed. The questions used in the instrument were then developed and piloted with a group of university undergraduate instructors; after which, the instrument was further refined. Although piloting revealed an increase in instructor awareness of how language abilities relate to assessment, further research is needed to determine the extent to which this tool affects instructor’s classroom or assessment practices.


Assessment Planning Ethics Undergraduates Teaching practices Adaptive instructional tools 



First language


English language learner


International Language Testing Association


International English Language Test System


Test of English as a Foreign Language Internet Based Test


Educational Testing Service


Test of English for International Communication


As university undergraduate classrooms become increasingly diverse with respect to language and culture, ensuring fairness in current teaching and assessment practices has become a critical issue. Unlike the notion of equality, which pursues equal treatment of all students, fairness requires differential treatment that assists students in achieving the same standards and minimizes the divergence in educational attainment across various social groups (Laing and Todd 2012).

Central to the discussion of fairness in undergraduate classrooms is the educational attainment of English language learners (henceforth written as ELLs1). Since ELLs experience more language difficulty than peers who learned English as a first language (L1) (Zeidner and Bensoussan 1988; Shakya and Horsfall 2000; Leki 2007), it is essential that post-secondary institutions and instructors recognize their needs and pay particular attention to fairness. This is especially important in assessment settings where ELLs might have difficulty demonstrating their knowledge and where the results of the assessment are used for high-stakes decisions such as passing a course or graduating from an educational program.

Oral assessment demands high levels of English proficiency and thus may isolate and disadvantage ELLs (Zeidner and Bensoussan 1988; Leki 2007). Data from the US Census Bureau indicates that less than 20 % of the foreign-born population graduates from high school (Gambino et al. 2014), and the ELLs that graduate receive lower scores than their English speaking peers on assessments of vocabulary (11.6 %), reading (11.0 %), and writing (19.0 %) (NAEP Data Explorer 2014). Unfortunately, oral skills are not being tracked but given that writing and speaking both involve language production, it is reasonable to conclude that ELLs are similarly behind in their ability to use oral language. ELLs might find oral assessment tasks particularly challenging and anxiety provoking (Al-Issa and Al-Qubtan, 2010; Otoshi and Heffernen 2008; Webster 2002; King 2002) since they demand students collect, inquire, construct, and deliver information using integrated reading, listening, and speaking skills (King 2002). However, little attention has been given to the fairness of using oral assessment when a class consists of both native English speakers and ELLs.

As a result, an instrument was developed aiming to help undergraduate instructors reflect on their teaching practices, select oral assessment tasks, and conduct them in a way that is fair for everyone. In addition to the inclusion of an under-represented subpopulation, this computer-based instrument attempts to align the curriculum of the course with its assessment regardless of the instructional domain. It has been argued that instructors commonly treat assessment as an afterthought or completely separate it from teaching, even though pedagogy and assessment are inherently linked and should be included within the planning of the course curriculum (Wieman et al. 2010; Welsh 2012; Brown 2008). This instrument, therefore, focuses on ensuring that the selected assessment tasks are consistent with instructor classroom practices.

This paper first describes the challenges and concerns that relate to the fair use of oral assessment within the classroom and the assessment practices that are currently being used at the undergraduate level. It then describes a taxonomy that serves as the theoretical grounding for the developed instrument. Finally, a formative evaluation of a comprehensive assessment-planning tool is described and the instrument’s use by an undergraduate instructor is presented.

Background literature

Since the developed instrument is intended to help instructors select and plan the oral assessments that they will be using in classroom settings, the literature review does not cover the basics of psychometrics or test development and validation. Rather, it starts with a discussion of fairness issues as they relate to current undergraduate assessment practices. It then presents evidence of the limitations of the English proficiency levels that are currently required for admitting undergraduate students. The subsequent methodology section provides additional background on oral assessment for classroom settings.

Fairness issues in current classroom assessment

A review of literature related to fairness issues and current classroom assessment practices provided the basis for constructing a taxonomy and the later development of potential instrument items. Recent studies on assessment fairness cover a broad range of topics, including appropriate task design, administration and scoring, content relevance and coverage, task quality, and equal opportunities for learning and access to assessment (Kunnan 2000; Saville 2003, 2005).

Among these topics, task design plays a pivotal role in achieving fairness: as the initial step in developing an assessment, it determines the assessment’s quality. If tasks are designed in a way that favors or disadvantages a subgroup of students, then inferences about the students’ ability may be unfair. Consequently, selection is critical and arguments have been made for integrating formal selection processes into assessment development (McNamara and Roever 2006).

Flint and Johnson (2011) conducted the most comprehensive work that investigated the design of fair assessments using qualitative methods to identify how undergraduate students perceived different assessment practices. Those identified as unfair were characterized by a lack of authenticity or relevance to real-world tasks, the failure to reward genuine effort, the absence of adequate feedback, a lack of long-term benefits, and a reliance on factual recall rather than higher-order thinking and problem-solving skills. Additional work on the fairness of participation-based assessment, which is often oral, has argued that participation grades are among the most subjective (Melvin 1988) and that they increase racial and gender discrimination, ignore cultural diversity, and reduce student motivation (Gilson 1994).

Responding to these perceptions, Flint and Johnson (2011) advised instructors to explicitly link assessment tasks to students’ future workplaces; to convey concrete standards and criteria; to employ many types of assessment tasks; to involve students in assessment processes; and to establish expectations that are realistic, sustainable, and educationally sound. Melvin (1988) also argued that even in seminars where participation plays a crucial role and warrants evaluation, other methods, such as combining peer and instructor ratings, should be employed to increase assessment fairness.

These general insights into designing fair assessment tasks can help instructors, but they fail to address how to ensure fairness when assessment tasks are culturally or linguistically sensitive (e.g., presentations or group work whereby ELLs may not have the cultural or linguistic abilities to interact fully compared to L1 peers). Moreover, there seems to be little awareness among university instructors as to how ELLs may be disadvantaged when assessing classroom participation or other forms of oral assessment (Heywood 2000) even though they have been shown to underperform when compared to their L1 peers in domains, such as chemistry, psychology, or linguistics, where specialized language is used (Kieffer et al. 2009). While most of this research was performed on written tests, the temporal nature of oral language makes this problem even more pronounced when employing oral assessment methods.

The developed instrument could help address this need. A recent survey of faculty in the United States revealed that 52.3 % use oral presentations and 46.2 % use group work in their courses (Webber 2012). Most faculty also reported grading classroom participation (Rogers 2013), and many of the open-ended tasks that are used in the sciences can be performed orally (Goubeaud 2009). Oral tasks are also frequently used in online learning environments that employ video conferencing or the posting of video reports (Borup et al. 2012; Hu et al. 2000).

Entrance requirements and band descriptors

ELLs are generally required to provide proof of their English language ability to gain admission to university undergraduate programs; this proof is often the student’s score on a standardized test (Coley 1999), such as the International English Language Test System (IELTS) (IELTS 2007), Test of English as a Foreign Language (TOEFL iBT) (Educational Testing Service 2008, 2012), or Test of English for International Communication (TOEIC) (Anaheim University 2012; University of California, San Diego 2012). Most institutions require that students achieve minimum scores in order to grant them admission (IELTS 2012a); these scores typically range from IELTS 5.5 to 7.0 but can be as low as 4.0 or as high as 8.5 out of 9.0 (IELTS 2009). Admission requirements may not ensure that students can participate fully in classroom or assessment tasks since the minimum proficiency level required by the tasks may exceed the school’s entrance requirement (Cheng et al. 2007).

In some cases, a student may have what is perceived to be sufficient language proficiency but might not since instruction in university courses requires multiple literacies (Leki 2007). For example, a student may be able to read and write but have difficulty participating in classroom discussions. Similarly, students may require English communication skills that are unrelated to their discipline such as asking questions, negotiating, understanding idioms or instructions, or discussing.

English proficiency tests employ band descriptors to detail score levels. These band descriptors provide a general description of the language abilities of individuals who have received certain scores and can be found on test developer websites (Educational Testing Service 2004, 2010; IELTS 2012b). Since band descriptors detail the language proficiency of the learner, they can be used to determine whether classroom oral assessment tasks might be beyond a student’s abilities. For example, a student at IELTS speaking band 2 is described as taking long pauses between words, only saying memorized utterances or isolated words, having unintelligible speech, and being unable to produce basic sentence forms (IELTS 2012b). Obviously, students at this level cannot successfully perform many of the tasks that take place within classroom settings unless they are entirely scripted (see Table 2 for more details on the types of tasks that are used). In contrast, students at IELTS speaking band 6 can usually be understood but occasionally mispronounce words in a way that reduces clarity and are willing to speak even though they may lose coherence occasionally and hesitate or correct themselves (IELTS 2012b). Students at this level are capable of performing well when given adequate preparation but may struggle with unplanned oral activities.

As a precursor to the instrument, a taxonomy was developed that includes minimum recommended scores that were identified by consulting band descriptors. The identified minimum recommendation, in many cases, exceeds those commonly used for admissions (e.g., IELTS 6.0). The use of oral tasks should, therefore, be carefully considered by instructors of undergraduate courses to ensure that all students have the same opportunity to demonstrate learning and that assessment methods match pedagogical requirements (Deng and Carless 2010).


The creation of an instrument to help classroom instructors select and plan oral assessment tasks was informed by theory and user-centered design practices.

Taxonomy development

The instrument was based on a taxonomy that was developed by synthesizing relevant literature from the field of language assessment. The taxonomy was subsequently modified to account for current educational contexts using the classroom experiences and observations of the authors in both their role as instructors and ELLs. A specialist in language assessment and several instructors of English as a second or foreign language who had received training in language assessment externally reviewed the taxonomy. These reviews indicated the comprehensive nature of the taxonomy and failed to identify major shortcomings.

The taxonomy has two major components. The first portion of the taxonomy (Table 1) describes factors applying to all language tasks, including validity, ethics, and the effect of the assessment method on classroom practices (i.e., washback). The second focuses on specific assessment tasks and the varied considerations related to their use in classroom settings (see Table 2). The sociocultural and linguistic components of the tasks as well as their logistical, practical, and environmental concerns are also detailed in the second part of the taxonomy, and a recommendation for the minimum level of English proficiency that is needed to ensure that students can perform the task satisfactorily is provided. This does not mean the taxonomy only applies to ELLs; the majority of items from both tables apply to all students.
Table 1

A taxonomy of the validity, ethics, and assessment consequences of oral assessment; their constituent sub-categories and exemplar questions that relate to each sub-category are provided



Selected related questions

Validity concerns

Construct validitya

Does the test measure what you want it to measure?

Free of construct-irrelevant variables

Can your students understand the instructions? Do students use the expected skills to complete the assessment? Do you allow enough time to demonstrate the skills you are trying to assess?

Content validity

Is the content of your assessment tasks relevant to the course material? Do students learn the content prior to the assessment?

Criterion validity

What are the grading criteria? Do they adequately distinguish different levels of performance? Do they clearly and explicitly describe student performance?


Have you trained other assessors (e.g., teaching assistants)? Do assessors assign grades against established criteria? Would other competent assessors agree on the conclusion of the assessment?


Is your assessment designed to guide ongoing and future learning? Do your assessment tasks require skills that can be used in extended contexts?

Ethical concerns

Language bias

Are your assessment tasks free of group specific vocabulary or reference pronouns? Do you explain figurative language? Do you explain English vocabulary that might be colloquial, regional, or unfamiliar?

Content bias

Is the material free of controversial or inflammatory content (e.g., religion or evolution)? Do you explain unfamiliar content to students with different cultural or linguistic backgrounds?


Do your assessment tasks depict groups in stereotyped situations (e.g., immigrant people running a Laundromat or girls needing help)?

Format bias

Does the student have experience with the format and procedures used?

Equal chance

Will all students have equal opportunity to participate in the assessment? What procedures have you put in place to ensure equal participation and interaction?

Access to information

Do all students have access to the same information about the test and its administration? Is information regarding the test purpose, format, and administration available ahead of the test and in accessible formats?


Are different student needs accommodated (e.g., linguistic)? Do you allow students to engage in voice recording when they have difficulty understanding oral instructions?


Do students understand how they will be assessed? Are they given appropriate feedback? Can they understand the results?

Other considerations

Do you intervene when you observe students within your class engaging in behaviors that show cultural insensitivity, racial biases, or stereotyping?

Assessment consequences


Is instruction given prior to the test related directly to the test construct? How closely are your classroom practices and assessment practices related?


Will classroom practices that you are targeting towards assessment benefit all students equally? Are you only developing skills that relate to the assessments?

aWe recognize that most instructors will not be measuring a psychological construct, but they may be measuring proficiency in a particular area. We have, therefore, left this row in the taxonomy

Table 2

A taxonomy of the contextual and logistical components for each type of oral assessment task

Task types


Logistical and practical concerns

Environmental concerns

Socio-cultural aspects

Linguistic aspects

Min. Recom. Test Score

Common across all task types

Asking and answering


Class size


Appropriate conditions

(e.g., temperature, lighting, location)

Affective factors

(e.g., student hyper vigilance during video recording or peer pressure)





Volume of voice



Intonation and stress



One-to-one interviews




Noise pollution

Physical orientation (e.g., physical distance)


IELTS: 6.5

TOEIC: 160


Oral exams



Group work



Physical orientation




TOEIC: 160

TOEFL: 100

Peer group discussion



Group work



Physical orientation


IELTS: 7.0

TOEIC: 160

TOEFL: 100


Group work


Equipment availability

Noise pollution

Physical orientation



IELTS: 5.5

TOEIC: 130



Group work


Equipment availability

Noise pollution

Physical orientation



IELTS: 6.0

TOEIC: 150






Group work


Equipment availability




IELTS: 7.0

TOEIC: 170

TOEFL: 100

Role play or simulation



Group work


Equipment availability

Noise pollution

Physical orientation



IELTS: 7.5

TOEIC: 180

TOEFL: 107

Speaking portfolio



Group work


Equipment availability


Maximum of the selected tasks

The minimum recommended level of language proficiency (Min. Recom. Test Score) for each task is described using English language-proficiency test scores

Component 1: validity, ethics, and assessment consequences

Validity includes construct validity (i.e., the assessment measures what it is supposed to measure), content validity (i.e., the assessment content and methods are consistent with the course content and objectives), criterion validity (i.e., criteria directly or indirectly measure an intended construct), and reliability (i.e., the consistency of the assessment’s results) (Bachman 1990). In contrast, concerns over ethics and assessment consequences are related to the broader social context that encompasses classroom practices.

Kunnan’s (2004) fairness framework and several professional codes of ethics were consulted (Code of ethics for ILTA 2001; Code of Fair Testing Practices in Education 1988; Standards for educational and psychological testing 1999). The information from these sources was combined to create a comprehensive view of assessment at a high level. Other assessment materials (Cizek et al. 2011; Nitko 2011; Popham 2000) were also incorporated with the aim of helping instructors examine whether the language, content, and format of the assessment are bias free; the information about the assessment is equally accessible to all students; and the assessment is administered under physically and psychologically comfortable conditions. In addition, the consequences of the assessment, which include its effects on classroom practices, were included. Wall and Alderson’s definition of washback as “the impact of a test on teaching” (1993, p. 41), was used to inform the taxonomy; it states that tests can become powerful determiners, both positively and negatively, of what happens in classroom practice. While some instructors deliberately map their classroom practices onto their chosen assessment methods, many may not be aware of how the assessment that they choose influences their classroom instructional practices (Bailey 1999).

Component 2: task-specific considerations

Table 2 details oral assessment task-specific considerations and contains categories of criteria that should be considered when selecting assessment methods. This includes the skills that may be required for each type of task; logistical, practical, and environmental concerns; and linguistic and sociocultural influences on task performance. Many of these considerations are consistent with those described by Bachman and Palmer (2012).

The oral assessment tasks identified include oral exams, one-to-one interviews, peer-group discussions, in-class participation, presentations, role-playing or simulations, and demonstrations. A final option, the speaking portfolio, was also included; it can be composed of any combination of the other task types since it documents student performance on oral tasks and is intended to be used to show the student’s overall ability. Oral exams were defined as a presentation that is accompanied by questions from an examining committee or group of people.

The tasks were identified based on those reported in the literature and those that the taxonomy developers had either used or experienced. Once a set of tasks had been identified and defined, the literature was re-consulted to ensure that no types of oral assessment task had been overlooked. Since the literature is sometimes inconsistent when defining assessment tasks (Joughin 1998; Van Moere 2012; Gan 2013), they were specified with respect to the language-based skills that are often performed as part of a larger classroom activity or skill. These skills included the ability to ask and answer questions, negotiate, debate, present information, work in a group, and perform requests.

A minimum recommended level of English language proficiency was also determined for each of the assessment task types, allowing instructors to use their knowledge of their institution’s standards when determining which teaching and assessment practices to use or how they may need to modify them. Considering that a major empirical study that assesses the definitive minimum test score requirements for each task has yet to be undertaken, scores were determined using a process that was similar to that reported by O’Neil et al. (2007). First, the required skills for each task type were evaluated and deconstructed into their relevant components. These were then independently mapped onto the speaking scores of all three major language tests based on the abilities detailed in the tests’ band descriptors (Educational Testing Service 2004, 2010; IELTS 2012b). Developers then met and resolved discrepancies in the minimum scores that they had assigned. For example, peer group discussions were rated as more difficult than one-to-one interviews because students would have to engage with multiple speakers which requires a larger degree of negotiation and an understanding of more complex requests. The addition of these more difficult skills could require a higher score (i.e., IELTS 7.0) than one-to-one interviews (i.e., IELTS 6.5). An exploration of graduate school admissions requirements at top-tier US universities shows that most require a score of at least 7.0 on the IELTS (IELTS 2009). This is consistent with our recommendation of IELTS 7.0 for oral exams.

It should be acknowledged that the validity of band descriptors when mapped to specific task types might be problematic. Despite the considerable effort invested in testing, adapting, and retesting descriptors against task types and student ability, band descriptors are not always entirely consistent (North and Schneider 1998). However, the scales are employed systematically because the descriptors alone provide a sufficiently valid basis for assessment on specific task types when applied across a vast number of test takers (Ang-Aw and Chuen Meng Goh 2011), and considerable research effort has gone into the development of building consensus around the use of these broad scales for oral assessment (Taguchi 2007; Fulcher and Reiter 2003). Moreover, researchers have considered the different levels of ability students must have to complete different task types depending on the task’s inherent difficulty (Robinson et al. 1995) but specific empirically derived guidelines have yet to be determined.

The resulting taxonomy describes the tasks using five categories of pedagogical, logistical, and sociolinguistic considerations. The analysis revealed commonalities across task types for some categories (i.e., table columns). As a result, the characteristics that can be applied to each task type for a category were separated from those that are specific to individual tasks. For example, the Logical and Practical Considerations column lists class size and time as being jointly required by all task types. Beyond this, the final five tasks require or are subject to equipment availability even though they may require different equipment (e.g., a projector, a computer, or a recording device). Many similarities were also observed across tasks once the sociocultural aspects and linguistic aspects columns had been expanded. The recurring considerations were, therefore, placed in the Common Characteristics Applied to All Task Types row. Though the taxonomy states that certain tasks may be affected by environmental concerns or require equipment, there could be cases where the taxonomy suggests influential factors that may not present themselves. Keeping this in mind, it is important to address concerns such as equipment availability even if it is merely to determine that no equipment needs exist for the intended assessment context.

Many of the terms used throughout the sociocultural and linguistic categories are technical in nature and require definition. Formality, register, and politeness are among them. Formality relates to language which is characterized by attention to form (Labov 1972), where the speaker follows standard language conventions, form, and pronunciation. Register is “a variety of language defined according to its use in social situations” (Crystal 1991 p. 295). Trudgill (1983) argued that it is linked to occupation, profession, discipline, or topic; it should, therefore, be considered during instruction and assessment since different registers could be employed during the completion of any of the oral assessment tasks. Although politeness, which can be defined as “conforming to socially agreed codes of good conduct” (Nwoye 1992) on the surface, may seem similar to register, it is a separate construct and can be realized using different registers.

The environmental concerns category refers to the classroom and assessment environment and has many items that are consistent across tasks. This category is possibly the most fluid of the six and the one most affected by local conditions. Environmental concerns relate to classroom size, temperature, lighting, noise levels, and other similar aspects of the physical environment. As with the other categories, different oral assessment tasks can be carried out in different ways depending on the environment. Since both the physical environment and the mental state of students are important, affective factors, such as student ‘hyper vigilance’ during video recording or everyday occurrences of peer pressure were included because they can affect student performance, especially for those who have lower English proficiency levels (Shohamy 1982; Scott 1986; Young 1986).

Instrument development

An instrument was developed based on the taxonomy using an argument-based approach to assessment validity (Chapelle et al. 2010): it was designed to ask how assessments are prepared, administered, and used.

Initial questions were developed based on the elements of each category within the taxonomy (see Tables 1 and 2) until every element had at least one question associated with it; some questions were associated with multiple elements. Additional questions were added to ensure that issues related to the logistics of the assessment methods and their authenticity with respect to what is done in the workplace and educational institution were considered (Ishii and Baba 2003; MacDonald et al. 2004; Association of Language Testers in Europe 2012). Since some oral assessment tasks may require additional language skills, necessitating the consideration of other forms of input and output (Frost et al. 2011), related questions were added, and established rating scales were consulted when formulating the response options for closed questions (Spector 1992).

Several rounds of revisions were performed, resulting in the modification, removal, and addition of questions to ensure that the primary concerns were being covered. This was done to limit the time burden that is placed on instructors by reducing the number of questions that are asked (Dix et al. 2004; Spector 1992); including all of the questions from the taxonomy would have made the instrument unnecessarily long. Moreover, items were reworded to make them more understandable to instructors unfamiliar with language assessment (Nielson 1994). The development and revision of instrument items were, therefore, based on guidelines from the area of scale development; this includes monitoring how pilot testers respond to questionnaire items (Fowler 2009).

Recommendations pertaining to which oral assessment methods might be more appropriate for an instructor’s course are made after the instructor has answered a series of questions. Recommendations focus on task authenticity and student familiarity with tasks based on the students’ academic program, specific course, and later use in the workplace. Recommendations are suggestions that detail why certain assessment methods may be more appropriate. Suggestions are also made to ensure the inclusion and readiness of ELLs should an instructor decide to use an assessment method that may be less appropriate. The aim of the instrument is not to prescribe methods or suggest that the classroom be adjusted to the individual cultural and linguistic backgrounds of each learner. Rather, it is to encourage the consideration of these students’ needs wherever possible when related to majority L1 centered instructional practices.

The instrument was automated for computer administration using the C#.NET programming language which has been used successfully in the past for adaptive language-learning and assessment systems (Demmans Epp and McCalla 2011; Demmans Epp et al. 2013). The creation of an adaptive computer program allowed the complexity of the relationships between different pedagogical and assessment practices to be captured without requiring that instructors, who might use the tool, fully understand all of the implications of their decisions. The finished result is a self-contained executable file for Windows-based computers that can be downloaded from The program presents questions to instructors using the same general sequence of screens. However, instructor responses can result in the addition of questions to those screens and the recommendations change based on instructor responses.

Instrument validation: question and visual design

Instrument questions were piloted with a group of 20 human–computer interaction researchers whose teaching experience included undergraduate courses at English-language institutions. Pilot testers were asked to complete the questions in pairs while considering one of their prior undergraduate teaching contexts. Testers provided feedback on the clarity of instrument items as well as their appropriateness, organization, and presentation. Items were eliminated, added, or their wording and presentation changed. Potential alternate wordings for instrument items were discussed.

Following item piloting, a structured brainstorming session was conducted to determine how to best present the instrument’s recommendations to users. During this session, participants worked independently using various paper and craft resources (i.e., markers, paper, sticky notes, or scissors) to design a graphical user interface for communicating the recommendations that would be given to users based on their responses within the instrument. Each person presented his or her design ideas to the group, where they were discussed and built upon.

Both the pilot testers’ responses to questions and the user interface design artifacts that they created were collected. These were later analyzed by the system developers in order to identify questions that were misinterpreted and to determine how to best present information within the instrument.

Results: instrument validation

The systematic review and pilot testing resulted in several questions being reworded to reduce the use of linguistic, psycholinguistic, and assessment terminology. For example, several pilot testers sought clarification for the term “non-majority accent”. Items were also modified to improve clarity with “how much cultural diversity does your class have?” becoming “to what extent do you share a common cultural background?”. Pilot testing also resulted in several questions being made optional and the addition of don’t know and maybe responses for some questions. Some of the questions that had been worded in a way that encouraged a yes or no response were reworded to reflect the extent to which an instructor performed an activity since pilot testers indicated that they should be allowed to provide more detailed information; they believed that their practices were more nuanced than the question implied. For example, “do you explain behavioural expectations…” was changed to “to what extent do you explain behavioural expectations…”.

During the brainstorming session, participants suggested the use of a table to present the assessment recommendations. This suggestion included columns listing the pros and cons for each assessment task (Fig. 1). These became the columns that explain why a specific method might be appropriate and that suggest ways for increasing fairness should that method be chosen.
Fig. 1

Participant-suggested format for how the recommendations should be presented


Beyond validating the instrument and informing the wording of instrument items, pilot testing illustrated the need for such tools, since most of the instructors who participated admitted that they had not previously considered their students’ language abilities when determining teaching and assessment practices. In some cases, pilot testers reported that their participation helped clarify some of the challenges that their students were facing and provided them with additional information that they could use when teaching. Nevertheless, the usefulness of the instrument and its recommendations depend on the quality of the information that is entered by instructors as well as their willingness to reflect on their practices.

Resulting instrument

The instrument is intended to be used when planning a course. Some questions may necessitate the one-time retrieval of information from other sources (e.g., the minimum English proficiency score required for admission) but none of the questions is required. The instrument will work if some questions go unanswered.

Instructors proceed through the questions and are shown a collection of recommendations that are generated based on their input. The instrument shows instructors which of their classroom practices are consistent with each of the potential oral assessment methods and suggests recommendations that could help improve fairness. The instrument also provides a guideline for the level of English proficiency required for students to succeed at each assessment task. Instructors then select the assessment methods they plan to use and are asked a series of questions that are intended to encourage reflection and support their planning. This last stage is especially important when an instructor chooses a task where students’ English proficiency falls below the recommended guideline.

The instrument’s use is illustrated within the context of planning an interdisciplinary course on Language Technology. Language technology is an elective where students are required to meet either a linguistics or computer science prerequisite. Students are assigned to pairs where each pair consists of one linguistics student and one computer science student so that each group has the minimum required background knowledge from both disciplines.

The instrument first asks questions about the instructor’s institutional context (Fig. 2). This includes information about the minimum English proficiency that is required for admission as well as the use of oral assessment tasks within the institution as a whole. These questions are answered with respect to the instructor’s knowledge and beliefs about the school. After answering the questions, the instructor is shown classroom-related questions (Fig. 3) such as the number of students and resources, the classroom’s physical size, and the social and cultural characteristics of the class.
Fig. 2

The instructor’s responses to the institution-level questions

Fig. 3

The instructor’s responses to the questions about her physical class and her course’s background

The instructor clicks on the Go to assessment goals button and advances to the Learning and Assessment Goals screen, where questions about what students will learn, the purposes of the assessments, and the importance of different types of communication tasks with respect to the skills that students will need are specified (Fig. 4). The instructor then answers questions about classroom practices (Fig. 5), including questions about the use of different types of activities within the classroom and how teaching and assessment are approached. The example indicates the course instructor plans to regularly use group work.
Fig. 4

The instructor’s responses to questions about the learning objectives of the course

Fig. 5

The instructor’s responses to questions about classroom practices

After completing this section, the instructor answers questions about assessment practices. These questions are sub-divided into sections: the assessment environment (Fig. 6), scoring practices and student preparation (Fig. 7), and student inclusion (Fig. 8). For this course, the instructor and the tutorial assistant will grade all project components including any presentations or demonstrations students might perform. The questions about who will be performing the assessment and how they will be trained provide an example of adaptive follow-up questions that are only asked when the instructor is considering using other assessors.
Fig. 6

The instructor’s responses to questions about the planned assessment environment and criteria

Fig. 7

How the instructor plans to score student work

Fig. 8

The instructor’s responses to questions about student inclusion

Once the questions have been answered, the instructor is given a set of recommendations from which oral assessment tasks can be selected (Figs. 9, 10). In this case, the instructor selects the group work, demonstrations, and participation options. Additional recommendations are provided below the table of task-specific recommendations (Fig. 11).
Fig. 9

The task-based recommendations that are made as a result of the instructor’s input

Fig. 10

Additional task-based recommendations

Fig. 11

Additional recommendations

Following this, the instructor is asked a series of planning-related questions and can print a report that details both the answers provided and the recommendations that were made (Fig. 12). Finally, an option allows the completed instrument to be printed so that instructors can keep a copy for their records.
Fig. 12

The instructor’s responses to the reflection and planning questions


There are too many inter-related factors involved in assessment and classroom practices for the taxonomy or instrument to account for all possibilities. It is also known that everything within the taxonomy and instrument may not apply across all disciplines. For example, a dance demonstration may not involve the use of oral language. Later use may reveal that items that reflect certain aspects of the taxonomy should be added or that others can be removed. It may also be difficult for instructors of courses where students do not necessarily target specific careers to answer all instrument questions. The developed instrument has not been tested through a long-term deployment with instructors from a representative sample of institutions or disciplines. However, it should still be usable to support reflection and planning.


Although universities require undergraduate students who are ELLs to meet minimum language requirements to gain admission, these learners may encounter assessment methods and classroom practices that demand greater proficiency levels than those needed for admission. An instrument was developed to help undergraduate instructors select and apply assessment methods that ensure the inclusion of ELLs so that they are not marginalized by oral assessment methods that may not consider their ability to comprehend or participate in the assessment.

The instrument was based on a taxonomy that was developed using the principles espoused in the ELL assessment literature. This taxonomy was also reviewed by third parties with experience in ELL instruction and assessment. The developed instrument collects information from the instructor during the planning phase of his or her course and recommends which oral assessment methods might be suitable for that instructor’s course, pedagogical goals, and context. The instrument then asks several questions to encourage instructor reflection and further aid the instructor in planning his or her assessments. It is hoped that the use of this instrument will aid in planning so that ELLs can demonstrate their domain knowledge and are not inadvertently penalized for having limited English proficiency. Pilot testing of the developed instrument has shown that it holds the potential to increase instructor awareness. However, further study should be performed to determine its effectiveness as a planning tool and its impact on instructor practices.


  1. 1.

    ELL is being used to represent any learner whose primary language(s) or language(s) spoken at home is not English and who may still be trying to achieve proficiency in English regardless of the context in which s/he is found.



We would like to thank the reviewers for their guidance. We would also like to thank our instructor, classmates, and participants for their guidance, contributions, and feedback. The first author held W. Garfield Weston and Walter C. Sumner Memorial Fellowships.


  1. Al-Issa, A. S., & Al-Qubtan, R. (2010). Taking the floor: Oral presentations in EFL classrooms. TESOL Journal, 1, 227–246. doi:10.5054/tj.2010.220425.CrossRefGoogle Scholar
  2. Anaheim University. (2012). Anaheim University—Entrance requirements for accredited online degree and certificate programs. Resource document. Anaheim University. Retrieved December 11, 2012, from
  3. Ang-Aw, H., & Chuen Meng Goh, C. (2011). Understanding discrepancies in rater judgement on national-level oral examination tasks. RELC Journal, 42(1), 31–51. doi:10.1177/0033688210390226.CrossRefGoogle Scholar
  4. American Psychological Association (1988). Code of fair testing practices in education. Washington, D.C.: Joint Committee on Testing Practices, American Psychological Association.Google Scholar
  5. American Educational Research Association (1999). Standards for educational and psychological testing. Washington, D.C.: American Educational Research Association.Google Scholar
  6. Association of Language Testers in Europe. (2012). The content analysis checklists project. Resource document. Retrieved December 7, 2012, from
  7. Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford, UK: Oxford University Press.Google Scholar
  8. Bachman, L. F., & Palmer, A. S. (2012). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford, UK: Oxford University Press.Google Scholar
  9. Bailey, K. (1999). Washback in language testing. Princeton, NJ: Educational Testing Service.Google Scholar
  10. Borup, J., West, R. E., & Graham, C. R. (2012). Improving online social presence through asynchronous video. Internet and Higher Education, 15, 195–203.CrossRefGoogle Scholar
  11. Brown, J. D. (2008). Testing-context analysis: Assessment is just another part of language curriculum development. Language Assessment Quarterly, 5, 275–312.CrossRefGoogle Scholar
  12. Chapelle, C. A., Enright, M., & Jamieson, J. (2010). Does an argument-based approach to validity make a difference? Educational Measurement: Issues and Practice, 29, 3–13.CrossRefGoogle Scholar
  13. Cheng, L., Klinger, D. A., & Zheng, Y. (2007). The challenges of the Ontario Secondary School Literacy Test for second language students. Language Testing, 24, 185–208. doi:10.1177/0265532207076363.CrossRefGoogle Scholar
  14. Cizek, G. J., Schmid, L. A., & Germuth, A. A. (2011). A Checklist For Evaluating K-12 Assessment Programs. Kalamazoo, MI: The Evaluation Center, Western Michigan University.Google Scholar
  15. Coley, M. (1999). The english language entry requirements of Australian universities for students of non-english speaking background. Higher Education Research & Development, 18, 7–17. doi:10.1080/0729436990180102.CrossRefGoogle Scholar
  16. Crystal, D. (1991). A dictionary of linguistics and phonetics. Oxford, UK: Basil Blackwell.Google Scholar
  17. Demmans Epp, C., & McCalla, G. I. (2011). ProTutor: Historic open learner models for pronunciation tutoring. In G. Biswas, S. Bull, J. Kay, & A. Mitrovic (Eds.), Artificial intelligence in education (pp. 441–443). Auckland, New Zealand: Springer.CrossRefGoogle Scholar
  18. Demmans Epp, C., Tsourounis, S., Djordjevic, J., & Baecker, R. M. (2013). Interactive event: Enabling vocabulary acquisition while providing mobile communication support. In H. C. Lane, K. Yacef, J. Mostow, & P. Pavlik (Eds.), Artificial intelligence in education (pp. 932–933). Memphis, TN: Springer.CrossRefGoogle Scholar
  19. Deng, C., & Carless, D. (2010). Examination preparation or effective teaching: Conflicting priorities in the implementation of a pedagogic innovation. Language Assessment Quarterly, 7, 285–302. doi:10.1080/15434303.2010.510899.CrossRefGoogle Scholar
  20. Dix, A., Finlay, J. E., Abowd, G. D., & Beale, R. (2004). Human–computer interaction (3rd ed.). Harlow, England: Pearson/Prentice-Hall.Google Scholar
  21. Educational Testing Service. (2004). ibt/Next generation TOEFL Test: Independent Speaking Rubrics (Scoring Standards). Princeton, NJ: Educational Testing Service.Google Scholar
  22. Educational Testing Service. (2008). Top universities in UK accept TOEFL ® scores. Princeton, NJ: Educational Testing Service.Google Scholar
  23. Educational Testing Service. (2010). User Guide (Speaking & Writing). Princeton, NJ: Educational Testing Service.Google Scholar
  24. Educational Testing Service. (2012). TOEFL Destinations Directory. Princeton, NJ: Educational Testing Service.Google Scholar
  25. Flint, N., & Johnson, B. (2011). Towards fairer university assessment: Recognizing the concerns of students. New York, NY: Taylor & Francis.Google Scholar
  26. Fowler, F. J. (2009). Survey research methods (4th ed.). Thousand Oaks, CA: Sage Publications.Google Scholar
  27. Frost, K., Elder, C., & Wigglesworth, G. (2011). Investigating the validity of an integrated listening-speaking task: A discourse-based analysis of test takers’ oral performances. Language Testing, 29, 345–369. doi:10.1177/0265532211424479.CrossRefGoogle Scholar
  28. Fulcher, G., & Reiter, R. (2003). Task difficulty in speaking tests. Language Testing, 20(3), 321–344. doi:10.1191/0265532203lt259oa.CrossRefGoogle Scholar
  29. Gambino, C. P., Acosta, Y. D., & Grieco, E. M. (2014). English-speaking ability of the foreign-born population in the United States: 2012. Washington, DC: US Census Bureau.Google Scholar
  30. Gan, Z. (2013). Task type and linguistic performance in school-based assessment situation. Linguistics and Education, 24(4), 535–544. doi:10.1016/j.linged.2013.08.004.CrossRefGoogle Scholar
  31. Gilson, C. (1994). Of dinosaurs and sacred cows: The grading of classroom participation. Journal of Management Education, 18, 227–236. doi:10.1177/105256299401800207.CrossRefGoogle Scholar
  32. Goubeaud, K. (2009). How is science learning assessed at the postsecondary level? Assessment and grading practices in college biology, chemistry and physics. Journal of Science Education and Technology, 19, 237–245. doi:10.1007/s10956-009-9196-9.CrossRefGoogle Scholar
  33. Heywood, J. (2000). Assessment in higher education: Student learning, teaching, programmes and institutions. Philadelphia, PA: Jessica Kingsley Pub.Google Scholar
  34. Hu, C., Sharpe, L., Crawford, L., Gopinathan, S., Khine, M. S., Moo, S. N., & Wong, A. (2000). Using lesson video clips via multipoint desktop video conferencing to facilitate reflective practice. Journal of Information Technology for Teacher Education, 9(3), 377–388. doi:10.1080/14759390000200093.CrossRefGoogle Scholar
  35. IELTS. (2007). Handbook 2007. Cambridge, UK: University of Cambridge.Google Scholar
  36. IELTS. (2009). US Recognition List: Educational Institutions, Professional Organizations and Accrediting Bodies Recognizing IELTS. Cambridge, UK: University of Cambridge.Google Scholar
  37. IELTS. (2012a). Institutions: Who accepts IELTS? Cambridge, UK: University of Cambridge. Retrieved December 7, 2012, from
  38. IELTS. (2012b). Speaking: Band descriptors (public version). Cambridge, UK: University of Cambridge. Retrieved November 14, 2012, from
  39. International Language Testing Association (2001). Code of ethics for ILTA.Google Scholar
  40. Ishii, D., & Baba, K. (2003). Locally developed oral skills evaluation in ESL/EFL classrooms: A checklist for developing meaningful assessment procedures. TESL Canada Journal, 21, 79–95.Google Scholar
  41. Joughin, G. (1998). Dimensions of oral assessment. Assessment and Evaluation in Higher Education, 23(4), 367–378.CrossRefGoogle Scholar
  42. Kieffer, M. J., Lesaux, N. K., Rivera, M., & Francis, D. J. (2009). Accommodations for English language learners taking large-scale assessments: A meta-analysis on effectiveness and validity. Review of Educational Research, 79, 1168–1201. doi:10.3102/0034654309332490.CrossRefGoogle Scholar
  43. King, J. (2002). Preparing EFL learners for oral presentations. Internet TESL Journal, 8(3). Retrieved from
  44. Kunnan, A. J. (2000). Fairness and justice for all. In A. J. Kunnan (Ed.), Fairness and validation in language assessment (pp. 1–14). Cambridge, UK: Cambridge University Press.Google Scholar
  45. Kunnan, A. J. (2004). Test Fairness. In M. Milanovic & C. Weir (Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona conference (pp. 27–48). Cambridge, UK: Cambridge University Press.Google Scholar
  46. Labov, W. (1972). Sociolinguistic patterns. Philadelphia, PN: University of Pennsylvania Press.Google Scholar
  47. Laing, K., & Todd, L. (2012). Fair or foul? Towards practice and policy in fairness in education. Newcastle, UK: Newcastle University. Retrieved April 28, 2014, from
  48. Leki, I. (2007). Undergraduates in a second language: Challenges and complexities of academic literacy development. New York, NY: Lawrence Erlbaum Associates.Google Scholar
  49. MacDonald, K., Alderson, J., & Lai, L. (2004). Selecting and using computer-based language tests (CLBTs) to assess language proficiency: Guidelines for educators. TESL Canada Journal, 21, 93–104.Google Scholar
  50. McNamara, T. F., & Roever, C. (2006). Language testing: The social dimension. Malden, MA: Blackwell Pub.Google Scholar
  51. Melvin, K. B. (1988). Rating class participation: The prof/peer method. Teaching of Psychology, 15, 137–139. doi:10.1207/s15328023top1503_7.CrossRefGoogle Scholar
  52. NAEP Data Explorer. (2014). 2011 national vocabulary, reading, and writing scores. USA: Institute of Educational Sciences National Center for Education Statistics.Google Scholar
  53. Nielson, J. (1994). Heuristic evaluation. In J. Nielson & R. L. Mack (Eds.), Usability inspection methods (pp. 25–62). New York, NY: John Wiley & Sons.Google Scholar
  54. Nitko, A. J. (2011). Educational assessment of students (6th ed.). Boston, MA: Pearson/Allyn & Bacon.Google Scholar
  55. North, B., & Schneider, G. (1998). Scaling descriptors for language proficiency scales. Language Testing, 15(2), 217–262. doi:10.1177/026553229801500204.CrossRefGoogle Scholar
  56. Nwoye, O. G. (1992). Linguistic politeness and socio-cultural variations of the notion of face. Journal of Pragmatics, 18, 309–328.CrossRefGoogle Scholar
  57. O’Neil, T., Buckendahl, C., Plake, B., & Taylor, L. (2007). Recommending a nursing-specific passing standard for the IELTS examination. Language Assessment Quarterly, 4, 295–317.CrossRefGoogle Scholar
  58. Otoshi, J., & Heffernen, N. (2008). Factors predicting effective oral presentations in EFL classrooms. Asian EFL Journal, 10(1), 65–78.Google Scholar
  59. Popham, W. J. (2000). Modern educational measurement: Practical guidelines for educational leaders. Boston, MA: Allyn and Bacon Pub.Google Scholar
  60. Robinson, P., Ting, S., & Urwin, J. J. (1995). Investigating second language task complexity. RELC Journal, 26(2), 62–79. doi:10.1177/003368829502600204.CrossRefGoogle Scholar
  61. Rogers, S. L. (2013). Calling the question: Do college instructors actually grade participation? College Teaching, 61, 11–22. doi:10.1080/87567555.2012.703974.CrossRefGoogle Scholar
  62. Saville, N. (2003). The process of test development and revision within UCLES EFL. In C. J. Weir & M. Milanovic (Eds.), Continuity and innovation: Revising the Cambridge Proficiency in English Examination, 1913–2002 (pp. 57–120)). Cambridge, UK: Cambridge University Press.Google Scholar
  63. Saville, N. (2005). Setting and monitoring professional standards: A QMS approach. Cambridge ESOL: Research Notes, 22, 2–5.Google Scholar
  64. Scott, M. (1986). Student affective reactions to oral language tests. Language Testing, 3, 99–118.CrossRefGoogle Scholar
  65. Shakya, A., & Horsfall, J. (2000). ESL undergraduate nursing students in Australia: Some experiences. Nursing and Health Sciences, 2, 163–171.CrossRefGoogle Scholar
  66. Shohamy, E. (1982). Affective considerations in language testing. The Modern Language Journal, 66, 13–17. doi:10.2307/327810.CrossRefGoogle Scholar
  67. Spector, P. (1992). Summated rating scale construction: Introduction. Newbury Park, CA: Sage Publications.CrossRefGoogle Scholar
  68. Taguchi, N. (2007). Task difficulty in oral speech act production. Applin, 28(1), 113–135. doi:10.1093/applin/aml051.Google Scholar
  69. Trudgill, P. (1983). Sociolinguistics: An introduction to language and society. New York, N.Y.: Penguin.Google Scholar
  70. University of California, San Diego. (2012). GLI—Admissions information. Resource document. University of California, San Diego. Retrieved December 11, 2012, from
  71. Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325–344. doi:10.1177/0265532211424478.CrossRefGoogle Scholar
  72. Wall, D., & Alderson, J. C. (1993). Examining washback: the Sri Lankan impact study. Language Testing, 10, 41–69. doi:10.1177/026553229301000103.CrossRefGoogle Scholar
  73. Webber, K. (2012). The use of learner-centered assessment in US colleges and universities. Research in Higher Education, 53, 201–228.CrossRefGoogle Scholar
  74. Webster, F. (2002). A genre approach to oral presentations. Internet TESL Journal, 8(7). Retrieved from
  75. Welsh, A. J. (2012). Exploring undergraduates’ perceptions of the use of active learning techniques in science lectures. Journal of College Science Teaching, 42, 80–87.Google Scholar
  76. Wieman, C., Perkins, K., & Gilbert, S. (2010). Transforming science education at large research universities: A case study in progress. Change: The Magazine of Higher Learning, 42, 7–14.CrossRefGoogle Scholar
  77. Young, D. J. (1986). The relationship between anxiety and foreign language oral proficiency ratings. Foreign Language Annals, 19, 439–445. doi:10.1111/j.1944-9720.1986.tb01032.x.CrossRefGoogle Scholar
  78. Zeidner, M., & Bensoussan, M. (1988). College students’ attitudes towards written versus oral tests of English as a foreign language. Language Testing, 5, 100–114.CrossRefGoogle Scholar

Copyright information

© Association for Educational Communications and Technology 2015

Authors and Affiliations

  • Carrie Demmans Epp
    • 1
  • Gina Park
    • 2
  • Christopher Plumb
    • 3
  1. 1.Department of Computer ScienceUniversity of TorontoTorontoCanada
  2. 2.Ontario Institute for Studies in Education (OISE)University of TorontoTorontoCanada
  3. 3.Abu Dhabi National Oil Company Technical InstituteAbu DhabiUnited Arab Emirates

Personalised recommendations