Instructionally Sensitive Assessments (Close, Proximal, Distal Assessments)
KeywordsTest Item Item Response Theory Cognitive Demand Item Statistic Expert Teacher
Instructional sensitivity refers to the validity of the inferences made about the ability of a test or test item to reflect or differentiate the instruction received by students, based on their performance on that test or test item. The focus of instructional sensitivity is the overlap between test and instruction. The scores produced by an instructionally sensitive test should distinguish accurately between students who have and have not been taught a given content or those who have and have not been effectively taught that content. Because of the importance of the validity of test interpretations, instructional sensitivity should be regarded as a psychometric property of the tests – as important as other psychometric properties (Polikoff 2010). Interpretations of the effectiveness of teachers’ and schools’ instructional practices are not valid if they are based on instructionally insensitive tests.
Two terms closely related to instructional sensitivity are instructional validity and curricular validity. Instructional validity was originally introduced by Feldhusen et al. (1976) to refer to two sources of data for evaluating the validity of test score interpretations: the specification of the knowledge or performance domain being tested and evidence that instruction on the specified domain was provided. They argued that the concept of content or curricular validity only considered the content of the test, but it did not provide information about whether and how that content was delivered to students. In 1979, McClure used the term instructional validity in the Debra P. vs. Turlington case. He defined it as “an actual measure of whether the schools are providing students with instruction on the knowledge and skills measured by the test” (p. 683, emphasis added). He defined curricular validity as “a measure of how well test items represent the objectives of the curriculum” (p. 682). It is important to note that McClure regarded curricular validity as “theoretical” and instructional validity as empirical since judgments on the latter need to be supported by evidence that the students were exposed to the knowledge and skills required to answer the test correctly. This distinction helps to appreciate that, even when a test appears to have an appropriate “fit” to a curriculum based on the content areas sampled, the fit does not ensure that students have actually been instructed on these content areas. That is, an assessment that has curricular validity may have different degrees of instructional validity across classrooms. The connection between instructional validity and instructional sensitivity is direct. Both terms focus on the need of evidence of the instruction received by students on the topics being tested. Curricular validity is part of instructional sensitivity. While instructional validity can be used interchangeable with instructional sensitivity, curricular validity cannot.
Two other terms usually linked to instructional sensitivity are instructional alignment and opportunity to learn. They are techniques to measure the characteristics of instruction to which students are exposed, and, therefore, they are not conceptually equivalent to instructional sensitivity. Instructional alignment refers to the match between the content of instruction and the content of an assessment based on teacher’s reports about the content being taught and the cognitive demands with which the content was taught. Opportunity to learn (OTL) refers to whether or not students have had the opportunity to study a particular content. It is a concept introduced in the First International Mathematics Survey in the 1960s with the purpose of ensuring valid comparisons in international testing programs. While there are multiple measures of OTL, basically, they address whether certain content was covered and, in some instruments, what proportion of time is spent covering it.
An important source of deviations in defining instructional sensitivity lies on the conceptualization of the instruction students received – “what” aspect of the instruction researchers paid attention to. Two aspects of instruction have been the focus of the research on instructional sensitivity: the content being taught and the quality with which the content is taught. This difference is relevant when it comes to the methods used to gather information about the instruction that students receive.
Examining Instructional Sensitivity of Tests
There are three major categories of methods for examining instructional sensitivity (Polikoff 2010): statistical, instruction-based, and judgmental. Statistical methods focus on item statistics based on students’ responses to items. One especially important item statistic is the pretest-posttest difference index (PPDI) proposed in the 1960s. PPDI is the proportion of students (p value) passing the item on the posttest minus the proportion of students passing the item in the pretest. The difference in pre-instruction and post-instruction scores is considered as an indicator of instructional effectiveness. PPDI is considered to be a robust indicator because it allows detection of the effects of instruction (on different tests and with different samples of students), it is easy to implement and understand (as gain scores), and its use in item selection for tests leads to a better ability to distinguish between students who have and have not received instruction (Polikoff 2010). Other item statistics involve the use of item response theory (IRT). One of them is ZDIFF, the normalized difference between IRT-based item difficulty estimates on the pretest and the posttest or from two different samples of students (see Polikoff 2010).
Instruction-based methods focus on two sources of information, students’ responses to items, and some type of measure of the instruction students received. The study of the content and/or the quality with which it is delivered has used a wide variety of approaches. These approaches include multiple measures of instruction (e.g., teachers’ reports about content covered or content taught/not taught, quality of instruction measured by direct observation or teacher surveys, or analysis of curriculum materials), multiple research designs (e.g., comparing expert teachers vs. less expert teachers), and multiple analytic methods for examining the link between instruction and performance (e.g., simple comparisons of means, regression, IRT, hierarchical linear modeling – HLM). Studies using instruction-based methods have produced conflicting results about instructional sensitivity (Polikoff 2010).
Judgmental methods use experts’ judgments about tests and test items. Judgments can target (1) the alignment or congruence of the test items with learning goals, targets, or objectives (henceforth learning goals) using a simple rating of yes/no/unsure; (2) the appropriateness or suitability of test items to measuring certain learning goals using a rating scale; (3) the correspondence of items to learning goals; (4) the curricular learning goals test items appear to assess; and (5) the clarity with which the curricular learning goals tapped by a test help teachers to understand what is being assessed. Unfortunately, little to none empirical support exists about the effectiveness of examining instructional sensitivity by focusing on any of these targets.
Developing Instructionally Sensitive Tests
All the methods and approaches mentioned above focus on examining the instructional sensitivity of extant tests, not the development or instructionally sensitive tests. Recently, an approach for developing instructionally sensitive tests has been proposed (Ruiz-Primo and Li 2008). The approach generated by DEISA (Developing and Evaluating Instructionally Sensitive Assessments) project builds on the notion of variations in the proximity of assessments to the enacted curriculum (i.e., close, proximal, and distal; see Ruiz-Primo et al. 2002). At a close level, assessments are curriculum sensitive; they are close to the content and activities of the curriculum. At a proximal level, assessments consider the knowledge and skills relevant to the curriculum, but their contexts (e.g., scenarios) differ from the one studied in the unit. At a distal level, assessments are based on state or national standards for a particular domain. Close assessments are assumed to be more sensitive than proximal or distal assessments to the impact of instruction. Proximal assessments are assumed to pose greater demands on students than close assessments; to achieve in these assessments, students need to transfer what they have learned to new contexts – which is likely to happen only if they have received high-quality instruction. Distal assessments tapped learning goals most likely differ from the goals of the curriculum students learned. Large-scale assessments are distal; they are assumed to be less sensitive to the instruction received by students.
The DEISA approach proposes the idea of “bundles of triads” to develop test items. Each triad has one close item and two types of proximal items, one near proximal and one far proximal. (Since distal items are selected from state, national, and international large-scale tests, they have not been the focus of the project, which focuses on test development.) A triad is used to (1) establish, based on information on student performance on the close item, whether the learning of the concept, principle, or explanation model took place after instruction and (2) to manipulate different contexts with the two types of proximal items in a way that some evidence can be obtained on how able students are to transfer their learning as a result of the instruction. Regarding the items’ questions, items with different distances to the enacted curriculum are produced through variations on the question they pose, their cognitive demands, and their contexts. Near proximal and far proximal item questions may be less familiar to students, compared to the questions studied in the curriculum, yet they tap the same content or inquiry process. Regarding the items’ cognitive demands, near proximal and far proximal items are designed to require students to go beyond what was studied in the curriculum, for example, by requiring students to use a pattern of reasoning that differs from that used in the curriculum activities (e.g., if a science curriculum examines causes of erosion, near proximal and far proximal items may ask about factors that can contribute to reducing erosion). Regarding the items’ contexts, near proximal and far proximal items have different scenarios from those used in the curriculum. For example, aspects of the scenarios that are changed may involve organisms, variables, and levels or values of variables.
The DEISA approach has been empirically evaluated through four iterations with different science curricula. Available evidence indicates that the DEISA approach can be used to obtain information relevant to developing items that can be sensitive to the quality of instruction students received.
Information about the content and the quality of instruction to deliver the content was collected through videotapes, interviews, questionnaires, and focus groups. Information based on the PPDI and group comparisons indicates that the approach enables developers to construct items that vary in instructional sensitivity. Remarkably, on average, the effect sizes of the difference between pretest and posttest scores across the tested science modules are consistent with the distance of the items: ES close items = 0.95, ES near proximal items = 0.71, ES far proximal items = 0.30, and ES distal items = 0.41. Results about the pattern linking quality of instructional and students’ performance are mixed; different measures of quality of instruction had led to different patterns. These results are consistent with findings from other studies using measures of quality of instruction (see Polikoff 2010).
Importance of Instructional Sensitivity
Accountability tests are largely instructionally insensitive mainly because, due to the sampling procedures used for large-scale testing, very little of what is taught is tested. As a consequence, test results reflect socioeconomic status, general ability, or maturation rather than effective instruction. As Popham and Ryan (2012) suggested, “Clearly, if the tests being employed in these evaluations [to evaluate success of schools] are not up to the job, then many of the resultant evaluative decisions about the effectiveness of schools and teachers will be mistaken. Mistaken decisions about the caliber of schools or teachers, of course, will have both short-term and long-term harmful effects on the quality of education we supply to our students” (p. 1).
Test developers should provide empirical evidence about instructional sensitivity with the same care as it is done for other aspects of validity or of the tests (e.g., discrimination and difficulty). They should plan ahead of time for studies to gather the necessary information. If nothing else, at least statistical approaches to measuring instructional sensitivity should be used (e.g., PPDI and ZDIFF) to provide such evidence.
More research is needed to better determine the link between quality of instruction and student performance. For now, we do know that there is a wealth of evidence indicating that instructional sensitivity is an important characteristic of criterion-reference assessments that, if not met, can threaten the validity of decisions made based on tests.
- Feldhusen JF, Hynes K, Ames CA (1976) Is a lack of instructional validity contributing to the decline of achievement test scores? Educ Technol 16(7):13–16Google Scholar
- McClung MS (1979) Competency testing programs: legal and educational issues, 47 Fordham L. Rev. 651. http://ir.lawnet.fordham.edu/flr/vol47/iss5/2. Accessed 24 Jan 2013
- Popham WJ, Ryan JM (2012) Determining a high-stakes tests instructional sensitivity. Paper presented at the National Council of measurement in education annual meeting, VancouverGoogle Scholar
- Ruiz-Primo M A, Li M (2008) Building a methodology for developing and evaluating instructionally sensitive assessments. Proposal submitted to National Science Foundation. Award ID: DRL-0816123. National Science Foundation, Washington, DCGoogle Scholar