International Handbook of Educational Evaluation pp 489-531 | Cite as
Psychometric Principles in Student Assessment
Chapter
Abstract
“Validity, reliability, comparability, and fairness are not just measurement issues, but social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made” (Messick, 1994, p. 2).
Keywords
Measurement Model Differential Item Functioning Item Response Theory True Score Item Response Theory Model
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Preview
Unable to display preview. Download preview PDF.
References
- Adams, R., Wilson, M.R., & Wang, W.-C. (1997). The multidimensional random coefficients multino mial logit model. Applied Psychological Measurement, 21, 1–23.CrossRefGoogle Scholar
- Almond, R.G., & Mislevy, R.J. (1999). Graphical models and computerized adaptive testing. Applied Psychological Measurement, 23, 223–237.CrossRefGoogle Scholar
- American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Google Scholar
- Anderson, J.R., Boyle, C.F., & Corbett, A.T. (1990). Cognitive modeling and intelligent tutoring. Artificial Intelligence, 42, 7–49.CrossRefGoogle Scholar
- Bennett, R.E. (2001). How the internet will help large-scale assessment reinvent itself. Education Policy Analysis, 9(5). Retrieved from http://epaa.asu.edu/epaa/v9n5.htlm.
- Bradlow, E.T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153–168.CrossRefGoogle Scholar
- Brennan, R.L. (1983). The elements of generalizability theory. Iowa City, IA: American College Testing Program.Google Scholar
- Brennan, R.L. (2001). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38(4), 295–317CrossRefGoogle Scholar
- Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322.Google Scholar
- Bryk, A.S., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park: Sage.Google Scholar
- Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105.CrossRefGoogle Scholar
- Cohen, J.A. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
- Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 17, 297–334.CrossRefGoogle Scholar
- Cronbach, L.J. (1989). Construct validation after thirty years. In R.L. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana, IL: University of Illinois Press.Google Scholar
- Cronbach, L.J., Gleser, G.C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.Google Scholar
- Cronbach, L.J., & Meehl, P.E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302.CrossRefGoogle Scholar
- Dayton, CM. (1999). Latent class scaling analysis. Thousand Oaks, CA: Sage.Google Scholar
- Dibello, L.V., Stout, W.F., & Roussos, L.A. (1995). Unified cognitive/psychometric diagnostic assessment likelihood based classification techniques. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 361–389). Hillsdale, NJ: Erlbaum.Google Scholar
- Embretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197.CrossRefGoogle Scholar
- Embretson, S.E. (1998). A cognitive design systems approach to generating valid tests: Application to abstract reasoning. Psychological Methods, 3, 380–396.CrossRefGoogle Scholar
- Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29, 543–553.CrossRefGoogle Scholar
- Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiency levels: Guidelines for assessment design. Applied Measurement in Education. 15, 269–294.CrossRefGoogle Scholar
- Falmagne, J.-C., & Doignon, J.-P. (1988). A class of stochastic procedures for the assessment of knowledge. British Journal of Mathematical and Statistical Psychology, 41, 1–23.CrossRefGoogle Scholar
- Fischer, G.H. (1973). The linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359–374.CrossRefGoogle Scholar
- Gelman, A., Carlin, J., Stern, H., & Rubin, D.B. (1995). Bayesian data analysis. London: Chapman & Hall.Google Scholar
- Greeno, J.G., Collins, A.M., & Resnick, L.B. (1996). Cognition and learning. In D.C Berliner, & R.C Calfee (Eds.), Handbook of educational psychology (pp. 15–146). New York: Macmillan.Google Scholar
- Gulliksen, H. (1950/1987). Theory of mental tests. New York: John Wiley/Hillsdale, NJ: Lawrence Erlbaum.CrossRefGoogle Scholar
- Haertel, E.H. (1989). Using restricted latent class models to map the skill structure of achievement test items. Journal of Educational Measurement, 26, 301–321.CrossRefGoogle Scholar
- Haertel, E.H., & Wiley, D.E. (1993). Representations of ability structures: Implications for testing. In N. Frederiksen, R.J. Mislevy, & I.I. Bejar (Eds.), Test theory for a new generation of tests. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
- Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 147–200). Phoenix, AZ: American Council on Education/Oryx Press.Google Scholar
- Hambleton, R.K., & Slater, S.C. (1997). Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Applied Measurement in Education, 10, 19–39.CrossRefGoogle Scholar
- Holland, P.W., & Thayer, D.T. (1988). Differential item performance and the Mantel-Haenzsel procedures. In H. Wainer, & H.I. Braun (Eds.), Test validity (pp. 129–145). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
- Holland, P.W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
- Jöreskog, K.G., & Sörbom, D. (1979). Advances in factor analysis and structural equation models. Cambridge, MA: Abt Books.Google Scholar
- Kadane, J.B., & Schum, D.A. (1996). A probabilistic analysis of the Sacco and Vanzetti evidence. New York: Wiley.Google Scholar
- Kane, M.T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.CrossRefGoogle Scholar
- Kelley, T.L. (1927). Interpretation of educational measurements. New York: World Book.Google Scholar
- Kuder, G.F., & Richardson, M.W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151–160.CrossRefGoogle Scholar
- Lane, W., Wang, N., & Magone, M. (1996). Gender-related differential item functioning on a middle-school mathematics performance assessment. Educational Measurement: Issues and Practice, 15(4), 21–27, 31.CrossRefGoogle Scholar
- Lazarsfeld, P.F. (1950). The logical and mathematical foundation of latent structure analysis. In S.A. Stouffer, L. Guttman, E.A. Suchman, P.R. Lazarsfeld, S.A. Star, & J.A Clausen (Eds.), Measurement and prediction (pp. 362–412). Princeton, NJ: Princeton University Press.Google Scholar
- Levine, M., & Drasgow, F. (1982). Appropriateness measurement: Review, critique, and validating studies. British Journal of Mathematical and Statistical Psychology, 35, 42–56.CrossRefGoogle Scholar
- Linacre, J.M. (1989). Many faceted Rasch measurement. Doctoral dissertation, University of Chicago.Google Scholar
- Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
- Lord, R.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.Google Scholar
- Martin, J.D., & VanLehn, K. (1995). A Bayesian approach to cognitive assessment. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 141–165). Hillsdale, NJ: Erlbaum.Google Scholar
- Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 13–103). New York: American Council on Education/Macmillan.Google Scholar
- Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Education Researcher, 32, 13–23.Google Scholar
- Messick, S., Beaton, A.E., & Lord, F.M. (1983). National Assessment of Educational Progress reconsidered: A new design for a new era. NAEP Report 83-1. Princeton, NJ: National Assessment for Educational Progress.Google Scholar
- Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (in press). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives. In S. Irvine, & P. Kyllonen (Eds.), Generating items for cognitive tests: Theory and practice. Hillsdale, NJ: Erlbaum.Google Scholar
- Mislevy, R.J., Steinberg, L.S., Almond, R.G., Haertel, G., & Penuel, W. (in press). Leverage points for improving educational assessment. In B. Means, & G. Haertel (Eds.), Evaluating the effects of technology in education. Hillsdale, NJ: Erlbaum.Google Scholar
- Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (1999). A cognitive task analysis, with implications for designing a simulation-based assessment system. Computers and Human Behavior, 15, 335–374.CrossRefGoogle Scholar
- Mislevy, R.J., Steinberg, L.S., Breyer, F.J., Almond, R.G., & Johnson, L. (in press). Making sense of data from complex assessment. Applied Measurement in Education.Google Scholar
- Myford, C.M., & Mislevy, R.J. (1995). Monitoring and improving a portfolio assessment system (Center for Performance Assessment Research Report). Princeton, NJ: Educational Testing Service.Google Scholar
- National Research Council (1999). How people learn: Brain, mind, experience, and school. Committee on Developments in the Science of Learning. Bransford, J.D., Brown, A.L., & Cocking, R.R. (Eds.). Washington, DC: National Academy Press.Google Scholar
- National Research Council (2001). Knowing what students know: The science and design of educational assessment. Committee on the Foundations of Assessment. Pellegrino, J., Chudowsky, N., & Glaser, R. (Eds.). Washington, DC: National Academy Press.Google Scholar
- O’Neil, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with Differential Item Functioning. In P.W. Holland, & H. Wainer (Eds.), Differential item functioning (pp. 255–276). Hillsdale, NJ: Erlbaum.Google Scholar
- Patz, R.J., & Junker, B.W. (1999). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.Google Scholar
- Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd ed.) (pp. 221–262). New York: American Council on Education/Macmillan.Google Scholar
- Pirolli. P., & Wilson, M. (1998). A theory of the measurement of knowledge content, access, and learning. Psychological Review, 105, 58–82.CrossRefGoogle Scholar
- Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research/Chicago: University of Chicago Press (reprint).Google Scholar
- Reckase, M. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401–412.CrossRefGoogle Scholar
- Rogosa, D.R., & Ghandour, G.A. (1991). Statistical models for behavioral observations (with discussion). Journal of Educational Statistics, 16, 157–252.CrossRefGoogle Scholar
- Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).Google Scholar
- Samejima, F. (1973). Homogeneous case of the continuous response level. Psychometrika, 38, 203–219.CrossRefGoogle Scholar
- Schum, D.A. (1987). Evidence and inference for the intelligence analyst. Lanham, MD: University Press of America.Google Scholar
- Schum, D.A. (1994). The evidential foundations of probabilistic reasoning. New York: Wiley.Google Scholar
- SEPUP (1995). Issues, evidence, and you: Teacher’s guide. Berkeley: Lawrence Hall of Science.Google Scholar
- Shavelson, R.J., & Webb, N.W (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.Google Scholar
- Spearman, C. (1904). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101.CrossRefGoogle Scholar
- Spearman, C. (1910). Correlation calculated with faulty data. British Journal of Psychology, 3, 271–295.Google Scholar
- Spiegelhalter, D.J., Thomas, A., Best, N.G., & Gilks, W.R. (1995). BUGS: Bayesian inference using Gibbs sampling, Version 0.50. Cambridge: MRC Biostatistics Unit.Google Scholar
- Tatsuoka, K.K. (1990). Toward an integration of item response theory and cognitive error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M.G. Shafto, (Eds.), Diagnostic monitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.Google Scholar
- Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 567–77.CrossRefGoogle Scholar
- Toulmin, S. (1958). The uses of argument. Cambridge, England: University of Cambridge Press.Google Scholar
- Traub, R.E., & Rowley, G.L. (1980). Reliability of test scores and decisions. Applied Psychological Measurement, 4, 517–545.CrossRefGoogle Scholar
- van der Linden, W.J. (1998). Optimal test assembly. Applied Psychological Measurement, 22, 195–202.CrossRefGoogle Scholar
- van der Linden, W.J., & Hambleton, R.K. (1997). Handbook of modern item response theory. New York: Springer.Google Scholar
- Wainer, H., Dorans, N.J., Flaugher, R., Green, B.F., Mislevy, R.J., Steinberg, L., & Thissen, D. (2000). Computerized adaptive testing: A primer (2nd ed). Hillsdale, NJ: Lawrence Erlbaum.Google Scholar
- Wainer, H., & Keily, G.L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 195–201.CrossRefGoogle Scholar
- Wiley, D.E. (1991). Test validity and invalidity reconsidered. In R.E. Snow, & D.E. Wiley (Eds.), Improving inquiry in social science (pp. 75–107). Hillsdale, NJ: Erlbaum.Google Scholar
- Willingham, W.W., & Cole, N.S. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum.Google Scholar
- Wilson, M., & Sloane, K. (2000). From principles to practice: An embedded assessment system. Applied Measurement in Education, 13, 181–208.CrossRefGoogle Scholar
- Wolf, D., Bixby, J., Glenn, J., & Gardner, H. (1991). To use their minds well: Investigating new forms of student assessment. In G. Grant (Ed.), Review of Educational Research, Vol. 17 (pp. 31–74). Washington, DC: American Educational Research Association.Google Scholar
- Wright, B.D., & Masters, G.N. (1982). Rating scale analysis. Chicago: MESA Press.Google Scholar
- Yen, W.M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.CrossRefGoogle Scholar
Copyright information
© Kluwer Academic Publishers 2003