Abstract
This paper reviews important methodological considerations for developing item banks and computerized adaptive scales (commonly called computerized adaptive tests in the educational measurement literature, yielding the acronym CAT), including issues of the reference population, dimensionality, dichotomous versus polytomous response scales, differential item functioning (DIF) and conditional scoring, mode effects, the impact of local dependence, and innovative approaches to assessment using CATs in health outcomes research.
Similar content being viewed by others
References
Ware, J. E., Jr. (2003). Conceptualization and measurement of health-related quality of life: comments on an evolving field. Archives of Physical Medicine and Rehabilitation, 84, S43–S51.
Lipscomb, J., Donaldson, M. S., & Arora, N. K., et al. (2004). Cancer outcomes research. Journal of the National Cancer Institute Monographs, 33, 178–197.
Gotay, C. C., Lipscomb, J., & Snyder, C. F. (2005). Reflections on COMWG findings and moving to the next phase. In J. Lipscomb, C. C. Gotay, & C. F. Snyder (Eds.), Outcomes assessment in cancer: measures, methods, and applications. (pp 568–583). Cambridge University Press, Cambridge.
Birnbaum, A. (1968). Some latent trait models and their use in inferring and examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 392–479). Reading, MA: Addison-Wesley.
Ware, J. E. Jr., Snow, K. K., Kosinski, M., & Gandek, B. (1993). SF-36 health survey. Manual and interpretation guide. Boston: The Health institute, New England Medical Center.
Ware, J. E. Jr., Kosinski, M., & Keller, S. (1995). SF-12: how to score the SF-12 physical and mental health summary scales. Boston, MA: The Health Institute, New England Medical Center.
Ware, J. E. Jr., Kosinski, M., Dewey, J. E., & Gandek B. (2001). How to score and interpret single-item health status measures: a manual for users of the SF-8 health survey (with a supplement on the SF-6 health survey). Lincoln, RI: QualityMetric Incorporated.
Orlando, M., Sherbourne, C. D., & Thissen, D. (2000). Summed-score linking using item response theory: Application to depression measurement. Psychological Assessment, 12, 354–359.
Bjorner, J. B., Kosinski, M., & Ware, J. E. Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12, 981–1002.
Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577–601.
Choppin, B. H. (1968). An item bank using sample-free calibration. Nature, 219, 870–872.
Choppin, B. H. (1976). Recent developments in item banking: A review. In D. N. M. de Gruijter & L. J. Th van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 233–245). New York: Wiley.
Wood, R., & Skurnik, L. S. (1969). Item banking: A method for producing school-based examinations and nationally comparable grades. Slough: National Foundation for Educational Research.
Wood, R. (1976). Trait measurement and item banks. In D. N. M. de Gruijter, L. J. Th. van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 247–263). New York: Wiley.
Bjorner, J. B., Kosinski, M., & Ware, J. E. Computerized adaptive testing and item banking. In P. Frayers & R. D. Hays (Eds.), Quality of life assessment in clinical trials (2nd Edn.). New York: Oxford University Press, in press.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: methods and practices (2nd Edn). New York: Springer.
Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.
Coon, C. D. (2004). Precision of parameter estimates for the graded item response model. Unpublished masters thesis, University of North Carolina at Chapel Hill.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educative Measurement, 27, 133–144.
Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Applied Psycholological Measurement, 16, 1–16.
Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9, 116–136.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.
McDonald, R. P. (1999). Test theory: A unified treatment. Mahway, NJ: Lawrence Erlbaum Associates.
McDonald, R. P. (1997). Normal ogive multidimensional model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269). New York: Springer.
Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care, 40, 812–823.
Muthen, B. O., & Muthen, L. (2001). Mplus User’s Guide (Version 2) [Computer software]. Los Angeles: Muthén & Muthén.
Muthen, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 29, 177–185.
Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436.
Reise, S., & Hayes, R. Special methodological issues in applications of IRT modeling for measuring health outcomes. Submitted for publication.
McLeod, L. D., Swygert, K., & Thissen, D. (2001). Factor analysis for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 189–216). Mahwah, NJ: Lawrence Erlbaum Associates.
Swygert, K., McLeod, L. D., & Thissen, D. (2001). Factor analysis for items scored in more than two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 217–250). Mahwah, NJ: Lawrence Erlbaum Associates.
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55, 293–325.
Brazier, J., Usherwood, T., Harper, R., & Thomas, K. (1997) Deriving a single index value for health from the UK SF-36 health survey. Journal of Clinical Epidemiology.
Kaplan, R. M., Anderson, J. P., & Ganiats, T. G. (1993). The quality of well-being scale: Rationale for a single quality of life index. In S. R. Walker & R. M. Rosser (Eds.), Quality of life assessment. Key Issues in the 1990s(pp. 65–94). Dordrecht: Kluwer.
Torrance, G. W., Feeny, D. H., & Furlong, W. J., et al. (1996). Multiattribute utility function for a comprehensive health status classification system. Health Utilities Index Mark 2. Medical Care, 34, 702–722.
EuroQol Group EuroQol (1990). A new facility for the measurement of health-related quality of life. Health Policy, 16, 199–208.
Bjorner, J. B., Kosinski, M., & Ware, J. E. Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the Headache Impact Test (HITTM). Quality of Life Research, 12, 913–933.
McHorney, C. A., & Cohen, A. S. (2000). Equating health status measures with item response theory: Illustrations with functional status items. MedicalCare, 38, (Suppl II)43–59.
Steinberg, L., & Thissen, D. (1996). The empirical consequences of response-category recombination, as viewed with item response theory. Paper presented at the annual meeting of the Psychometric Society, Banff, Alberta, Canada, June 27–30.
Schwarz, N. (1999). Self reports: How the questions shape the answers. The American Psychologist, 54, 93–105.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Muraki, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Iowa City, Iowa: Psychometric Society; Psychometric Monograph, No. 17.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Ramsay, J. O. (1997). A functional approach to modeling test data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Ramsay, J. O. (2000). \scTestGraf: A program for the graphical analysis of multiple choice test and questionnaire data [Computer software]. Montreal, P.Q., McGill University, Department of Psychology.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, 29–51.
Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates
Reeve, B. B. (2000). Item- and scale-level analysis of clinical and non-clinical sample responses to the MMPI-2 depression scales employing item response theory. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill.
Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item analyses of the Beck Depression Inventory: Evaluating gender item bias and response option weights. Psychological Assessment, 6, 255–270.
Schaeffer, N. C. (1988). An application of item response theory to the measurement of depression. In C. C. Clogg (Ed.), Sociological methodology. Washington, DC:American Sociological Association 18:271–307.
Hancock, T. D. (1999). Differential trait and symptom functioning of MMPI-2 items in substance abusers and the restandardization sample: An item response theory approach. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill.
Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415.
Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: a meta-analysis. Psychological Bullettin, 114, 449–58.
Steinberg, L., Thissen, D., & Wainer, H. (1990). Validity. In H. Wainer, N. Dorans, R. Flaugher, B. Green, R. Mislevy, L. Steinberg, & D. Thissen (Eds.), Computerized adaptive testing: A primer. (pp. 187–231). Hillsdale, NJ: Lawrence Erlbaum Associates.
Campbell, D. T., Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bullettin, 56, 81–105.
Steinberg, L. (1994). Context and serial-order effects in personality measurement: Limits on the generality of measuring changes the measure. Journal of Personality and Social Psychology, 66, 341–349.
Steinberg, L. (2001). The consequences of pairing questions: Context effects in personality measurement. Journal of Personality and Social Psychology, 81, 332–342.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.
Chen, W. H., & Thissen, D. (1997). Local dependence indices for item pairs using item response theory. Journal of Educational Behaviour Statistics, 22, 265–289.
Armstrong, R. D., Jones, D. H., Koppel, N. B., & Pashley, P. (1004). Computerized adaptive testing with multiple-form structures. Applied Psychological Measurement, 28, 147–164.
Luecht, R. M., & Nungester, R. J. (2000). Computer-adaptive sequential testing. In C. Glas & W. J. van der Linden (Eds.), Computer-adaptive testing (pp. 117–128). Dordrecht, The Netherlands: Kluwer.
Luecht, R. M., De Champlain, A., & Nungester, R. J. (1997). Maintaining content validity in computerized adaptive testing. Advances in Health Sciences Education Theory and Practice, 3, 29–41.
Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychology Measurement, 14, 271–282.
du Toit, M. (Ed.) (2003). IRT from SSI. Lincolnwood, IL, Scientific Software International.
Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275–285.
Muraki, E., Mislevy, R. J., & Bock, R. D. (1991).PC-BiMain: Analysis of item parameter drift, differential item functioning, and variant item performance [Computer software]. Chicago, IL: Scientific Software, Inc.
Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linking in NAEP. Journal of Educational Statistics, 17, 155–173.
Author information
Authors and Affiliations
Corresponding author
Additional information
Thanks to David J. Weiss for extremely useful comments on an earlier draft. Any errors that remain are, of course, our own.
Rights and permissions
About this article
Cite this article
Thissen, D., Reeve, B.B., Bjorner, J.B. et al. Methodological issues for building item banks and computerized adaptive scales. Qual Life Res 16 (Suppl 1), 109–119 (2007). https://doi.org/10.1007/s11136-007-9169-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11136-007-9169-5