Methodological issues for building item banks and computerized adaptive scales

Thissen, David; Reeve, Bryce B.; Bjorner, Jakob Bue; Chang, Chih-Hung

doi:10.1007/s11136-007-9169-5

Methodological issues for building item banks and computerized adaptive scales

Original Paper
Published: 10 February 2007

Volume 16, pages 109–119, (2007)
Cite this article

Quality of Life Research Aims and scope Submit manuscript

David Thissen¹,
Bryce B. Reeve²,
Jakob Bue Bjorner^3,4 &
…
Chih-Hung Chang⁵

738 Accesses
56 Citations
3 Altmetric
Explore all metrics

Abstract

This paper reviews important methodological considerations for developing item banks and computerized adaptive scales (commonly called computerized adaptive tests in the educational measurement literature, yielding the acronym CAT), including issues of the reference population, dimensionality, dichotomous versus polytomous response scales, differential item functioning (DIF) and conditional scoring, mode effects, the impact of local dependence, and innovative approaches to assessment using CATs in health outcomes research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

Article Open access 27 August 2022

Item Response Theory

Objective Measurement: How Rasch Modeling Can Simplify and Enhance Your Assessment

References

Ware, J. E., Jr. (2003). Conceptualization and measurement of health-related quality of life: comments on an evolving field. Archives of Physical Medicine and Rehabilitation, 84, S43–S51.
Article PubMed Google Scholar
Lipscomb, J., Donaldson, M. S., & Arora, N. K., et al. (2004). Cancer outcomes research. Journal of the National Cancer Institute Monographs, 33, 178–197.
Article PubMed Google Scholar
Gotay, C. C., Lipscomb, J., & Snyder, C. F. (2005). Reflections on COMWG findings and moving to the next phase. In J. Lipscomb, C. C. Gotay, & C. F. Snyder (Eds.), Outcomes assessment in cancer: measures, methods, and applications. (pp 568–583). Cambridge University Press, Cambridge.
Birnbaum, A. (1968). Some latent trait models and their use in inferring and examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 392–479). Reading, MA: Addison-Wesley.
Google Scholar
Ware, J. E. Jr., Snow, K. K., Kosinski, M., & Gandek, B. (1993). SF-36 health survey. Manual and interpretation guide. Boston: The Health institute, New England Medical Center.
Google Scholar
Ware, J. E. Jr., Kosinski, M., & Keller, S. (1995). SF-12: how to score the SF-12 physical and mental health summary scales. Boston, MA: The Health Institute, New England Medical Center.
Google Scholar
Ware, J. E. Jr., Kosinski, M., Dewey, J. E., & Gandek B. (2001). How to score and interpret single-item health status measures: a manual for users of the SF-8 health survey (with a supplement on the SF-6 health survey). Lincoln, RI: QualityMetric Incorporated.
Google Scholar
Orlando, M., Sherbourne, C. D., & Thissen, D. (2000). Summed-score linking using item response theory: Application to depression measurement. Psychological Assessment, 12, 354–359.
Article PubMed CAS Google Scholar
Bjorner, J. B., Kosinski, M., & Ware, J. E. Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12, 981–1002.
Article PubMed Google Scholar
Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577–601.
Article Google Scholar
Choppin, B. H. (1968). An item bank using sample-free calibration. Nature, 219, 870–872.
Article PubMed CAS Google Scholar
Choppin, B. H. (1976). Recent developments in item banking: A review. In D. N. M. de Gruijter & L. J. Th van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 233–245). New York: Wiley.
Google Scholar
Wood, R., & Skurnik, L. S. (1969). Item banking: A method for producing school-based examinations and nationally comparable grades. Slough: National Foundation for Educational Research.
Wood, R. (1976). Trait measurement and item banks. In D. N. M. de Gruijter, L. J. Th. van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 247–263). New York: Wiley.
Google Scholar
Bjorner, J. B., Kosinski, M., & Ware, J. E. Computerized adaptive testing and item banking. In P. Frayers & R. D. Hays (Eds.), Quality of life assessment in clinical trials (2nd Edn.). New York: Oxford University Press, in press.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: methods and practices (2nd Edn). New York: Springer.
Google Scholar
Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.
Article Google Scholar
Coon, C. D. (2004). Precision of parameter estimates for the graded item response model. Unpublished masters thesis, University of North Carolina at Chapel Hill.
Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educative Measurement, 27, 133–144.
Article Google Scholar
Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Applied Psycholological Measurement, 16, 1–16.
Article Google Scholar
Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9, 116–136.
Article PubMed Google Scholar
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.
Article Google Scholar
McDonald, R. P. (1999). Test theory: A unified treatment. Mahway, NJ: Lawrence Erlbaum Associates.
Google Scholar
McDonald, R. P. (1997). Normal ogive multidimensional model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269). New York: Springer.
Google Scholar
Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care, 40, 812–823.
Google Scholar
Muthen, B. O., & Muthen, L. (2001). Mplus User’s Guide (Version 2) [Computer software]. Los Angeles: Muthén & Muthén.
Google Scholar
Muthen, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 29, 177–185.
Google Scholar
Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436.
Article Google Scholar
Reise, S., & Hayes, R. Special methodological issues in applications of IRT modeling for measuring health outcomes. Submitted for publication.
McLeod, L. D., Swygert, K., & Thissen, D. (2001). Factor analysis for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 189–216). Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Swygert, K., McLeod, L. D., & Thissen, D. (2001). Factor analysis for items scored in more than two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 217–250). Mahwah, NJ: Lawrence Erlbaum Associates.
Google Scholar
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55, 293–325.
Article Google Scholar
Brazier, J., Usherwood, T., Harper, R., & Thomas, K. (1997) Deriving a single index value for health from the UK SF-36 health survey. Journal of Clinical Epidemiology.
Kaplan, R. M., Anderson, J. P., & Ganiats, T. G. (1993). The quality of well-being scale: Rationale for a single quality of life index. In S. R. Walker & R. M. Rosser (Eds.), Quality of life assessment. Key Issues in the 1990s(pp. 65–94). Dordrecht: Kluwer.
Google Scholar
Torrance, G. W., Feeny, D. H., & Furlong, W. J., et al. (1996). Multiattribute utility function for a comprehensive health status classification system. Health Utilities Index Mark 2. Medical Care, 34, 702–722.
Article PubMed CAS Google Scholar
EuroQol Group EuroQol (1990). A new facility for the measurement of health-related quality of life. Health Policy, 16, 199–208.
Google Scholar
Bjorner, J. B., Kosinski, M., & Ware, J. E. Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the Headache Impact Test (HIT^TM). Quality of Life Research, 12, 913–933.
Article PubMed Google Scholar
McHorney, C. A., & Cohen, A. S. (2000). Equating health status measures with item response theory: Illustrations with functional status items. MedicalCare, 38, (Suppl II)43–59.
Google Scholar
Steinberg, L., & Thissen, D. (1996). The empirical consequences of response-category recombination, as viewed with item response theory. Paper presented at the annual meeting of the Psychometric Society, Banff, Alberta, Canada, June 27–30.
Schwarz, N. (1999). Self reports: How the questions shape the answers. The American Psychologist, 54, 93–105.
Article Google Scholar
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Article Google Scholar
Muraki, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Google Scholar
Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Iowa City, Iowa: Psychometric Society; Psychometric Monograph, No. 17.
Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Google Scholar
Ramsay, J. O. (1997). A functional approach to modeling test data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Google Scholar
Ramsay, J. O. (2000). \scTestGraf: A program for the graphical analysis of multiple choice test and questionnaire data [Computer software]. Montreal, P.Q., McGill University, Department of Psychology.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, 29–51.
Article Google Scholar
Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.
Google Scholar
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates
Google Scholar
Reeve, B. B. (2000). Item- and scale-level analysis of clinical and non-clinical sample responses to the MMPI-2 depression scales employing item response theory. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill.
Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item analyses of the Beck Depression Inventory: Evaluating gender item bias and response option weights. Psychological Assessment, 6, 255–270.
Article Google Scholar
Schaeffer, N. C. (1988). An application of item response theory to the measurement of depression. In C. C. Clogg (Ed.), Sociological methodology. Washington, DC:American Sociological Association 18:271–307.
Hancock, T. D. (1999). Differential trait and symptom functioning of MMPI-2 items in substance abusers and the restandardization sample: An item response theory approach. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill.
Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415.
Article PubMed Google Scholar
Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: a meta-analysis. Psychological Bullettin, 114, 449–58.
Article Google Scholar
Steinberg, L., Thissen, D., & Wainer, H. (1990). Validity. In H. Wainer, N. Dorans, R. Flaugher, B. Green, R. Mislevy, L. Steinberg, & D. Thissen (Eds.), Computerized adaptive testing: A primer. (pp. 187–231). Hillsdale, NJ: Lawrence Erlbaum Associates.
Google Scholar
Campbell, D. T., Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bullettin, 56, 81–105.
Article CAS Google Scholar
Steinberg, L. (1994). Context and serial-order effects in personality measurement: Limits on the generality of measuring changes the measure. Journal of Personality and Social Psychology, 66, 341–349.
Article Google Scholar
Steinberg, L. (2001). The consequences of pairing questions: Context effects in personality measurement. Journal of Personality and Social Psychology, 81, 332–342.
Article PubMed CAS Google Scholar
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.
Article Google Scholar
Chen, W. H., & Thissen, D. (1997). Local dependence indices for item pairs using item response theory. Journal of Educational Behaviour Statistics, 22, 265–289.
Google Scholar
Armstrong, R. D., Jones, D. H., Koppel, N. B., & Pashley, P. (1004). Computerized adaptive testing with multiple-form structures. Applied Psychological Measurement, 28, 147–164.
Article Google Scholar
Luecht, R. M., & Nungester, R. J. (2000). Computer-adaptive sequential testing. In C. Glas & W. J. van der Linden (Eds.), Computer-adaptive testing (pp. 117–128). Dordrecht, The Netherlands: Kluwer.
Google Scholar
Luecht, R. M., De Champlain, A., & Nungester, R. J. (1997). Maintaining content validity in computerized adaptive testing. Advances in Health Sciences Education Theory and Practice, 3, 29–41.
Article Google Scholar
Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychology Measurement, 14, 271–282.
Article Google Scholar
du Toit, M. (Ed.) (2003). IRT from SSI. Lincolnwood, IL, Scientific Software International.
Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275–285.
Article Google Scholar
Muraki, E., Mislevy, R. J., & Bock, R. D. (1991).PC-BiMain: Analysis of item parameter drift, differential item functioning, and variant item performance [Computer software]. Chicago, IL: Scientific Software, Inc.
Google Scholar
Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linking in NAEP. Journal of Educational Statistics, 17, 155–173.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
David Thissen
National Cancer Institute, NIH, Bethesda, MD, USA
Bryce B. Reeve
QualityMetric Inc, Lincoln, RI, USA
Jakob Bue Bjorner
Health Assessment Lab, Waltham, MA, USA
Jakob Bue Bjorner
Northwestern University Feinberg School of Medicine, Chicago, IL, USA
Chih-Hung Chang

Authors

David Thissen
View author publications
You can also search for this author in PubMed Google Scholar
Bryce B. Reeve
View author publications
You can also search for this author in PubMed Google Scholar
Jakob Bue Bjorner
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Hung Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Thissen.

Additional information

Thanks to David J. Weiss for extremely useful comments on an earlier draft. Any errors that remain are, of course, our own.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thissen, D., Reeve, B.B., Bjorner, J.B. et al. Methodological issues for building item banks and computerized adaptive scales. Qual Life Res 16 (Suppl 1), 109–119 (2007). https://doi.org/10.1007/s11136-007-9169-5

Download citation

Received: 29 August 2006
Accepted: 22 December 2006
Published: 10 February 2007
Issue Date: August 2007
DOI: https://doi.org/10.1007/s11136-007-9169-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methodological issues for building item banks and computerized adaptive scales

Abstract

Access this article

Similar content being viewed by others

Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

Item Response Theory

Objective Measurement: How Rasch Modeling Can Simplify and Enhance Your Assessment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Methodological issues for building item banks and computerized adaptive scales

Abstract

Access this article

Similar content being viewed by others

Power Analysis for the Wald, LR, Score, and Gradient Tests in a Marginal Maximum Likelihood Framework: Applications in IRT

Item Response Theory

Objective Measurement: How Rasch Modeling Can Simplify and Enhance Your Assessment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation