Skip to main content

Advertisement

Log in

Methodological issues for building item banks and computerized adaptive scales

  • Original Paper
  • Published:
Quality of Life Research Aims and scope Submit manuscript

Abstract

This paper reviews important methodological considerations for developing item banks and computerized adaptive scales (commonly called computerized adaptive tests in the educational measurement literature, yielding the acronym CAT), including issues of the reference population, dimensionality, dichotomous versus polytomous response scales, differential item functioning (DIF) and conditional scoring, mode effects, the impact of local dependence, and innovative approaches to assessment using CATs in health outcomes research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Ware, J. E., Jr. (2003). Conceptualization and measurement of health-related quality of life: comments on an evolving field. Archives of Physical Medicine and Rehabilitation, 84, S43–S51.

    Article  PubMed  Google Scholar 

  2. Lipscomb, J., Donaldson, M. S., & Arora, N. K., et al. (2004). Cancer outcomes research. Journal of the National Cancer Institute Monographs, 33, 178–197.

    Article  PubMed  Google Scholar 

  3. Gotay, C. C., Lipscomb, J., & Snyder, C. F. (2005). Reflections on COMWG findings and moving to the next phase. In J. Lipscomb, C. C. Gotay, & C. F. Snyder (Eds.), Outcomes assessment in cancer: measures, methods, and applications. (pp 568–583). Cambridge University Press, Cambridge.

  4. Birnbaum, A. (1968). Some latent trait models and their use in inferring and examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 392–479). Reading, MA: Addison-Wesley.

    Google Scholar 

  5. Ware, J. E. Jr., Snow, K. K., Kosinski, M., & Gandek, B. (1993). SF-36 health survey. Manual and interpretation guide. Boston: The Health institute, New England Medical Center.

    Google Scholar 

  6. Ware, J. E. Jr., Kosinski, M., & Keller, S. (1995). SF-12: how to score the SF-12 physical and mental health summary scales. Boston, MA: The Health Institute, New England Medical Center.

    Google Scholar 

  7. Ware, J. E. Jr., Kosinski, M., Dewey, J. E., & Gandek B. (2001). How to score and interpret single-item health status measures: a manual for users of the SF-8 health survey (with a supplement on the SF-6 health survey). Lincoln, RI: QualityMetric Incorporated.

    Google Scholar 

  8. Orlando, M., Sherbourne, C. D., & Thissen, D. (2000). Summed-score linking using item response theory: Application to depression measurement. Psychological Assessment, 12, 354–359.

    Article  PubMed  CAS  Google Scholar 

  9. Bjorner, J. B., Kosinski, M., & Ware, J. E. Jr. (2003). Using item response theory to calibrate the Headache Impact Test (HIT) to the metric of traditional headache scales. Quality of Life Research, 12, 981–1002.

    Article  PubMed  Google Scholar 

  10. Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577–601.

    Article  Google Scholar 

  11. Choppin, B. H. (1968). An item bank using sample-free calibration. Nature, 219, 870–872.

    Article  PubMed  CAS  Google Scholar 

  12. Choppin, B. H. (1976). Recent developments in item banking: A review. In D. N. M. de Gruijter & L. J. Th van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 233–245). New York: Wiley.

    Google Scholar 

  13. Wood, R., & Skurnik, L. S. (1969). Item banking: A method for producing school-based examinations and nationally comparable grades. Slough: National Foundation for Educational Research.

  14. Wood, R. (1976). Trait measurement and item banks. In D. N. M. de Gruijter, L. J. Th. van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 247–263). New York: Wiley.

    Google Scholar 

  15. Bjorner, J. B., Kosinski, M., & Ware, J. E. Computerized adaptive testing and item banking. In P. Frayers & R. D. Hays (Eds.), Quality of life assessment in clinical trials (2nd Edn.). New York: Oxford University Press, in press.

  16. Kolen, M. J., & Brennan, R. L. (2004). Test equating, linking, and scaling: methods and practices (2nd Edn). New York: Springer.

    Google Scholar 

  17. Tsutakawa, R. K., & Johnson, J. C. (1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371–390.

    Article  Google Scholar 

  18. Coon, C. D. (2004). Precision of parameter estimates for the graded item response model. Unpublished masters thesis, University of North Carolina at Chapel Hill.

  19. Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educative Measurement, 27, 133–144.

    Article  Google Scholar 

  20. Stone, C. A. (1992). Recovery of marginal maximum likelihood estimates in the two-parameter logistic response model: An evaluation of MULTILOG. Applied Psycholological Measurement, 16, 1–16.

    Article  Google Scholar 

  21. Wang, W. C., Chen, P. H., & Cheng, Y. Y. (2004). Improving measurement precision of test batteries using multidimensional item response models. Psychological Methods, 9, 116–136.

    Article  PubMed  Google Scholar 

  22. Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.

    Article  Google Scholar 

  23. McDonald, R. P. (1999). Test theory: A unified treatment. Mahway, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  24. McDonald, R. P. (1997). Normal ogive multidimensional model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 257–269). New York: Springer.

    Google Scholar 

  25. Gardner, W., Kelleher, K. J., & Pajer, K. A. (2002). Multidimensional adaptive testing for mental health problems in primary care. Medical Care, 40, 812–823.

    Google Scholar 

  26. Muthen, B. O., & Muthen, L. (2001). Mplus User’s Guide (Version 2) [Computer software]. Los Angeles: Muthén & Muthén.

    Google Scholar 

  27. Muthen, B. O. (1984). A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 29, 177–185.

    Google Scholar 

  28. Gibbons, R. D., & Hedeker, D. R. (1992). Full-information item bi-factor analysis. Psychometrika, 57, 423–436.

    Article  Google Scholar 

  29. Reise, S., & Hayes, R. Special methodological issues in applications of IRT modeling for measuring health outcomes. Submitted for publication.

  30. McLeod, L. D., Swygert, K., & Thissen, D. (2001). Factor analysis for items scored in two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 189–216). Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  31. Swygert, K., McLeod, L. D., & Thissen, D. (2001). Factor analysis for items scored in more than two categories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 217–250). Mahwah, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  32. Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55, 293–325.

    Article  Google Scholar 

  33. Brazier, J., Usherwood, T., Harper, R., & Thomas, K. (1997) Deriving a single index value for health from the UK SF-36 health survey. Journal of Clinical Epidemiology.

  34. Kaplan, R. M., Anderson, J. P., & Ganiats, T. G. (1993). The quality of well-being scale: Rationale for a single quality of life index. In S. R. Walker & R. M. Rosser (Eds.), Quality of life assessment. Key Issues in the 1990s(pp. 65–94). Dordrecht: Kluwer.

    Google Scholar 

  35. Torrance, G. W., Feeny, D. H., & Furlong, W. J., et al. (1996). Multiattribute utility function for a comprehensive health status classification system. Health Utilities Index Mark 2. Medical Care, 34, 702–722.

    Article  PubMed  CAS  Google Scholar 

  36. EuroQol Group EuroQol (1990). A new facility for the measurement of health-related quality of life. Health Policy, 16, 199–208.

    Google Scholar 

  37. Bjorner, J. B., Kosinski, M., & Ware, J. E. Jr. (2003). Calibration of an item pool for assessing the burden of headaches: An application of item response theory to the Headache Impact Test (HITTM). Quality of Life Research, 12, 913–933.

    Article  PubMed  Google Scholar 

  38. McHorney, C. A., & Cohen, A. S. (2000). Equating health status measures with item response theory: Illustrations with functional status items. MedicalCare, 38, (Suppl II)43–59.

    Google Scholar 

  39. Steinberg, L., & Thissen, D. (1996). The empirical consequences of response-category recombination, as viewed with item response theory. Paper presented at the annual meeting of the Psychometric Society, Banff, Alberta, Canada, June 27–30.

  40. Schwarz, N. (1999). Self reports: How the questions shape the answers. The American Psychologist, 54, 93–105.

    Article  Google Scholar 

  41. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.

    Article  Google Scholar 

  42. Muraki, E. (1997). A generalized partial credit model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.

    Google Scholar 

  43. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Iowa City, Iowa: Psychometric Society; Psychometric Monograph, No. 17.

  44. Samejima, F. (1997). Graded response model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.

    Google Scholar 

  45. Ramsay, J. O. (1997). A functional approach to modeling test data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.

    Google Scholar 

  46. Ramsay, J. O. (2000). \scTestGraf: A program for the graphical analysis of multiple choice test and questionnaire data [Computer software]. Montreal, P.Q., McGill University, Department of Psychology.

  47. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more latent categories. Psychometrika, 37, 29–51.

    Article  Google Scholar 

  48. Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory. New York: Springer.

    Google Scholar 

  49. Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Hillsdale, NJ: Lawrence Erlbaum Associates

    Google Scholar 

  50. Reeve, B. B. (2000). Item- and scale-level analysis of clinical and non-clinical sample responses to the MMPI-2 depression scales employing item response theory. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill.

  51. Santor, D. A., Ramsay, J. O., & Zuroff, D. C. (1994). Nonparametric item analyses of the Beck Depression Inventory: Evaluating gender item bias and response option weights. Psychological Assessment, 6, 255–270.

    Article  Google Scholar 

  52. Schaeffer, N. C. (1988). An application of item response theory to the measurement of depression. In C. C. Clogg (Ed.), Sociological methodology. Washington, DC:American Sociological Association 18:271–307.

  53. Hancock, T. D. (1999). Differential trait and symptom functioning of MMPI-2 items in substance abusers and the restandardization sample: An item response theory approach. Unpublished doctoral dissertation, University of North Carolina at Chapel Hill.

  54. Steinberg, L., & Thissen, D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11, 402–415.

    Article  PubMed  Google Scholar 

  55. Mead, A. D., & Drasgow, F. (1993). Equivalence of computerized and paper-and-pencil cognitive ability tests: a meta-analysis. Psychological Bullettin, 114, 449–58.

    Article  Google Scholar 

  56. Steinberg, L., Thissen, D., & Wainer, H. (1990). Validity. In H. Wainer, N. Dorans, R. Flaugher, B. Green, R. Mislevy, L. Steinberg, & D. Thissen (Eds.), Computerized adaptive testing: A primer. (pp. 187–231). Hillsdale, NJ: Lawrence Erlbaum Associates.

    Google Scholar 

  57. Campbell, D. T., Fiske, D. W. (1959). Convergent and discriminant validity by the multitrait-multimethod matrix. Psychological Bullettin, 56, 81–105.

    Article  CAS  Google Scholar 

  58. Steinberg, L. (1994). Context and serial-order effects in personality measurement: Limits on the generality of measuring changes the measure. Journal of Personality and Social Psychology, 66, 341–349.

    Article  Google Scholar 

  59. Steinberg, L. (2001). The consequences of pairing questions: Context effects in personality measurement. Journal of Personality and Social Psychology, 81, 332–342.

    Article  PubMed  CAS  Google Scholar 

  60. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213.

    Article  Google Scholar 

  61. Chen, W. H., & Thissen, D. (1997). Local dependence indices for item pairs using item response theory. Journal of Educational Behaviour Statistics, 22, 265–289.

    Google Scholar 

  62. Armstrong, R. D., Jones, D. H., Koppel, N. B., & Pashley, P. (1004). Computerized adaptive testing with multiple-form structures. Applied Psychological Measurement, 28, 147–164.

    Article  Google Scholar 

  63. Luecht, R. M., & Nungester, R. J. (2000). Computer-adaptive sequential testing. In C. Glas & W. J. van der Linden (Eds.), Computer-adaptive testing (pp. 117–128). Dordrecht, The Netherlands: Kluwer.

    Google Scholar 

  64. Luecht, R. M., De Champlain, A., & Nungester, R. J. (1997). Maintaining content validity in computerized adaptive testing. Advances in Health Sciences Education Theory and Practice, 3, 29–41.

    Article  Google Scholar 

  65. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychology Measurement, 14, 271–282.

    Article  Google Scholar 

  66. du Toit, M. (Ed.) (2003). IRT from SSI. Lincolnwood, IL, Scientific Software International.

  67. Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275–285.

    Article  Google Scholar 

  68. Muraki, E., Mislevy, R. J., & Bock, R. D. (1991).PC-BiMain: Analysis of item parameter drift, differential item functioning, and variant item performance [Computer software]. Chicago, IL: Scientific Software, Inc.

    Google Scholar 

  69. Yamamoto, K., & Mazzeo, J. (1992). Item response theory scale linking in NAEP. Journal of Educational Statistics, 17, 155–173.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Thissen.

Additional information

Thanks to David J. Weiss for extremely useful comments on an earlier draft. Any errors that remain are, of course, our own.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thissen, D., Reeve, B.B., Bjorner, J.B. et al. Methodological issues for building item banks and computerized adaptive scales. Qual Life Res 16 (Suppl 1), 109–119 (2007). https://doi.org/10.1007/s11136-007-9169-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11136-007-9169-5

Keywords

Navigation