Achieving a Stable Scale for an Assessment with Multiple Forms: Weighting Test Samples in IRT Linking

  • Jiahe QianEmail author
  • Alina A. von Davier
  • Yanming Jiang
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 66)


In the quality control of an assessment with multiple forms, one goal is to attain a stable scale across time. Variability and seasonality across examinee samples and test conditions could cause variation in IRT linking and equating procedures and twist the “sampling exchangeability” in the Draper–Lindley–de Finetti (DLD) measurement validity framework. As an initial exploration of optimal design in linking, we intended to obtain an improved sampling design for invariant Stocking–Lord test characteristic curve (TCC) linking across testing seasons. We applied statistical weighting techniques, such as raking and poststratification, to yield a weighted sample distribution that is consistent with the target population distribution. To assess the weighting effects on linking, we first selected multiple subsamples from an original sample; then, we compared the linking parameters from subsamples with those from the original sample. The results showed that the linking parameters from the weighted sample yielded smaller mean square errors (MSE) than those from the unweighted subsample. The developed techniques can be applied to (1) assessments such as GRE® and TOEFL® with variability and seasonality among multiple forms and (2) assessments such as state assessments with linking decisions based on small initial data.



The authors thank Jim Carlson, Shelby Haberman, Kentaro Yamamoto, Frank Rijmen, Xueli Xu, Tim Moses, and Matthias von Davier for their suggestions and comments. The authors also thank Shuhong Li and Jill Carey for their assistance in assembling data and Kim Fryer for editorial help. Any opinions expressed in this paper are those of the authors and not necessarily those of Educational Testing Service.


  1. Allen, N., Donoghue, J., & Schoeps, T. (2001). The NAEP 1998 technical report (NCES 2001-509). Washington, DC: National Center for Education Statistics.Google Scholar
  2. Berger, M. P. F. (1991). On the efficiency of IRT models when applied to different sampling designs. Applied Psychological Measurement, 15, 293–306.CrossRefGoogle Scholar
  3. Berger, M. P. F. (1997). Optimal designs for latent variable models: A review. In J. Rost & R. Langeheine (Eds.), Application of latent trait and latent class models in the social sciences (pp. 71–79). Muenster, Germany: Waxmann.Google Scholar
  4. Berger, M. P. F., King, C. Y. J., & Wong, W. K. (2000). Minimax D-optimal designs for item response theory models. Psychometrika, 65, 377–390.MathSciNetCrossRefzbMATHGoogle Scholar
  5. Berger, M. P. F., & van der Linden, W. J. (1992). Optimality of sampling designs in item response theory models. In M. Wilson (Ed.), Objective measurement: theory into practice (Vol. 1, pp. 274–288). Norwood, NJ: Ablex.Google Scholar
  6. Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9–49). New York, NY: Academic Press.Google Scholar
  7. Buyske, S. (2005). Optimal design in educational testing. In M. P. F. Berger & W. K. Wong (Eds.), Applied optimal designs (pp. 1–19). New York, NY: Wiley.Google Scholar
  8. Cochran, W. G. (1977). Sampling techniques (3rd ed.). New York, NY: Wiley.zbMATHGoogle Scholar
  9. Deming, W. E. (1943). Statistical adjustment of data. New York, NY: Wiley.zbMATHGoogle Scholar
  10. Deming, W. E., & Stephan, F. F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal tables are known. Annals of Mathematical Statistics, 11, 427–444.MathSciNetCrossRefzbMATHGoogle Scholar
  11. Duong, M., & von Davier, A. A. (2012). Observed-score equating with a heterogeneous target population. International Journal of Testing, 12, 224–251.CrossRefGoogle Scholar
  12. Guo, H., Liu, J., Haberman, S., & Dorans, N. (2008, March). Trend analysis in seasonal time series models. Paper presented at the meeting of the National Council on Measurement in Education, New York, NYGoogle Scholar
  13. Haberman, S. J. (1979). Analysis of qualitative data (Vol. 2). New York, NY: Academic Press.Google Scholar
  14. Haberman, S. J. (2009). Linking parameter estimates derived from an item response model through separate calibrations (Research Report 09-40). Princeton, NJ: Educational Testing Service.Google Scholar
  15. Haberman, S. J., Lee, Y., & Qian, J. (2009). Jackknifing techniques for evaluation of equating accuracy (Research Report 09-39). Princeton, NJ: Educational Testing Service.Google Scholar
  16. Holland, P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). New York, NY: Springer.CrossRefGoogle Scholar
  17. Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187–220). Westport, CT: Greenwood.Google Scholar
  18. Huggins, A. C. (2011, April). Equating invariance across curriculum groups on a statewide fifth-grade science exam. Paper presented at the meeting of the American Educational Research Association, New Orleans, LAGoogle Scholar
  19. Kish, L. (1965). Survey sampling. New York, NY: Wiley.zbMATHGoogle Scholar
  20. Kolen, M. J. (2004). Population invariance in equating and linking: Concept and history. Journal of Educational Measurement, 41, 3–14.CrossRefGoogle Scholar
  21. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York, NY: Springer.CrossRefzbMATHGoogle Scholar
  22. Li, D., Li, S., & von Davier, A. A. (2011). Applying time-series analysis to detect scale drift. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 381–398). New York, NY: Springer.Google Scholar
  23. Livingston, S. A. (2004). Equating test scores (without IRT). Princeton, NJ: Educational Testing Service.Google Scholar
  24. Livingston, S. A. (2007). Demographically adjusted groups for equating test scores (Unpublished manuscript). Princeton, NJ: Educational Testing Service.Google Scholar
  25. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.Google Scholar
  26. Lord, M. F., & Wingersky, M. S. (1985). Sampling variances and covariances of parameter estimates in item response theory. In D. J. Weiss (Ed.), Proceedings of the 1982 IRT/CAT Conference (pp 69–88). Minneapolis: University of Minnesota, Department of Psychology, CAT LaboratoryGoogle Scholar
  27. Miller, R. G. (1964). A trustworthy jackknife. Annals of Mathematical Statistics, 53, 1594–1605.MathSciNetCrossRefzbMATHGoogle Scholar
  28. Moses, T. (2011). Evaluating empirical relationships among prediction, measurement, and scaling invariance (Research Report 11-06). Princeton, NJ: Educational Testing Service.Google Scholar
  29. Muraki, E., & Bock, R. D. (2002). PARSCALE (Version 4.1). [Computer software]. Lincolnwood, IL: Scientific Software International.Google Scholar
  30. Potter, F. J. (1990). A study of procedures to identify and trim extreme sampling weights. Proceedings of the section on survey research methods (pp. 225–230). Alexandria, VA: American Statistical AssociationGoogle Scholar
  31. Qian, J. (2005, April). Measuring the cumulative linking errors of NAEP trend assessments. Paper presented at the meeting of the National Council on Measurement in Education, Montreal, CanadaGoogle Scholar
  32. Qian, J., von Davier, A., & Jiang, Y. (2011, April). Effects of weighting examinee samples in TBLT IRT linking: Weighting test samples in IRT linking and equating. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans, LA.Google Scholar
  33. Robin, F., Holland, P., & Hemat, L. (2006). ICEDOG. [Computer software]. Princeton, NJ: Educational Testing Service.Google Scholar
  34. Sinharay, S., Holland, P. W., & von Davier, A. A. (2011). Evaluating the missing data assumptions of the chain and poststratification equating methods. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 381–398). New York, NY: Springer.Google Scholar
  35. Stocking, M. L. (1990). Specifying optimum examinees for item parameter estimation in item response theory. Psychometrika, 55, 461–475.CrossRefGoogle Scholar
  36. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210.CrossRefGoogle Scholar
  37. van der Linden, W. J. (2010). On bias in linear observed-score equating. Measurement, 8, 21–26.Google Scholar
  38. van der Linden, W. J., & Luecht, R. M. (1998). Observed-score equating as a test assembly problem. Psychometrika, 63, 401–418.MathSciNetCrossRefzbMATHGoogle Scholar
  39. von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The chain and post-stratification methods for observed-score equating and their relationship to population invariance. Journal of Educational Measurement, 41, 15–32.CrossRefGoogle Scholar
  40. von Davier, M., & von Davier, A. A. (2011). A general model for IRT scale linking and scale transformation. In A. A. von Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 225–242). New York, NY: Springer.CrossRefGoogle Scholar
  41. von Davier, A. A., & Wilson, C. (2008). Investigating the population sensitivity assumption of item response theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32, 11–26.MathSciNetCrossRefGoogle Scholar
  42. Wolter, K. (1985). Introduction to variance estimation. New York, NY: Springer.zbMATHGoogle Scholar
  43. Yang, W.-L., & Gao, R. (2008). Invariance of score linkings across gender groups for forms of a testlet-based college-level examination program examination. Applied Psychological Measurement, 32, 45–61.MathSciNetCrossRefGoogle Scholar
  44. Yi, Q., Harris, D. J., & Gao, X. (2008). Invariance of equating functions across different subgroups of examinees taking a science achievement test. Applied Psychological Measurement, 32, 62–80.MathSciNetCrossRefGoogle Scholar
  45. Zumbo, B. D. (2007). Validity: foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics (Psychometrics, Vol. 26, pp. 45–79). Amsterdam, The Netherlands: Elsevier Science BV.Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Jiahe Qian
    • 1
    Email author
  • Alina A. von Davier
    • 2
  • Yanming Jiang
    • 2
  1. 1.Research and Development, ETSPrincetonUSA
  2. 2.ETSPrincetonUSA

Personalised recommendations