A Practical Procedure for the Construction and Reliability Analysis of Fixed-Length Tests with Randomly Drawn Test Items

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 571)


A procedure to construct valid and fair fixed-length tests with randomly drawn items from an item bank is described. The procedure provides guidelines for the set-up of a typical achievement test with regard to the number of items in the bank and the number of items for each position in a test. Further, a procedure is proposed to calculate the relative difficulty for individual tests and to correct the obtained score for each student based on the mean difficulty for all students and the particular test of a student. Also, two procedures are proposed for the problem to calculate the reliability of tests with randomly drawn items. The procedures use specific interpretations of regularly used methods to calculate Cronbach’s alpha and KR20 and the Spearman-Brown prediction formula. A simulation with R is presented to illustrate the accuracy of the calculation procedures and the effects on pass-fail decisions.


Sparse datasets Classical test theory Educational measurement P-value Reliability 


  1. 1.
    Draaijer, S., Warburton, B.: The emergence of large-scale computer assisted summative examination facilities in higher education. In: Kalz, M., Ras, E. (eds.) CAA 2014. CCIS, vol. 439, pp. 28–39. Springer, Heidelberg (2014)Google Scholar
  2. 2.
    Mills, C.N., Potenza, M.T., Fremer, J.J., Ward, W.C.: Computer-Based Testing, Building the Foundation for Future Assessments. Lawrence Erlbaum Associates, London (2002)Google Scholar
  3. 3.
    Glas, C.A.W., Van der Linden, W.J.: Computerized Adaptive Testing With Item Cloning. Appl. Psychol. Meas. 27, 247–261 (2003)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Van Haneghan, J.P.: The impact of technology on assessment and evaluation in higher education. In: Technology Integration in Higher Education: Social and Organizational Aspects, pp. 222–235 (2010)Google Scholar
  5. 5.
    Veldkamp, B.: Het random construeren van toetsen uit een itembank [Random selection of tests from an itembank]. Exam. Tijdschr. Voor Toetspraktijk. 9, 17–19 (2012)Google Scholar
  6. 6.
    Gibson, W.M., Weiner, J.A.: Generating random parallel test forms using CTT in a computer-based environment. J. Educ. Meas. 35, 297–310 (1998)CrossRefGoogle Scholar
  7. 7.
    Parshall, C.G., Spray, J.A., Kalohn, J.C., Davey, T.: Practical Considerations in Computer-Based Testing. Springer, New York (2002)zbMATHCrossRefGoogle Scholar
  8. 8.
    van Berkel, H., Bax, A.: Toetsen in het Hoger Onderwijs [Testing in Higher Education]. Bohn Stafleu Van Loghum, Houten/Diegem (2006)Google Scholar
  9. 9.
    Schönbrodt, F.D., Perugini, M.: At what sample size do correlations stabilize? J. Res. Personal. 47, 609–612 (2013)CrossRefGoogle Scholar
  10. 10.
    Cizek, G.J., Bunch, M.B.: Standard Setting: a Guide to Establishing and Evaluating Performance Standards on Tests. Sage Publications, Thousand Oaks (2007)Google Scholar
  11. 11.
    Impara, J.C., Plake, B.S.: Teachers’ ability to estimate item difficulty: a test of the assumptions in the angoff standard setting method. J. Educ. Meas. 35, 69–81 (1998)CrossRefGoogle Scholar
  12. 12.
    Gierl, M.J., Haladyna, T.M.: Automatic Item Generation: Theory and Practice. Routledge, New York (2012)Google Scholar
  13. 13.
    Livingston, S.A.: Equating Test Scores (without IRT). Educational Testing Service, Princeton (2004)Google Scholar
  14. 14.
    Cronbach, L.J.: Coefficient alpha and the internal structure of tests. Psychometrika 16, 297–334 (1951)CrossRefGoogle Scholar
  15. 15.
    Kuder, G.F., Richardson, M.W.: The theory of the estimation of test reliability. Psychometrika 2, 151–160 (1937)CrossRefGoogle Scholar
  16. 16.
    Lopez, M.: Estimation of Cronbach’s alpha for sparse datasets. In: Mann, S., Bridgeman, N. (eds.) Proceedings of the 20th Annual Conference of the National Advisory Committee on Computing Qualifications (NACCQ), pp. 151–155, New Zealand (2007)Google Scholar
  17. 17.
    Spearman, C.: Correlation calculated from faulty data. Br. J. Psychol. 1904–1920 3, 271–295 (1910)Google Scholar
  18. 18.
    Team, R.C.: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2015)Google Scholar
  19. 19.
    Ripley, B., Venables, B., Bates, D.M., Hornik, K., Gebhardt, A., Firth, D., Ripley, M.B.: Package “MASS.” (2014)Google Scholar
  20. 20.
    De Boeck, P., Wilson, M. (eds.): Explanatory Item Response Models. Springer, New York (2004)zbMATHGoogle Scholar
  21. 21.
    Klinkenberg, S.: Simulation for determining test reliability of sparse data sets (2015)Google Scholar
  22. 22.
    Woodhouse, B., Jackson, P.H.: Lower bounds for the reliability of the total score on a test composed of non-homogeneous items: II: a search procedure to locate the greatest lower bound. Psychometrika 42, 579–591 (1977)MathSciNetCrossRefGoogle Scholar
  23. 23.
    Cureton, E.E.: Corrected item-test correlations. Psychometrika 31, 93–96 (1966)CrossRefGoogle Scholar
  24. 24.
    Lucas, J.M., Saccucci, M.S.: Exponentially weighted moving average control schemes: properties and enhancements. Technometrics 32, 1–12 (1990)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Wei, W.W.: Time Series Analysis. Addison-Wesley, Boston (1994)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Faculty of Psychology and Education, Department of Research and Theory in EducationVU University AmsterdamAmsterdamThe Netherlands
  2. 2.Faculty of Social and Behavioural SciencesUniversity of AmsterdamAmsterdamThe Netherlands

Personalised recommendations