Skip to main content

Automated Item Generation with Recurrent Neural Networks


Utilizing technology for automated item generation is not a new idea. However, test items used in commercial testing programs or in research are still predominantly written by humans, in most cases by content experts or professional item writers. Human experts are a limited resource and testing agencies incur high costs in the process of continuous renewal of item banks to sustain testing programs. Using algorithms instead holds the promise of providing unlimited resources for this crucial part of assessment development. The approach presented here deviates in several ways from previous attempts to solve this problem. In the past, automatic item generation relied either on generating clones of narrowly defined item types such as those found in language free intelligence tests (e.g., Raven’s progressive matrices) or on an extensive analysis of task components and derivation of schemata to produce items with pre-specified variability that are hoped to have predictable levels of difficulty. It is somewhat unlikely that researchers utilizing these previous approaches would look at the proposed approach with favor; however, recent applications of machine learning show success in solving tasks that seemed impossible for machines not too long ago. The proposed approach uses deep learning to implement probabilistic language models, not unlike what Google brain and Amazon Alexa use for language processing and generation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    Thanks to the excellent suggestion received from reviewers of the first draft, it was decided to collect actual response data using automatically generated items and compare these to response data from published human generated personality items. Additional experiments with dropouts, another suggestion received from reviewers, which allow to train networks with a form of regularization, will be conducted in future research.


  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Schuster, M., Monga, R., Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from (Google Research).

  2. Bejar, I. I., Lawless, R., Morley, M. E., Wagner, M. E., Bennett, R. E., & Revuelta, J. (2003). A feasibility study of on-the-fly item generation in adaptive testing. Journal of Technology, Learning, and Assessment. Accessed 7 March 2018.

  3. Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., & Bengio, Y. (2015). A recurrent latent variable model for sequential data. arXiv:1506.02216v6 [cs.LG].

  4. Cui, H., Wei, X., & Dai, M. (2010). Parallel implementation of expectation-maximization for fast convergence. In ACM proceedings. Accessed 7 March 2018.

  5. Cybenko, G. (1989). Approximations by superpositions of sigmoidal functions. Mathematics of Control, Signals, and Systems, 2(4), 303–314.

    Article  Google Scholar 

  6. Dennis, J. E., & Schnabel, R. B. (1996). Numerical methods for unconstrained optimization and nonlinear equations. Classics in Applied Mathematics: Society for Industrial and Applied Mathematics.

  7. Dreyfus, S. E. (1990). Artificial neural networks, back propagation, and the Kelley–Bryson gradient procedure. Journal of Guidance, Control, and Dynamics, 13(5), 926–928.

    Article  Google Scholar 

  8. Embretson, S. E. (2002). Generating abstract reasoning items with cognitive theory. In S. H. Irvine & P. C. Kyllonen (Eds.), Item generation for test development (p. 219250). Mahwah, NJ: Erlbaum.

    Google Scholar 

  9. Embretson, S. E., & Yang, X. (2007). Automatic item generation and cognitive psychology. In C. R. Rao & S. Sinharay (Eds.), Handbook of Statistics: Psychometrics (Vol. 26, p. 747768). North Holland: Elsevier.

    Google Scholar 

  10. Gal, Y., & Ghahramani, Z. (2015). A theoretically grounded application of dropout in recurrent neural networks. Published in NIPS 2016. arXiv:1512.05287

  11. Gierl, M. J., & Lai, H. (2013). Using automated processes to generate test items. Educational Measurement: Issues and Practice, 32, 3650.

    Article  Google Scholar 

  12. Gilula, Z., & Haberman, S. J. (1994). Models for analyzing categorical panel data. Journal of the American Statistical Association, 89, 645–656.

    Article  Google Scholar 

  13. Gilula, Z., & Haberman, S. J. (1995). Prediction functions for categorical panel data. The Annals of Statistics, 23, 1130–1142.

    Article  Google Scholar 

  14. Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg: Tilburg University Press.

    Google Scholar 

  15. Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., et al. (2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in Personality, 40, 84–96.

    Article  Google Scholar 

  16. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, J. (2014). Generative adversarial networks. arXiv:1406.2661.

  17. Greff, K., Srivastava, R. K., Koutnik, J., Steunebrink, B. R., & Schmidhuber, J. (2015). LSTM: A search space odyssey. arXiv preprint arXiv:1503.04069.

  18. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 17351780.

    Article  Google Scholar 

  19. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.

    Article  Google Scholar 

  20. Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer N., & Wu, Y. (2016). Exploring the limits of language modeling. arXiv:1602.02410v2.

  21. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of recurrent network architectures. In Proceedings of the 32nd international conference on machine learning, Lille, France (Vol. 37). JMLR: W&CP.

  22. Karpathy, A. (2015). The unreasonable effectiveness of RNNs. Accessed 7 March 2018.

  23. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  24. Mikolov, T. (2012). Statistical language models based on NNs. Ph.D. thesis, Brno University of Technology.

  25. Ozair, S. (2016). Char-rnn for tensorflow. Accessed 7 March 2018.

  26. Rammstedt, B., & John, O. P. (2007). Measuring personality in one minute or less: A 10-item short version of the big five inventory in English and German. Journal of Research in Personality, 41, 203–212.

    Article  Google Scholar 

  27. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.

    Article  Google Scholar 

  28. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation (Vol. 1). Cambridge, MA: MIT press.

    Google Scholar 

  29. Savage, L. (1971). Elicitation of personal probabilities and expectations. Journal of the American Statistical Association, 66(336), 783–801.

    Article  Google Scholar 

  30. Schäfer, A. M., & Zimmermann, H. G. (2006). Recurrent neural networks are universal approximators. In S. D. Kollias, A. Stafylopatis, W. Duch, & E. Oja (Eds.), Artificial neural networks— ICANN 2006. ICANN 2006. Lecture notes in computer science (Vol. 4131). Berlin: Springer.

    Google Scholar 

  31. Sundermeyer, M., Ney, H., & Schlüter, R. (2015). From feedforward to recurrent LSTM NNs for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3), 517–529.

    Article  Google Scholar 

  32. Trask, A., Gilmore, D., & Russell, M. (2015). Modeling order in neural word embeddings at scale. CoRR, abs/1506.02338, 2015. arXiv:1506.02338.

  33. von Davier, M. (2016). High-performance psychometrics: The parallel-E parallel-M algorithm for generalized latent variable models. ETS Research Report Series, 2016, 111.

    Article  Google Scholar 

  34. von Davier, M. (2017). New results on an improved parallel EM algorithm for estimating generalized latent variable models. In L. A. van der Ark, M. Wiberg, S. A. Culpepper, J. A. Douglas, & W.-C. Wang (Eds.) Quantitative psychology: Proceedings of the 81st annual meeting of the psychometric society, Asheville, North Carolina, 2016 (p. 1–8).

Download references

Author information



Corresponding author

Correspondence to Matthias von Davier.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

von Davier, M. Automated Item Generation with Recurrent Neural Networks. Psychometrika 83, 847–857 (2018).

Download citation


  • deep learning
  • neural networks
  • automatic item generation
  • machine learning