Measuring Language Complexity Using Word Embeddings

  • Peter A. Whigham
  • Mansi Chugh
  • Grant Dick
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11320)


The analysis of word patterns from a corpus has previously been examined using a number of different word embedding models. These models create a numeric representation of word co-occurrence and are able to capture some of the syntactic and semantic relationships of words in a document. Assessing language complexity has been considered for many years through the use of simple indexes and basic statistical properties (word frequency, etc.), however little work has been done on using word embeddings to develop language complexity measures. This paper describes preliminary work on measuring language complexity using clustered word embeddings to produce network transition models. The structural measures of these transition networks are shown to represent basic properties of language complexity and may be used to infer some aspects of the underlying generative grammar.


Word embedding Language complexity Word2vec Grammar Network 


  1. 1.
    Andreas, J., Klein, D.: How much do word embeddings encode about syntax? In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), pp. 822–827. Association for Computational Linguistics, Baltimore (2014)Google Scholar
  2. 2.
    Baumann, J.F.: Vocabulary and reading comprehension: the nexus of meaning. In: Israel, S., Duffy, G. (eds.) Handbook of Research on Reading Comprehension, chap. 15, p. 24 (2014).
  3. 3.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009)Google Scholar
  4. 4.
    Brown, P.F., deSouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based N-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)Google Scholar
  5. 5.
    Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: a computational study. Behav. Res. Methods 39(3), 510–526 (2007). Scholar
  6. 6.
    Cha, M., Gwon, Y., Kung, H.T.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, pp. 2003–2006. ACM, New York (2017).
  7. 7.
    Church, K.W., Hanks, P.: Word association norms mutual information, and lexicography. Comput. Linguist. 1(1), 22–29 (1990)Google Scholar
  8. 8.
    Csardi, G., Nepusz, T.: The igraph software package for complex network research. Int. J. Complex Syst. 1695(5), 1–9 (2006)Google Scholar
  9. 9.
    Gunning, R.: The fog index after twenty years. J. Bus. Commun. 6(2), 3–13 (1969). Scholar
  10. 10.
    Harris, Z.S.: Distributional structure. WORD 10(2–3), 146–162 (1954). Scholar
  11. 11.
    Huang, Y.T., Chang, H.P., Sun, Y., Chen, M.C.: A robust estimation scheme of reading difficulty for second language learners. In: 2011 IEEE 11th International Conference on Advanced Learning Technologies, pp. 58–62 (2011).
  12. 12.
    Li, Y., Yang, T.: Word embedding for understanding natural language: a survey. In: Srinivasan, S. (ed.) Guide to Big Data Applications. SBD, vol. 26, pp. 83–104. Springer, Cham (2018). Scholar
  13. 13.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781 [cs] (2013)
  14. 14.
    Mikolov, T., Yih, S.W.T., Zweig, G.: Linguistic Regularities in Continuous Space Word Representations. Microsoft Research (2013)Google Scholar
  15. 15.
    Patel, K., Bhattacharyya, P.: Towards lower bounds on number of dimensions for word embeddings. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing (vol. 2: Short Papers), pp. 31–36. Asian Federation of Natural Language Processing, Taipei (2017)Google Scholar
  16. 16.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Association for Computational Linguistics, Doha (2014)Google Scholar
  17. 17.
    Team, R.C.: R: A Language and Environment for Statistical Computing (2017)Google Scholar
  18. 18.
    Yasseri, T., Kornai, A., Kertész, J.: A practical approach to language complexity: a wikipedia case study. PLoS ONE 7(11), e48386 (2012). Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Information ScienceUniversity of OtagoDunedinNew Zealand

Personalised recommendations