Finding an appropriate lexical diversity measurement for a small-sized corpus and its application to a comparative study of L2 learners’ writings


The present study investigates four kinds of lexical diversity measurement and a computational experiment with corpus processing and statistical test has been conducted to find out the most effective lexical diversity measurement in evaluating a small-sized corpus of 350 ~ 550 words. The results show that the D-estimate is the most appropriate among the four lexical diversity measurements which were compared in this research. Also the D-estimate showed more stable results than other measurements when the number of words varied between texts. The D-estimate was applied to measure the morphological and grammatical diversities of L2 learners of the Korean language, and conduct a statistical test on whether the mother tongues of L2 learners affect the degree of acquisition of grammatical morphemes. The test shows that the native languages of L2 learners learning Korean did not seem to have a significant impact.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2


  1. 1.

    Baayen RH (2008) Analyzing linguistic data: a practical introduction to statistics using R. Cambridge University Press, NY

    Book  Google Scholar 

  2. 2.

    Chang KH, Jeon EJ (2008) A study on the diversity of words used by middle and high school students. Korean Semant 27:225–242

  3. 3.

    Durán P, Malvern D, Brian R, Ngoni C (2004) Development trends in lexical diversity. Appl Linguist 25(2):220–242

  4. 4.

    Jin DY (2006) A study on vocabulary as a component of KSL writing ability. Biling Res 30:385–418

  5. 5.

    Kang S (2002) Korean morphological analyzer and information retrieval. Hongneung Science Publication, Seoul

    Google Scholar 

  6. 6.

    Lee HY (2010) The comparison on the Korean language proficiency of American heritage learners and that of non-heritage learners in their beginning level. Biling Res 44:275–294

  7. 7.

    Mellor A (2011) Essay length, lexical diversity and automatic essay scoring. Mem Osaka Inst Technol Ser B 55(2):1–14

    MathSciNet  Google Scholar 

  8. 8.

    Ministry of Culture, Sports, and Tourism (2010) The research on the actual condition and demand of Korean language educational institutions. The National Institute of the Korean Language, Republic of Korea

  9. 9.

    Park JE, Kim YJ (2014) Lexical diversity in the writings of advanced Korean learners. J Korean Lang Educ 25(2):1–32

  10. 10.

    Text Corpus from Project Gutenberg available on, (2011)

  11. 11.

    Tweedie FJ, Baayen RH (1998) How variable may a constant be? Measures of lexical richness in perspective. Comput Hum 32:323–335

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to HwaYoung Jeong.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Choi, W., Jeong, H. Finding an appropriate lexical diversity measurement for a small-sized corpus and its application to a comparative study of L2 learners’ writings. Multimed Tools Appl 75, 13015–13022 (2016).

Download citation


  • L2 learning
  • TTR (Type-Token Ratio)
  • D-estimate
  • Yule’s K
  • Guiraud’s R
  • Lexical diversity