Skip to main content

Table 3 Bible Corpus statistics

From: A massively parallel corpus: the Bible in 100 languages

Corpus # Tokens STTR (%) Length SD Top-1,000 cover (%)
Bible-avg 432,691 48.59 23.82 7.46 73.80
Bible-eng 789,635 34.42 28.35 12.58 88.69
WSJ 1,173,760 48.89 24.92 12.57 74.11
1984-novel 122,644 47.56 19.99 15.20 81.89
CHILDES 366,509 32.17 4.45 3.04 93.60
  1. STTR is standardised type-token ratio; length refers to the average/standard deviation number of tokens in each verse (or sentence for the other corpora). Bible-avg is the (macro) average over all the languages in the corpus; WSJ is the Wall Street Journal portion of the Penn Treebank (Marcus et al. 1993); George Orwell’s 1984 novel is part of the MULTEXT-East corpus (Erjavec 2004); CHILDES (MacWhinney 2000) is a corpus of child-directed speech utterances