Lexical sophistication is an important consideration in fields such as educational psychology, cognitive science, and artificial intelligence, where text complexity, learning trends, and language production are important areas of study. However, lexical sophistication can be measured in a number of different ways, with perhaps the most common measure being word frequency. Word frequency measures calculate how frequently a word occurs in general usage, as measured by a representative corpus such as the British National Corpus (BNC; BNC Consortium, 2007), the Corpus of Contemporary American English (COCA; Davies, 2010), or SUBTLEXus (Brysbaert & New, 2009). Word frequency has been shown to be a strong predictor of lexical-response and word-naming times (Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Forster & Chambers, 1973; Frederiksen & Kroll, 1976), as well as to be strongly correlated with a number of related developmental constructs, such as writing and speaking proficiency (Kyle & Crossley, 2015; Laufer & Nation, 1995; McNamara, Crossley, Roscoe, Allen, & Dai, 2015), and text complexity considerations, such as reading difficulty (Crossley, Dufty, McCarthy, & McNamara, 2007; Nation, 2006). Recent research, however, has suggested that features other than lexical frequency may explain lexical knowledge and development better than word frequency does (Adelman, Brown, & Quesada, 2006; Crossley, Salsbury, & McNamara, 2012; Johns & Jones, 2008; Kyle & Crossley, 2015; McDonald & Shillcock, 2001).

In this article, we introduce and test the reliability of the second version of the Tool for the Analysis of Lexical Sophistication (TAALES 2.0). TAALES 1.0 (Kyle & Crossley, 2015) was developed to provide researchers with a freely available tool that would automatically calculate a variety of classic and new indices of lexical sophistication. It included lexical features related to word frequency, word range, n-gram frequency, academic language, and psycholinguistic word properties. The second iteration of TAALES increases the breadth and depth of the available indices reported by TAALES by expanding the word frequency, word range, and n-gram frequency indices, and by adding indices related to word recognition norms, contextual distinctiveness, word neighborhood, semantic network, n-gram range, and n-gram strength of association. To help validate the indices reported in TAALES 2.0, we present two studies. In the first study, indices of lexical sophistication are used to model human judgments of lexical proficiency in free writes. In the second study, the TAALES 2.0 indices are used to model word-choice ratings in narrative essays.

TAALES 1.0

Given the importance of lexical sophistication in a number of fields as well as the relative difficulty of accessing methods for the assessment of lexical sophistication beyond word frequency and/or type–token ratios, we developed TAALES 1.0 (Kyle & Crossley, 2015). The most recent version of TAALES 1.0 (version 1.4) included 104 indices, related to word frequency, word range, n-gram frequency, academic language, and psycholinguistic word information. TAALES versions 1.0–1.4 have been used in a variety of domains, including the assessment of written lexical proficiency, speaking proficiency, and writing quality (Allen, Crossley, & McNamara, 2015; Allen & McNamara, 2015; Jung, Crossley, & McNamara, 2015; Kyle & Crossley, 2015); modeling lexical development (Crossley, Kyle, & Salsbury, 2016); identifying satire in product reviews (Skalicky & Crossley, 2015); and indexing humor in academic writing (Skalicky, Berger, Crossley, & McNamara, 2016). Below we provide a brief description of the constructs covered in TAALES 1.4 (see Kyle & Crossley, 2015, for a comprehensive treatment).

Word frequency

Word frequency refers to the number of times a word occurs in a corpus of texts. Words that are less frequent in a reference corpus (e.g., edifice, cuisine, egregious) are considered more sophisticated than words that occur frequently (e.g., building, food, bad). A great deal of research has demonstrated the relationship between the frequency of lexical items in normal language use and lexical sophistication. Reading research has demonstrated that texts that include less frequent lexical items tend to be considered more difficult (Crossley et al., 2007; Nation, 2006). Writing research has indicated that essays that include less frequent lexical items tend to be considered of higher quality (Guo, Crossley, & McNamara, 2013; Laufer & Nation, 1995; McNamara et al., 2015), and similar findings have been observed with regard to written lexical proficiency and speaking proficiency (Kyle & Crossley, 2015). TAALES 1.4 includes 36 frequency indices derived from the BNC (BNC Consortium, 2007), the Brown verbal frequency list (Brown, 1984; Svartvik & Quirk, 1980), the Kučera–Francis written frequency list (Kučera & Francis, 1967), SUBTLEXus (Brysbaert & New, 2009), and the Thorndike–Lorge written frequency list (Thorndike & Lorge, 1944).

Word range

Range refers to the number of texts in a corpus in which a particular item occurs. Although a robust relationship between frequency and language proficiency has been established, word frequency values may be inflated due to a high occurrence of a technical word in a small set of documents in a given corpus that may otherwise be extremely infrequent in general language usage. Range norms help control for this inflation, and may provide a better approximation of an individual’s exposure to a particular word. In the written portion of the BNC, for example, the words next, four, and cent have similar frequencies (approximately 381 times per million words). The words next and four also occur in a wide range of texts (approximately 90% of the texts in the written BNC), but the word cent occurs in much fewer texts (approximately 50%). This suggests that despite similar frequencies, next and four may be more likely to be encountered than cent. Range indices have recently been used to model writing quality (Kyle & Crossley, 2016), speaking proficiency (Kyle & Crossley, 2015), lexical proficiency (Kyle & Crossley, 2015), and to explain variance in lexical-decision times (Adelman et al., 2006). Generally, words that occur in fewer contexts are considered more sophisticated. Accordingly, language samples that on average include words with a more restricted range tend to earn higher quality/proficiency scores. TAALES 1.4 includes 18 range indices derived from the BNC, Kučera–Francis written frequencies, and SUBTLEXus.

Academic language

Learning academic language is an important part of academic socialization (Hyland, 2009). Academic language includes words and phrases that occur frequently in academic contexts, but infrequently in general language use. Perhaps the most influential list of academic language is Coxhead’s (2000) academic word list (AWL), which has been integrated into English for academic purposes (EAP) research and pedagogy (Coxhead, 2011). A similar list of academic multiword formulas (the Academic Formulas List [AFL]) was developed by Simpson-Vlach and Ellis (2010). A higher proportion of academic language in a text would lead to a more sophisticated text. However, the few studies that have employed AFL and AWL indices have failed to find a relationship between the use of academic language and writing proficiency/lexical sophistication (Kyle & Crossley, 2015). TAALES 1.4 included 15 indices related to academic language derived from the AWL and the AFL.

N-gram frequency

The individual word has a long history as the unit of investigation in vocabulary and lexical development studies (e.g., Laufer & Nation, 1995; Nation, 2006). Recent research, however, has begun to highlight the importance of multiword units (Biber, Conrad, & Cortes, 2004; O’Donnell, Römer, & Ellis, 2013) in language development. N-gram frequencies have recently been used to model writing quality (Bestgen & Granger, 2014; Crossley, Cai, & McNamara, 2012; Kyle & Crossley, 2016), speaking proficiency (Kyle & Crossley, 2015), and to model scores of lexical proficiency (Kyle & Crossley, 2015). N-grams such as the end of, out of the, and a lot of occur frequently, whereas n-grams such as now not only, time some of, and is about being occur much less frequently. Across these assessment contexts, n-gram frequencies have generally been positively correlated with proficiency/quality scores (cf. Crossley, Cai, et al. 2012), suggesting that an indicator of linguistic development is the knowledge of how words tend to be combined. TAALES 1.4 includes 12 indices related to n-gram frequency derived from the BNC.

Psycholinguistic word information

The psycholinguistic properties of words have been of interest to psycholinguists, cognitive scientists, and second language acquisition researchers for some time (Coltheart, 1981; Crossley et al., 2016; Crossley, Weston, Sullivan, & McNamara, 2011; Toglia & Battig, 1978). Word properties such as concreteness, familiarity, meaningfulness, imageability, and age of acquisition have been used to model writing quality scores (Crossley & McNamara, 2011; Guo et al., 2013), speaking proficiency (Crossley & McNamara, 2013; Kyle & Crossley, 2015), lexical proficiency (Crossley, Salsbury, & McNamara, 2012; Kyle & Crossley, 2015), lexical-decision times (Kuperman, Stadthagen-Gonzalez, & Brysbaert, 2012), and word associations (Altarriba, Bauer, & Benvenuto, 1999). TAALES 1.4 includes 21 indices related to concreteness, familiarity, meaningfulness, and age of acquisition derived from the MRC database (Coltheart, 1981), Brysbaert, Warriner, and Kuperman (2014), and Kuperman et al. (2012).

TAALES 2.0

Like its predecessor, TAALES 2.0 is freely available,Footnote 1 easy to use, and compatible with most operating systems (Windows, Mac, and Linux). It is written in Python, is accessed via an intuitive graphical user interface (GUI; see Fig. 1), and requires no programming knowledge to operate. Unlike Web-based tools, TAALES 2.0 is housed on the user’s hard drive, which allows users to process data securely and without the need for an Internet connection. TAALES 2.0 includes all of the indices found in TAALES 1.4 and adds over 300 additional indices, related to word and n-gram frequency and range, n-gram strength of association, contextual distinctiveness, word recognition norms, semantic network, and word neighbors. Each of these is described below. For an overview of the indices included in TAALES 1.4 and TAALES 2.0, see Table 1. For instructions regarding the use of TAALES 2.0, see the user manual, which is available as supplementary material at www.kristopherkyle.com/supplementary-materials.html.

Fig. 1
figure 1

TAALES 2.0 graphical user interface

Table 1 Comparison of indices included in TAALES 1.4 and TAALES 2.0

Word and n-gram frequency and range

Indices related to word and n-gram frequency and range have been shown to be important predictors of a number of constructs related to language development. One factor that may affect the accuracy of word frequency and range indices is the reference corpora from which norms are derived. These factors include, but are not limited to, mode, region, and purpose (i.e., register; Biber, Conrad, Reppen, et al., 2004; Hyland, 2009). Thus, TAALES 2.0 includes frequency and range norms for words and n-grams that are register-specific and derived from the Corpus of Contemporary American English (COCA; Davies, 2009). COCA comprises texts collected between 1990 and 2015 that represent five registers (academic, fiction, magazine, news, and spoken). In total, COCA includes approximately 520 million words. TAALES 2.0 includes six word frequency and six word range indices for each of the five COCA registers (30 total for each), 24 n-gram frequency (bigram and trigram) indices for each COCA register (120 total), and four n-gram range indices for each COCA register (20 total). Additionally, TAALES 2.0 adds two word frequency indices from the 131-million-word Hyperspace Analogue to Language (HAL) corpus (Lund & Burgess, 1996), compiled from Internet news groups.Footnote 2

Age of exposure

TAALES 2.0 includes recently developed age of exposure (AoE) indices (Dascalu, McNamara, Crossley, & Trausan-Matu, 2016). These indices are based on computational models that estimate a word’s complexity on the basis of co-occurrence data and a word’s links to relevant sematic concepts within large corpora. Using latent Dirichlet allocation (LDA), which computationally infers underlying topics through a generative probabilistic process, Dascalu et al. developed measures of AoE values for the words found in the Touchstone Applied Science Associates (TASA) corpus,Footnote 3 which contains 13 grade-level textbooks in the United States (Landauer, Foltz, & Laham, 1998). Validations of the AoE values indicate that they are strongly related to human ratings of age of acquisition, word frequency, entropy, and human lexical-response latencies.

Word recognition norms

Word recognition scores report the average response latencies, standard deviations, and accuracies for a given word when used as a stimulus in lexical-decision and word-naming tasks. Lexical-decision latencies (i.e., response times) measure the time it takes participants to decide whether a word is a real word in English or not, whereas word-naming latencies measure the time it takes participants to begin reading a word aloud. These norms may reflect the ease or difficulty of processing a given word (Balota et al., 2004; Forster & Chambers, 1973; Frederiksen & Kroll, 1976). TAALES 2.0 calculates eight indices based on lexical-decision (LD) and word-naming (WN) behavioral norms obtained from The English Lexicon Project (ELP), a large publicly available psycholinguistic dataset (Balota et al., 2007). The ELP includes LD and WN task response latencies, standard deviations, and accuracies collected from 816 native English-speaking subjects. Word recognition norms were calculated in response to 40,481 real words (and an additional 40,481 nonwords for the LD task).

Contextual distinctiveness

Contextual distinctiveness measures the diversity of contexts in which a word is encountered. The constraints context puts on a word’s meaning may contribute to a more psychologically valid explanation of the word frequency effect than frequency of isolated occurrence alone (Adelman et al., 2006; Brysbaert & New, 2009; McDonald & Shillcock, 2001). Such constraints have been found to predict spoken lexical proficiency in L1 and L2 speech samples (Berger, Crossley, & Kyle, 2017). TAALES includes a number of different techniques for measuring contextual distinctiveness, ranging from free association norms to corpus-driven statistical approaches. These techniques are described below.

Free association norms

One approach to operationalizing contextual distinctiveness is to observe the number of other words commonly associated with a word, such that words with a greater number of associations are assumed to be less contextually distinct. Such information is available from free word association tasks, in which participants are given a stimulus word and asked to produce the first word (or word) that comes to mind.

Two sources of existing L1 word association norms are the Edinburgh Associative Thesaurus (EAT; Kiss, Armstrong, Milroy, & Piper, 1973) and the University of South Florida (USF; Nelson, McEvoy, & Schreiber, 2004) norms. Among other data, the EAT norms include the number of responses a given word receives when used as a stimulus in a written free association task. For example, the word worry elicits 65 different response types, whereas the word husband elicits only 15. The USF norms report the number of stimuli words that result in production of a given word as an associate in a free association task. Words elicited by a greater range of stimuli are considered more likely to come to mind in response to a variety of cues (Nelson et al., 2004). For example, the word love is produced in response to 181 different stimulus words, whereas a less contextually distinct word, such as bride, is produced in response to just six stimuli. TAALES 2.0 includes three indices related to free association norms taken from EAT and USF.

Corpus-driven approaches

Other approaches to measuring contextual distinctiveness are based on statistical regularities observed in large reference corpora. One such approach takes a lexical perspective and measures the probability of a given word statistically co-occurring with others words in general language usage (McDonald & Shillcock, 2001). For example, a word like today is found within a variety of lexical contexts (i.e., today co-occurs with many other words) and is thus less contextually distinct than a word like lone, which is less likely to co-occur with other words. Meanwhile, a semantic corpus-based approach to contextual distinctiveness (e.g., Hoffman, Lambon Ralph, & Rogers, 2013) observes the variety of semantic contexts in which a word occurs. The assumption underlying a semantic approach to contextual distinctiveness is that a word occurring in a variety of semantic contexts (e.g., one) is more semantically ambiguous and thus less contextually distinct than a word occurring in constrained semantic contexts (e.g., vibe). TAALES includes two indices related to corpus derived contextual distinctiveness, including semantic distinctiveness, as reported by Hoffman et al., and the McDonald co-occurrence probability (McDonald & Shillcock, 2001).

Word neighborhood

Word neighborhood refers to the words that share orthographic, phonographic, and/or phonological similarities with a particular word, all of which are correlated with one another (Peereman & Content, 1997). The size and characteristics of a word’s neighborhood have been shown to contribute to explaining variance in word-naming and recognition tasks (Adelman & Brown, 2007; Andrews, 1989; Balota et al., 2004; Coltheart, Davelaar, Jonasson, & Besner, 1977; Grainger, 1990; M. Yates, 2005; M. Yates, Locker, & Simpson, 2004). TAALES includes 14 indices related to word neighborhood information derived from the ELP (Balota et al., 2007). These are discussed briefly below.

Orthographic neighbors

An orthographic neighbor (Coltheart et al., 1977) is a real word that is formed by changing just one letter in the original word. For example, the word cat has 18 orthographic neighbors, including cab, cap, car, oat, sat, and so forth.

Phonographic neighbors

Phonographic neighbors differ in one letter and one phoneme. For example, whereas stove and shove are only orthographic neighbors, stone and stove are also phonographic neighbors (Adelman & Brown, 2007).

Phonological neighbors

Phonological neighbors differ by one phoneme, regardless of their orthography. For example, the word geese has seven phonological neighbors: cease, lease, niece, peace, gas, goose, and guess (M. Yates, 2009).

Semantic networks

A semantic network refers to the way that word forms are semantically related. Two key areas of semantic networks are polysemy and hypernymy. Both polysemy and hypernymy have been shown to be related to lexical development (Crossley, Salsbury, & McNamara, 2009, 2010) and L2 writing proficiency (Guo et al., 2013; Reynolds, 1995).

Polysemy

Polysemy refers to the number of related senses (i.e., meanings) a particular word form has. Words such as make and give have more senses than words such as construct and deliver. Research has suggested that as learners develop, they tend to use words with fewer senses (Crossley et al., 2010). Furthermore, research has demonstrated that polysemy scores are negatively correlated with L2 writing quality (e.g., Guo et al., 2013). TAALES 2.0 calculates polysemy values for all content words and for nouns, verbs, adjectives, and adverbs (five total indices). Polysemy scores represent the number of senses a word form has according to WordNet (Fellbaum, 1998).

Hypernymy

Hypernymy refers to the number of superordinate terms a particular word has. A word such as animal has but a few superordinate terms, whereas words such as greyhound, stag, and whitefish have many. Research has suggested that as individuals develop, they tend to have access to words with more superordinate terms (i.e., words that are more specific; Crossley et al., 2009). Additionally, hypernymy ratings have been shown to be positively correlated with L2 writing quality (Guo et al., 2013). TAALES 2.0 includes nine indices related to hypernymy for nouns, verbs, and a combination of nouns and verbs. One issue relating to the operationalization of hypernymy is that different senses of a particular word often have different superordinate terms (and different numbers of superordinate forms). Thus, TAALES 2.0 includes three versions of each hypernymy index, such that the first version comprises hypernymy values for the most frequent sense and path, the second comprises the average value for all senses (but the most frequent path for each), and the third version comprises the average value for all senses and all paths.

N-gram strength of association

Strength-of-association norms measure the conditional probability that words will occur together. Strength-of-association norms are related to n-gram frequency norms but control for the relative frequencies of the words that comprise n-grams by measuring the conditional probability of word co-occurrence. Such norms can show that the words in bigrams such as optimistic about are more strongly related than the ones in and the and in the. Bigram association strength has been shown to be positively correlated with L2 writing quality and longitudinal writing development (Bestgen & Granger, 2014). TAALES 2.0 includes 75 strength-of-association norms, covering both bigrams (25 indices) and trigrams (50 indices). These measures are described below and presented in Table 2. Two types of trigram indices are computed, such that the first word is considered Item 1 and the following bigram is considered Item 2, or the first bigram is considered Item 1 and the third word is considered Item 2.

Table 2 2×2 contingency table used in the calculation of strength-of-association norms

Mutual information

Mutual information (MI) scores represent the joint probability that two items will co-occur. Studies in corpus linguistics have suggested that MI scores tend to inflate the importance of low-frequency items (e.g., bigrams that consist of lower-frequency words tend to earn higher MI scores; Evert, 2005). N-grams such as spina bifida and lingua franca earn high MI scores, whereas n-grams such as an a and great the earn low MI scores. MI is calculated as the (logarithm) of the observed co-occurrence of two items divided by the expected co-occurrence of two items:

$$ M I= \log \left(\frac{\mathrm{observed}}{\mathrm{expected}}\right)= \log \left(\frac{a}{\frac{\left( a+ b\right)\ast \left( a+ c\right)}{N}}\right). $$

Mutual information squared

Mutual information squared (MI2) scores are a variant of MI scores that attempt to mitigate the emphasis of low-frequency items (Evert, 2005). N-grams such as twentieth century and stainless steel earn high MI2 scores, whereas n-grams such as the all and some and earn low MI2 scores. MI2 is calculated as the (logarithm) of the observed co-occurrence of two items (squared) divided by the expected co-occurrence of two items:

$$ M I= \log \left[\frac{{\mathrm{observed}}^2}{\mathrm{expected}}\right]= \log \left(\frac{a^2}{\frac{\left( a+ b\right)\ast \left( a+ c\right)}{N}}\right). $$

T Like MI scores, T scores represent the joint probability that two items will co-occur. Although MI scores tend to emphasize infrequent items, T scores tend to emphasize frequent items (Evert, 2005). N-grams such as of the and from the earn high T scores, whereas n-grams such as the between and the who earn low T scores. T is calculated as the observed frequency minus the expected frequency, divided by the square root of the observed frequency: \( T=\frac{\mathrm{observed}-\mathrm{expected}}{\sqrt{\mathrm{observed}}}=\frac{\mathrm{observed}-\mathrm{expected}}{\sqrt{\mathrm{observed}}}. \)

Delta P

Delta P scores represent the probability of an outcome (i.e., a particular word) based on a cue (i.e., another word). Delta P scores are directional, meaning that word order affects the score, unlike MI, MI2, and T scores. Delta P is calculated via the following formula: delta P = P(O | C) – P(O | –C); that is, delta P is the probability of an outcome given a cue minus the probability of an outcome without the cue. N-grams such as preformatted table and pursuant to earn high delta P scores, whereas n-grams such as would the and must the earn low delta P scores. With reference to Table 2, we calculate delta P with the second item as the outcome and the first item as the cue via:

$$ \mathrm{delta}\kern0.5em P=\left(\frac{\mathrm{a}}{\mathrm{a}+\mathrm{b}}\right)-\left(\frac{\mathrm{c}}{\mathrm{c}+ d}\right). $$

Approximate collexeme strength

Collexeme strength scores (Gries, Hampe, & Schönefeld, 2005) represent the joint probability that two items will co-occur. Collexeme strength is calculated using an exact test and does not include normal distribution as an assumption. For these reasons, it has been argued to be superior to other association strength indices such as MI and T (Gries et al., 2005). Collexeme strength is calculated by taking the negative logarithm of the Fisher–Yates exact test (Fisher, 1934; F. Yates, 1934), which is calculated as:

$$ {p}_{\mathrm{observed}\kern0.5em \mathrm{distribution}}=\frac{\left(\frac{a+ c}{a}\right)\ast \left(\frac{b+ d}{b}\right)}{\frac{N}{a+ b}}+\Sigma \kern0.5em {p}_{\mathrm{all}\ \mathrm{more}\ \mathrm{extreme}\ \mathrm{distribution}\mathrm{s}}. $$

Although the use of an exact test has some benefits, in practical applications with large corpora (such as COCA), decimal rounding causes particularly attracted or repelled bigram items to equal 1 or 0, respectively. A solution is to approximate collexeme strength by multiplying the delta P value by the frequency of Item 1:

$$ \mathrm{approximate}\kern0.5em \mathrm{collexeme}\kern0.5em \mathrm{strength}=\left(\left(\frac{a}{a+ b}\right)-\left(\frac{c}{c+ d}\right)\right)\ast \left( a+ b\right). $$

This approximation is reportedly strongly correlated (r = .950) with collexeme strength (Gries, personal communication, December 19, 2014). N-grams such as for example and would be earn high approximate collexeme strength scores, whereas n-grams such as not the and more the earn low approximate collexeme strength scores.

Present studies

In these studies, we validate TAALES 2.0 by investigating whether TAALES 2.0 indices can be used to predict holistic scores of lexical proficiency in L1 and L2 writing samples, and analytic word-choice scores in L1 essays. The research questions that guide these validation studies are

  1. 1.

    What is the relationship between the indices of lexical sophistication included in TAALES 2.0 and holistic scores of written lexical proficiency?

  2. 2.

    What is the relationship between the indices of lexical sophistication included in TAALES 2.0 and word-choice scores for narrative essays?

Method

Corpora

Lexical proficiency corpus

The lexical proficiency corpus comprises free writes written by L1 and L2 English users reported by Crossley, Salsbury, McNamara, and Jarvis (2011). It includes 180 free writes written by L2 English learners enrolled in an English for academic purposes program at a university in the US. These texts were stratified to include equal numbers of texts from individuals with beginning (n = 60), intermediate (n = 60), and advanced (n = 60) English proficiency, based on institutional TOEFL scores. These samples were augmented with 60 unstructured writing samples from undergraduate native speakers leading to a total corpus of n = 240. The writing samples were evaluated by expert raters who used a holistic rubric related to lexical proficiency. Interrater reliability was acceptable (r = .796). The corpus has been used in a number of studies to explore the nature of lexical proficiency. Crossley, Salsbury, et al. (2011), for example, found that indices related to lexical diversity, word hypernymy, and content word frequency explained 44% of the variance in lexical proficiency scores. Texts that had higher lexical diversity and included words with fewer hypernyms and lower-frequency content words tended to earn higher scores. In a follow-up study, Kyle and Crossley (2015) found that indices related to bigram and trigram frequency, word range, familiarity, and meaningfulness scores explained 51.7% of the variance in lexical proficiency scores. Texts that included bigrams and trigrams that are less frequent and words that are less frequent, familiar, and meaningful tended to earn higher scores.

Word-choice corpus

The word-choice corpus comprises 716 narrative essays written by 10th graders in the United States that predominately speak English as an L1. The corpus was collected as part of the Automated Student Assessment Prize (ASAP) and is described in Shermis and Hamner (2013). Essays were scored using a six trait analytic rubric by at least two raters. For this study, we used analytic ratings related to word choice. The analytic rating for word choice indicates that essays that include “accurate, strong, specific words” would be scored higher for word choice than essays that include “general, vague words” and/or “an extremely limited range of words,” which would be scored lower. Interrater reliability for the word-choice ratings was moderate (Kappa = .482), whereas 98.2% of ratings were either exact or adjacent matches. To our knowledge, the word-choice scores have not been used in previous studies. Shermis and Hamner reported on a shared task in which participants attempt to automatically predict overall essay score (which was calculated on the basis of all six analytic traits). Fully featured models (i.e., models that include predictors related to fluency, lexical sophistication, cohesion, and syntactic complexity) explained between 40% and 52% of the variance in essay scores.

TAALES 2.0 indices

All TAALES 2.0 indices related to word frequency, word range, psycholinguistic word information, age of exposure, academic language, contextual distinctiveness, word recognition norms, semantic network, n-gram frequency, n-gram range, n-gram strength of association, and word neighbors were used for the analysis. All TAALES indices are normed by text length. Any item in the text that is not represented in a particular index database (e.g., rare words and misspellings) are not counted toward text length. Some TAALES databases, such as Brysbaert et al.’s (2014) concreteness norms are based on word lemmas, whereas others, such as Balota et al.’s (2007) lexical-decision norms, are based on raw (nonlemmatized) words. Furthermore, some databases/corpora count contractions (e.g., cant and wont) as two tokens (e.g., ca and nt), whereas others count them as one. TAALES is sensitive to each of these differences and tokenizes and/or lemmatizes the source texts as necessary. An Index Guide is provided as supplementary material at www.kristopherkyle.com/supplementary-materials.html. The document provides in depth information regarding each index, including database sources and lemmatization information (among other pertinent information). It should also be noted that some databases are larger than others, which may affect index coverage. The MRC concreteness index, for example, is based on a database of concreteness ratings for 4,292 lemmas (Paivio, Yuille, & Madigan, 1968; Spreen & Schulz, 1966), and Brysbaert et al.’s (2014) database includes ratings for 40,000 lemmas. In addition to index scores, TAALES also provides optional output that indicates the number of text items that are covered by each database providing word coverage for each text for each index.

Statistical analyses

To investigate the relationship between indices of lexical sophistication and human judgments of lexical proficiency in L1 and L2 free writes and word choice in L1 narrative essays, multiple regressions models were developed. For each study, all TAALES 2.0 indices were checked for normality using histograms. Any index that was not normally distributed was removed from further consideration.Footnote 4 We then set a correlation threshold of r = .100, which represents the lower bound of a meaningful correlation (Cohen, 1988), and our alpha level at p = .001 (to control for Type I errors). Any index that did not reach both thresholds was removed from further consideration. We then checked for multicollinearity, which can lead to exaggerated models. Any indices that were strongly correlated (r = .700) were flagged for further analysis. In each collinear group, only the index with the strongest correlation with the criteria variable was kept.Footnote 5 The remaining indices were then entered into a stepwise regression that used the Akaike information criterion (AIC) method (Akaike, 1974). If any of the indices in the model demonstrated suppression (i.e., their beta weights had switched signs), those indices were removed and the regression was rerun. This process was repeated until the model included no suppressed variables. Finally, a follow-up tenfold forced entry linear regression was conducted using the indices included in the final model to ensure that the model was consistent across the dataset.

Results

Study 1: Lexical proficiency

To validate the indices of lexical sophistication included in TAALES 2.0, 421 indices were used to model the variance in holistic scores of lexical proficiency in essays. Twenty-eight of the indices violated the assumption of normality and were removed from further consideration. Furthermore, 285 of the remaining 393 variables did not reach the minimum correlation threshold of r ≥ .100 and p < .001 and also were removed from further consideration. Of the remaining 108 variables, 84 were removed due to multicollinearity. The remaining 24 variables (see Table 3) were entered into a tenfold stepwise regression. The initial model included two variables with switched signs, which were subsequently removed. The final model, which included ten variables, explained 58.0% (R 2 = .580) of the variance in holistic lexical proficiency scores. This model was significant, F(10, 229), p < .001. When the model was cross-validated, it explained 56.4% of variance (R 2 = .564), suggesting that the model is stable across the dataset. The model included indices related to association strength, n-gram proportion scores, range scores, lexical-decision and word-naming response times, age of exposure, word hypernymy and polysemy, and word frequency. The results indicated that texts rated as being more lexically proficiency contained more sophisticated lexical features. Table 4 presents a summary of the regression model.

Table 3 Correlations between lexical proficiency scores and the TAALES 2.0 indices

Study 2: L1 word choice

To validate the indices of lexical sophistication included in TAALES 2.0, 421 indices were used to model the variance in analytic word choice scores in L1 essays. Fourteen indices violated the assumption of normality and were removed from further consideration. One hundred twenty-eight of the remaining 407 variables did not reach the minimum correlation thresholds of r ≥ .100 and p < .001, and were also removed from further consideration. Of the remaining 279 variables, 233 were removed due to multicollinearity. The remaining 46 variables (see Table 5) were entered into a tenfold stepwise regression. The initial model included 11 variables with switched signs, which were subsequently removed. Four additional indices were removed due to suppression in subsequent models. The final model, which included 11 variables, explained 32% (R 2 = .320) of the variance in analytic word-choice scores. This model was significant, F(11, 704), p < .001. When the model was cross-validated, it explained 30.5% of variance (R 2 = .305), suggesting that the model is stable across the dataset. The final model included indices related to phonological neighbors, lexical-decision times, word familiarity and frequency, and association strength. The results indicated that texts that were scored higher in word choice included more sophisticated lexical features. Table 6 presents a summary of the regression model.

Table 4 Summary of lexical proficiency multiple regression model
Table 5 Correlations between word choice scores and the TAALES 2.0 indices
Table 6 Summary of word-choice multiple regression model

Discussion

This study introduces and helps validate TAALES 2.0, an easy to use, freely available, versatile tool for measuring a wide variety of indices related to lexical sophistication. It is hoped that researchers in a variety of fields related to discourse processing, text analysis, and language assessment will find TAALES 2.0 a useful mechanism to examine lexical sophistication in a variety of situations. We envision that TAALES 2.0 might prove beneficial for researchers examining the effects of text complexity on reading comprehension and text processing. Educational assessments related to language production might also be informed by the lexical features found in TAALES 2.0. Additionally, cognitive scientists might find the tool useful in helping to develop language stimuli for behavioral experiments. Computational social scientists may use the tool’s features to examine trends reported in traditional or social media. Here, we examined if the indices reported in TAALES 2.0 were predictive of human judgment of lexical proficiency and word choice. We discuss these findings below.

Lexical proficiency

Ten TAALES indices were used in a model that explained approximately 58% of the variance in lexical proficiency scores. These results are stronger than models in previous studies, which explained between 44% (Crossley, Salsbury, et al., 2011) and 51.7% (Kyle & Crossley, 2015) of the variance in lexical proficiency scores. The predictor model both supports and extends previous models of lexical proficiency. Of the ten predictor variables in the final model, only one (Kučera–Francis Register Range CW) was included in TAALES 1.4, and only two others (COCA Magazine Trigram Proportion 80k and COCA Academic Frequency CW Logarithm) are conceptually related to the TAALES 1.4 indices. This suggests that TAALES 2.0 represents an important upgrade to previous versions, both practically and conceptually.

Four indices related to n-gram strength of association, n-gram frequency, and word range contributed over two thirds of the variance explained by the model (41.8%), whereas an index related to word frequency explained only 3.2% of the variance in lexical proficiency scores. Each index category that contributed to the final model is discussed below.

N-gram association strength and frequency

Indices related to n-gram association strength and n-gram frequency explained approximately 28% of the variance in lexical proficiency scores. Texts that included more strongly associated bigrams and trigrams and a higher percentage of frequent trigrams tended to earn higher lexical proficiency scores. This supports recent findings that suggest collocational knowledge is a key aspect of lexical proficiency (Bestgen & Granger, 2014; Jurafsky, Bell, Gregory, & Raymond, 2001; Kyle & Crossley, 2015; McDonald & Shillcock, 2003; Römer, 2009). The findings also suggest that n-gram frequency and strength-of-association indices may capture related but different aspects of collocational knowledge.

Word range

One index related to word range explained 13.9% of the variance in lexical proficiency scores. Texts that included words that occur in fewer registers tended to earn higher lexical proficiency scores. This may suggest that the use of words that are more register specific is an important indicator of lexical proficiency. These results support recent findings with regard to both lexical proficiency (Kyle & Crossley, 2015) and L2 writing quality (Kyle & Crossley, 2016).

Semantic networks

Two indices related to semantic networks explained 6.6% of the variance in lexical proficiency scores. Texts that included nouns and verbs with fewer hypernymic levels and that were more polysemous verbs tended to earn higher scores. This suggests that the use of less specific verbs and nouns are indicators of lexical proficiency, which supports previous findings related to lexical development (Crossley et al., 2009; Crossley et al., 2011).

Word recognition norms

Indices related to word recognition norms explained 5.7% of the variance in lexical proficiency scores. Texts that included words with a wider standard deviation in lexical-decision times and that were named less accurately tended to earn higher scores. This suggests that words that are more difficult to process tend to be perceived as more sophisticated. These results generally support psycholinguistic accounts of word processing and extend psycholinguistic data to support predictions of holistic judgments of lexical proficiency in writing samples.

Word frequency

One index related to word frequency explained 3.6% of the variance in lexical proficiency scores. Texts that included less frequent content words tended to earn higher lexical proficiency scores. This negative trend aligns with previous research. The relatively limited role of frequency in the predictor model, however, suggests that other factors (e.g., n-gram strength of association and frequency and word range) are more directly related to the construct of lexical proficiency, supporting previous studies reporting that frequency is not the strongest predictor of word processing (Adelman et al., 2006; McDonald & Shillcock, 2001).

Age of exposure

One index related to age of exposure explained a small amount of the variance (0.3%) in lexical proficiency scores. The results indicate that texts including words that have lower co-occurrence patterns at later grade level tended to earn higher scores.

Word choice

Eleven TAALES indices were used in a model that explained 32% of the variance in word-choice scores. Direct comparisons to previous studies are not possible because this is the first study that has only used the word-choice scores. However, the word-choices scores have been analyzed in combination with other scores (e.g., ideas and content, voice, and organization) using NLP features to predict overall essay scores. This analysis, which included lexical features along with other features (e.g., cohesion and syntactic complexity) explained between 40% and 52% of the variance in the overall essay scores (Shermis & Hamner, 2013). Given that we only used a single construct (lexical sophistication), the results reported here seem reasonably strong and support and extend previous models of lexical proficiency. Of the eleven predictor variables in the final model, only three (MRC Familiarity AW, Brown Frequency AW, and SUBTLEXus Frequency CW Logarithm) were included in TAALES 1.4, and these explained a relatively small portion of the variance in word-choice scores (5.3%). New variables unique to TAALES 2.0, including word neighbor information, word recognition scores, and association measures, explained the lion’s share of the variance, suggesting that TAALES 2.0 represents an important upgrade to previous versions, both practically and conceptually.

Word neighbor information

Two indices related to word neighbor information explained 19.3% of the variance in word-choice scores. The average number of phonological neighbors accounted for most of this variance (18.6%). Essays that included words with fewer phonological neighbors (i.e., are more phonologically distinct) tend to earn higher word-choice scores. The mean orthographic neighbor frequency score for words in a text accounted for an additional 0.7% of the variance in word-choice scores. Essays that included words with less frequent orthographic neighbors tended to earn higher scores. These results are in line with psycholinguistic findings that demonstrate that performance on naming and lexical-decision tasks is faster for low-frequency words that have more orthographic neighbors, indicating that words with fewer orthographic neighbors are more complex (Andrews, 1989; Grainger, 1990; McCann & Besner, 1987).

Word recognition norms

One index related to word recognition norms explained 3.8% of the variance in word-choice scores. Essays that included words that are processed more slowly (as measured by a lexical-decision task) tended to earn higher scores. This suggests that words that are processed more slowly are considered more sophisticated by human raters. This finding is novel but is in line with previous research regarding processing difficulty (Balota et al., 2004; Forster & Chambers, 1973; Frederiksen & Kroll, 1976).

N-gram association strength

Five indices related to n-gram association strength cumulatively explained 3.7% of the variance in word-choice scores. Essays that included bigrams and trigrams that were more strongly associated tended to earn higher scores. This finding suggests that collocational knowledge is an important aspect of lexical knowledge and is in line with previous research in L2 contexts (Bestgen & Granger, 2014) and usage-based theories regarding lexical knowledge (Römer, 2009).

Word information

One index related to word information explained 2.9% of the variance in word-choice scores. Essays that included words that were less familiar tended to earn higher word-choice scores. These results align with previous studies (Crossley & McNamara, 2011; Guo et al., 2013; Kyle & Crossley, 2015).

Word frequency

Two indices related to word frequency explained 2.4% of the variance in word-choice scores. Essays that included less frequent content words tended to earn higher word-choice scores. The negative relationship between corpus frequency and word-choice scores generally align with previous studies (Guo et al., 2013; Kyle & Crossley, 2015; McNamara, Crossley, & McCarthy, 2009). One word frequency index (SUBTLEXus Frequency CW Logarithm) demonstrated the strongest correlation between lexical sophistication indices and word-choice scores (r = −.471), underscoring the important relationship between corpus frequency and word-choice scores. The most accurate predictor model, however, weighted other indices more heavily (e.g., phonological neighbors).

Overview of findings

This study reported on two validation studies that used TAALES 2.0 indices to successfully predict holistic scores of lexical proficiency in L2 and L1 writing (R 2 = .580), and analytic word choice scores in L1 essays (R 2 = .320). The lexical proficiency model explained greater variance in lexical proficiency scores than previous models, providing evidence to support the inclusion of the new indices included in TAALES 2.0. Furthermore, the word-choice model explained a significant amount of the variance (with a medium effect), providing further supporting evidence for the validation of the TAALES 2.0 indices.

Although correlations between TAALES 2.0 indices and word-choice scores tended to be stronger than those between TAALES 2.0 indices and lexical proficiency scores, the lexical proficiency model explained more variance than the word-choice model. One interpretation of these seemingly contradictory results is that the majority of our lexical proficiency corpus was sampled from L2 learners, which may represent a greater variation of lexical knowledge and production than found in our L1 only word-choice corpus. In the lexical proficiency model; for example, three indices related to n-gram association strength, n-gram frequency, and word range each contributed a relatively large percentage of the variance explained by the model indicating that a number of unique lexical features account for the variation in human judgments of L2 and L1 speakers. In contrast, for the word-choice model, a single index related to phonological neighbors accounted for a lion’s share of the variance in scores suggesting less variation on the part of L1 writers, at least in terms of predicting word-choice scores.

The results generally support previous findings related to the lexical features reported by TAALES 1.4 (i.e., word frequency, word range, n-gram frequency, academic language, and psycholinguistic word information). Importantly, the results also present a number of novel findings regarding the additional lexical features included as part of TAALES 2.0 (i.e., contextual distinctiveness, word recognition norms, semantic network, n-gram association strength, n-gram range, and word neighbors). In the lexical proficiency study, for example, indices related to COCA derived n-gram frequency and association strength accounted for 28% of the variance in lexical proficiency scores. N-gram association strength indices also accounted for a portion of the variance (3.7%) in word-choice scores. Indices related to word neighbor information also accounted for over half of the variance explained by the word-choice model (19.3% of the total variance explained). Indices related to word recognition norms were included in both the lexical proficiency model and the word-choice model. Furthermore, these indices were among those that demonstrated some of the strongest correlations with lexical proficiency scores (r = .312) and word-choice scores (r = .417). Of the new TAALES 2.0 categories, the only one not represented in either predictor model was contextual distinctiveness. Although they were not included in the regression model, contextual distinctiveness indices did demonstrate significant correlations with word choice and lexical proficiency. In the case of lexical proficiency, the contextual distinctiveness indices did not meet our strict threshold of p < .001 that we used to control for Type I errors.

Limitations and future directions

A major limitation of TAALES is user knowledge. The tool contains a vast repository of lexical features that most new users will not be familiar with. Thus, there will be a steep learning curve for most. Although we have endeavored to provide background information on the indices reported by TAALES in this article and in the index guide, the sheer number of indices and the lexical constructs they represent will prove daunting for the new user. In addition, statistical analyses using TAALES requires a solid knowledge of not only inferential statistics but knowledge of variable selection. For instance, experienced, but not inexperienced, users will know that frequency indices based on logarithmic transformations will more likely be normally distributed, whereas raw frequency counts may be nonnormally distributed because of Zipfian tendencies in language. Likewise, experienced users will know that some of the variables reported by TAALES depend on word vectors that are not densely populated. For instance, a number of the Academic Word Sublists contain a small number of words that are rare in smaller texts, leading to nonnormally distributed data. However, for larger texts, these indices will report normal distributions. Of course, depending on the statistical analysis, normal distributions need not be an assumption. This is especially true for a number of machine learning techniques. In addition, TAALES is purposefully redundant in that it calculates a number of lexical features for a single lexical construct (e.g., frequency) and also calculates a number of similar variables from a single database (e.g., raw and logarithmic frequency counts, range scores, and n-gram counts from COCA). Such redundancy allows users greater capabilities to make decisions specific to their research question, but also introduces the potential for multicollinearity and suppression effects within an analysis. Again, many new TAALES users will face a learning curve when making experimental and statistical decisions.

There are also limitations specific to this study. Foremost, we only looked at two cross-sectional datasets. One of the datasets was relatively small (n = 240), and the other had relatively low rating reliability. It would be fruitful for future studies to investigate larger datasets with more reliable human ratings. Longitudinal datasets may also allow researchers to observe how lexical sophistication develops over time, particularly in L2 or younger L1 participants. Future studies should also investigate the complexity of predictor models related to L1 and L2 lexical proficiency. In this study, we found more indices accounting for a larger percentage of the variance in the predominately L2 dataset than in the L1 dataset. Future research should investigate whether these differences are attributable to speaker status (e.g., L1 vs. L2) or other variables, such as writing tasks and scoring rubrics. Importantly, the nature of the statistical analyses used here required the removal of a large number of indices to avoid multicollinearity. Depending on their research question(s), researchers who use TAALES 2.0 may consider employing statistical techniques that minimize multicollinearity effects, such as factor analysis. Finally, although TAALES 2.0 has been refined considerably, some limitations remain. For example, not all indices allow researchers to distinguish between values for content versus function words in a text (e.g., word recognition norms).

Conclusion

This study introduces a major update to a freely available text analysis tool, TAALES 2.0. This tool was designed to allow for efficient and replicable analysis of lexical sophistication in a variety of domains, including educational psychology, cognitive science, and artificial intelligence (among many others). The results suggest that the increased construct coverage of TAALES 2.0 positively influenced the predictive validity of models related to lexical sophistication, providing predictive validation of the indices reported by the tool. The results of both studies also suggest that the nature of lexical sophistication is multifaceted and complex. Furthermore, the results suggest that the construct of lexical sophistication is not restricted to the properties of words in isolation, but involves collocational knowledge. These findings provide new avenues for varied endeavors such as developing behavioral stimuli, automatically assessing speaking and writing proficiency, and investigating reading difficulty, among others.