Introduction

Language scientists often need to generate sets of related words or words with specific properties. This might be in service of experimental control (e.g., words matched on length and frequency of occurrences, but differing in lexical neighborhood; Luce & Pisoni, 1998). Or the need might arise based on a theoretically motivated or model-driven hypothesis; perhaps your theory proposes – or your model simulations predict – that shorter words embedded within a word should make that word more difficult to process, so you want to find words with many or few words embedded within them. Sets of related items and their characteristics can also be useful for clinical purposes. For example, frequency-weighted lexical neighborhoods have proven useful for clinical assessments and interventions (e.g., Kirk, Pisoni, & Osberger, 1995; Morrisette & Gierut, 2002; Sommers & Danielson, 1999; Storkel, Bontempo, Aschenbrenner, Maekawa, & Lee, 2013; Storkel, Maekawa, & Hoover, 2010). So how do we generate these lists?

Various excellent tools already exist. For example, three web-based tools are Michael Vitevtich’s phonotactic probability (Vitevitch & Luce 1998, 1999) and neighborhood density calculators (http://www.people.ku.edu/~mvitevit/PhonoProbHome.html), the English Lexicon Project (https://elexicon.wustl.edu/; Balota et al., 2007), and the recent Auditory English Lexicon Project (https://inetapps.nus.edu.sg/aelp; Goh, Yap, & Chee, 2020). Other tools exist for semantic variables or languages other than English, such as Lexique, which includes English and French (http://www.lexique.org/; New, Pallier, Brysbaert, & Ferrand, 2004), the multilingual CLEARPOND (https://clearpond.northwestern.edu/; Marian, Bartolotti, Chabal, & Shook, 2012), and EsPal (https://www.bcbl.eu/databases/espal/; Duchon, Perea, Sebastián-Gallés, Martí, & Carreiras, 2013) for Spanish, but it takes considerable independent work for a researcher to combine these resources with things like neighborhood statistics from the other tools.

Furthermore, while these tools are incredibly useful, they have limitations. Many require using web interfaces, so a researcher’s workflow must include interacting with the websites and documenting the steps taken, and importing lists of items into the researcher’s local workflow (e.g., into R; R Core Team, 2019). One might argue that this is not a major inconvenience, but other limitations are more severe. For example, so far as we are aware, the computer code used to search lexicons on the sites listed above are not readily available, so a researcher can neither easily confirm the code’s validity or extend it (for example, to include a new type of potential competitor). Another limitation is that some tools have a predefined lexicon, and a researcher cannot substitute another in its place. Substituting your own lexicon might be useful if you simply prefer a different lexicon, or if you were using an artificial lexicon, either with human subjects or with a computational model, or if you wanted to examine an understudied language or dialect. Finally, we assume that many labs and researchers have developed and will continue to develop their own code for lexical searches. This duplication of effort is unfortunate. An open-source, extensible tool shared via a version-control repository would allow researchers to collaborate and share their extensions, reducing duplication of effort.

We have developed a lightweight R package, LexFindR (Li, Crinnion, & Magnuson, 2020), which addresses these limitations. LexFindR comes with a suite of lexical relation finders for common competitor types used in studies of spoken and/or visual word recognition (e.g., neighbors, cohort [onset] competitors, and rhymes), but is also easily extended to incorporate new definitions. LexFindR is also fast, as it uses R’s parallelization capabilities to leverage multiple CPU cores (typically found even on contemporary laptops) and efficient core capabilities of R (e.g., R’s apply family of functions). Appendix A provides an example of how to put together aspects of the examples throughout the paper in order to efficiently gather information about multiple lexical dimensions in one script. In the following sections, we review how to install and use LexFindR. Details about how to share extensions with the community via LexFindR’s GitHub repository are provided in Appendix B.

Using LexFindR

Installing and loading LexFindR

The package is implemented in R and can be utilized like any R package. The package is available from the R package repository, CRAN. Users can install the stable version using the Tools::Install Packages menu in R Studio, or via the following command:

figure a

The most current developmental version can be installed from GitHub with the following commands:

figure b

Once installed, the package can be loaded with the following command.

figure c

Getting started

The package comes with two lexicons: the 212-word slex lexicon (with only 14 phonemes) from the TRACE model of spoken word recognition (McClelland & Elman, 1986) as a small data set for the user to experiment with, and a larger lexicon (lemmalex) that we compiled from various open-access, non-copyrighted materials. The primary source is the SUBTLEX subtitle corpus (Brysbaert & New, 2009), which we cross-referenced with the copyrighted (Francis & Kučera, 1982) database to reduce the word list to “lemma” (base- or uninflected) forms. Pronunciations were drawn from the larger CMU Pronouncing Dictionary (CMU Computer Science, 2020) without lexical stress for both lexicons (with those for slex transcribed by Nenadić & Tucker, 2020a). The second lexicon is large enough to demonstrate the full capabilities of the package. The two data sets are automatically loaded when we load LexFindR. We can use the tidyverse (Wickham et al., 2019) glimpse function to view essential details about the lexicons, and view their first few lines.

figure d
figure e
figure f
figure g

Both lexicons are loaded as R dataframes with three fields. “Item” is a label (orthography in the case of lemmalex, and transcriptions in the original phonemic conventions used for the TRACE model in the case of slex). “Pronunciation” is a space-delimited phonemic transcription using the ARPAbet conventions of the CMU Pronouncing Dictionary (ARPAbet transcriptions for TRACE items are from Nenadić & Tucker, 2020b). We will discuss shortly how to specify alternative delimiters, including a “null” delimiter for working with orthographic forms or pronunciation forms that use one character per phoneme without spaces. “Frequency” is occurrences-per-million words; frequencies are based on (Kučera & Francis, 1967) for slex and on Brysbaert and New (2009) for lemmalex.

More information about the lexicons can by queried with the ‘?’ command (we do not present the output here as it is rather extensive):

figure h

Note that you can use any lexicon you can load into an R dataframe. You may find it convenient to use the same field names as in slex and lemmalex, but it is not necessary. For work on phonological word forms, you typically will have both “Item” (usually orthography) and “Pronunciation”, but as we will see later, you can do useful things with LexFindR with any list of forms, including orthographic forms. To use this package with orthographic forms, refer to the section below on Working with orthography or other “undelimited” forms, or other delimiters.

LexFindR commands

Table 1 provides a list of LexFindR commands along with brief descriptions. To use any of the LexFindR functions, we provide a target pattern and a word list to compare it to. LexFindR will compare the target pattern to the patterns in the word list to find items that have particular relations to the target. The functions can return indices of items that meet the criteria of the function, but we can also tell LexFindR to return instead the list of matching forms, the list of accompanying labels for matching forms (e.g., spellings), or the frequencies of matching forms. As we progress through examples, we will see when these different options are useful.

Table 1 LexFindR functions briefly described

Cohorts

Let’s begin with cohorts. Cohorts are words that overlap at word onset, and are called “cohorts” because they comprise the set of words predicted to be strongly activated as a spoken word is heard (and thus to form the recognition cohort) by the Cohort Model (Marslen-Wilson & Welsh, 1978). While definitions vary, LexFindR is equipped to handle overlap in any number of phonemes. By default, it uses a very common cohort definition: overlap in the first two phonemes. However, it contains a parameter – overlap – to allow the researcher to adjust how many initial phonemes must match for two words to be cohorts. We can get the set of cohort indices for a pattern with a command like this for the pronunciation of CAR:

figure i
figure j

This tells us that slex entries 66-71 are cohorts of CAR (overlapping in at least the initial two positions, since 2 is the default overlap). To get the competitors themselves rather than the indices, we could specify that we want forms:

figure k
figure l

To see the labels of those items (in TRACE’s phonemic transcriptions), we can use standard R conventions (and should see the phonemic transcriptions for COLLEAGUE, COP, COPY, CAR, CARD, and CARPET):

figure m
figure n

Alternatively, we could request the count of cohorts:

figure o
figure p

That is not a large number of cohorts. Let’s compare it to the count we get from lemmalex:

figure q
figure r

As expected, we get many more from a more realistically sized lexicon. Note that most LexFindR functions have exactly the same structure, returning indices by default, but with options to return forms or counts.

Finally, let’s see how we can change the cohort definition in terms of how many phonemes must match. Let’s say we want to try a definition of cohorts with overlap in the first three phonemes for the cohort of CARD:

figure s
figure t

We could repeat any of the preceding example commands with 3-phoneme overlap by simply adding “overlap = 3” to each command.

Neighborhood

Neighbors are another possible competitor often considered in word recognition research. The standard neighbor definition for spoken words comes from the Neighborhood Activation Model (NAM; (Luce & Pisoni, 1998)). While NAM includes a graded similarity rule, most often, researchers use the simpler DAS rule: two words are considered neighbors (and are expected to be strongly activated if either one is heard) if they differ by no more than a single phonemic d eletion, a ddition, or s ubstitution. For example, CAR (/kar/) has many neighbors, including the deletion neighbor ARE (note that neighbors are based on pronunciation here, not spelling), addition neighbors SCAR and CARD, and substitution neighbors at every position, such as BAR, CORE, and COP (though as we will see, CAR has no medial [vowel] substitution neighbors in slex). Let’s look at CAR’s neighbors in slex, using analogous commands to those we used for cohorts.

figure u
figure v
figure w
figure x
figure y
figure z
figure aa
figure ab

Note that in visual word recognition, it is much more common to consider only substitution neighbors (often called “Coltheart’s N”; Coltheart, Davelaar, Jonasson, & Besner, 1977). So if you are working with orthography, you may only want substitution neighbors. Or perhaps you would like to explore the relative impact of deletion, addition, and substitution neighbors. LexFindR’s get_neighbors function anticipates the potential need for such flexibility. By default, it assumes you want all three, but you can specify any single type or any combination with the neighbors argument and specifying deletion neighbors with “d”, addition neighbors with “a”, and/or substitution neighbors with “s”. Here are some examples:

figure ac
figure ad
figure ae
figure af
figure ag
figure ah
figure ai
figure aj

Of course, we can easily do other things using basic R commands, such as determine what proportion of CAR’s neighbors are substitution neighbors:

figure ak
figure al

Other competitor types

In addition to cohorts and neighbors, LexFindR comes with analogous functions for several other similarity types.

  • get_rhymes: returns items that mismatch at word onset by no more than a specified number of phonemes, using a mismatch argument which the user can supply. The default mismatch argument is 1 phoneme, meaning the function will by default return items that mismatch at word onset by a maximum of 1 phoneme (so not a standard definition of poetic rhyme or phonological rime). With this default argument, rhymes will include items that are addition or deletion neighbors at first position (e.g., CAR’s rhymes will include ARE and SCAR) as well as substitution neighbors at position 1 (e.g., BAR, TAR). If mismatch were set to 2, for example, CAR would additionally match any 3-phoneme word ending in /r/ and any 4-phoneme word ending in /ar/.

  • get_embeds_in_target: returns items that are embedded within a target word. For SCAR, this would include ARE and CAR.

  • get_target_embeds_in: returns items that the target embeds within. For CAR, this would include SCAR and CARD.

  • get_homoforms: returns items with the same form as the target. We use “homoform” because these would be homophones for phonological forms but homonyms for orthographic forms.

LexFindR also anticipates the possibility that a researcher may want to find competitor types that do not overlap. For example, CARD is both a cohort and a neighbor of CAR, so which set should it appear in? We propose a novel category called nohorts – neighbors that are also cohorts – and provide “P” (pure) versions of several competitor-type functions that return non-overlapping sets.

  • get_nohorts: Cohorts and neighbors are overlapping sets, although not all cohorts are neighbors (e.g., CAR and CARPET are cohorts but not neighbors) and not all neighbors are cohorts. Nohorts are the intersection of cohorts and neighbors. Note that the target word will be part of the nohort set, and not part of cohortsP or neighborsP, which we define next.

  • get_cohortsP: the set of “pure” cohorts that are not also neighbors.

  • get_neighborsP: the set of “pure” neighbors that are not also cohorts or rhymes.

  • get_embeds_in_targetP: set of items that embed in the target that are not also cohorts or neighbors.

  • get_target_embeds_inP: set of items that the target embeds in that are not also cohorts or neighbors.

The nohort and “P” functions use the base-R intersect and setdiff functions to find set intersections and differences. To see the code for any function in R, you can simply enter the function name with no arguments and no following parentheses. Let’s look at the code for get_nohorts. Many of the details provided may not be useful for a typical user, but the intersect command is the interesting part of this example.

figure am
figure an

Now let’s examine the get_neighborsP function to see how setdiff is used to find “pure” sets.

figure ao
figure ap

This function uses nested setdiff calls to first find neighbors excluding cohorts and then to exclude rhymes from that set. A user could use these functions as examples to create their own specific subsets of items.

Form length

You may wish to calculate form length. This is easy to do with base R. If you use CMU pronunciations, as in lemmalex, we can use a technique for counting words separated by whitespace with the lengths command in R.

figure aq
figure ar

If you have a null-delimited form, where each character is a single letter or phoneme, we can use the nchar function.

figure as
figure at

Uniqueness point

We have added one other common lexical dimension to the LexFindR functions (get_uniqpt), which is the uniqueness point (UP) of a form. This is the position at which an item becomes the only completion in the lexicon. For example, in slex, /kard/ (CARD) becomes unique at position 4, as does /karpˆt/ (CARPET). SCAR becomes unique at position 3. CAR (/kar/) is not unique at its final position, so its uniqueness point is set to its length plus one.

figure au
figure av
figure aw
figure ax

Again, CAR is not unique by word offset, so its UP is its length plus one. SCAR becomes unique at position 3, one before its offset. Let’s consider some additional useful steps. We could normalize UPs by dividing them by word length plus one, the maximal possible score. So CARD would have a normalized UP of 0.8 (4/5), while CARPET’s would be 0.57 (4/7), and CAR’s would be 1.0 (4/4). Here are some examples.

figure ay
figure az

Helper functions

LexFindR includes two helper functions that can be applied to the output of other functions: get_fw and get_fwcp.

Log frequency weights: get_fw

Intuitively, the number (count) of potential competitors may be important, but some competitors might have more influence than others; in particular, words with higher frequency-of- occurrence may compete more strongly. So we may wish to consider the frequencies of competitors. We can use the indices returned by functions like get_cohorts or get_neighbors to get the frequencies of the items. Let’s do this for the word CAR in slex and lemmalex and get some summary statistics.

figure ba
figure bb
figure bc
figure bd

Typically, frequencies are log scaled, as this provides a better fit when they are used to predict human behavior (e.g., word recognition time). It would be useful, therefore, to weight the count of competitors by log frequencies. The LexFindR helper function get_fw does this. You supply it with a list of frequencies, and it takes their logs and returns the sum. This is simple enough that you could do it with basic R functions yourself. However, get_fw provides some useful error checking. Specifically, it checks whether the minimum frequency in your set of frequencies is less than 1, since taking the log would return a negative value. If so, it also suggests a minimum constant to specify for pad to add to each frequency before taking the log. Let’s consider how we might use this. First, let’s try using get_fw to give us summed log frequencies for the frequencies we collected above for CAR’s slex cohorts.

figure be
figure bf

This gives us the sum without any problem, as the minimum frequency in slex_cohort_frequencies is greater than 1. Now let’s try with llex_cohort_frequencies.

figure bg
figure bh
figure bi

Now we get a value (55.64038) but also a warning because the minimum value is less than 1. So let’s add the pad option. Using 1 will bring our minimum to a value greater than 1, avoiding results with non-positive values.

figure bj
figure bk

Log Frequency-Weighted Competitor Probabilities: get_fwcp

We could go a step beyond frequency weights and calculate the Frequency-Weighted Competitor Probability (FWCP) of a word, inspired by the Neighborhood Activation Model’s Frequency-Weighted Neighborhood Probability (FWNP; Luce & Pisoni, 1998). This is calculated as the ratio of the target word’s log frequency to the sum of all words meeting the competitor definition, as in the following equation.

$$FWCP = \frac{log(Frequency_{target})}{{\sum}_{c \in competitors}{log(Frequency_{c})}}$$

Notably, on most competitor definitions, this includes the target word itself, so we can think of the ratio as expressing what proportion of the “frequency weight” of the target’s competitors is contributed by the target itself. For spoken words, the larger the ratio, the more easily the target word tends to be recognized. To calculate this with LexFindR, we supply a set of competitor frequencies and the target word’s frequency to the get_fwcp function. Note that we can include a pad option as for get_fw, and it will be applied to both the target word’s frequency and the list of competitor frequencies; again, this should be done if the minimum frequency value is less than 1. Let’s verify that the minimum frequency in slex is greater than 1.

figure bl
figure bm

The next two code blocks demonstrate how to get the FWCP for neighbors (i.e., the FWNP) and then for cohorts.

figure bn
figure bo
figure bp
figure bq

Note that get_fwcp is not simply computing the ratio of target-to-competitor frequencies; it is first converting the frequencies to log frequencies. If your lexicon file has frequencies already in log form, you should not use the get_fwcp function, but instead you should calculate the ratios directly. Also note that it is fairly standard to express frequencies as occurrences-per-million. If your basis is different (e.g., occurrences-per-six million), you may want to transform your frequencies to the more standard per-million basis. Finally, we recommend that you examine distributions before using the results of get_fwcp, as these often exhibit difficult-to-mitigate deviations from normality. One may be better served by examining target frequencies and competitor frequency weights (obtained with get_fw) separately.

Working with orthography or other “undelimited” forms, or other delimiters

By default, LexFindR functions expect the forms you supply to be space-delimited, which is the typical convention for CMU pronunciations. Using a delimiter allows you to have form codes (typically phoneme codes) made up of more than one character. But what if you want to work with orthography, or a phoneme code that uses one character per phoneme without delimiters? You can simply specify sep = ”” to indicate that your forms have a “null” delimiter. We can illustrate this with the orthography in the “Item” field in lemmalex.

figure br
figure bs

Now let’s try it with TRACE’s original phoneme encodings, which use one character per phoneme. Those original forms are in the “Item” field of slex:

figure bt
figure bu

Batch processing with target list and lexicon

Often, we may need to get the competitors for each word in the lexicon, with respect to the entire lexicon. This would be a prerequisite for selecting words with relatively many vs. few neighbors, for example. One way to do this would be to use the base R function lapply. Here is how we could do this for cohorts. The final glimpse command will show us the first few instances of each field.

figure bv
figure bw

Consider the cohort_idx field. We can see that /ad/ (ODD) has only one cohort (itself), while /ar/ (ARE) has four (items 2, 3, 4, 5, or /ar/, /ark/, /art/, and /artˆst/, i.e., ARE, ARK, ART, ARTIST).

What if we also want the lists of cohort forms or labels and frequencies? Rather than calling the function three times, we could speed up the process (speed will be very important when we work with large lexicons!) by calling get_cohorts only once, and then using the indices to get the other items we want. In the next example, we keep working with target_df and its new field cohort_idx (which has the list of indices [row counts] of records that meet the cohort definition for each target).

figure bx

Let’s look at the results:

figure by
figure bz

We can see that cohort_idx, cohort_str, and cohort_freq all contain lists, and we can verify that for a given word, the lists are the same length (e.g., one frequency form for each cohort). There should only be one value per target word in cohort_count and cohort_fw, which we can see is the case as well.

Working with different target and lexicon lists

In some cases, you may only want to get details for a subset of items in the lexicon – or even for a list of forms that are not in the lexicon. In these cases, you can simply specify a shorter target list rather than making the target list and lexicon the same. Note that of course, if you do not have frequencies for your items, you will not be able to use the get_fwcp command. As an example, we might want to examine what the neighborhoods of the words in the TRACE lexicon would be in the context of a realistically sized lexicon. We can do this by using slex as our target list and lemmalex as our lexicon.

figure ca
figure cb

Comparing this to our earlier results, we see that ODD would have four cohorts in lemmalex instead of one within slex.

Parallelizing for speed

If we are getting competitors for every word in a lexicon, speed becomes a concern, especially if we want to do this for many competitor types. To quantify this, let’s time how long it takes to calculate cohorts for all words in lemmalex. We will use the R tictoc package (Izrailev, 2014) to time the process. For this demonstration, we are using a MacBook Pro with an Intel Core i9 CPU and 32 GB of RAM.

figure cc
figure cd
figure ce
figure cf
figure cg
figure ch

On our demonstration laptop, get_cohorts with lapply took ˜111 seconds (on an older workstation we tested, it took several minutes). If you only have to do this once, that may be tolerable. But we can do better! We could easily parallelize using the R future package, and its commands like future.apply (Bengtsson, 2013). There are various ways to engage multiple cores with this package, as detailed in its documentation. The plan(multisession, workers = num_cores) is quite convenient, and works on Windows, Macintosh, and Linux with Rstudio and base R. In the following code block, we show how to load future.apply and set things up to use multiple cores.

figure ci
figure cj
figure ck

With this setup, the only thing left to do is to replace our apply functions with their future.apply equivalents. In the example below, we just replace lapply with future_lapply to parallelize the function that gets competitors (there’s no real need to do this with the other apply call as it is not the bottleneck; in fact, it is so poorly suited for parallelization that it is slowed by a factor of ˜10 if we do use future_apply).

figure cl
figure cm
figure cn
figure co

We see an improvement from 111 seconds to approximately 35; it took a bit more than three times longer without parallelization. On the older workstation, the improvement was more dramatic, from several minutes to around 35 seconds (around ten times faster with parallelization). Again, such differences may not seem important if you are running a search once, but if you want to do many different kinds of searches, or explore novel similarity definitions, speed will become important. In Appendix A, we present an example of parallelized code for conducting several LexFindR competitor searches in series.

Conclusions

LexFindR fills important gaps in the language scientist’s toolkit. It provides a free, fast, extensible, tested, and readily shared tool that can be integrated into typical analysis workflow within R. Researchers inclined to contribute extensions to LexFindR should refer to Appendix B for basic guidance on how to do so. We hope our fellow researchers will find LexFindR useful.