LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

Li, ZhaoBin; Crinnion, Anne Marie; Magnuson, James S.

doi:10.3758/s13428-021-01667-6

LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

Published: 30 September 2021

Volume 54, pages 1388–1402, (2022)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

Download PDF

ZhaoBin Li¹,
Anne Marie Crinnion^2,3 &
James S. Magnuson^2,3,4,5

1510 Accesses
1 Citation
8 Altmetric
Explore all metrics

Abstract

Language scientists often need to generate lists of related words, such as potential competitors. They may do this for purposes of experimental control (e.g., selecting items matched on lexical neighborhood but varying in word frequency), or to test theoretical predictions (e.g., hypothesizing that a novel type of competitor may impact word recognition). Several online tools are available, but most are constrained to a fixed lexicon and fixed sets of competitor definitions, and may not give the user full access to or control of source data. We present LexFindR, an open-source R package that can be easily modified to include additional, novel competitor types. LexFindR is easy to use. Because it can leverage multiple CPU cores and uses vectorized code when possible, it is also extremely fast. In this article, we present an overview of LexFindR usage, illustrated with examples. We also explain the details of how we implemented several standard lexical competitor types used in spoken word recognition research (e.g., cohorts, neighbors, embeddings, rhymes), and show how “lexical dimensions” (e.g., word frequency, word length, uniqueness point) can be integrated into LexFindR workflows (for example, to calculate “frequency-weighted competitor probabilities”), for both spoken and visual word recognition research.

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Natural Language Processing

Word prevalence norms for 62,000 English lemmas

Article 02 July 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Language scientists often need to generate sets of related words or words with specific properties. This might be in service of experimental control (e.g., words matched on length and frequency of occurrences, but differing in lexical neighborhood; Luce & Pisoni, 1998). Or the need might arise based on a theoretically motivated or model-driven hypothesis; perhaps your theory proposes – or your model simulations predict – that shorter words embedded within a word should make that word more difficult to process, so you want to find words with many or few words embedded within them. Sets of related items and their characteristics can also be useful for clinical purposes. For example, frequency-weighted lexical neighborhoods have proven useful for clinical assessments and interventions (e.g., Kirk, Pisoni, & Osberger, 1995; Morrisette & Gierut, 2002; Sommers & Danielson, 1999; Storkel, Bontempo, Aschenbrenner, Maekawa, & Lee, 2013; Storkel, Maekawa, & Hoover, 2010). So how do we generate these lists?

Various excellent tools already exist. For example, three web-based tools are Michael Vitevtich’s phonotactic probability (Vitevitch & Luce 1998, 1999) and neighborhood density calculators (http://www.people.ku.edu/~mvitevit/PhonoProbHome.html), the English Lexicon Project (https://elexicon.wustl.edu/; Balota et al., 2007), and the recent Auditory English Lexicon Project (https://inetapps.nus.edu.sg/aelp; Goh, Yap, & Chee, 2020). Other tools exist for semantic variables or languages other than English, such as Lexique, which includes English and French (http://www.lexique.org/; New, Pallier, Brysbaert, & Ferrand, 2004), the multilingual CLEARPOND (https://clearpond.northwestern.edu/; Marian, Bartolotti, Chabal, & Shook, 2012), and EsPal (https://www.bcbl.eu/databases/espal/; Duchon, Perea, Sebastián-Gallés, Martí, & Carreiras, 2013) for Spanish, but it takes considerable independent work for a researcher to combine these resources with things like neighborhood statistics from the other tools.

Furthermore, while these tools are incredibly useful, they have limitations. Many require using web interfaces, so a researcher’s workflow must include interacting with the websites and documenting the steps taken, and importing lists of items into the researcher’s local workflow (e.g., into R; R Core Team, 2019). One might argue that this is not a major inconvenience, but other limitations are more severe. For example, so far as we are aware, the computer code used to search lexicons on the sites listed above are not readily available, so a researcher can neither easily confirm the code’s validity or extend it (for example, to include a new type of potential competitor). Another limitation is that some tools have a predefined lexicon, and a researcher cannot substitute another in its place. Substituting your own lexicon might be useful if you simply prefer a different lexicon, or if you were using an artificial lexicon, either with human subjects or with a computational model, or if you wanted to examine an understudied language or dialect. Finally, we assume that many labs and researchers have developed and will continue to develop their own code for lexical searches. This duplication of effort is unfortunate. An open-source, extensible tool shared via a version-control repository would allow researchers to collaborate and share their extensions, reducing duplication of effort.

We have developed a lightweight R package, LexFindR (Li, Crinnion, & Magnuson, 2020), which addresses these limitations. LexFindR comes with a suite of lexical relation finders for common competitor types used in studies of spoken and/or visual word recognition (e.g., neighbors, cohort [onset] competitors, and rhymes), but is also easily extended to incorporate new definitions. LexFindR is also fast, as it uses R’s parallelization capabilities to leverage multiple CPU cores (typically found even on contemporary laptops) and efficient core capabilities of R (e.g., R’s apply family of functions). Appendix A provides an example of how to put together aspects of the examples throughout the paper in order to efficiently gather information about multiple lexical dimensions in one script. In the following sections, we review how to install and use LexFindR. Details about how to share extensions with the community via LexFindR’s GitHub repository are provided in Appendix B.

Using LexFindR

Installing and loading LexFindR

The package is implemented in R and can be utilized like any R package. The package is available from the R package repository, CRAN. Users can install the stable version using the Tools::Install Packages menu in R Studio, or via the following command:

The most current developmental version can be installed from GitHub with the following commands:

Once installed, the package can be loaded with the following command.

Getting started

The package comes with two lexicons: the 212-word slex lexicon (with only 14 phonemes) from the TRACE model of spoken word recognition (McClelland & Elman, 1986) as a small data set for the user to experiment with, and a larger lexicon (lemmalex) that we compiled from various open-access, non-copyrighted materials. The primary source is the SUBTLEX subtitle corpus (Brysbaert & New, 2009), which we cross-referenced with the copyrighted (Francis & Kučera, 1982) database to reduce the word list to “lemma” (base- or uninflected) forms. Pronunciations were drawn from the larger CMU Pronouncing Dictionary (CMU Computer Science, 2020) without lexical stress for both lexicons (with those for slex transcribed by Nenadić & Tucker, 2020a). The second lexicon is large enough to demonstrate the full capabilities of the package. The two data sets are automatically loaded when we load LexFindR. We can use the tidyverse (Wickham et al., 2019) glimpse function to view essential details about the lexicons, and view their first few lines.

Both lexicons are loaded as R dataframes with three fields. “Item” is a label (orthography in the case of lemmalex, and transcriptions in the original phonemic conventions used for the TRACE model in the case of slex). “Pronunciation” is a space-delimited phonemic transcription using the ARPAbet conventions of the CMU Pronouncing Dictionary (ARPAbet transcriptions for TRACE items are from Nenadić & Tucker, 2020b). We will discuss shortly how to specify alternative delimiters, including a “null” delimiter for working with orthographic forms or pronunciation forms that use one character per phoneme without spaces. “Frequency” is occurrences-per-million words; frequencies are based on (Kučera & Francis, 1967) for slex and on Brysbaert and New (2009) for lemmalex.

More information about the lexicons can by queried with the ‘?’ command (we do not present the output here as it is rather extensive):

Note that you can use any lexicon you can load into an R dataframe. You may find it convenient to use the same field names as in slex and lemmalex, but it is not necessary. For work on phonological word forms, you typically will have both “Item” (usually orthography) and “Pronunciation”, but as we will see later, you can do useful things with LexFindR with any list of forms, including orthographic forms. To use this package with orthographic forms, refer to the section below on Working with orthography or other “undelimited” forms, or other delimiters.

LexFindR commands

Table 1 provides a list of LexFindR commands along with brief descriptions. To use any of the LexFindR functions, we provide a target pattern and a word list to compare it to. LexFindR will compare the target pattern to the patterns in the word list to find items that have particular relations to the target. The functions can return indices of items that meet the criteria of the function, but we can also tell LexFindR to return instead the list of matching forms, the list of accompanying labels for matching forms (e.g., spellings), or the frequencies of matching forms. As we progress through examples, we will see when these different options are useful.

Table 1 LexFindR functions briefly described

Full size table

Cohorts

Let’s begin with cohorts. Cohorts are words that overlap at word onset, and are called “cohorts” because they comprise the set of words predicted to be strongly activated as a spoken word is heard (and thus to form the recognition cohort) by the Cohort Model (Marslen-Wilson & Welsh, 1978). While definitions vary, LexFindR is equipped to handle overlap in any number of phonemes. By default, it uses a very common cohort definition: overlap in the first two phonemes. However, it contains a parameter – overlap – to allow the researcher to adjust how many initial phonemes must match for two words to be cohorts. We can get the set of cohort indices for a pattern with a command like this for the pronunciation of CAR:

This tells us that slex entries 66-71 are cohorts of CAR (overlapping in at least the initial two positions, since 2 is the default overlap). To get the competitors themselves rather than the indices, we could specify that we want forms:

To see the labels of those items (in TRACE’s phonemic transcriptions), we can use standard R conventions (and should see the phonemic transcriptions for COLLEAGUE, COP, COPY, CAR, CARD, and CARPET):

Alternatively, we could request the count of cohorts:

That is not a large number of cohorts. Let’s compare it to the count we get from lemmalex:

As expected, we get many more from a more realistically sized lexicon. Note that most LexFindR functions have exactly the same structure, returning indices by default, but with options to return forms or counts.

Finally, let’s see how we can change the cohort definition in terms of how many phonemes must match. Let’s say we want to try a definition of cohorts with overlap in the first three phonemes for the cohort of CARD:

We could repeat any of the preceding example commands with 3-phoneme overlap by simply adding “overlap = 3” to each command.

Neighborhood

Neighbors are another possible competitor often considered in word recognition research. The standard neighbor definition for spoken words comes from the Neighborhood Activation Model (NAM; (Luce & Pisoni, 1998)). While NAM includes a graded similarity rule, most often, researchers use the simpler DAS rule: two words are considered neighbors (and are expected to be strongly activated if either one is heard) if they differ by no more than a single phonemic d eletion, a ddition, or s ubstitution. For example, CAR (/kar/) has many neighbors, including the deletion neighbor ARE (note that neighbors are based on pronunciation here, not spelling), addition neighbors SCAR and CARD, and substitution neighbors at every position, such as BAR, CORE, and COP (though as we will see, CAR has no medial [vowel] substitution neighbors in slex). Let’s look at CAR’s neighbors in slex, using analogous commands to those we used for cohorts.

Note that in visual word recognition, it is much more common to consider only substitution neighbors (often called “Coltheart’s N”; Coltheart, Davelaar, Jonasson, & Besner, 1977). So if you are working with orthography, you may only want substitution neighbors. Or perhaps you would like to explore the relative impact of deletion, addition, and substitution neighbors. LexFindR’s get_neighbors function anticipates the potential need for such flexibility. By default, it assumes you want all three, but you can specify any single type or any combination with the neighbors argument and specifying deletion neighbors with “d”, addition neighbors with “a”, and/or substitution neighbors with “s”. Here are some examples:

Of course, we can easily do other things using basic R commands, such as determine what proportion of CAR’s neighbors are substitution neighbors:

Other competitor types

In addition to cohorts and neighbors, LexFindR comes with analogous functions for several other similarity types.

get_rhymes: returns items that mismatch at word onset by no more than a specified number of phonemes, using a mismatch argument which the user can supply. The default mismatch argument is 1 phoneme, meaning the function will by default return items that mismatch at word onset by a maximum of 1 phoneme (so not a standard definition of poetic rhyme or phonological rime). With this default argument, rhymes will include items that are addition or deletion neighbors at first position (e.g., CAR’s rhymes will include ARE and SCAR) as well as substitution neighbors at position 1 (e.g., BAR, TAR). If mismatch were set to 2, for example, CAR would additionally match any 3-phoneme word ending in /r/ and any 4-phoneme word ending in /ar/.
get_embeds_in_target: returns items that are embedded within a target word. For SCAR, this would include ARE and CAR.
get_target_embeds_in: returns items that the target embeds within. For CAR, this would include SCAR and CARD.
get_homoforms: returns items with the same form as the target. We use “homoform” because these would be homophones for phonological forms but homonyms for orthographic forms.

LexFindR also anticipates the possibility that a researcher may want to find competitor types that do not overlap. For example, CARD is both a cohort and a neighbor of CAR, so which set should it appear in? We propose a novel category called nohorts – neighbors that are also cohorts – and provide “P” (pure) versions of several competitor-type functions that return non-overlapping sets.

get_nohorts: Cohorts and neighbors are overlapping sets, although not all cohorts are neighbors (e.g., CAR and CARPET are cohorts but not neighbors) and not all neighbors are cohorts. Nohorts are the intersection of cohorts and neighbors. Note that the target word will be part of the nohort set, and not part of cohortsP or neighborsP, which we define next.
get_cohortsP: the set of “pure” cohorts that are not also neighbors.
get_neighborsP: the set of “pure” neighbors that are not also cohorts or rhymes.
get_embeds_in_targetP: set of items that embed in the target that are not also cohorts or neighbors.
get_target_embeds_inP: set of items that the target embeds in that are not also cohorts or neighbors.

The nohort and “P” functions use the base-R intersect and setdiff functions to find set intersections and differences. To see the code for any function in R, you can simply enter the function name with no arguments and no following parentheses. Let’s look at the code for get_nohorts. Many of the details provided may not be useful for a typical user, but the intersect command is the interesting part of this example.

Now let’s examine the get_neighborsP function to see how setdiff is used to find “pure” sets.

This function uses nested setdiff calls to first find neighbors excluding cohorts and then to exclude rhymes from that set. A user could use these functions as examples to create their own specific subsets of items.

Form length

You may wish to calculate form length. This is easy to do with base R. If you use CMU pronunciations, as in lemmalex, we can use a technique for counting words separated by whitespace with the lengths command in R.

If you have a null-delimited form, where each character is a single letter or phoneme, we can use the nchar function.

Uniqueness point

We have added one other common lexical dimension to the LexFindR functions (get_uniqpt), which is the uniqueness point (UP) of a form. This is the position at which an item becomes the only completion in the lexicon. For example, in slex, /kard/ (CARD) becomes unique at position 4, as does /karpˆt/ (CARPET). SCAR becomes unique at position 3. CAR (/kar/) is not unique at its final position, so its uniqueness point is set to its length plus one.

Again, CAR is not unique by word offset, so its UP is its length plus one. SCAR becomes unique at position 3, one before its offset. Let’s consider some additional useful steps. We could normalize UPs by dividing them by word length plus one, the maximal possible score. So CARD would have a normalized UP of 0.8 (4/5), while CARPET’s would be 0.57 (4/7), and CAR’s would be 1.0 (4/4). Here are some examples.

Helper functions

LexFindR includes two helper functions that can be applied to the output of other functions: get_fw and get_fwcp.

Log frequency weights: get_fw

Intuitively, the number (count) of potential competitors may be important, but some competitors might have more influence than others; in particular, words with higher frequency-of- occurrence may compete more strongly. So we may wish to consider the frequencies of competitors. We can use the indices returned by functions like get_cohorts or get_neighbors to get the frequencies of the items. Let’s do this for the word CAR in slex and lemmalex and get some summary statistics.

Typically, frequencies are log scaled, as this provides a better fit when they are used to predict human behavior (e.g., word recognition time). It would be useful, therefore, to weight the count of competitors by log frequencies. The LexFindR helper function get_fw does this. You supply it with a list of frequencies, and it takes their logs and returns the sum. This is simple enough that you could do it with basic R functions yourself. However, get_fw provides some useful error checking. Specifically, it checks whether the minimum frequency in your set of frequencies is less than 1, since taking the log would return a negative value. If so, it also suggests a minimum constant to specify for pad to add to each frequency before taking the log. Let’s consider how we might use this. First, let’s try using get_fw to give us summed log frequencies for the frequencies we collected above for CAR’s slex cohorts.

This gives us the sum without any problem, as the minimum frequency in slex_cohort_frequencies is greater than 1. Now let’s try with llex_cohort_frequencies.

Now we get a value (55.64038) but also a warning because the minimum value is less than 1. So let’s add the pad option. Using 1 will bring our minimum to a value greater than 1, avoiding results with non-positive values.

Log Frequency-Weighted Competitor Probabilities: get_fwcp

We could go a step beyond frequency weights and calculate the Frequency-Weighted Competitor Probability (FWCP) of a word, inspired by the Neighborhood Activation Model’s Frequency-Weighted Neighborhood Probability (FWNP; Luce & Pisoni, 1998). This is calculated as the ratio of the target word’s log frequency to the sum of all words meeting the competitor definition, as in the following equation.

$$FWCP = \frac{log(Frequency_{target})}{{\sum}_{c \in competitors}{log(Frequency_{c})}}$$

Notably, on most competitor definitions, this includes the target word itself, so we can think of the ratio as expressing what proportion of the “frequency weight” of the target’s competitors is contributed by the target itself. For spoken words, the larger the ratio, the more easily the target word tends to be recognized. To calculate this with LexFindR, we supply a set of competitor frequencies and the target word’s frequency to the get_fwcp function. Note that we can include a pad option as for get_fw, and it will be applied to both the target word’s frequency and the list of competitor frequencies; again, this should be done if the minimum frequency value is less than 1. Let’s verify that the minimum frequency in slex is greater than 1.

The next two code blocks demonstrate how to get the FWCP for neighbors (i.e., the FWNP) and then for cohorts.

Note that get_fwcp is not simply computing the ratio of target-to-competitor frequencies; it is first converting the frequencies to log frequencies. If your lexicon file has frequencies already in log form, you should not use the get_fwcp function, but instead you should calculate the ratios directly. Also note that it is fairly standard to express frequencies as occurrences-per-million. If your basis is different (e.g., occurrences-per-six million), you may want to transform your frequencies to the more standard per-million basis. Finally, we recommend that you examine distributions before using the results of get_fwcp, as these often exhibit difficult-to-mitigate deviations from normality. One may be better served by examining target frequencies and competitor frequency weights (obtained with get_fw) separately.

Working with orthography or other “undelimited” forms, or other delimiters

By default, LexFindR functions expect the forms you supply to be space-delimited, which is the typical convention for CMU pronunciations. Using a delimiter allows you to have form codes (typically phoneme codes) made up of more than one character. But what if you want to work with orthography, or a phoneme code that uses one character per phoneme without delimiters? You can simply specify sep = ”” to indicate that your forms have a “null” delimiter. We can illustrate this with the orthography in the “Item” field in lemmalex.

Now let’s try it with TRACE’s original phoneme encodings, which use one character per phoneme. Those original forms are in the “Item” field of slex:

Batch processing with target list and lexicon

Often, we may need to get the competitors for each word in the lexicon, with respect to the entire lexicon. This would be a prerequisite for selecting words with relatively many vs. few neighbors, for example. One way to do this would be to use the base R function lapply. Here is how we could do this for cohorts. The final glimpse command will show us the first few instances of each field.

Consider the cohort_idx field. We can see that /ad/ (ODD) has only one cohort (itself), while /ar/ (ARE) has four (items 2, 3, 4, 5, or /ar/, /ark/, /art/, and /artˆst/, i.e., ARE, ARK, ART, ARTIST).

What if we also want the lists of cohort forms or labels and frequencies? Rather than calling the function three times, we could speed up the process (speed will be very important when we work with large lexicons!) by calling get_cohorts only once, and then using the indices to get the other items we want. In the next example, we keep working with target_df and its new field cohort_idx (which has the list of indices [row counts] of records that meet the cohort definition for each target).

Let’s look at the results:

We can see that cohort_idx, cohort_str, and cohort_freq all contain lists, and we can verify that for a given word, the lists are the same length (e.g., one frequency form for each cohort). There should only be one value per target word in cohort_count and cohort_fw, which we can see is the case as well.

Working with different target and lexicon lists

In some cases, you may only want to get details for a subset of items in the lexicon – or even for a list of forms that are not in the lexicon. In these cases, you can simply specify a shorter target list rather than making the target list and lexicon the same. Note that of course, if you do not have frequencies for your items, you will not be able to use the get_fwcp command. As an example, we might want to examine what the neighborhoods of the words in the TRACE lexicon would be in the context of a realistically sized lexicon. We can do this by using slex as our target list and lemmalex as our lexicon.

Comparing this to our earlier results, we see that ODD would have four cohorts in lemmalex instead of one within slex.

Parallelizing for speed

If we are getting competitors for every word in a lexicon, speed becomes a concern, especially if we want to do this for many competitor types. To quantify this, let’s time how long it takes to calculate cohorts for all words in lemmalex. We will use the R tictoc package (Izrailev, 2014) to time the process. For this demonstration, we are using a MacBook Pro with an Intel Core i9 CPU and 32 GB of RAM.

On our demonstration laptop, get_cohorts with lapply took ˜111 seconds (on an older workstation we tested, it took several minutes). If you only have to do this once, that may be tolerable. But we can do better! We could easily parallelize using the R future package, and its commands like future.apply (Bengtsson, 2013). There are various ways to engage multiple cores with this package, as detailed in its documentation. The plan(multisession, workers = num_cores) is quite convenient, and works on Windows, Macintosh, and Linux with Rstudio and base R. In the following code block, we show how to load future.apply and set things up to use multiple cores.

With this setup, the only thing left to do is to replace our apply functions with their future.apply equivalents. In the example below, we just replace lapply with future_lapply to parallelize the function that gets competitors (there’s no real need to do this with the other apply call as it is not the bottleneck; in fact, it is so poorly suited for parallelization that it is slowed by a factor of ˜10 if we do use future_apply).

We see an improvement from 111 seconds to approximately 35; it took a bit more than three times longer without parallelization. On the older workstation, the improvement was more dramatic, from several minutes to around 35 seconds (around ten times faster with parallelization). Again, such differences may not seem important if you are running a search once, but if you want to do many different kinds of searches, or explore novel similarity definitions, speed will become important. In Appendix A, we present an example of parallelized code for conducting several LexFindR competitor searches in series.

Conclusions

LexFindR fills important gaps in the language scientist’s toolkit. It provides a free, fast, extensible, tested, and readily shared tool that can be integrated into typical analysis workflow within R. Researchers inclined to contribute extensions to LexFindR should refer to Appendix B for basic guidance on how to do so. We hope our fellow researchers will find LexFindR useful.

References

Balota, D., Yap, M., Cortese, M., Hutchison, K., Kessler, B., & Loftis, B. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445–459. https://doi.org/10.3758/BF03193014
Article PubMed Google Scholar
Bengtsson, H. (2013). future: Unified parallel and distributed processing in R for everyone.
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 977–990.
Article PubMed Google Scholar
CMU Computer Science. (2020) CMU pronouncing dictionary. Pittsburgh: Carnegie Mellon University. Retrieved from http://www.speech.cs.cmu.edu/cgi-bin/cmudict, August 25, 2020
Google Scholar
Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). S. Dornic (Ed.) Access to the internal lexicon, (pp. 535–555). Hillsdale: Erlbaum.
Duchon, A., Perea, M., Sebastián-Gallés, N., Martí, A., & Carreiras, M. (2013). EsPal: One-stop shopping for Spanish word properties. Behavior Research Methods, 45(4), 1246–1258.
Article PubMed Google Scholar
Francis, W., & Kučera, H. (1982) Frequency analysis of English usage: lexicon and grammar. Boston: Houghton Mifflin.
Google Scholar
Goh, W., Yap, M., & Chee, Q. (2020). The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords. Behav. Res. Methods. https://doi.org/10.3758/s13428-020-01352-0
Izrailev, S. (2014). tictoc: Functions for timing R scripts, as well as implementations of Stack and List structures.
Kirk, K., Pisoni, D., & Osberger, M. (1995). Lexical effects on spoken word recognition by pediatric cochlear implant users. Ear Hearing, 16, 470–481. https://doi.org/10.1097/00003446-199510000-00004
Article PubMed Google Scholar
Kučera, H., & Francis, W. (1967) Computational analysis of present-day American English. Providence: Brown University Press.
Google Scholar
Li, Z., Crinnion, A. M., & Magnuson, J. S. (2020). LexFindR: Find related items and lexical dimensions in a lexicon. Retrieved from https://github.com/maglab-uconn/LexFindR
Luce, P., & Pisoni, D. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19, 1–36. https://doi.org/10.1097/00003446-199802000-00001
Article PubMed Google Scholar
Marian, V., Bartolotti, J., Chabal, S., & Shook, A. (2012). CLEARPOND: Cross-linguistic easy-access resource for phonological and orthographic neighborhood densities. PLoS ONE, 7. https://doi.org/10.1371/journal.pone.0043230
Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10, 29–63. https://doi.org/10.1016/0010-0285(78)90018-X
Article Google Scholar
McClelland, J., & Elman, J. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86.
Article PubMed Google Scholar
Morrisette, M., & Gierut, J. (2002). Lexical organization and phonological change in treatment. Journal of Speech, Language, and Hearing Research, 45, 143–159. https://doi.org/10.1044/1092-4388(2002/011
Article PubMed Google Scholar
Nenadić, F., & Tucker, B. V. (2020). Computational modelling of an auditory lexical decision experiment using jTRACE and tisk. Language, Cognition and Neuroscience, 1–29.
Nenadić, F., & Tucker, B. V. (2020). Computational modelling of an auditory lexical decision experiment using jTRACE and TISK. Language, Cognition and Neuroscience, 0, 1–29. https://doi.org/10.1080/23273798.2020.1764600
Google Scholar
New, B., Pallier, C., Brysbaert, M., & Ferrand, L. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, and Computers, 36, 516–524. https://doi.org/10.3758/BF03195598
Article PubMed Google Scholar
R Core Team (2019). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
Sommers, M., & Danielson, S. (1999). Inhibitory processes and spoken word recognition in young and older adults: The interaction of lexical competition and semantic context. Psychology and Aging, 14, 458–472. https://doi.org/10.1037/0882-7974.14.3.458
Article PubMed Google Scholar
Storkel, H., Bontempo, D., Aschenbrenner, A., Maekawa, J., & Lee, S.-Y. (2013). The effect of incremental changes in phonotactic probability and neighborhood density on word learning by preschool children. Journal of Speech, Language, and Hearing Research, 56, 1689–1700. https://doi.org/10.1044/1092-4388(2013/12-0245
Article PubMed Google Scholar
Storkel, H., Maekawa, J., & Hoover, J. (2010). Differentiating the effects of phonotactic probability and neighborhood density on vocabulary comprehension and production: A comparison of preschool children with versus without phonological delays. Journal of Speech, Language, and Hearing Research, 53, 933–949. https://doi.org/10.1044/1092-4388(2009/09-0075
Article PubMed Google Scholar
Vitevitch, M., & Luce, P. (1998). When words compete: Levels of processing in perception of spoken words. Psychological Science, 9, 325–329. https://doi.org/10.1111/1467-9280.00064
Article Google Scholar
Vitevitch, M., & Luce, P. (1999). Probabilistic phonotactics and neighborhood activation in spoken word recognition. Journal of Memory and Language, 40, 374–408. https://doi.org/10.1006/jmla.1998.2618
Article Google Scholar
Wickham, H. (n.d.) The tidyverse style guide. Retrieved from https://style.tidyverse.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., ..., Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Article Google Scholar
Wickham, H., Danenberg, P., Csárdi, G., & Eugster, M. (2020). Roxygen2: In-line documentation for r. Retrieved from https://CRAN.R-project.org/package=roxygen2

Download references

Acknowledgements

This work was supported in part by U.S. National Science Foundation grants PAC 1754284 (JM, PI) and IGE NRT 1747486 (JM, PI). The authors are solely responsible for the content of this article. This work was also supported in part by the Basque Government through the BERC 2018-2021 program, and by the Agencia Estatal de Investigación through BCBL Severo Ochoa excellence accreditation SEV-2015-0490.

Author information

Authors and Affiliations

Department of Mathematics and Statistics, Carleton College, Northfield, MN, USA
ZhaoBin Li
Institute for the Brain and Cognitive Sciences, University of Connecticut, Storrs, CT, USA
Anne Marie Crinnion & James S. Magnuson
Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA
Anne Marie Crinnion & James S. Magnuson
BCBL. Basque Center on Cognition Brain and Language, Donostia-San Sebastián, Spain
James S. Magnuson
Ikerbasque. Basque Foundation for Science, Bilbao, Spain
James S. Magnuson

Authors

ZhaoBin Li
View author publications
You can also search for this author in PubMed Google Scholar
Anne Marie Crinnion
View author publications
You can also search for this author in PubMed Google Scholar
James S. Magnuson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

ZL and JM conceptualized the project; ZL wrote most code and drafted most of this manuscript; AMC contributed significant documentation to the LexFindR package and contributed to the writing and editing of the full manuscript; JM advised on and contributed to code and writing, and contributed to and edited the full manuscript.

Corresponding author

Correspondence to James S. Magnuson.

Additional information

Open Practices Statement

All materials, including computer code, related to this manuscript are available publicly at the associated GitHub repository (https://github.com/maglab-uconn/LexFindR). The package itself is released as open-source software.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Extended example – Getting several competitor types

This example shows how you can go through several competitor types for a lexicon, adding columns for the indices, labels, frequencies, counts, frequency weights, and FWCP for each competitor type. For an example implemented in tidyverse (Wickham et al., 2019) piping style, see the package vignettes for LexFindR.

Appendix: 2: Bug reports and user contributions

2.1 How to report bugs

Report any bugs at https://github.com/maglab-uconn/LexFindR/issues https://github.com/maglab-uconn/LexFindR/issues by clicking on “New Issue”.

2.2 How to create an extension

To contribute new functions, first please read the R files that are part of the LexFindR package. New functions can be added to extensions.R on your local installation. New functions should be carefully tested and the code should be clearly commented. Once you are confident your code is ready to be shared, move on to the next step of submitting your code via GitHub.

2.3 How to contribute extensions via GitHub

Extensions are welcomed through a GitHub “pull request”. Once the user has created a local clone of the forked repository, the user can edit the competitors.R or extensions.R file and push their edits to their forked path. Once these edits have been made, users can open a pull request. Before every pull request, run R CMD check to ensure that the code is clean. Please also style your code using the tidyverse style guide at https://style.tidyverse.org/ (Wickham, n.d.) and document your code using roxygen2 (Wickham, Danenberg, Csárdi, & Eugster, 2020). We will monitor pull requests and merge appropriate changes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Crinnion, A.M. & Magnuson, J.S. LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon. Behav Res 54, 1388–1402 (2022). https://doi.org/10.3758/s13428-021-01667-6

Download citation

Accepted: 02 July 2021
Published: 30 September 2021
Issue Date: June 2022
DOI: https://doi.org/10.3758/s13428-021-01667-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

Abstract

Similar content being viewed by others

A survey on large language model based autonomous agents

Natural Language Processing

Word prevalence norms for 62,000 English lemmas

Introduction

Using LexFindR

Installing and loading LexFindR

Getting started

LexFindR commands

Cohorts

Neighborhood

Other competitor types

Form length

Uniqueness point

Helper functions

Log frequency weights: get_fw

Log Frequency-Weighted Competitor Probabilities: get_fwcp

Working with orthography or other “undelimited” forms, or other delimiters

Batch processing with target list and lexicon

Working with different target and lexicon lists

Parallelizing for speed

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Additional information

Open Practices Statement

Publisher’s note

Appendices

Appendix 1: Extended example – Getting several competitor types

Appendix: 2: Bug reports and user contributions

2.1 How to report bugs

2.2 How to create an extension

2.3 How to contribute extensions via GitHub

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation