The efficiency with which bilinguals make language-related decisions in the face of similarity is a truly remarkable feat. When visually confronted with nearly identical translation equivalents, such as English-Portuguese “paper-papel”, speakers can quickly assign them to different languages, retrieve correct pronunciations from different phonological systems and ascribe meaning. Interlanguage form similarity has been widely studied in both experimental (e.g., Brenders et al., 2011; Christoffels et al., 2006; Comesaña et al., 2012, 2015; Costa et al., 2000, 2005; de Groot & Nas, 1991; Dijkstra et al., 1999, 2010; Lemhöfer & Dijkstra, 2004; Soares et al., 2018b, 2019b; van Hell & Dijkstra, 2002; Voga & Grainger, 2007) and real-life (e.g., Cunningham & Graham, 2000; Holmes & Ramos, 1993; Peters & Webb, 2018) settings, and the bulk of research has suggested that there is a processing advantage when translations formally resemble one another. For instance, when asked to recognize (e.g., Dijkstra et al., 2010; Ferré et al., 2017; Lemhöfer et al., 2004; Voga & Grainger, 2007) or name (e.g., Broersma et al., 2016; Costa et al., 2000, 2005; Hoshino & Kroll, 2008) words in a foreign language, bilinguals typically provide faster and more accurate responses to cognates, i.e., translation equivalents with similar (e.g., English-Portuguese “theory-teoria”) or identical (e.g., “animal-animal”) form, than to noncognates (words that share their meaning but not their form, e.g., English-Portuguese “house-casa”). They have also been shown to learn (e.g., de Groot & Keijzer, 2000; Lotto & de Groot, 1998; Peters & Webb, 2018; Valente et al., 2017) and remember (e.g., de Groot & Keijzer, 2000) cognates more effectively than noncognates (see however Arana et al., 2022 and Comesaña et al., 2012, 2015, for reversed or null effects as a function of stimulus list composition; see also Pureza et al., 2016 for poorer resolution of Tip-of-the-Tongue states in cognates vs. noncognates).

To examine how similarity affects processing in bilinguals and multilinguals, researchers typically retrieve the most accurate translation for the words in their stimulus sets, and subsequently assess the extent to which their forms overlap, either by collecting subjective ratings (i.e., asking bilinguals to provide personal similarity judgements via Likert-type scales), or using computational algorithms, such as Van Orden’s orthographic similarity (Van Orden, 1987), which indicates the proportion of identical characters shared by two words, and the Levenshtein distance (LD; Levenshtein, 1966), which returns the minimum number of insertions, deletions and/or substitutions required to transform one string into another. Recently, the normalized LD (NLD; Schepens et al., 2012), estimated by dividing the LD of two words by the number of letters in the longest string and subtracting the result from one, has become prevalent in the literature for a number of reasons. First, varying on a continuum between 0 (no similarity, e.g., English-Portuguese “sky-céu”) and 1 (exact match, e.g., “banana-banana”), the result is easily interpretable and allows for an objective characterization of translation equivalents as identical cognates, non-identical cognates or noncognates, with a threshold of .50 typically considered for cognate inclusion (Schepens et al., 2012). Second, it is sensitive to word length, and thus character additions and deletions have a smaller impact in the final NLD in longer than shorter words (for instance, in the English-European Portuguese translation pairs “air-ar” and “intellectual-intelectual”, even though only one character is deleted in each pair to transform one word into the other, the NLD is .67 in the former and .92 in the latter). Third, it is independent of word and language order, i.e., the orthographic NLD for the translation pair “house-casa” is .20 regardless of whether the input is “house-casa” or “casa-house”. Unlike Van Orden’s orthographic similarity, this is a key feature in this algorithm, since it satisfies the commutative property of similarity (Guasch et al., 2013). Lastly, web-accessible tools (e.g., Guasch et al., 2013) that allow for the submission of large lists of translation equivalents and instantly return their NLD have also contributed to its growing popularity.

The efficiency of the NLD has turned it into the gold standard for the analysis of orthographic overlap, but the shortage of equivalent indices for phonology has led to a range of approaches in the assessment of phonological similarity. Even though a few existing lexical databases already return some sort of phonological information for multiple languages, e.g., Celex (English, German, and Dutch; Baayen et al., 1995), CLEARPOND (English, Dutch, German, Spanish, and French; Marian et al., 2012), or WordGen (English, German, Dutch, and French; Duyck et al., 2004), only CLEARPOND performs cross-language comparisons (Marian, 2017), but it focuses solely on neighborhood analyses, providing information on the words that can be formed in one language by changing one phoneme in a word of another language, e.g., the English word “height” [haɪt] and its German phonological neighbors “hat” [hat] (he/she/it has; deletion of [ɪ]), “heiß" [haɪs] (hot; replacement of [t] with [s]) and “seit” [zaɪt], (since; replacement of [h] with [z]). Consequently, selecting and characterizing translation equivalents in terms of how phonologically similar they are from a reliable, centralized tool is still virtually impossible, leading to the adoption of different strategies, from time-consuming collections of subjective ratings (e.g., Brenders et al., 2011; Dijkstra et al., 1999, 2010; Hoshino & Kroll, 2008; Poort & Rodd, 2019; Post da Silveira & van Leussen, 2015; Schwartz et al., 2007) to manual analyses of the proportion of identical phonetic syllables or segments in the source and target strings (e.g., Blumenfeld & Marian, 2005; Comesaña et al., 2012, 2015; Costa et al., 2000; Valente et al., 2017; Voga & Grainger, 2007). In addition, the difficulty in promptly accessing reliable estimates of phonological similarity for sufficiently large stimulus sets has widened the gap between the number of studies exploring the role of orthographic and phonological similarity in second-language processing, as already acknowledged in the literature (see Comesaña et al., 2015 and Dijkstra et al., 1999 for an overview; see also Dijkstra et al., 2010), and has critical implications for the tenets of bilingual computational models (e.g., Dijkstra et al., 2019; Dijkstra & van Heuven, 2002; van Heuven et al., 1998), most of which are rooted on orthographic similarity alone (Comesaña et al., 2015).

More recently, some bilingual and multilingual lexical databases offering estimates of interlanguage phonological similarity were put forth in the literature, but with considerable limitations nonetheless. Poort and Rodd (2019) advanced an English-Dutch stimulus set containing translation equivalents with varying degrees of form similarity, and interlingual homographs (i.e., words with similar form but different meaning in two languages, e.g., “angel-angel” – heavenly being vs. sting of bee or wasp), and Post da Silveira and van Leussen (2015) proposed a bilingual lexicon comprising equisyllabic cognates and noncognates for Brazilian Portuguese-American English. Despite the benefits of these resources for second-language research, only very small stimulus sets (Poort & Rodd, 2019: 58 identical cognates, 76 non-identical cognates, 78 noncognates, and 72 interlingual homographs; Post da Silveira & van Leussen, 2015: 64 cognates and 40 noncognates) were included in a single language combination in each database. In addition, while the phonological similarity estimates in the Poort and Rodd study were based on subjective ratings, Post da Silveira and van Leussen used the classical NLD, which does not take phoneme similarities into account, and hence their objective indices of phonological similarity may be slightly underestimated (for instance, in the American English-Brazilian Portuguese translation pair “minister-ministro” [ˈmɪnɪstəɹ-mi'nistɾu], substituting the nearly identical phonemes /ɪ/ with /i/ and /ɹ/ with /ɾ/ is as costly as any other consonant-vowel substitution). To this end, a more fine-tuned, NLD-based phonological similarity index was proposed by Schepens (Schepens, 2010; see also Schepens et al., 2013) in a lexical database for multiple languages that compared the phonetic transcriptions of two translation equivalents, and applied a modified version of the LD that introduced the degree of similarity between the source and target phonemes as substitution costs. However, the materials only comprised high-frequency cognates, and thus researchers interested in manipulating different frequency values or form similarity degrees (e.g., low and medium-frequency words and noncognates) cannot retrieve such stimuli from this database.

Another issue in reference to the assessment of interlanguage form similarity concerns the absence of an index of phonographic similarity in the literature, i.e., a method that combines the degree of orthographic and phonological overlap of two translation equivalents in a single measure. In effect, similar measures of phonographic similarity within languages have been examined with monolingual populations, and suggested that phonographic effects outperform orthographic and/or phonological measures taken separately. For instance, to test whether onset-nucleus-coda subsyllabic components mediate visual word recognition, Nuerk and collaborators (2000) introduced a measure of subcomponent frequency (SCF) for phonographic sublexical units, i.e., orthographic subsets that are phonology-dependent (e.g., the bigram <or> in “horse” [hɔːs] and “morse” [mɔːs], but not in “worse” [wɜːs]). Results from a lexical decision task revealed that the phonographic SCF facilitated visual word recognition, and that bigram frequency (a purely orthographic measure) did not produce an effect when SCF was controlled for. Research looking into within-language neighborhoods has also shown that the phonographic N (the number of words of equal length in letters and phonemes that can be generated by a single letter and phoneme substitution, addition or deletion; Peereman & Content, 1997; Siew & Vitevitch, 2019) produces a processing advantage for words that have more rather than fewer phonographic neighbors, and that it is a more significant predictor of subjects’ performances than the orthographic and phonological N individually (see Adelman & Brown, 2007, and Siew & Vitevitch, 2019 for overviews). Although an interlanguage phonographic similarity measure has never been investigated with bilingual or multilingual populations, a number of studies using different types of stimuli have shown the close interdependence of orthographic and phonological interactions. For instance, in cognate recognition and naming, performances become altered under the influence of phonologically similar words in another language, even when using language pairs with different scripts, e.g., Greek-Spanish (Dimitropoulou et al., 2011), Greek-French (Voga & Grainger, 2007), Japanese-English (Ando et al., 2014) and Hebrew-English (Gollan et al., 1997). In a slightly different line of research, using interlingual pseudo-homophones, i.e., pseudowords that sound identical to real words (e.g., “tauw” is a pseudo-homophone of the Dutch word “touw”, meaning “rope”) as cross-language primes to their translation equivalents in English (“rope”), Duyck (2005) found differences in performances when English targets were preceded by pseudo-homophone primes compared to their graphemic controls. The fact that these phonological similarity effects were observed during silent reading, often with masked priming (a useful technique to investigate the activation of orthographic, phonological, and semantic codes that influence the early stages of visual word recognition) or pseudo-homophones (which do not involve activation of semantic representations), reinforces the assumption that the presentation of orthographic input necessarily activates a fast and automatic phonological representation that does not require lexical access (Dimitropoulou et al., 2011). These findings on the strong reciprocity of orthographic and phonological activation at pre-lexical stages of bilingual processing (see also Clifton, 2015 and Schotter et al., 2012 for a review of studies investigating the influence of phonological similarity during parafoveal processing in reading with monolinguals and bilinguals) lend strong support to the assumption that an interlanguage phonographic similarity measure could be a more solid predictor of bilinguals’ performances than the orthographic and phonological NLDs in isolation.

To address these limitations, here we introduce PHOR-in-One, a fully integrated multilingual lexical database containing 6160 translation equivalents in English – both American and British varieties – European Portuguese, German, and Spanish, and an array of interlanguage form similarity measures, including the classical orthographic NLD, an adapted phonological NLD, and a phonographic NLD as an estimate of the degree of overall form similarity of two translation equivalents. Based on an adaptation of Schepens’ (2010) algorithm, the phonological NLD considers the articulatory, acoustic, and perceptive features of the phonemes as substitution costs. For instance, to determine the phonological NLD of the English-Portuguese pair “house-casa”, the algorithm takes their phonetic transcriptions [‘haʊs-‘kazɐ], identifies the relative positions of the phonemes to be replaced ([k] with [h], [a] with [aʊ], and [s] with [z]) from the International Phonetic Alphabet (IPA; International Phonetic Association, 1999) feature space, computes the Euclidean distance between them, and finally adds the cost of inserting [ɐ]. It should be noted that substantial modifications to the original algorithm (Schepens, 2010) were introduced for our computation, as discussed ahead. The total transformation cost is subsequently normalized, allowing for the characterization of translation pairs as phonologically identical (e.g., English-German “fish-Fisch [‘fɪʃ-‘fɪʃ], phonological NLD = 1), phonologically similar (e.g., English-Portuguese “voice-voz” [‘vɔɪs-‘vɔʃ], phonological NLD = .90), and phonologically dissimilar (e.g., English-Portuguese “age-idade” [‘eɪdʒ- i‘ðaðɨ], phonological NLD = 0.25), all of which are included in our database. As for the phonographic NLD, it is computed by intersecting the orthographic and phonological NLDs of each translation pair (see ahead for details) and also ranges between 0 (entirely different orthography and phonology, e.g., American English-European Portuguese “butler-mordomo” [‘bʌtɫəɹ̠-mɔɾ’ðomu]) and 1 (orthographically and phonologically identical words, e.g., English-German “test-Test” [‘tɛst-‘tɛst]). To our knowledge, an interlanguage phonographic similarity measure has never been advanced in the literature, but examining the overlapping areas of the phonological and orthographic layers may offer valuable insights on the interaction of phonology and orthography in various language processes (Siew & Vitevitch, 2019). Overall, the three cross-language form similarity estimates in PHOR-in-One will allow experimenters to manipulate different degrees of orthographic, phonological, and phonographic overlap using continuous variables, rather than relying on an arbitrary threshold to define cognates (Tainturier, 2019). The orthographic and phonological NLD will be particularly useful to select stimuli with contrasting degrees of orthographic and phonological overlap (O+P-, O-P+), while the phonographic NLD can be used to select stimuli with high (O+P+) or low (O-P-) orthographic and phonological overlap. Although future studies should compare the three estimates, and explore how well they can capture subjects’ performances, they will contribute to test recent accounts (e.g., Iniesta et al., 2021) of the organization of phonological and orthographic interactions in visual and auditory processing in bilinguals.

Aside from the interlanguage form similarity estimates, PHOR-in-One comprises a range of features to ensure a comprehensive characterization of its stimuli within and across languages. Relevant linguistic information, such as number of letters and phonemes, phonetic transcription, and a number of frequency indices are provided for the full multilingual lexicon in a total of 30,800 words. Its easy-to-use format enables the automatic retrieval of words and their translation equivalents in each language, based on the application of specific linguist criteria (e.g., Part-of-Speech [PoS], NLD interval and per-million-word frequency). In addition, PHOR-in-One contains languages with different opacities (transparent: German and Spanish; intermediate: European Portuguese; opaque: English), timings (stress-timed: English, European Portuguese and German; syllable-timed: Spanish; see Nespor et al., 2011 for an overview of rhythm and timing, and also Campos et al., 2018 for an example of how timing can affect the role of sublexical units in processing) and families (Romance languages: European Portuguese and Spanish; Germanic languages: English and German). It also includes different word types, such as simple and compound words, multi-word expressions, identical and non-identical cognates and noncognates from all ranges of frequency and interlanguage form similarity. Distributed along a continuum of orthographic, phonological, and phonographic similarity, the different types of stimuli in PHOR-in-One will further encourage the development of research to examine how language transparency affects processing, and the circumstances under which bilinguals rely on grapheme-to-phoneme mappings or on more direct access to whole-word representations across languages (Iniesta et al., 2021). PHOR-in-One is thus a useful research instrument, in that it delivers the most fundamental measures for the selection, control and manipulation of experimental multilingual materials.

Materials and methods

PHOR-in-One lexical database

Entry compilation and translation procedures

The PHOR-in-One lexicon originated from two existing stimulus sets used previously as experimental materials at our lab, one containing 5048 European Portuguese, English and Spanish translation equivalents, and another one with 1779 translation equivalents in European Portuguese, English, and German. The two sets comprised words with an array of interesting features for research, including short, medium, and long words, as well as cognates and noncognates with low, medium, and high lexical frequency. Intersection of the two sets revealed the existence of 871 words in common, which, upon removal, originated an integrated lexicon of 5956 unique entries. Additional lexical entries were included for homonymous and polysemic words if they originated more than one orthographic form in another language. For instance, the English word “hug” generates different words in German, European Portuguese and Spanish, depending on whether the grammatical category is a noun (“Umarmung”, “abraço”, “abrazo”) or a verb (“umarmen”, “abraçar”, “abrazar”). To incorporate different forms in the three languages, two separate lexical entries were created, where the English word “hug” is duplicated. In the same vein, the European Portuguese noun “canto” translates as the German words “Eckball” (football corner), “Kante” or “Ecke” (corner of a room or table), and “Gesang” (singing), and hence four separate lexical entries were created to accommodate four different words in German, although the same word “canto” is displayed for European Portuguese. Unavailable German and Spanish translations in each list were added automatically. An American English lexicon was created from the British English words by adapting the spellings. The two varieties were included on account of their differences in terms of: i) terminology (e.g., British English “chemist, lift” vs. American English “drugstore, elevator”); ii) spelling (e.g., British English “characterise, honour, centre” vs. American English “characterize, honor, center”); iii) frequency of use (e.g., the word “analysis” has a raw frequency of 8456 occurrences in the original American English corpus, and 1101 in the British English corpus; conversely, the word “back” appears more frequently in British than American English, with 22,071 and 17,570 raw counts, respectively); and iv) pronunciation (approximately 70% of the lexical entries in PHOR-in-One present different phonetic transcriptions for American and British English). An expert in the four languages subsequently conducted a comprehensive review of the translations, applying corrections where needed. Finally, two native speakers of German and Spanish with different second-language combinations reviewed the translations.

The resulting multilingual lexicon includes 6160 lexical entries, each containing a wordform in American and British English, German, European Portuguese and Spanish, all fully aligned across languages, in a total of 30,800 words. Because of their features, the words in each language are particularly useful to cover a broad range of research requirements for the manipulation and/or control of verbal stimuli. The five lexica include a) words with varying lengths, ranging between two and 22 letters (American and British English: min = 2 [“go”]; max = 16 [“misunderstanding”]; German: min = 2 [“Ei”; egg]; max = 22 [“Erziehungsberechtigter”; guardian]; European Portuguese: min = 2 [“”; dust]; max = 16 [“congestionamento”; jam]; Spanish: min = 2 [“”; tea]; max = 15 [“existencialismo”; existentialism]); b) low, medium and high-frequency words, ranging between 0.01 and 7903.62 occurrences per million words (pmw; American English: min = 0.02 [“adequacy”]; max = 6161.41 [“have”]; British English: min = 0.01 [“artifact”]; max = 7903.62 [“have”]; German: min = 0.03 [“File”; file] max = 4201.37 [“haben”; have]; European Portuguese: min = 0.01 [“ermida”; hermitage]; max = 5512.72 [“bem”; well]; Spanish: min = 0.02 [“ánodo”; anode]; max = 5804.59 [“bien”; well]; SUBTLEX pmw occurrences in each language); and c) words from five different grammatical categories, namely nouns, verbs, adjectives, adverbs and interjections (in addition to five compound grammatical categories, as detailed ahead). Moreover, all languages include simple (e.g., “house”), closed-compound (items containing at least two stems, Lieber, 2010; e.g., “notebook”), and hyphenated (e.g., “t-shirt”) words (except for Spanish, which does not include hyphenated words). As an inherent consequence of the translation process, simple and closed-compound words in one language often originate multi-word expressions (items containing at least two words with unitary semantic or pragmatic function, Moon, 2015) in another language. Even though they are typically not included in analogous lexical databases, and make up for only a small portion of the PHOR-in-One lexica (American English: 0.73% of the lexicon; British English: 0.71%; German: 0.15%; European Portuguese: 0.08%; Spanish: 0.10%), we opted to preserve multi-word expressions (e.g., English-Portuguese “nut-fruto seco”) including phrasal (e.g., “warm up”) and reflexive (e.g., German “sich erinnern”, to remember) verbs, as they contribute to promote lexical diversity, while opening a window of opportunity to explore whether there may be processing differences between them and simple or closed-compound words (see Arnon & Christiansen, 2017 and Titone & Libben, 2014 for an overview of the role of multiword expressions in language learning abilities).

Spelling and pronunciation in each lexicon comply with the orthographic entries and phonetic transcriptions adopted in monolingual and multilingual dictionaries, namely the Dictionary of the Contemporary Portuguese Language (Casteleiro, 2001), the Dictionary of the Spanish Language (Real Academia Española, n.d.), the Duden Dictionary (Dudenredaktion, n.d.), and the Oxford English Dictionary (Oxford University Press, n.d.) for European Portuguese, Spanish, German, and English, respectively, which mirror the standard linguistic varieties of the corresponding languages. It is worth mentioning that Castilian Spanish was considered for the Spanish lexicon, and that the European Portuguese spelling reflects the norm before the 1990 Portuguese Language Orthographic Agreement, implemented in Portuguese-speaking countries in 2015, since there are no frequency norms available thus far in the literature for post-Agreement spelling. However, an additional column was created to accommodate the new spelling (N = 115; note that pre- and post-Agreement mismatches will not influence the phonological or phonographic similarity indices, since pronunciation has been maintained). Noun capitalization was preserved in German, as determined by convention, while other grammatical categories are lowercase. Some words are also capitalized in American and British English, namely those referring to nationalities and languages (e.g., “American”), months (e.g., “April”), holidays (e.g., “Christmas”), and some proper nouns (e.g., “Earth”; proper nouns were however generally excluded from PHOR-in-One, since they are typically not relevant for behavioral research).

Frequency and PoS assignment

Two kinds of wordform frequency information are offered in PHOR-in-One, namely monolingual printed corpus frequency, including spoken (e.g., conversations, interviews) and written (e.g., books, newspapers) records, and subtitle frequency. Although subtitle frequency accounts for more variance in lexical decision and naming tasks (see Brysbaert & New, 2009; Soares et al., 2014a, 2019a for reviews), some SUBTLEX studies lack relevant morphosyntactic information (e.g., SUBTLEX-ESP and SUBTLEX-DE do not include grammatical annotation), and as such, combined, the two resources offer a more comprehensive characterization of the words in the five lexica. Corpus frequency information was retrieved from the following preexisting lexical databases: the American National Corpus (ANC; Ide, 2009) for American English, Celex (Baayen et al., 1995) for British English, dlexDB (Heister et al., 2011) for German, ESPAL (Duchon et al., 2013) for Spanish, and P-PAL (Soares et al., 2018a) for European Portuguese. Subtitle frequencies in each language were retrieved from the corresponding preexisting SUBTLEX studies (American English: Brysbaert et al., 2012; British English: van Heuven et al., 2014; European Portuguese: Soares et al., 2014a; German: Brysbaert et al., 2011; Spanish: Cuetos et al., 2011). Absolute (raw counts), pmw and log10 (estimated by determining the base 10 logarithm of the absolute frequency counts + 1) wordform frequency norms are provided for each lexical entry for both printed and subtitle corpora. Certain frequency measures were unavailable in some of the original lexical databases (e.g., the ANC; Ide, 2009), and were thus calculated by us for cross-language comparability. For subtitle frequencies, Zipf norms (van Heuven et al., 2014) were also included, as long as they were available in the original SUBTLEX databases. Zipf frequency norms offer several advantages compared to other frequency indices, since they resemble a Likert scale and include the size of the corpus in the computation. The scale is intuitive to allow for a clear distinction between low- and high-frequency words (van Heuven et al., 2014), but unlike log10 frequencies offers a straightforward relationship with pmw frequency, with the values 1–3 indicating low-frequency words (frequencies of 1 per million words and lower) and the values 4–7 indicating high-frequency words (frequencies of 10 per million words and higher; Brysbaert & New, 2009; see also Soares et al., 2014a).

For corpus frequencies, two types of counts were included, namely PoS-independent (an index of the total number of times a wordform appears in the corpus, e.g., the word “act” appears 3354 times in the British English corpus overall), and PoS-dependent (an index of the number of times a wordform appears in the corpus with a specific grammatical category, e.g., the word “act” appears 2278 times as a noun and 1076 times as a verb in the British English corpus) frequency indices. As with other recent lexical databases (e.g., Duchon et al., 2013; Soares et al., 2018a), the two indices were included due to the existence of systematic differences in the processing of different grammatical categories (see Brysbaert et al., 2012 for a review), and hence some researchers may be more interested in intersecting frequency with PoS, rather than simply extracting overall frequency information.

The following five major grammatical categories were created: noun (75.07% of the full lexicon), verb (14.69%), adjective (9.40% of the lexicon), adverb (0.08%) and interjection (0.05%). Five additional compound categories, namely adjective-adverb (ADJ|ADV), adjective-noun (ADJ|N), adjective-noun-quantifier (ADJ|N|QUANT), adjective-noun-verb (ADJ|N|V), and adjective-verb (ADJ|V) were incorporated, since some words were originally annotated with complex PoS tags in the source corpora (approximately 0.71% of the full PHOR-in-One lexicon), and hence it was not possible to split their frequency into individual grammatical categories. The same PoS tag is true for the five languages in each lexical entry (note however that a language-specific PoS tag was added elsewhere for words whose PoS-dependent frequency information is indexed to a different grammatical category; see ahead for details). Data cleaning procedures were implemented during PoS and frequency compilation, which, in some cases, produced small frequency variations compared to the source lexical databases. Specifically, wordform frequencies originally annotated in the source corpora as non-word items, unknown/unclassified categories, web elements, dates and proper nouns were subtracted from the frequency of the corresponding lexical entries in PHOR-in-One. Subclass frequency counts, when available in the original lexical databases, were added to a single main grammatical class. For instance, the verb “accept” has an overall raw frequency of 1704 occurrences in PHOR-in-One, even though in Celex (Baayen et al., 1995) this value is split into four equal parts (426 occurrences in four separate lexical entries), each reflecting potential occurrences of the four inflections of the word. Wordforms originally unavailable in the source corpora (e.g., American and British English “inexistence” and “ice cream”) were intentionally assigned the frequency value “N.A.” on the basis of the following reasoning. First, a large portion of these words are multiword expressions, e.g., “city wall”, or hyphenated words, e.g., “t-shirt”, which are typically excluded from frequency lists (for instance, hyphenated words in SUBTLEX-UK were split into their individual components before counting word occurrences; van Heuven et al., 2014). Not only are they less relevant for most behavioral research, but they also pose significant challenges to the estimation of frequency values (see Gries, 2022, and O’Donnell, 2011 for overviews). Second, although a frequency of either zero or one is often indexed to non-occurring words (Brysbaert & Diependaele, 2013), these values may be an inaccurate approximation to the real frequency of the word. For instance, even though “abril” (European Portuguese for “April”) has a fairly high (4.63) log10 corpus frequency in P-Pal (Soares et al., 2018a), surprisingly it does not occur in SUBTLEX-PT. In this case, a frequency of zero might mistakenly suggest that the word “abril” is highly infrequent and potentially unfamiliar to most speakers, when in fact it may simply reflect a more specific context of occurrence (P-Pal is essentially a newspaper corpus, which may increase the chances of certain word types appearing compared to subtitle corpora). Third, the simple words assigned a frequency of N.A. in PHOR-in-One make up for a very small portion of the lexica, ranging between 0.02% (Spanish) and 2.39% (American English) for corpus frequency, and 0.60% (British English and European Portuguese) and 3.20% (German) for subtitle frequency. Therefore, we opted for N.A. so that these few words would not interfere with any statistical computations based on lexical frequency in the database.

Figure 1 displays PoS-independent summed corpus frequency distribution as a function of word length for each language.

Fig. 1
figure 1

PHOR-in-One summed corpus frequency distribution (per million words) by word length in each language. AmE = American English; BrE = British English; DE = German; EP = European Portuguese; ES = Spanish

As illustrated, European Portuguese, German, and Spanish present a similar summed frequency distribution per word length, although the total summed frequencies in each language are much larger for European Portuguese and Spanish (40,858,792 and 63,338,916 tokens, respectively) than German (14,357,946 tokens). The German lexicon includes longer words (max = 22 letters) than the remaining languages (where maximum lengths range between 15 and 17 letters). The American and British English lexica have nearly overlapping frequency distributions, both peaking at four letters. All languages have a positively skewed distribution, suggesting that lexical frequency decreases as the number of letters in the word increase, as confirmed in previous works (e.g., Corral et al., 2015; Grzybek, 2007; Soares et al., 2014a, 2018a). For instance, Soares and collaborators (2018a) showed that approximately 55% of the European Portuguese lexical frequencies in P-PAL are accounted for by one, two, and three-letter words. However, in PHOR-in-One, 53% of European Portuguese frequencies are accounted for by five, six, and seven-letter words, presumably due to the fact that function words (e.g., pronouns and determiners), which typically capture a large percentage of the frequency of occurrence in a corpus (Brysbaert et al., 2012; Soares et al., 2014a, 2018a), were excluded here.

Structure of the database

PHOR-in-One is a read-only Excel file with one spreadsheet divided into three interconnected sections, each offering different types of information. The first section (columns A–G) displays five translation equivalents fully aligned in American English (AmE_wordform), British English (BrE_wordform), German (DE_wordform), European Portuguese (EP_wordform) and Spanish (ES_wordform), as well as their global PoS (PoS [all]), which is true for all languages at once. Each lexical entry is uniquely represented with an identity key (ID) specified in column A. The second section (columns H to AK) comprises orthographic (NLD_orthg), phonological (NLD_phonoL) and phonographic (NLD_phonoG) similarity scores for each language combination. Both sections are represented in Fig. 2, which displays lexical entries 1 to 30, their ID, five translation equivalents in American and British English, German, European Portuguese, and Spanish, PoS information and the orthographic, phonological, and phonographic similarity scores for British English-European Portuguese.

Fig. 2
figure 2

PHOR-in-One lexical database (sections 1 and 2). AmE = American English; BrE = British English; DE = German; EP = European Portuguese; ES = Spanish. Depiction of the PHOR-in-One lexical database including five translation equivalents, (Columns B-F), their ID (Column A), global PoS (Column G), and orthographic (NLD_orthg_BrE_EP), phonological (NLD_phonoL_BrE_EP) and phonographic (NLD_phonoG_BrE_EP) NLD scores for British English-European Portuguese. Columns H-L, N-V and X-AF, containing form similarity estimates in other language combinations, are hidden

The third section offers a collection of linguistic information for each language individually. This includes number of letters (LEN_orth), phonetic transcription (phonetic_t) and number of phonemes (LEN_phon), as well as a set of PoS-independent and PoS-dependent absolute (abs), per-million-word (mln) and log10 of the absolute frequency + 1 (log10[abs+1]) corpus frequency indices (PoS-dependent frequency information is signaled with the use of the word “annotated” in the header, e.g., Celex_abs_annotated). The ensuing column (SpecPoS_annotated) presents a language-specific PoS tag, which will only be filled if the PoS-dependent corpus frequency estimates do not match the global PoS specified in column G. To illustrate, consider the American and British English wordform “accounting” in ID 42, and its translation equivalents “Buchhaltung”, “contabilidade” and “contabilidad” in German, European Portuguese, and Spanish, respectively, all labeled as nouns (N) in column G (PoS [all]). Although it was possible to extract PoS-dependent frequencies for this wordform as a noun in the American English corpus (SpecPoS_annotated_ANC is blank), in British English the word only occurred as a verb. Therefore, the value (V) specified in column BK (SpecPoS_annotated_Celex) indicates that the annotated frequency values for British English (Celex_abs_annotated, Celex_mln_annotated and Celex_log10(abs+1)_annotated) are indexed to the PoS verb. Subsequently, SUBTLEX absolute (Subtlex_abs), per-million-word (Subtlex_mln), log10 absolute frequency + 1 (Subtlex_log10[abs+1]), and Zipf (Subtlex_zipf; as mentioned, Zipf subtitle frequencies are not available for German or Spanish) frequencies are displayed. The same properties are available across languages in the same order of presentation. Figure 3 depicts the linguistic and frequency information available for British English, as represented in PHOR-in-One.

Fig. 3
figure 3

PHOR-in-One lexical database (section 3). BrE = British English. Depiction of the third section of the PHOR-in-One lexical database including British English wordforms (BrE_wordform) for lexical entries 1-20 (Column A), and their length in number of letters (LEN_orth_BrE), phonetic transcription (phonetic_t_BrE), number of phonemes (LEN_phon_BrE), PoS-independent (Columns BE-BG) and PoS-dependent (Columns BH-BJ) corpus frequency estimates

For clarity, header labels are identical across languages, either preceded or followed by an abbreviation that specifies the corresponding source corpus (ANC, Celex, dlexDB, P-PAL or ESPAL) or language (AmE, BrE, DE, EP and ES for American English, British English, German, European Portuguese, and Spanish, respectively). A total of 111 columns are included. PHOR-in-One is available for download as an Excel file at https://www.psi.uminho.pt//pt/CIPsi/Laboratorios_Investigacao/Psicolinguistica/Documents/PHOR_in_One_LDB.zip and in the supplementary materials.

Results and discussion

Form similarity estimates

Phonological and orthographic similarity

Computation of the interlanguage phonological similarity estimates in PHOR-in-One required setting up an integrated multilingual phonetic alphabet, and the subsequent retrieval, standardization, and verification of phonetic transcriptions for all its lexical entries. Two principles underpinned the development of the phonetic alphabet: i) compliance with the notation of the IPA (International Phonetic Association, 1999), and ii) assignment of a single, unambiguous phonetic symbol to each sound across languages. Accordingly, some adjustments were made in our implementation of the phonological similarity algorithm, since the original version (Schepens, 2010) often employed identical characters to represent different sounds across languages (e.g., [r] represented the Spanish alveolar trill /r/, the English post-alveolar approximant /ɹ̠/, the German uvular trill /ʁ/, and the German post-vocalic /r/, [ɐ]; [R] represented both the Spanish simple alveolar tap/flap /ɾ/ and the British English post-alveolar approximant /ɹ̠/ at word endings; see Appendix Table 9 for details). To avoid the use of overlapping notations for different phonemes across languages, more specific phonetic segments were included in our version of the phonological similarity algorithm for each language. In addition, allophones, i.e., different realizations of the same phoneme, e.g., [l] and [ɫ], were also considered in our alphabet. For instance, the lateral approximant /l/ has two different realizations in British English: clear [l] at word-beginning and same-syllable pre-vocalic positions (e.g., “land” [lænd] and “plate” [pleɪt], respectively), and dark/velarized [ɫ] at word endings and before consonants (e.g., “full” [‘fʊɫ] and “belt” [‘bɛɫt], respectively). Conversely, in American English /l/ is always dark/velarized (e.g., “label” [‘ɫeɪbəɫ]). Although these phonetic specifications are typically not represented in dictionaries or psycholinguistic lexical databases, allophones of /l/, /t/, /d/, /b/, /ɡ/, /r/, /n/, /m/ and /s/ were included to highlight different realizations of the same phoneme within and across languages, since recent evidence has suggested that bilingual individuals are sensitive to allophonic contrast (e.g., Burrows et al., 2019; Fabiano-Smith et al., 2015), and that allophones may in fact form the basis of pre-lexical processing during spoken-word recognition (e.g., Mitterer et al., 2018). Table 1 features the allophones incorporated into the PHOR-in-One multilingual alphabet.

Table 1 Allophone Inventory and contextual occurrence in PHOR-in-One for each language

Integration of the sounds of each language into a single multilingual inventory resulted in a phonetic alphabet that contains 37 pulmonic consonants and seven complex consonants, including four affricates (a cluster of two segments with the same place but different manners of articulation: [pf, ʦ, tʃ, dʒ]) and three co-articulated consonants (a cluster of segments with two simultaneous places of articulation: [kw, ɡw, γw]), distributed in the consonant space according to their articulatory features. The consonant matrix contains two overlapping tables (or dimensions), one for pulmonic consonants, and one for complex consonants, each with twelve columns and eight rows that reflect their place (e.g., labial, coronal, dorsal) and manner (e.g., plosive, fricative, approximant) of articulation. These two overlapping dimensions are included in Table 2, which displays the positioning of each consonant in the IPA feature space (see also Appendix Table 10 for the full list of consonants adopted in PHOR-in-One).

Table 2 Phonetic consonants adopted in PHOR-in-One and positioning in the IPA matrix reflecting their place and manner of articulation

The vowel inventory contains 39 (oral, nasal, short, long, and borrowed) vowels and 26 diphthongs, distributed in the IPA feature space according to their height (e.g., open, close) and backness (e.g., front, central). The vowel matrix contains four overlapping dimensions, for short vowels (oral and nasal, e.g., /i/ and /ũ/, respectively), long vowels (e.g., /i:/), borrowed vowels (e.g., /ɛ̃:/), and diphthongs (e.g., /aɪ/), respectively. Table 3 displays (oral and nasal) short vowel positions in the IPA feature space. Long vowels, borrowed vowels and diphthongs are displayed in Table 4 (see also Appendix Table 11 for the full list of vowels and diphthongs adopted in PHOR-in-One, with examples).

Table 3 Short (oral and nasal) vowels adopted in PHOR-in-One and positioning in the IPA matrix reflecting their height and backness
Table 4 Long vowels, borrowed vowels and diphthongs adopted in PHOR-in-One with positioning in the IPA matrix reflecting their height and backness

Phonemic mergers were considered in our multilingual phonetic alphabet in order to standardize American English phonetic transcriptions. Due to the multiplicity of accents, phonological changes over time have resulted in large phonetic variability, complexifying the use of a representative phonetic notation (see Labov et al., 2008 and Hughes et al., 2012 for details). Hence, for parsimony, the following American English mergers were adopted in PHOR-in-One: i) hurry-furry: merge of /ʌr/ with /ɜr/; ii) horse-hoarse: merge of /ɔː/ and /oʊ/ (includes other word pairs such as war-wore and morning-mourning); iii) fern-fir-fur: merge of /ɛ, ɪ, ʊ/ into [ɜɹ] in coda positions; iv) cot-caught: merge of /ɔː/, /ɔ/ and /ɒ/; and e) intervocalic /ɒr/ merges with /ɑr/ and /ɔr/ (see Labov, 2006; Labov et al., 2008; and Wells, 1982 for an overview). The vowels adopted for American and British English are displayed in Appendix Table 12 with examples. Considering all these adjustments, overall, our multilingual phonetic alphabet includes 108 phonemes, 15 more than the original version (Schepens, 2010).

Upon the definition of the multilingual phonetic notation, the phonetic transcriptions retrieved from The Free Dictionary were standardized (e.g., [g] was converted to [ɡ], and [ʧ] – one character – was converted to [tʃ] – two characters) and cross-checked using monolingual and bilingual dictionaries (European Portuguese: Casteleiro, 2001; German: Dudenredaktion, n.d., and Wiktionary; British English and American English: Oxford University Press, n.d.; Spanish: Real Academia Española, n.d. and Wiktionary), in addition to the following automatic phonetic converters online: Automatic Phonemic Transcriber, (Brondsted, n.d.), Res Publicae (Armario, 2008), Transcriptor Fonético (López, n.d.), Easy Pronunciation (Baytukalov, n.d.), tophonetics (Tophonetics, n.d.) and Text2Phonetics (Text2Phonetics, n.d.). If all transcriptions from these resources matched, they were automatically accepted as correct. When at least one of the transcriptions was different, they were verified by an expert in phonetics. European Portuguese transcriptions were retrieved from P-PAL (Soares et al., 2018a), or from The Free Dictionary and subsequently reviewed if unavailable.

Before computing the interlanguage phonological similarity measures, the phonetic transcriptions were automatically converted into an intermediate notation, so that complex characters like long vowels, e.g., [i:], and diphthongs, e.g., [əʊ], could be processed as units. For this purpose, an extended version of DISC and DISC++ (Schepens, 2010), a single-coded ASCII notation that visually resembles the IPA in a computer readable format (see the Celex English Linguistic Guide, 1995 for details) was created. Our extended version, DISC*, introduces new phonemes for American English and European Portuguese, and a set of allophones, as detailed in Tables 1 and 2 (see also Appendix Table 10 and 11 for the full IPA and DISC* alphabet with examples in each language).

To compute the interlanguage phonological similarity estimates, a modified version of the LD was implemented, in conjunction with an adaptation of the phoneme distance algorithm developed by Schepens (Schepens, 2010; see also Schepens et al., 2013). As mentioned, this algorithm is sensitive to phoneme qualities and modulates substitution costs according to the acoustic and articulatory features of the source and target phonemes. To estimate the cost of each substitution, it identifies the relative positions of the two phonemes in the consonant and vowel matrices of the IPA feature space (Tables 2, 3, and 4), and computes their Euclidean distance by applying Eq. (1), which determines the length of a line segment between them. The shorter the line, the smaller the distance.

$$Phoneme\ distance=\frac{\sqrt{{\left( column\ difference\right)}^2+{\left( row\ difference\right)}^2}}{Normalization\ constant}.$$
(1)

By way of example, the distance between the velar plosive [k] and the glottal fricative [h] is 5. The minimum distance is 0, for phonemes with the same place and manner of articulation (e.g., [p] and [b]), whereas the maximum distance is 11.70, between a bilabial plosive ([p] or [b]) and the glottal fricative [h].

After computing the Euclidean distance between each source and target phoneme, penalties are added if at least one of the two is a long vowel, diphthong or borrowed vowel (for vowel substitutions), and a long affricate or co-articulated consonant (for consonant substitutions). The base penalty is set at 0.4 (e.g., substituting a short vowel with a diphthong, or vice-versa). Penalties are cumulative when neither the source or target phoneme is a short vowel or a pulmonic consonant. For instance, in the British English-German translation pair “analyse-analysieren” [ænəɫaɪz-analy:zi:ʁən], the diphthong [aɪ] is replaced with the long vowel [y:], and hence two penalties are applied, producing a total penalty of 0.8. No penalties are applied between two short vowels, or two pulmonic consonants. The resulting phoneme distances, including penalties, are then divided by a normalization constant, so that individual phoneme substitution costs can be distributed between 0 and 2 (see Eq. 1). Upon testing six different values, a normalization constant of 5 was established for the implementation of the original algorithm, as it generated greater correlation coefficients with subjective ratings from previous studies (for further details see Schepens, 2010). Substitutions exceeding the maximum cost of 2 (e.g., the cost of substituting a bilabial plosive [p] or [b] with a glottal fricative [h] is 2.34) are automatically adjusted to 2. This cost redistribution between 0 and 2 is based on the premise that consonant/vowel substitutions are not allowed because they have distinct roles in word processing (e.g., Acha & Perea, 2010; Caramazza et al., 2000; Lee et al., 2002; Soares et al., 2020; Soares et al., 2014b). Therefore, in this algorithm, regarding substitutions as a deletion followed by an insertion, rather than a single operation, ensures that consonant/vowel substitutions instantly receive the maximum cost of 2 (Schepens, 2010).

Compared to the original version of the algorithm (Schepens, 2010), a major adjustment was carried out in our implementation, which impacted the computation of the phonological similarity scores. While insertions and deletions originally received a cost of one, here, those costs were adjusted to 2. The rationale behind this alteration was as follows. In the classical LD, every operation (substitution, insertion, and deletion) has an identical cost, i.e., 1. Given that phoneme substitutions here were set at a maximum cost of 2, insertions and deletions should also be adjusted to 2. In addition, due to the nature of the normalization formula in phonology, which multiplies the denominator by 2 as shown in Eq. (2), if insertion and deletion costs were set at 1 (as in the original proposal; Schepens, 2010), the resulting degree of phonological similarity would be overestimated, particularly for pairs involving multiple insertions or deletions.

$$phonological\ NLD=\frac{\sum phoneme\ distances}{Length\ of\ the\ longest\ string\ast 2}.$$
(2)

To illustrate, consider the German-English pair “Ende-end” [ɛndə-ɛnd], with an orthographic NLD of .75. Like orthography, the source and target phonetic strings have four and three characters, respectively, the first three elements are identical, and one operation (insertion/deletion) is required to transform one string into the other. As such, in this pair, the orthographic and phonological NLDs should be identical, i.e., .75. While an insertion/deletion cost of 1 results in a phonological NLD of .88, a cost of 2 ensures identical NLD scores for orthography and phonology. Therefore, taken together, these arguments support our assumption that an insertion/deletion cost of 2 is more suitable for this algorithm than a cost of 1.

With these adjustments, our adaptation of the phonological Levenshtein distance algorithm, including the use of phoneme distances as substitution costs, can be defined as

$$lev_{a,b}\;\left(i,j\right) = \begin{cases} \max\;\left(i\ast k,j\ast k\right) & \quad \text{if } \min\;\left(i,j\right)=0,\\\min \begin{cases} lev_{a,b}\;\left(i-1,j\right)+k_{a,b}\\ lev_{a,b}\;\left(i,j-1\right)+k_{a,b}\quad\\lev_{a,b}\;\left(i-1,j-1\right)+ cost_{a,b_{(a_i\neq b_j)}}\\\end{cases} & \quad \text{otherwise }\end{cases}$$
(3)

where k = 2Footnote 1. For clarity, Table 5 displays the adapted LD matrix derived by our modified algorithm for the computation of the phonological distance between the American/British English-European Portuguese translation equivalents “house-casa” using the single-coded DISC* notation.

Table 5 Phonological Levenshtein distance matrix for the American/British English-European Portuguese pair “house-casa” using DISC*

The value (3.4) displayed in the last entry corresponds to the final distance between the two strings, which is then normalized (see Eq. [2]). The resulting phonological NLD for this pair is .58.

As for orthography, the similarity scores were computed using the classical LD (Levenshtein, 1966). Table 6 displays the LD matrix for the computation of the orthographic distance between the American/British English-European Portuguese translation equivalents “house-casa” (the same as provided above for a direct comparison with phonology).

Table 6 Orthographic Levenshtein distance matrix for the English-European Portuguese pair “casa-house”

The value specified in the last entry of the matrix (4.00) corresponds to the final distance between the two strings, which is then normalized (Schepens et al., 2012). The resulting orthographic NLD for this pair is .20.

Table 7 displays the mean, median, maximum, and minimum orthographic and phonological NLD for each language pair in PHOR-in-One.

Table 7 Mean, median, maximum, and minimum orthographic and phonological NLD scores in PHOR-in-One, and number of translation equivalents with orthographic NLD greater than or equal to .5 and phonological NLD greater than or equal to .75 with examples

As illustrated, all pairs exhibit minimum and maximum orthographic NLDs of .00 and 1.00, respectively, indicating that orthographically entirely distinct and orthographically identical translation equivalents are included in the database. German-European Portuguese exhibits the smallest mean (.29) and median (.19) NLD, whereas European Portuguese-Spanish presents the largest (.77 and .86, respectively). As for phonological overlap, all language pairs have a minimum phonological NLD of .00, except for European Portuguese-Spanish (min = .12). British English-European Portuguese and British English-Spanish have a maximum phonological NLD of .98, indicating that there are no phonologically identical cognates for these language combinations in the database. The remaining language pairs have a maximum phonological NLD of 1.00. With the exception of European Portuguese-Spanish, mean phonological NLD scores are very close across language pairs, ranging between .49 for German-European Portuguese and .53 for British English-German. Conversely, mean orthographic NLD scores are more scattered, ranging between .29 for European Portuguese-German and German-Spanish, and .43 for British English-Spanish.

When orthographic and phonological similarity are taken together, European Portuguese-Spanish stands out for a number of reasons. First, 86% (N = 5,266) of European Portuguese-Spanish translation equivalents have an orthographic NLD score greater than or equal to .50, and the percentage is even higher for phonology (91%, N = 5,672), which signals the great formal proximity of the two languages. Second, not only are mean orthographic (.765) and phonological (.770) NLD scores considerably higher for this pair than for the other language combinations, they are also nearly identical. The fact that the mean difference between phonological and orthographic NLD scores for the remaining language pairs is substantially greater than zero (Min = .09 for British English-Spanish; Max = .22 for German-Spanish), suggests that the phonological NLD scores are on average higher than the orthographic NLDs. A practical example is the British English-European Portuguese translation pair “veil-véu” [veɪɫ-vɛw], with an orthographic NLD of .25 and a phonological NLD of .77. This asymmetry is presumably due to the fact that, unlike orthography (where grapheme similarity does not play a role, and where substituting non-identical phonemes always has a cost of 1), many phoneme substitutions produce a cost which is smaller than 1, thus necessarily resulting in higher NLDs. The difference between individual orthographic and phonological NLD scores may be particularly pronounced for translation equivalents that involve fewer costly operations, such as insertions and/or deletions, or consonant-vowel substitutions. Hence, to compensate for this increment, Schepens (2010) and Schepens and collaborators (2013) proposed using a cognate inclusive threshold of .75 for phonology.

To explore the relationship between the two indices, Fig. 4 depicts a histogram for orthographic and phonological NLD scores in each language pair.

Fig. 4
figure 4

Histogram of the distribution of orthographic and phonological NLD scores in each language pair. BrE = British English; DE = German; EP = European Portuguese; ES = Spanish. Language combinations involving American English were excluded as they were very similar to British English

The distribution suggests that orthographic overlap is, in general, more positively skewed (see purple bars in Fig. 4), indicating that the amount of translation equivalents decreases as NLD intervals increase. Conversely, phonological NLD scores (pink bars) seem to fall into a bell-shaped distribution in most language pairs. European Portuguese-Spanish presents a different pattern, since orthography and phonology are both negatively skewed and have nearly overlapping distributions. A Kendall’s Tau correlation analysis between orthographic and phonological NLDs was performed across all language pairs, as shown in Fig. 5 (for the sake of simplicity, language combinations involving American English were excluded, as they were very similar to British English).

Fig. 5
figure 5

Correlation matrix between the orthographic and phonological NLD scores across language combinations in PHOR-in-One. BrE = British English; DE = German; EP = European Portuguese; ES = Spanish. Blue circles in the correlogram denote positive correlations. Darker and larger circles denote stronger correlations. Correlation coefficients range from .08 (for the correlation between the orthographic NLD scores in British English-German and the phonological NLD scores in European Portuguese-Spanish, and vice-versa) and .59 (for the correlation between the orthographic and phonological NLD scores in European Portuguese-Spanish). All p < .001. Correlations involving American English were excluded, as they were very similar to British English

As expected, all correlations were significant (all p < .001), given the large number of data points in the analysis. Considering orthographic and phonological similarity for the same language pairs (e.g., the correlation between orthographic and phonological NLD scores within British English-German, tb = .54), only strong positive correlations were found (all tb greater than or equal to .46), showing that phonological similarity increases as orthographic NLD scores increase. Strong positive correlations were also observed for different language pairs, namely between the orthographic NLD for British English-European Portuguese and the phonological NLD for British English-Spanish (tb = .50), and between the orthographic NLD for German-European Portuguese and the phonological NLD for German-Spanish (tb = .47). Interestingly, the correlation between the orthographic NLD for British English-European Portuguese and the phonological NLD for British English-Spanish (tb = .50) is nearly identical to the correlation between the orthographic and phonological NLD within British English-European Portuguese (tb = .51), potentially due to the close proximity of European Portuguese and Spanish orthography and phonology. The correlation between orthographic and phonological NLD scores within European Portuguese-Spanish is the strongest (tb = .59) out of all comparisons, as expected.

The distribution of orthographic and phonological NLD scores for each language pair presented in this section shows that PHOR-in-One allows for the selection of stimuli with distinct formal features, including, i) orthographically and phonologically identical cognates, e.g., British English-German “lift-Lift” [lɪft-lɪft] (orthographic NLD = 1.00; phonological NLD = 1.00); ii) orthographically, but not phonologically related cognates, e.g., British English-German “psychologist-Psychologe” [saɪkɒlədʒɪst-psy:çolo:ɡə] (orthographic NLD = .75; phonological NLD = .41); iii) phonologically, but not orthographically related cognates, e.g., English-German “ice-Eis” [aɪs-aɪs] (orthographic NLD = .00, phonological NLD = 1.00); and iv) orthographically and phonologically distinct translation pairs, or noncognates, e.g., American English-Spanish “bear-oso” [bɛəɹ̠-os̺o] (orthographic NLD = .00; phonological NLD = 0.36). This variability is important for research, and is an adequate representation of how the words are distributed in a language (Siew & Vitevitch, 2019).

Phonographic similarity

A new objective index of interlanguage phonographic overlap is also introduced in PHOR-in-One. The definition of a phonographic NLD should satisfy the following guiding principles: i) express the degree of overall form similarity of two translation equivalents by intersecting their orthographic and phonological overlap; ii) ensure an intuitive categorization of translation pairs as low or high-similarity; iii) be distributed on a continuum between .00 and 1.00 for comparability with other form similarity estimates; iv) approximate the mean of the orthographic and phonological NLD scores. To apply these principles the geometric mean of the individual orthographic and phonological NLD scores was computed. We opted for the geometric rather than the arithmetic mean because it tends to dampen the effects of high values (Habib, 2012), thus levelling the differences reported above between the two measures. However, due to the nature of the geometric mean, which uses the product of n values, the resulting phonographic NLD is zero when at least one of the values is zero. For instance, in the British English-German translation equivalents ice-Eis [aɪs-aɪs], with an orthographic NLD of 0.00 and a phonological NLD of 1.00, the phonographic NLD using the geometric mean is zero. This is a serious limitation, in that it does not capture any information about non-zero values (de la Cruz & Kreft, 2019) and violates principle iv). To address this issue, an extension of the geometric mean was implemented here, which can handle zero values efficiently, and which has been used before in other scientific areas (e.g., Alexander et al., 2005; Williams, 1937). In this extension, 1 is added to individual NLD scores, before estimating the product of the orthographic and phonological NLD, and subsequently subtracted from the result, as expressed in (4),

$$\mathrm{G}\left(\mathrm{X}\right)={\left(\prod_{i=1}^n\left({x}_i+1\right)\right)}^{\frac{1}{n}}-1,$$
(4)

where x ≥ 0.

Table 8 details the total number of translation equivalents in PHOR-in-One with a phonographic NLD of .000 (noncognates with no orthographically and phonologically overlapping features), 1.000 (orthographically and phonologically identical words), and distributed between .001 and .499 and .500 and .999 with examples for two, three and four language combinations.

Table 8 Number of translation equivalents in four phonographic NLD intervals for each language combination in PHOR-in-One with examples

The numbers show that hardly any translation equivalents with a phonographic NLD of .00 exist in PHOR-in-One. Only one pair of such translations is included for British English-German (“poor-arm” [ˈpʊə - ˈa:ɐm]) and German-Spanish (“Topf-olla” [ˈtɔpf - ˈoʝa]), and three for British English-European Portuguese (e.g., “aunt-tia” [ˈɑ:nt - ˈtiɐ]) and British English-Spanish (e.g., “nail-uña” [ˈneɪɫ - ˈuɲa]). The remaining language combinations do not contain phonographically non-overlapping translation equivalents. Additionally, only British English-German (e.g., “film-Film” [ˈfɪɫm - ˈfɪlm]), German-Spanish (e.g., “Mango-mango” [ˈmaŋɡo - ˈmaŋɡo]) and European Portuguese-Spanish (e.g., “flor-flor” [ˈfɫoɾ - ˈfloɾ]) contain phonographically identical cognates (17, 7, and 59 translation equivalents, respectively). The European Portuguese-Spanish pair shares a larger number of phonographically similar (i.e., .50 ≤ NLD <1.00; e.g., “acesso-acceso”) and identical (NLD = 1.00) words than any other language combination (N = 5377 combined). When three language combinations are considered at once, a large number of phonographically distinct (i.e., .00 ≤ NLD < .50; min = 616 words in British English-European Portuguese-Spanish, e.g. “barn [ˈbɑ:n] - celeiro [sɨˈɫɐjɾu] - granero [ɡɾaˈneɾo]”; max = 2532 words in British English-German-Spanish, e.g., “wall [wɔ:ɫ] - Wand [vant] - pared [paɾeð]”) and phonographically similar (i.e., .50 ≤ NLD < 1.00; min = 1451 words in British English-German-Spanish, e.g., “insulin [ˈɪnsjʊlɪn] - Insulin [ɪnzuˈli:n] - insulina [ins̺uˈlina]”; max = 2610 words in British English-European Portuguese-Spanish, e.g., “lemon [ˈlɛmən] - limão [ɫiˈmɐ̃w] - limón [liˈmon]”) translation equivalents are part of the lexicon. A total of 1350 phonographically similar (e.g., “vein [ˈveɪn] - Vene [ˈve:nə] - veia [ˈvɐjɐ] - vena [ˈbena]”) and 485 phonographically distinct (e.g., “window [ˈwɪndəʊ] - Fenster [ˈfɛnstɐ] - janela [ʒɐˈnɛɫɐ] - Ventana [benˈtana]”) translation equivalents across the four languages at once are also included in PHOR-in-One.

To further assess how the phonographic NLD is distributed across languages, Fig. 6 depicts a histogram of the number of translation equivalents in each interval for six language pairs.

Fig. 6
figure 6

Histogram of the distribution of phonographic NLD scores in each language pair. BrE = British English; DE = German; EP = European Portuguese; ES = Spanish. Language combinations involving American English were excluded as they were very similar to British English

Compared to the histograms in Fig. 4, where for most language pairs orthographic NLD peaks between .00 and .20 and phonological NLD between .30 and .50 (except for British English-European Portuguese and British English-Spanish, both peaking between .30 and .70, and European Portuguese-Spanish, which presents a different distribution), phonographic NLD generally peaks between .20 and .30, suggesting that most translation equivalents in these language pairs share few orthographic and phonological features at once. British English-European Portuguese and British English-Spanish seem to be approximately binormally distributed, peaking between .20 and .30 and also between .60 and .70 (for British English-European Portuguese) or .70 and .80 (for British English-Spanish). European Portuguese-Spanish, however, remains negatively skewed, with a greater concentration of translation equivalents displaying a phonographic NLD between .90 and 1.00. Overall, compared to the orthographic and phonological NLD distributions in Figure 4, the histograms in Figure 6 reflect the fact that the phonographic NLD is a more intermediate index of similarity. To illustrate, the British English-European Portuguese translation equivalents “veil-véu” [veɪɫ-vɛw], which, as mentioned, have an orthographic and phonological NLD of .25 and .77, respectively, bear a phonographic NLD of .49.

The differences between the orthographic and phonological NLD scores signal the importance of controlling for both orthographic and phonological overlap in the selection of cognates and noncognates. In addition, the phonographic NLD may help calibrate these differences between the two measures, and potentially account for more variance in bilingual and multilingual performances, particularly for translation pairs with contrasting degrees of orthographic and phonological similarity. This is therefore a more straightforward method for assessing the degree of overall form similarity of two words, with particular relevance for researchers interested in selecting translation equivalents with high (O+P+) or low (O-P-) degrees of orthographic and phonological similarity.

Conclusions

In this paper we introduced PHOR-in-One, an extensive multilingual lexical database containing 6160 translation equivalents fully aligned in American and British English, German, European Portuguese and Spanish, as well as their linguistic, morphosyntactic and frequency characterization. To address a long-needed research requirement, PHOR-in-One offers three indices of interlanguage form similarity, including the classical orthographic NLD, an adapted phonological NLD, which considers the degree of proximity of the source and target phonemes in the IPA feature space as substitution costs, and an estimate of phonographic NLD, a simplified alternative to control for and/or manipulate orthographic and phonological similarity at once. Combined, these indices allow for the selection of comprehensive stimulus sets, including orthographically and phonologically distinct translation equivalents (or noncognates), orthographically but not phonologically related translation equivalents, phonologically but not orthographically related translation equivalents, and orthographically and phonologically identical cognates. PHOR-in-One will promote the adoption of comparable estimates of interlanguage form similarity across studies, increase speed and reliability in the process of stimulus selection, and contribute to expand the predictions of bilingual computational models on phonological processing and representation. Future studies should test how well the new indices of form similarity proposed here can capture subjects’ performances with different tasks, word types and populations.