Introduction

This article presents K-SPAN (Korean Surface Phonetics and Neighborhoods), the first lexical database of Korean to include transcriptions of surface phonetic forms and neighborhood density statistics. The database includes 63,836 entries, drawn from the Modern Korean Usage Frequency Survey 2 corpus (Kim 2005; Korean title: “2”).

Developing experiments with carefully controlled stimuli is a common activity of those who are interested in spoken language processing, such as speech scientists, experimental psychologists, and linguists. Especially for tasks involving speech production or perception it can be important to be able to control the phonetic or phonological content of stimulus items. For that reason, phonetized databases—which list phonetic transcriptions of words—are an invaluable resource. However, Korean, like other understudied languages, does not have such a database. The development of K-SPAN was motivated by a desire to remedy this gap.

A related motivation was to calculate the phonological neighborhood density (ND) of Korean words. Phonological ND is commonly used as a measure of word similarity in studies of the phonological structure of the lexicon. It is typically operationalized as the number of other words in the lexicon that differ from a target word by a phonological edit distance of one: that is, by the addition, deletion, or substitution of a single phoneme. ND has been shown to be relevant in explaining performance on a host of linguistic tasks.

In the realm of speech perception, it has been found that high ND is correlated with slower lexical access. For example, Luce and Pisoni (1998) showed that listeners exhibited longer reaction times to high ND words in lexical decision, identification in noise, and naming tasks. In speech production, many studies have shown evidence for hyperarticulation in high ND words (Munson and Solomon 2004; Scarborough 2004; Wright 2004), which has been interpreted in support of a listener-oriented view of phonetic reduction. Other studies have shown how ND may be useful in characterizing word learning: children’s lexicons tend to contain more high ND words at first, and over time expand to low ND words (Coady and Aslin 2003; Stokes 2010). In short, the concept of ND is applicable to a wide range of questions in language processing and acquisition.

However, the great majority of this literature is based on English. Although ND is not inherently a language-specific concept (i.e., all languages have words, which are composed of phonemes), the validity of ND as a meaningful psycholinguistic measure is understudied in non-English languages. (Holliday & Turnbull 2015 Footnote 1 and Vitevitch and Stamer2006 are notable exceptions.) Accordingly, one of the aims of the current paper is to extend research on ND effects to Korean.

Previous research on neighborhood density in Korean

Although there exists some previous work on the effects of ND in Korean, it has been limited in several ways. First, these studies have focused mostly on visual word recognition (Kwon, Lee, Lee, & Nam, 2011; Kwon & Nam, 2011; Kwon, 2014). While such work is valuable in its own right, it addresses a different set of questions and does not lend itself to cross-linguistic comparison with the studies on spoken language processing cited above. Second, all of these studies used a measure of ND that is based on the eumjeol, a unit of Korean orthography that usually, but not always, corresponds to a phonological syllable. This measure is not analogous to the phoneme-based measure used in most of the spoken language processing literature.

Although syllable-based ND measures have proven meaningful in studies of visual word recognition (e.g., Carreiras, Alvarez, & de Vega, 1993; Perea, & Carreiras, 1998), they may be unsuited to the study of spoken Korean, owing to the many-to-many mapping between Korean orthography and pronunciation. The same spelling can be used for different pronunciations (e.g., ‘occurrence of a disease’ vs. ‘foot disease’), and different spellings can be used for the same pronunciation (e.g., ‘easy’ vs. ‘regards’). This fact is discussed in more detail in Section 3. To our knowledge, the only published study of ND effects involving spoken Korean, Song, Nam, & Koo, (2012), investigated the effects of word frequency and ND on spoken word segmentation. This study used a syllable-based ND measure, as was done in other studies of spoken word segmentation (e.g., Cutler, Mehler, Norris, & Segui, 1986 and Mehler, Dommergues, Frauenfelder, & Segui, 1981), but it renders it nevertheless incompatible with the majority of ND studies on English, which used a phoneme-based measure.

Perhaps the most substantial gap in this body of research on Korean lies in the representations upon which ND was calculated, irrespective of being syllable- or phoneme-based. The aforementioned studies calculated ND from the orthographic forms of words, taking advantage of the fact that the Korean orthography, Hangeul, is relatively shallow: in many cases, the phonological form of a Korean word can be reliably derived from its orthographic form. Nevertheless, there are many exceptions, including the examples cited above, whose phonological form cannot be derived from its orthographic form. Sometimes this is the result of phonological processes that apply selectively based on morphological factors (e.g., whether or not the word is a compound noun), or only to words in certain lexical strata (e.g., Sino-Korean words). Lastly, there are some words whose phonological form is irregular, and is simply unpredictable from the orthographic form. Thus, given only the orthographic forms in a lexicon, the calculation of any kind of ND (other than orthographic syllable-based) becomes intractable.

Currently available Korean corpora

There exist several large corpora of Korean, but, to our knowledge, none of them provide both orthographic and phonological representations of words. The Sejong Corpus (Kim 2006) is the largest publicly available Korean corpus. This corpus contains a large amount of annotated textual data written in Hangeul (part of the corpus also contains Chinese characters). It contains two core subparts: a “spoken” corpus, which includes orthographic transcriptions of conversations and interviews, and a “written” corpus, which includes material such as press articles, textbooks, novels and poems from the 20th century. The written part of the corpus contains nearly 34 million tokens, corresponding to about 2.7 million types. The spoken part is much smaller and contains about 800,000 tokens, corresponding to nearly 115,000 types. Part of the Sejong Corpus is annotated for part of speech and includes lemmatic information, but no disambiguation is provided for lemmas that are homographs. This is an important issue since, as mentioned earlier, homographs in Korean are not necessarily homophones. As a result, it is not possible to reliably derive surface phonetic forms from the Sejong Corpus.

The KAIST CorpusFootnote 2 contains 70 million words and is available to the public, but the downloadable .zip file consists of raw, unparsed Korean text contained in 11,629 separate .txt files. An annotated subset of 1 million words is also available, but still contains only orthographic forms, like the larger version, and is not lemmatized. While this is certainly a valuable resource, it carries the same limitations as the Sejong Corpus with respect to the calculation of ND.

Shin, Kiaer, & Cha, (2013) reported phoneme frequency statistics from two corpora: the Yonsei Korean Language Dictionary,Footnote 3 and the Spoken Language Information Lab Corpus (Shin 2008), a corpus of spoken dialogue recorded from 57 native speakers of Seoul Korean. Either of these corpora could potentially be used for the calculation of ND statistics, but neither the phonetic forms of the Yonsei Korean Language Dictionary nor the Spoken Language Information Lab Corpus (in any form) are available to the public.

The current study

This paper presents a lexical database of surface phonetic formsFootnote 4 and ND measures for Korean words, derived from a publicly available orthographic corpus. The corpus used as the basis for the current work was the Modern Korean Usage Frequency Survey 2 (Kim2005; Korean title: “ 2”; henceforth MKUFS2). The MKUFS2 is a balanced lemmatized corpus containing 3,086,031 word tokens and 82,501 word types. Although it only provides orthographic forms, what sets the MKUFS2 apart from the corpora described above is its accessibility: it can be downloaded from the website of the National Institute of Korean Language (NIKL) as a single table file containing the orthographic form and lexical frequency of each word. Our work thus consisted of two main parts: phonetization of the orthographic forms, and calculation of ND based on the phonetized forms.

This work proceeded according to the following steps, to be described in detail in the next section. First, we retrieved the pronunciation of each word in the MKUFS2 that had an entry in the online Naver Korean dictionary. Second, because the pronunciation entries in the Naver dictionary are taken from the prescriptive forms in the Korean dictionary published by NIKL, we implemented several phonetic neutralizations that more accurately reflect the modern pronunciation of younger Seoul speakers. Lastly, we calculated several ND measures based on the modern pronunciations, the conservative pronunciations offered by NIKL, and the orthographic forms. These statistics are discussed in Section 3, along with a comparison between the segment-based frequencies measured in our database and those reported in previous studies.

This endeavor resulted in the creation of a database with the following information for each word: phonetic transcriptions of the modern and conservative pronunciations rendered in WorldBet (Hieronymus 1994) and in another easy-to-process encoding scheme, and both segment- and syllable-based ND measures (to be described below). Each word is further identified by the row number of its entry in the MKUFS2, which the user can then refer to alongside the current database.

Method

Background on Korean orthography–phonology mapping

Contemporary Korean is written using an alpha-syllabic system (Hangeul), invented in the 15th century, which was specifically designed to transcribe this language. Modern Hangeul comprises 40 letters (jamo), which are organized into graphical syllable-sized blocks (eumjeol). For example, the word ‘photo’ is composed of two eumjeol. The first eumjeol, , is composed of the jamo /s/ and /a/. The second eumjeol, , is composed of the jamo /i/, and /n/.

The syllabic nature of this alphabet allows Hangeul to encode differences in syllabification between words. For example, the sequence /tali/ can be written as /ta.li/ ‘leg’ or as /tal.i/ ‘moon + nom’. Note that the second eumjeol in the latter example features the jamo , which represents an empty (null) syllable onset. Due to the phonological process of resyllabification, both of these words are pronounced identically as [ta.li]. Crucially, however, the morphological distinction between them is preserved in the spelling.

Other examples of many-to-one mappings between Hangeul and pronunciation relate to phonological mergers and neutralizations. First, a number of phonemes that were formerly distinct, such as the vowels /e/ and /ε/, have merged in Modern Korean and are now pronounced identically by most speakers (see Eychenne & Jang, 2015; Hong, 1988, 36–89; Shin et al., 2013, 99–101 among others). Second, Korean possesses a rich set of phonological processes that neutralize some phonological contrasts in certain environments (Ahn, 1998; Shin et al., 2013, ch. 8). For instance, the contrast between the three bilabial plosives /p/ (lenis), / ph/ (aspirated) and /p*/ (fortis) is lost in coda position, where these phonemes are all realized as an unreleased bilabial stop [\(\phantom {\dot {i}\!}\mathrm {p}\urcorner \)]. Such phenomena are generally not problematic for a phonetization system since they are fully predictable.

There are, however, a number of processes that are sensitive to morphological information, involve a large amount of lexical idiosyncrasy, and are not reflected in standard Hangeul spelling. To take but one example, in some compound words in which the second morpheme starts with /i/ or /j/, an /n/ is inserted between the two morphemes. For example, /tam#jo/ ‘blanket’ is a compound of the Sino-Korean morpheme /tam/ ‘blanket’ and the native Korean word /jo/ ‘Korean-style mattress’. This word, which has the morphophonological structure /tam#jo/, undergoes [n]-insertion and is pronounced [tamnjo]. Note that this inserted [n] is not reflected in the spelling. However, not all words containing an /i/- or /j/-initial morpheme trigger this process. Thus, the word ‘Friday’, which contains the morphemes ‘gold’ and /jo/ ‘shining’, is transparently realized as . We will not delve into the complicated issues surrounding the range of morphology-sensitive processes in Korean in this paper (but see Shin et al., 2013, ch. 9, for an overview of the most important ones); for our purposes, it suffices to say that these processes make it extremely difficult to derive completely reliable phonetic transcriptions from orthographic forms alone.

Phonetization of MKUFS2

To deal with these unpredictable grapheme-to-phoneme correspondences, we opted for a more direct phonetization strategy by relying on existing publicly available resources (see Appendix for details, including web links). The MKUFS2 corpus is freely available for research purposes and can be downloaded from NIKL’s website. This corpus provides, among other things, a dictionary of grammatical morphemes and lexical items. For the purpose of this work, we only considered the dictionary of lexical items, which contains 82,501 lemmas, along with each word’s token frequency, part of speech, and an optional disambiguation column. In the case of homonyms, the disambiguation column clarified which lemma the entry referred to, and in the case of Sino-Korean or other loanwords, it contained the Chinese characters (hanja) or source language form.

In order to obtain the surface phonetic forms for each word, we used the free online Naver dictionary.Footnote 5 For most words whose pronunciation differs from the spelling (predictably or not), Naver provides a pseudo-phonetic representation in Hangeul. For instance, the verb form ‘to be ripe’ is phonetized as , which transparently corresponds to the actual phonetic realization . This pseudo-phonetic representation shows the application of the non-predictable /n/ insertion rule discussed above, and also the predictable rule of post-obstruent tensing, which turns the underlying /t/ into tense [t*] because of the preceding obstruent. Homonyms were generally (but not systematically) identified with a numeric code that matched forms across the two corpora. For example, no phonetization is provided for ‘volume unit’, indicating that it is transparently phonetized as , whereas ‘sedan chair’ is phonetized as , with a long vowel in the first syllable.

Thus, the first step of the phonetization procedure was to obtain the pseudo-phonetic transcription in Hangeul, as provided by Naver, for each word in the MKUFS2 corpus. The text file containing the lexical items, which is provided in a legacy encoding (Windows code page 949), was first converted to Unicode (UTF-8).Footnote 6 For each word form, we retrieved the first result page(s), up to five. For unambiguous words, we extracted the only entry that was returned; for homonyms, we relied on a combination of the word’s numeric code and hanja disambiguation (where available) to attempt to identify the target entry, giving precedence to the hanja disambiguation in case of conflict. For each entry, we extracted the pseudo-phonetic form when one was provided; otherwise, we used the orthographic form as a pseudo-phonetization since, in that case, the pronunciation was totally transparent. Words that could not be identified were discarded. Failure to identify a word in Naver could have two causes. First, some words from the MKUFS2 corpus were simply not listed at all in Naver. Many of the unknown words were complex verbs composed of a base verb + helping verb (such as ‘to do’, ‘to give’ or ‘to become’), such as ‘to become simplified’. Second, some words were redirected to a similar, but different entry. For example, , a rare word with only one occurrence in the MKUFS2 corpus, was redirected to ; although Naver does provide several entries for the latter form, the first of which is phonetized as , this cannot be used to automatically and reliably derive the phonetization of . Therefore, search results such as this one were excluded. In total, out of the 82,501 lemmas found in the MKUFS2 corpus, 18,665 forms were discarded; we obtained 63,836 phonetic forms, representing 77.4 % of the original corpus.Footnote 7

For a large number of words (5018 items, 7.9 % of the database), Naver provided two different pseudo-phonetic forms, representing two pronunciation variants, with or without the application of a number of optional (though widespread) processes, such as the reduction of /je/ to /e/ after a velar stop (/sikje/ ‘watch’ → [sike]), or the neutralization of /ø/ to /we/ (see Table 1). Although each process, taken in isolation, was systematically applied to either the first or second pronunciation variant, the phonetization was not entirely consistent regarding what type of pronunciation each variant was supposed to represent. For example, many processes that characterize a typical modern pronunciation in Seoul Korean were applied to the second variant, but some (such as the insertion of the glide /j/ between /i/ and ) were applied to the first variant. In addition, a number of features found in Modern Korean (e.g., loss of the length contrast, merger between /e/ and /ε/) were not indicated at all.

Table 1 Processes applied to obtain the modern pronunciation variants

In order to alleviate these problems and to make the database maximally useful, we created two pronunciation variants, labeled as “conservative” and “modern”. The conservative variant represents a somewhat archaic, if not artificial, pronunciation where all potential contrasts have been preserved. For example, the vowels and are transcribed as the monophthongs /y/ and /ø/, respectively, which corresponds to the normative pronunciation known as the “Standard Korean Pronunciation” (Shin et al., 2013, 97-99). The modern variant, on the other hand, represents a pronunciation typical of contemporary Seoul Korean. In order to obtain surface phonetic forms for the modern and conservative pronunciations, we first linearized each Hangeul pseudo-phonetic form using a standard code point decomposition algorithm (The Unicode Consortium, 2015, §3.12) which decomposes each eumjeol into its constituent jamo. As an example, the string was linearized into . For all the forms in the database that were phonetized with two variants, we aligned the two strings using the Minimum Edit Distance algorithm, as implemented in Cock et al. (2009). We then built a conservative and modern pronunciation by assigning each mismatched character in Naver’s pseudo-phonetic forms to the appropriate variant. The conservative forms, as mentioned above, retain all of the contrasts.

After the conservative and modern pronunciations were generated, we checked for potential errors (that is, cases when the phonetization provided by Naver was obviously incorrect) by searching for illegal phoneme strings. For example, underlying word-final /s/, which is common in /t/-final loanwords, is neutralized to an unreleased /t/ on the surface (e.g., /lopos/ “robot” is phonetically realized as [lopot]). Some of Naver’s phonetizations, however, contained errors such as this (e.g., “robot” being phonetized as [lopos] instead of [lopot]), and so in order to correct them we ran another script that checked for any anomalous phonetizations and applied an appropriate patch. We corrected 98 errors using this procedure. Because this method could not catch any errors that did not result in an illegal phoneme string, we also hand-checked a random subset of 1000 words to gauge whether there may be more errors, but did not find any.

In addition to outright errors, however, since even the modern pronunciations provided by the Naver dictionary contained some contrasts that are not maintained by most speakers of modern Seoul Korean, we applied several phonetic neutralizations to reflect the pronunciations that modern Seoul Korean speakers actually produce. The first neutralization comprised, broadly, the widely reported loss of contrast between /e/ and / ε/, as noted earlier. This neutralization included the realization of both /e/ and /ε/ as /ε/, both /we/ and /wε/ as /wε/, and both /je/ and /jε/ as /jε/. Note that because the realization of /ø/ as /wε/ was already reflected in the Naver modern pronunciations, this step effectively neutralized the conservative three-way contrast among /ø/, /we/, and /wε/. Finally, we neutralized vowel length distinctions since this feature appears to play a marginal role, if any at all, in the phonology of contemporary Seoul Korean (Lee, & Ramsey, 2011, 296-297; Shin et al., 2013, 153; Sohn, 1999, 14). In summary, the list of processes which was applied to the modern forms is given in Table 1.

A few representative examples, drawn from the final database, are provided in Table 2. These examples demonstrate the orthographic representation in Hangeul, the conservative pronunciation provided by Naver, and the modern pronunciation provided by Naver and subsequently updated based on the mergers described above. In the final database, both the conservative and modern pronunciations were rendered in Worldbet (Hieronymus 1994), an ASCII encoding scheme for the International Phonetic Alphabet. In addition, in order to facilitate the calculation of (possibly novel) lexical metrics, we also rendered these pronunciations using a simple encoding scheme which maps each segment (vowel, consonant or diphthong) to a single ASCII character. (This scheme is described in the documentation provided with the database).

Table 2 Examples illustrating the result of the phonetization process

Calculation of ND measures

Neighborhood density was calculated in several different ways. First, we calculated a set of segment-based ND measures following (Luce 1986) and (Pisoni, Nusbaum, Luce, & Slowiaczek, 1985). Two words were considered neighbors if they differed by the deletion, addition, or substitution of one and only one segment—i.e., an edit distance of one. The neighborhood relation is therefore symmetric (e.g., if /mak/ is a neighbor of /hak/, then /hak/ is a neighbor of /mak/), intransitive (e.g., although /mak/ is a neighbor of /hak/, and /hak/ is a neighbor of /han/, /mak/ and /han/ are not necessarily neighbors), and anti-reflexive (i.e., a word is not a neighbor of itself). We calculated ND using three different representations. The first two representations were the modern and conservative surface phonetic forms described above. The third representation will be referred to as orthographic, in which we treated the orthographic representation (in Hangeul) of each word as a linear string of jamo, instead of as arranged into syllable blocks.Footnote 8 Then, ND was calculated based on an edit distance of one jamo. Note that words that differ by only one jamo may not differ phonetically. For example, ‘gourd’ /pak/ and ‘outside’ /pak*/ differ orthographically (and in their underlying phonological representation) in coda position, with the former having a lax velar stop and the latter having a tense velar stop. But because homorganic Korean stops are neutralized in coda position, both words have the same surface phonetic representation of [pak]. Thus, the modern and conservative forms can be thought of as phonetic forms, whereas the orthographic form corresponds more closely to a phonological form.

Second, we calculated a set of syllable-based ND measures in which two words were considered neighbors if they differed by the substitution of one and only one syllable. The syllable-based measures were also calculated based on the three representations discussed above: modern, conservative, and orthographic. For example, consider the word ‘tree’ /namu/. This word has two syllables, /na/ and /mu/. Its syllable-based neighbors would be all bisyllabic words whose first syllable is /na/ or whose second syllable is /mu/. In this case, although no phonological processes would be applied to obtain the modern (/namu/) and conservative (/namu/) representations, syllable-based ND would still differ among them. For the modern and conservative syllable-based ND, the word ‘brand’ /nak.in/ would be considered a neighbor, since the word-medial /k/, which is a coda of the first syllable, is resyllabified as the onset of the second syllable, as in [na.kin]. For the orthographic representation, however, /nak.in/ would not be considered a neighbor, because the first syllable is represented orthographically as /nak/, whereas in the target word, ‘tree’ /namu/, it is /na/. Note that unlike in segment-based ND, syllable-based neighbors did not include words that differed by deletion or addition. Thus, ‘lumberjack’ /namuk*un/ would not be considered a syllable-based neighbor of ‘tree’ /namu/ in any of the three representations.

Results

The resulting database, K-SPAN, which includes the surface phonetic forms and accompanying ND measures, is available in Appendix. In this section, we summarize some of the salient trends in segment frequencies and ND measures calculated from the database. The trends in segment frequencies will be compared to those reported in Shin et al. (2013), who reported frequency trends from the Yonsei Korean Language Dictionary and the Spoken Language Information Lab Corpus (Shin 2008).

It should first be noted that the K-SPAN database differs from both the Yonsei Korean Language Dictionary and the Spoken Language Information Lab Corpus in several important ways. The lexical entries in K-SPAN were taken from the MKUFS2 (Kim 2005), which listed words in their dictionary form (i.e., stripped of any morphology), in the same way as the Yonsei Korean Language Dictionary. On the other hand, the entries in the MKUFS2 were gathered from a variety of sources, such as textbooks, novels, screenplays, and spoken dialogue, among others, and thus reflect actual usage. The Yonsei Korean Language Dictionary is an actual dictionary, however, and thus may include some very low-frequency words that could be absent from the MKUFS2. The Spoken Language Information Lab Corpus, of course, reflects actual usage, but is different from K-SPAN in that it was gathered entirely from speech and contains morphological markers that are absent in K-SPAN (e.g., the topic marker or the future and conditional modals /kes*/ and .

Turning back to K-SPAN, the type and token frequencies of vowels and consonants in both the modern and conservative forms are given in Table 3. It can be seen that there are overall slightly more consonants than vowels, which is expected, given that a syllable may contain up to two consonants but necessarily has only one vowel. The bottom two rows of Table 3 show the number and percentage of consonants that are in onset or coda position. As expected, syllable onsets are more common than syllable codas. In addition, while syllables with a consonant onset are far more common than syllables with an empty onset, open syllables are more common than closed syllables. These results are comparable to those calculated from the corpora reported in Shin et al. (2013).

Table 3 Type and token frequencies of segment type in the modern (abbreviated m) and conservative (abbreviated c) forms, calculated over all 63,836 word types (186,239 syllables)

Frequency counts for individual consonants and vowels are given in Tables 4 and 5, which are sorted according to the modern form type frequency. Several trends are apparent. First, although the tense and aspirated consonants have lower type frequencies than the lax obstruents, nasals, and liquid, there are a few consonants whose type and token frequency rankings diverge markedly. Among these are the alveolar stops /t, th, t*/, which all have a much higher relative token frequency than type frequency. We attribute this partly to the fact that the dictionary form of all verbs and adjectives ends with /ta/, resulting in /t/ being over-represented among high-frequency words. Depending on the coda of the preceding syllable, this /ta/ can also surface as [t*a] (when preceded by an obstruent, as in ‘eat’ surfacing as ) or as [tha] (when preceded by /h/, as in /anhta/ ‘do not’ and ‘good’ surfacing as [antha] and ).

Table 4 Consonant type and token frequencies for the modern (m), conservative (c), and orthographic (o) forms
Table 5 Vowel type and token frequencies for the modern (m), conservative (c), and orthographic (o) forms

Second, the frequencies of the lax obstruents /k/, /t/, /p/, , and /s/ are all lower in the modern and conservative forms than in the orthographic forms, likely reflecting the several processes that phonetically neutralize them. For example, coda lax obstruents surface as homorganic nasals when followed by a nasal or liquid, and onset lax obstruents surface as tense when preceded by an obstruent coda. The wide application of these processes should result in a decrease in the frequency of lax obstruents and an increase in the frequency of nasals and tense obstruents when comparing orthographic forms to phonetic surface forms, and that is exactly what we see in Table 4.

Lastly, it should be noted that the difference between the modern and conservative forms does not substantially impact consonant frequencies. The only consonants whose modern and conservative frequencies differ at all are /k/, /t/, /p/, and /h/. The most common process affecting consonants was same-place deletion, in which a lax stop in a coda-onset sequence of /kk*/, /tt*/, or /pp*/ was deleted.Footnote 9 Another example was the deletion of /h/ between /n/ and /j/, such as in ‘balance’ surfacing as in the modern pronunciation.

Turning next to the vowel frequencies in Table 5, we see that the frequencies are heavily skewed, with only a few vowels accounting for the majority of counts. The most common vowel across the board is /a/, representing approximately 28 % of the type counts and 37 % of the token counts. The next most frequent vowel, /i/, is at most half as frequent. Regardless of the representation used, /a/, /i/, and account for over half of the type counts, and /a/ and /i/ alone account for over half of the token counts.

Another obvious pattern in the vowel frequencies is the total absence of certain vowels in the modern forms. Specifically, the absence of /e/, /ø/, /je/, and /we/ in the modern forms reflects their neutralization with /ε/, /wε/, /jε/, and /wε/, respectively. Conversely, these neutralizations are reflected in the frequencies of /ε/, /wε/, and /jε/, which are comparably higher in the modern forms than in the conservative forms. Some of these vowels, /wε/ in particular, are in fact quite rare underlyingly.

Overall, the modern form type frequencies of individual consonants and vowels in the current database closely mirror those of the Yonsei Korean Language Dictionary as reported in Shin et al. (2013). With the exception of /t/, discussed above, the most frequent consonants and vowels are also the same, and the tense and aspirated consonants are also the least frequent across both corpora.

Finally, some summary statistics of the ND calculations are presented in Tables 6, 7, and 8. First, summary statistics of segment-based ND are given in Table 6. The first column contains the statistics for the entire database, and the columns to the right contain statistics for just the words with each corresponding number of syllables. For each of the three representations (modern, conservative, and orthographic), the range, mean, and median ND are provided, along with the percentage of words that have no neighbors (“% 0”). It can be seen that the maximum, mean, and median number of neighbors decreases with increasing syllable count, with the exception of the comparison between three- and four-syllable words. We presume this discrepancy is due to the fact that two-syllable nouns are so frequent, and many of them can take a two-syllable light verb to become a four-syllable verb or adjective (e.g., the noun ‘happiness’ can combine with the light verb /hata/ to become the adjective ‘happy’). Thus, the fact that many four-syllable words already share two of their syllables with many other words serves to counteract the general trend of longer words having fewer neighbors. Nevertheless, an important conclusion to be drawn from Table 6 is that the possible range of ND can vary greatly depending on the number of syllables in the word.

Table 6 Segment-based neighborhood density summary statistics
Table 7 Syllable-based neighborhood density summary statistics
Table 8 Spearman’s rho correlations among the different ND metrics. Mod, Cons, and Ortho refer to the modern, conservative, and orthographic representations, and Seg and Syll refer to the segment- and syllable-based measures, respectively

It can also be seen that every one-syllable word has at least one neighbor, and more than half of all words with three or more syllables have no neighbors. Thus, it is only the set of two-syllable words that contains some words with no neighbors while the majority of words still has some neighbors. Among words with five or more syllables, having even one neighbor at all seems to be more of an exception than the rule, which suggests that research on the effects of ND in Korean may not be applicable to longer words.

Table 7 contains the same statistics calculated for syllable-based ND. Across the board, syllable-based ND tends to be higher than segment-based ND, which is expected given that syllable-based neighbors can differ by more segments than segment-based neighbors can. One result of this trend is that there exists much greater variation in ND within different syllable counts. Almost all two-syllable words, and most three- and four-syllable words, have at least one neighbor. On the other hand, there is very little variation in ND among monosyllabic words. Because syllable-based neighbors are defined as words that differ by the substitution of exactly one syllable, all monosyllablic words should be neighbors of each other. The reason ND is not uniform among them is that homophones are technically not neighbors of each other, and so a word’s ND is reduced by the number of homophones it has.

Finally, Table 8 reports the Spearman’s rho correlation among lexical frequency and the six ND measures reported in the database. The top panel reports the correlations for the entire database. It can be seen that all of the ND measures are only weakly correlated with lexical frequency, and all of the ND measures are strongly correlated with each other. Because only the two-syllable words showed substantial variation in ND according to both the segment- and syllable-based measures, the bottom panel reports the correlations among the measures when only the two-syllable words are considered. The overall trends are similar, with frequency even more weakly correlated with ND, and the various ND measures only slightly less strongly correlated with each other.

This is not to say, of course, that these different ND measures will always pattern similarly. For example, ‘murder’ has 109 orthographic syllable neighbors but 505 modern syllable neighbors, as the modern surface form is . On the other hand, /p*ah.ta/ ‘crush’ has 556 orthographic syllable neighbors but only 58 modern syllable neighbors, as the modern surface form is [p∗a.tha/]. Disparities such as these can arise when an orthographically uncommon syllable undergoes some phonological process (e.g., neutralization, resyllabification) that renders its surface form the same as a frequent syllable. Alternatively, very common orthographic syllables (such as /ta/, the marker for all verbs and adjectives) can undergo some process (e.g., aspiration, in this case), that renders its surface form something far less frequent (e.g., /tha/).

Conclusions

Despite the large body of research on Korean language processing, there has been no publicly available phonetized lexical database of Korean until now. The database presented here, K-SPAN, provides surface phonetic forms derived in two different ways for 63,836 Korean words. When combined with the lexical frequencies and part of speech information provided in the MKUFS2 corpus (Kim 2005), a wide range of useful statistics may be computed. Among these, K-SPAN itself includes six different measures of neighborhood density: both segment- and syllable-based ND calculated from modern surface phonetic forms, conservative surface phonetic forms, and orthographic representations. The availability of K-SPAN opens several avenues for future research.

First, the surface phonetic forms, instantiated here as “conservative” and “modern” pronunciations, may be used to look up the pronunciation of Korean word forms without having to consult a Korean-language dictionary. Although there exist several freely available Korean corpora, including the MKUFS2, all of them are rendered orthographically (in Hangeul). K-SPAN therefore simplifies the calculation of various statistics over the Korean lexicon, such as n-gram phoneme frequencies, since the surface phonetic forms are rendered in an ASCII scheme. Such queries would be impossible in an orthographically rendered corpus. For example, several studies have examined the potential role of functional load, a measure of the strength of a phonological contrast, in phoneme mergers and neutralizations in Korean (Eychenne and Jang 2015; Silverman 2010) and across languages, including Korean (Oh, Coupé, Marsico, & Pellegrino, 2015; Wedel, Jackson, & Kaplan, 2013a; Wedel, Kaplan, & Jackson, 2013b). However, the Korean data used in these studies were “phonological” forms similar to our orthographic forms and/or forms phonetized by rules, which as we have seen often do not reflect the actual pronunciation. K-SPAN now offers a more reliable database that can be used to calculate such metrics.

Second, the ND statistics provided in K-SPAN may be used to extend studies of ND effects to Korean. For example, it remains unknown whether or how ND affects spoken language production or perception in Korean. Previous studies have suggested that the eumjeol (or syllable) may play an important role in visual word recognition, but a proper comparison between the effects of segment- versus syllable-based ND has not been possible. Similarly, it has also not been explored whether there might be any meaningful difference between ND calculated on surface phonetic forms or orthographic forms (which, in Korean, more closely reflect underlying forms). Future work may also explore the usefulness of other types of ND measures, for instance position-sensitive ND, such as the first-syllable frequency metric used in Kwon et al. (2011), or ND measures calculated within a given syntactic category rather than across the lexicon, since it has been suggested that words compete more strongly when they can be substituted for one another in the speech stream (Wedel et al. 2013a).

The current database will therefore help researchers, including those who may not be literate in Korean, to explore the Korean lexicon in greater depth, thereby widening the empirical scaffolding upon which theories of the lexicon are built.