In this section, we describe our data collection and processing workflow. In this description, we will frequently use the terms ‘source language’ and ‘target language’. Unlike in lexicography, the target language will always be the language for which we want to collect the data, and the source languages are the languages in which dictionaries into the target language are available. These names remain the same even if we refer to data from a dictionary that translates target language lemmas back into one of the source languages.
Design decisions
Many similar large-scale databases (e.g. IDS or WOLD) rely on language experts or native speakers for their data. While this ensures the quality of the result, it is often difficult to get enough experts involved for getting good coverage, and there are only few or no experts or native speakers for many of the smaller languages. Because of this, we decided to initially collect all of the data ourselves in a small team from written sources, and to only ask experts and native speakers later for confirmation and help with missing and unclear data points.
Of course, this means that our initial version is not perfect and still requires corrections and revisions. Also, documentation for many languages is sparse, especially for verbal concepts, and we sometimes had to rely on small individual or old and possibly outdated sources. At times, two sources for the same language employed different orthographies which had to be bridged to maintain compatibility. These orthographies would sometimes not adequately render the language’s phonology, and phonological information was often missing or presented in a variety of different transcription schemes.
As primary data sources, we rely exclusively on published dictionaries on paper or in digital formats, from which information is typically extracted manually by going through all of the relevant entries. We initially experimented with using OCR for pre-processing, but found that OCR models perform very poorly especially on bilingual dictionaries, because they tend to be trained on running text for a single language, and do not know how to handle the very peculiar formatting and mixed-language nature of scanned dictionary pages. Post-processing the output of OCR turned out to be just as time-consuming as typing in the relevant information, especially because the extraction process involves various standardization steps, such as normalizing part-of-speech annotations and domain labels.
All lexical items are extracted in the native orthography whenever possible to preserve the information provided by the sources, and to enable automated integration of comparable data from different sources. A phonetic representation is later inferred automatically from the orthography, or stored in addition where necessary (see Sect. 3.3). In all cases, we took the pragmatic approach of using the form used in our dictionary sources to translate our source language lemmas. Typically no attempt was made to reduce lemmas to stems, or to normalize forms in order to represent underlying representations. Only when different sources used different quotation forms we worked into the grammatical system of the respective language, and looked up or derived the desired form in order to ensure consistent treatment. For instance, in the case of qualities in Korean, we opted for using the present determiner form instead of the non-past indicative used for other verbs.
Compared to projects which are composed out of individual expert contributions, our method has the advantage that we are familiar with all the data, and have a good overview over the current status of our word lists, allowing us as the data managers to confidently implement corrections and additions that would require too much long-term commitment from expert contributors. Because we do not solely rely on external help, we can achieve complete coverage of the desired languages and were able to quickly progress in our initial data collection steps. The vast majority of our entries are filled, and the current version of our database, even though not yet reviewed by experts, is already fully functional for its intended purposes.
Data collection procedure
During the 4 years (2013–2017) that the data collection process lasted, we have developed systematic data collection procedures tuned towards mass processing of dictionary sources by non-experts in the respective target languages. Since to our knowledge, this type of data collection has never before been performed on such a scale, we describe what have shaped up to be our best practices in some detail here. The discussion of our five-step process, which is also outlined in a workflow diagram in Fig. 2, also describes the motivation behind some of our non-obvious design decisions.
When selecting our lexical resources, for target languages where several dictionaries were available, we did not base our decision on the immediate accessibility of the respective source language, but on a preference for work describing the standard language. For many Uralic languages, this implied not to rely on scientific dictionaries in German or English (as seems to be common practice in Western Europe), but on Russian dictionaries of the sometimes very recently established standard languages. Moreover, we always preferred lexical resources where both translation directions were developed independently to resources where one of the two directions is only available as an index or a mechanical inversion of the translation pairs in the other.
(1) Create lookup lemma list In order to look up the best equivalents of our 1016 concepts across all the languages of Northern Eurasia, we had to bridge different source languages, for which we needed to develop lookup lemma lists based on our original list. This original list is in German, which became the main language of the database due to the nature of our pre-existing data, and because we found it to be much more efficient and less error-prone to write dictionary entries in the native language of most project members. While all the major source languages we needed to bridge were also target languages in our database, it would not have been optimal to simply re-use the prepared wordlists as lemma lists for lookup, because some most natural choices for a given concept are polysemous, while much more explicit alternative glosses exist. For instance, the concept of shedding tears is most naturally expressed by the verb “to cry” in contemporary English. However, especially in older dictionaries, the translations one finds under “cry” are centered around the concept of shouting, whereas the target concept can much more reliably be looked up under “weep”. A very difficult challenge faced us in the selection of the best Russian equivalents for basic verbs, especially in the domains of movement and manipulation. It is largely unpredictable which of the many corresponding Russian verbs, which lexicalize slight differences in grammatical aspect and other meaning components, is used for this purpose in small dictionaries. In larger dictionaries, the nuances expressed by the different verbs are often quite faithfully, but unnaturally modeled by regular derivational morphology in the target language. Comparing different sources and getting an impression of the most common practices, in most cases we decided to use the perfective forms (often with a disambiguating prefix) as the most useful Russian lookup lemma. Many such considerations were involved in the development of our lookup lemma lists for English, German, and Russian. These three we consider the official gloss languages of the database because together, they provide enough sources to cover the bulk of North Eurasian languages. In addition to these three primary languages, our intention to only use the best available resources for each language created a need for lookup in a surprisingly large number of smaller languages. For these smaller languages, we did not produce independent lookup lemma lists as for the major languages, but took the less ideal decision of relying on the previously collected data for the source language instead. This was our strategy for the following source languages: Norwegian (for the Western Saami languages), Swedish (for some information on South Saami), Finnish (for Inari Saami and Skolt Saami), Estonian and Latvian (for Livonian), Hungarian (for some information on Northern Mansi and Nganasan), French (for Breton), Japanese (for some information on Ainu), and Chinese (for some information on Manchu).
(2) Lookup in source–target direction The second step is to look up all the lemmas in a source–target dictionary (e.g. Estonian–Livonian) and digitalize the relevant parts of the target entries for each lemma as faithfully as possible while adapting them to our internal formats. This means that all equivalents are stored in the order in which they appear in the source, annotations given in the source (e.g. abbreviations such as fig. for ‘figurative’, and disambiguating information such as prototypical objects for verbs) are extracted, and care is taken to distinguish the separators between different senses of a lemma (frequently a semicolon) from those between alternative translations (frequently a comma). Working with a small team of data collectors coordinated by a single person made it possible to ensure largely consistent and standardized annotations across languages, a prerequisite for the later Step 4 in which all the extracted information is pre-processed automatically. Representation in the original standard orthographies ensures ease of automated retrieval, and compatibility across different resources.
(3) Lookup in target–source direction The third step is the reverse lookup stage, where all lemmas in the target language that were collected in the previous step, which can amount to several thousand depending on the size of the dictionary, are looked up in a target–source dictionary (e.g. Livonian–Estonian). The lookup list for this step is produced by automatically inverting the completed lookup list from Step 2, and sorting it alphabetically in the target language for more efficient lookup. Otherwise procedures and formats are exactly the same as in the preceding lookup step, creating mirror lists which model the lexical correspondences from the viewpoint of both languages, often making it possible to resolve possible polysemies. Usage information especially on verbs which can be extracted from example sentences found in good dictionaries, is frequently encoded in additional annotations to further enrich the decision basis.
(4) Automated aggregation In the fourth step, the gathered lemmas are mapped back onto the primary concept list. First, the looked-up source language translations are automatically mapped to one of the gloss languages (typically German, the native language of most contributors), which helps to ensure consistency of translations across target languages. The lemmas of the target language are then assigned to the corresponding concepts in a preliminary version of a selection file. To facilitate the work for the human collector, the system already discards some lemmas based on mismatches between their translations in the two lookup steps. However, the entire data is also compiled into a PDF summary file, which lists all possible translations for each concept, even those discarded in the selection file, together with the translations from both lookups and the annotations provided by the dictionaries. This document, containing up to 500 pages of information about the 1016 concepts in our final list, provides the data compiler with a compact view of all the relevant information to efficiently perform the subsequent lemma selection decisions for each language-concept pair.
(5) Lemma selection The fifth and final step consists in manually reviewing the pre-generated selection files which store the selection decisions concerning the best equivalent of each concept in the target language. While automated mapping into a gloss language is used for the automated pre-filtering and to create indices bridging different source languages, the decision process itself is always performed on the original data by a data collector with at least a good passive command of the source languages. If multiple translations seem fitting, one will be selected based on disambiguation information provided by the stored dictionary annotations, consistency across multiple dictionaries (if available), and their order in the dictionary entries (assuming that the most widely used word is listed first in the sources). The latter criterion provides another good reason to prefer school dictionaries over scientific ones, because they tend to focus on a single most natural translation, instead of trying to cover all senses, in the worst case in alphabetical order. If the dictionaries themselves do not provide enough information to make a decision, the data collector will consult other sources, such as additional dictionaries, grammars or websites. For additional example sentences, we sometimes relied on the collaborative database Tatoeba (Ho and Simon 2016), and Google phrase searches in the target language often helped us to clarify the contexts in which words are used. Image searches have proved to be particularly useful for nouns, and even the word picked by translation tools such as Google Translate (as unreliable as automated translation generally is) for a gloss language lemma in the context of a sentence are sometimes helpful in deciding on a lemma. If no translation was found in the source dictionaries, the same additional sources are consulted to cover as many concepts as possible. Concepts for which two or more target translations are required because the target language makes more semantic distinctions than the source language (e.g. ‘older brother’ vs. ‘younger brother’) or where the available resources are not sufficient to make a decision for one term, multiple translations can also be given. These are only used sparingly, however, and one of our rules is to never use more than three translations. Each selection decision is annotated by one of four status values describing our level of certainty. The possible statuses are “Questionable” for concept-language pairs for which no information or only suspiciously-looking data from unreliable sources like Google Translate was available, “Review” for decisions that we are uncertain of, typically due to ambiguous sources, and which would need to be reviewed by an expert like a native speaker or a linguist specializing in the language. “Validate” is the status of decisions where the sources seemed quite clear and we have no evidence contradicting our choice. On decisions with this status (which comprise almost 90% of the released database), we would still like to get confirmation by experts, but we do not consider these to be a very high priority. Finally, “Validated” is the status of selection decisions that have already been checked and confirmed by experts or native speakers. To facilitate further review, all data collected in the individual steps (source–target lookup, target–source lookup, selection) is retained and archived. In addition to their key role in the generation of lookup reports, separate machine-readable files for each type of information also facilitate the automation of consistency checking. Also, these files help to remove the need to go back to the primary sources on paper during subsequent steps of the revision process.
Deriving phonetic representations
For computational approaches that do not start on the level of cognacy decisions, the main problem of many existing lexical databases is that they primarily focus on cognacy judgments, and that little effort is put into standardized and detailed phonetic transcriptions. The differences between the transcriptions employed by the different databases also makes it quite difficult to combine their data in order to derive larger aggregate databases with better coverage.
The Austronesian Basic Vocabulary Database, for instance, while providing quite uniform transcriptions for languages which do not have an official orthography, only contains the written forms for those languages that do. ASJP consistently uses its own transcription scheme, which however reduces the sounds of the world’s languages to 41 equivalence classes, which do not suffice to adequately transcribe central phonological distinctions in many languages. While in principle, the ASJP encoding defines diacritics which would be able to express many of these distinctions, in practice only the 41 basic symbols are used consistently. IELex provides full IPA transcription, but only for some languages, whereas it relies on a mixture of original orthographies and transliteration for others. The dictionaries contained in the IDS do not even aim to include a uniform phonetic representation across all languages, favoring orthographic forms for most languages instead, and leaving the decision how to represent the words to the expert contributors otherwise.
To retrieve a phonetic representation of our lexical data, we developed a simple transcription system that can automatically transcribe orthographic input to IPA in Unicode with the help of language-specific conversion rules. While phonetic transcriptions for individual words are hard to come by especially for smaller and less well-documented languages, most grammars include at least an overview of the phonology, providing us with information on how words in the language are pronounced.
An obvious challenge for such an approach is that its only source of information from which everything needs to be derived are the dictionary forms. This reliance on written forms causes problems whenever the standard orthography does not fully represent pronunciation. Examples include the non-phonemic weakly voiced vowels which are not represented in the orthographies of Tundra Nenets and Skolt Saami, palatalization in the nominative case of some Estonian nouns (caused by an elided front vowel which is only visible in other case forms), and epenthetic vowels which split up consonant clusters in Armenian and other languages. While substantial effort was put into predicting and implementing these phenomena whenever we became aware of them, in others we opted to reduce complexity by aiming for a phonemic notation that more closely corresponds to the orthography. Since many of these distinctions are not of central importance to historical linguistics, we decided that even a transcription which does not fully cover these phenomena was good enough for a first release. While some of the resulting transcriptions have a hybrid status somewhere between the phonetic and phonemic levels, the level of detail usually suffices to accurately represent distinctions which are relevant e.g. for sound correspondences. Pending the expert feedback leading to even better transcriptions, the automated transcriptions generally provide a good approximation of each word’s pronunciation which at least makes uniform application of algorithms across languages feasible.
The main advantage of an automatic transcription system, even if it does not always yield a perfect output, is that expert feedback on the results does not have to be applied manually to each affected word, but can usually be integrated as a new or modified rule to systematically adjust faulty transcriptions. Also, different contributors who manually write IPA transcriptions for words in a language will unavoidably disagree on some details, whereas using an automated system ensures consistency across an entire wordlist. If there are exceptions to the language’s general pronunciation rules, a very common phenomenon in loanwords, our infrastructure allows to override the automatic conversion by specifying the transcription directly in the word list. These mechanisms make our design much more flexible than the approach taken by other databases where phonetic transcriptions are considered primary data which are maintained separately. The effort required for manual revisions makes it much more unlikely that existing transcriptions will be revised and adapted if e.g. it turns out the same sound was represented by different experts in an incompatible way. Our impression is that these difficulties are the main reason why databases of expert contributions like IDS or WOLD do not contain uniform phonetic representations.
In our system, the typical transcriptor for a language is defined by one or more plain text files containing lists of simple rewrite rules. In order to facilitate human editing of these rules, all input is first converted to X-SAMPA (Wells 1995), and these X-SAMPA transcriptions are then transformed into IPA in a second step. The rules can have the form sch\(\rightarrow \)S to model simple letter-to-sound correspondences. To represent more complex phonological processes, it is possible to specify symbol classes such as frontVowel = [e i ä ö ü] and backVowel = [a o u]. These classes can then be referenced in a rule to systematically change symbols in certain environments, as in properly converting German ch into the ich and ach sound: [frontVowel]ch\(\rightarrow \)[.]C and [backVowel]ch\(\rightarrow \)[.]X, where [.] represents the class on the left side. There can be arbitrarily many classes in a rule and the classes can contain an arbitrary amount of symbols and strings. We found that these two rule types are sufficient to concisely model the grapheme–phoneme correspondences of most languages.
The transcriptor program tries to apply each of these rules in a given file in order and greedily consumes a substring once it has been matched by a rule. Within the same rule file, that string cannot be matched again (e.g. to convert the front and back vowels of the previous example to their X-SAMPA equivalents). However, the output of the application of each rule file serves as the input to the next, so any number of files may be chained together to achieve the desired end result. It is often practical to place each type of phonetic process in a separate file, so that the generation of the final form proceeds in logically separate steps (e.g. Icelandic öngull\(\rightarrow \)öNkudl\(\rightarrow \)9yNkYdl\(\rightarrow \)9yNkYtl_0\(\rightarrow \)
).
In the future, this simple transcription system is going to be replaced by a finite-state based system, which is an interface between our previously developed transcription rule files and the Helsinki Finite State Toolkit (HFST; Koskenniemi and Yli-Jyrä 2008). It converts the rule files to a series of regular expressions, from which HFST is able to construct the corresponding finite state transducer. Once this transducer has been created, it can be reused to quickly transcribe any input in the given language. This system is faster, especially on longer words and sentences, and could thus also be employed for transcribing whole texts. Also, the underlying transducer works independently of our system and can be distributed on its own. Additionally, HFST offers convenient tools to convert transducers into the internal formats of various other finite state tools.
These HFST-based orthography-to-IPA transducers, and the code for compiling them from our rule files, will be made publicly available as part of an additional publication, which also describes transducer development in more detail. Together with the code, we also plan to release all the transducer definition files for the 103 languages for which we have automated transcription modules from either the orthography or standard transcriptions (e.g. Persian, Pashto, Japanese, Chinese).
In order to illustrate what the result of the entire workflow looks like in the current database release, we display a sample from the table containing the words for ‘rainbow’ in Fig. 3. In this snippet, we see many different writing systems, one language where we need to generate the IPA out of a standard phonetic transcription (e.g. Japanese), one where some additional phonetic information from the dictionary needed to be modeled (e.g. Kalmyk), and many others where the pronunciation is quite predictable from the orthography, allowing us to automate the mapping using a chain of transducers. The last column contains the mentioned status values for each selection decision, in this case implying that we are quite uncertain about the words in Kalaallisut and Lak, and would prioritize these words when getting into contact with an expert and native speaker, whereas we are reasonably certain already about our choices for all of the other words.
The preliminary release version 0.9 of NorthEuraLex has been available via the project webpageFootnote 1 for inspection and download since July 2017, and we are officially releasing it with this article. The web interface builds on the CLLD framework by Forkel et al. (2018a), which Pavel Sofroniev adopted for the purposes of this project. The database is licensed under a CC-BY-SA license, allowing anyone to use or extend it however they wish, as long as the original version is attributed to us by citing this article, and any extensions are published under the same terms. This policy goes a long way towards ensuring long-term availability, as evidenced by the fact that it has already been added to the Zenodo online repository in a repackaged form.Footnote 2 This version is in the Cross-Linguistic Data Format (CLDF) as described by Forkel et al. (2018b), a linked data format which is quickly becoming the standard for lexicostatistical databases, and will also be the release format for all future versions of our database. Along with the web interface, the CLDF specification is also best reference for readers seeking to explore the possibilities of using the NorthEuraLex database in their own research.
We cannot cite all of our primary sources in this article, but they are documented both in the web interface (in table format as well as next to the wordlists) and in the sources.tsv file of the CLDF release. The web interface includes a warning that our database does not represent a primary resource for any of the languages concerned, and that users who are interested in the data for a particular language, and intend to use lexical data in a context where some erroneous datapoints would lead to problems, should consult and cite these primary sources instead.