1 Introduction

Swiss society is characterised by highly complex linguistic practices in comparison with other European countries. In addition to four official languages (German, French, Italian, Romansh, in the order of the number of speakers) and a wide range of other languages spoken by foreigners living in Switzerland (25% of the population, according to the Swiss Federal Statistical OfficeFootnote 1), the local population speaks a great variety of local dialects. Unlike in other European countries, where dialect usage decreases in favour of standardised variants in most social domains, Swiss dialects are widely used in various domains, including education and public speech. This is especially true in the case of German dialects, which are the topic of this article.

Traditionally, the domains of use between standard German and Swiss German dialects have been divided according to the concept of medial diglossia (Kolde 1981; Siebenhaar and Wyler 1997), where standard German is used in written communication (and some institutionalised settings of oral communication) and the various dialects, fairly different from standard German, in spoken communication. Following this traditional division, Swiss German varieties are rarely studied outside of the narrowly focused research area of dialectology. With the development of computer-mediated communication, the traditional division between the two domains has become less clear, as Swiss varieties are increasingly written and recorded (Siebenhaar 2003). These developments call for and, at the same time, allow automatic processing of Swiss German for various purposes, including both research in digital humanities and developing practical applications.

In contrast to the increasing demand, basic tools for natural language processing of Swiss German texts are relatively undeveloped. Adapting existing tools developed for standard German has not proved successful. Explorative experiments (Hollenstein and Aepli 2014; Samardžić et al. 2015) have shown that even a small amount of data in Swiss varieties is more useful for training language processing tools than much larger data sets in standard German.

This paper presents an annotated corpus of spoken Swiss German, the ArchiMob corpus. We take advantage of natural language processing tools to provide additional annotation layers and show with some case studies that the corpus is suitable for research in language and humanities. We target specifically two issues: (a) the challenges of digitising a heterogeneous group of linguistic varieties that have no written tradition and (b) the opportunities that such a resource brings for the study of language and humanities in Switzerland.

Most of the existing Swiss German resources were developed in the context of dialectology and consist of isolated word types, such as the dialect lexicon Idiotikon (Staub et al. 1881) and the linguistic atlas of German-speaking Switzerland (Hotzenköcherle et al. 1962–1997; Christen et al. 2013); more recent digital resources in the same paradigm include Kolly and Leemann (2015). The Phonogrammarchiv of the University of ZurichFootnote 2 has a relatively rich collection of speech corpora, some of which range back more than 100 years. This archive is currently being processed in order to serve as a digital research resource, but this work is still in progress. The Bavarian Archive for Speech Signals provides speech corpora for “Regional varieties of German”,Footnote 3 but its Swiss parts seem to include regionally accented High German rather than Swiss German dialect.Footnote 4

In contrast, the resource presented here is one of the first multi-dialectal corpora available for Swiss German. As a corpus of continuous speech, it enables not only the analysis of formal linguistic features, but also of its content. Compared to two other recent corpora of Swiss German dialect—a corpus of SMS messages (Stark et al. 2009–2015) and a corpus of written texts (Hollenstein and Aepli 2014)—the ArchiMob corpus is larger, is aligned with the sound source and contains finer-grained metadata such as the dialect of the speaker. On the other hand, not all annotation layers of ArchiMob are verified manually. The ArchiMob corpus is also the only one that represents (transcribed) spoken language and features a particular content (historical narratives).Footnote 5

This paper starts with a presentation of the content of the ArchiMob corpus and its modalities of access (Sect. 2). In Sect. 3, we describe the encoding and annotation layers of the corpus in more detail. Section 4 summarises our experiments of automating the annotation tasks. Finally, Sect. 5 presents six case studies that rely on the ArchiMob corpus to investigate various aspects of dialectal variation and variation in the content, showcasing the interest of such a resource for digital humanities in general.

2 The ArchiMob corpus: from oral history to a digital research resource

The original Archimob project was initiated by a filmmaker, Frédéric Gonseth, in 1998 and was conducted by the Archimob association.Footnote 6 The goal of this collaboration between historians and filmmakers was to gather testimonies of personal experiences of life in Switzerland in the period from 1939 to 1945. The resulting archive contains 555 recordings of interviews covering topics such as political wrangling, daily life and even illicit love affairs during wartime. Out of these 555 recordings, 300 are in Swiss German. Each recording is produced with one informant using a semi-directive technique and usually is between 1h and 2h long. Informants come from all regions of Switzerland and represent both genders, different social backgrounds, and different political views. Most informants were born between 1910 and 1930.

The compilation of the present ArchiMob corpus started in 2004, when a collection of 52 VHS tapes was obtained from the Archimob association. The initial goal of the corpus compilation was to investigate dialectal phenomena such as the varying position of the indefinite article in adverbially complemented noun phrases (Richner-Steiner 2011) and comparative clauses in Swiss German (Friedli 2012).

Of these 52 recordings, nine were excluded either because of poor sound quality or because the interviewees were highly exposed to dialect and language contact, making their productions less interesting for dialectological research. The remaining 43 recordings were then digitised into the MP4 format.Footnote 7

Fig. 1
figure 1

Locations of the ArchiMob recordings included in the corpus, with different symbols for different transcribers (see Sect. 3.1). The grey area represents the German-speaking part of Switzerland. Canton boundaries are included as a proxy of major dialectal borders

The first release of the corpus contained 34 recordings transcribed with 15 540 tokens per recording on average (Samardžić et al. 2016). The second release, described in this article, contains all the 43 selected recordings, amounting to approximately 70 h of speech. Figure 1 shows the spatial distribution of the recordings included in the corpus, according to the origins of the speakers.

Fig. 2
figure 2

An example of a query result with Sketch Engine for the word gält ‘money’

The selected recordings were then transcribed and processed so that they can be searched for diverse phenomena of interest to the researchers. The processing steps, described in more detail below, include writing normalisation and part-of-speech tagging.

To meet different needs of the users, we make the corpus accessible in two ways: as online look-up via corpus query engines and as an XML archive download. The current point of access to the corpus is its web pageFootnote 8, but we consider integrating it in a larger infrastructure (such as CLARIN). The audio files are available on request.Footnote 9

2.1 Online access with corpus query engines

After considering suitable corpus query engines for online look-up, we decided to use two systems, each with some advantages and disadvantages: Sketch Engine (Kilgarriff et al. 2014) and ANNIS (Krause and Zeldes 2014).Footnote 10

An example of a search result with Sketch Engine is shown in Fig. 2. The system not only returns text passages with the exact match of the query word gält ‘money’, but also with dialectal variants such as gäld or gäut. Such a flexible search is made default in the simple search option. In order to relate different variants of the same word, we use normalised writing shown as grey subscript of the query word in Fig. 2. This normalised writing resembles standard German, but, as it is explained in more detail below (Sect. 3.2), it should not be considered an exact mapping between Swiss and standard German.

For a good functionality of our resource, it is crucial not to expect the user to know the exact normalisation of a word. The normalisation that we use is not a widely accepted standard in Switzerland and the user cannot be expected to know or to learn it.Footnote 11 To allow the users to search the corpus without knowing the normalisation, a new feature was implemented by the Sketch Engine team specifically for the purpose of our project. This feature allows the user to enter the query in any writing that seems plausible to her. If this writing occurred at least once in our corpus, we will be able to link and show instances of the queried item in all the other writings. Note that this approach to query is rather different from what used to be the practice in corpus query systems. Primarily conceived for working with text in standard languages, corpus query systems expect the user to know the exact writing for the query or to use regular expressions in order to approximate flexible search. Our solution enables searching resources with inconsistent writing in an intuitive and user-friendly way, making the resource accessible to a wider audience. This feature is thus potentially useful not only in the case of Swiss German but also for any non-standardised languages and varieties.

In addition to the new flexible search feature, Sketch Engine users can apply the standard functionality of the system to make more advanced queries using corpus query language, to manipulate resulting concordances, and to calculate different statistics (e.g., significant collocations).

Fig. 3
figure 3

An example of a query response with ANNIS for the normalised form milch ‘milk’, showing the dialectal variants miuch and müuch. Note that this example has been annotated automatically (see Sect. 4), illustrating some errors that might occur: in line 1, de should be tagged as ADV (adverb) instead of ART (article); in line 3, mìuch should be tagged as NN (noun) instead of PPER (personal pronoun). The normalised form blaue illustrates the difference between our normalisation and translation to standard German (correct translation would be geschlagen), described in more detail in Sect. 3.2

To meet the need of some potential ArchiMob corpus users for a more detailed visualisation of search results, we use the ANNIS corpus query system. An example of an ANNIS query result is shown in Fig. 3. We can see that the system shows all the information currently available in the corpus at the same time. Below each transcribed word, we can see its corpus ID, normalised writing and part-of-speech tag. However, the number of shown hits needs to remain small as such a detailed view quickly fills up the screen.

We did not implement the flexible search option in ANNIS because this system is intended to be used by more advanced users interested in linguistic details. This purpose is reflected not only in the detailed responses of the system, but also in the rather advanced querying skills that are needed in order to perform any searches.

An important difference between the two corpus query systems is that users outside the European Union need to have a paid account in order to access our corpus with the Sketch Engine,Footnote 12 whereas personal accounts on ANNIS are free.

2.2 XML download

In addition to online look-up, we provide an XML archive for download. The XML format of the documents in the archive follows the Text Encoding Initiative (TEI) recommendations whenever possible. We add specific elements only for the cases not explicitly covered by TEI (e.g. the attribute normalised). This format is the base for producing the formats required by the corpus query engines. An overview of the steps performed in order to obtain the final XML format is given in Fig. 4. These steps are described in more detail in the following sections.

Fig. 4
figure 4

Transcription and text processing work flow. Final (shared) data formats are marked with red background, intermediate text formats with beige, intermediate sound/video with grey, and annotation steps with green. (Color figure online)

The data are stored in three types of files:

  • Content files contain the text of the transcriptions.

  • Media files contain the alignment between transcribed text and the corresponding audio files.

  • Speaker file contains the socio-demographic information about the informants (region/dialect, age, gender, occupation) and the information about the speakers’ roles in the conversation (interviewer, interviewee).

The content files are segmented into utterances. The references to the speaker and the media file are specified as attributes of each utterance (element “u”), as shown in the following illustration:

  • <u start=”media_pointers#d1007-T176” xml:id=”d1007-u88” who=”person_db#EJos1007”>

Utterances consist of words (element “w”). Normalisation and part-of-speech tagging are encoded as attributes of the element “w”, as in:

  • <w id=”...” normalised=”einest” POS=”ADV” xml:id=”...”> ainisch</w>

In addition to usual annotated words, utterances can contain pauses (vocalised or not), repeated speech, and unclear (or untranscribable) passages. Pauses are not counted as words; they are annotated with a different label (<pause xml:id=”...”/>), as illustrated below. In repeated speech, the word in question is annotated as a word only once; the repeated fragments are annotated as deletion (<del> ... </del>). Unclear speech is annotated with a label that can span over multiple words.

  • <del type=”truncation” xml:id=”...”>hundertvierz/</del>

  • <del type=”truncation” xml:id=”...”>hundertvier/</del>

  • <w normalised=”hundertfünfundvierzig” tag=”NN” xml:id=”...”>hundertfiifevierzgi</w>

The media and the speaker files are simple XML documents that consist of lists of time and speaker IDs respectively associated with the corresponding information.

3 Corpus encoding and annotation

Transforming oral history recordings into a widely accessible research resource requires extensive processing. In this section, we describe the encoding and annotation steps that were undertaken in creating the resource described above, underlining the challenges specific to Swiss German.

3.1 Transcription and speech-to-text alignment

The 43 documents selected for inclusion in the corpus contain approximately 70 h of speech. They were transcribed in four phases by five transcribers, with an average of 30 person-hours invested in transcribing 1 h of recordings. The modes and the phases of transcription were not part of a single plan, but rather a result of different circumstances in which the work on the corpus took place. Table 1 sums up the time line of the annotation process.Footnote 13

Table 1 Overview of the four transcription phases

The transcribed text is divided into utterances that correspond to transcription units of an approximate average length of 4-8 seconds and aligned to sound at this level. The utterances are mostly fragments spanning over one or more sentence constituents. We do not mark sentence boundaries. As it is usual in spoken language corpora, utterances are grouped into turns. We do not mark the boundaries between turns explicitly. Instead, we annotate utterances with speaker IDs. A change in the speaker (and its role) signals a turn boundary.

The transcription units, aligned with the sound source, are manually formed by transcribers. Such alignment is part of the output of specialised tools like FOLKER and EXMARaLDA (Schmidt 2012, for both tools). Since no specialised tool was used in phase 1, the 16 documents produced in this phase needed to be aligned subsequently. We approach this task by first automatically aligning the transcriptions with the sound source at the level of words using the tool WebMAUS (Kisler et al. 2012). To obtain the utterance level alignment comparable to the output of the transcription tools, we join the WebMAUS alignment automatically into larger units and then import it into EXMARaLDA for manual correction. For around one third of the transcriptions, the automatic alignment did not work well enough to be used as a pre-processing step. In these cases, we first produced an approximation of the target segments automatically based on the pauses encoded in the transcription. We then imported the transcriptions into EXMARaLDA for manual correction.

There is no widespread convention for writing Swiss German. We use the writing system “Schwyzertütschi Dialäktschrift” proposed by Dieth (1986), as is standard in recent dialectological research. The transcription is expected to show the main phonetic properties of the variety but in a way that is legible for everybody who is familiar with standard German spelling (Dieth 1986, 10). The function of the grapheme inventory in the Dieth’s script depends on the dialect and its phonetic properties. For example, the grapheme \(\langle \hbox {e}\rangle \) stands for different vowel qualities, [e], [ε] or [ǝ], depending on the dialect, the accentuation of the syllable and—to a considerable degree—also to the dialectal background of the transcriber.

Dieth’s system, which is originally phonemic, can be implemented in different ways depending on how differentiated the phonetic qualities are to be expressed. The practice in using Dieth’s system changed over the transcription phases, so that more distinctions concerning the openness of vowels were made in the first phase than in the later phases [e.g., phase 1: èèr vs. phase 3: er (std. er, engl. ‘he’)].

A manual inspection of the transcriptions showed that these changes in the guidelines were not the only source of inconsistency. Different transcribers tended to make different decisions on how to implement the guidelines, not only regarding vowel quality, but also regarding word segmentation and other issues. These inconsistencies are one of the reasons why we introduce an additional annotation layer of normalised word tokens.

3.2 Normalisation

Variation in written Swiss German is observed at two levels. First, dialectal variation causes lexical units to be pronounced, and therefore also written, in a different way in different regions. Second, a lexical unit that can be considered phonetically invariant (within a region) is written in a different way on different occasions, due to occasional intra-speaker variation and, as mentioned above, to transcriber-related variation. In order to establish lexical identity of all writing variants that can be identified as “the same word”—needed to enable flexible search for instance—they need to be normalised to a single form.

Table 2 A segment of a transcribed (original) text with corresponding variants found in the same document and in other documents

Table 2 illustrates the range of potential variation with an arbitrarily chosen segment from our corpus. The table shows all the variants of the chosen words found in the same document, that is within a sample of the size of around 10 000 tokens transcribed by the same trained expert. In addition to these, more variants are found in other documents containing samples from other varieties. The shown variants include cases of regional variation (e.g., gsait, gsäit, gseit), variants due to changing transcription guidelines (e.g., hed, hèd) and variants caused by code-switching (e.g., in, main, main, mann, hat).

In the example of Table 2, all normalised forms correspond to standard German forms. Indeed, whenever a Swiss German word form corresponds to a standard German form in meaning and etymology, the standard German form is used for normalisation. However, it is important to note that we do not conceive normalisation as translation into standard German. A real translation to standard German would require substantial syntactic transformations and lexical replacement, whereas our normalisation is a word-by-word annotation of lexical identity whose exact forms can be seen as arbitrary. Some of our choices for the normalisation language require further explanation:

  • Swiss German word forms that do not have etymologically related standard German counterparts are normalised using a reconstructed common Swiss German form. For example, öpper ‘someone’ is normalised as etwer instead of the semantic standard German equivalent jemand, töff ‘motorbike’ to ff instead of standard German motorrad, gheie ‘to fall’ to geheien instead of fallen. Likewise, Swiss German vorig ‘remaining’ is normalised as vorig, even though this word means ‘previous’ in standard German.

  • Standard German conventions regarding word boundaries are often not applicable to Swiss German, where articles and pronouns tend to be cliticised. As a result, transcribers often produce single tokens that correspond to several standard German tokens (albeit with a lot of transcriber-related variation). In such cases, we allow several tokens on the normalised side. For example, hettemers is normalised as hätten wir es, and bimene is normalised as bei einem.

  • Sometimes, normalisation has the welcome side effect of disambiguating homophonous dialect forms. For example, de is normalised as der (definite article) or as dann (temporal adverb), depending on the context.

  • In other cases, a normalised form encompasses formally distinct dialect forms, due to morphosyntactic syncretism. For example, the first normalised word of Table 2, mein, will also be applied to dialect forms such as mis, miis, which are neuter forms of masculine min, miin.

An important feature of our approach is that we regard normalisation as a hidden annotation layer used only for automatic processing. As discussed above, the users are expected to formulate queries and the results are presented in a form of original writing (keeping the original inconsistency). This allows us to choose arbitrary representations, which users would find artificial and hard to adopt.

We describe our approach to normalisation in detailed guidelines, which we then apply to manual annotation of six documents taken from the transcription phase 1 (document IDs 1007, 1048, 1063, 1143, 1198, 1270) by three expert annotators. For this task, we used annotation tools that allowed annotators to quickly look up previous normalisations (if they exist) for the current word. We initially used VARD 2 (Baron and Rayson 2008), but we later switched to the better adapted SGT tool (Ruef and Ueberwasser 2013). These manually normalised documents were then used as a training set for automatic normalisation with character-level machine translation discussed in detail in Sect. 4.2.

3.3 Part-of-speech tagging

Annotation of part-of-speech tags is important for enabling more abstract queries in the corpus, regarding word classes and their combinations rather than concrete words. Part-of-speech tagging is a well studied task in NLP, and a rich offer of tools is available. These tools, however, are developed with written standardised languages in mind, while we need to apply them to a spoken non-standard variety.

We approach this task following Hollenstein and Aepli (2014), who adapted the widely used Stuttgart–Tübingen–Tagset (STTS) (Thielen et al. 1999) to a written version of Swiss German. The adaptation of the tag set addresses the following specific phenomena observed in Swiss German dialects:

  • A new label PTKINF is introduced for the infinitival particles go, cho, la, afa. These particles are used when the respective full verbs (to go, to come, to let, to begin) subcategorise an infinitival clause. As this phenomenon does not exist in standard German, the addition of a new label is warranted.

  • The label APPRART, used in standard German for preposition + definite article, is extended to preposition + indefinite article, as in bimene ‘at a’, which does not exist in standard German.

  • The labels VAFIN+ and VMFIN+ apply to verb forms with enclitics. The latter usually are pronouns, e.g., ts ‘has it’, hettemers ‘would have we it’. Conjunctions with enclitics are labelled as KOUS+, e.g., wemmer ‘when we’.

  • Whenever the zu-particle (phonologically reduced to z in Swiss German) is attached to the infinitive, the PTKZU+ tag is used: zflüge ‘to fly’.

  • Adverbs with enclitics (which can be articles or other adverbs) are given the ADV+ tag: sones ‘such a’.

We apply this adapted tag set in manual annotation of three test sets:

  • Test_0: 791 randomly selected segments from the same documents that are manually normalised (approximately 10%)

  • Test_1: 600 randomly selected segments transcribed in the phases 1 and 2

  • Test_2: 300 randomly selected segments transcribed in the phases 3 and 4

These test sets are used to assess the performance of different tagging models and strategies. By making separate tests, we intend to track potential effects of the variability in the corpus. Test_0 is the closest to the training data, which consist of the remaining 90% segments of the six documents included in manual normalisation, as described in more detail in Sect. 3.2. Test_1 and Test_2 are progressively more distant: Test_1 is partially transcribed by the same person as the documents used for training, while Test_2 contains the newest transcriptions.

4 Adaptation and evaluation of automatic processing tools

Our resource is intended to offer accurate and reliable information about the use of Swiss German. The size of the data set, however, should be big enough to allow quantitative analyses. We therefore aim to achieve the quality of manual encoding and annotation, but also to automate the processing steps to allow scaling up the data size. In Sect. 3, we described the encoding and annotation steps required to build the corpus. In this section, we describe the experiments with automatic systems adapted to perform these tasks.

4.1 Automatic transcription with automatic speech recognition

All the transcriptions included in the current version of the corpus are produced manually using the transcription tools listed in Table 1. However, these transcriptions now constitute an initial training set for future automatic processing with automatic speech recognition (ASR) systems.

To obtain a baseline for future improvements, we perform learning experiments with our current data set and a prototype speech recognition system developed in collaboration with a private company, an adaptation of the open-source Kaldi toolkit (Povey et al. 2011). We report here the results of this first evaluation.

The initial version of our ASR system is trained on the transcriptions representing the varieties of the larger Zurich area (see document IDs in Table 3). We choose this region as the largest relatively homogeneous subset of data containing around 48 h of speech. We start the training with a homogeneous sample in order to be able to assess the effects of adding heterogeneous samples to the training set at a later stage.

We test the system on six documents from different regions (see Table 3). For each document, we create 3-min samples starting: (a) in the middle of the document, (b) in the middle of the first half, and (c) in the middle of the second half. For each of the 18 samples, we manually count the following:

  • Gold transcriptionT: the number of word tokens in the manual transcription of the sample.

  • System outputO: the number of word tokens (different from ’unknown’) in the system output for the given sample.

  • Strict overlapS: the number of word tokens that are identical in the system output and the manual transcription.

  • Flexible overlapF: the number of word tokens in the system output that are not identical to the gold tokens, but still judged as correct by native speakers.

We then define the measures of precision and recall in terms of the collected counts. Precision is expressed as the proportion of correct word tokens, including the flexible overlap, in the system output (\(\frac{S+F}{O}\)). Recall is the proportion of words correctly recognised by the system in the gold transcription (\(\frac{S+F}{T}\)). F-measure is then calculated in the standard way. We take the average score over the three samples as a performance measure at document level. These average values are reported in Table 3.

Table 3 Initial ASR evaluation

Our evaluation shows that our ASR system is rather conservative: precision is systematically higher than recall. The variation in the performance over different documents (representing different regions) is considerable, but it is not explained solely by the known regional variation (discussed in more detail in the following section). In a subjective assessment by our transcribers, the current output of the system is judged not sufficient as a pre-processing for manual transcription.

We will therefore continue to work on improving the ASR by introducing more data and new learning methods. Improvements are possible for both the acoustic model and the language model. The acoustic model will benefit from a better representation of the phonetic features, while the language model will benefit from new training techniques including character-level modelling and neural networks. The baseline and the evaluation scheme described here will be crucial for monitoring improvements in the future.

4.2 Automatic normalisation with character-level statistical machine translation

The task of normalisation described in Sect. 3.2 is a recurring issue in dealing with different kinds of non-standard texts such as historical, spoken or computer-mediated communication (Dipper et al. 2013a, b; Bartz et al. 2013, e.g., for different non-standard varieties of German). Automatic word normalisation has been a popular topic in historical NLP over the last few years, resulting in a range of methods that are primarily useful for treating small edits in largely similar words (Baron and Rayson 2008; Bollmann 2012; Pettersson et al. 2013a).

More recently, character-level statistical machine translation (CSMT) has been successfully applied to normalisation of computer-mediated communication (De Clercq et al. 2013; Ljubešić et al. 2014) and historical texts (Pettersson et al. 2013b, 2014; Scherrer and Erjavec 2016). This method has originally been proposed for translation between closely related languages (Vilar et al. 2007; Tiedemann 2009). It requires less training data than word-level SMT but is limited to applications where regular changes occur at character level.

As for Swiss German dialects, word normalisation has already been manually performed by Stark et al. (2009–2015) using a collaborative annotation platform (Ruef and Ueberwasser 2013).

Our approach includes both manual and automatic annotation. We first normalise a small set of documents manually and train an automatic normalisation tool on these documents. After the initial evaluation, we try to improve the quality of the annotation in two ways. First, we explore improvements in the automatic methods. Second, we increase the training set by correcting manually some of the automatic output. With the iterative technical and manual improvement, we provide a good quality annotation that can be scaled up to larger data sets.

We choose CSMT for the automatic annotation because the string transformations that need to be performed in our case exceed the power of rule based or string-similarity methods. An alternative approach would be to use neural sequence-to-sequence methods (Cho et al. 2014; Sutskever et al. 2014). Neural methods are shown to outperform traditional statistical machine translation. However, experiments have not shown a clear advantage on the task of normalisation. While the recent shared task on normalisation of historical Dutch (Tjong Kim Sang et al. 2017) suggest that CSMT still performs better on this task, Honnet et al. (2017) have obtained better performance with neural methods. We intend to introduce neural methods in the future, examining these findings and exploring recent models especially suited for character-level string transformations in low input-data settings. These methods have been tested on morphological transformation tasks (Aharoni and Goldberg 2017; Makarov et al. 2017), but they can be extended to our normalisation task.

Table 4 Step-by-step improvements in automatic normalisation with CSMT

Table 4 shows the main steps in our adaptation of the standard CSMT for the task of dialect normalisation. In all the tests shown in the table, we use the system Moses (Koehn et al. 2007) with GIZA++ for word alignment.Footnote 14 We adapt the input for character-level calculations instead of the standard word level and we experiment with different system parameters and data sets.

We start by testing the default system settings using manually normalised documents (as discussed in Sect. 3.2). We first work with unchanged manual normalisation, which we term pre-release in Table 4. To account for the strong generalisation tendency of CSMT, we combine the output of CSMT system with simple memory-based learning. We take as the final output the most frequent normalisation for the test items seen in the training set and the CSMT output for the unseen items. This yields an accuracy score of 77.28% (Samardžić et al. 2015).

During manual inspection of the first results, it turned out that the initial normalisation guidelines were not explicit enough to guarantee consistent annotation by the three annotators. For example, the unambiguous Swiss German form dra was sometimes normalised as dran and sometimes as daran; both normalisations are correct Standard German words. Also, the Swiss German form gschaffet was sometimes normalised to the semantic Standard German equivalent gearbeitet and sometimes to its etymological equivalent geschafft. We thus revised both the manual normalisation and the corresponding guidelines. For the examples cited above, we gave preference to the longer form daran and to geschafft. Furthermore, descriptions of non-vocalised communicative phenomena that had accidentally ended up as tokens in three of the texts were excluded from the normalisation task.

We reran the experiments with the same settings as above, but with the improved data set (termed release in Table 4 because this version is included in the official corpus release), and obtained a rise in the accuracy from 77.28 to 84.13%. Detailed results are reported by Samardžić et al. (2016). The observed improvement in accuracy underlines the importance of clear and easy-to-follow guidelines, especially for smaller datasets like ours.

Fig. 5
figure 5

Dialectal origin of the texts used in the CSMT experiments. The six initial training texts are displayed with red circles, whereas the two additional training texts are displayed with blue stars. (Color figure online)

We further improved the normalisation tool by a) tuning the CSMT system to optimise the weights for the translation model and for the language model, b) increasing the translation unit from a word to an entire utterance (the condition termed segment in Table 4), c) augmenting the training set for the language model with a corpus of spoken standard German (condition LM2 in Table 4). These experiments, described in detail by Scherrer and Ljubešić (2016), lead to a considerable improvement in the performance cancelling the need for combining CSMT with memory-based learning. However, the best accuracy score of 90.46% is achieved only after introducing a constraint that selects the single observed normalisation for those test items that are seen in the training set with exactly one normalisation.

To assess whether adding more Swiss German examples is beneficial to CSMT, we correct manually the automatic output in two documents and then add the corrected documents to the training and tuning data (the bottom row in Table 4). Figure 5 shows the locations of the six initial and the two added documents. Note that the benefits of adding more data are not evident in the context of a highly varied data set such as Swiss German (Samardžić et al. 2016). Nevertheless, we obtain comparable performance as in the previous setting without using additional constraints.

4.3 Part-of-speech tagging with adapted taggers and active learning

Part-of-speech annotation is in principle portable across similar languages (Yarowsky et al. 2001). This fact, together with the fact that similar tagged corpora already exist, brought us to the decision to start part-of-speech tagging of the ArchiMob corpus by adapting the existing models and tools.

There are two potential similar sources that could be used for training an initial part-of-speech tagging model, both with some advantages and disadvantages.

  1. 1.

    TüBa-D/S (Hinrichs et al. 2000) is a corpus of spontaneous dialogues conducted in standard German (360 000 tokens in 38 000 utterances). This corpus is of the same genre as ArchiMob (spoken language), but it is a different language variety (standard German vs. Swiss German).

  2. 2.

    NOAH’s Corpus of Swiss German Dialects (Hollenstein and Aepli 2014) represents approximately the same variety (Swiss German), but it is small in size (73 000 tokens), from various sources of written language, and not normalised.

To assess which source provides better models for our ArchiMob data, we test them both on the Test_0 set (described in Sect. 3.3). We select test items from the documents that are manually normalised so that we can measure the performance on both original transcriptions and normalised words. Original transcriptions are closer to NOAH’s corpus, while normalised writing is closer to TüBa-D/S. Both corpora contain punctuation, whereas the ArchiMob corpus does not. We therefore removed all punctuation signs for the purpose of our experiments.

Table 5 Results of the part-of-speech tagging experiments in terms of accuracy and out-of-vocabulary words (OOV)

Table 5 shows the main outcome of these initial evaluation experiments (more detailed results are reported by Samardžić et al. (2016)). Both results are obtained using the BTagger (Gesmundo and Samardžić 2012), which has shown good performance on smaller training sets.

Table 6 The increase of the part-of-speech tagging performance through correction of entire documents

Due to the relatively large training set and surface form similarity, we expected to obtain the best initial score by training a tagger on TüBa-D/S and testing on the normalised version of ArchiMob. This expectation, however, proved wrong, as we obtained a considerably better score by training on NOAH’s corpus, despite the fact that this setting included much larger variation in writing (no normalisation is used) and that NOAH’s corpus is relatively small. Although the proportion of test words unseen in training is larger in NOAH’s setting (OOV in Table 5), the performance of the tagger is better.

We note that the additional tags introduced in NOAH’s corpus to account for morphosyntactic particularities of Swiss German dialects (see Sect. 3.3) help produce better results. Indeed, 2.45% of tokens in the gold standard are tagged with one of the additional tags; the NOAH’s tagger provides 68.05% accuracy on these tokens, whereas the TüBa-D/S tagger, having not seen the correct tags in the training data, gets them all wrong.

Following these findings, we set out to annotate more data in Swiss German by gradually adding manually corrected output of automatic tagging to the train set. Table 6 shows the steps in this process: we tag (still with BTagger) one document at the time, then correct it and add it to the train set in the next iteration. In every iteration, we note down the proportion of correctly tagged tokens in the new file before correction. We can see that the proportion of correct tags generally increases more in the first two than in the last two iterations.

With five ArchiMob documents added to the training set, we move on to improving the performance of the tagger. At this point, we replace BTagger with a conditional random fields (CRF) tagger, available as a Python library.Footnote 15 We decided to change the algorithm because the CRF tagger is a newer, better supported algorithm, more flexible, easier to use and embed in new tagging frameworks. A comparison of the performance of the two taggers showed that this change does not lead to a loss in the quality of the output.

We proceed with the improvements in two ways. First, we enrich the tagging model adding normalised forms, now available in the added ArchiMob documents, as features. Second, we increase the training data set by adding segments corrected through an active learning procedure.

The normalised forms used to enrich the model are annotated manually in the initial set of six documents described in Sect. 3.2 and in the extended set mentioned in Sect. 4.2. In other documents, we use the output of the automatic annotation (Sect. 4.2).

The active learning interactive annotator, developed for the purpose of this project by the TakeLab, University of Zagreb, runs the best current model on all currently unannotated utterances and identifies those items where the tagger is least confident. These items are presented to the human annotator for correction and then added to the training set for the next iteration. The procedure is repeated as long as it yields improvements on the test sets.

The active learning component is developed to meet the specific needs of our project: the user is presented with a pre-tagged low-confidence utterance and is asked to correct the tags that are wrong. Since these units are rather short, the interface displays a number of previous utterances, in order to provide enough context for the user to evaluate the tags with longer dependencies. The previous utterances are presented as simple text with no annotation.

The interface is run from the command line. It is configurable by means of an accompanying Python script, where the user can set: (a) the number of segments to annotate in one iteration, (b) the length of the previous context, (c) the span for the tagger’s hyper-parameter optimisation. These settings are made configurable because they depend on the size of the existing training set and on the time available for training the tagger and entering new annotations. As the annotation advances, the settings need to be adapted to ensure optimal use of the interface. The interface also allows separating the task of training the model from the annotation task. In this way, we can run the two components according to our own time schedule and goals.

The evaluation outcomes for the automatic part-of-speech tagging with the described improvements are shown in Table 7.Footnote 16 These experiments show that using the normalisation feature helps the tagger even in the cases of Test_1 and Test_2, where this annotation is noisy (automatic output without manual correction). As for the increase of the training data with active learning, we observe most benefits in the case of Test_2, where the performance is lower than on the other two sets. We can also see that the improvements are proportional to the number of corrected items (100 segments in the third row in Table 7 vs. 300 segments in the last row). As the performance approaches the threshold of 90% accuracy score, the impact of training data increase becomes limited.

Table 7 Accuracy scores (%) obtained in part-of-speech tagging experiments with CRF and ArchiMob data only

5 Studying linguistic variation using the ArchiMob corpus

The ArchiMob corpus is not only an interesting object of study for computational linguistics, it can also serve as a precious resource for dialectological and historical research, as has been intended from the beginning of the project. In this section, we present several case studies to illustrate the potential of the ArchiMob corpus. In Sect. 5.1, we investigate to what extent dialectal variation can be captured by looking at the transcriptions alone. Section 5.2 asks similar questions about linguistic variation, but tries to answer them by taking into account the normalisations. Section 5.3 illustrates how the annotations can be used to investigate the content of the texts.

For all case studies, we only use the utterances produced by the informants, not those produced by the interviewers. By doing so, we hope that the data material is as representative of the informant’s dialect as possible. Also, we remove diacritics from the phase 1 transcriptions in order to control for the most obvious effect of gradual changes to the guidelines.

5.1 Extracting dialectal variation patterns from speech transcriptions

The transcriptions of the ArchiMob corpus provide an interesting dataset for detecting dialectal variation patterns. Here, we discuss the tasks of identifying the dialectal origin of an utterance (dialect identification, Sect. 5.1.1) and of classifying the documents according to their linguistic similarity (dialect classification, Sect. 5.1.2).

5.1.1 Dialect identification

Language identification is an important task for natural language processing in general. While relatively simple methods perform well for languages that are sufficiently different, language identification for closely related languages is still a challenging task (e.g., Zampieri et al. 2014). Identifying the origin of dialect texts can be viewed as a particular case of this problem. In this spirit, data from the ArchiMob corpus were used to set up the German Dialect Identification task at the VarDial 2017 and 2018 workshops (Zampieri et al. 2017, 2018).

Four dialectal areas with a sufficient number of texts and which were known to be distinct enough were used in the identification tasks: Zurich (ZH), Basel (BS), Bern (BE), and Lucerne (LU). For each dialect area, utterances from at least three documents were selected as training data, and utterances from a different document were chosen for testing the systems.

Ten teams participated in the 2017 task, and eight teams in the 2018 task. Most participants obtained between 60% and 70% macro-averaged F1-scores. These figures are probably not far away from human performance: some utterances do not contain any dialect-specific cues and therefore cannot be reliably classified even by experts. It remains to be seen to what extent the addition of acoustic data can make up for lacking detail in the transcriptions. Further details about the task setup, the submitted systems and the obtained results can be found in the respective publications (Zampieri et al. 2017, 2018).

5.1.2 Dialect classification

Inspired by the dialect identification task, we wanted to extend this idea to the more general problem of automatic dialect classification, by (a) taking into account all documents of the corpus, and (b) not relying on predefined dialect areas. Concretely, we wanted to investigate to what extent dialect areas could be inferred directly from the data.

Uncovering dialect areas is one of the main goals of dialectometry (e.g., Goebl 1982, 1993). The traditional dialectometrical pipeline consists of the following steps (after Goebl 2010, 439):

  • The linguistic data, typically extracted from a dialectological atlas, is formatted into a data matrix of n enquiry points × m linguistic features. Each cell contains the local variant of a feature at an enquiry point.

  • A distance matrix of n points  × n points is derived from the data matrix, by pairwise comparison of the feature vectors of two enquiry points. The distance matrix typically is symmetric, with 0 values on the diagonal.

  • Then, a dimensionality reduction algorithm is applied to reduce each row of the distance matrix to a single value (or a small number of values v), leading to a value matrix of n points  × v values. A wide variety of algorithms have been proposed, and one of the simplest ones (also used in the following) is hierarchical clustering, which assigns each enquiry point a cluster ID, grouping the points with the most similar linguistic characteristics together.

  • The values of the value matrix are colour-coded and plotted onto a map. The hypothesis is that geographically close places will be clustered together, and where they are not, a dialectological explanation for this mismatch will have to be found.

The dialectometrical pipeline has mainly been applied to atlas data (Goebl 2005), where each column in the data matrix contains a feature known to vary across dialects. Introduction of corpus data into the study of regional variation allows collecting information about text frequency of the varying forms and constructions as they are spontaneously produced (Wolk and Szmrecsanyi 2016). The main disadvantages of this data source are uneven spatial coverage (naturally occurring texts tend to be more concentrated in particular regions) and sparseness of linguistic phenomena (the features of interest typically show only rarely in text). For instance, only 114 normalised word types are realised in all documents of the ArchiMob corpus. In addition to this, variation across recordings of free conversations, such as ArchiMob, may be due to personal preference and context, and not necessarily to dialectal variation. Studying linguistic distance using corpus data therefore requires departing from a typical dialect data matrix.

In a pilot study (Scherrer 2012), we explicitly tried to match words with similar transcriptions across dialects, using a preliminary version of the ArchiMob corpus. Here, we propose to use language modelling, a technique that has also been used in the dialect identification task (Gamallo et al. 2017b), to create a distance matrix directly. The last steps of the dialectometrical pipeline can then be applied as before.

In particular, we create a language model for each document of the ArchiMob corpus and show how well it fits all other documents of the collection. The assumption is that a language model will better fit a text of the same dialect than a text of a distant dialect. We estimate character 4-gram language models using the KenLM tool (Heafield 2011) with discount fallback, and we use document-level perplexity as a distance measure (Gamallo et al. 2017a). The resulting “distance” matrix is not symmetrical, as the perplexity of model A on text B is not guaranteed to be identical to the perplexity of model B on text A. Likewise, the diagonal does not necessarily contain 0 values, as the perplexity of model A on text A is not always equal to 0. For classification, we apply hierarchical clustering with Ward’s algorithm.Footnote 17

Fig. 6
figure 6

Dendrogram of the Ward clustering of language model perplexities. Each leaf is labelled with the document ID, the canton of its origin, and the transcriber initials. The background colors refer to the dialect regions inferred by Scherrer and Stoeckle (2016). (Color figure online)

Figure 6 shows the results of the clustering in the form of a dendrogram. It can be seen that the transcriber effect is quite strong: documents edited by the same transcriber tend to cluster together. The texts from the Zurich area (ZH) are partitioned into three distinct areas, according to the transcriber. Nevertheless, some dialectologically interesting groupings can be found: BL and BS refer to similar dialects, NW and UR as well, and VS is least connected to any other dialect area.

The clustering obtained with the ArchiMob documents can be compared with a similar experiment based on data from two Swiss German dialect atlases according to the traditional dialectometrical approach (Scherrer and Stoeckle 2016): each ArchiMob document in Fig. 6 is assigned a color, which corresponds to the cluster inferred at that geographical location by (Scherrer and Stoeckle 2016). The matching suggests that at least a subset of documents contains a dialectologically differentiated signal that can rival with much more cost-intensive atlas data.

As transcriber effects cannot be eliminated completely (partly also because of the transcription guideline changes), future work will focus on the statistical modelling of transcriber variation and dialectal variation as distinct effects (Wieling et al. 2011).

5.2 Normalisation as basis for dialectological comparison

In the previous section, we referred to the difficulty of comparing features in the different dialect texts: not all speakers use the same linguistic structures, but it is difficult to tease apart proper dialectological effects from subject-induced and personal preferences. However, the normalisation layer can help here: all linguistic elements (words, graphemes or characters) that are normalised the same way can be compared with each other. In this section, we explore two applications of this idea, one related to particular (phonological) phenomena, and one related to aggregate dialect measurements.

5.2.1 Investigating phonological variation in dialect texts

Investigating phonological properties in transcribed speech is challenging, and even more so if the transcription guidelines are known to have changed over time and transcriber differences are known to be prominent. Despite these challenges, we show in the following that known phonological variation patterns can be efficiently searched and compared across documents thanks to the normalisation layer.

Taking several methodological shortcuts, we define a phonological variable as a grapheme on the normalisation layer, and its possible dialectal realisations (variants) as the set of graphemes on the transcription layer it is aligned with. For example, the phonological variable represented by the normalisation grapheme ck has two levels, the dialectal variants k and gg, whose frequency distribution varies according to the origin of the texts.Footnote 18 For this definition to work, we a) need to align characters between transcription tokens and normalisation tokens, and b) group adjacent characters into multi-character graphemes when required.

A popular character alignment technique is based on Levenshtein distance, where the edit operations that contribute to the Levenshtein distance calculation are converted to alignment links. Several extensions have been proposed for dialect data, e.g., by prohibiting alignments of vowels with consonants (Wieling et al. 2009).

Character-level statistical machine translation (CSMT), which we already used for normalisation, is an alternative to these approaches. It does not presuppose any notion of word (or character) identity and thus works equally well with different character inventories or writing systems. The most widely used alignment models were proposed in the early days of statistical machine translation (Brown et al. 1993) and have been used for character alignment in our CSMT setting. Since character alignments are an integral part of the CSMT translation models, we can extract from our normalisation models any alignments of interest. We thus align characters and extract grapheme correspondences using exactly the same process as for creating the CSMT normalisation models, except that we create a distinct model for each document.

Graphemes do not always consist of single characters. Character sequences that frequently co-occur and that are frequently aligned in the same way should be grouped together as a multi-character grapheme. This process has also been studied in the field of statistical machine translation under the name of phrase extraction (Och et al. 1999), and can again be straightforwardly converted from the word level to the character level. The phrase table file created during CSMT model training lists all grapheme pairs together with their (co-)occurrence counts, allowing us to easily compute relative frequencies of the transcription graphemes.

Let us return to the example given above and examine how the normalised grapheme ck is realised in two arbitrarily chosen ArchiMob documents:

  • Document 1:    k 37.0%,    gg 63.0%

  • Document 2:    k 95.2%,    gg 2.4%,    ch 2.4%

Fig. 7
figure 7

Probabilities of ck dialectally realised as gg. The green areas represent the distribution of the gg variant in SDS map 2/095 drücken ‘to push’. (Color figure online)

This analysis can be extended to all texts of the ArchiMob corpus, and the frequency distributions of each variant can be plotted on a map. Figure 7 shows such a plot for the gg variant. The frequencies extracted from the ArchiMob texts can be compared with atlas data from the Sprachatlas der deutschen Schweiz (SDS; Hotzenköcherle et al. 1962–1997). The area of use of the gg variant according to the atlas (map 2/095) is reproduced as a green background on Fig. 7. Seven ArchiMob documents show relative frequencies higher than 0.5 for the gg variant. All these documents are located in the three regions where the atlas data also shows the gg variant: Basel (Northwest), St. Gallen (Northeast), and Glarus (Southeast).

Another interesting phenomenon is the vocalisation of intervocalic ll in Western Swiss German dialects. Figure 8 shows the relative frequencies of the vocalic variant u in the ArchiMob texts, and the occurrence of the same variant according to atlas data. Again, one can see that all ArchiMob speakers who vocalise are located in (or near) the areas where the atlas predicts vocalisation. However, the frequency values are spread widely. In the case of the three texts of the Bern area, this variation reflects—at least to some extent—the sociolinguistic status of l-vocalisation as a lower-class phenomenon (Siebenhaar 2000): the lower-class speaker, a gardener, uses vocalisation more often (69%) than the two upper middle-class speakers, a draughtsman (33%) and a doctor (49%). In Central Switzerland, two documents exhibit vocalisation, and both stem from the edges of the vocalisation area as defined by the SDS map. There are also some ArchiMob documents from within the vocalisation areas that do not show any evidence of this phenomenon. Whether this mismatch should be attributed to language change or to annotation effects remains to be analysed.

Fig. 8
figure 8

Probabilities of ll dialectally realised as u. The green areas represent the distribution of the u variant in SDS map 2/198 ‘Teller’. (Color figure online)

The two examples given above show that the geographic extension of some phonological variants extracted from the ArchiMob corpus coincide remarkably well with those of the Swiss German dialect atlas SDS. As the ArchiMob interviewees are about one generation younger than the informants of the SDS, the proposed technique can also be used to trace dialect change. For example, Christen (2001) has found l-vocalisation to extend eastwards to the city of Lucerne and Nidwald (the Southern shore of Lake Lucerne), but the ArchiMob documents from that area do not show any vocalisation. This suggests that this linguistic change may have set in more recently.

However, the method presented here is limited by the precision of the transcription: obviously, only variation patterns that are reflected in the transcription can be retrieved. For example, studies on the realisation of /r/ cannot be carried out (at least not without analysing the corresponding audio data) as the different variants are not distinguished in the transcription. Likewise, studies on vowel quality will not be reliable as not all documents of the ArchiMob corpus are transcribed in the same way. Still, we believe that the proposed approach can shed a new light on dialect variation and change in German-speaking Switzerland, complementing detailed phonetic information that can be found in other resources.

5.2.2 Dialectality measurements

The normalisation layer can also be used to perform aggregate analyses of the ArchiMob data. The case study presented here is inspired by a measure known as dialectality, a score that expresses the distance between a dialect text and the standard variety (Herrgen and Schmidt 1989; Herrgen et al. 2001). This method has seen a lot of success in Germany, where small-scale dialects are in the process of being replaced by larger-scale regiolects. Dialectality has proved to be a relevant measure of the degree of advancement of this process.

Fig. 9
figure 9

Dialectality values. (Color figure online)

The dialectality measure requires character-aligned phonetically transcribed data in the dialect and the standard language. The phonemes are compared pairwise, and for each phoneme pair a distance value is computed, based on the number of phonetic features that need to be changed. These distance values are then averaged across words and utterances to provide a single dialectality value for each text.

We simplify this idea drastically for our purposes. First, we assume the normalisation layer to be our standard language, which is not quite accurate. Second, we do not attempt to convert the transcriptions and normalisations into true phonetic transcriptions, as they are generally underspecified. Instead, we use plain Levenshtein distance to compute the distance value per word. Figure 9 plots the dialectality values of all ArchiMob texts.

The results show that the dialectality values do not differ much between documents. Also, they seem neither strongly correlated with geography nor with the transcriber. With the exception of the northernmost documents located very close to the German border, it shows that Northern dialects are not subject to higher assimilation pressure from standard German than southern Swiss dialects are. This is expected, as the standard variety is perceived in Switzerland as a vertical counterpart in the diglossia, not a “horizontal” counterpart on the geographical (in our case) North-South axis (Siebenhaar and Wyler 1997).

The lowest dialectality values tend to be found in the Zurich area, which suggests that the Zurich dialect acts as a sort of default dialect with a low number of characteristic traits. This effect has been found in several studies based on different datasets (Scherrer and Rambow 2010; Hollenstein and Aepli 2015; Scherrer and Stoeckle 2016).

5.3 Content analysis

In the previous sections, we have focused on the form of the linguistic data in the ArchiMob texts, leading naturally to interpretations in the field of dialectology. Another type of analysis, with potentially much larger impact in the field of digital humanities, refers to the content of the texts. Qualitative analyses could be carried out using methodologies from conversation analysis and interactional sociolinguistics. Quantitative analyses can also be envisaged, and two illustrations of the potential of the dataset for such quantitative analyses are given in the following.

5.3.1 Collocation analysis

Collocation analysis (or more generally co-occurrence analysis) is a simple and popular method in digital humanities for measuring associations between entities and concepts in texts. For each keyword, the most frequently co-occurring concepts are found, using weighting techniques from information retrieval. Collocation analysis can be performed on a whole corpus, or separately for different partitions of the corpus, in order to find changes in usage patterns.

Fig. 10
figure 10

Collocates of the word chind in the ArchiMob corpus. (Color figure online)

The SketchEngine tool provides a built-in collocation analysis method, which can be used interactively by account holders. As an example, Fig. 10 shows a collocation analysis for the word chind ‘child/children’,Footnote 19 performed over all documents of the ArchiMob corpus. The collocates are visualised in a circle around the keyword, with different colours representing different parts-of-speech. By looking at numerals (purple circles) alone, the ArchiMob corpus can provide us with an interesting insight into the traditional family sizes in Switzerland in the first half of the 20th century. Words relating to other concepts and their associations can be analysed in a similar way.

5.3.2 Lexical change

Schifferle (2017) convincingly demonstrates the usefulness of a corpus such as ArchiMob for lexicological studies. Changes in the usage of words, in turn, hint at societal changes and can therefore be potentially interesting for a wide range of disciplines such as sociology, psychology or political science.

In his research, Schifferle investigates the usage of the relationship terms koleeg ‘colleague, friend’ and fründ ‘friend’ in the 16 ArchiMob texts of phase 1. He finds that koleeg is used nearly exclusively by male speakers, whereas fründ is used almost exclusively by female speakers. This is partly an effect of the traditional gender role distribution where men were more likely to have work and military colleagues than women, but it also hints at some kind of taboo for male speakers regarding the use of fründ (or the feminine term fründin, also included in the study) for non-romantic relationships.

A second result concerns the contexts of use of koleeg. While recent dictionaries specify that Swiss German usage is not restricted to ‘work/club/military colleague’ but can refer to a close personal friendship, there is no historical evidence of this usage in the relevant lexicons. Although this usage is only marginally attested in the ArchiMob sample examined by Schifferle, exactly these occurrences may provide a precious insight into the emergence of this semantic shift.

6 Summary of contributions

In this paper, we introduce a general-purpose research resource consisting of manual transcriptions of oral history recordings in Swiss German, the ArchiMob corpus. We present its construction and functionality and show examples of its use for a range of research topics in digital humanities.

We have retraced the history of the construction of this corpus. The discontinuous work on this corpus, under different responsibilities, using different (sometimes not well adapted) tools, resulted in various types of inconsistencies, which we tried to minimize through various correction and harmonisation rounds. We have also presented additional annotation layers like normalisation and part-of-speech tagging, for which we were able to obtain competitive automatic annotation results in a difficult setting, emphasizing the usefulness of language technology for digital humanities.

A recurrent characteristics of this corpus—as alluded to in several applications described in Sect. 5—is transcriber inconsistency. Consistency of transcriptions is indeed a central point in dealing with dialect corpora in particular and with dialectological data in general (Mathussek 2016). We take this issue into account carefully. The large amount of variation in transcription is a central reason for providing a normalisation level of annotation; any research question that does not explicitly rely on the phonological realisation of the words and utterances can be addressed on the basis of the normalisation level alone. This layer allows more systematic studies of variance in writing as different versions of the same word can be identified and analysed.

Furthermore, the XML documents are annotated with the transcriber identification, which allows the researcher to create subcorpora that are minimally affected by this issue. While considerable effort has been spent on harmonizing the transcriptions since the first release of the corpus (Samardžić et al. 2016),Footnote 20 we sketch the potential application of automatic speech recognition (Sect. 4.1) to create virtual transcriptions that can then be compared with the existing manual ones. Future work will show if this approach leads to significant improvements. While the transcriber effects do influence some findings on the regional variability (Sect. 5), we address these effects directly by comparing our classifications with those performed using atlas data.

To summarize, we argue that a resource such as the ArchiMob corpus is an important reference for new quantitative approaches to the study of language use and variation, not only in dialectology, but also in social sciences and history. Our solutions to the challenges of encoding and annotating a polycentric spoken language constitute a collection of know-how that can facilitate further developments of similar resources. Finally, the tools for automatic processing that we adapted and evaluated are now available for use and further improvements in future development of similar resources, but also in more general processing of Swiss German recordings and texts.