A new corpus of geolocated ASR transcripts from Germany

Coats, Steven

doi:10.1007/s10579-023-09686-9

A new corpus of geolocated ASR transcripts from Germany

Project Notes
Open access
Published: 21 October 2023

(2023)
Cite this article

Download PDF

You have full access to this open access article

Language Resources and Evaluation Aims and scope Submit manuscript

A new corpus of geolocated ASR transcripts from Germany

Download PDF

Steven Coats ORCID: orcid.org/0000-0002-7295-3893¹

651 Accesses
Explore all metrics

Abstract

This report describes the Corpus of German Speech (CoGS), a 56-million-word corpus of automatic speech recognition transcripts from YouTube channels of local government entities in Germany. Transcripts have been annotated with latitude and longitude coordinates, making the resource potentially useful for geospatial analyses of lexical, morpho-syntactic, and pragmatic variation; this is exemplified with an exploratory geospatial analysis of grammatical variation in the encoding of past temporal reference. Additional corpus metadata include video identifiers and timestamps on individual word tokens, making it possible to search for specific discourse content or utterance sequences in the corpus and download the underlying video and audio from the web, using open-source tools. The discourse content of the transcripts in CoGS touches upon a wide range of topics, making the resource potentially interesting as a data source for research in digital humanities and social science. The report also briefly discusses the permissibility of reuse of data sourced from German municipalities for corpus-building purposes in the context of EU, German, and American law, which clearly authorize such a use case.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The quality of automatic speech recognition (ASR) transcripts has improved markedly in the last decade due to advances in neural network-based techniques and the application of large transformer models (Baevski et al., 2020; Xiong et al., 2017; Zhang et al., 2020), and the widespread use of ASR transcriptions in publicly available videos has opened new horizons for linguistic and social science research. From a linguistic perspective, vast amounts of ASR transcript data can now be harnessed to investigate lexical, morpho-syntactic, sociolinguistic, or pragmatic variation, and from the perspective of the digital humanities and social sciences, transcript data may be useful for the study of a wide range of cultural, political, geographical, and economic phenomena.

This report introduces a large corpus of German-language ASR transcript data collected from YouTube channels of local governments: the Corpus of German Speech (CoGS). The corpus, similar in terms of data collection methodology and structure to the Corpus of North American Spoken English, the Corpus of British Isles Spoken English, and Corpus of Australian and New Zealand English (Coats, 2022a, b, 2023),^{Footnote 1} contains geo-tagged transcripts in which individual word tokens have been annotated with part-of-speech tags and word timing information, allowing the speech content to be immediately examined in the context of the original videos. The size of the corpus, its recency, and its indexing to accessible video and audio data may make it suitable for research into contemporary language use in Germany. The rest of the report is organized as follows: in Sect. 2, some related resources are briefly described, and Sect. 3 outlines the procedures used for the compilation of the corpus and describes its content. Section 4 provides an illustrative, small-scale example of the kind of exploratory analysis that can be undertaken with the corpus data, and Sect. 5 discusses the permissibility of re-sourcing YouTube data for research and caveats pertaining to the quality of the corpus data. Section 6 summarizes the report, providing an outlook on possibilities for future work with the resource.

2 Background and related work

There is a rich tradition of corpus compilation and analysis for German. The German Linguistic Atlas Research Center (Forschungszentrum Deutscher Sprachatlas) in Marburg hosts regionalsprache.de, a research platform for the study of linguistic variation of German (Schmidt et al. 2020). The service provides access to digitized versions of materials collected in the context of German-language linguistic atlases, ranging from Georg Wenker’s Sprachatlas des Deutschen Reichs to relatively recent regional linguistic atlases. More than 20,000 audio files are hosted on the site, including many recordings of informants reading Wenker’s original test sentences, designed to elicit phonological and syntactic features of dialectal speech; longer reading and conversational passages are also hosted on the site. Metadata include transcriptions and location information.

The Leibniz-Institut für Deutsche Sprache in Mannheim (IDS) maintains the Archiv für gesprochenes Deutsch (Archive for Spoken German),^{Footnote 2} the most comprehensive collection of corpora of transcribed spoken German. The collection includes a large number of specialized corpora corresponding to specific regional varieties, social groups (e.g., post-WW2 refugees from East Prussia, German-Turkish migrants, and others), or specific German-speaking communities found outside the core German-speaking area (for example in Namibia, Brazil, or Wisconsin, USA). Notable among the collections is the Zwirner Corpus (Stift & Schmidt, 2014), a collection of 5,796 recordings made from 1955 to 1972 in Western Germany, Liechtenstein, Vorarlberg in Austria, and Alsace-Lorraine in France. The recordings, which have a total length of more than 1,077 h, include speech from formal and informal contexts, and comprise a variety of genres, including spontaneous stories of informants, texts read out loud, conversations, and questionnaire items designed for dialectological purposes. As of early 2023, the digitized component of the Zwirner Corpus contains 2,922 recordings and 2,944 transcripts, comprising in total 4,863,521 word tokens. The metadata include detailed information about informants’ age, occupation, and birth, school, work, and living locations as well as information about the origin of parents and spouses.

Additional corpus resources hosted by the Archiv für gesprochenes Deutsch include the Deutsch heute (‘German today’) Corpus, a large resource comprising transcribed recordings made from 2006 to 2009 of mainly young people from Germany, Austria, Switzerland, South Tirol, Belgium, Luxembourg, and Liechtenstein (Kleiner, 2015), and the FOLK corpus (Forschungs- und Lehrkorpus Gesprochenes Deutsch, ‘Research and Teaching Corpus of Spoken German’, Schmidt, 2014b), which contains transcripts and audio or video of 400 naturalistic speech events from a total of 1,294 speakers, recorded in a variety of contexts in the years 2003–2021. Transcripts have been manually prepared according to the conventions of Conversation Analysis, both in dialect and standard orthography, and metadata include information about speakers, interaction context, and location. The corpus comprises in total 336 h of conversation, corresponding to 3,203,882 word tokens.

A number of corpora from the Archiv für gesprochenes Deutsch, including the Zwirner Corpus, the Deutsch heute Corpus, and FOLK, have been digitized and are available online through the Datenbank für Gesprochenes Deutsch (‘Database for Spoken German’, DGD, see Schmidt, 2014a, 2017; Kehrein & Vorberger, 2018 for overviews). Data from these resources have been utilized for a wide variety of studies, ranging from dialect phonology (e.g., Wagener, 2002) to syntax (e.g., Dubenion-Smith, 2010) and conversation analysis (e.g., Kaiser, 2016). FOLK metadata would permit geographical analysis of speech phenomena, but as far as is known, the resource has not yet been utilized to investigate regional variation in speech, perhaps because the initial impetus for corpus compilation was oriented towards interactional analyses.

Several corpora hosted by the Archive for Spoken German, including FOLK, have also been made accessible in the context of the Zugänge zu multimodalen Korpora gesprochener Sprache project (ZuMult, ‘Access to Multimodal Spoken Language Corpora’), an initiative of the IDS together with institutes at the universities of Hamburg and Leipzig. The project makes corpora of German speech, including transcripts, audio, and video, available to researchers working in fields such as sociolinguistics, dialectology, phonetics and phonology, speech and language technology, or cultural studies, among others.^{Footnote 3} As noted by the project’s website, developments in technology have made it possible to work with larger amounts of spoken language data on a broader scale, opening up new possibilities for the analysis of linguistic and societal phenomena.

A few language resources based on German speech have been developed for the purposes of improving automatic translation systems or for other NLP tasks (e.g., Beilharz et al., 2020; Jeuris & Niehues, 2022), but these do not contain naturalistic speech and have not been designed to study regional variation in language from a linguistic perspective.

The potential use cases of CoGS are similar to those suggested for the resources at DGD/ZuMult: the data can be used to investigate linguistic, cultural, and societal phenomena. Audio and video data, while not contained within the downloadable CoGS corpus, can be identified by using regular expression searches in the corpus; data can then be retrieved using open-source tools such as yt-dlp and ffmpeg.^{Footnote 4} Because CoGS is significantly larger than existing resources, it may have better coverage for rare syntactic constructions or interactional exchanges in speech. CoGS also has more extensive geographical coverage within Germany and more precise geographical metadata, compared to existing resources. Naturally, the corpus also has important shortcomings (see Sect. 5 below), and will not be suitable for every linguistic research question. In the next section, the procedures used to create the corpus and its content are briefly described.

3 Corpus compilation and content

Two principal considerations motivated the choice to focus on municipal YouTube channels for data collection for CoGS. First, YouTube channels of local governments tend to feature locally-produced content. For a researcher interested in regional language variation, locally-produced content uploaded by municipalities may better capture regional language features than does content from national or commercial YouTube channels (for a discussion, see Coats, 2023). Secondly, in contrast to commercial content, videos produced by local governments in Germany, and transcripts thereof, are considered to be public domain data which can be used for purposes such as teaching, scholarship, or research (see Sect. 5 below).

Data was harvested using custom scripts and open-source software tools, following the basic procedures outlined in Coats (2023). Homepages of local government authorities were scraped for links to YouTube and searches were additionally sent to the YouTube search page using an iterative method.

From information available from the German Federal Statistical Office and from Wikipedia, a list of 10,371 German municipalities with websites was created.^{Footnote 5} In a first step, the homepages of these websites were scraped for links to YouTube channels; results were then checked to ensure they were the YouTube channels of the municipalities in question, resulting in 481 channels. In a second step, for the remaining municipalities in the list for which no channel had been found, a script was devised to iteratively search YouTube for channels with the name of each municipality appended to the strings Gemeinde and Stadt; for each of these, the first three channel results were retrieved. This second step returned more than 21,000 channels, many of which were channels of churches or religious congregations located in those locations. The results were filtered semi-automatically, and, in cases of ambiguity,^{Footnote 6} manually checked by inspecting the “About” page of the channel for links to official local government websites or text such as offizieller Kanal and by checking the titles and/or content of the videos in the channel. In cases where multiple channels existed for a single municipality but there was no additional disambiguation information available, the channel with the larger number of videos was selected. In some cases, if the channel had a common place name and it was impossible to determine its location, it was excluded. For municipalities that share a centralized administration, the channel of the central authority (Verbandsgemeinde etc.) was only used if it corresponded to the name of the municipality.

Most of the channels sampled represent official channels of municipality governments. However, in a few cases, if no official government channel was found for a municipality using these methods, but a channel with a clear local orientation was found, such as the channel of a local business organization or civic institution, these were also included. In total, 1,595 channels were identified in this manner. All available ASR transcripts were downloaded from these channels using yt-dlp, then filtered to remove non-German transcripts. In total, CoGS contains 39,495 transcripts from 1,313 channels in 1,308 locations, corresponding to 50,514,575 word tokens and over 7,223 h of video. The transcripts are from videos ranging in duration from just over 5 s to almost 7 h, but for the most part are from videos that are of short duration (mean value 10.98 min; standard deviation 24.24 min, median value 3.40 min).

Individual records in the corpus contain the following metadata fields: federal state and municipality provenance (Bundesland and Gemeinde), title and URL of the channel from which the transcript was downloaded, title of the corresponding video, 11-character YouTube ID code for the corresponding video,^{Footnote 7} date of upload of the corresponding video, length of the corresponding video (in seconds), length of the transcript (in number of word tokens), location of the channel’s municipality as street address, and location in latitude-longitude coordinates. In order to satisfy the “transformation” provision for fair use according to US copyright law (see Sect. 5, below), the CoGS transcript data is not in the original format(s) provided by YouTube, and has been additionally transformed: every 200 words, ten words have been replaced with the filler token @. The downloadable file containing the entire corpus is a csv-separated table, suitable for ingestion and processing in Python, R, or other languages. Conversion to other formats (for example, xml and derived formats) can be undertaken with functions in widely used programming libraries or with custom scripts.^{Footnote 8}

CoGS is available in two versions: one in which word tokens in the transcripts have no annotation, and one in which each word token has been annotated with timing information and a part-of-speech tag. Part-of-speech tagging was undertaken using the de_core_news_sm model from Spacy (Montani et al., 2022), applying the Stuttgart-Tübingen tagset (Brants et al., 2004). Because the syntax of spoken language can differ from that of written texts, some tags in CoGS may be inaccurate, and the accuracy of the tags for this data has not been evaluated against a manually annotated sample. One possible approach to deal with this issue would be to utilize the expanded STTS tagset of Westpfahl and Schmidt (2016), who trained a tagger on speech transcripts from FOLK. A Python implementation of the STTS2.0 tagger has been created during the preparation of CoGS,^{Footnote 9} and exploratory tests indicate it accurately tags CoGS transcripts.

The transcripts for CoGS do not contain speaker diarization – there is no indication of individual speaker turns or changes in speaker. Although speaker information can be manually or automatically annotated, the lack of diarization has implications for the kinds of studies that can be undertaken with the data (see Sect. 5 below).

As an example, the transcript for the video XCREK7CnyYA, entitled ‘Interview Staatsanzeiger mit Oberbürgermeister Jürgen Kessing am 22.11.2021’ (‘Interview Staatsanzeiger with Mayor Jürgen Kessing on 22.11.2021’),^{Footnote 10} from the channel of the municipality of Bietigheim-Bissingen, begins as follows:

ich begrüße heute oberbürgermeister jürgen kessing zum interview für den staatsanzeiger sehr geehrter herr kessing gewinnt seit 2004 oberbürgermeister der stadt bietigheim bissingen was gefällt ihnen denn am besten an ihrem amt ja das amt als oberbürgermeister das fällt einem natürlich besonders aus….

Several characteristics of YouTube ASR transcripts are apparent in the excerpt. First, German-language transcripts do not exhibit standard orthography or punctuation: all words have a lower-case first character, and transcripts contain no commas, question marks, or other punctuation. Second, there is no diarization: the interviewer’s question ‘… was gefällt ihnen denn am besten an ihrem amt’ (‘what do you like most about your job?’ is followed by the mayor’s response ‘ja das amt als oberbürgermeister…’ (‘well, the job as mayor…’), without any indication of a change in speaker turn. Third, the excerpt contains a few ASR errors: the word gewinnt occurs where there is speaker overlap (in the video, the interviewer’s ‘Sie sind’ and the mayor’s ‘hallo’ are articulated simultaneously), and the transcript reads ‘fällt einem natürlich besonders aus’ (approx. ‘of course especially falls away from someone’) instead of ‘füllt einem natürlich besonders aus’ (approx. ‘is of course especially fulfilling’).

CoGS transcripts are from all 16 German federal states. As can be seen in Fig. 1, channel density is highest in the southwestern federal state of Baden-Württemberg and in the southern part of Hessia, and lower in the northern part of the country and in the former German Democratic Republic, in the eastern part of the country.^{Footnote 11} In part, this reflects higher population densities in Germany’s southwest, but may also reflect larger municipal budgets for municipalities in Southwestern Germany.

The size of CoGS is summarized by federal state in Table 1.

Table 1 Size of CoGS by geographical location

Full size table

The content of CoGS transcripts is varied, ranging from announcements by mayors or municipal officials to news reports, recordings of meetings or official events, or content for children. CoGS content differs somewhat from that of the related resources CoNASE, CoBISE, and CoANZSE, which contain more transcripts of videos of municipal meetings. German municipalities are less likely to upload recordings of local government meetings to YouTube than are municipalities in Australia, New Zealand, the UK, or the US, possibly reflecting differences in attitudes as well as underlying legislation.

The overall content of the transcripts in CoGS was estimated by qualitatively inspecting a random sample of videos. Table 2 shows the number of videos and video titles, separated by the “|” character, for eight content categories, determined qualitatively on the basis of the video contents, for a random sample of 100 videos from CoGS. The category “Official announcement / Municipal event” is the largest category in the sample – it includes video recordings of municipal ceremonies and events as well as announcements of upcoming municipal activities. The category “News / Information” comprises 20 videos in the sample, mostly reports of local events. Fifteen of the videos, in the category “Mayor”, are from municipal leadership: communications by mayors or information about mayoral elections. Local tourism and promotion of local municipal services and businesses comprise 13 videos. Health announcements, most of which concern the Coronavirus, make up 8 videos, and seven videos in the sample are interviews or are focused primarily on interview content. Five videos are content for or about children, and two are of local church services or activities.

Table 2 Content category by title of a sample of 100 videos

Full size table

The provisional categories proposed for the 100-video sample may differ in terms of communicative parameters: announcements from mayors and news reports tend to contain more monologic, prepared speech, while the categories “interviews” and “municipal events” often include interactive discourse. For the analyst interested in naturalistic conversational exchanges, one possible method for identifying suitable speech configurations in the corpus would be to select transcripts based on words contained in video titles. A search for Sitzung (‘session’, i.e., council session) as an independent word or compounding element, for example, returns 556 transcripts from 97 municipalities with a total length of 4,797,690 tokens, including Livestream der Sitzung des Technischen Ausschusses (‘Livestream of the session of the technical committee’), Öffentliche Gemeinderatssitzung in Esslingen am Neckar vom 22. November 2021 (‘Public municipal council session in Esslingen am Neckar from 22 November 2021’), Sitzung des Stuttgarter Gemeinderats (‘Session of the Stuttgart municipal council’), Sitzung Stadtrat Halberstadt − 25.11.2021 17_00 Uhr (‘Session city council Halberstadt – 25.11.2021 17_00 hours’), and others. Likewise, a search for transcripts with titles including terms such as Stadtrat or Gemeinderat (‘city council’ or ‘municipal council’) mostly returns council-related content, including meetings, and terms such as Gespräch (‘talk/conversation’), Interview, or Unterhaltung (‘conversation/discussion’) return a large number of transcripts that contain interactive speech. Along the same lines, title searches using regular expressions can be used to instantly generate sub-corpora pertaining to particular topics (e.g., Klima|Umwelt|Natur ‘climate|environment|nature’) or pragmatic content (a case-insensitive search for gr[uü]ß ‘greet’ in video titles returns almost 1,000 transcripts of official video greetings, for example).

The underlying audio and video data from which the transcripts for CoGS have been created are not part of the corpus. The data is accessible from YouTube through their website, their APIs, or by using open-source tools such as youtube-dl and yt-dlp. However, there is no guarantee that the videos underlying the transcripts in CoGS will remain available in perpetuity, and it is likely that some videos may be made private or removed by the original uploaders, or removed or otherwise made inaccessible by YouTube.

4 Example exploratory analysis: verb forms for past reference

CoGS data could be suitable for the investigation of regional variation in syntax. A well-known feature subject to regional variation within German-speaking Europe concerns the expression of past temporality: speakers in more northern regions are more likely to use the simple past (Präteritum), whereas more southerly speakers are more likely to use a (historically newer) periphrastic construction consisting of an auxiliary verb with a past participle (Abraham & Conradie, 2001; Fischer, 2018).^{Footnote 12}

Figure 2 shows a map in which the color of individual transcript locations indicates the value of the local spatial correlation statistic Getis-Ord G_i^*, calculated on the basis of the relative frequency of simple past to past participle forms for 500 frequent German verbs.^{Footnote 13} Here, blue map markers indicate a positive local spatial autocorrelation value (i.e., a tendency for the ratio of simple past to past participle forms to be greater than the average value for all of the CoGS transcripts), whereas red markers indicate a negative spatial correlation value (transcripts centered on those locations have a lower than average simple past-past participle ratio), with hue value indicating the magnitude of the statistic.^{Footnote 14} The data from CoGS show a north-south distinction in the use of simple past and perfect forms which corresponds to characteristics of hochdeutsche and oberdeutsche varieties, but also may provide evidence for a contemporary tendency in northeastern Germany, in Saxony, Brandenburg, Berlin, and Saxony-Anhalt, towards more relative use of perfect than simple past. As an exploratory and preliminary result, this trend may be of interest, but it would need to be examined more closely by considering specific verbs and transcript genres in the locations where high or low values have been measured and verified in other data such as FOLK. In addition, the weighting of the spatial autocorrelation model would need to more carefully account for differences in channel sample size. Nevertheless, the finding suggests that the zone of use of perfect forms for the encoding of past temporality in spoken German may be continuing to advance towards the north, resulting in ongoing Präteritumschwund (‘disappearance of preterite’).^{Footnote 15} While this phenomenon may warrant closer examination, CoGS data could potentially also be used to investigate and document regional variation in grammar and syntax by taking a large number of features into account. Here, multidimensional analyses rooted in dialectometry (Goebl, 1982; Lameli, 2013; Pickl & Pröll, 2019), which has a long and productive history in German-speaking dialectological research, may represent a suitable use case for CoGS data.

5 Discussion and caveats

CoGS contains a large number of transcripts from a broad range of speech types and locations, and while its size suggests that it may record significant linguistic variation, as exemplified in the preliminary analysis above, three caveats should be noted:

The right to utilize and share data obtained via collection scripts in the context of EU legal directives and American and German law.
ASR errors and methods for dealing with error-containing transcripts.
Lack of diarization/speaker information and demographic metadata.

In Europe, Directive 2003/98/EC of the European Parliament established a legal basis for reuse of public sector information such as content produced by local governments.^{Footnote 16} The directive authorizes “the use by persons or legal entities of documents held by public sector bodies, for commercial or non-commercial purposes other than the initial purpose within the public task for which the documents were produced”; documents are defined as “any content whatever its medium (written on paper or stored in electronic form or as a sound, visual or audiovisual recording)” or “any part of such content” (Directive 2003/98/EC 2003).^{Footnote 17} Germany canonized the principles of Directive 2003/98/EC with the Informationsweiterverwendungsgesetz (“Information reuse law”, IWG) of 2006 and updated the law with the Datennutzungsgesetz (“Data use law”) of 2021 (DNG 2021), which stipulates that public sector information such as content produced by municipalities must be open to reuse by default (“konzeptionell und standardmäßig offen”). From the perspective of EU and German law, the reuse of material created and uploaded by German local governments is clearly authorized. Furthermore, use of data such as that collected in CoGS is also in accord with acceptable use provisions of the EU’s 2018 General Data Protection Regulation (GDPR). The GDPR is mostly concerned with data protection rights from the perspective of sensitive personal individual data of the sort which is not included in CoGS, such as political party or trade union membership, marital status, medical information, ethnicity, or sexual behavior, among other categories. CoGS transcripts contain no demographic metadata in the form of information about individual speakers, and personal information of this sort is unlikely to be included in content uploaded to channels of German municipalities, which are also bound by GDPR provisions. Still, Article 85 of the GDPR explicitly acknowledges the right to process such data as justified by the “right to freedom of expression and information, including processing for journalistic purposes and the purposes of academic, artistic or literary expression”.^{Footnote 18} The legitimacy of repurposing data for research purposes is further noted in GDPR recitals pertaining to archiving, scientific or historical research, and statistical analysis (GDPR Recitals 156, 159, 160, 162).^{Footnote 19} CoGS availability is subject to end users’ acceptance of license conditions which specifically acknowledge GDPR principles. The resource recognizes individuals’ rights to rectification, to erasure, to be forgotten, and to restriction of processing; records will be pseudo-anonymized or removed at the request of individuals identified in the transcripts.

As far as copyright is concerned, CoGS and similar resources fall under the “fair use” provisions of copyright law in the United States, where YouTube is based, which stipulates that the use “of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright”.^{Footnote 20} Fair use provisions have served as the basis for the collection of large, well-known corpora such as those at english-corpora.org or public archive repositories such as the Internet Web Archive (archive.org). These large data collections, and others like them, have been supported by funding bodies such as the American National Science Foundation, and they include copyrighted material from commercial entities. Large German-language corpora, such as the German Reference Corpus (DeReKo; Kupietz et al. 2018), have also begun to incorporate data from YouTube and other commercial services in their text collections.

Furthermore, as noted in Sect. 3, tokens in the publicly-available version of the corpus have been replaced with filler tokens. The step has no effect on aggregate analyses, as it affects all words and constructions in the corpus at the same rate, and serves to further satisfy fair use provisions of American copyright law, specifically the “transformation” condition. It makes it impossible to re-create the original transcripts from the CoGS data (although an end user seeking to obtain a particular transcript can simply download it from YouTube).^{Footnote 21} CoGS and related resources, as corpora to be used for linguistic research purposes with no commercial relevance, thus clearly fall within the scope of EU and US laws pertaining to utilization of data created by public sector entities and fair use of copyrighted materials.

German copyright law was reformed in 2017 to specifically permit data collection of copyrighted materials for research purposes using scraping or other techniques. The Gesetz zur Angleichung des Urheberrechts an die aktuellen Erfordernisse der Wissensgesellschaft (“law on the adjustment of copyright to the needs of knowledge-based society”; UrhWissG, 2017) § 60d explicitly authorizes the automatic and systematic collection of copyrighted material in order to build a corpus, especially by means of normalizing, structuring, and categorizing the collected material (“Um eine Vielzahl von Werken (Ursprungsmaterial) für die wissenschaftliche Forschung automatisiert auszuwerten, ist es zulässig, das Ursprungsmaterial auch automatisiert und systematisch zu vielfältigen, um daraus insbesondere durch Normalisierung, Strukturierung und Kategorisierung ein auszuwertendes Korpus zu erstellen”). Corpora created in this manner can then be shared for non-commercial purposes of scientific research. The copyright reform has facilitated the creation of language resources in Germany, for example of DeReKo, or of components of the Digital Dictionary of the German Language (DWDS) hosted by the Berlin-Brandeburg Academy of Sciences.^{Footnote 22}

In addition to legal considerations, there are several caveats pertaining to the nature of the CoGS data and metadata. ASR transcripts, including those in CoGS, contain errors. The word error rate (WER) of ASR is affected by many factors: audio recording quality, speech fluency or lack thereof, use of out-of-vocabulary words such as technical terms, slang, or dialect words, or individual speaker phonetic and prosodic characteristics such as regional accent, speech rate, or pitch (Aksënova et al., 2021). Two procedures were used to test the accuracy of the CoGS transcripts. First, the ASR and manually uploaded transcripts for the 2,408 videos in CoGS for which both types of transcript were available were compared, using the manually uploaded transcripts as the reference. This procedure found a median WER for the ASR transcripts of 0.17. This error rate may be sufficient to allow accurate inferences for frequent features in the corpora, but for out-of-vocabulary or rare items, ASR errors in the corpus may result in corpus searches for these items returning false negatives (zero or few hits of the targeted item, even though it has been used in the speech of the corresponding video). Secondly, because manually uploaded transcripts may also be generated by ASR algorithms, a small sample of transcript excerpts was manually inspected and corrected, then compared with the uncorrected ASR transcripts. This small sample, which comprised the first two minutes of 10 randomly-selected transcripts, exhibited a remarkably low mean WER of 0.07 (minimum 0.024, maximum 0.172), suggesting that for recordings with high audio quality, the transcripts in the corpus are accurate.^{Footnote 23}

As the corpora in CoGS consist of orthographic transcripts of standard German, they are not suitable out-of-the-box for analysis of phonetic dialectal features. However, because the video and audio recordings corresponding to the transcripts are available on YouTube, they can serve as the starting point for a pipeline that (for example) identifies a particular sequence of word tokens and extracts the audio signal at the appropriate time for acoustic or prosodic analysis using common tools. This kind of approach may be useful for the characterization of regional phonetic/phonological variation in Germany.

To mitigate the possible effects of ASR errors, one possible approach, demonstrated by Coats (2022c) and suitable for combined quantitative-qualitative analyses, involves manual annotation of corpus hits to remove false positives, which can be undertaken relatively quickly due to the fact that all tokens have timing information: a script can instantly produce a custom URL linking to the exact time of utterance for any feature the analyst may wish to search for in the corpus. For a targeted word or phrase, the analyst can follow the links and listen to hundreds of clips per hour, filtering out the ones that contain an ASR error.

A final caveat concerns speaker metadata: because YouTube ASR transcripts do not include speaker diarization information, transcripts may represent one, a few, or multiple speakers. While this may not present a problem for aggregate analyses of (for example) the regional distribution of particular language features, the lack of speaker metadata means that demographic traits commonly recorded in dialectological fieldwork or sociolinguistic interview situations, such as speaker gender, age, or residence history, are not available to the analyst, although it may be possible to annotate some demographic traits from the video context. While the videos for which transcripts are collected in the corpus likely record the speech of local people, no demographic details such as residence history are available for the speakers in the videos. Analyses of regional language based on data from CoGS must take this fact into account.

Despite these drawbacks, the large size of the corpora may allow more fine-grained analyses of speech patterns in terms of discourse context, as well as potentially offering a framework in which regional variation may be detected, compared to existing resources.

6 Summary and outlook

CoGS is a large corpus of ASR transcripts of videos published to YouTube by local governments in Germany, created by using scripts to access publicly available information from governments and from YouTube. The corpus is freely downloadable for use for research purposes, and like the resources contained in the ZuMult project, its data may be relevant for a range of research approaches, including conversation analysis, sociolinguistics/dialectology, phonetics/phonology, and lexicography, as well as in other scientific disciplines (e.g., language technology and natural language processing, qualitative social research, or political studies). CoGS data can be used to explore regional variation in grammar and can complement well-documented existing corpora such as the Zwirner Corpus, the Deutsch heute Corpus, FOLK, and other resources hosted at the DGD. Because CoGS is indexed to video and audio data, it can serve as a starting point for phonetic and multimodal analyses. From the perspective of pragmatics or conversation analysis, the corpus could be used to quickly find multiple naturalistic instances of particular types of interaction and exemplify instances of commonly analyzed categories such as self-repair or turn maintenance (Schegloff et al., 1977; Fox et al., 1996).

The data for CoGS has been collected from the public sector (municipalities in Germany) via YouTube, and for future releases of CoGS, data collection from German-speaking municipalities in Austria, Switzerland, Belgium, Italy, Liechtenstein, and Luxembourg is planned. EU and German judicial frameworks in recent years have emphasized open accessibility of data and encouraged reuse of documents and data originating in the public sector. Copyright law in the US and Germany also explicitly licenses the use of this kind of data for non-profit research purposes. These developments suggest that the methods used to create CoGS and related sources may represent a viable model for the collection of language data in the future.

Despite the fact that ASR transcripts contain errors and CoGS does not include speaker or demographic metadata, its accessibility, large size, and links to audio and video make it a versatile resource, suitable for a variety of investigations into the status and development of the German language and of German society from a variety of perspectives.

Notes

https://cc.oulu.fi/~scoats/CoGS.html, https://cc.oulu.fi/~scoats/CoNASE.html, https://cc.oulu.fi/~scoats/CoBISE.html.
https://agd.ids-mannheim.de/index.shtml.
https://zumult.org/.
https://github.com/yt-dlp/yt-dlp, https://ffmpeg.org.
https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html.
Manual disambiguation was necessary, for example, for channels of municipalities with names such as “Berg”, “Burg”, “Neuenkirchen”, “Wald”, “Kirchdorf”, “Neustadt”, and others, which exist in many locations in Germany.
The URL for a given video ID is https://www.youtube.com/watch?v=3_H1-a29MJI. For the video ID “3_H1-a29MJI”, for example, the URL is https://www.youtube.com/watch?v=3_H1-a29MJI.
For example, in Python by using the to_xml function from the pandas package.
https://huggingface.co/stcoats/de_STTS2_folk.
“Staatsanzeiger” is the name of a weekly publication in the German state of Baden-Württemberg.
Currently, CoGS does not include transcripts from Austria, Switzerland, or German-speaking municipalities in other countries. An update of the corpus with additional data from Germany, as well as transcripts from outside Germany, is in the planning stages as of early 2023.
This is a simplification for illustrative purposes, as the choice of preterite or periphrastic perfect construction can also encode an aspectual or modal distinction and is subject to some syntactic constraints (see Abraham & Conradie, 2001). There is no reason to believe, however, that in randomly sampled speech transcript data, the marking of these features should exhibit a geographical pattern: people in Hamburg or Dortmund are no more or less likely to refer to completed actions than are people in Nuremburg or Stuttgart.
From https://verben.org/top-500-verben-deutsch/.
The map shows the value of Getis-Ord’s G_i^* statistic for the ratio \(\frac{\sum simple past forms}{\sum simple past forms+ \sum past participle forms}\) calculated on the basis of the 200 nearest neighbors for 1,190 locations; in total N = 1,362,593verb forms. The average distance to nearest neighbors was 96.68 km (std. dev. 41.96 km). As can be inferred from the locations of sampled channels, the largest average distance to the set of neighbors was for locations on the island of Rügen in Mecklenburg-Vorpommern (~ 280 km). The lowest average distance to the set of neighbors was for municipalities in Swabia such as Vaihingen and Mühlacker (~ 47 km).
The tendency was already noted for some verbs in Berlin and Brandeburg in Wenker’s Sprachatlas des Detuschen Reichs (Fischer, 2018: 19). The CoGS map may also capture some regional trends noted in the dialect grammars summarized by Fischer (2018): for example, continued use of some simple past forms in some Lower Bavarian locations (p. 41) or the transition area of the Rhine-Franconian dialect region (p. 45).
https://eur-lex.europa.eu/legal-content/en/ALL/?uri=CELEX%3A32003L0098. The directive was updated in 2019 https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L1024&from=en.
In 2019 the directive was recast as the much more detailed Directive (EU) 2019/1024 (2019), which expounds the same principles.
https://gdpr-info.eu/art-85-gdpr/.
https://gdpr-info.eu/recitals/no-156/, https://gdpr-info.eu/recitals/no-159/, https://gdpr-info.eu/recitals/no-160/, https://gdpr-info.eu/recitals/no-162/.
U.S. Code Title 17, § 107. See also https://fairuse.stanford.edu/overview/fair-use.
A similar procedure has been used by Mark Davies, the creator of the english-corpora.org resources (see https://www.corpusdata.org/limitations.asp).
https://www.dwds.de/.
9 of the 10 randomly selected transcripts were primarily monologic in nature, and one was an interview. Incorrectly transcribed items included proper nouns such as place names and personal names (e.g., Von-Müller-Gymnasium Regensburg, Bietigheim-Bissingen) relatively infrequent lexical items (e.g., Hindernisparcours ‘obstacle course/parcours’, akribisch, ‘meticulous/meticulously’), and words that are relatively new in the German lexicon (e.g. belarussisch, ‘Belarusian’, which replaced the former official term Weißrussisch in 2020; Koronaimpfung ‘corona vaccination’). Some lexical items were transcribed with an incorrect inflectional morpheme (e.g. ermordete ‘murdered’ as ermordet, keinen ‘no/none’ as kein, totalen ‘total’ as total ‘totally’, liegen gebliebene Dinge ‘things left undone’ as liegen gebliebener Dinge. In some cases words which were not enunciated clearly were incorrectly transcribed (e.g., abartig ‘disgusting’ instead of achtzig ‘eighty’). The manually corrected transcript excerpts were from the videos WELk_e9ybsg, HFwE87l_P6M, Acl-zDevOs0, r2P9fIWmoDU, GEBnhsnBW2A, UtA2Vr0qjM8, HJ3_TL1POuE, AAWlIPd0sVY, mOy_Lm6uqtY, and 4lpqFoBDsEE.

References

Abraham, W., & Conradie, C. J. (2001). Präteritumschwund und Diskursgrammatik. John Benjamins.
Aksënova, A., van Esch, D., Flynn, J., & Golik, P. (2021). How might we create better benchmarks for speech recognition? In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, Association for Computational Linguistics, pp. 22–34. https://doi.org/10.18653/v1/2021.bppf-1.4.
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv:2006.11477v3 [cs.CL]. https://doi.org/10.48550/arXiv.2006.11477.
Beilharz, B., Sun, X., Karimova, S., & Riezler, S. (2020). LibriVoxDeEn: A corpus for German-to-English speech translation and German speech recognition. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 3590–3594. https://aclanthology.org/2020.lrec-1.441/.
Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., Lezius, W., Rohrer, C., Smith, G., & Uszkoreit, H. (2004). TIGER: Linguistic interpretation of a german corpus. Research on Language and Computation, 2, 597–620. https://doi.org/10.1007/s11168-004-7431-3.
Article Google Scholar
Coats, S. (2022a). The Corpus of Australian and New Zealand Spoken English: A new resource of naturalistic speech transcripts. In P. Parameswaran, J. Biggs, & D. Powers (Eds.), Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association (pp. 1–5). Australasian Language Technology Association.
Coats, S. (2022b). The Corpus of British Isles Spoken English (CoBISE): A new resource of contemporary British and Irish speech. In Proceedings of DHNB ’22: Digital Humanities in the Nordic and Baltic Countries Conference, March 15–18, 2022, Uppsala, Sweden. CEUR-WS.
Coats, S. (2022c). Naturalistic double modals in North America. American Speech. https://doi.org/10.1215/00031283-9766889.
Article Google Scholar
Coats, S. (2023). Dialect corpora from YouTube. In B. Busse, N. Dumrukcic, & I. Kleiber (Eds.), Language and linguistics in a complex world (pp. 79–1029). De Gruyter.
Directive 2003/98/EC. (2003). Directive 2003/98/EC of the European Parliament and of the Council of 17 November 2003 on the re-use of public sector information. Official Journal of the European Union, L 345(46), 90–96. https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32003L0098&from=EN.
Google Scholar
Directive, (E. U.) 2019/1024. (2019). Directive (EU) 2019/1024 of the European Parliament and of the Council of 20 June 2019 on open data and the re-use of public sector information (recast). Official Journal of the European Union, L 172(62), 56–83. https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32019L1024&from=en.
Google Scholar
DNG. (2021). Gesetz zur Änderung des e-government-gesetzes und zur Einführung des gesetzes für die Nutzung von Daten des öffentlichen Sektors (Datennutzungsgesetz). Bundesgesetzblatt, Jahrgang 2021. Teil I Nr, 46, 2941–2946. http://www.bgbl.de/xaver/bgbl/start.xav?startbk=Bundesanzeiger_BGBl&jumpTo=bgbl121s2941.pdf.
Google Scholar
Dubenion-Smith, S. A. (2010). Verbal complex phenomena in West Central German: Empirical domain and multi-causal account. Journal of Germanic Linguistics, 22(2), 99–191.
Article Google Scholar
Fischer, H. (2018). Präteritumschwund im Deutschen: Dokumentation und Erklärung eines Verdrängungsprozesses. De Gruyter.
Fox, B. A., Hayashi, M., & Jasperson, R. (1996). Resources and repair: A cross-linguistic study of syntax and repair. In E. Ochs, E. A. Schegloff, & S. A. Thompson (Eds.), Interaction and Grammar (pp. 185–237). Cambridge University Press.
Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden des Einsatzes der Numerischen Taxonomie im Bereich der Dialektgeographie. Verlag der Österreichischen Akademie der Wissenschaft.
IWG (2006). Gesetz über die Weiterverwendung von Informationen öffentlicher Stellen (Informationsweiterverwendungsgesetz-IWG). Bundesgesetzblatt, Jahrgang 2006, Teil I Nr. 60: 2913–2914. http://www.bgbl.de/xaver/bgbl/start.xav?startbk=Bundesanzeiger_BGBl&jumpTo=bgbl106s2913.pdf
Jeuris, P., & Niehues, J. (2022). LibriS2S: A german-english speech-to-speech translation corpus. arXiv:2204.10593v1 [cs.CL]. https://doi.org/10.48550/arXiv.2204.10593.
Kaiser, J. (2016). Reformulierungsindikatoren im gesprochenen Deutsch: Die Benutzung der Ressourcen DGD und FOLK für gesprächsanalytische Zwecke. Gesprächsforschung – Online-Zeitschrift zur Verbalen Interaktion, 17, 196–230.
Google Scholar
Kehrein, R., & Vorberger, L. (2018). Dialekt- und Variationskorpora. In M. Kupietz, & T. Schmidt (Eds.), Korpuslinguistik (pp. 125–150). De Gruyter.
Kleiner, S. (2015). Deutsch heute und der Atlas zur Aussprache des deutschen Gebrauchsstandards. In R. Kehrein, A. Lameli, & S. Rabanus (Eds.), Regionale Variation des Deutschen: Projekte und Perspektiven (pp. 489–518). De Gruyter.
Kupietz, M., Lüngen, H., Kamocki, P., & Witt, A. (2018). The German Reference Corpus DeReKo: New Developments – New Opportunities. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 7–12 May 2018, Miyazaki, Japan, 4353–4360. European Language Resources Association (ELRA). https://aclanthology.org/L18-1689.
Lameli, A. (2013). Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland. De Gruyter. https://doi.org/10.1515/9783110331394.
Article Google Scholar
Montani, I., Honnibal, M., Honnibal, M., Van Landeghem, S., Boyd, A., Peters, H., O’Leary McCann, P., Samsonov, M., Geovedi, J., O’Regan, J., Altinok, D., Orosz, G., Kristiansen, S. L., Miranda, L., de Kok, D., Roman, E., Bot, Fiedler, L., Howard, G., Edward, Phatthiyaphaibun, W., Tamura, Y., Bozek, S., murat, Daniels, R., Amery, M., Böing, B., Vanroy, B., & Tippa, P. K. (2022). explosion/spaCy: v3.3.0: Improved speed, new trainable lemmatizer, and pipelines for Finnish, Korean and Swedish (v3.3.0). Zenodo. https://doi.org/10.5281/zenodo.6504092.
Pickl, S., & Pröll, S. (2019). Ergebnisse geostatistischer Analysen arealsprachlicher Variation im Deutschen. In J. Herrgen & J. E. Schmidt (Eds.). Deutsch: Sprache und Raum - Ein internationales Handbuch der Sprachvariation (= HSK 30.4) (pp. 861–878). De Gruyter Mouton. https://doi.org/10.1515/9783110261295-032.
Schegloff, E. A., Jefferson, G., & Sacks, H. (1977). The preference for self-correction in the organization of repair in conversation. Language, 53, 361–382.
Article Google Scholar
Schmidt, J. E., Herrgen, J., Kehrein, R., Lameli, A., & Fischer, H. (Eds.). (2020). Regionalsprache.de (REDE): Forschungsplattform zu den modernen Regionalsprachen des Deutschen. (with the assistance of Engsterhold, R., Girnth, H., Kasper, S., Limper, J., Oberdorfer, G., Pistor, T., Wolańska, A., Beitel, D., Gropp, M., Krapp, M. L., Lang, V., Lipfert, S., Pheiff, J., & Vielsmeier, B.). Forschungszentrum Deutscher Sprachatlas.
Schmidt, T. (2014a). : The Database for Spoken German – DGD2. In Proceedings of the Ninth Conference on International Language Resources and Evaluation (LREC’14), pp. 1451–1457. http://www.lrec-conf.org/proceedings/lrec2014/pdf/171_Paper.pdf.
Schmidt, T. (2014b). The Research and Teaching Corpus of Spoken German - FOLK. In Proceedings of the Ninth Conference on International Language Resources and Evaluation (LREC’14), pp. 383–387. http://www.lrec-conf.org/proceedings/lrec2014/pdf/290_Paper.pdf.
Schmidt, T. (2017). DGD – die Datenbank für Gesprochenes Deutsch. Mündliche Korpora am Institut für Deutsche Sprache (IDS) in Mannheim. Zeitschrift für Germanistische Linguistik, 45(3), 451–463.
Article Google Scholar
Stift, U. M., & Schmidt, T. (2014). Mündliche Korpora am IDS: Vom Deutschen Spracharchiv zur Datenbank für Gesprochenes Deutsch. In M. Steine & F. J. Berens (eds.), Ansichten und Einsichten: 50 Jahre Institut für Deutsche Sprache, 360–375. Institut für Deutsche Sprache. http://ids-pub.bsz-bw.de/frontdoor/index/index/docId/2477.
UrhWissG (2017). Gesetz zur Angleichung des Urheberrechts an die aktuellen Erfordernisse der Wissensgesellschaft (Urheberrechts-Wissensgesellschaftsgesetz). Bundesgesetzblatt, Jahrgang 2017, Teil I Nr. 61: 3346–3351. http://www.bgbl.de/xaver/bgbl/start.xav?startbk=Bundesanzeiger_BGBl&jumpTo=bgbl117s3346.pdf
Wagener, P. (2002). German dialects in real-time change. Journal of Germanic Linguistics, 14(3), 271–285.
Article Google Scholar
Westpfahl, S., & Schmidt, T. (2016). FOLK-Gold ― A Gold Standard for Part-of-Speech-Tagging of Spoken German. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard et al. (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1493–1499. European Language Resources Association. https://aclanthology.org/L16-1237.pdf.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M. L., Stolcke, A., Yu, D., & Zweig, G. (2017). Toward human parity in conversational speech recognition. IEEE/ACM Transactions on Audio Speech and Language Processing, 25(12), 2410–2423. https://doi.org/10.1109/TASLP.2017.2756440.
Article Google Scholar
Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., & Kumar, S. (2020). Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. arXiv:2002.02562v2 [eess.AS]. https://doi.org/10.48550/arXiv.2002.02562.

Download references

Acknowledgements

The author declares none.

Funding

No external funding has been utilized for the preparation of the manuscript.

Open Access funding provided by University of Oulu.

Author information

Authors and Affiliations

English, Faculty of Humanities, University of Oulu, Oulu, Finland
Steven Coats

Authors

Steven Coats
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S. C. created the resource, wrote the manuscript, created the figures, and reviewed the manuscript. The author of the manuscript is responsible for everything in the manuscript.

Corresponding author

Correspondence to Steven Coats.

Ethics declarations

Statements and declarations

The author has no relevant financial or non-financial interests to disclose.

Competing interests

The authors declare no competing interests.

Conflict of interest

The author declares none.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Coats, S. A new corpus of geolocated ASR transcripts from Germany. Lang Resources & Evaluation (2023). https://doi.org/10.1007/s10579-023-09686-9

Download citation

Accepted: 07 August 2023
Published: 21 October 2023
DOI: https://doi.org/10.1007/s10579-023-09686-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A new corpus of geolocated ASR transcripts from Germany

Abstract

1 Introduction

2 Background and related work

3 Corpus compilation and content

4 Example exploratory analysis: verb forms for past reference

5 Discussion and caveats

6 Summary and outlook

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Statements and declarations

Competing interests

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation