Abstract
In recent years, building reference speech corpora was an important part of the activities which provided the necessary linguistic infrastructure in many European countries, for languages with many speakers (e.g., French, German, Spanish, Italian) as well as for those with smaller numbers of speakers (e.g., Swedish, Dutch, Czech, Slovak). This paper describes the process of the creation of a reference speech corpus and its distribution to potential users, as it was done in the case of the Slovene corpus GOS. The corpus structure and fieldwork experiences with recording, labelling system, and two levels of transcription (pronunciation-based and standardized) are described, as well as the main characteristics of the corpus interface (web concordancer) and the availability of the original corpus files.
Similar content being viewed by others
Notes
According to the Slovene Personal Data Protection Act (ZVOP), the collected data has been prepared in such a way that the speakers’ identity is no longer detectable. This is why we have anonymised all of the data in conversations concerning the speakers themselves, while the transcription only provides the type of the anonymised data, for example, [name], [surname], or [address]. On the audio recordings, the anonymised parts are covered by a beep, and within non-public discourse, the voice frequency has been changed.
The Brazilian corpus C-ORAL-BRASIL faced a similar problem in constructing the “non-orthographic criteria” of transcription (Mello and Raso 2009) indicating new lexical structures, including acronyms, foreign words or vocal reductions related to verb paradigms, pronouns, prepositions, etc.
These and all other examples of phonetic phenomena are represented in the Slovene SAMPA conventions (Zemljak Jontes et al. 2002).
A similar type of transcription is known in different fields of discourse and pragmatic studies (e.g., the transcription system of conversation analysis developed by Gail Jefferson is well known; see Atkinson and Heritage 1984), while similar principles are used when people write in internet forums or chat rooms, or occasionally in literature (e.g., Welsh 1997) when authors want to characterize colloquial spoken language.
References
Allwood, J., Björnberg, M., Grönqvist, L., Ahlsén, E., Ottesjö, C. (2000). The spoken language corpus at the Linguistics Department, Göteborg University. Forum: Qualitative Social Research 1/3.
Arhar, Š. (2007). Uporabniška evalvacija korpusa FidaPLUS: Zasnova vprašalnika, prvi rezultati. In M. Stabej (Ed.), Infrastruktura slovenščine in slovenistike (Proceedings of the 28th Obdobja Symposium) (pp. 19–26). Ljubljana: Znanstvena založba Filozofske fakultete.
Atkinson, J. M., & Heritage, J. (Eds.). (1984). Structures of social action: Studies in conversation analysis. Cambridge: Cambridge University Press.
Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2000). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, special issue on Speech Annotation and Corpus Tools, 33(1–2), 5–22.
British Academic Spoken English (BASE). URL: http://www.reading.ac.uk/AcaDepts/ll/base_corpus/. Accessed 20 June 2012.
Burnard, L. (Ed.) (2007). Reference guide for the British National Corpus (XML Edition). URL: http://www.natcorp.ox.ac.uk/XMLedition/URG/. Accessed 20 June 2012.
Burnard, L., & Bauman, S. (2007). P5: Guidelines for electronic text encoding and interchange: 8 Transcriptions of speech. TEI—text encoding iniciative. URL: http://www.tei-c.org/Vault/P5/1.7.0/doc/tei-p5-doc/en/html/TS.html. Accessed 20 June 2012.
Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intonation. Amsterdam: John Benjamins.
CHILDES—Child Language Data Exchange System. (2012). URL: http://childes.psy.cmu.edu/. Accessed 20 June 2012.
CLIPS. (2012). URL: http://www.clips.unina.it/en/index.jsp. Accessed 20 June 2012.
Communication in Slovene. (2008). URL: www.slovenscina.eu. Accessed 20 June 2012.
CORPUS.BYU.EDU. (2012). URL: http://corpus.byu.edu/corpora.asp. Accessed 20 June 2012.
Corpus de la Parole. (2012). URL: http://corpusdelaparole.in2p3.fr/. Accessed 20 June 2012.
Corpus of Spoken Slovak. (2012). URL: http://korpus.juls.savba.sk/shk.html. Accessed 20 June 2012.
Cresti, E., Bacelar do Nascimento, F., Moreno Sandoval, A., Veronis, J., Martin, P., & Choukri, K. (2004). The C-ORAL-ROM CORPUS: A multilingual resource of spontaneous speech for romance languages. In Proceedings of the fourth international conference on language resources and evaluation (LREC’04). Lisbon, Portugal.
Czech National Corpus—ORAL2008. (2008). Institute of the Czech National Corpus, Praha. URL: http://www.korpus.cz. Accessed 20 June 2012.
Czech National Corpus—SYN. (2012). Institute of the Czech National Corpus, Praha. URL: http://ucnk.ff.cuni.cz/english/. Accessed 15 November 2012.
Deutsches Spracharchiv (DSAv) und Datenbank Gesprochenes Deutsch (DGD). (2012). URL: http://dsav-oeff.ids-mannheim.de/. Accessed 20 June 2012.
EAGLES. (1996). Preliminary recommendations on spoken texts. EAGLES Document EAGTCWG-STP/P.
Erjavec, T. (2010). TEI Schema for GOS speech corpus of Slovene. URL: http://nl.ijs.si/ssj/gos/schema/tei_gos_doc.pdf. Accessed 20 June 2012.
Erjavec, T., Krek, S., Arhar, Š., Fišer, D., Ledinek, N., Saksida, A., Sivec, B., & Trebar, B. (2010). Oblikoskladenjske specifikacije JOS, v1.1. URL: http://nl.ijs.si/jos/msd/html-sl/. Accessed 20 June 2012.
FIDAplus, korpus slovenskega jezika. (2012). URL: http://www.fidaplus.net/. Accessed 20 June 2012.
Göteborg Spoken Language Corpus, GSLC. (2012). URL: http://www.ling.gu.se/projekt/tal/index.cgi?PAGE=3. Accessed 9th May 2012. Accessed 20 June 2012.
GOS, GOvorjena Slovenščina. (2012). URL: www.korpus-gos.net. Accessed 20 June 2012.
Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Zbornik Osme konference Jezikovne tehnologije. Slovenia: Ljubljana.
Grishina, E. (2006). Spoken Russian in the Russian National Corpus (RNC). In Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genova, Italy. URL: http://www.lrec-conf.org/proceedings/lrec2006/. Accessed 15 November 2012.
Hong Kong Corpus of Spoken English. (2012). URL: http://rcpce.engl.polyu.edu.hk/HKCSE/. Accessed 20 June 2012.
Izre’el, S., Hary, B., & Rahav, G. (2001). Designing CoSIH: The corpus of spoken Israeli Hebrew. International Journal of Corpus Linguistics, 6, 171–197.
Jacobson, M., & Baude, O. (2011). Corpus de la parole: Collecte, catalogage, conservation et diffusion des ressources orales sur le français et les langues de France. Traitement Automatique des Langues, 52(3), 47–69.
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th Euralex international congress (pp. 105–116). Lorient: Universite de Bretagne-Sud.
LABLITA Corpus of Spontaneous Spoken Italian. URL: http://lablita.dit.unifi.it/corpora/descriptions/lablita. Accessed 20 June 2012.
Mello, H., & Raso, T. (2009). Para a transcrição da fala espontânea: O caso do C-ORAL–BRASIL. Revista Portuguesa de Humanidades, 13(1), 153–178.
MICASE—Michigan Corpus of Academic Spoken English. (2012). URL: http://quod.lib.umich.edu/m/micase/. Accessed 20 June 2012.
Oostdijk, N., Goedertier, W., Van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., et al. (2002). Experiences from the Spoken Dutch corpus project. In M. González Rodriguez, C. Paz Suárez Araujo (Eds.), Proceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 340–347). Las Palmas, Canary Islands.
Pořízka, P. (2009). Olomouc corpus of Spoken Czech: Characterization and main features of the project. Linguistik online, 38/2. URL: http://www.linguistik-online.de/38_09/porizka.html. Accessed 20 June 2012.
Przepiórkowski, A., Górski, R. L., Lewandowska-Tomaszczyk, B., & Łazinski, M. (2008). Towards the National Corpus of Polish. In Proceedings of the international conference on language resources and evaluation (LREC’08). Marrakech, Morocco. URL: http://www.lrec-conf.org/proceedings/lrec2008/pdf/211_paper.pdf. Accessed 15 November 2012.
Rotovnik, T., Sepesy Maučec, M., & Kačič, Z. (2007). Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Communication, 49(6), 437–452.
Rusko, M., & Garabík, R. (2007). Corpus of Spoken Slovak language. In J. Levická & R. Garabík (Eds.), Computer treatment of Slavic and East European Languages. Proceedings of the conference Slovko 2007 (pp. 222–236). Brno: Tribun.
Russian National Corpus. (2012). URL: http://ruscorpora.ru/en/search-spoken.html. Accessed 20 June 2012.
SACODEYL, European Youth language. (2012). URL: http://sacodeyl.inf.um.es/sacodeyl-search2/. Accessed 20 June 2012.
Savy, R., & Cutugno, F. (2009). CLIPS: Diatopic, diamesic and diaphasic variations in spoken Italian. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), On-line proceedings of 5th corpus linguistics conference. URL: http://ucrel.lancs.ac.uk/publications/cl2009/, article 213. Accessed 15 November 2012.
Sketch Engine. (2012). URL: http://www.sketchengine.co.uk/. Accessed 20 June 2012.
Slovene Personal Data Protection Act (Zakon o varstvu osebnih podatkov—ZVOP-1, UL RS 86/4). (2012). http://zakonodaja.gov.si/rpsi/r06/predpis_ZAKO3906.html. Accessed 15 November 2012.
Spoken Dutch Corpus/Corpus Gesproken Nederlands. (2012). URL: http://lands.let.kun.nl/cgn/. Accessed 20 June 2012.
The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University computing services on behalf of the BNC consortium. URL: http://www.natcorp.ox.ac.uk/. Accessed 20 June 2012.
Wagener, P. (2005). DGD—Datenbank Gesprochenes Deutsch Archivierung, Dokumentation und Erschließung des Deutschen Spracharchivs. Sprachreport, 3, 23–26.
Welsh, I. (1997). Trainspotting. (Translation: A. Skubic.) Ljubljana: DZS.
Widmann, J., Kohn, K., & Ziai, R. (2008). The SACODEYL search tool. Exploiting corpora for language learning purposes. In A. Frankenberg-Garcia, T. Rkibi, M. R. Cruz, R. Carvalho, C. Direito, & D. Santos-Rosa (Eds.), Proceedings of the 8th TALC conference (pp. 321–327). Lisbon: ISLA.
Zemljak Jontes, M., Kačič, Z., Dobrišek, S., Žganec Gros, J., & Weiss, P. (2002). Računalniški simbolni fonetični zapis slovenskega govora. Slavistična revija, 50(2), 159–169.
Žgank, A. (2010). Three-stage framework for unsupervised acoustic modeling using untranscribed spoken content. ETRI Journal, 32(5), 810–818.
Žgank, A., Rotovnik, T., & Sepesy Maučec, M. (2008). Slovenian spontaneous speech recognition and acoustic modelling of filled pauses and onomatopoeas. WSEAS transaction Signal Processing, 4(7), 388–397.
Acknowledgments
The GOS corpus was built within the Communication in Slovene project. The operation was partly financed by the European Union, the European Social Fund, and the Ministry of Education, Science, Culture and Sport of the Republic of Slovenia. The operation was carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013. The online search tool, a concordancer, was made as part of the “Online concordancer for the national spoken corpus of Slovene,” project, funded within the priority axis Economic Development Infrastructure and the priority strategy Information Society within the operational programme ‘Strengthening Regional Development Potentials’ for the period 2007-2013. We thank all the speakers who participated in the recordings for the GOS corpus, media companies who provided the recordings from their archive, as well as other companies and institutions, especially schools, where the recordings of official discourse were conducted.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Verdonik, D., Kosem, I., Vitez, A.Z. et al. Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS. Lang Resources & Evaluation 47, 1031–1048 (2013). https://doi.org/10.1007/s10579-013-9216-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9216-5