Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

Verdonik, Darinka; Kosem, Iztok; Vitez, Ana Zwitter; Krek, Simon; Stabej, Marko

doi:10.1007/s10579-013-9216-5

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

Original Paper
Published: 29 January 2013

Volume 47, pages 1031–1048, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Darinka Verdonik¹,
Iztok Kosem²,
Ana Zwitter Vitez²,
Simon Krek³ &
…
Marko Stabej⁴

359 Accesses
16 Citations
Explore all metrics

Abstract

In recent years, building reference speech corpora was an important part of the activities which provided the necessary linguistic infrastructure in many European countries, for languages with many speakers (e.g., French, German, Spanish, Italian) as well as for those with smaller numbers of speakers (e.g., Swedish, Dutch, Czech, Slovak). This paper describes the process of the creation of a reference speech corpus and its distribution to potential users, as it was done in the case of the Slovene corpus GOS. The corpus structure and fieldwork experiences with recording, labelling system, and two levels of transcription (pronunciation-based and standardized) are described, as well as the main characteristics of the corpus interface (web concordancer) and the availability of the original corpus files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transcription and Qualitative Methods: Implications for Third Sector Research

Article 10 September 2021

Caitlin McMullin

Machine translation systems and quality assessment: a systematic review

Article Open access 10 April 2021

Irene Rivera-Trigueros

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Yogesh Kumar, Apeksha Koul & Chamkaur Singh

Notes

According to the Slovene Personal Data Protection Act (ZVOP), the collected data has been prepared in such a way that the speakers’ identity is no longer detectable. This is why we have anonymised all of the data in conversations concerning the speakers themselves, while the transcription only provides the type of the anonymised data, for example, [name], [surname], or [address]. On the audio recordings, the anonymised parts are covered by a beep, and within non-public discourse, the voice frequency has been changed.
The Brazilian corpus C-ORAL-BRASIL faced a similar problem in constructing the “non-orthographic criteria” of transcription (Mello and Raso 2009) indicating new lexical structures, including acronyms, foreign words or vocal reductions related to verb paradigms, pronouns, prepositions, etc.
These and all other examples of phonetic phenomena are represented in the Slovene SAMPA conventions (Zemljak Jontes et al. 2002).
A similar type of transcription is known in different fields of discourse and pragmatic studies (e.g., the transcription system of conversation analysis developed by Gail Jefferson is well known; see Atkinson and Heritage 1984), while similar principles are used when people write in internet forums or chat rooms, or occasionally in literature (e.g., Welsh 1997) when authors want to characterize colloquial spoken language.

References

Allwood, J., Björnberg, M., Grönqvist, L., Ahlsén, E., Ottesjö, C. (2000). The spoken language corpus at the Linguistics Department, Göteborg University. Forum: Qualitative Social Research 1/3.
Arhar, Š. (2007). Uporabniška evalvacija korpusa FidaPLUS: Zasnova vprašalnika, prvi rezultati. In M. Stabej (Ed.), Infrastruktura slovenščine in slovenistike (Proceedings of the 28th Obdobja Symposium) (pp. 19–26). Ljubljana: Znanstvena založba Filozofske fakultete.
Google Scholar
Atkinson, J. M., & Heritage, J. (Eds.). (1984). Structures of social action: Studies in conversation analysis. Cambridge: Cambridge University Press.
Google Scholar
Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2000). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, special issue on Speech Annotation and Corpus Tools, 33(1–2), 5–22.
Google Scholar
British Academic Spoken English (BASE). URL: http://www.reading.ac.uk/AcaDepts/ll/base_corpus/. Accessed 20 June 2012.
Burnard, L. (Ed.) (2007). Reference guide for the British National Corpus (XML Edition). URL: http://www.natcorp.ox.ac.uk/XMLedition/URG/. Accessed 20 June 2012.
Burnard, L., & Bauman, S. (2007). P5: Guidelines for electronic text encoding and interchange: 8 Transcriptions of speech. TEI—text encoding iniciative. URL: http://www.tei-c.org/Vault/P5/1.7.0/doc/tei-p5-doc/en/html/TS.html. Accessed 20 June 2012.
Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intonation. Amsterdam: John Benjamins.
Google Scholar
CHILDES—Child Language Data Exchange System. (2012). URL: http://childes.psy.cmu.edu/. Accessed 20 June 2012.
CLIPS. (2012). URL: http://www.clips.unina.it/en/index.jsp. Accessed 20 June 2012.
Communication in Slovene. (2008). URL: www.slovenscina.eu. Accessed 20 June 2012.
CORPUS.BYU.EDU. (2012). URL: http://corpus.byu.edu/corpora.asp. Accessed 20 June 2012.
Corpus de la Parole. (2012). URL: http://corpusdelaparole.in2p3.fr/. Accessed 20 June 2012.
Corpus of Spoken Slovak. (2012). URL: http://korpus.juls.savba.sk/shk.html. Accessed 20 June 2012.
Cresti, E., Bacelar do Nascimento, F., Moreno Sandoval, A., Veronis, J., Martin, P., & Choukri, K. (2004). The C-ORAL-ROM CORPUS: A multilingual resource of spontaneous speech for romance languages. In Proceedings of the fourth international conference on language resources and evaluation (LREC’04). Lisbon, Portugal.
Czech National Corpus—ORAL2008. (2008). Institute of the Czech National Corpus, Praha. URL: http://www.korpus.cz. Accessed 20 June 2012.
Czech National Corpus—SYN. (2012). Institute of the Czech National Corpus, Praha. URL: http://ucnk.ff.cuni.cz/english/. Accessed 15 November 2012.
Deutsches Spracharchiv (DSAv) und Datenbank Gesprochenes Deutsch (DGD). (2012). URL: http://dsav-oeff.ids-mannheim.de/. Accessed 20 June 2012.
EAGLES. (1996). Preliminary recommendations on spoken texts. EAGLES Document EAGTCWG-STP/P.
Erjavec, T. (2010). TEI Schema for GOS speech corpus of Slovene. URL: http://nl.ijs.si/ssj/gos/schema/tei_gos_doc.pdf. Accessed 20 June 2012.
Erjavec, T., Krek, S., Arhar, Š., Fišer, D., Ledinek, N., Saksida, A., Sivec, B., & Trebar, B. (2010). Oblikoskladenjske specifikacije JOS, v1.1. URL: http://nl.ijs.si/jos/msd/html-sl/. Accessed 20 June 2012.
FIDAplus, korpus slovenskega jezika. (2012). URL: http://www.fidaplus.net/. Accessed 20 June 2012.
Göteborg Spoken Language Corpus, GSLC. (2012). URL: http://www.ling.gu.se/projekt/tal/index.cgi?PAGE=3. Accessed 9th May 2012. Accessed 20 June 2012.
GOS, GOvorjena Slovenščina. (2012). URL: www.korpus-gos.net. Accessed 20 June 2012.
Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Zbornik Osme konference Jezikovne tehnologije. Slovenia: Ljubljana.
Google Scholar
Grishina, E. (2006). Spoken Russian in the Russian National Corpus (RNC). In Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genova, Italy. URL: http://www.lrec-conf.org/proceedings/lrec2006/. Accessed 15 November 2012.
Hong Kong Corpus of Spoken English. (2012). URL: http://rcpce.engl.polyu.edu.hk/HKCSE/. Accessed 20 June 2012.
Izre’el, S., Hary, B., & Rahav, G. (2001). Designing CoSIH: The corpus of spoken Israeli Hebrew. International Journal of Corpus Linguistics, 6, 171–197.
Article Google Scholar
Jacobson, M., & Baude, O. (2011). Corpus de la parole: Collecte, catalogage, conservation et diffusion des ressources orales sur le français et les langues de France. Traitement Automatique des Langues, 52(3), 47–69.
Google Scholar
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th Euralex international congress (pp. 105–116). Lorient: Universite de Bretagne-Sud.
Google Scholar
LABLITA Corpus of Spontaneous Spoken Italian. URL: http://lablita.dit.unifi.it/corpora/descriptions/lablita. Accessed 20 June 2012.
Mello, H., & Raso, T. (2009). Para a transcrição da fala espontânea: O caso do C-ORAL–BRASIL. Revista Portuguesa de Humanidades, 13(1), 153–178.
Google Scholar
MICASE—Michigan Corpus of Academic Spoken English. (2012). URL: http://quod.lib.umich.edu/m/micase/. Accessed 20 June 2012.
Oostdijk, N., Goedertier, W., Van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., et al. (2002). Experiences from the Spoken Dutch corpus project. In M. González Rodriguez, C. Paz Suárez Araujo (Eds.), Proceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 340–347). Las Palmas, Canary Islands.
Pořízka, P. (2009). Olomouc corpus of Spoken Czech: Characterization and main features of the project. Linguistik online, 38/2. URL: http://www.linguistik-online.de/38_09/porizka.html. Accessed 20 June 2012.
Przepiórkowski, A., Górski, R. L., Lewandowska-Tomaszczyk, B., & Łazinski, M. (2008). Towards the National Corpus of Polish. In Proceedings of the international conference on language resources and evaluation (LREC’08). Marrakech, Morocco. URL: http://www.lrec-conf.org/proceedings/lrec2008/pdf/211_paper.pdf. Accessed 15 November 2012.
Rotovnik, T., Sepesy Maučec, M., & Kačič, Z. (2007). Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Communication, 49(6), 437–452.
Article Google Scholar
Rusko, M., & Garabík, R. (2007). Corpus of Spoken Slovak language. In J. Levická & R. Garabík (Eds.), Computer treatment of Slavic and East European Languages. Proceedings of the conference Slovko 2007 (pp. 222–236). Brno: Tribun.
Russian National Corpus. (2012). URL: http://ruscorpora.ru/en/search-spoken.html. Accessed 20 June 2012.
SACODEYL, European Youth language. (2012). URL: http://sacodeyl.inf.um.es/sacodeyl-search2/. Accessed 20 June 2012.
Savy, R., & Cutugno, F. (2009). CLIPS: Diatopic, diamesic and diaphasic variations in spoken Italian. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), On-line proceedings of 5th corpus linguistics conference. URL: http://ucrel.lancs.ac.uk/publications/cl2009/, article 213. Accessed 15 November 2012.
Sketch Engine. (2012). URL: http://www.sketchengine.co.uk/. Accessed 20 June 2012.
Slovene Personal Data Protection Act (Zakon o varstvu osebnih podatkov—ZVOP-1, UL RS 86/4). (2012). http://zakonodaja.gov.si/rpsi/r06/predpis_ZAKO3906.html. Accessed 15 November 2012.
Spoken Dutch Corpus/Corpus Gesproken Nederlands. (2012). URL: http://lands.let.kun.nl/cgn/. Accessed 20 June 2012.
The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University computing services on behalf of the BNC consortium. URL: http://www.natcorp.ox.ac.uk/. Accessed 20 June 2012.
Wagener, P. (2005). DGD—Datenbank Gesprochenes Deutsch Archivierung, Dokumentation und Erschließung des Deutschen Spracharchivs. Sprachreport, 3, 23–26.
Google Scholar
Welsh, I. (1997). Trainspotting. (Translation: A. Skubic.) Ljubljana: DZS.
Widmann, J., Kohn, K., & Ziai, R. (2008). The SACODEYL search tool. Exploiting corpora for language learning purposes. In A. Frankenberg-Garcia, T. Rkibi, M. R. Cruz, R. Carvalho, C. Direito, & D. Santos-Rosa (Eds.), Proceedings of the 8th TALC conference (pp. 321–327). Lisbon: ISLA.
Zemljak Jontes, M., Kačič, Z., Dobrišek, S., Žganec Gros, J., & Weiss, P. (2002). Računalniški simbolni fonetični zapis slovenskega govora. Slavistična revija, 50(2), 159–169.
Google Scholar
Žgank, A. (2010). Three-stage framework for unsupervised acoustic modeling using untranscribed spoken content. ETRI Journal, 32(5), 810–818.
Article Google Scholar
Žgank, A., Rotovnik, T., & Sepesy Maučec, M. (2008). Slovenian spontaneous speech recognition and acoustic modelling of filled pauses and onomatopoeas. WSEAS transaction Signal Processing, 4(7), 388–397.
Google Scholar

Download references

Acknowledgments

The GOS corpus was built within the Communication in Slovene project. The operation was partly financed by the European Union, the European Social Fund, and the Ministry of Education, Science, Culture and Sport of the Republic of Slovenia. The operation was carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013. The online search tool, a concordancer, was made as part of the “Online concordancer for the national spoken corpus of Slovene,” project, funded within the priority axis Economic Development Infrastructure and the priority strategy Information Society within the operational programme ‘Strengthening Regional Development Potentials’ for the period 2007-2013. We thank all the speakers who participated in the recordings for the GOS corpus, media companies who provided the recordings from their archive, as well as other companies and institutions, especially schools, where the recordings of official discourse were conducted.

Author information

Authors and Affiliations

University of Maribor, Maribor, Slovenia
Darinka Verdonik
Trojina, Institute for Applied Slovene Studies, Škofja Loka, Slovenia
Iztok Kosem & Ana Zwitter Vitez
Amebis, d.o.o., Kamnik, Slovenia
Simon Krek
University of Ljubljana, Ljubljana, Slovenia
Marko Stabej

Authors

Darinka Verdonik
View author publications
You can also search for this author in PubMed Google Scholar
Iztok Kosem
View author publications
You can also search for this author in PubMed Google Scholar
Ana Zwitter Vitez
View author publications
You can also search for this author in PubMed Google Scholar
Simon Krek
View author publications
You can also search for this author in PubMed Google Scholar
Marko Stabej
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darinka Verdonik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verdonik, D., Kosem, I., Vitez, A.Z. et al. Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS. Lang Resources & Evaluation 47, 1031–1048 (2013). https://doi.org/10.1007/s10579-013-9216-5

Download citation

Published: 29 January 2013
Issue Date: December 2013
DOI: https://doi.org/10.1007/s10579-013-9216-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

Abstract

Access this article

Similar content being viewed by others

Transcription and Qualitative Methods: Implications for Third Sector Research

Machine translation systems and quality assessment: a systematic review

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

Abstract

Access this article

Similar content being viewed by others

Transcription and Qualitative Methods: Implications for Third Sector Research

Machine translation systems and quality assessment: a systematic review

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation