Skip to main content

Advertisement

Log in

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In recent years, building reference speech corpora was an important part of the activities which provided the necessary linguistic infrastructure in many European countries, for languages with many speakers (e.g., French, German, Spanish, Italian) as well as for those with smaller numbers of speakers (e.g., Swedish, Dutch, Czech, Slovak). This paper describes the process of the creation of a reference speech corpus and its distribution to potential users, as it was done in the case of the Slovene corpus GOS. The corpus structure and fieldwork experiences with recording, labelling system, and two levels of transcription (pronunciation-based and standardized) are described, as well as the main characteristics of the corpus interface (web concordancer) and the availability of the original corpus files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. According to the Slovene Personal Data Protection Act (ZVOP), the collected data has been prepared in such a way that the speakers’ identity is no longer detectable. This is why we have anonymised all of the data in conversations concerning the speakers themselves, while the transcription only provides the type of the anonymised data, for example, [name], [surname], or [address]. On the audio recordings, the anonymised parts are covered by a beep, and within non-public discourse, the voice frequency has been changed.

  2. The Brazilian corpus C-ORAL-BRASIL faced a similar problem in constructing the “non-orthographic criteria” of transcription (Mello and Raso 2009) indicating new lexical structures, including acronyms, foreign words or vocal reductions related to verb paradigms, pronouns, prepositions, etc.

  3. These and all other examples of phonetic phenomena are represented in the Slovene SAMPA conventions (Zemljak Jontes et al. 2002).

  4. A similar type of transcription is known in different fields of discourse and pragmatic studies (e.g., the transcription system of conversation analysis developed by Gail Jefferson is well known; see Atkinson and Heritage 1984), while similar principles are used when people write in internet forums or chat rooms, or occasionally in literature (e.g., Welsh 1997) when authors want to characterize colloquial spoken language.

References

  • Allwood, J., Björnberg, M., Grönqvist, L., Ahlsén, E., Ottesjö, C. (2000). The spoken language corpus at the Linguistics Department, Göteborg University. Forum: Qualitative Social Research 1/3.

  • Arhar, Š. (2007). Uporabniška evalvacija korpusa FidaPLUS: Zasnova vprašalnika, prvi rezultati. In M. Stabej (Ed.), Infrastruktura slovenščine in slovenistike (Proceedings of the 28th Obdobja Symposium) (pp. 19–26). Ljubljana: Znanstvena založba Filozofske fakultete.

    Google Scholar 

  • Atkinson, J. M., & Heritage, J. (Eds.). (1984). Structures of social action: Studies in conversation analysis. Cambridge: Cambridge University Press.

    Google Scholar 

  • Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2000). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, special issue on Speech Annotation and Corpus Tools, 33(1–2), 5–22.

    Google Scholar 

  • British Academic Spoken English (BASE). URL: http://www.reading.ac.uk/AcaDepts/ll/base_corpus/. Accessed 20 June 2012.

  • Burnard, L. (Ed.) (2007). Reference guide for the British National Corpus (XML Edition). URL: http://www.natcorp.ox.ac.uk/XMLedition/URG/. Accessed 20 June 2012.

  • Burnard, L., & Bauman, S. (2007). P5: Guidelines for electronic text encoding and interchange: 8 Transcriptions of speech. TEI—text encoding iniciative. URL: http://www.tei-c.org/Vault/P5/1.7.0/doc/tei-p5-doc/en/html/TS.html. Accessed 20 June 2012.

  • Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intonation. Amsterdam: John Benjamins.

    Google Scholar 

  • CHILDES—Child Language Data Exchange System. (2012). URL: http://childes.psy.cmu.edu/. Accessed 20 June 2012.

  • CLIPS. (2012). URL: http://www.clips.unina.it/en/index.jsp. Accessed 20 June 2012.

  • Communication in Slovene. (2008). URL: www.slovenscina.eu. Accessed 20 June 2012.

  • CORPUS.BYU.EDU. (2012). URL: http://corpus.byu.edu/corpora.asp. Accessed 20 June 2012.

  • Corpus de la Parole. (2012). URL: http://corpusdelaparole.in2p3.fr/. Accessed 20 June 2012.

  • Corpus of Spoken Slovak. (2012). URL: http://korpus.juls.savba.sk/shk.html. Accessed 20 June 2012.

  • Cresti, E., Bacelar do Nascimento, F., Moreno Sandoval, A., Veronis, J., Martin, P., & Choukri, K. (2004). The C-ORAL-ROM CORPUS: A multilingual resource of spontaneous speech for romance languages. In Proceedings of the fourth international conference on language resources and evaluation (LREC’04). Lisbon, Portugal.

  • Czech National Corpus—ORAL2008. (2008). Institute of the Czech National Corpus, Praha. URL: http://www.korpus.cz. Accessed 20 June 2012.

  • Czech National Corpus—SYN. (2012). Institute of the Czech National Corpus, Praha. URL: http://ucnk.ff.cuni.cz/english/. Accessed 15 November 2012.

  • Deutsches Spracharchiv (DSAv) und Datenbank Gesprochenes Deutsch (DGD). (2012). URL: http://dsav-oeff.ids-mannheim.de/. Accessed 20 June 2012.

  • EAGLES. (1996). Preliminary recommendations on spoken texts. EAGLES Document EAGTCWG-STP/P.

  • Erjavec, T. (2010). TEI Schema for GOS speech corpus of Slovene. URL: http://nl.ijs.si/ssj/gos/schema/tei_gos_doc.pdf. Accessed 20 June 2012.

  • Erjavec, T., Krek, S., Arhar, Š., Fišer, D., Ledinek, N., Saksida, A., Sivec, B., & Trebar, B. (2010). Oblikoskladenjske specifikacije JOS, v1.1. URL: http://nl.ijs.si/jos/msd/html-sl/. Accessed 20 June 2012.

  • FIDAplus, korpus slovenskega jezika. (2012). URL: http://www.fidaplus.net/. Accessed 20 June 2012.

  • Göteborg Spoken Language Corpus, GSLC. (2012). URL: http://www.ling.gu.se/projekt/tal/index.cgi?PAGE=3. Accessed 9th May 2012. Accessed 20 June 2012.

  • GOS, GOvorjena Slovenščina. (2012). URL: www.korpus-gos.net. Accessed 20 June 2012.

  • Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Zbornik Osme konference Jezikovne tehnologije. Slovenia: Ljubljana.

    Google Scholar 

  • Grishina, E. (2006). Spoken Russian in the Russian National Corpus (RNC). In Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genova, Italy. URL: http://www.lrec-conf.org/proceedings/lrec2006/. Accessed 15 November 2012.

  • Hong Kong Corpus of Spoken English. (2012). URL: http://rcpce.engl.polyu.edu.hk/HKCSE/. Accessed 20 June 2012.

  • Izre’el, S., Hary, B., & Rahav, G. (2001). Designing CoSIH: The corpus of spoken Israeli Hebrew. International Journal of Corpus Linguistics, 6, 171–197.

    Article  Google Scholar 

  • Jacobson, M., & Baude, O. (2011). Corpus de la parole: Collecte, catalogage, conservation et diffusion des ressources orales sur le français et les langues de France. Traitement Automatique des Langues, 52(3), 47–69.

    Google Scholar 

  • Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th Euralex international congress (pp. 105–116). Lorient: Universite de Bretagne-Sud.

    Google Scholar 

  • LABLITA Corpus of Spontaneous Spoken Italian. URL: http://lablita.dit.unifi.it/corpora/descriptions/lablita. Accessed 20 June 2012.

  • Mello, H., & Raso, T. (2009). Para a transcrição da fala espontânea: O caso do C-ORAL–BRASIL. Revista Portuguesa de Humanidades, 13(1), 153–178.

    Google Scholar 

  • MICASE—Michigan Corpus of Academic Spoken English. (2012). URL: http://quod.lib.umich.edu/m/micase/. Accessed 20 June 2012.

  • Oostdijk, N., Goedertier, W., Van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., et al. (2002). Experiences from the Spoken Dutch corpus project. In M. González Rodriguez, C. Paz Suárez Araujo (Eds.), Proceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 340–347). Las Palmas, Canary Islands.

  • Pořízka, P. (2009). Olomouc corpus of Spoken Czech: Characterization and main features of the project. Linguistik online, 38/2. URL: http://www.linguistik-online.de/38_09/porizka.html. Accessed 20 June 2012.

  • Przepiórkowski, A., Górski, R. L., Lewandowska-Tomaszczyk, B., & Łazinski, M. (2008). Towards the National Corpus of Polish. In Proceedings of the international conference on language resources and evaluation (LREC’08). Marrakech, Morocco. URL: http://www.lrec-conf.org/proceedings/lrec2008/pdf/211_paper.pdf. Accessed 15 November 2012.

  • Rotovnik, T., Sepesy Maučec, M., & Kačič, Z. (2007). Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Communication, 49(6), 437–452.

    Article  Google Scholar 

  • Rusko, M., & Garabík, R. (2007). Corpus of Spoken Slovak language. In J. Levická & R. Garabík (Eds.), Computer treatment of Slavic and East European Languages. Proceedings of the conference Slovko 2007 (pp. 222–236). Brno: Tribun.

  • Russian National Corpus. (2012). URL: http://ruscorpora.ru/en/search-spoken.html. Accessed 20 June 2012.

  • SACODEYL, European Youth language. (2012). URL: http://sacodeyl.inf.um.es/sacodeyl-search2/. Accessed 20 June 2012.

  • Savy, R., & Cutugno, F. (2009). CLIPS: Diatopic, diamesic and diaphasic variations in spoken Italian. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), On-line proceedings of 5th corpus linguistics conference. URL: http://ucrel.lancs.ac.uk/publications/cl2009/, article 213. Accessed 15 November 2012.

  • Sketch Engine. (2012). URL: http://www.sketchengine.co.uk/. Accessed 20 June 2012.

  • Slovene Personal Data Protection Act (Zakon o varstvu osebnih podatkov—ZVOP-1, UL RS 86/4). (2012). http://zakonodaja.gov.si/rpsi/r06/predpis_ZAKO3906.html. Accessed 15 November 2012.

  • Spoken Dutch Corpus/Corpus Gesproken Nederlands. (2012). URL: http://lands.let.kun.nl/cgn/. Accessed 20 June 2012.

  • The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University computing services on behalf of the BNC consortium. URL: http://www.natcorp.ox.ac.uk/. Accessed 20 June 2012.

  • Wagener, P. (2005). DGD—Datenbank Gesprochenes Deutsch Archivierung, Dokumentation und Erschließung des Deutschen Spracharchivs. Sprachreport, 3, 23–26.

    Google Scholar 

  • Welsh, I. (1997). Trainspotting. (Translation: A. Skubic.) Ljubljana: DZS.

  • Widmann, J., Kohn, K., & Ziai, R. (2008). The SACODEYL search tool. Exploiting corpora for language learning purposes. In A. Frankenberg-Garcia, T. Rkibi, M. R. Cruz, R. Carvalho, C. Direito, & D. Santos-Rosa (Eds.), Proceedings of the 8th TALC conference (pp. 321–327). Lisbon: ISLA.

  • Zemljak Jontes, M., Kačič, Z., Dobrišek, S., Žganec Gros, J., & Weiss, P. (2002). Računalniški simbolni fonetični zapis slovenskega govora. Slavistična revija, 50(2), 159–169.

    Google Scholar 

  • Žgank, A. (2010). Three-stage framework for unsupervised acoustic modeling using untranscribed spoken content. ETRI Journal, 32(5), 810–818.

    Article  Google Scholar 

  • Žgank, A., Rotovnik, T., & Sepesy Maučec, M. (2008). Slovenian spontaneous speech recognition and acoustic modelling of filled pauses and onomatopoeas. WSEAS transaction Signal Processing, 4(7), 388–397.

    Google Scholar 

Download references

Acknowledgments

The GOS corpus was built within the Communication in Slovene project. The operation was partly financed by the European Union, the European Social Fund, and the Ministry of Education, Science, Culture and Sport of the Republic of Slovenia. The operation was carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013. The online search tool, a concordancer, was made as part of the “Online concordancer for the national spoken corpus of Slovene,” project, funded within the priority axis Economic Development Infrastructure and the priority strategy Information Society within the operational programme ‘Strengthening Regional Development Potentials’ for the period 2007-2013. We thank all the speakers who participated in the recordings for the GOS corpus, media companies who provided the recordings from their archive, as well as other companies and institutions, especially schools, where the recordings of official discourse were conducted.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Darinka Verdonik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verdonik, D., Kosem, I., Vitez, A.Z. et al. Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS. Lang Resources & Evaluation 47, 1031–1048 (2013). https://doi.org/10.1007/s10579-013-9216-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9216-5

Keywords

Navigation