Language Resources and Evaluation

, Volume 47, Issue 4, pp 1031–1048 | Cite as

Compilation, transcription and usage of a reference speech corpus: the case of the Slovene corpus GOS

  • Darinka Verdonik
  • Iztok Kosem
  • Ana Zwitter Vitez
  • Simon Krek
  • Marko Stabej
Original Paper


In recent years, building reference speech corpora was an important part of the activities which provided the necessary linguistic infrastructure in many European countries, for languages with many speakers (e.g., French, German, Spanish, Italian) as well as for those with smaller numbers of speakers (e.g., Swedish, Dutch, Czech, Slovak). This paper describes the process of the creation of a reference speech corpus and its distribution to potential users, as it was done in the case of the Slovene corpus GOS. The corpus structure and fieldwork experiences with recording, labelling system, and two levels of transcription (pronunciation-based and standardized) are described, as well as the main characteristics of the corpus interface (web concordancer) and the availability of the original corpus files.


Spoken language Discourse Recordings Transcription conventions Web concordancer 



The GOS corpus was built within the Communication in Slovene project. The operation was partly financed by the European Union, the European Social Fund, and the Ministry of Education, Science, Culture and Sport of the Republic of Slovenia. The operation was carried out within the operational programme Human Resources Development for the period 2007–2013, developmental priorities: improvement of the quality and efficiency of educational and training systems 2007–2013. The online search tool, a concordancer, was made as part of the “Online concordancer for the national spoken corpus of Slovene,” project, funded within the priority axis Economic Development Infrastructure and the priority strategy Information Society within the operational programme ‘Strengthening Regional Development Potentials’ for the period 2007-2013. We thank all the speakers who participated in the recordings for the GOS corpus, media companies who provided the recordings from their archive, as well as other companies and institutions, especially schools, where the recordings of official discourse were conducted.


  1. Allwood, J., Björnberg, M., Grönqvist, L., Ahlsén, E., Ottesjö, C. (2000). The spoken language corpus at the Linguistics Department, Göteborg University. Forum: Qualitative Social Research 1/3.Google Scholar
  2. Arhar, Š. (2007). Uporabniška evalvacija korpusa FidaPLUS: Zasnova vprašalnika, prvi rezultati. In M. Stabej (Ed.), Infrastruktura slovenščine in slovenistike (Proceedings of the 28th Obdobja Symposium) (pp. 19–26). Ljubljana: Znanstvena založba Filozofske fakultete.Google Scholar
  3. Atkinson, J. M., & Heritage, J. (Eds.). (1984). Structures of social action: Studies in conversation analysis. Cambridge: Cambridge University Press.Google Scholar
  4. Barras, C., Geoffrois, E., Wu, Z., & Liberman, M. (2000). Transcriber: Development and use of a tool for assisting speech corpora production. Speech Communication, special issue on Speech Annotation and Corpus Tools, 33(1–2), 5–22.Google Scholar
  5. British Academic Spoken English (BASE). URL: Accessed 20 June 2012.
  6. Burnard, L. (Ed.) (2007). Reference guide for the British National Corpus (XML Edition). URL: Accessed 20 June 2012.
  7. Burnard, L., & Bauman, S. (2007). P5: Guidelines for electronic text encoding and interchange: 8 Transcriptions of speech. TEI—text encoding iniciative. URL: Accessed 20 June 2012.
  8. Cheng, W., Greaves, C., & Warren, M. (2008). A corpus-driven study of discourse intonation. Amsterdam: John Benjamins.Google Scholar
  9. CHILDES—Child Language Data Exchange System. (2012). URL: Accessed 20 June 2012.
  10. CLIPS. (2012). URL: Accessed 20 June 2012.
  11. Communication in Slovene. (2008). URL: Accessed 20 June 2012.
  12. CORPUS.BYU.EDU. (2012). URL: Accessed 20 June 2012.
  13. Corpus de la Parole. (2012). URL: Accessed 20 June 2012.
  14. Corpus of Spoken Slovak. (2012). URL: Accessed 20 June 2012.
  15. Cresti, E., Bacelar do Nascimento, F., Moreno Sandoval, A., Veronis, J., Martin, P., & Choukri, K. (2004). The C-ORAL-ROM CORPUS: A multilingual resource of spontaneous speech for romance languages. In Proceedings of the fourth international conference on language resources and evaluation (LREC’04). Lisbon, Portugal.Google Scholar
  16. Czech National Corpus—ORAL2008. (2008). Institute of the Czech National Corpus, Praha. URL: Accessed 20 June 2012.
  17. Czech National Corpus—SYN. (2012). Institute of the Czech National Corpus, Praha. URL: Accessed 15 November 2012.
  18. Deutsches Spracharchiv (DSAv) und Datenbank Gesprochenes Deutsch (DGD). (2012). URL: Accessed 20 June 2012.
  19. EAGLES. (1996). Preliminary recommendations on spoken texts. EAGLES Document EAGTCWG-STP/P.Google Scholar
  20. Erjavec, T. (2010). TEI Schema for GOS speech corpus of Slovene. URL: Accessed 20 June 2012.
  21. Erjavec, T., Krek, S., Arhar, Š., Fišer, D., Ledinek, N., Saksida, A., Sivec, B., & Trebar, B. (2010). Oblikoskladenjske specifikacije JOS, v1.1. URL: Accessed 20 June 2012.
  22. FIDAplus, korpus slovenskega jezika. (2012). URL: Accessed 20 June 2012.
  23. Göteborg Spoken Language Corpus, GSLC. (2012). URL: Accessed 9th May 2012. Accessed 20 June 2012.
  24. GOS, GOvorjena Slovenščina. (2012). URL: Accessed 20 June 2012.
  25. Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik. In T. Erjavec & J. Žganec Gros (Eds.), Zbornik Osme konference Jezikovne tehnologije. Slovenia: Ljubljana.Google Scholar
  26. Grishina, E. (2006). Spoken Russian in the Russian National Corpus (RNC). In Proceedings of the fifth international conference on language resources and evaluation (LREC’06). Genova, Italy. URL: Accessed 15 November 2012.
  27. Hong Kong Corpus of Spoken English. (2012). URL: Accessed 20 June 2012.
  28. Izre’el, S., Hary, B., & Rahav, G. (2001). Designing CoSIH: The corpus of spoken Israeli Hebrew. International Journal of Corpus Linguistics, 6, 171–197.CrossRefGoogle Scholar
  29. Jacobson, M., & Baude, O. (2011). Corpus de la parole: Collecte, catalogage, conservation et diffusion des ressources orales sur le français et les langues de France. Traitement Automatique des Langues, 52(3), 47–69.Google Scholar
  30. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th Euralex international congress (pp. 105–116). Lorient: Universite de Bretagne-Sud.Google Scholar
  31. LABLITA Corpus of Spontaneous Spoken Italian. URL: Accessed 20 June 2012.
  32. Mello, H., & Raso, T. (2009). Para a transcrição da fala espontânea: O caso do C-ORAL–BRASIL. Revista Portuguesa de Humanidades, 13(1), 153–178.Google Scholar
  33. MICASE—Michigan Corpus of Academic Spoken English. (2012). URL: Accessed 20 June 2012.
  34. Oostdijk, N., Goedertier, W., Van Eynde, F., Boves, L., Martens, J. P., Moortgat, M., et al. (2002). Experiences from the Spoken Dutch corpus project. In M. González Rodriguez, C. Paz Suárez Araujo (Eds.), Proceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 340–347). Las Palmas, Canary Islands.Google Scholar
  35. Pořízka, P. (2009). Olomouc corpus of Spoken Czech: Characterization and main features of the project. Linguistik online, 38/2. URL: Accessed 20 June 2012.
  36. Przepiórkowski, A., Górski, R. L., Lewandowska-Tomaszczyk, B., & Łazinski, M. (2008). Towards the National Corpus of Polish. In Proceedings of the international conference on language resources and evaluation (LREC’08). Marrakech, Morocco. URL: Accessed 15 November 2012.
  37. Rotovnik, T., Sepesy Maučec, M., & Kačič, Z. (2007). Large vocabulary continuous speech recognition of an inflected language using stems and endings. Speech Communication, 49(6), 437–452.CrossRefGoogle Scholar
  38. Rusko, M., & Garabík, R. (2007). Corpus of Spoken Slovak language. In J. Levická & R. Garabík (Eds.), Computer treatment of Slavic and East European Languages. Proceedings of the conference Slovko 2007 (pp. 222–236). Brno: Tribun.Google Scholar
  39. Russian National Corpus. (2012). URL: Accessed 20 June 2012.
  40. SACODEYL, European Youth language. (2012). URL: Accessed 20 June 2012.
  41. Savy, R., & Cutugno, F. (2009). CLIPS: Diatopic, diamesic and diaphasic variations in spoken Italian. In M. Mahlberg, V. González-Díaz, & C. Smith (Eds.), On-line proceedings of 5th corpus linguistics conference. URL:, article 213. Accessed 15 November 2012.
  42. Sketch Engine. (2012). URL: Accessed 20 June 2012.
  43. Slovene Personal Data Protection Act (Zakon o varstvu osebnih podatkov—ZVOP-1, UL RS 86/4). (2012). Accessed 15 November 2012.
  44. Spoken Dutch Corpus/Corpus Gesproken Nederlands. (2012). URL: Accessed 20 June 2012.
  45. The British National Corpus, version 3 (BNC XML Edition). (2007). Distributed by Oxford University computing services on behalf of the BNC consortium. URL: Accessed 20 June 2012.
  46. Wagener, P. (2005). DGD—Datenbank Gesprochenes Deutsch Archivierung, Dokumentation und Erschließung des Deutschen Spracharchivs. Sprachreport, 3, 23–26.Google Scholar
  47. Welsh, I. (1997). Trainspotting. (Translation: A. Skubic.) Ljubljana: DZS.Google Scholar
  48. Widmann, J., Kohn, K., & Ziai, R. (2008). The SACODEYL search tool. Exploiting corpora for language learning purposes. In A. Frankenberg-Garcia, T. Rkibi, M. R. Cruz, R. Carvalho, C. Direito, & D. Santos-Rosa (Eds.), Proceedings of the 8th TALC conference (pp. 321–327). Lisbon: ISLA.Google Scholar
  49. Zemljak Jontes, M., Kačič, Z., Dobrišek, S., Žganec Gros, J., & Weiss, P. (2002). Računalniški simbolni fonetični zapis slovenskega govora. Slavistična revija, 50(2), 159–169.Google Scholar
  50. Žgank, A. (2010). Three-stage framework for unsupervised acoustic modeling using untranscribed spoken content. ETRI Journal, 32(5), 810–818.CrossRefGoogle Scholar
  51. Žgank, A., Rotovnik, T., & Sepesy Maučec, M. (2008). Slovenian spontaneous speech recognition and acoustic modelling of filled pauses and onomatopoeas. WSEAS transaction Signal Processing, 4(7), 388–397.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Darinka Verdonik
    • 1
  • Iztok Kosem
    • 2
  • Ana Zwitter Vitez
    • 2
  • Simon Krek
    • 3
  • Marko Stabej
    • 4
  1. 1.University of MariborMariborSlovenia
  2. 2.Trojina, Institute for Applied Slovene StudiesŠkofja LokaSlovenia
  3. 3.Amebis, d.o.o.KamnikSlovenia
  4. 4.University of LjubljanaLjubljanaSlovenia

Personalised recommendations