Annotations in the Nordic Dialect Corpus

  • Janne Bondi JohannessenEmail author


In this chapter I focus on annotation in the Nordic Dialect Corpus, a dialect corpus that consists of dialectal speech from five closely related languages. There are two main types of annotation that are central: the annotation of speech itself, i.e. transcription, and the annotation of grammatical categories, i.e. tagging. Both are described and discussed, with a special focus on the success, or lack thereof, of some key choices.


Linguistic basis Speech corpus Nordic languages Transcription Tagging Maps 


  1. 1.
    Allwood, J., Nivre, J., Ahlsén, E.: Speech management–on the non-written life of speech. Nord. J. Linguist. 13, 3–48 (1990)Google Scholar
  2. 2.
    Barras, C., Geoffrois, E., Wu, Z., Liberman, M.: Transcriber: a free tool for segmenting, labeling and transcribing speech. In: First International Conference on Language Resources and Evaluation (LREC), pp. 1373–1376 (1998)Google Scholar
  3. 3.
    Bick, E.: PaNoLa - The Danish connection. In: Holmboe, H. (ed.) Nordic Language Technology, Årbog for Nordisk Sprogteknologisk Forskningsprogram 2000–2004 (Yearbook 2002), pp. 75–88. Museum Tusculanum, Copenhagen (2003)Google Scholar
  4. 4.
    Bokmålsordboka. 2005. Wangensteen, Boye (ed.). Oslo: Kunnskapsforlaget.
  5. 5.
    Christ, O.: A modular and flexible architecture for an integrated corpus query system. COM-PLEX’94, Budapest (1994)Google Scholar
  6. 6.
    Evert, S.: The CQP query language tutorial. Institute for Natural Language Processing, University of Stuttgart, (2005)
  7. 7.
    Fjeld, R.V.: Talespråksforskningens betydning for leksikografien. In: Johannessen & Hagen, pp. 15–28 (2008)Google Scholar
  8. 8.
    Hagen, K., Bondi Johannessen, J., Nøklestad, A.: A constraint-based tagger for Norwegian. I Lindberg, Carl-Erik og Steffen Nordahl Lund (red.): 17th Scandinavian Conference of Linguistics. Odense Working Papers in Language and Communication vol. 19, pp. 31-48, University of Southern Denmark, Odense (2000)Google Scholar
  9. 9.
    Halácsy, P., Kornai, A., Oravecz, C.: Hunpos - an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the Asso- ciation for Computational Linguistics, volume Com- panion Volume, Proceedings of the Demo and Poster Sessions, pp. 209–212, Prague, Czech Republic. Association for Computational Linguistics (2007)Google Scholar
  10. 10.
    Johannessen, J.B.: The Corpus Search and Results Handling System Glossa. Chung-hua Buddh. J. 25, 87–104 (2012)Google Scholar
  11. 11.
    Johannessen, J.B., Nygaard, L., Priestley, J., Nøklestad, A.: Glossa: a multilingual, multimodal, configurable user inter-face. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08. Paris: European Language Resources Association (ELRA) (2008)Google Scholar
  12. 12.
    Johannessen, J.B., Priestley, J., Hagen, K., Åfarli, T.A., Vangsnes, Ø.A.: The Nordic Dialect Corpus - an advanced research tool. In: Jokinen, K., Bick, E. (eds.) Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. NEALT Proceedings Series, vol. 4 (2009a)Google Scholar
  13. 13.
    Johannessen, J.B., Hagen, K., Nøklestad, A., Priestley, J.: Enhancing language resources with maps. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D., (eds.) Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), pp. 1081–1088. Paris: European Language Resources Association (ELRA) ISBN 2-9517408-6-7 (2010)Google Scholar
  14. 14.
    Johannessen, J.B., Priestley, J., Hagen, K., Nøklestad, A., Lynum, A., The Nordic Dialect Corpus. In: Calzolari, N., Choukri, K., Declerck, T., Ugur Dogan, M., Maegaard, B., Mariani, J., Odijk, J., (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation. European Language Resources Association, pp. 3388–3391 (2012)Google Scholar
  15. 15.
    Johannessen, Janne Bondi, Hagen, Kristin (eds.): Språk i Oslo. Ny forskning omkring talespråk. Novus forlag, Oslo (2008)Google Scholar
  16. 16.
    Johannessen, J.B., Jørgensen, F.: Annotating and parsing spoken language. In: Peter, J.H., Peter, R.S. (eds.) Treebanking for Discourse and Speech, pp. 83–103. København, Samfundslitteratur (2006)Google Scholar
  17. 17.
    Johannessen, J.B., Hagen, K., Håberg, L., Laake, S., Søfteland og, Å., Vangsnes, Ø.: Transkripsjonsrettleiing for ScanDiaSyn (2009b)Google Scholar
  18. 18.
    Johannessen, J.B., Vangsnes, Ø.A., Priestley, J., Hagen, K.: A multilingual speech corpus of North-Germanic languages. Raso and Mello (eds.) 2014, 69–83 (2014)Google Scholar
  19. 19.
    Jørgensen, F.: Automatisk gjenkjenning av ytringsgrenser i talespråk. In: Johannessen and Hagen (eds.), pp. 204–213 (2008)Google Scholar
  20. 20.
    Karlsson, F., Voutilainen, A., Heikkilä, J., Anttila, A. (eds.): Constraint Grammar. A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter, Berlin (1995)Google Scholar
  21. 21.
    Kokkinakis, S.J.: En studie över påverkande faktorer i ordklasstaggning. Baserad på taggning av svensk text med EPOS. Ph.D. dissertation. Göteborg University (2003)Google Scholar
  22. 22.
    Laake, S., Gjermundsen, I.F., Grov, A., Hagen, K., Johannessen, J.B., Kinn, K., Lykke, A., Olsen, E.: Nordisk dialektkorpus: Oversettelse fra dialekt til bokmål. Technical report, The Text Laboratory, University of Oslo (2011)Google Scholar
  23. 23.
    Lea, A.H.: Lånord i norsk talespråk. University of Oslo, Department of Linguistics and Scandinavian Studies (2009)Google Scholar
  24. 24.
    Loftsson, H.: Tagging icelandic text: a linguistic rule-based approach. In Nordic J. Linguist. 31, 1 (2008)Google Scholar
  25. 25.
    Mello, H.: Methodological issues for spontaneous speech corpora compilation: The case of C-Oral-Brasil. Raso and Mello (eds.) 2014, 27–68 (2014)Google Scholar
  26. 26.
    Nivre, J.: Grönqvist, Leif: Tagging a corpus of spoken swedish. Int. J. Corpus Linguist. 6(1), 47–78 (2001)CrossRefGoogle Scholar
  27. 27.
    Nøklestad, A., Søfteland, Å.: Tagging a Norwegian speech corpus. In: NODALIDA 2007 Conference Proceedings. NEALT Proceedings Series (2007)Google Scholar
  28. 28.
    Nøklestad, A., Søfteland, Å.: Manuell morfologisk tagging av NoTa-materialet med støtte fra en statistisk tagger. In: Johannessen & Hagen (eds.), pp. 226–234 2008Google Scholar
  29. 29.
    Opsahl, T., Røyneland, U., Svendsen, B.A.: Syns du jallanorsk er lættis, eller?" - om taggen [lang=X] i Nota-Oslo-korpuset. Johannessen & Hagen 2008, 29–41 (2008)Google Scholar
  30. 30.
    Papazian, E., Helleland, B.: Norsk talemål. Høyskoleforlaget, Kristiansand (2005)Google Scholar
  31. 31.
    Raso, T., Mello, H. (eds.): Spoken Corpora and Linguistic Studies. John Benjamins Publishing Company, Amsterdam (2014)Google Scholar
  32. 32.
    Rosén, V.: Mot en trebank for talespråk. In: Johannessen and Hagen (eds), pp. 214–225 (2008)Google Scholar
  33. 33.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language ProcessingGoogle Scholar
  34. 34.
    Trosterud, T.: A constraint grammar for Faroese. In: NODALIDA 2007 Conference Proceedings. NEALT Proceedings Series (2009)Google Scholar


  1. 35.
  2. 36.
  3. 37.
    Dynamic syntactic atlas of the Dutch dialects (DynaSAND).
  4. 38.
  5. 39.
    Google Maps:
  6. 40.
    Google Translate:
  7. 41.
    GSCP 2012 International Conference on Speech and Corpora.
  8. 42.
    Nordic Atlas of Language Structures Online (NALS) Journal:
  9. 43.
  10. 44.
    \(\eth \)djärum (The Övdalian Language Council):
  11. 45.
    Scottish Corpus of Text and Speech:
  12. 46.
    Svenska Litteratursällskapet i Finland:
  13. 47.
  14. 48.
    Talko Finland Swedish Corpus:
  15. 49.
  16. 50.
  17. 51.

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.The Text Laboratory and MultiLing, Department of Linguistics and Nordic StudiesUniversity of OsloOsloNorway

Personalised recommendations