Language Resources and Evaluation

, Volume 49, Issue 3, pp 753–775 | Cite as

The IMP historical Slovene language resources

Project Notes

Abstract

The paper describes the combined results of several projects which constitute a basic language resource infrastructure for printed historical Slovene. The IMP language resources consist of a digital library, an annotated corpus and a lexicon, which are interlinked and uniformly encoded following the Text Encoding Initiative Guidelines. The library holds about 650 units (mostly complete books) consisting of facsimiles with 45,000 pages as well as hand-corrected and structured transcriptions. The hand-annotated corpus has 300,000 tokens, where each word is tagged with its modernised word form, lemma, part-of-speech and, in cases of archaic words, its nearest contemporary equivalents. This information was extracted into the lexicon, which also covers an extended target-annotated corpus, resulting in 20,000 lemmas (of these 4,000 archaic) with 50,000 modern word forms and 70,000 attested forms. We have also developed a program to modernise, tag and lemmatise historical Slovene, and annotated the digital library with it, producing an automatically annotated corpus of 15 million words. To serve the humanities, the digital library and lexicon are available for reading and browsing on the web and the corpora via a concordancer. For language technology research and development the resources are available in source TEI XML under the Creative Commons Attribution licence. The paper presents the IMP resources, available from http://nl.ijs.si/imp/, the process of their compilation, encoding and dissemination, and concludes with directions for future research.

Keywords

Historical language resources Slovene language Text Encoding Initiative Non-standard language normalisation 

References

  1. Arhar, Š. (2009). Učni korpus SSJ in leksikon besednih oblik za slovenščino (The SSJ training corpus and word form lexicon for Slovene). Jezik in Slovstvo, 54(3–4), 43–56.Google Scholar
  2. Bień, J. S. (2014). The IMPACT project Polish Ground-Truth texts as a DjVu corpus. Cognitive Studies (Études Cognitives), 14, 75–84. http://bc.klf.uw.edu.pl/381/
  3. Christ, O. (1994). A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX ’94: 3rd conference on computational lexicography and text research, Budapest, Hungary (pp. 23–32).Google Scholar
  4. Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2011). Aletheia—An advanced document layout and text ground-truthing system for production environments. In IEEE Xplore Digital Library (pp. 48–52).Google Scholar
  5. Dudczak, A., Kmieciak, M., & Werla, M. (2012). Creation of textual versions of historical documents from polish digital libraries. In Lecture notes in computer science (Vol. 7489, pp. 89–94). Berlin: Springer.Google Scholar
  6. Erjavec, T. (2007). An architecture for editing complex digital documents. In Proceedings of INFuture’07 “digital information and heritage” (pp. 105–114). University of Zagreb.Google Scholar
  7. Erjavec, T. (2011). Automatic linguistic annotation of historical language: ToTrTaLe and XIX century Slovene. In Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, association for computational linguistics, Portland, OR, USA (pp. 33–38). http://www.aclweb.org/anthology/W11-1505
  8. Erjavec, T. (2012a). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131–142.CrossRefGoogle Scholar
  9. Erjavec, T. (2012b). The goo300k corpus of historical Slovene. In Proceedings of the eight international conference on language resources and evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey.Google Scholar
  10. Erjavec, T. (2014). Posodabljanje starejše slovenščine (Modernising historical Slovene). Uporabna informatika, 21(4), 186–195.Google Scholar
  11. Erjavec, T., & Fišer, D. (2014). Recepcija virov starejše slovenščine IMP (The reception of the IMP historical language resources). In 33. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana.Google Scholar
  12. Erjavec, T., Vodopivec, I., & Kodrič, M. (2011). Izdelava korpusa starejših slovenskih besedil v okviru projekta IMPACT (The compilation of a corpus of historical Slovene texts in the scope of the IMPACT project). In 30. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana (pp. 121–127).Google Scholar
  13. Hladnik, M. (2009). Infrastruktura slovenistične literarne vede (The infrastructure of Slovene literary studies). In 28. simpozij Obdobja, Znanstvena založba Filozofske fakultete, Ljubljana (pp. 161–169). http://www.centerslo.net/files/file/simpozij/simp28/Hladnik
  14. Kenter, T., Erjavec, T., Žorga, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the EACL workshop on language technology for cultural heritage, social sciences, and humanities, ACL, Avignon, France.Google Scholar
  15. Krauwer, S. (2003). The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In Proceedings of the international workshop speech and computer (SPECOM 2003) (pp. 8–15). Moscow State Linguistic University. http://www.elsnet.org/dox/krauwer-specom2003
  16. Kroch, A., Santorini, B., & Diertani, A. (2004). Penn–Helsinki parsed corpus of Early Modern English. http://www.ling.upenn.edu/hist-corpora/PPCEME-RELEASE-2/
  17. Kučera, K. (1999). The general principles of the diachronic part of the Czech National Corpus. In Text, speech and dialogue, lecture notes in computer science (Vol. 1692, pp. 841–842). Berlin: Springer.Google Scholar
  18. Ljubešić, N., Erjavec, T., & Fišer, D. (2014). Standardizing tweets with character-level machine translation. In A. Gelbukh (Ed.), 15th International conference, CICLing 2014, proceedings, part II, lecture notes in computer science (Vol. 8404, pp. 164–175). Berlin: Springer.Google Scholar
  19. Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies. San Rafael, USA: Morgan & Claypool Publishers.Google Scholar
  20. Pletschacher, S., & Antonacopoulos, A. (2010). The PAGE (page analysis and ground-truth elements) format framework. In Proceedings of the 20th international conference on pattern recognition (ICPR), Istambul.Google Scholar
  21. Prunč, E. (2007). Deutsch-slowenische/kroatische Übersetzung 1848–1918. Ein Werkstättenbericht. Wiener Slavistisches Jahrbuch, 53, 63–176.Google Scholar
  22. Rayson, P., Archer, D., Baron, A., Culpeper, J., & Smith, N. (2007). Tagging the bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Corpus linguistics conference (CL2007). University of Birmingham, Birmingham, UK. http://ucrel.lancs.ac.uk/publications/CL2007/paper/192_Paper
  23. Reffle, U. (2011). Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering, 17, 265–282.CrossRefGoogle Scholar
  24. Rychlý, P. (2007). Manatee/bonito—A modular corpus manager. In Proceedings of 1st workshop on recent advances in Slavonic natural language processing (pp. 65–70). Brno: Masaryk University.Google Scholar
  25. Sánchez-Marco, C., Boleda, G., Fontana, J. M., & Domingo, J. (2010). Annotation and representation of a diachronic corpus of Spanish. In Proceedings of the seventh conference on language resources and evaluation (LREC’10), ELRA, Valletta, Malta.Google Scholar
  26. Sánchez-Martínez, F., Martínez-Sempere, I., Ivars-Ribes, X., & Carrasco, R. C. (2013). An open diachronic corpus of historical Spanish. Language Resources and Evaluation, 47(4), 1327–1342.CrossRefGoogle Scholar
  27. Scheible, S., Whitt, R. J., Durrell, M., & Bennett, P. (2011). A gold standard corpus of Early Modern German. In Proceedings of the 5th linguistic annotation workshop, association for computational linguistics, Portland, Oregon, USA (pp. 124–128). http://www.aclweb.org/anthology/W11-0415
  28. Scherrer, Y., & Erjavec, T. (2013). Modernizing historical Slovene words with character-based SMT. In BSNLP 2013—4th Biennial workshop on Balto-Slavic natural language processing, Sofia.Google Scholar
  29. TEI Consortium (Ed.). (2012). Guidelines for electronic text encoding and interchange. TEI Consortium. http://www.tei-c.org/P5/
  30. Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.CrossRefGoogle Scholar
  31. Wallenberg, J., Ingason, A. K., Sigurthsson, E. F., & Rögnvaldsson, E. (2011). Icelandic Parsed Historical Corpus (IcePaHC), version 0.9. http://www.linguist.is/icelandic_treebank

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  1. 1.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia

Personalised recommendations