Advertisement

The Janes project: language resources and tools for Slovene user generated content

  • Darja FišerEmail author
  • Nikola Ljubešić
  • Tomaž Erjavec
Project Notes

Abstract

The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content. The paper first describes the 200 million word Janes corpus, containing tweets, forum posts, news comments, user and talk pages from Wikipedia, and blogs and blog comments, where each text is accompanied by rich metadata. The developed processing tools for Slovene user generated content are presented next, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser. A set of manually annotated datasets was also produced, both for tool training as well as for linguistic research. The developed resources and tools are made publicly available under Creative Commons licences in the repository of the CLARIN.SI research infrastructure and on GitHub, while the corpora are also available through the CLARIN.SI concordancers.

Keywords

User generated content Slovene language Corpora Manually annotated datasets Text normalisation 

Notes

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments which helped us improved the paper. We would also like to thank Avtomobilizem.net, Kvarkadabra, MedOverNet, Mladina, Reporter and RTV Slovenija for permission to include their content in the Janes corpus. We are grateful to Jaka Čibej, Teja Goli, Dafne Marko, Eneja Osrajnik, Senja Pollak and Iza Škrjanec for manual metadata enrichment, Špela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc, Simon Krek and Katja Zupan for their help with the development of the annotation guidelines and the annotators Teja Goli, Melanija Kožar, Vesna Koželj, Polona Logar, Klara Lubej, Dafne Marko, Barbara Omahen, Eneja Osrajnik, Predrag Petrović, Polona Polc, Aleksandra Rajković, Špela Reher, Iza Škrjanec and Katja Zupan for their dedicated work throughout the campaign. The work described in this paper was funded by the Slovenian Research Agency within the national basic research projects “Resources, Tools and Methods for the Research of Nonstandard Internet Slovene” (J6-6842, 2014–2018) and “Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society” (J7-8280, 2017–2020).

References

  1. Arhar Holdt, Š., Erjavec, T., & Fišer, D. (2017). CMC training corpus Janes-Syn 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1086.
  2. Arhar Holdt, Š., Fišer, D., Erjavec, T., & Krek, S. (2016). Syntactic annotation of Slovene CMC: First steps. In Proceedings of the 4th conference on CMC and social media corpora for the humanities (pp. 3–6).Google Scholar
  3. Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., & Patti, V. (2016). Overview of the EVALITA 2016 SENTiment POLarity classification task. In Proceedings of third Italian conference on computational linguistics (CLiC-it 2016) & fifth evaluation campaign of natural language processing and speech tools for Italian. Final Workshop (EVALITA 2016).Google Scholar
  4. Baron, A., & Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the postgraduate conference in corpus linguistics. Birmingham: Aston University.Google Scholar
  5. Bartz, T., Beißwenger, M., & Storrer, A. (2014). Optimierung des Stuttgart–Tübingen–Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics, 28(1), 157–198.Google Scholar
  6. Beißwenger, M., Ehrhardt, E., Horbach, A., Lüngen, H., Steffen, D., & Storrer, A. (2015). Adding value to CMC corpora: CLARINification and part-of-speech annotation of the Dortmund chat corpus. In Proceedings of the 2nd workshop on natural language processing for computer-mediated communication/social media (NLP4CMC2015) (pp. 12–16).Google Scholar
  7. Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., & Storrer, A. (2012). A TEI schema for the representation of computer-mediated communication. Journal of the Text Encoding Initiative, 3.Google Scholar
  8. Beißwenger, M., & Storrer, A. (2008). Corpora of computer-mediated communication. In A. Lüdeling & M. Kyto (Eds.), Corpus linguistics: An international handbook (pp. 292–308). Berlin: Mouton de Gruyter.Google Scholar
  9. Bollmann, M., Bingel, J., & Søgaard, A. (2017). Learning attention for historical text normalization by learning to pronounce. In ACL (Vol. 1, pp. 332–344). Association for Computational Linguistics.Google Scholar
  10. Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.Google Scholar
  11. Chanier, T., Poudat, C., Sagot, B., Antoniadis, G., Wigham, C. R., Hriba, L., et al. (2014). The CoMeRe corpus for French: Structuring and annotating heterogeneous CMC genres. JLCL-Journal for Language Technology and Computational Linguistics, 29(2), 1–30.Google Scholar
  12. Derczynski, L., Bontcheva, K., & Roberts, I. (2016). Broad twitter corpus: A diverse named entity recognition resource. In COLING (pp. 1169–1179).Google Scholar
  13. Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., et al. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49.CrossRefGoogle Scholar
  14. Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., & Romih, M. (2015). Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1039.
  15. Dobrovoljc, K., Krek, S., & Rupnik, J. (2012). Skladenjski razčlenjevalnik za slovenščino [A syntax parser for Slovene]. In Proceedings of the eight conference on language technologies (pp. 42–47). Jožef Stefan Institute. http://nl.ijs.si/isjt12/proceedings/index-en.html.
  16. Dürscheid, C., & Stark, E. (2011). SMS4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland. In C. Thurlow & K. Mroczek (Eds.), Digital discourse. Language in the new media (pp. 299–320). Oxford: Oxford University Press.CrossRefGoogle Scholar
  17. Erjavec, T. (2012). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131–142.CrossRefGoogle Scholar
  18. Erjavec, T. (2015). The IMP historical Slovene language resources. Language Resources and Evaluation, 49(3), 753–775.CrossRefGoogle Scholar
  19. Erjavec, T., Arhar Holdt, Š., Čibej, J., Dobrovoljc, K., Fišer, D., Laskowski, C., & Zupan, K. (2016). Annotating CLARIN.SI TEI corpora with WebAnno. In Proceedings of the CLARIN annual conference (pp. 1–5).Google Scholar
  20. Erjavec, T., Čibej, J., Arhar Holdt, Š., Ljubešić, N., & Fišer, D. (2016). Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent advances in Slavonic natural language processing (pp. 29–40). Brno: Tribun EU.Google Scholar
  21. Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10). Valletta: European Language Resources Association (ELRA).Google Scholar
  22. Erjavec, T., Fišer, D., Čibej, J., & Arhar Holdt, Š. (2016). CMC training corpus Janes-Norm 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1084.
  23. Erjavec, T., Fišer, D., Čibej, J., Arhar Holdt, Š., & Ljubešić, N. (2016). CMC training corpus Janes-Tag 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1085.
  24. Erjavec, T., Ignat, C., Poliquen, B., & Steinberger, R. (2005). Massive multilingual corpus compilation: Acquis communautaire and ToTaLe. In The 2nd language & technology conference: Human language technologies as a challenge for computer science and linguistics. Association for Computing Machinery (ACM) and UAM Fundacja.Google Scholar
  25. Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica, 39(1), 35.Google Scholar
  26. Erjavec, T., Ljubešić, N., & Fišer, D. (2017). Blog post and comment corpus Janes-Blog 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1138.
  27. Erjavec, T., Ljubešić, N., & Fišer, D. (2017). Forum corpus Janes-Forum 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1139.
  28. Erjavec, T., Ljubešić, N., & Fišer, D. (2017). News comment corpus Janes-News 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1140.
  29. Fišer, D., Erjavec, T., & Ljubešić, N. (2017). Legal framework, dataset and annotation schema for socially unacceptable online discourse practices in Slovene. In Proceedings of the first workshop on abusive language online (pp. 46–51).Google Scholar
  30. Fišer, D. (2018). Viri, orodja in metode za analizo spletne slovenščine. Znanstvena založba Filozofske fakultete Univerze v Ljubljani.Google Scholar
  31. Fišer, D., Smailović, J., Erjavec, T., Grčar, M., & Mozetič, I. (2016). Sentiment annotation of the Janes corpus of Slovene user-generated content. In Proceedings of the conference on language technologies and digital humanities (pp. 65–70). Ljubljana: Academic Publishing Division of the Faculty of Arts.Google Scholar
  32. Frey, J. C., Glaznieks, A., & Stemle, E. W. (2016). The DiDi corpus of south Tyrolean CMC data: A multilingual corpus of Facebook texts. In CLIC-it.Google Scholar
  33. Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies: Short papers, HLT ’11 (Vol. 2, pp. 42–47). Stroudsburg: Association for Computational Linguistics.Google Scholar
  34. Goli, T., Osrajnik, E., Fišer, D., & Erjavec, T. (2017). CMC shortening corpus Janes-Kratko 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1087.
  35. Goli, T., Osrajnik, E., & Fišer, D. (2016). Analiza krajšanja slovenskih sporočil na družbenem omrežju Twitter. In T. Erjavec, D. Fišer (Eds.), Proceedings of the conference on language technologies and digital humanities (pp. 77–82). Ljubljana: Academic Publishing Division of the Faculty of Arts.Google Scholar
  36. Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik [obeliks: a statistical morphosyntactic tagger and lemmatiser for slovene]. Ljubljana, Slovenia: In Zbornik Osme konference Jezikovne tehnologije.Google Scholar
  37. Holozan, P., Krek, S., Pivec, M., Rigač, S., Rozman, S., & Velušček, A. (2008). Specifikacije za učni korpus. Projekt “Sporazumevanje v slovenskem jeziku” [Specifications for the training corpus. The “communication in Slovene” project]. Technical report, Amebis, d.o.o. http://www.slovenscina.eu/Vsebine/Sl/Kazalniki/K2.aspx.
  38. Horsmann, T., & Zesch, T. (2016). Building a social media adapted pos tagger using flextag: A case study on Italian Tweets. In Fifth evaluation campaign of natural language processing and speech tools for Italian: EVALITA 2016 (pp. 95–98). Naples. http://www.ltl.uni-due.de/wp-content/uploads/horsmannZesch_evalita2016.pdf.pdf.
  39. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR arXiv:abs/1508.01991.
  40. Johansson, R., Adesam, Y., Bouma, G., & Hedberg, K. (2016). A multi-domain corpus of Swedish word sense annotation. In LREC.Google Scholar
  41. Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. In Proceedings of the first workshop on neural machine translation (pp. 28–39). Association for Computational Linguistics. http://aclweb.org/anthology/W17-3204.
  42. Krek, S., Erjavec, T., Dobrovoljc, K., Može, S., Ledinek, N., & Holz, N. (2013). Training corpus ssj500k 1.3. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1029.
  43. Krippendorff, K. (2012). Content analysis: An introduction to its methodology (3rd ed.). Thousand Oaks, CA: Sage Publications.Google Scholar
  44. Lagus, K. H., Ruckenstein, M. S., Pantzar, M., & Ylisiurua, M. J., et al. (2016). Suomi24. Helsingin yliopisto.Google Scholar
  45. Lebar, L., Petrovčič, A., & Petrič, G. (2012). Analiza slovenskih spletnih forumov. poročilo.Google Scholar
  46. Liu, B. (2015). Sentiment analysis., Mining opinions, sentiments, and emotions Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  47. Ljubešić, N., & Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: The case of slovene. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).Google Scholar
  48. Ljubešić, N., Erjavec, T., & Fišer, D. (2016). Corpus-based diacritic restoration for south slavic languages. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).Google Scholar
  49. Ljubešić, N. (2018). Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages. In Proceedings of the fifth workshop on NLP for similar languages, varieties and dialects (VarDial). Santa Fe, USA.Google Scholar
  50. Ljubešić, N., Erjavec, T., & Fišer, D. (2014). Standardizing tweets with character-level machine translation. In Proceedings of CICLing 2014 (pp. 164–75). Lecture notes in computer science. Kathmandu: Springer.CrossRefGoogle Scholar
  51. Ljubešić, N., Erjavec, T., & Fišer, D. (2017). Adapting a state-of-the-art tagger for south Slavic languages to non-standard text. In Proceedings of the 6th Workshop on Balto-Slavic natural language processing (pp. 60–68).Google Scholar
  52. Ljubešić, N., Erjavec, T., & Fišer, D. (2017). Twitter corpus Janes-Tweet 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1142.
  53. Ljubešić, N., Erjavec, T., & Fišer, D. (2017). Wikipedia talk corpus Janes-Wiki 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1137.
  54. Ljubešić, N., & Fišer, D. (2016a). Private or corporate? Predicting user types on twitter. In Proceedings of the 2nd workshop on noisy user-generated text (WNUT) (pp. 4–12).Google Scholar
  55. Ljubešić, N., & Fišer, D. (2016b). Slovene Twitter analytics. In Proceedings of the 4th conference on CMC and social media corpora for the humanities.Google Scholar
  56. Ljubešić, N., Fišer, D., & Erjavec, T. (2014). TweetCaT: A tool for building Twitter corpora of smaller languages. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik: European Language Resources Association (ELRA).Google Scholar
  57. Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S., & Škrjanec, I. (2015). Predicting the level of text standardness in user-generated content. In Proceedings of recent advances in natural language processing.Google Scholar
  58. Ljubešić, N., Zupan, K., Fišer, D., & Erjavec, T. (2016). Normalising Slovene data: Historical texts vs. user-generated content. In Proceedings of KONVENS.Google Scholar
  59. Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., Krek, S., & Kosem, I. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Trojina, zavod za uporabno slovenistiko.Google Scholar
  60. Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use]. Zbirka Sporazumevanje. Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, Ljubljana.Google Scholar
  61. Margaretha, E., & Lüngen, H. (2014). Building linguistic corpora from wikipedia articles and discussions. JLCL, 29(2), 59–82.Google Scholar
  62. Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. In European conference on information retrieval (pp. 16–27). Springer.Google Scholar
  63. Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5), 1–26.  https://doi.org/10.1371/journal.pone.0155036.CrossRefGoogle Scholar
  64. Popič, D., Zupan, K., Logar, P., Kavčič, T., Erjavec, T., & Fišer, D. (2017). Tweet comma corpus Janes-Vejica 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1088.
  65. Popič, D., & Fišer, D. (2018). (ne)Normativnost računalniško posredovane komunikacije v slovenščini: merilo vejice. Viri, orodja in metode za analizo spletne slovenščine (pp. 140–159).Google Scholar
  66. Pustejovsky, J., & Stubbs, A. (2012). Natural language annotation for machine learning. Sebastopol: O’Reilly Media.Google Scholar
  67. Reher, Š., & Fišer, D. (2018). Kodno preklapljanje v objavah slovenskih uporabnikov twitterja. Viri, orodja in metode za analizo spletne slovenščine (pp. 294–323).Google Scholar
  68. Reher, Š., Tomaž, & Fišer, D. (2017). Tweet code-switching corpus Janes-Preklop 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1154.
  69. Rei, L., Krek, S., & Mladenić, D. (2016). xLiMe Twitter corpus XTC 1.0.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1078.
  70. Rychlý, P. (2007). Manatee/Bonito: A modular corpus manager. In: 1st workshop on recent advances in Slavonic natural language processing (pp. 65–70). Brno: Masarykova univerzita.Google Scholar
  71. Smailović, J., Grčar, M., Lavrač, N., & Žnidaršič, M. (2014). Stream-based active learning for sentiment analysis in the financial domain. Information Sciences, 285, 181–203.CrossRefGoogle Scholar
  72. TEI Consortium. (2017). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium. http://www.tei-c.org/Guidelines/P5/.
  73. Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., Domingo, M., van der Goot, R., van Koppen, M., Ljubešić, N., Östling, R., Petran, F., Pettersson, E., Scherrer, Y., Schraagen, M., Sevens, L., Tiedemann, J., Vanallemeersch, T., & Zervanou, K. (2017). The clin27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal, 7, 53–64. http://www.clinjournal.org/sites/clinjournal.org/files/04.clin27-shared-task.pdf.
  74. Čibej, J., Špela Arhar Holdt, Erjavec, T., & Fišer, D. (2016). Razvoj učne množice za izboljšano označevanje spletnih besedil [The developoment of a training dataset for better annotation of web texts]. In Proceedings of the conference on language technologies and digital humanities (pp. 40–46). Ljubljana: Academic Publishing Division of the Faculty of Arts.Google Scholar
  75. Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.CrossRefGoogle Scholar
  76. Verhoeven, B., Škrjanec, I., & Pollak, S. (2017). Gender profiling for Slovene Twitter communication: The influence of gender marking, content and style. In The 6th workshop on Balto-Slavic natural language processing, BSNLP 2017 (p. 119).Google Scholar
  77. Vickery, G., & Wunsch-Vincent, S. (2007). Participative web and user-created content: Web 2.0 wikis and social networking. Paris: Organization for Economic Cooperation and Development (OECD)Google Scholar
  78. Walther, J. B., & Jang, Jw. (2012). Communication processes in participatory websites. Journal of Computer-Mediated Communication, 18(1), 2–15.CrossRefGoogle Scholar
  79. Yimam, S. M., Gurevych, I., de Castilho, R. E., & Biemann, C. (2013). Webanno: A flexible,web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics (system demonstrations) (ACL 2013) (pp. 1–6). Association for Computational Linguistics, Stroudsburg, PA, USA.Google Scholar
  80. Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shuon, S., Glass, J., Scherrer, Y., Samardžić, T., Ljubešić, N., Tiedemann, J., van der Lee, C., Grondelaers, S., Oostdijk, N., van den Bosch, A., Kumar, R., Lahiri, B., & Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (VarDial). Santa Fe, USA.Google Scholar

Copyright information

© Springer Nature B.V. 2018

Authors and Affiliations

  1. 1.Department of Translation, Faculty of ArtsUniversity of LjubljanaLjubljanaSlovenia
  2. 2.Department of Information and Communication Sciences, Faculty of Humanities and Social SciencesUniversity of ZagrebZagrebCroatia
  3. 3.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia

Personalised recommendations