Skip to main content
Log in

The Janes project: language resources and tools for Slovene user generated content

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The paper presents the results of the Janes project, which aimed to develop language resources and tools for Slovene user generated content. The paper first describes the 200 million word Janes corpus, containing tweets, forum posts, news comments, user and talk pages from Wikipedia, and blogs and blog comments, where each text is accompanied by rich metadata. The developed processing tools for Slovene user generated content are presented next, which include a tokeniser, word-normaliser, part-of-speech tagger and lemmatiser, and a named entity recogniser. A set of manually annotated datasets was also produced, both for tool training as well as for linguistic research. The developed resources and tools are made publicly available under Creative Commons licences in the repository of the CLARIN.SI research infrastructure and on GitHub, while the corpora are also available through the CLARIN.SI concordancers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Notes

  1. Since in the Janes project the focus was on the non-standard linguistic features of UGC, we did not include the regular Wikipedia articles because they are written for a mass audience, in a formal register, objective and impersonal style, and in standard language, very much like regular informative texts.

  2. The latest 3,200 tweets of a user can be harvested.

  3. For this we used BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/), where we wrote specialised grammars for each source of the HTML-based subcorpora, i.e. Forums, News and Blogs.

  4. https://github.com/clarinsi/wikitalk-extractor.

  5. https://www.clarin.si/kontext/, source available from https://github.com/ufal/lindat-kontext.

  6. https://www.clarin.si/noske/, source available from https://nlp.fi.muni.cz/trac/noske.

  7. https://www.clarin.eu/content/content-search.

  8. http://clarin.si/repository/xmlui.

  9. https://www.github.com/clarinsi/tweetpub.

  10. http://www.nltk.org/api/nltk.tokenize.html.

  11. https://www.github.com/clarinsi/csmtiser.

  12. http://nl.ijs.si/ME/V5/msd/.

  13. https://www.github.com/clarinsi/janes-ner.

  14. http://nl.ijs.si/janes/dogodki/.

References

  • Arhar Holdt, Š., Erjavec, T., & Fišer, D. (2017). CMC training corpus Janes-Syn 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1086.

  • Arhar Holdt, Š., Fišer, D., Erjavec, T., & Krek, S. (2016). Syntactic annotation of Slovene CMC: First steps. In Proceedings of the 4th conference on CMC and social media corpora for the humanities (pp. 3–6).

  • Barbieri, F., Basile, V., Croce, D., Nissim, M., Novielli, N., & Patti, V. (2016). Overview of the EVALITA 2016 SENTiment POLarity classification task. In Proceedings of third Italian conference on computational linguistics (CLiC-it 2016) & fifth evaluation campaign of natural language processing and speech tools for Italian. Final Workshop (EVALITA 2016).

  • Baron, A., & Rayson, P. (2008). VARD 2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the postgraduate conference in corpus linguistics. Birmingham: Aston University.

  • Bartz, T., Beißwenger, M., & Storrer, A. (2014). Optimierung des Stuttgart–Tübingen–Tagset für die linguistische Annotation von Korpora zur internetbasierten Kommunikation: Phänomene, Herausforderungen, Erweiterungsvorschläge. Journal for Language Technology and Computational Linguistics, 28(1), 157–198.

    Google Scholar 

  • Beißwenger, M., Ehrhardt, E., Horbach, A., Lüngen, H., Steffen, D., & Storrer, A. (2015). Adding value to CMC corpora: CLARINification and part-of-speech annotation of the Dortmund chat corpus. In Proceedings of the 2nd workshop on natural language processing for computer-mediated communication/social media (NLP4CMC2015) (pp. 12–16).

  • Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., & Storrer, A. (2012). A TEI schema for the representation of computer-mediated communication. Journal of the Text Encoding Initiative, 3.

  • Beißwenger, M., & Storrer, A. (2008). Corpora of computer-mediated communication. In A. Lüdeling & M. Kyto (Eds.), Corpus linguistics: An international handbook (pp. 292–308). Berlin: Mouton de Gruyter.

    Google Scholar 

  • Bollmann, M., Bingel, J., & Søgaard, A. (2017). Learning attention for historical text normalization by learning to pronounce. In ACL (Vol. 1, pp. 332–344). Association for Computational Linguistics.

  • Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., & Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.

    Google Scholar 

  • Chanier, T., Poudat, C., Sagot, B., Antoniadis, G., Wigham, C. R., Hriba, L., et al. (2014). The CoMeRe corpus for French: Structuring and annotating heterogeneous CMC genres. JLCL-Journal for Language Technology and Computational Linguistics, 29(2), 1–30.

    Google Scholar 

  • Derczynski, L., Bontcheva, K., & Roberts, I. (2016). Broad twitter corpus: A diverse named entity recognition resource. In COLING (pp. 1169–1179).

  • Derczynski, L., Maynard, D., Rizzo, G., van Erp, M., Gorrell, G., Troncy, R., et al. (2015). Analysis of named entity recognition and linking for tweets. Information Processing & Management, 51(2), 32–49.

    Article  Google Scholar 

  • Dobrovoljc, K., Krek, S., Holozan, P., Erjavec, T., & Romih, M. (2015). Morphological lexicon Sloleks 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1039.

  • Dobrovoljc, K., Krek, S., & Rupnik, J. (2012). Skladenjski razčlenjevalnik za slovenščino [A syntax parser for Slovene]. In Proceedings of the eight conference on language technologies (pp. 42–47). Jožef Stefan Institute. http://nl.ijs.si/isjt12/proceedings/index-en.html.

  • Dürscheid, C., & Stark, E. (2011). SMS4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland. In C. Thurlow & K. Mroczek (Eds.), Digital discourse. Language in the new media (pp. 299–320). Oxford: Oxford University Press.

    Chapter  Google Scholar 

  • Erjavec, T. (2012). MULTEXT-East: Morphosyntactic resources for Central and Eastern European languages. Language Resources and Evaluation, 46(1), 131–142.

    Article  Google Scholar 

  • Erjavec, T. (2015). The IMP historical Slovene language resources. Language Resources and Evaluation, 49(3), 753–775.

    Article  Google Scholar 

  • Erjavec, T., Arhar Holdt, Š., Čibej, J., Dobrovoljc, K., Fišer, D., Laskowski, C., & Zupan, K. (2016). Annotating CLARIN.SI TEI corpora with WebAnno. In Proceedings of the CLARIN annual conference (pp. 1–5).

  • Erjavec, T., Čibej, J., Arhar Holdt, Š., Ljubešić, N., & Fišer, D. (2016). Gold-standard datasets for annotation of Slovene computer-mediated communication. In Proceedings of RASLAN 2016: Recent advances in Slavonic natural language processing (pp. 29–40). Brno: Tribun EU.

  • Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10). Valletta: European Language Resources Association (ELRA).

  • Erjavec, T., Fišer, D., Čibej, J., & Arhar Holdt, Š. (2016). CMC training corpus Janes-Norm 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1084.

  • Erjavec, T., Fišer, D., Čibej, J., Arhar Holdt, Š., & Ljubešić, N. (2016). CMC training corpus Janes-Tag 1.2. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1085.

  • Erjavec, T., Ignat, C., Poliquen, B., & Steinberger, R. (2005). Massive multilingual corpus compilation: Acquis communautaire and ToTaLe. In The 2nd language & technology conference: Human language technologies as a challenge for computer science and linguistics. Association for Computing Machinery (ACM) and UAM Fundacja.

  • Erjavec, T., Ljubešić, N., & Logar, N. (2015). The slWaC corpus of the Slovene web. Informatica, 39(1), 35.

    Google Scholar 

  • Erjavec, T., Ljubešić, N., & Fišer, D. (2017). Blog post and comment corpus Janes-Blog 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1138.

  • Erjavec, T., Ljubešić, N., & Fišer, D. (2017). Forum corpus Janes-Forum 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1139.

  • Erjavec, T., Ljubešić, N., & Fišer, D. (2017). News comment corpus Janes-News 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1140.

  • Fišer, D., Erjavec, T., & Ljubešić, N. (2017). Legal framework, dataset and annotation schema for socially unacceptable online discourse practices in Slovene. In Proceedings of the first workshop on abusive language online (pp. 46–51).

  • Fišer, D. (2018). Viri, orodja in metode za analizo spletne slovenščine. Znanstvena založba Filozofske fakultete Univerze v Ljubljani.

  • Fišer, D., Smailović, J., Erjavec, T., Grčar, M., & Mozetič, I. (2016). Sentiment annotation of the Janes corpus of Slovene user-generated content. In Proceedings of the conference on language technologies and digital humanities (pp. 65–70). Ljubljana: Academic Publishing Division of the Faculty of Arts.

  • Frey, J. C., Glaznieks, A., & Stemle, E. W. (2016). The DiDi corpus of south Tyrolean CMC data: A multilingual corpus of Facebook texts. In CLIC-it.

  • Gimpel, K., Schneider, N., O’Connor, B., Das, D., Mills, D., Eisenstein, J., Heilman, M., Yogatama, D., Flanigan, J., & Smith, N. A. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies: Short papers, HLT ’11 (Vol. 2, pp. 42–47). Stroudsburg: Association for Computational Linguistics.

  • Goli, T., Osrajnik, E., Fišer, D., & Erjavec, T. (2017). CMC shortening corpus Janes-Kratko 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1087.

  • Goli, T., Osrajnik, E., & Fišer, D. (2016). Analiza krajšanja slovenskih sporočil na družbenem omrežju Twitter. In T. Erjavec, D. Fišer (Eds.), Proceedings of the conference on language technologies and digital humanities (pp. 77–82). Ljubljana: Academic Publishing Division of the Faculty of Arts.

  • Grčar, M., Krek, S., & Dobrovoljc, K. (2012). Obeliks: statistični oblikoskladenjski označevalnik in lematizator za slovenski jezik [obeliks: a statistical morphosyntactic tagger and lemmatiser for slovene]. Ljubljana, Slovenia: In Zbornik Osme konference Jezikovne tehnologije.

  • Holozan, P., Krek, S., Pivec, M., Rigač, S., Rozman, S., & Velušček, A. (2008). Specifikacije za učni korpus. Projekt “Sporazumevanje v slovenskem jeziku” [Specifications for the training corpus. The “communication in Slovene” project]. Technical report, Amebis, d.o.o. http://www.slovenscina.eu/Vsebine/Sl/Kazalniki/K2.aspx.

  • Horsmann, T., & Zesch, T. (2016). Building a social media adapted pos tagger using flextag: A case study on Italian Tweets. In Fifth evaluation campaign of natural language processing and speech tools for Italian: EVALITA 2016 (pp. 95–98). Naples. http://www.ltl.uni-due.de/wp-content/uploads/horsmannZesch_evalita2016.pdf.pdf.

  • Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. CoRR arXiv:abs/1508.01991.

  • Johansson, R., Adesam, Y., Bouma, G., & Hedberg, K. (2016). A multi-domain corpus of Swedish word sense annotation. In LREC.

  • Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. In Proceedings of the first workshop on neural machine translation (pp. 28–39). Association for Computational Linguistics. http://aclweb.org/anthology/W17-3204.

  • Krek, S., Erjavec, T., Dobrovoljc, K., Može, S., Ledinek, N., & Holz, N. (2013). Training corpus ssj500k 1.3. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1029.

  • Krippendorff, K. (2012). Content analysis: An introduction to its methodology (3rd ed.). Thousand Oaks, CA: Sage Publications.

    Google Scholar 

  • Lagus, K. H., Ruckenstein, M. S., Pantzar, M., & Ylisiurua, M. J., et al. (2016). Suomi24. Helsingin yliopisto.

  • Lebar, L., Petrovčič, A., & Petrič, G. (2012). Analiza slovenskih spletnih forumov. poročilo.

  • Liu, B. (2015). Sentiment analysis., Mining opinions, sentiments, and emotions Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Ljubešić, N., & Erjavec, T. (2016). Corpus vs. lexicon supervision in morphosyntactic tagging: The case of slovene. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).

  • Ljubešić, N., Erjavec, T., & Fišer, D. (2016). Corpus-based diacritic restoration for south slavic languages. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).

  • Ljubešić, N. (2018). Comparing CRF and LSTM performance on the task of morphosyntactic tagging of non-standard varieties of South Slavic languages. In Proceedings of the fifth workshop on NLP for similar languages, varieties and dialects (VarDial). Santa Fe, USA.

  • Ljubešić, N., Erjavec, T., & Fišer, D. (2014). Standardizing tweets with character-level machine translation. In Proceedings of CICLing 2014 (pp. 164–75). Lecture notes in computer science. Kathmandu: Springer.

  • Ljubešić, N., Erjavec, T., & Fišer, D. (2017). Adapting a state-of-the-art tagger for south Slavic languages to non-standard text. In Proceedings of the 6th Workshop on Balto-Slavic natural language processing (pp. 60–68).

  • Ljubešić, N., Erjavec, T., & Fišer, D. (2017). Twitter corpus Janes-Tweet 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1142.

  • Ljubešić, N., Erjavec, T., & Fišer, D. (2017). Wikipedia talk corpus Janes-Wiki 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1137.

  • Ljubešić, N., & Fišer, D. (2016a). Private or corporate? Predicting user types on twitter. In Proceedings of the 2nd workshop on noisy user-generated text (WNUT) (pp. 4–12).

  • Ljubešić, N., & Fišer, D. (2016b). Slovene Twitter analytics. In Proceedings of the 4th conference on CMC and social media corpora for the humanities.

  • Ljubešić, N., Fišer, D., & Erjavec, T. (2014). TweetCaT: A tool for building Twitter corpora of smaller languages. In Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik: European Language Resources Association (ELRA).

  • Ljubešić, N., Fišer, D., Erjavec, T., Čibej, J., Marko, D., Pollak, S., & Škrjanec, I. (2015). Predicting the level of text standardness in user-generated content. In Proceedings of recent advances in natural language processing.

  • Ljubešić, N., Zupan, K., Fišer, D., & Erjavec, T. (2016). Normalising Slovene data: Historical texts vs. user-generated content. In Proceedings of KONVENS.

  • Logar, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., Krek, S., & Kosem, I. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Trojina, zavod za uporabno slovenistiko.

  • Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Š., & Krek, S. (2012). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba [The Gigafida, KRES, ccGigafida and ccKRES corpora of Slovene language: compilation, content, use]. Zbirka Sporazumevanje. Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, Ljubljana.

  • Margaretha, E., & Lüngen, H. (2014). Building linguistic corpora from wikipedia articles and discussions. JLCL, 29(2), 59–82.

    Google Scholar 

  • Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. In European conference on information retrieval (pp. 16–27). Springer.

  • Mozetič, I., Grčar, M., & Smailović, J. (2016). Multilingual twitter sentiment classification: The role of human annotators. PLoS ONE, 11(5), 1–26. https://doi.org/10.1371/journal.pone.0155036.

    Article  Google Scholar 

  • Popič, D., Zupan, K., Logar, P., Kavčič, T., Erjavec, T., & Fišer, D. (2017). Tweet comma corpus Janes-Vejica 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1088.

  • Popič, D., & Fišer, D. (2018). (ne)Normativnost računalniško posredovane komunikacije v slovenščini: merilo vejice. Viri, orodja in metode za analizo spletne slovenščine (pp. 140–159).

  • Pustejovsky, J., & Stubbs, A. (2012). Natural language annotation for machine learning. Sebastopol: O’Reilly Media.

    Google Scholar 

  • Reher, Š., & Fišer, D. (2018). Kodno preklapljanje v objavah slovenskih uporabnikov twitterja. Viri, orodja in metode za analizo spletne slovenščine (pp. 294–323).

  • Reher, Š., Tomaž, & Fišer, D. (2017). Tweet code-switching corpus Janes-Preklop 1.0. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1154.

  • Rei, L., Krek, S., & Mladenić, D. (2016). xLiMe Twitter corpus XTC 1.0.1. Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1078.

  • Rychlý, P. (2007). Manatee/Bonito: A modular corpus manager. In: 1st workshop on recent advances in Slavonic natural language processing (pp. 65–70). Brno: Masarykova univerzita.

  • Smailović, J., Grčar, M., Lavrač, N., & Žnidaršič, M. (2014). Stream-based active learning for sentiment analysis in the financial domain. Information Sciences, 285, 181–203.

    Article  Google Scholar 

  • TEI Consortium. (2017). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium. http://www.tei-c.org/Guidelines/P5/.

  • Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., Domingo, M., van der Goot, R., van Koppen, M., Ljubešić, N., Östling, R., Petran, F., Pettersson, E., Scherrer, Y., Schraagen, M., Sevens, L., Tiedemann, J., Vanallemeersch, T., & Zervanou, K. (2017). The clin27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal, 7, 53–64. http://www.clinjournal.org/sites/clinjournal.org/files/04.clin27-shared-task.pdf.

  • Čibej, J., Špela Arhar Holdt, Erjavec, T., & Fišer, D. (2016). Razvoj učne množice za izboljšano označevanje spletnih besedil [The developoment of a training dataset for better annotation of web texts]. In Proceedings of the conference on language technologies and digital humanities (pp. 40–46). Ljubljana: Academic Publishing Division of the Faculty of Arts.

  • Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.

    Article  Google Scholar 

  • Verhoeven, B., Škrjanec, I., & Pollak, S. (2017). Gender profiling for Slovene Twitter communication: The influence of gender marking, content and style. In The 6th workshop on Balto-Slavic natural language processing, BSNLP 2017 (p. 119).

  • Vickery, G., & Wunsch-Vincent, S. (2007). Participative web and user-created content: Web 2.0 wikis and social networking. Paris: Organization for Economic Cooperation and Development (OECD)

  • Walther, J. B., & Jang, Jw. (2012). Communication processes in participatory websites. Journal of Computer-Mediated Communication, 18(1), 2–15.

    Article  Google Scholar 

  • Yimam, S. M., Gurevych, I., de Castilho, R. E., & Biemann, C. (2013). Webanno: A flexible,web-based and visually supported system for distributed annotations. In Proceedings of the 51st annual meeting of the association for computational linguistics (system demonstrations) (ACL 2013) (pp. 1–6). Association for Computational Linguistics, Stroudsburg, PA, USA.

  • Zampieri, M., Malmasi, S., Nakov, P., Ali, A., Shuon, S., Glass, J., Scherrer, Y., Samardžić, T., Ljubešić, N., Tiedemann, J., van der Lee, C., Grondelaers, S., Oostdijk, N., van den Bosch, A., Kumar, R., Lahiri, B., & Jain, M. (2018). Language identification and morphosyntactic tagging: The second VarDial evaluation campaign. In Proceedings of the fifth workshop on nlp for similar languages, varieties and dialects (VarDial). Santa Fe, USA.

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their valuable comments which helped us improved the paper. We would also like to thank Avtomobilizem.net, Kvarkadabra, MedOverNet, Mladina, Reporter and RTV Slovenija for permission to include their content in the Janes corpus. We are grateful to Jaka Čibej, Teja Goli, Dafne Marko, Eneja Osrajnik, Senja Pollak and Iza Škrjanec for manual metadata enrichment, Špela Arhar Holdt, Jaka Čibej, Kaja Dobrovoljc, Simon Krek and Katja Zupan for their help with the development of the annotation guidelines and the annotators Teja Goli, Melanija Kožar, Vesna Koželj, Polona Logar, Klara Lubej, Dafne Marko, Barbara Omahen, Eneja Osrajnik, Predrag Petrović, Polona Polc, Aleksandra Rajković, Špela Reher, Iza Škrjanec and Katja Zupan for their dedicated work throughout the campaign. The work described in this paper was funded by the Slovenian Research Agency within the national basic research projects “Resources, Tools and Methods for the Research of Nonstandard Internet Slovene” (J6-6842, 2014–2018) and “Resources, methods and tools for the understanding, identification and classification of various forms of socially unacceptable discourse in the information society” (J7-8280, 2017–2020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Darja Fišer.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fišer, D., Ljubešić, N. & Erjavec, T. The Janes project: language resources and tools for Slovene user generated content. Lang Resources & Evaluation 54, 223–246 (2020). https://doi.org/10.1007/s10579-018-9425-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9425-z

Keywords

Navigation