Abstract
This contribution aims to provide a representative sample of Slovak colloquial language in an organized corpus. The corpus makes it possible to study spontaneous, interactive communication that often includes various incorrect or unusual words. The corpus includes a complete set of web discussions about various topics from a single site. Each discussion is marked with a topic and talking person and is assigned to a specific section. The corpus includes an index for easy searching using regular expressions. Text of the discussions is processed with our tools for word tokenization, sentence boundary detection and morphological analysis. Token annotations include a correct word, proposed by a statistical correction system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Böhmová, A., Hajič, J., Hajičová, E., Hladká, B.: The Prague dependency treebank. In: Treebanks, pp. 103–127. Springer (2003)
Ćavar, D., Jazbec, I.P., Stojanov, T.: Cromo-morphological analysis for standard Croatian and its synchronic and diachronic dialects and variants. In: Finite-State Methods and Natural Language Processing. Frontiers in Artificial Intelligence and Applications, vol. 19, pp. 183–190 (2009)
Hládek, D., Staš, J.: Text gathering and processing agent for language modeling corpus. In: Proceedings of the 12th International Conference on Research in Telecommunication Technologies, RTT, pp. 200–203 (2010)
Hládek, D., Staš, J., Juhár, J.: Dagger: The Slovak morphological classifier. In: ELMAR, 2012 Proceedings, pp. 195–198. IEEE (2012)
Hládek, D., Staš, J., Juhár, J.: Unsupervised spelling correction for Slovak. Advances in Electrical and Electronic Engineering 11(5), 392–397 (2013)
Horák, A., Gianitsová, L., Šimková, M., Šmotlák, M., Garabík, R.: Slovak national corpus. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 89–93. Springer, Heidelberg (2004)
Rosenthal, S., McKeown, K.: Detecting opinionated claims in online discussions. In: Proceedings - IEEE 6th International Conference on Semantic Computing, ICSC 2012, pp. 30–37 (2012)
Rusko, M., Juhár, J., Trnka, M., Staš, J., Darjaa, S., Hládek, D., Cerňak, M., Papco, M., Sabo, R., Pleva, M., et al.: Slovak automatic transcription and dictation system for the judicial domain. In: Human Language Technologies as a Challenge for Computer Science and Linguistics: 5th Language & Technology Conference, pp. 365–369 (2011)
Saxe, J., Mentis, D., Greamo, C.: Mining web technical discussions to identify malware capabilities. In: Proceedings - International Conference on Distributed Computing Systems, pp. 1–5 (2013)
Spoustová, J., Spousta, M.: A high-quality web corpus of Czech. In: LREC, pp. 311–315 (2012)
Thurston, A.D.: Parsing computer languages with an automaton compiled from a single regular expression. In: Ibarra, O.H., Yen, H.-C. (eds.) CIAA 2006. LNCS, vol. 4094, pp. 285–286. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Hládek, D., Staš, J., Juhár, J. (2014). Slovak Web Discussion Corpus. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-10888-9_45
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10887-2
Online ISBN: 978-3-319-10888-9
eBook Packages: Computer ScienceComputer Science (R0)