Abstract
The paper presents extended N-gram model designed for analysis of texts in Polish language. One of possible applications of the model is automatic detection and correction of errors that occur during computerized text edition. N-grams belong to the group of statistical methods in Natural Language Processing (NLP). They are created through analysis of sufficiently large language data resources called corpora. In the classic version N-grams represent the sequences of words of certain length that appear in analyzed language resources. Presented approach introduces N-grams that include also results of morphological analysis of texts. As a result, three types of N-grams may be distinguished: lexical (containing original words from text or their basic forms), morphosyntactic (sequences of morphosyntactic tags assigned to words) and mixed (combination of lexical and morphological description). Extended model with new types of N-grams encompasses language properties specific for Polish such as free word order and complex inflection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Davies, M.: The corpus of contemporary American english as the first reliable monitor corpus of english. Lit. Linguist. Comput. 25(4), 447–464 (2010)
Davies, M.: Making google books n-grams useful for a wide range of research on language change. Int. J. Corpus Linguist. 19(3), 401–416 (2014)
Goldberg, Y., Orwant, J.: A dataset of syntactic-ngrams over time from a very large corpus of english books. In: SEM 2013, Atlanta, US, pp. 241–247 (2013)
Jurafsky, D., Martin, J.H.: Speech and Language Processing. Pearson, London (2008)
Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: ACL 2012, Jeju Island, Korea, pp. 169–174 (2012)
Piasecki, M.: Polish tagger TaKIPI: rule based construction and optimisation. Task Q. 11(1–2), 151–167 (2007)
Pohl, A., Ziółko, B.: Using part of speech n-grams for improving automatic speech recognition of polish. In: Perner, P. (ed.) Machine Learning and Data Mining in Pattern Recognition. LNCS, vol. 7988, pp. 492–504. Springer, Berlin (2013)
Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B.: Narodowy Korpus Jezyka Polskiego. Wydawnictwo Naukowe PWN, Warsaw (2012)
Woliński, M.: System znaczników morfosyntaktycznych w korpusie IPI PAN. Polonica XXII–XXIII, 39–55 (2003)
Woliński, M.: Morfeusz–a practical tool for the morphological analysis of polish. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining, AINSC, vol. 35, pp. 511–520. Springer, Berlin (2006)
Ziółko, B., Skurzok, D.: N-Grams Model for Polish, pp. 107–127. InTech, Rijeka (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Banasiak, D., Mierzwa, J., Sterna, A. (2018). Extended N-gram Model for Analysis of Polish Texts. In: Gruca, A., Czachórski, T., Harezlak, K., Kozielski, S., Piotrowska, A. (eds) Man-Machine Interactions 5. ICMMI 2017. Advances in Intelligent Systems and Computing, vol 659. Springer, Cham. https://doi.org/10.1007/978-3-319-67792-7_35
Download citation
DOI: https://doi.org/10.1007/978-3-319-67792-7_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67791-0
Online ISBN: 978-3-319-67792-7
eBook Packages: EngineeringEngineering (R0)