Stylometry Analysis of Literary Texts in Polish

Walkowiak, Tomasz; Piasecki, Maciej

doi:10.1007/978-3-319-91262-2_68

Tomasz Walkowiak¹⁸ &
Maciej Piasecki¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10842))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

1955 Accesses
5 Citations

Abstract

In this work we compare different methods for deriving features for text representation in two stylometric tasks of gender and author recognition. The first group of methods uses the Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected features occurring in the documents. We analyze features such as the most frequent 1000 lemmas, word forms, all lemmas, selected (content insensitive) lemmas, bigrams of grammatical classes and mixture of bigrams of grammatical classes, selected lemmas and punctuations. Moreover, the approach based on the recently proposed fastText algorithm (for vector based representation of text) is also applied. We evaluate these different approaches on two publicly available collections of Polish literary texts from late 19th- and early 20th-century: one consisting of 99 novels from 33 authors and the second one 888 novels from 58 authors. Our study suggests that depending on the corpora the best are the style features (grammatical bigrams) or semantic features (1000 lemmas extracted from the training set). We also noticed the importance of proper division of corpora into training and testing sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://ws.clarin-pl.eu/websty.shtml.
2.
Lemmas are simply understood here as basic morphological forms selected to represent sets of word forms that differ only in the values of grammatical categories like number, gender, person etc.
3.
http://ws.clarin-pl.eu/demo/inc/nkjp360-meaningless-no-prep-freq-above-3500.txt.
4.
There are many lemmas that express semantic content and are correlated with the semantic content or topics of text among the 1,000 most frequent lemmas.

References

Baj, M., Walkowiak, T.: Computer based stylometric analysis of texts in Polish language. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2017. LNCS (LNAI), vol. 10246, pp. 3–12. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59060-8_1
Chapter Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article Google Scholar
Eder, M., Piasecki, M., Walkowiak, T.: An open stylometric system based on multilevel text analysis. Cogn. Stud. | Etudes Cogn. 17 (2017). https://doi.org/10.11649/cs.1430
Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Polish Linguist. 6, 99–114 (2011). www.wuj.pl/page,art,artid,1923.html
Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels. CLARIN-PL Digital Repository (2015). http://hdl.handle.net/11321/57
Eder, M., Rybicki, J., Młynarczyk, K., Oleksy, M., Borys, R., Maryl, M., Piasecki, M.: 1000 novels corpus. CLARIN-PL Digital Repository (2016). http://hdl.handle.net/11321/312
Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1, pp. 561–564 (2001)
Google Scholar
Harris, Z.: Distributional structure. Word 10(2/3), 146–162 (1954)
Article Google Scholar
Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7. Autres impressions: 2011 (corr.), 2013 (7e corr.)
Book MATH Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inform. Sci. Technol. 60(1), 9–26 (2009)
Article Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Piasecki, M.: User-driven language technology infrastructure - the case of CLARIN-PL. In: Proceedings of the Ninth Language Technologies Conference, Ljubljana, Slovenia (2014). http://nl.ijs.si/isjt14/proceedings/isjt2014_01.pdf
Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN (2012). http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf
Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-35647-6_16
Chapter Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
MATH Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Torkkola, K.: Discriminative features for textdocument classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004). https://doi.org/10.1007/s10044-003-0196-8
Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Walkowiak, T.: Language processing modelling notation – orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2017. AISC, vol. 582, pp. 464–473. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-59415-6_44
Chapter Google Scholar
Woliński, M.: Morfeusz reloaded. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 1106–1111. ELRA, Reykjavík (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electronics, Wrocław University of Science and Technology, Wrocław, Poland
Tomasz Walkowiak
Faculty of Computer Science and Management, Wrocław University of Science and Technology, Wrocław, Poland
Maciej Piasecki

Authors

Tomasz Walkowiak
View author publications
You can also search for this author in PubMed Google Scholar
Maciej Piasecki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomasz Walkowiak .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walkowiak, T., Piasecki, M. (2018). Stylometry Analysis of Literary Texts in Polish. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10842. Springer, Cham. https://doi.org/10.1007/978-3-319-91262-2_68

Download citation

DOI: https://doi.org/10.1007/978-3-319-91262-2_68
Published: 11 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91261-5
Online ISBN: 978-3-319-91262-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Stylometry Analysis of Literary Texts in Polish