Advertisement

Stylometry Analysis of Literary Texts in Polish

  • Tomasz WalkowiakEmail author
  • Maciej Piasecki
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10842)

Abstract

In this work we compare different methods for deriving features for text representation in two stylometric tasks of gender and author recognition. The first group of methods uses the Bag-of-Words (BoW) approach, which represents the documents with vectors of frequencies of selected features occurring in the documents. We analyze features such as the most frequent 1000 lemmas, word forms, all lemmas, selected (content insensitive) lemmas, bigrams of grammatical classes and mixture of bigrams of grammatical classes, selected lemmas and punctuations. Moreover, the approach based on the recently proposed fastText algorithm (for vector based representation of text) is also applied. We evaluate these different approaches on two publicly available collections of Polish literary texts from late 19th- and early 20th-century: one consisting of 99 novels from 33 authors and the second one 888 novels from 58 authors. Our study suggests that depending on the corpora the best are the style features (grammatical bigrams) or semantic features (1000 lemmas extracted from the training set). We also noticed the importance of proper division of corpora into training and testing sets.

Keywords

Stylometric Natural language processing Polish Text analysis Bag of words Machine learning 

References

  1. 1.
    Baj, M., Walkowiak, T.: Computer based stylometric analysis of texts in Polish language. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2017. LNCS (LNAI), vol. 10246, pp. 3–12. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-59060-8_1CrossRefGoogle Scholar
  2. 2.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001).  https://doi.org/10.1023/A:1010933404324CrossRefGoogle Scholar
  3. 3.
    Eder, M., Piasecki, M., Walkowiak, T.: An open stylometric system based on multilevel text analysis. Cogn. Stud. | Etudes Cogn. 17 (2017).  https://doi.org/10.11649/cs.1430
  4. 4.
    Eder, M.: Style-markers in authorship attribution: a cross-language study of the authorial fingerprint. Stud. Polish Linguist. 6, 99–114 (2011). www.wuj.pl/page,art,artid,1923.html
  5. 5.
    Eder, M., Rybicki, J.: Late 19th- and early 20th-century polish novels. CLARIN-PL Digital Repository (2015). http://hdl.handle.net/11321/57
  6. 6.
    Eder, M., Rybicki, J., Młynarczyk, K., Oleksy, M., Borys, R., Maryl, M., Piasecki, M.: 1000 novels corpus. CLARIN-PL Digital Repository (2016). http://hdl.handle.net/11321/312
  7. 7.
    Goodman, J.: Classes for fast maximum entropy training. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 1, pp. 561–564 (2001)Google Scholar
  8. 8.
    Harris, Z.: Distributional structure. Word 10(2/3), 146–162 (1954)CrossRefGoogle Scholar
  9. 9.
    Hastie, T.J., Tibshirani, R.J., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York (2009).  https://doi.org/10.1007/978-0-387-84858-7. Autres impressions: 2011 (corr.), 2013 (7e corr.)CrossRefzbMATHGoogle Scholar
  10. 10.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 427–431. Association for Computational Linguistics (2017). http://aclweb.org/anthology/E17-2068
  11. 11.
    Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inform. Sci. Technol. 60(1), 9–26 (2009)CrossRefGoogle Scholar
  12. 12.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Piasecki, M.: User-driven language technology infrastructure - the case of CLARIN-PL. In: Proceedings of the Ninth Language Technologies Conference, Ljubljana, Slovenia (2014). http://nl.ijs.si/isjt14/proceedings/isjt2014_01.pdf
  14. 14.
    Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Języka Polskiego [Eng.: National Corpus of Polish]. Wydawnictwo Naukowe PWN (2012). http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf
  15. 15.
    Radziszewski, A.: A tiered CRF tagger for Polish. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 215–230. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-35647-6_16CrossRefGoogle Scholar
  16. 16.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)zbMATHGoogle Scholar
  17. 17.
    Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)CrossRefGoogle Scholar
  18. 18.
    Torkkola, K.: Discriminative features for textdocument classification. Formal Pattern Anal. Appl. 6(4), 301–308 (2004).  https://doi.org/10.1007/s10044-003-0196-8
  19. 19.
    Tsuruoka, Y., Tsujii, J., Ananiadou, S.: Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, pp. 477–485. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  20. 20.
    Walkowiak, T.: Language processing modelling notation – orchestration of NLP microservices. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) DepCoS-RELCOMEX 2017. AISC, vol. 582, pp. 464–473. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-59415-6_44CrossRefGoogle Scholar
  21. 21.
    Woliński, M.: Morfeusz reloaded. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 1106–1111. ELRA, Reykjavík (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Faculty of ElectronicsWrocław University of Science and TechnologyWrocławPoland
  2. 2.Faculty of Computer Science and ManagementWrocław University of Science and TechnologyWrocławPoland

Personalised recommendations