Skip to main content

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 716))

Abstract

Authorship attribution aims at identifying the author of an unseen text document based on text samples originating from different authors. In this paper we focus on authorship attribution of Polish texts using stylometric features based on part of speech (POS) tags. Polish language is characterized by high inflection level and in consequence over 1000 POS tags can be distinguished. This allows building a sufficiently large feature space by extracting POS information from documents and performing their classification with use of machine learning methods. We report results of experiments conducted with Weka workbench using combinations of the following features: POS tags, an approximation of their bigrams and simple document statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://en.wikipedia.org/wiki/Function_(mathematics).

  2. 2.

    https://sites.google.com/site/computationalstylistics/.

References

  1. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006). doi:10.1007/11892755_87

    Chapter  Google Scholar 

  2. Eder, M.: Style-markers in authorship attribution a cross-language study of the authorial fingerprint. Stud. Pol. Linguist. 6(1), 99–114 (2011)

    Google Scholar 

  3. Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics. COLING 2004, Stroudsburg. Association for Computational Linguistics (2004). http://dx.doi.org/10.3115/1220355.1220443

  4. Juola, P.: Authorship attribution. Found. Trends Inf. Retr. 1(3), 233–334 (2006)

    Article  Google Scholar 

  5. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)

    Google Scholar 

  6. Koppel, M., Akiva, N., Dagan, I.: Feature instability as a criterion for selecting potential style markers. J. Am. Soc. Inf. Sci. Technol. 57(11), 1519–1525 (2006)

    Article  Google Scholar 

  7. Koppel, M., Schler, J., Argamon, S.: Authorship attribution: what’s easy and what’s hard? J. Law Policy 21, 317–331 (2013)

    Google Scholar 

  8. Kuta, M., Puto, B., Kitowski, J.: Authorship attribution of Polish newspaper articles. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9693, pp. 474–483. Springer, Cham (2016). doi:10.1007/978-3-319-39384-1_41

    Google Scholar 

  9. Lamirel, J.-C.: New metrics and related statistical approaches for efficient mining in very large and highly multidimensional databases. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015. CCIS, vol. 521, pp. 3–20. Springer, Cham (2015). doi:10.1007/978-3-319-18422-7_1

    Google Scholar 

  10. Luyckx, K., Daelemans, W.: The effect of author set size and data size in authorship attribution. Lit. Linguist. Comput. 26(1), 35–55 (2011)

    Article  Google Scholar 

  11. Miłkowski, M.: Morfologik (2016). http://morfologik.blogspot.com/. Accessed Dec 2016

  12. Rybicki, J.: Success rates in most-frequent-word-based authorship attribution: a case study of 1000 Polish novels from Ignacy Krasicki to Jerzy Pilch. Stud. Pol. Linguist. 10(2), 87–104 (2015). http://www.ejournals.eu/SPL/2015/Issue-2/art/5409/

    Google Scholar 

  13. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comput. Humanit. 35(2), 193–214 (2001). http://dx.doi.org/10.1023/A: 1002681919510

    Article  Google Scholar 

  14. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  15. Stańczyk, U.: The class imbalance problem in construction of training datasets for authorship attribution. In: Gruca, A., Brachman, A., Kozielski, S., Czachórski, T. (eds.) Man–Machine Interactions 4. AISC, vol. 391, pp. 535–547. Springer, Cham (2016). doi:10.1007/978-3-319-23437-3_46

    Google Scholar 

  16. Szwed, P.: Concepts extraction from unstructured Polish texts: a rule based approach. In: 2015 Federated Conference on Computer Science and Information Systems (FedCSIS), pp. 355–364, September 2015

    Google Scholar 

  17. Szwed, P.: Enhancing concept extraction from Polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). doi:10.1007/978-3-319-34099-9_27

    Chapter  Google Scholar 

  18. Wolinski, M., Milkowski, M., Ogrodniczuk, M., Przepiórkowski, A.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: LREC, pp. 860–864 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Piotr Szwed .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Szwed, P. (2017). Authorship Attribution for Polish Texts Based on Part of Speech Tagging. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Computer and Information Science, vol 716. Springer, Cham. https://doi.org/10.1007/978-3-319-58274-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-58274-0_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-58273-3

  • Online ISBN: 978-3-319-58274-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics