Skip to main content
Log in

Using Literal and Grammatical Statistics for Authorship Attribution

  • Published:
Problems of Information Transmission Aims and scope Submit manuscript

Abstract

Markov chains are used as a formal mathematical model for sequences of elements of a text. This model is applied for authorship attribution of texts. As elements of a text, we consider sequences of letters or sequences of grammatical classes of words. It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. A comparison of results for various modifications of the method using both letters and grammatical classes is given. Experimental research involves 385 texts of 82 writers. In the Appendix, the research of D.V. Khmelev is described, where data compression algorithms are applied to authorship attribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

REFERENCES

  1. Khmelev, D.V., Using Markov Chains for Authorship Attribution, Vestn. MGU, Ser. 9, Philolog., 2000, no. 2, pp. 115-126.

  2. Morozov, N.A., Linguistic Spectra: Methods for Telling Plagiary from True Creation of a Known Author. Stylometric Study, Izv. Otd. Rus. Yazyka i Slovesnosti Imp. Akad. Nauk, 1915, vol. 20,no. 4.

  3. Markov, A.A., On One Application of the Statistical Method, Izv. Imp. Akad. Nauk, Ser. 6, 1916, no. 4, pp. 239-242.

  4. Markov, A.A., An Example of Statistical Analysis of the Text of “Evgenii Onegin” Illustrating the Linking of Events into a Chain, Izv. Imp. Akad. Nauk, Ser. 6, 1913, no. 3, pp. 153-162.

  5. Ot Nestora do Fonvizina. Novye metody opredeleniya avtorstva (From Nestor to Fonvizin. New Methods for Authorship Attribution), Moscow: Progress, 1994.

  6. Holmes, D.I., The Evolution of Stylometry in Humanities Scholarship, Literary Linguist. Comput., 1997, vol. 13,no. 3, pp. 111-117.

    Google Scholar 

  7. Fomenko, V.P. and Fomenko, T.G., Author Invariant for Russian Literary Texts, in Metody kolichestvennogo analiza tekstov narrativnykh istochnikov (Methods of Quantitative Analysis of Texts from Narrative Sources), Moscow: Inst. Istorii SSSR, 1983, pp. 86-109.

    Google Scholar 

  8. Yaglom, A.M. and Yaglom, I.M., Veroyatnost' i informatsiya, Moscow: Nauka, 1960. Translated under the title Probability and Information, Boston: Reidel, 1983.

    Google Scholar 

  9. Dobrushin, R.L., Mathematical Methods in Linguistics, Matematicheskoe prosveshchenie, 1959, issue 6.

  10. Zaliznyak, A.A., Grammaticheskii slovar' russkogo yazyka (Grammatical Dictionary of the Russian Language), Moscow: Rus. Yaz., 1977.

    Google Scholar 

  11. Grammatika sovremennogo russkogo literaturnogo yazyka (The Grammar of Modern Russian Literary Language), Moscow: Nauka, 1970.

  12. Russkaya grammatika (Russian Grammar), 2 vols., Moscow: Nauka, 1980.

  13. Ivchenko, G.I. and Medvedev, Yu.I., Matematicheskaya statistika (Mathematical Statistics), Moscow: Vyssh. Shkola, 1992.

    Google Scholar 

  14. Li, M. and Vitányi, P., An Introduction to Kolmogorov Complexity and Its Applications, New York: Springer, 1997.

    Google Scholar 

  15. Kolmogorov, A.N., Three Approaches to the Definition of the Notion of “Information Amount,” Probl. Peredachi Inf., 1965, vol. 1,no. 1, pp. 3-11.

    Google Scholar 

  16. Burrows, M. and Wheeler, D.J., A Block-Sorting Lossless Data Compression Algorithm, Digital SRC Research Report, 1994, no. 124. Available from ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz

  17. Lempel, A. and Ziv, J., On the Complexity of Finite Sequences, IEEE Trans. Inf. Theory., 1976, vol. 22,no. 1, pp. 75-81.

    Google Scholar 

  18. Cleary, J.G. and Witten, I.H., Data Compression Using Adaptive Coding and Partial String Matching, IEEE Trans. Commun., 1984, vol. 32,no. 4, pp. 396-402.

    Google Scholar 

  19. Cormack, G.V. and Horspool, R.N., Data Compression Using Dynamic Markov Modelling, Comput. J., 1987, vol. 30,no. 6, pp. 541-550.

    Google Scholar 

Download references

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kukushkina, O.V., Polikarpov, A.A. & Khmelev, D.V. Using Literal and Grammatical Statistics for Authorship Attribution. Problems of Information Transmission 37, 172–184 (2001). https://doi.org/10.1023/A:1010478226705

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1010478226705

Keywords

Navigation