Abstract
Markov chains are used as a formal mathematical model for sequences of elements of a text. This model is applied for authorship attribution of texts. As elements of a text, we consider sequences of letters or sequences of grammatical classes of words. It turns out that the frequencies of occurrences of letter pairs and pairs of grammatical classes in a Russian text are rather stable characteristics of an author and, apparently, they could be used in disputed authorship attribution. A comparison of results for various modifications of the method using both letters and grammatical classes is given. Experimental research involves 385 texts of 82 writers. In the Appendix, the research of D.V. Khmelev is described, where data compression algorithms are applied to authorship attribution.
Similar content being viewed by others
REFERENCES
Khmelev, D.V., Using Markov Chains for Authorship Attribution, Vestn. MGU, Ser. 9, Philolog., 2000, no. 2, pp. 115-126.
Morozov, N.A., Linguistic Spectra: Methods for Telling Plagiary from True Creation of a Known Author. Stylometric Study, Izv. Otd. Rus. Yazyka i Slovesnosti Imp. Akad. Nauk, 1915, vol. 20,no. 4.
Markov, A.A., On One Application of the Statistical Method, Izv. Imp. Akad. Nauk, Ser. 6, 1916, no. 4, pp. 239-242.
Markov, A.A., An Example of Statistical Analysis of the Text of “Evgenii Onegin” Illustrating the Linking of Events into a Chain, Izv. Imp. Akad. Nauk, Ser. 6, 1913, no. 3, pp. 153-162.
Ot Nestora do Fonvizina. Novye metody opredeleniya avtorstva (From Nestor to Fonvizin. New Methods for Authorship Attribution), Moscow: Progress, 1994.
Holmes, D.I., The Evolution of Stylometry in Humanities Scholarship, Literary Linguist. Comput., 1997, vol. 13,no. 3, pp. 111-117.
Fomenko, V.P. and Fomenko, T.G., Author Invariant for Russian Literary Texts, in Metody kolichestvennogo analiza tekstov narrativnykh istochnikov (Methods of Quantitative Analysis of Texts from Narrative Sources), Moscow: Inst. Istorii SSSR, 1983, pp. 86-109.
Yaglom, A.M. and Yaglom, I.M., Veroyatnost' i informatsiya, Moscow: Nauka, 1960. Translated under the title Probability and Information, Boston: Reidel, 1983.
Dobrushin, R.L., Mathematical Methods in Linguistics, Matematicheskoe prosveshchenie, 1959, issue 6.
Zaliznyak, A.A., Grammaticheskii slovar' russkogo yazyka (Grammatical Dictionary of the Russian Language), Moscow: Rus. Yaz., 1977.
Grammatika sovremennogo russkogo literaturnogo yazyka (The Grammar of Modern Russian Literary Language), Moscow: Nauka, 1970.
Russkaya grammatika (Russian Grammar), 2 vols., Moscow: Nauka, 1980.
Ivchenko, G.I. and Medvedev, Yu.I., Matematicheskaya statistika (Mathematical Statistics), Moscow: Vyssh. Shkola, 1992.
Li, M. and Vitányi, P., An Introduction to Kolmogorov Complexity and Its Applications, New York: Springer, 1997.
Kolmogorov, A.N., Three Approaches to the Definition of the Notion of “Information Amount,” Probl. Peredachi Inf., 1965, vol. 1,no. 1, pp. 3-11.
Burrows, M. and Wheeler, D.J., A Block-Sorting Lossless Data Compression Algorithm, Digital SRC Research Report, 1994, no. 124. Available from ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz
Lempel, A. and Ziv, J., On the Complexity of Finite Sequences, IEEE Trans. Inf. Theory., 1976, vol. 22,no. 1, pp. 75-81.
Cleary, J.G. and Witten, I.H., Data Compression Using Adaptive Coding and Partial String Matching, IEEE Trans. Commun., 1984, vol. 32,no. 4, pp. 396-402.
Cormack, G.V. and Horspool, R.N., Data Compression Using Dynamic Markov Modelling, Comput. J., 1987, vol. 30,no. 6, pp. 541-550.
Rights and permissions
About this article
Cite this article
Kukushkina, O.V., Polikarpov, A.A. & Khmelev, D.V. Using Literal and Grammatical Statistics for Authorship Attribution. Problems of Information Transmission 37, 172–184 (2001). https://doi.org/10.1023/A:1010478226705
Issue Date:
DOI: https://doi.org/10.1023/A:1010478226705