Abstract
This contribution is the result of trying to put in amore systematic, updated and readable form some notes I used in the preparation of my talk at the INdAM Workshop Mathematical Models and Methods for Planet Earth, in Rome (May, 2013). The aim was to discuss some recent mathematical approaches to textual data analysis, focusing on literary texts and on some specific topics and examples: universal statistical properties of written language and the nature of long correlations in literary texts with specific applications to authorship attribution, keyword extraction, and automatic text generation.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
1 To be more precise, d n is a pseudo-distance, since it does not satisfy the triangular inequality and it is not even positive definite: two texts X,Y can be at distance d n (X,Y) = 0 without being the same.
References
Allegrini, P., Grigolini, P., Palatella L.: Intermittency and scale-free networks: a dynamical model for human language complexity. Chaos, Solitons and Fractals 20, 95–105 (2004)
Altmann, E.G., Cristadoro, G., Degli Esposti, M.: On the origin of long-range correlations in texts. Proceedings of the National Academy of Sciences 109, 11582–11587 (2012)
Alvarez-Lacalle, E., Dorow, B., Eckmann, J.P., Moses E.: Hierarchical structures induce longrange dynamical correlations in written texts. Proc Natl Acad Sci USA 103, 7956–7961 (2006)
Amit, M., Shmerler Y., Eisenberg, E., Abraham, M., Shnerb N.: Language and codification dependence of long-range correlations in texts. Fractals 2, 7–13 (1994)
Basile, C., Benedetto, D., Caglioti, E., Degli Esposti M.: An example of mathematical authorship attribution. J. Math. Phys. 41, 125–211 (2008)
Benedetto, D., Caglioti, E., Degli Esposti, M.: The unreasonable effectiveness ofMathematics in Human Science: the attribution of texts by Antonio Gramsci. In: Emmer, M. (ed.) Imagine Math-Between Culture and Mathematics, pp. 143–154. Springer-Verlag Italia, Milano (2012)
Benedetto, D., Degli Esposti, M., Maspero, G.: The puzzle of Basil’s Epistula 38: a mathematical approach to a philological problem. Journal of Quantitative Linguistics (2013)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Physical Review Letters 88(4), 48702 (2002)
Bennet, W. R., Scientific and engineering problem-solving with the computer. Prentice-Hall, Englewood Cliffs, NJ (1976)
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: The meta book and size-dependent properties of written language. New Journal of Physics 11, 123015 (2009)
Clement, R., Sharp, D.: Ngram and Bayesian classification of documents for topic and Authorship. Lit. Ling. Comp. 18, 423 (2003)
Conrad, B., Mitzenmacher, M.: Power laws for monkeys typing randomly: the case of unequal probabilities. IEEE Transactions on Information Theory 50, 1403–1414 (2004)
Dickman, R., Moloney, N.R., Altmann, E. G.: Analysis of an information-theoretic model for communication. Journal of Statistical Mechanics: Theory and Experiment 12, 12022 (2012)
Ebeling, W., Neiman, A.: Long-range correlations between letters and sentences in texts. Physica A 215, 233–241 (1995)
Ebeling, W., Pöschel, T.: Entropy and long-range correlations in literary English. Europhys Lett 26, 241–246 (1994)
Fedwick, P.J.: Bibliotheca Basiliana Universalis I. Brepols-Turnhout, pp. 620–623, 674-678 (1993)
Ferrer i Cancho, R., Solé, R.V.: Two regimes in the frequency of words and the origins of complex lexicons: Zipf’s law revisited. Journal of Quantitative Linguistic 8(3), 165–173 (2001)
Ferrer i Cancho, R., Solé, R. V.: Least effort and the origins of scaling in human language. Proc. Natl. Acad. Sci. 100(3), pp. 788–791. USA (2003)
Ferrer i Cancho, R., Elvevag, B.: Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS ONE 5, 9411 (2010)
Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Kešelj, V., Endo, T. (eds.) PACLING’03. Proceedings of the Conference Pacific Association for Computational Linguistics, pp. 255–264. Dalhousie University, Halifax (2003)
Juola, P.: Authorship Attribution. FNT in Information Retrieval 1, 233–334 (2007)
Landini, G.: Evidence of linguistic structure in the Voynich manuscript using spectral analysis. Cryptologia 25, 275–295 (2001)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inform. Theory IT 22(1), 75–81 (1976)
Lempel, A., Ziv, J.: A universal algorithm for sequential data compression IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Lempel, A., Ziv, J.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory IT 24(5), 530–536 (1978)
Lü, L., Zhang, Z.K., Zhou, T.: Zipf’s law leads to Heaps’ law: analyzing their relation in finitesize systems. PLoS ONE 5, e14139 (2010)
Li, W.: Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE T Inform Theory 38, 1842–1845 (1992)
Mandelbrot, B.: An informational theory of the statistical structure of language. In: Jackson, W. (ed.) Communication Theory. Butterworths, London (1953)
Maspero, G., Leal, J.: Revisiting Tertullian’s Authorship of the Passio Perpetuae through Quantitative Analysis. In Grzybek, P. (ed.) Text and Language. In: Kelih, E., Maoutek, J. (eds.) Structures. Functions. Interrelations Π Quantitative Perspectives, pp. 99–108. Wien (2010)
Melnyk, S.S., Usatenko, O.V., Yampolskii, V.A.: Competition between two kinds of correlations in literary texts. Phys Rev E 72, 026140 (2005)
Meredith, A.: Gregory of Nyssa. Routledge, London, New York (1999)
Mitzenmacher, M.: A brief history of generative models for power law and lognormal distributions. Internet Mathematics 1, 226–251 (2003)
Montemurro, M.A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300, 567–578 (2001)
Montemurro, M.A., Zanette, D.: Towards the quantification of the semantic information encoded in written language. Adv Comp Syst 13, 135–153 (2010)
Montemurro, M.A., Zanette, D.: Keywords and Co-Occurrence Patterns in the Voynich Manuscript: An Information-Theoretic Analysis. PLoS ONE 8, e66344 (2013)
Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemporary physics 46, 323–351 (2005)
Rousseau, P.: Basil of Caesarea. University of California Press, Berkeley (CA), Los Angeles (CA), London (1998)
Schenkel, A., Zhang, J., Zhang, Y.: Long range correlation in human writings. Fractals 1, 47–55 (1993)
Schinner, A.: The Voynich Manuscript: Evidence of the Hoax Hypothesis. Cryptologia 31, 95–107 (2007)
Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Jour. Am. Soc. Infor. Sci. Tech. 60, 538–556 (2009)
Suzuki, R., Tyack, P.L., Buck, J.: The use of Zipf’s law in animal communication analysis. Anim. Behav. 69, 9–17 (2005)
Wyner, A.D., Ziv, J., Wyner, A.J.: On the role of pattern matching in information theory. IEEE Transactions on information Theory 44(6), 2045–2056 (1998)
Zanette, D.: Statistical Patterns inWritten Language, available at http://fisica.cab.cnea.gov.ar/estadistica/zanette/
Zanette, D., Montemurro, M. A.: Dynamics of text generation with realistic Zipf’s distribution. J Quantitative Linguistics 12, 29–40 (2005)
Ziv, J., Merhav, N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory 39(4), 1270–1279 (1993)
The digital library has been developed by University of California, Irvine. http://www.tlg.uci.edu
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Esposti, M.D. (2014). Mathematical Models of Textual Data: A Short Review. In: Celletti, A., Locatelli, U., Ruggeri, T., Strickland, E. (eds) Mathematical Models and Methods for Planet Earth. Springer INdAM Series, vol 6. Springer, Cham. https://doi.org/10.1007/978-3-319-02657-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-02657-2_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02656-5
Online ISBN: 978-3-319-02657-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)