Skip to main content
Log in

The deconstruction of a text: the permanence of the generalized Zipf law—the inter-textual relationship between entropy and effort amount

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Zipf’s law has intrigued people for a long time. This distribution models a certain type of statistical regularity observed in a text. George K. Zipf showed that, if a word is characterised by its frequency, then, rank and frequency are not independent and approximately verify the relationship:

$${\text{Rank }} \times {\text{ frequency}} \approx {\text{constant}}$$

Various explanations have been advanced to explain this law. In this article, we talk about the Mandelbrot process, which includes two very different approaches. In the first approach, Mandelbrot studies language generation as the transmission of a signal and bases it on information theory, using the entropy concept. In the second, geometric approach, he draws a parallel with the fractal theory, where each word of the text is a sequence of characters framed by two separators, meaning a simple geometric pattern. This leads us to hypothesise that, since the statistical regularities observed have several possible explanations, Zipf’s law carries other patterns. To verify this hypothesis, we chose a text, which we modified and degraded in several successive stages. We called T i the text degraded at step i. We then segmented T i into words. We found that rank and frequency were not independent and approximately verified the relationship:

$${\text{Rank}}\,\beta_{i} \, \times {\text{ frequency}} \approx {\text{constant}}\quad \beta_{i} \, > 1$$

The coefficient β i increases with each step i. We call Eq. (1) the generalized Zipf law. We found statistical regularities in the deconstruction of the text. We notably observed a linear relationship between the entropy H i and the amount of effort E i of the various degraded texts T i . To verify our assumptions, we degraded a text of approximately 200 pages. At each step, we calculated various parameters such as entropy, the amount of effort, and the coefficient. We observed an inter-textual relationship between entropy and the amount of effort. This paper therefore provides a proof of this relationship.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. \(f\left( r \right) \approx \frac{k}{{\left( {r + B} \right)^{\beta } }}\quad \beta \approx 1\quad B > 0.\)

  2. Chains, unlike words, are devoid of meaning; semantics are absent.

  3. In fact, in this case, a word in a text is characterised by its frequency alone: variables and probabilities seem to merge.

  4. We find in (Simon 1955) paragraph 4 “The empirical distributions Word frequency” many ideas on the increased size of the lexicon: I quote p. 434: “An author writes not only by processes of associationi.e. sampling earlier segments of word sequence—but also by processes of imitations—i.e. sampling segments of word sequences from other works he has written, from works of other authors…”.

  5. This property is true since we delete the same character throughout the text.

  6. This hypothesis is verified at all steps of the deconstruction.

  7. For each adjustment, we display the graph with the points that are aligned to see if there is no bias.

  8. Here, entropy is measured by a natural logarithm; it should be divided by Ln(2) to express it in bits.

  9. This text can be found at the following address: https://archive.org/details/principesdegogr00blacgoog website consulted in Février 2015.

  10. We can show that this sequence is decreasing thanks to the proposition proven in “Annex 1”. Indeed, the S i sequence decreases through construction and the M i  − S i sequence, which equals \(\frac{680,54}{{k_{i} }} - S_{i}\) is also decreasing.

  11. We could indeed assume that the observed results arose from the fact that the characters were deleted by decreasing frequencies.

  12. The number of texts that can be deconstructed is very high \(\approx 10^{7}\)!

References

  • Benguigui, L., & Blumenfeld-Lieberthal, E. (2011). The end of a paradigm: Is Zipf’s law universal. Journal of Geographical Systems, 281, 69–77.

    Google Scholar 

  • Clauset, A., Slalizi, C., & Newman, M. (2009). Power-law distribution empirical data. SIAM Reviews, 51, 661–771.

    Article  MATH  Google Scholar 

  • Egghe, L. (1990). On the duality of informetric systems with application to the empirical law. Journal of Information Science, 16, 17–27.

    Article  Google Scholar 

  • Egghe, L. (2005). Power laws in the information production process: Lotkaian informetrics. Amsterdam: Elsevier.

    Google Scholar 

  • Egghe, L. (2013). The functional relation between the impact factor and the uncitedness factor revisited. Journal of Informetrics, 7, 183–189.

    Article  Google Scholar 

  • Estoup, J. (1916). Gammes sténographiques (4th ed.). Paris: Institut Sténographique.

    Google Scholar 

  • Ferrer i Cancho, R., & Elveag, B. (2010). Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS One, 3, 1–9.

    Google Scholar 

  • Lafouge, T., & Agouzal, A. (2015). The source-effort coverage of an exponential informetric process. Journal of Informetrics, 9(1), 156–168.

    Article  Google Scholar 

  • Lafouge, T., & Pouchot, S. (2012). Statistiques de l’intellect: Lois puissances inverses en sciences humaines et sociales. Sous la direction de E. Guichard: Publibook.

    Google Scholar 

  • Lafouge, T., & Smolczewska, A. (2006). An interpretation of the effort function through the mathematical formalism of exponential informetric process. Information Processing and Management, 42, 1442–1450.

    Article  Google Scholar 

  • Li, W. (1992). Random texts exhibits Zipf’s law like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842–1845.

    Article  Google Scholar 

  • Lotka, A. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Science, 16, 317–323.

    Google Scholar 

  • Mandelbrot, B. (1953). An informational theory of the statistical structure of languages. In W. Jackson (Ed.), Communication theory (pp. 486–502). Woburn, MA: Butterworth.

    Google Scholar 

  • Mandelbrot, B. (1977). The fractal geometry of nature. New York, NY: Freeman.

    Google Scholar 

  • Mitzenmacher, M. (2003). A brief history of generative models for power law and lognormal distribution. Internet Mathematics, 1(2), 226–251.

    Article  MathSciNet  Google Scholar 

  • Petruszewycz, M. (1972). Loi de Pareto ou loi log-normale: un choix difficile. Mathématiques et Sciences Humaines, 39, 37–52.

    Google Scholar 

  • Petruszewycz, M. (1973). L’histoire de la loi d’ Estoup-Zipf. Mathématiques des Sciences Humaines, 11(44), 41–56.

    MathSciNet  Google Scholar 

  • Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin and Review, 21(5), 1112–1130.

    Article  Google Scholar 

  • Price, D. D. S. (1976). A general theory of bibliometric and other cumulative and other advantage processes. Journal of the American Society for Information Science, 27(5–6), 292–306.

    Article  Google Scholar 

  • Reginald, S., & Bouchet-Franklin Institute. (2007). Investigation of the Zipf-plot of the extinct Meroitic language. Glottometrics, 15, 53–61.

    Google Scholar 

  • Simon, H. A. (1955). On a class of skew distributions functions. Biometrika, 42(3/4), 425–440.

    Article  MATH  MathSciNet  Google Scholar 

  • Stumpf Michael, P. H., & Porter Masson, A. (2012). Critical truths about power laws. Sciences, 335, 665–666.

    Article  Google Scholar 

  • Zipf, G. (1949). Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley. Reprinted: New York: Hafner, 1965.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thierry Lafouge.

Appendices

Annex 1

Theorem 4.1

Let a pair \((S_{i} , M_{i} ) \in \left[ {1, \ldots ,\infty } \right[ \times {\mathbb{R}}^{ + }\), then there exists a unique real \(x, x \ge 1\) such that:

$$\int_{1}^{S} {\frac{dr}{{r^{x} }}} = M$$

if and only if: \(M \in \left[ {o, \ldots ,{\text{Ln}}\left( S \right)} \right]\).

Proof

Let the function

$$F\left( x \right) = \int_{1}^{x} {\frac{dr}{{r^{x} }}}$$

where x ∊ [1,…,∞[. This function is strictly decreasing and continuous. Then we have:

$$F\left( 1 \right) = {\text{Ln }}s.$$

For x > 1, we have:

$$F\left( x \right) = \frac{1}{x - 1} \cdot \left( {1 - \frac{1}{{S^{x - 1} }}} \right)$$

If x tends towards ∞ then F(x) tends towards 0. The theorem of intermediate values then allows us to say that there is a unique value x such that F(s) = M.

Proposition

We have the following results:

  • If S i a decreasing sequence and M i  − S i is a decreasing sequence, then β i is an increasing sequence.

Proof

We show the identity:

$$Y = \int_{1}^{{S_{i + 1} }} {\left( {\frac{1}{{r^{{\beta_{i + 1} }} }} - \frac{1}{{r^{{\beta_{i} }} }}} \right)dr = M_{i + 1} - M_{i} + \int_{{S_{i + 1} }}^{{S_{i} }} {\frac{dr}{{r^{{\beta_{i} }} }}} }$$

we have also:

$$\int_{{S_{i + 1} }}^{{S_{i} }} {\frac{dr}{{r^{{\beta_{i} }} }} \le } \int_{{S_{i + 1} }}^{{S_{i} }} {dr} = S_{i} - S_{i + 1} < 0$$

S i is decreasing and also M i  − S i .

Hence, we can write:

$$\begin{aligned} Y & \le M_{i + 1} - M_{i} + S_{i} - S_{i + 1} \\ Y & \le \left( {M_{i + 1} - S_{i + 1} } \right) - (M_{i} - S_{i} ) \le 0 \\ \end{aligned}$$

Then we have Y < 0 hence, we can write:

$$Y < 0\;{ \Rightarrow }\;\forall r \ge 1\frac{1}{{r^{{\beta_{i + 1} }} }} - \frac{1}{{r^{{\beta_{i} }} }} \le 0$$

Thus

$$\forall r \ge 1\;r^{{\beta_{i} }} \le r^{{\beta_{i + 1} }} \Leftrightarrow \beta_{i} \le \beta_{i + 1}$$
$$\beta_{i} \le \beta_{i + 1} .$$

Annex 2

Annex 3

Demonstration of the relationship between the amount of effort and entropy (9) in Sect. 4.2.3

If we calculate entropy \(H_{i} = - \sum\nolimits_{r = 1}^{{S_{i} }} {\text{Ln}} (p_{i} (r)) \cdot p_{i} \left( r \right)\) assuming that the distributions are generalized Zipf law, using the equation

$$f_{i} \left( r \right) \approx \frac{{k_{i} }}{{r^{{\beta_{i} }} }}\quad k_{i} > 0 \quad \beta_{i} \ge 1\quad r = 1, \ldots ,S_{i}$$

with \(h_{i} = \frac{{k_{i} }}{{M_{i} }}\) and \(\int_{1}^{{S_{i} }} {\frac{dr}{{r^{{\beta_{i} }} }}} = M_{i} ;M_{i} = \frac{M}{{k_{i} }}\) we have:

$$H_{i} = - {\text{Ln}}\left( {h_{i} } \right) + \beta_{i} \mathop \sum \limits_{r = 1}^{{S_{i} }} \frac{{h_{i} }}{{r^{{\beta_{i} }} }}{\text{Ln}}\left( r \right)\quad \beta_{i} \ge 1, 0 < h_{i}$$

according to \(C_{i} \left( r \right) = \frac{{{\text{Ln}}\left( r \right)}}{{{\text{Ln}}\left( {V_{i} } \right)}}\) we have:

$$H_{i} = - {\text{Ln}}\left( {h_{i} } \right) + \beta_{i} \cdot {\text{Ln}}\left( {V_{i} } \right) \cdot \mathop \sum \limits_{r = 1}^{{S_{i} }} \frac{{h_{i} }}{{r^{{\beta^{i} }} }} \cdot C_{i} \left( r \right)$$

according to \(E_{i} = - \mathop \sum \limits_{r = 1}^{{S_{i} }} C_{i} \left( r \right) \cdot p_{i} \left( r \right)\) we have:

$$H_{i} = - {\text{Ln}}\left( {h_{i} } \right) + \beta_{i} \cdot {\text{Ln}}\left( {V_{i} } \right) \cdot E_{i}$$

we obtain a relationship between entropy and amount of effort:

$$H_{i} = \gamma_{i} + \delta_{i} \cdot E_{i} \quad \gamma_{i} > 0,\quad \delta_{i} = \beta_{i} \cdot {\text{Ln}}\left( {V_{i} } \right) > 0$$
(9)

Demonstration of the relationship (16) between the amount of effort and entropy in Sect. 6.3

\(p_{i} \left( r \right) = B \cdot \exp \left( { - A \cdot F_{i} (r)} \right) \quad A > O \quad i = 1, \ldots ,I\) where F i is a strictly increasing unbounded effort function we suppose:

$$\mathop \sum \limits_{i = 1}^{I} p_{i} \left( r \right) = 1$$

then \(H_{i} = - {\text{Ln}}\left( B \right) + A \cdot E_{i}\).

Proof

$$\begin{aligned} H_{i} & = \mathop \sum \limits_{i = 1}^{I} {\text{Ln}}(p_{i} \left( r \right)) \cdot p_{i} \left( r \right) \\ H_{i} & = - \mathop \sum \limits_{i = 1}^{I} {\text{Ln}}(B \cdot \exp \left( { - A \cdot F_{i} \left( r \right)} \right) \cdot B \cdot \exp \left( { - A \cdot F_{i} \left( r \right)} \right) \\ H_{i} & = - \mathop \sum \limits_{i = 1}^{I} {\text{Ln}}\left( B \right) \cdot B \cdot \exp \left( { - A \cdot F_{i} \left( r \right)} \right) - \mathop \sum \limits_{i = 1}^{I} - A \cdot F_{i} \left( r \right) \cdot B \cdot { \exp }\left( { - A \cdot F_{i} \left( r \right)} \right) \\ H_{i} & = - \mathop \sum \limits_{i = 1}^{I} {\text{Ln}}\left( B \right) \cdot p_{i} \left( r \right) + \mathop \sum \limits_{i = 1}^{I} A \cdot F_{i} \left( r \right) \cdot p_{i} \left( r \right) \\ H_{i} & = - {\text{Ln}}\left( B \right)\mathop \sum \limits_{i = 1}^{I} \cdot p_{i} \left( r \right) + A \cdot \mathop \sum \limits_{i = 1}^{I} F_{i} \left( r \right) \cdot p_{i} \left( r \right) \\ E_{i} & = \mathop \sum \limits_{i = 1}^{I} F_{i} \left( r \right) \cdot p_{i} \left( r \right) \\ H_{i} & = - {\text{Ln}}\left( B \right) + A \cdot E_{i} \\ \end{aligned}$$

If \(F_{i} \left( r \right) = \frac{{{\text{Ln}}\left( r \right)}}{{{\text{Ln}}\left( {V_{i} } \right)}}\) is the effort function defined by Mandelbrot:

$$p_{i} \left( r \right) = B \cdot \exp \left( { - A \cdot \frac{{{\text{Ln}}\left( r \right)}}{{V_{i} }}} \right) = B \cdot \frac{1}{{r^{{\beta_{i} }} }}\quad \beta_{i} = \frac{A}{{{\text{Ln}}\left( {V_{i} } \right)}}.$$

Annex 4

The text’s characteristic

  • Total number of signs: 344,292

  • Percentage of separators: 20.2 %

  • Percentage of characters: 79.8 %

  • Total number of segmented words: 61,086

  • Number of identified sources: 5853

  • Number of distinct characters: 75

  • Adjustment coefficient β: 1.12

This text is available on line using the URL: http://fr.wikisource. org/wiki/La_Valeur_de_la_Science, website consulted in February 2015.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lafouge, T., Agouzal, A. & Lallich, G. The deconstruction of a text: the permanence of the generalized Zipf law—the inter-textual relationship between entropy and effort amount. Scientometrics 104, 193–217 (2015). https://doi.org/10.1007/s11192-015-1600-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-015-1600-z

Keywords

Navigation