Abstract
We present empirical data on misprints in citations to 12 high-profile papers. The great majority of misprints are identical to misprints in articles that earlier cited the same paper. The distribution of the numbers of misprint repetitions follows a power law. We develop a stochastic model of the citation process, which explains these findings and shows that about 70–90% of scientific citations are copied from the lists of references used in other papers. Citation copying can explain not only why some misprints become popular, but also why some papers become highly cited. We show that a model where a scientist picks few random papers, cites them, and copies a fraction of their references accounts quantitatively for empirically observed distribution of citations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See, for example, the discussion “Scientists Don’t Read the Papers They Cite” on Slashdot: http://science.slashdot.org/article.pl?sid=02/12/14/0115243&mode=thread&tid=134.
- 2.
Suppose that the number of occurrences of a misprint (K), as a function of the rank (r), when the rank is determined by the above frequency of occurrence (so that the most popular misprint has rank 1, second most frequent misprint has rank 2 and so on), follows a Zipf law: K(r) = C ∕ r α. We want to find the number-frequency distribution, i.e. how many misprints appeared n times. The number of misprints that appeared between K 1 and K 2 times is obviously r 2 − r 1, where K 1 = C ∕ r 1 α and K 2 = C ∕ r 2 α. Therefore, the number of misprints that appeared K times, N k , satisfies N K dK = − dr and hence, N K = − dr ∕ dK ∼ K − 1 ∕ α − 1.
- 3.
Why did this happen? Obviously, T reaches maximum when R equals zero. Substituting R = 0 in (16.22) we get: T MAX = N(1 − 1 ∕ N M). For paper No.4 we have N = 2, 578, M = D ∕ N = 32 ∕ 2, 578. Substituting this into the preceding equation, we get T MAX = 239. The observed value T = 263 is therefore higher than an expectation value of T for any R. This does not immediately suggest discrepancy between the model and experiment but a strong fluctuation. In fact out of 1,000,000 runs of Monte Carlo simulation of MPM with the parameters of the mentioned paper and R = 0. 2 exactly 49,712 runs (almost 5%) produced T ≥ 263.
- 4.
There are also misprints where author, journal, volume, and year are perfectly correct, but the page number is totally different. Probably, in such case the citer mistakenly took the page number from a neighboring paper in the reference list he was lifting the citation from.
- 5.
In our initial report [22] we mentioned “over 24 thousand papers.” This number is incorrect and the reader surely understands the reason: misprints. In fact, out of 24,295 “papers” in that dataset only 18,560 turned out to be real papers and 5,735 “papers” turned out to be misprinted citations. These “papers” got 17,382 out of 351,868 citations. That is every distinct misprint on average appeared three times. As one could expect, cleaning out misprints lead to much better agreement between experiment and theory: compare Fig.16.8 and Fig. 1 of [22].
- 6.
If one assumes that all papers are created equal then the probability to win m out of n possible citations when the total number of cited papers is N is given by the Poisson distribution: P = ((n ∕ N)m ∕ m! ) ×e − n ∕ N. Using Stirling’s formula one can rewrite this as: ln(P) ⊈m ln(ne ∕ Nm) − (n ∕ N). After substituting n = 330, 000, m = 500 and N = 18500 into the above equation we get: ln(P) ⊈ − 1, 180, or P⊈ 10 − 512.
- 7.
Sociologist of science Robert Merton observed [19] that when a scientist gets recognition early in his career he is likely to get more and more recognition. He called it “Matthew Effect” because in Gospel according to Mathew (25:29) appear the words: “unto every one that hath shall be given”. The attribution of a special role to St. Matthew is unfair. The quoted words belong to Jesus and also appear in Luke and Mark’s gospels. Nevertheless, thousands of people who did not read The Bible copied the name “Matthew Effect.”
- 8.
From the mathematical perspective, almost identical to RCS model (the only difference was that they considered an undirected graph, while citation graph is directed) was earlier proposed in [23].
- 9.
The analysis presented here also applies to a more general case when m is not a constant, but a random variable. In that case m in all of the equations that follow should be interpreted as the mean value of this variable.
- 10.
Some of these references do not deal with citing, but with other social processes, which are modeled using the same mathematical tools. Here we rephrase the results of such papers in terms of citations for simplicity.
- 11.
The uncertainty in the value of α depends not only on the accuracy of the estimate of the fraction of citations which goes to previous year papers. We also arbitrarily defined recent paper (in the sense of our model), as the one published within a year. Of course, this is by order of magnitude correct, but the true value can be anywhere between half a year and 2 years.
References
Simkin MV, Roychowdhury VP (2003) Read before you cite! Complex Systems 14: 269–274. Alternatively available at http://arxiv.org/abs/cond-mat/0212043
Simkin MV, Roychowdhury VP (2006) An introduction to the theory of citing. Significance 3: 179–181. Alternatively available at http://arxiv.org/abs/math/0701086
Simkin MV, Roychowdhury VP (2005) Stochastic modeling of citation slips. Scientometrics 62: 367–384. Alternatively available at http://arxiv.org/abs/cond-mat/0401529
Simon HA (1957) Models of Man. New York: Wiley.
Krapivsky PL, Redner S (2001) Organization of growing random networks. Phys. Rev. E 63, 066123; Alternatively available at http://arxiv.org/abs/cond-mat/0011094
Krapivsky PL, Redner S (2002) Finiteness and Fluctuations in Growing Networks. J. Phys. A 35: 9517; Alternatively available at http://arxiv.org/abs/cond-mat/0207107
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Numerical Recipes in FORTRAN: The Art of Scientific Computing. Cambridge: University Press (see Chapt. 14.3, p.617–620).
Simboli B (2003) http://listserv.nd.edu/cgi-bin/wa?A2=ind0305&L=pamnet&P=R2083. Accessed on 7 Sep 2011
Smith A (1983) Erroneous error correction. New Library World 84: 198.
Garfield E (1990) Journal editors awaken to the impact of citation errors. How we control them at ISI. Essays of Information Scientist 13:367.
SPIRES (http://www.slac.stanford.edu/spires/) data, compiled by H. Galic, and made available by S. Redner: http://physics.bu.edu/ ∼ http://redner/projects/citation. Accessed on 7 Sep 2011
Steel CM (1996) Read before you cite. The Lancet 348: 144.
Broadus RN (1983) An investigation of the validity of bibliographic citations. Journal of the American Society for Information Science 34: 132.
Moed HF, Vriens M (1989) Possible inaccuracies occurring in citation analysis. Journal of Information Science 15:95.
Hoerman HL, Nowicke CE (1995) Secondary and tertiary citing: A study of referencing behaviour in the literature of citation analyses deriving from the Ortega Hypothesis of Cole and Cole. Library Quarterly 65: 415.
Kåhre J (2002) The Mathematical Theory of Information. Boston: Kluwer.
Deming WE (1986) Out of the crisis. Cambridge: MIT Press.
Garfield E (1979) Citation Indexing. New York: John Wiley.
Merton RK (1968) The Matthew Effect in Science. Science 159: 56.
Price D de S (1976) A general theory of bibliometric and other cumulative advantage process. Journal of American Society for Information Science 27: 292.
Barabasi A-L, Albert R (1999) Emergence of scaling in random networks. Science 286: 509.
Simkin MV, Roychowdhury VP (2005) Copied citations create renowned papers? Annals of Improbable Research 11:24–27. Alternatively available at http://arxiv.org/abs/cond-mat/0305150
Dorogovtsev SN, Mendes JFF (2004) Accelerated growth of networks. http://arxiv.org/abs/cond-mat/0204102 (see Chap. 0.6.3)
Price D de S (1965) Networks of Scientific Papers. Science 149: 510.
Silagadze ZK (1997) Citations and Zipf-Mandelbrot law. Complex Systems 11: 487
Redner S (1998) How popular is your paper? An empirical study of citation distribution. Eur. Phys. J. B 4: 131.
Vazquez A (2001) Disordered networks generated by recursive searches. Europhys. Lett. 54: 430.
Ziman JM (1969) Information, communication, knowledge. Nature, 324: 318.
Günter R, Levitin L, Schapiro B, Wagner P (1996) Zipf’s law and the effect of ranking on probability distributions. International Journal of Theoretical Physics 35: 395
Nakamoto H (1988) Synchronous and diachronous citation distributions. In. Egghe L and Rousseau R (eds) Informetrics 87/88. Amsterdam: Elsevier.
Glänzel W., Schoepflin U. (1994). A stochastic model for the ageing of scientific literature. Scientometrics 30: 49–64.
Pollmann T. (2000). Forgetting and the aging of scientific publication. Scientometrics 47: 43.
Simkin M. V., Roychowdhury V. P. (2007) A mathematical theory of citing. Journal of the American Society for Information Science and Technology 58:1661–1673.
Harris T.E. (1963). The theory of branching processes. Berlin: Springer.
Bentley R. A., Hahn, M.W., Shennan S.J. (2004). Random drift and culture change. Proceedings of the Royal Society B: Biological Sciences 271: 1443 – 1450.
Redner S. (2004). Citation Statistics From More Than a Century of Physical Review. http://arxiv.org/abs/physics/0407137
Wright S (1931) Evolution in Mendelian populations. Genetics 16: 97–159.
Simkin M. V., Roychowdhury V. P. (2010) An explanation of the distribution of inter-seizure intervals. EPL 91: 58005
Bak P, Tang C, Wiesenfeld K (1988) Self-organized criticality. Phys. Rev. A 38: 364–374.
Simkin M. V., Roychowdhury V. P. (2008) A theory of web traffic. EPL 62: 28006. Accessed on 7 Sep 2011
Some Statistics about the MR Database http://www.ams.org/publications/60ann/FactsandFigures.html
Burrell Q L (2003) Predicting future citation behavior. Journal of the American Society for Information Science and Technology 54: 372–378.
Garfield E (1980) Premature discovery or delayed recognition -Why? Current Contents 21: 5–10.
Raan AFJ van (2004) Sleeping Beauties in science. Scientometrics 59: 467–472
Alstrøm P (1988). Mean-field exponents for self-organized critical phenomena. Phys. Rev. A 38: 4905–4906.
Bak P (1999). How Nature Works the Science of Self-Organized Criticality. New York: Copernicus.
Bak P, Sneppen, K (1993). Punctuated equilibrium and criticality in a simple model of evolution. Physical Review Letters 71: 4083–4086
Sokal A, Bricmont J (1998) Fashionable Nonsense. New York: Picador.
Hahn MW, Bentley RA (2003) Drift as a mechanism for cultural change: an example from baby names. Proc. R. Soc. Lond. B (Suppl.), Biology Letters, DOI 10.1098/rsbl.2003.0045.
Social Security Administration: Popular Baby Names http://www.ssa.gov/OACT/babynames/. Accessed on 7 Sep 2011
Simkin MV (2007) My Statistician Could Have Painted That! A Statistical Inquiry into Modern Art. Significance 14:93–96. Also available at http://arxiv.org/abs/physics/0703091
Naftulin DH, Ware JE, Donnelly FA (1973) The Doctor Fox Lecture: A Paradigm of Educational Seduction. Journal of Medical Education 48: 630–635.
Encyclopaedia of Mathematics (Ed. M. Hazewinkel). See: Bürmann–Lagrange series: http://eom.springer.de/b/b017790.htm. Accessed on 7 Sep 2011
Otter R (1949) The multiplicative process. The Annals of Mathematical Statistics 20: 206
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Simkin, M.V., Roychowdhury, V.P. (2012). Theory of Citing. In: Thai, M., Pardalos, P. (eds) Handbook of Optimization in Complex Networks. Springer Optimization and Its Applications(), vol 57. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-0754-6_16
Download citation
DOI: https://doi.org/10.1007/978-1-4614-0754-6_16
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-0753-9
Online ISBN: 978-1-4614-0754-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)