Theory of Citing

Simkin, M. V.; Roychowdhury, V. P.

doi:10.1007/978-1-4614-0754-6_16

M. V. Simkin³ &
V. P. Roychowdhury³

Part of the book series: Springer Optimization and Its Applications ((SOIA,volume 57))

2443 Accesses
4 Citations
62 Altmetric

Abstract

We present empirical data on misprints in citations to 12 high-profile papers. The great majority of misprints are identical to misprints in articles that earlier cited the same paper. The distribution of the numbers of misprint repetitions follows a power law. We develop a stochastic model of the citation process, which explains these findings and shows that about 70–90% of scientific citations are copied from the lists of references used in other papers. Citation copying can explain not only why some misprints become popular, but also why some papers become highly cited. We show that a model where a scientist picks few random papers, cites them, and copies a fraction of their references accounts quantitatively for empirically observed distribution of citations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See, for example, the discussion “Scientists Don’t Read the Papers They Cite” on Slashdot: http://science.slashdot.org/article.pl?sid=02/12/14/0115243&mode=thread&tid=134.
2.
Suppose that the number of occurrences of a misprint (K), as a function of the rank (r), when the rank is determined by the above frequency of occurrence (so that the most popular misprint has rank 1, second most frequent misprint has rank 2 and so on), follows a Zipf law: K(r) = C ∕ r ^α. We want to find the number-frequency distribution, i.e. how many misprints appeared n times. The number of misprints that appeared between K ₁ and K ₂ times is obviously r ₂ − r _1, where K ₁ = C ∕ r ₁ ^α and K ₂ = C ∕ r ₂ ^α. Therefore, the number of misprints that appeared K times, N _k, satisfies N _K dK = − dr and hence, N _K = − dr ∕ dK ∼ K ^{− 1 ∕ α − 1}.
3.
Why did this happen? Obviously, T reaches maximum when R equals zero. Substituting R = 0 in (16.22) we get: T _MAX = N(1 − 1 ∕ N ^M). For paper No.4 we have N = 2, 578, M = D ∕ N = 32 ∕ 2, 578. Substituting this into the preceding equation, we get T _MAX = 239. The observed value T = 263 is therefore higher than an expectation value of T for any R. This does not immediately suggest discrepancy between the model and experiment but a strong fluctuation. In fact out of 1,000,000 runs of Monte Carlo simulation of MPM with the parameters of the mentioned paper and R = 0. 2 exactly 49,712 runs (almost 5%) produced T ≥ 263.
4.
There are also misprints where author, journal, volume, and year are perfectly correct, but the page number is totally different. Probably, in such case the citer mistakenly took the page number from a neighboring paper in the reference list he was lifting the citation from.
5.
In our initial report [22] we mentioned “over 24 thousand papers.” This number is incorrect and the reader surely understands the reason: misprints. In fact, out of 24,295 “papers” in that dataset only 18,560 turned out to be real papers and 5,735 “papers” turned out to be misprinted citations. These “papers” got 17,382 out of 351,868 citations. That is every distinct misprint on average appeared three times. As one could expect, cleaning out misprints lead to much better agreement between experiment and theory: compare Fig.16.8 and Fig. 1 of [22].
6.
If one assumes that all papers are created equal then the probability to win m out of n possible citations when the total number of cited papers is N is given by the Poisson distribution: P = ((n ∕ N)^m ∕ m! ) ×e ^{− n ∕ N}. Using Stirling’s formula one can rewrite this as: ln(P) ⊈m ln(ne ∕ Nm) − (n ∕ N). After substituting n = 330, 000, m = 500 and N = 18500 into the above equation we get: ln(P) ⊈ − 1, 180, or P⊈ 10^− 512.
7.
Sociologist of science Robert Merton observed [19] that when a scientist gets recognition early in his career he is likely to get more and more recognition. He called it “Matthew Effect” because in Gospel according to Mathew (25:29) appear the words: “unto every one that hath shall be given”. The attribution of a special role to St. Matthew is unfair. The quoted words belong to Jesus and also appear in Luke and Mark’s gospels. Nevertheless, thousands of people who did not read The Bible copied the name “Matthew Effect.”
8.
From the mathematical perspective, almost identical to RCS model (the only difference was that they considered an undirected graph, while citation graph is directed) was earlier proposed in [23].
9.
The analysis presented here also applies to a more general case when m is not a constant, but a random variable. In that case m in all of the equations that follow should be interpreted as the mean value of this variable.
10.
Some of these references do not deal with citing, but with other social processes, which are modeled using the same mathematical tools. Here we rephrase the results of such papers in terms of citations for simplicity.
11.
The uncertainty in the value of α depends not only on the accuracy of the estimate of the fraction of citations which goes to previous year papers. We also arbitrarily defined recent paper (in the sense of our model), as the one published within a year. Of course, this is by order of magnitude correct, but the true value can be anywhere between half a year and 2 years.

References

Simkin MV, Roychowdhury VP (2003) Read before you cite! Complex Systems 14: 269–274. Alternatively available at http://arxiv.org/abs/cond-mat/0212043
Simkin MV, Roychowdhury VP (2006) An introduction to the theory of citing. Significance 3: 179–181. Alternatively available at http://arxiv.org/abs/math/0701086
Simkin MV, Roychowdhury VP (2005) Stochastic modeling of citation slips. Scientometrics 62: 367–384. Alternatively available at http://arxiv.org/abs/cond-mat/0401529
Google Scholar
Simon HA (1957) Models of Man. New York: Wiley.
MATH Google Scholar
Krapivsky PL, Redner S (2001) Organization of growing random networks. Phys. Rev. E 63, 066123; Alternatively available at http://arxiv.org/abs/cond-mat/0011094
Krapivsky PL, Redner S (2002) Finiteness and Fluctuations in Growing Networks. J. Phys. A 35: 9517; Alternatively available at http://arxiv.org/abs/cond-mat/0207107
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Numerical Recipes in FORTRAN: The Art of Scientific Computing. Cambridge: University Press (see Chapt. 14.3, p.617–620).
Google Scholar
Simboli B (2003) http://listserv.nd.edu/cgi-bin/wa?A2=ind0305&L=pamnet&P=R2083. Accessed on 7 Sep 2011
Smith A (1983) Erroneous error correction. New Library World 84: 198.
Google Scholar
Garfield E (1990) Journal editors awaken to the impact of citation errors. How we control them at ISI. Essays of Information Scientist 13:367.
Google Scholar
SPIRES (http://www.slac.stanford.edu/spires/) data, compiled by H. Galic, and made available by S. Redner: http://physics.bu.edu/ ∼ http://redner/projects/citation. Accessed on 7 Sep 2011
Steel CM (1996) Read before you cite. The Lancet 348: 144.
Google Scholar
Broadus RN (1983) An investigation of the validity of bibliographic citations. Journal of the American Society for Information Science 34: 132.
Google Scholar
Moed HF, Vriens M (1989) Possible inaccuracies occurring in citation analysis. Journal of Information Science 15:95.
Article Google Scholar
Hoerman HL, Nowicke CE (1995) Secondary and tertiary citing: A study of referencing behaviour in the literature of citation analyses deriving from the Ortega Hypothesis of Cole and Cole. Library Quarterly 65: 415.
Article Google Scholar
Kåhre J (2002) The Mathematical Theory of Information. Boston: Kluwer.
Book MATH Google Scholar
Deming WE (1986) Out of the crisis. Cambridge: MIT Press.
Google Scholar
Garfield E (1979) Citation Indexing. New York: John Wiley.
Google Scholar
Merton RK (1968) The Matthew Effect in Science. Science 159: 56.
Google Scholar
Price D de S (1976) A general theory of bibliometric and other cumulative advantage process. Journal of American Society for Information Science 27: 292.
Google Scholar
Barabasi A-L, Albert R (1999) Emergence of scaling in random networks. Science 286: 509.
Article MathSciNet Google Scholar
Simkin MV, Roychowdhury VP (2005) Copied citations create renowned papers? Annals of Improbable Research 11:24–27. Alternatively available at http://arxiv.org/abs/cond-mat/0305150
Google Scholar
Dorogovtsev SN, Mendes JFF (2004) Accelerated growth of networks. http://arxiv.org/abs/cond-mat/0204102 (see Chap. 0.6.3)
Price D de S (1965) Networks of Scientific Papers. Science 149: 510.
Google Scholar
Silagadze ZK (1997) Citations and Zipf-Mandelbrot law. Complex Systems 11: 487
MATH Google Scholar
Redner S (1998) How popular is your paper? An empirical study of citation distribution. Eur. Phys. J. B 4: 131.
Article Google Scholar
Vazquez A (2001) Disordered networks generated by recursive searches. Europhys. Lett. 54: 430.
Article Google Scholar
Ziman JM (1969) Information, communication, knowledge. Nature, 324: 318.
Article Google Scholar
Günter R, Levitin L, Schapiro B, Wagner P (1996) Zipf’s law and the effect of ranking on probability distributions. International Journal of Theoretical Physics 35: 395
Article Google Scholar
Nakamoto H (1988) Synchronous and diachronous citation distributions. In. Egghe L and Rousseau R (eds) Informetrics 87/88. Amsterdam: Elsevier.
Google Scholar
Glänzel W., Schoepflin U. (1994). A stochastic model for the ageing of scientific literature. Scientometrics 30: 49–64.
Article Google Scholar
Pollmann T. (2000). Forgetting and the aging of scientific publication. Scientometrics 47: 43.
Article Google Scholar
Simkin M. V., Roychowdhury V. P. (2007) A mathematical theory of citing. Journal of the American Society for Information Science and Technology 58:1661–1673.
Article Google Scholar
Harris T.E. (1963). The theory of branching processes. Berlin: Springer.
MATH Google Scholar
Bentley R. A., Hahn, M.W., Shennan S.J. (2004). Random drift and culture change. Proceedings of the Royal Society B: Biological Sciences 271: 1443 – 1450.
Article Google Scholar
Redner S. (2004). Citation Statistics From More Than a Century of Physical Review. http://arxiv.org/abs/physics/0407137
Wright S (1931) Evolution in Mendelian populations. Genetics 16: 97–159.
Google Scholar
Simkin M. V., Roychowdhury V. P. (2010) An explanation of the distribution of inter-seizure intervals. EPL 91: 58005
Article Google Scholar
Bak P, Tang C, Wiesenfeld K (1988) Self-organized criticality. Phys. Rev. A 38: 364–374.
Article MathSciNet Google Scholar
Simkin M. V., Roychowdhury V. P. (2008) A theory of web traffic. EPL 62: 28006. Accessed on 7 Sep 2011
Google Scholar
Some Statistics about the MR Database http://www.ams.org/publications/60ann/FactsandFigures.html
Burrell Q L (2003) Predicting future citation behavior. Journal of the American Society for Information Science and Technology 54: 372–378.
Article Google Scholar
Garfield E (1980) Premature discovery or delayed recognition -Why? Current Contents 21: 5–10.
Google Scholar
Raan AFJ van (2004) Sleeping Beauties in science. Scientometrics 59: 467–472
Article Google Scholar
Alstrøm P (1988). Mean-field exponents for self-organized critical phenomena. Phys. Rev. A 38: 4905–4906.
Article Google Scholar
Bak P (1999). How Nature Works the Science of Self-Organized Criticality. New York: Copernicus.
Google Scholar
Bak P, Sneppen, K (1993). Punctuated equilibrium and criticality in a simple model of evolution. Physical Review Letters 71: 4083–4086
Article Google Scholar
Sokal A, Bricmont J (1998) Fashionable Nonsense. New York: Picador.
Google Scholar
Hahn MW, Bentley RA (2003) Drift as a mechanism for cultural change: an example from baby names. Proc. R. Soc. Lond. B (Suppl.), Biology Letters, DOI 10.1098/rsbl.2003.0045.
Google Scholar
Social Security Administration: Popular Baby Names http://www.ssa.gov/OACT/babynames/. Accessed on 7 Sep 2011
Simkin MV (2007) My Statistician Could Have Painted That! A Statistical Inquiry into Modern Art. Significance 14:93–96. Also available at http://arxiv.org/abs/physics/0703091
Naftulin DH, Ware JE, Donnelly FA (1973) The Doctor Fox Lecture: A Paradigm of Educational Seduction. Journal of Medical Education 48: 630–635.
Google Scholar
Encyclopaedia of Mathematics (Ed. M. Hazewinkel). See: Bürmann–Lagrange series: http://eom.springer.de/b/b017790.htm. Accessed on 7 Sep 2011
Otter R (1949) The multiplicative process. The Annals of Mathematical Statistics 20: 206
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, University of California, Los Angeles, CA, 90095-1594, USA
M. V. Simkin & V. P. Roychowdhury

Authors

M. V. Simkin
View author publications
You can also search for this author in PubMed Google Scholar
V. P. Roychowdhury
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. V. Simkin .

Editor information

Editors and Affiliations

Science and Engineering, Department of Computer and Information, University of Florida, Gainesville, 3260, Florida, USA
My T. Thai
, Department of Industrial & Systems Engin, University of Florida, Weil Hall 401, Gainesville, 32611-6595, Florida, USA
Panos M. Pardalos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Simkin, M.V., Roychowdhury, V.P. (2012). Theory of Citing. In: Thai, M., Pardalos, P. (eds) Handbook of Optimization in Complex Networks. Springer Optimization and Its Applications(), vol 57. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-0754-6_16

Download citation

DOI: https://doi.org/10.1007/978-1-4614-0754-6_16
Published: 29 September 2011
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-0753-9
Online ISBN: 978-1-4614-0754-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics