Skip to main content

Theory of Citing

  • Chapter
  • First Online:
Book cover Handbook of Optimization in Complex Networks

Part of the book series: Springer Optimization and Its Applications ((SOIA,volume 57))

Abstract

We present empirical data on misprints in citations to 12 high-profile papers. The great majority of misprints are identical to misprints in articles that earlier cited the same paper. The distribution of the numbers of misprint repetitions follows a power law. We develop a stochastic model of the citation process, which explains these findings and shows that about 70–90% of scientific citations are copied from the lists of references used in other papers. Citation copying can explain not only why some misprints become popular, but also why some papers become highly cited. We show that a model where a scientist picks few random papers, cites them, and copies a fraction of their references accounts quantitatively for empirically observed distribution of citations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See, for example, the discussion “Scientists Don’t Read the Papers They Cite” on Slashdot: http://science.slashdot.org/article.pl?sid=02/12/14/0115243&mode=thread&tid=134.

  2. 2.

    Suppose that the number of occurrences of a misprint (K), as a function of the rank (r), when the rank is determined by the above frequency of occurrence (so that the most popular misprint has rank 1, second most frequent misprint has rank 2 and so on), follows a Zipf law: K(r) = C ∕ r α. We want to find the number-frequency distribution, i.e. how many misprints appeared n times. The number of misprints that appeared between K 1 and K 2 times is obviously r 2 − r 1,  where K 1 = C ∕ r 1 α and K 2 = C ∕ r 2 α. Therefore, the number of misprints that appeared K times, N k , satisfies N K dK =  − dr and hence, N K  =  − dr ∕ dK ∼ K  − 1 ∕ α − 1.

  3. 3.

    Why did this happen? Obviously, T reaches maximum when R equals zero. Substituting R = 0 in (16.22) we get: T MAX = N(1 − 1 ∕ N M). For paper No.4 we have N = 2, 578, M = D ∕ N = 32 ∕ 2, 578. Substituting this into the preceding equation, we get T MAX = 239. The observed value T = 263 is therefore higher than an expectation value of T for any R. This does not immediately suggest discrepancy between the model and experiment but a strong fluctuation. In fact out of 1,000,000 runs of Monte Carlo simulation of MPM with the parameters of the mentioned paper and R = 0. 2 exactly 49,712 runs (almost 5%) produced T ≥ 263.

  4. 4.

    There are also misprints where author, journal, volume, and year are perfectly correct, but the page number is totally different. Probably, in such case the citer mistakenly took the page number from a neighboring paper in the reference list he was lifting the citation from.

  5. 5.

    In our initial report [22] we mentioned “over 24 thousand papers.” This number is incorrect and the reader surely understands the reason: misprints. In fact, out of 24,295 “papers” in that dataset only 18,560 turned out to be real papers and 5,735 “papers” turned out to be misprinted citations. These “papers” got 17,382 out of 351,868 citations. That is every distinct misprint on average appeared three times. As one could expect, cleaning out misprints lead to much better agreement between experiment and theory: compare Fig.16.8 and Fig. 1 of [22].

  6. 6.

    If one assumes that all papers are created equal then the probability to win m out of n possible citations when the total number of cited papers is N is given by the Poisson distribution: P = ((n ∕ N)m ∕ m! ) ×e  − n ∕ N. Using Stirling’s formula one can rewrite this as: ln(P) ⊈m ln(ne ∕ Nm) − (n ∕ N). After substituting n = 330, 000, m = 500 and N = 18500 into the above equation we get: ln(P)  − 1, 180, or P⊈ 10 − 512.

  7. 7.

    Sociologist of science Robert Merton observed [19] that when a scientist gets recognition early in his career he is likely to get more and more recognition. He called it “Matthew Effect” because in Gospel according to Mathew (25:29) appear the words: “unto every one that hath shall be given”. The attribution of a special role to St. Matthew is unfair. The quoted words belong to Jesus and also appear in Luke and Mark’s gospels. Nevertheless, thousands of people who did not read The Bible copied the name “Matthew Effect.”

  8. 8.

    From the mathematical perspective, almost identical to RCS model (the only difference was that they considered an undirected graph, while citation graph is directed) was earlier proposed in [23].

  9. 9.

    The analysis presented here also applies to a more general case when m is not a constant, but a random variable. In that case m in all of the equations that follow should be interpreted as the mean value of this variable.

  10. 10.

    Some of these references do not deal with citing, but with other social processes, which are modeled using the same mathematical tools. Here we rephrase the results of such papers in terms of citations for simplicity.

  11. 11.

    The uncertainty in the value of α depends not only on the accuracy of the estimate of the fraction of citations which goes to previous year papers. We also arbitrarily defined recent paper (in the sense of our model), as the one published within a year. Of course, this is by order of magnitude correct, but the true value can be anywhere between half a year and 2 years.

References

  1. Simkin MV, Roychowdhury VP (2003) Read before you cite! Complex Systems 14: 269–274. Alternatively available at http://arxiv.org/abs/cond-mat/0212043

  2. Simkin MV, Roychowdhury VP (2006) An introduction to the theory of citing. Significance 3: 179–181. Alternatively available at http://arxiv.org/abs/math/0701086

  3. Simkin MV, Roychowdhury VP (2005) Stochastic modeling of citation slips. Scientometrics 62: 367–384. Alternatively available at http://arxiv.org/abs/cond-mat/0401529

    Google Scholar 

  4. Simon HA (1957) Models of Man. New York: Wiley.

    MATH  Google Scholar 

  5. Krapivsky PL, Redner S (2001) Organization of growing random networks. Phys. Rev. E 63, 066123; Alternatively available at http://arxiv.org/abs/cond-mat/0011094

  6. Krapivsky PL, Redner S (2002) Finiteness and Fluctuations in Growing Networks. J. Phys. A 35: 9517; Alternatively available at http://arxiv.org/abs/cond-mat/0207107

  7. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1992) Numerical Recipes in FORTRAN: The Art of Scientific Computing. Cambridge: University Press (see Chapt. 14.3, p.617–620).

    Google Scholar 

  8. Simboli B (2003) http://listserv.nd.edu/cgi-bin/wa?A2=ind0305&L=pamnet&P=R2083. Accessed on 7 Sep 2011

  9. Smith A (1983) Erroneous error correction. New Library World 84: 198.

    Google Scholar 

  10. Garfield E (1990) Journal editors awaken to the impact of citation errors. How we control them at ISI. Essays of Information Scientist 13:367.

    Google Scholar 

  11. SPIRES (http://www.slac.stanford.edu/spires/) data, compiled by H. Galic, and made available by S. Redner: http://physics.bu.edu/ ∼ http://redner/projects/citation. Accessed on 7 Sep 2011

  12. Steel CM (1996) Read before you cite. The Lancet 348: 144.

    Google Scholar 

  13. Broadus RN (1983) An investigation of the validity of bibliographic citations. Journal of the American Society for Information Science 34: 132.

    Google Scholar 

  14. Moed HF, Vriens M (1989) Possible inaccuracies occurring in citation analysis. Journal of Information Science 15:95.

    Article  Google Scholar 

  15. Hoerman HL, Nowicke CE (1995) Secondary and tertiary citing: A study of referencing behaviour in the literature of citation analyses deriving from the Ortega Hypothesis of Cole and Cole. Library Quarterly 65: 415.

    Article  Google Scholar 

  16. Kåhre J (2002) The Mathematical Theory of Information. Boston: Kluwer.

    Book  MATH  Google Scholar 

  17. Deming WE (1986) Out of the crisis. Cambridge: MIT Press.

    Google Scholar 

  18. Garfield E (1979) Citation Indexing. New York: John Wiley.

    Google Scholar 

  19. Merton RK (1968) The Matthew Effect in Science. Science 159: 56.

    Google Scholar 

  20. Price D de S (1976) A general theory of bibliometric and other cumulative advantage process. Journal of American Society for Information Science 27: 292.

    Google Scholar 

  21. Barabasi A-L, Albert R (1999) Emergence of scaling in random networks. Science 286: 509.

    Article  MathSciNet  Google Scholar 

  22. Simkin MV, Roychowdhury VP (2005) Copied citations create renowned papers? Annals of Improbable Research 11:24–27. Alternatively available at http://arxiv.org/abs/cond-mat/0305150

    Google Scholar 

  23. Dorogovtsev SN, Mendes JFF (2004) Accelerated growth of networks. http://arxiv.org/abs/cond-mat/0204102 (see Chap. 0.6.3)

  24. Price D de S (1965) Networks of Scientific Papers. Science 149: 510.

    Google Scholar 

  25. Silagadze ZK (1997) Citations and Zipf-Mandelbrot law. Complex Systems 11: 487

    MATH  Google Scholar 

  26. Redner S (1998) How popular is your paper? An empirical study of citation distribution. Eur. Phys. J. B 4: 131.

    Article  Google Scholar 

  27. Vazquez A (2001) Disordered networks generated by recursive searches. Europhys. Lett. 54: 430.

    Article  Google Scholar 

  28. Ziman JM (1969) Information, communication, knowledge. Nature, 324: 318.

    Article  Google Scholar 

  29. Günter R, Levitin L, Schapiro B, Wagner P (1996) Zipf’s law and the effect of ranking on probability distributions. International Journal of Theoretical Physics 35: 395

    Article  Google Scholar 

  30. Nakamoto H (1988) Synchronous and diachronous citation distributions. In. Egghe L and Rousseau R (eds) Informetrics 87/88. Amsterdam: Elsevier.

    Google Scholar 

  31. Glänzel W., Schoepflin U. (1994). A stochastic model for the ageing of scientific literature. Scientometrics 30: 49–64.

    Article  Google Scholar 

  32. Pollmann T. (2000). Forgetting and the aging of scientific publication. Scientometrics 47: 43.

    Article  Google Scholar 

  33. Simkin M. V., Roychowdhury V. P. (2007) A mathematical theory of citing. Journal of the American Society for Information Science and Technology 58:1661–1673.

    Article  Google Scholar 

  34. Harris T.E. (1963). The theory of branching processes. Berlin: Springer.

    MATH  Google Scholar 

  35. Bentley R. A., Hahn, M.W., Shennan S.J. (2004). Random drift and culture change. Proceedings of the Royal Society B: Biological Sciences 271: 1443 – 1450.

    Article  Google Scholar 

  36. Redner S. (2004). Citation Statistics From More Than a Century of Physical Review. http://arxiv.org/abs/physics/0407137

  37. Wright S (1931) Evolution in Mendelian populations. Genetics 16: 97–159.

    Google Scholar 

  38. Simkin M. V., Roychowdhury V. P. (2010) An explanation of the distribution of inter-seizure intervals. EPL 91: 58005

    Article  Google Scholar 

  39. Bak P, Tang C, Wiesenfeld K (1988) Self-organized criticality. Phys. Rev. A 38: 364–374.

    Article  MathSciNet  Google Scholar 

  40. Simkin M. V., Roychowdhury V. P. (2008) A theory of web traffic. EPL 62: 28006. Accessed on 7 Sep 2011

    Google Scholar 

  41. Some Statistics about the MR Database http://www.ams.org/publications/60ann/FactsandFigures.html

  42. Burrell Q L (2003) Predicting future citation behavior. Journal of the American Society for Information Science and Technology 54: 372–378.

    Article  Google Scholar 

  43. Garfield E (1980) Premature discovery or delayed recognition -Why? Current Contents 21: 5–10.

    Google Scholar 

  44. Raan AFJ van (2004) Sleeping Beauties in science. Scientometrics 59: 467–472

    Article  Google Scholar 

  45. Alstrøm P (1988). Mean-field exponents for self-organized critical phenomena. Phys. Rev. A 38: 4905–4906.

    Article  Google Scholar 

  46. Bak P (1999). How Nature Works the Science of Self-Organized Criticality. New York: Copernicus.

    Google Scholar 

  47. Bak P, Sneppen, K (1993). Punctuated equilibrium and criticality in a simple model of evolution. Physical Review Letters 71: 4083–4086

    Article  Google Scholar 

  48. Sokal A, Bricmont J (1998) Fashionable Nonsense. New York: Picador.

    Google Scholar 

  49. Hahn MW, Bentley RA (2003) Drift as a mechanism for cultural change: an example from baby names. Proc. R. Soc. Lond. B (Suppl.), Biology Letters, DOI 10.1098/rsbl.2003.0045.

    Google Scholar 

  50. Social Security Administration: Popular Baby Names http://www.ssa.gov/OACT/babynames/. Accessed on 7 Sep 2011

  51. Simkin MV (2007) My Statistician Could Have Painted That! A Statistical Inquiry into Modern Art. Significance 14:93–96. Also available at http://arxiv.org/abs/physics/0703091

  52. Naftulin DH, Ware JE, Donnelly FA (1973) The Doctor Fox Lecture: A Paradigm of Educational Seduction. Journal of Medical Education 48: 630–635.

    Google Scholar 

  53. Encyclopaedia of Mathematics (Ed. M. Hazewinkel). See: Bürmann–Lagrange series: http://eom.springer.de/b/b017790.htm. Accessed on 7 Sep 2011

  54. Otter R (1949) The multiplicative process. The Annals of Mathematical Statistics 20: 206

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. V. Simkin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Simkin, M.V., Roychowdhury, V.P. (2012). Theory of Citing. In: Thai, M., Pardalos, P. (eds) Handbook of Optimization in Complex Networks. Springer Optimization and Its Applications(), vol 57. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-0754-6_16

Download citation

Publish with us

Policies and ethics