Modelling citation networks

Abstract

The distribution of the number of academic publications against citation count for papers published in the same year is remarkably similar from year to year. We characterise the shape of such distributions by a ‘width’, \(\sigma ^2\), associated with fitting a log-normal to each distribution, and find the width to be approximately constant for publications published in different years. This similarity is not surprising, after all, why would papers in a given year be cited more than another year? Nevertheless, we show that simple citation models fail to capture this behaviour. We then provide a simple three parameter citation network model which can reproduce the correct width over time. We use the citation network of papers from the hep-th section of arXiv to test our model. Our final model reproduces the data’s observed ‘width’ when around 20 % of the citations in the model are made to recently published papers in the entire network (‘global information’). The remaining 80 % of citations are made using the references from these papers’ bibliographies (‘local searches’). We note that this is consistent with other studies, though our motivation to achieve the above distribution with time is very different. Finally, we find that, in the citation network model, varying the number of papers referenced by a new publication is important as it alters the parameters in the model which are fitted to the data. This is not addressed in current models and needs further work.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Notes

  1. 1.

    We omit year 2003 from our analysis because it is incomplete.

  2. 2.

    The bin scale was chosen to ensure there were no empty bins below the bin containing the highest citation values.

  3. 3.

    There is a small correction to this form due to the possibility of that the same vertex i could be chosen more than once which is excluded in the actual model.

  4. 4.

    The full Price model may be solved exactly within the mean-field approximation in the infinite time limit. Those solutions are very close to numerical results found for finite sized simulations such as ours. For \(\langle k^{({\text {in}})} \rangle =12.0\), these formulae give \(z=0.156\) if \(p=0.55\) and we find we need \(p=0.59\) in order to get the same value of \(z=0.169\) found in our data. However, we remove the first thousand papers created in our simulation so we do not expect an exact match with the theoretical expressions.

  5. 5.

    Again there is a small correction to this form to allow for the fact that we do not allow the same vertex i to be chosen more than in model B.

  6. 6.

    This is different from the ‘half-life’ values referred to later, which are measured from the data.

  7. 7.

    Note this is why we call it the median half-life: if you plot the number of citations gained by a paper against year and take the median, this is the value we call the median half-life \(T_{\text {med}}\). Thus the value of \(T_{\text {med}}\) for a given paper increases over time. We call it the median half-life not a half-life because the half-life, as defined in a process with an exponential decay, is a fixed value, only equal to our median half-life in the limit of an infinitely old paper. We expect any estimate of the half-life of a paper’s citations to be roughly constant whereas our median half-life \(T_{\text {med}}\) measurement varies from year to year, increasing until it reaches the formal half-life value.

  8. 8.

    Note that the average in-degree of the full un-reduced arXiv network is 12.82. The average in-degree after two-step declustering is much less, 3.9. The fully transitively reduced network has an even smaller average in-degree of 2.27 (Clough et al. 2014; Goldberg 2013), as expected.

  9. 9.

    For model C our final comparison tool is the \(\sigma ^2\) plot of model C and the hep-th data. Note that so far we have used z and the number of core papers C as our comparison tools.

  10. 10.

    This estimate assumes no correlation between in- and out-degree as a better estimate for the average in-degree of core papers chosen using cumulative advantage is \(\langle (k^{({\text {in}})})^2 \rangle /\langle k^{({\text {in}})} \rangle\).

  11. 11.

    We have also done another check. The proportion of core papers referenced per new publication is therefore, on average \(=\frac{C}{\langle k^{({\text {out}})} \rangle }=\frac{C}{C(1+q\langle k^{({\text {out}})} \rangle )}=\frac{3.9}{12.0} \approx 0.3\) for \(q=0.17\) or \(q=0.20\) (which are the mathematically expected value of q and the q derived from fitting the model C to the hep-th data, respectively). In A Mathematical Theory of Citing and Solla Price (1965) this was found to be 0.1 and 0.15 for their models, respectively. Again, our values are consistent with these as they are low.

References

  1. Bentley, R., Hahn, M., & Shennan, S. (2004). Random drift and culture change. Proceedings of the Royal Society B, 271, 1443–1450.

    Article  Google Scholar 

  2. Brzezinski, M. (2015). Power laws in citation distributions: Evidence from Scopus. Scientometrics, 103(1), 213–228.

    Article  Google Scholar 

  3. Chung, F., Lu, L., Dewey, T. G., & Galas, D. J. (2003). Duplication models for biological networks. Journal of Computational Biology, 10, 677–687.

    Article  Google Scholar 

  4. Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2009). Power-law distributions in empirical data. Siam Review, 51, 661–703.

    MathSciNet  Article  MATH  Google Scholar 

  5. Clough, J. R., & Evans, T. S. (2014). What is the dimension of citation space? arXiv:1408.1274.

  6. Clough, J. R., Gollings, J., Loach, T. V., & Evans, T. S. (2014). Transitive reduction of citation networks. Journal of Complex Networks. arXiv:1310.8224.

  7. de Solla Price, D. J. (1965). Networks of scientific papers. Science, 149, 510–515.

    Article  Google Scholar 

  8. de Solla Price, D. J. (1976). A general theory of bibliometric and other cumulative advantage processes. Journal of the American Society for Information Science, 27, 292–306.

    Article  Google Scholar 

  9. Dorogovtsev, S., & Mendes, J. (2000). Evolution of networks with aging of sites. Physical Review E, 62, 1842–1845.

    Article  Google Scholar 

  10. Dorogovtsev, S., & Mendes, J. A. A. A. (2001). Scaling properties of scale-free evolving networks: Continuous approach. Physical Review E, 63, 056125.

    Article  Google Scholar 

  11. Dorogovtsev, S., Mendes, J., & Samukhin, A. (2000). Structure of growing networks with preferential linking. Physical Review Letters, 85, 4633–4636.

    Article  Google Scholar 

  12. Eom, Y.-H., & Fortunato, S. (2011). Characterizing and modeling citation dynamics. Plos One, 6, e24926.

    Article  Google Scholar 

  13. Evans, T. S., Hopkins, N., & Kaube, B. S. (2012). Universality of performance indicators based on citation and reference counts. Scientometrics, 93, 473–495. doi:10.1007/s11192-012-0694-9. arXiv:1110.3271.

    Article  Google Scholar 

  14. Evans, T. S., & Saramaki, J. (2005). Scale-free networks from self-organization. Physical Review E, 72, 026138. doi:10.1103/PhysRevE.72.026138. arXiv:cond-mat/0411390.

    MathSciNet  Article  Google Scholar 

  15. Fowler, J. H., & Jeon, S. (2008). The authority of supreme court precedent. Social Networks, 30, 16–30.

    Article  Google Scholar 

  16. Geng, X., & Wang, Y. (2009). Degree correlations in citation networks model with aging. EPL, 88, 38002.

    Article  Google Scholar 

  17. Goldberg, S. R. (2013). Modelling citation networks. figshare. doi:10.6084/m9.figshare.1134542.

  18. Goldberg, S. R., & Evans, T. S. (2012). Universality of performance indicators based on citation and reference counts. figshare. doi:10.6084/m9.figshare.1134544. Retrieved 12 Aug 2014.

  19. Golosovsky, M., & Sorin, S. (2013). The transition towards immortality: Non-linear autocatalytic growth of citations to scientific papers. Journal of Statistical Physics, 151, 340–354.

    MathSciNet  Article  MATH  Google Scholar 

  20. Hajra, K., & Sen, P. (2005). Aging in citation networks. Physica A, 346, 44–48.

    Article  Google Scholar 

  21. Hajra, K. B., & Sen, P. (2006). Modelling aging characteristics in citation networks. Physica A, 368, 575–582.

    Article  Google Scholar 

  22. KDD Cup. (2003). Network mining and usage log analysis. http://www.cs.cornell.edu/projects/kddcup/datasets.html. Accessed 1 Oct 2012.

  23. Krapivsky, P., & Redner, S. (2001). Organization of growing random networks. Physical Review E, 63, 066123.

    Article  Google Scholar 

  24. Laherraére, J., & Sornette, D. (1998). Stretched exponential distributions in nature and economy: ‘fat tails’ with characteristic scales. The European Physical Journal B-Condensed Matter and Complex Systems, 2, 525–539.

    Article  Google Scholar 

  25. Maslov, S., & Redner, S. (2008). Promise and pitfalls of extending Google’s PageRank algorithm to citation networks. The Journal of Neuroscience, 28, 11103–11105.

    Article  Google Scholar 

  26. Mitzenmacher, M. (2004). A brief history of generative models for power law and lognormal distributions. Internet Mathematics, 1, 226–251.

    MathSciNet  Article  MATH  Google Scholar 

  27. Newman, M. (2010). Networks: An introduction. New York: Oxford University Press.

    Google Scholar 

  28. Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface, 11, 20140378–20140378.

    Article  Google Scholar 

  29. Pollmann, T. (2000). Forgetting and the ageing of scientific publications. Scientometrics, 47, 43–54.

    Article  Google Scholar 

  30. Radicchi, F., Fortunato, S., & Castellano, C. (2008). Universality of citation distributions: Toward an objective measure of scientific impact. Proceedings of the National Academy of Sciences of the United States of America, 105, 17268–17272.

    Article  Google Scholar 

  31. Redner, S. (1998). How popular is your paper? An empirical study of the citation distribution. The European Physical Journal B-Condensed Matter and Complex Systems, 4, 131–134.

    Article  Google Scholar 

  32. Ren, F.-X., Shen, H.-W., & Cheng, X.-Q. (2012). Modeling the clustering in citation networks. Physica A, 391, 3533–3539.

    Article  Google Scholar 

  33. Saramäki, J., & Kaski, K. (2004). Scale-free networks generated by random walkers. Physica A, 341, 80.

    MathSciNet  Article  Google Scholar 

  34. Seglen, P. O. (1992). The skewness of science. Journal of the American Society for Information Science, 43, 628–638.

    Article  Google Scholar 

  35. Simkin, M. V., & Roychowdhury, V. P. (2005a). Copied citations create renowned papers? Annals of Improbable Research, 11, 24–27.

    Article  Google Scholar 

  36. Simkin, M. V., & Roychowdhury, V. P. (2005b). Stochastic modeling of citation slips. Scientometrics, 62, 367–384.

    Article  Google Scholar 

  37. Simkin, M. V., & Roychowdhury, V. P. (2007). A mathematical theory of citing. Journal of the American Society for Information Science and Technology, 58, 1661–1673.

    Article  Google Scholar 

  38. Smolinsky, L., Lercher, A., & McDaniel, A. (2015). Testing theories of preferential attachment in random networks of citations. Journal of the Association for Information Science and Technology. doi:10.1002/asi.23312.

    Google Scholar 

  39. Sternitzke, C., Bartkowski, A., & Schramm, R. (2008). Visualizing patent statistics by means of social network analysis tools. World Patent Information, 30, 115–131.

    Article  Google Scholar 

  40. Stringer, M. J., Sales-Pardo, M., & Amaral, L. A. N. (2008). Effectiveness of journal ranking schemes as a tool for locating information. PLoS ONE, 3(2), e1683.

    Article  Google Scholar 

  41. Stringer, M. J., Sales-Pardo, M., & Amaral, L. A. N. (2010). Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal. Journal of the American Society for Information Science and Technology, 61, 1377–1385.

    Article  Google Scholar 

  42. van Raan, A. F. J. (2001). Two-step competition process leads to quasi power-law income distributions: Application to scientific publication and citation distributions. Physica A, 298, 530–536.

    Article  MATH  Google Scholar 

  43. Vazquez, A. (2000). Knowing a network by walking on it: Emergence of scaling. arXiv:cond-mat/0006132.

  44. Vázquez, A. (2001). Statistics of citation networks. arXiv:cond-mat/0105031.

  45. Vázquez, A. (2003). Growing networks with local rules: preferential attachment, clustering hierarchy and degree correlations. Physical Review E, 67, 056104.

    Article  Google Scholar 

  46. Wallace, M. L., Lariviere, V., & Gingras, Y. (2009). Modeling a century of citation distributions. Journal of Informetrics, 3, 296–303.

    Article  Google Scholar 

  47. Waltman, L., van Eck, N. J., & van Raan, A. F. J. (2012). Universality of citation distributions revisited. Journal of the American Society for Information Science and Technology, 63, 72–77.

    Article  Google Scholar 

  48. Wu, Y., Fu, T. Z. J., & Chiu, D. M. (2014). Generalized preferential attachment considering aging. Journal of Informetrics, 8, 650–658.

    Article  Google Scholar 

  49. Zhu, H., Wang, X., & Zhu, J. (2003). Effect of aging on network structure. Physical Review E, 68, 056121.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank James Gollings and James Clough for allowing us to use their transitive reduction code from which we created our own declustering code. We would like to thank Tamar Loach for sharing her results on related projects and M. V. Simkin for discussions about his work.

Author information

Affiliations

Authors

Corresponding author

Correspondence to S. R. Goldberg.

Appendix

Appendix

Fitting procedure

We follow the procedure used in Evans et al. (2012) and Goldberg and Evans (2012). We use logarithmic binning so that the citations in bin b \(c_b \in {\mathbb {Z}}\) with \(c_{b+1}\) equal to \(R c_b\) rounded to the nearest integer or to \((c_{b}+1)\), whichever is the highest, where R is some fixed bin scale chosen to ensure there are no empty bins below the bin containing the highest citation values. The edge of the first bin is chosen to be the lowest integer above the value \(0.1 \langle c \rangle\). In order to make the fit we compare the total value in the bth bin, \(n_b= \sum _{c=c_{b}}^{c_{b+1}} n(c)\), against the expected value

$$\begin{aligned} n_b^{({\text {expect}})} = (1+A) N \int _{c_{b}-0.5}^{c_{b+1}+0.5} {\text{d}}c \,\, \frac{1}{\sqrt{2\pi } \sigma c } \exp \left\{ -\frac{(\ln (c/\langle c \rangle )+(\sigma ^2/2)-B)^2}{2\sigma ^2 } \right\} . \end{aligned}$$
(9)

This gives us a sequence of data and model values which are compared using a non-linear least squares algorithm to give us values for \(\sigma ^2, A\) and B.

Lognormal out-degree distribution

In all above models the citation networks were created by determining the number of references a new node would create (via a normal distribution mean 12.0, standard deviation 3.0 references) and then having a method of deciding which nodes to reference. However, we found that a lognormal fits the out-degree distribution of the hep-th data better than a normal distribution, Figs. 4 and 21, respectively.

Fig. 21
figure21

This is a plot of the out-degree distribution of 27,000 publications from the hep-th arXiv data, in blue, on a log–log plot. Superimposed, in green, is a plot of 27,000 numbers generated by the lognormal distribution fitted to the out-degree of the hep-th arXiv data. We observe that the lognormal is a better fit to the data than the normal in Fig. 4. (Color figure online)

As further work we inputted this fitted lognormal to determine the number of references created by a new node into the model C and ran it for our final parameters \((p,q)=(0.55,0.20)\) and \(\tau = 200\)papers. The ratio of the \(\sigma ^2\) values associated with the in-degree distribution of papers published in the same year for the data is divided by the corresponding year’s \(\sigma ^2\) for this modified model C and plotted against year in Fig. 22. We find that the ratio is close to 1, however, it is not as close as the original model C, Fig. 20. Therefore the \(\sigma ^2\) plot does depend on in-degree, contradicting (Ren et al. 2012), who say it is ‘innocuous’ to the in-degree distribution of the citation network. Although this out-degree distribution has been observed by the literature (Vázquez 2001) its use in a citation network model is novel and original.

Fig. 22
figure22

This is the ratio of \(\sigma ^2\) associated with the in-degree distribution of papers published in the same year for the data (years from 1992, 1993 etc. relabelled to 0, 1 etc.) divided by that of the modified model C (where the out-degree is determined by a lognormal distribution, above). The plot and error bars lie within 1.0 therefore the modified model C is consistent with the data. Therefore the modified model C is promising, a significant improvement on model A and B, Figs. 7 and 14, respectively. However, the data is always lower than the modified model C; the points are always above the 1.0 line and not as close to 1.0 as the model C which implies the need for modification of the parameters in model C. So modifying the out-degree does change the in-degree, which contradicts (Vázquez 2001). We conjecture that by changing the attention span parameter this model’s \(\sigma ^2\) plot could increase to match the data

Although the \(\sigma ^2\) plot is lower than that of the original model C we conjecture that by varying the \(\tau\) of the model C the \(\sigma ^2\) plot could match that of the data’s, this may also increase the attention span to something closer to a year as expected by Simkin and Roychowdhury (2007).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Goldberg, S.R., Anthony, H. & Evans, T.S. Modelling citation networks. Scientometrics 105, 1577–1604 (2015). https://doi.org/10.1007/s11192-015-1737-9

Download citation

Keywords

  • Complex networks
  • Directed acyclic graphs
  • Bibliometrics
  • Citation networks

Mathematics Subject Classification

  • 91D30