Abstract
Collaborations and citations within scientific research grow simultaneously and interact dynamically. Modelling the coevolution between them helps to study many phenomena that can be approached only through combining citation and coauthorship data. A geometric graph for the coevolution is proposed, the mechanism of which synthetically expresses the interactive impacts of authors and papers in a geometrical way. The model is validated against a dataset of papers published on PNAS during 2007–2015. The validation shows the ability to reproduce a range of features observed with citation and coauthorship data combined and separately. Particularly, in the empirical distribution of citations per author there exist two limits, in which the distribution appears as a generalized Poisson and a power-law respectively. Our model successfully reproduces the shape of the distribution, and provides an explanation for how the shape emerges via the decisions of authors. The model also captures the empirically positive correlations between the numbers of authors’ papers, citations and collaborators.
Similar content being viewed by others
References
Balietti, S., Goldstone, R. L., & Helbing, D. (2016). Peer review and competition in the art exhibition game. Proceedings of the National Academy of Sciences USA, 113(30), 8414–8419.
Barabási, A. L., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A, 311, 590–614.
Barabási, A. L., Jeong, H., Neda, Z., Ravasz, E., Schubert, A., & Vicsek, T. (2002). Evolution of the social network of scientific collaborations. Physica A, 311(3–4), 590–614.
Börner, K., Maru, J. T., & Goldstone, R. L. (2004). The simultaneous evolution of author and paper networks. Proceedings of the National Academy of Sciences of the United States of America, 101(suppl 1), 5266–5273.
Bornmann, L., & Daniel, H. D. (2009). Universality of citation distributions-A validation of Radicchi et al’.s relative indicator \(\text{c{f}}{=\,}\text{c/c}{0}\) at the micro level using data from chemistry. Journal of the American Society for Information Science and Technology, 60(8), 1664–1670.
Catanzaro, M., Caldarelli, G., & Pietronero, L. (2004). Assortative model for social networks. Physical Review E, 70, 037101.
Christensen, K., & Moloney, N. R. (2005). Complexity and criticality. London: Imperial College Press.
Consul, P. C., & Jain, G. C. (1973). A generalization of the Poisson distribution. Technometrics, 15(4), 791–799.
de Solla, Price D. J. (1965). Networks of scientific papers. Science, 149(3683), 510–515.
de Solla, Price D. J. (1976). A general theory of bibliometric and other cumulative advantage process. Journal of the American Society for Information Science, 27(5), 292–306.
Evans, T. S., Hopkins, N., & Kaube, B. S. (2012). Universality of performance indicators based on citation and reference counts. Scientometrics, 93, 473–495.
Glänzel, W., & Schubert, A. (2004). Analysing scientific networks through co-authorship. Handbook of quantitative science and technology research (pp 257–276).
Glänzel, W. (2002). Coauthorship patterns and trends in the sciences (1980–1998): A bibliometric study with implications for database indexing and search strategies. Library Trends, 50(3), 461.
Glänzel, W. (2011). National characteristics in international scientific co-authorship relations. Scientometrics, 51, 69–115.
Goldberg, S. R., Anthony, H., & Evans, T. S. (2015). Modelling citation networks. Scientometrics, 105, 1577–1604.
Kim, J., & Diesner, J. (2016). Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks. Journal of the Association for Information Science and Technology, 67(6), 1446–1461.
Krioukov, D., Kitsak, M., Sinkovits, R. S., Rideout, D., Meyer, D., & Boguñá, M. (2012). Network cosmology. Scientific Reports, 2, 793.
Kuhn, T., Perc, M., & Helbing, D. (2014). Inheritance patterns in citation networks reveal scientific memes. Physical Review X, 4(4), 041036.
Liben-Nowell, D., & Kleinberg, J. (2007). The link-prediction problem for social networks. Journal of the Association for Information Science and Technology, 58(7), 1019–1031.
Mali, F., Kronegger, L., Doreian, P., & Ferligoj, A. (2012). Dynamic scientific coauthorship networks. In: Scharnhorst A, Börner K, Besselaar PVD editors. Models of science dynamics. Springer, Berlin (pp 195–232).
Martin, T., Ball, B., Karrer, B., & Newman, M.E.J. (2013). Coauthorship and citation in scientific publishing. arXiv:1304.0473.
Milojević, S. (2010). Modes of collaboration in modern science: Beyond power laws and preferential attachment. Journal of the Association for Information Science and Technology, 61(7), 1410–1423.
Milojević, S. (2014). Principles of scientific research team formation and evolution. Proceedings of the National Academy of Sciences USA, 111, 3984–3989.
Moody, J. (2004). The strucutre of a social science collaboration network: Disciplinery cohesion form 1963 to 1999. American Sociological Review, 69(2), 213–238.
Newman, M. (2004). Coauthorship networks and patterns of scientific collaboration. Proceedings of the National Academy of Sciences of the USA, 101, 5200–5205.
Perc, C. (2010). Growth and structure of Slovenia’s scientific collaboration network. Journal of Informetrics, 4, 475–482.
Perc, M. (2010). Zipf’s law and log-normal distributions in measures of scientific output across fields and institutions: 40 years of Slovenia’s research as an example. Journal of Informetrics, 4(3), 358–364.
Perc, M. (2013). Self-organization of progress across the century of physics. Scientific Reports, 3, 1720.
Perc, M. (2014). The Matthew effect in empirical data. Journal of The Royal Society Interface, 11(98), 20140378.
Radicchi, F., & Castellano, C. (2015). Understanding the scientific enterprise: citation analysis, data and modeling. In Social Phenomena. Springer. (pp 135–151).
Radicchi, F., Fortunato, S., Markines, B., & Vespignani, A. (2009). Diffusion of scientific credits and the ranking of scientists. Physical Review E, 80(5), 056103.
Squazzoni, F., & Gandelli, C. (2012). Saint Matthew strikes again: An agent-based model of peer review and the scientific community structure. Journal of Informetrics, 6(2), 265–275.
Tomassini, M., & Luthi, L. (2007). Empirical analysis of the evolution of a scientific collaboration network. Physica A, 285, 750–764.
Wagner, C. S., & Leydesdorff, L. (2005). Network structure, self-organization, and the growth of international collaboration in science. Research Policy, 34(10), 1608–1618.
Wang, D., Song, C., & Barabási, A. L. (2013). Quantifying long-term scientific impact. Science, 342, 127–132.
Xie, Z., Dong, E.M., Yi, D.Y., Ouyang, Z.Z., & Li, J.P. (2016). Modelling transition phenomena of scientific coauthorship networks. arXiv:1604.08891.
Xie, Z., Duan, X. J., Ouyang, Z. Z., & Zhang, P. Y. (2015). Quantitative analysis of the interdisciplinarity of applied mathematics. PLoS One, 10(9), e0137424.
Xie, Z., Ouyang, Z. Z., & Li, J. P. (2016). A geometric graph model for coauthorship networks. Journal of Informetrics, 10, 299–311.
Xie, Z., Ouyang, Z. Z., Liu, Q., & Li, J. P. (2016). A geometric graph model for citation networks of exponentially growing scientific papers. Physica A, 456, 167–175.
Xie, Z., Ouyang, Z. Z., Zhang, P. Y., Yi, D. Y., & Kong, D. X. (2015). Modeling the citation network by network cosmology. PLoS One, 10(3), e0120687.
Xie, Z., & Rogers, T. (2016). Scale-invariant geometric random graphs. Physical Review E, 93, 032310.
Zhou, T., Wang, B. H., Jin, Y. D., He, D. R., Zhang, P. P., He, Y., et al. (2007). Modeling collaboration networks based on nonlinear preferential attachment. International Journal of Modern Physics C, 18, 297–314.
Acknowledgements
We thank Professor K. Christensen for the valuable suggestions on the description of “cross-over”, Professor J. Y. Su for proofreading this paper. This work is supported by the fund from the national university of defense technology teacher training project (No. 434513512G).
Author information
Authors and Affiliations
Corresponding author
Additional information
Zheng Xie, Zonglin Xie and Miao Li have contributed equally to this work.
Appendix
Appendix
Detecting boundary for probability density functions
The boundary detection algorithm for probability density functions (PDF) is listed in Table 6, which comes from Reference Xie et al. (2016b).
Simplifying the model
An obvious weakness of the provided model is that it has a lot of parameters. If ignoring the fitting of the distribution of references per paper and that of paper-team sizes, we can reduce the model’s parameters as those in Table 7. The reduction does not affect the synthetic distribution type of collaborators/papers per author, and that of citations per paper/per author (Fig. 8).
The underlying formula for the distribution of citations per paper
We only analyze the underlying formula for the distribution type of synthetic “citations per paper” (in-degrees), which is similar to that in References Xie et al. (2016c), Xie and Rogers (2016). The analysis of the formula for the distribution type of synthetic “collaborators per author” is the same as that in Reference Xie et al. (2016b). As shown in Fig. 4c, the synthetic in-degree distribution type is a mixture of generalized Poisson and power-law, hence the formula is analyzed piecewise. The formula for the head and that for the tail of the type are deduced respectively. The cross-over can be well fitted by the formula in the notes of Table 5.
The in-degrees contributed by the second half of Step 2.b are due to a random selection. Together with the preset small domain of f(x) in this step, the effect of the second half on in-degree distribution is small enough to be ignored, when compared with that contributed by the first half.
The first half of Step 2.b makes the expected in-degree of a node generated at time t to be \(k^-(t)\approx {\alpha _l}\delta p T^{ {\beta _l} } t^{- {\beta _l} }/\beta _l-1\), where \(l=2,3\) and \(\delta =N_1 /2\pi\). If t is large enough (suppose larger than a big number \(T_1\)), \(k^-(t)\) is small enough, and changes slowly over t. Hence the formula for the head is
which is a mixture Poisson distribution. A generalized Poisson distribution can be well fitted by a mixture Poisson distribution, which can be verified numerically.
The formula for the tail is deduced as follows, where the calculations are inspired by some of the same general ideas as explored in the cosmological networks Krioukov et al. (2012):
Here Laplace approximation is used in the third step, and Stirling’s formula is used in the fourth step. When \(k\gg 0\), the integration part in Eq. (2) is free of k approximatively, which can be verified as follows:
where \(L_1={\delta \alpha _l( T_1 +1)^{- {\beta _l}}} p T^{ {\beta _l} }/\beta _l\), \(L_2= {\delta \alpha _l } p T^{ {\beta _l} }/\beta _l\), and \(\rho =1+ {1}/{\beta _l}\). This derivative is approximately equal to 0 for \(k\gg 0\). Hence
Stirling’s formula is used in the first approximation. The second approximation holds for \(k\gg 0\). Hence \(P_L(k )\) is approximately a power-law distribution with exponent \(1+ {1}/{\beta _l}\). So we obtain that the in-degree distribution tail of the network generated in Step 2.b is a mixture of power-law distributions with exponents \(1+ {1}/{\beta _1}\) and \(1+ {1}/{\beta _2}\) respectively. Note that in Eq. (1), the condition \(k\gg 0\) does not hold, so the power-law does not emerge in the head of the distribution.
Flexibility of the model
The provided model has the flexibility of fitting empirical data from different sciences. We have shown that the model can capture specific features of the empirical data PNAS 2007–2015, the papers of which mainly belong to biological sciences. Here we consider the data from physical sciences: the papers of Physical review E published during 2007–2016 (PRE 2007–2016). The data are gathered from the Web of Science. Authors are identified by their names on papers.
Synthetic data are generated through the provided model to capture specific features of PRE 2007–2016. The parameters of the synthetic data are listed in Table 8. Comparisons on statistic indicators and distributions are shown in Table 9 and Fig. 10 respectively.
TARL model
Constructively suggested by a reviewer, the result of TARL model is compared with that of the proposed model. The pseudo code of TARL model in Reference Börner et al. (2004) is repeated as follows.
-
Initialization
-
Generate m “papers” and n “authors” with randomly assigned “topics”;
-
Randomly assign l “authors” to the “papers” within the same “topic”.
-
-
For time \(t=1,2,\ldots ,T\) do:
-
Add s new “authors” with randomly assigned “topics”;
-
Deactivate the “authors” older than h;
-
For each “topic” do:
-
-
Randomly partition the “authors” within the “topic” into groups with size l;
-
For each group do:
-
Randomly read g “papers” from existing “papers” within the “topic”;
-
“Select a time-slice form (1 to \(t-\)1) with probability given in aging-function” Börner et al. (2004);
-
Generate a new “paper” and randomly cite k papers (published or cited in this time-slice) from the read “papers” and their references up to w-th level.
-
The generated connections are restricted to the “papers” and “authors” within the same “topic”. If no aging-function is given, then all “papers” can be “read” equally. We set the number of “topics” to be 4, and no aging-function (so no time-slice). We let \(T=200\), \(g=1\), \(h=T\), \(k=2\), \(w=2\), \(m=n=l=s=4\). The generated distribution of “collaborators”/“papers” per “author” and that of “citations” per “paper”/per “author” are shown in Fig. 10.
TARL model can generate a “coauthorship” network and a “citation” network, which grow simultaneously. The “citation” network is scale-free (caused by recursive linking), and has a positive clustering coefficient (caused by citing the “papers” within the same “topic”). Our model harmoniously express the citation factors considered in TARL model (i. e. topics, aging and recursive follow-up of citation references) by the connection mechanism induced through the influential zones of “papers”. The aging of papers is expressed by decreasing the sizes of influential zones over t. In TARL model, “papers” and “authors” are assigned specific “topics” directly. In our model, we use a continuous way: expressing nodes’ “topic” by nodes’ spacial coordinate. So the circles could be regarded as “topic spaces”. Note that it is not a real topic space, which is a high dimensional space representing textual contents of papers. In our model, “papers” can incompletely “copy” the references of the “papers” it cited, which is induced through the overlapping of influential zones.
TARL model neither consider the Matthew effect on the number of authors’ collaborators nor that on papers. In addition, the above instantiation assumes that the number of “papers” per “author” is a constant. Hence, the generated distribution of “papers” per “author” and that of “collaborators” per “author” (Fig. 10) have no power-law tails, which emerge in the corresponding distributions from real data (Figs. 4, 9). Our model expresses those Matthew effects geometrically: older leaders having a larger influential zone to obtain more “collaborators” and “papers”. Therefore, our model can reproduce those power-law tails.
Rights and permissions
About this article
Cite this article
Xie, Z., Xie, Z., Li, M. et al. Modeling the coevolution between citations and coauthorship of scientific papers. Scientometrics 112, 483–507 (2017). https://doi.org/10.1007/s11192-017-2359-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-017-2359-1