1 Introduction

The availability of large-scale and time resolved data sets about economic, scientific or social activities opens new venues to address the long standing question of how we collaborate. This question becomes more important as globalization leads to a vast increase of collaborations in many areas of human activity, including science and economics [14]. In these areas, progress is mainly generated in collaboration and almost never in isolation. Hence, by understanding how we collaborate, we can re-design funding schemes and policies to allocate resources efficiently, and better foster innovation.

One could argue that collaboration patterns change with respect to the actors and the domain of activity, but there may be also evidence for common features across different domains. In the latter case, we could hypothesize that a unified modeling approach should be able to reproduce, and to explain, the structural and the dynamic features of collaborations in different domains. To demonstrate this is the aim of our paper. By this, we provide a new flexible model that allows to understand collaboration patterns.

The present study is focused on two domains with a large impact on human development, (i) economy and (ii) science. Specifically, we refer to (i) firms collaborating in Research and Development (R&D) alliances and (ii) scientists collaborating in co-authored publications. For both cases large, comprehensive and structured data sets about individual collaboration activities have become available. The data sets analyzed in this study are (i) the Thomson Reuters SDC Platinum database, listing around 15,000 inter-firm R&D alliances and (ii) a data set of over 300,000 co-authored papers in physics, which was obtained from the American Physical Society (APS) scholars database with additional disambiguation of authors names. For the details we refer to Section 3.1.

The time-aggregated data about these collaboration events can be conveniently represented by means of a complex network, where the nodes are the actors, or agents as we denote them in the following, and the links are the recorded collaborations. The structural features of such collaboration networks have been already investigated in different domains. Previous works have, for instance, discussed the presence of clusters, or communities, both in R&D networks of firms [5, 6] and in co-authorship networks of scientists [7]. The existence of such communities also impacts performance criteria [8, 9] and affect knowledge transfer [10, 11] and the ability to innovate [1214]. Other topological analyses focus on importance measures to characterize nodes [1517].

However, even the most refined topological characterization of collaboration networks can only constitute a first step toward their comprehensive and systematic understanding. This has to include the mechanisms that shape the structure and dynamics of such networks at the level of nodes, or agents. In particular, we need to identify the rules, or strategies, that agents follow in choosing their collaboration partners - such that at the end the observed collaboration networks emerge.

To combine the empirical analysis with a formal approach of the network formation we have proposed data-driven modeling as a suitable methodology. It is, for the application at hand, comprised of the following four steps: (a) proposition of an agent-based model (ABM) that shall explain the formation of collaboration networks, (b) reconstruction of the collaboration networks using the empirical data from two different domains, (c) calibration of the free parameters of the ABM for each domain by means of the empirical networks, (d) validation of the ABM for each domain by reproducing network features not used for the calibration.

This leaves us with the question about agent-based models that are suitable for being used in a data-driven approach. Some ABM rooted in economics propose a utility function for an agent which weight costs and benefits of collaborations [12, 18]. Agents create or maintain links only if this mutually increases their utility, and delete existing links otherwise. Such ABM allow to prove general features of, e.g., R&D networks such as sparseness or stability, dependent on certain cost functions. But because of theoretical assumptions about the utility function and the partner selection they cannot easily be calibrated against network data. Therefore, we have developed an ABM in the context of R&D collaborations [19] which assumes simple rules of link formation that are followed by agents with certain probabilities (see Section 2 for details). Such probabilities can be calibrated against available network data.

In this paper, we build on the existing ABM [19] which was already applied to R&D alliances [20, 21], but has not been extended to, or validated in, other domains yet. Hence, the goal of this work is twofold. On the one hand, we want to understand whether the same agent-based model can reproduce the topology of both R&D and co-authorship networks. On the other hand, we want to identify similarities and differences - at the microscopic level - with respect to the agents’ choice of collaboration partners. To the best of our knowledge no study has tried yet to unify findings in these two domains and find systematic, reproducible and universal patterns in collaboration networks. This investigation can also provide some evidence to our initial conjecture whether there may be a unified modeling approach for collaboration networks in different domains (see Figure 1).

Figure 1
figure 1

Visualization of collaboration networks: (a) R&D alliances of firms, (b) co-authorship relations of scientists. For the data sets see Section 3.1. We show the complete R&D network with about 14,000 nodes and 21,000 links, but only a sampled co-authorship network with about 11,000 nodes and 32,000 links (i.e. 10% of randomly chosen co-authors). For both networks we use the layout algorithm of [22].

2 Agent-based model of collaborations

How do economic actors or scientists choose their collaboration partners? At first, one would argue that scientists as decision makers are quite different from firms. In addition, inside their respective domain, how they choose partners may very much depend on the specific economic sector or scientific discipline. Thus, there is no ad-hoc evidence that such a problem can be addressed using the same modeling framework.

On the other hand, in order to reproduce a macroscopic structure such as a collaboration network, we may not need to include all the microscopic details that distinguish economic from social agents. Instead, an agent-based model should abstract from these details, to capture only the essential features of the decision making process. In this sense, we aim at an agent-based model that includes a minimalistic set of microscopic rules. We argue that this agent-based model is correct if it is able to reproduce a specific set of macroscopic properties of the different collaboration networks, namely degree distribution, path length distribution, distribution of community sizes, that are not used for the calibration of the model. At the same time, the agent-based model has to provide degrees of freedom to allow a proper calibration to reflect the differences of the domains in their respective empirical data.

In order to achieve this goal, this study utilizes a previously proposed agent-based model [19] that has the above mentioned features. The model is flexible in that it builds on five probabilities to capture the choice of agents for collaborating with either established nodes or newcomers, which need to be calibrated. Obviously, different sets of probabilities may match the same macroscopic features. In order to distinguish between them, we adopt a Maximum-Likelihood approach that uses the mean degree, the mean path length, and the global clustering coefficient of the resulting collaboration network as quantities to be exactly matched.

In the model, agents represent nodes in a collaboration network and links between nodes represent collaboration events. Each agent is characterized by two individual attributes, activity \(a_{i}\) and label \(l_{i}\). Activity reflects the propensity to participate in a collaboration, while label represents the membership of the agent in a recognized ‘circle of influence’. In other words, it models the belonging of the firm or of the scientist to different groups implicitly defined by shared practices and behaviors. Such a membership attribute is in agreement with the analysis of real-world networks reported by [5, 23]. The agent’s dynamics can be divided in two steps: first, the agent decides with whom to link, which impacts the network topology and the size of the network if a newcomer is chosen. Second, she adjusts her label, i.e. she keeps her previous label if she already has one, or she adopts the label of the counterparty if she is a newcomer, or she receives a new label, as discussed below.

2.1 Activation

The model is initialized by assigning individual activities \(a_{i}\) to agents which are sampled without replacement from the empirical distribution of activities (see Section 3.1). Hence, these activities are different for each agent and kept constant in time for the simulation. Next, at each time step, we select an agent to initiate a collaboration with probability \(p_{i}\) proportional to its activity, \(p_{i}= \eta a_{i}\), where η is a rescaling parameter that we fix by imposing that \(\sum_{i} p_{i}\) is equal to the number of collaboration event empirically observed per day.

2.2 Non-labeled versus labeled agents

Activated agents can belong to two different groups: (a) newcomers, if they never engaged in a collaboration before, or (b) established agents, if they were already part of a previous collaboration. We distinguish between these groups by means of the agent label \(l_{i}\). Newcomers are non-labeled, \(l_{i}=0\), whereas established agents get a label depending on their first collaboration, \(l_{i}>0\).

2.3 Collaboration size

When an agent is activated, she initiates a collaboration. The number of partners for her collaboration, \(m_{i}\), is obtained by sampling at random from the empirical size distribution of collaborating groups (see Section 3.1). The selection of partners is independent of the activity or other characteristics of the agent.

2.4 Collaboration partners

Given the size of the collaboration, the initiator chooses partners either from the group of newcomers or from the group of established agents. This choice also depends on the label of the initiator herself and can be expressed by five probabilities. A labeled initiator links to another agent with the same label with probability \(p^{L}_{s}\), to an agent with a different label with probability \(p^{L}_{d}\), or to an agent without any label with probability \(p^{L}_{n}\). If the initiator is a newcomer, i.e. non-labeled, she links to an labeled agent with probability \(p^{N L}_{l}\) and to another newcomer with probability \(p^{N L}_{n}\). Because the probabilities have to sum up to one, we have two constrains \(p^{L}_{s}+p^{L}_{d}+p^{L}_{n}=1\) and \(p^{N L}_{n}+p^{N L}_{l}=1\).

2.5 Link formation

The probabilities to choose collaboration partners only consider the two groups, newcomers and established agents. To specify which of the specific agents from these groups are chosen, we adopt the preferential attachment rule. Precisely, the initiator i selects, among all agents from the specific group, agent j as collaborator with a probability proportional to the degree \(k_{j}\) of j. If the initiator chooses a non-labeled agent (\(k_{j}=0\)) as collaborator, she will select uniformly at random from all non-labeled agents. After selecting the \(m_{i}\) partners, we link all of them to the initiator, this way creating a clique of size \(m+1\).

2.6 Label dynamics

In our model, agents are initialized as non-labeled agents, i.e. they are considered as newcomers. An agent receives a label only when entering the network (which may consist of disconnected communities). This can happen in two different ways: either the agent initiates a collaboration, or the agent is chosen as partner by an activated agent. In the first case, the agent gets a new label assigned that was not used before. In the second case, the agent adopts the label from the initiator of the collaboration. The label is a unique attribute of an agent, i.e. once an agent has obtained a label, this cannot be changed.

Let us emphasize that labels are dynamically generated during the computer simulations. This implies that the total number of distinct labels varies during each simulation and from one simulation to another.

Figure 2 summarizes the agent-based model described above. It illustrates the possible choices for the two different groups, newcomers and established agents. We note again that this choice progresses in three steps: First, activated agents choose (m times) between newcomers and established agents as partners. Subsequently, if activated agents already have a label assigned, they have the choice between the group with the same label or groups with a different label. Finally, within the groups, agents choose their partners with respect to their degree. Obviously, the number of agents in each group and the degree of agents change dynamically as the network evolves.

Figure 2
figure 2

Two representative examples of collaboration selection and of label propagation. (a) A labeled agent (whose label is depicted in green) is activated at time t and has to form an alliance with \(m=2\) partners. She links to an agent having a different label (depicted in yellow) and one non-labeled, at time \(t+dt\). (b) Likewise, a non-labeled agent gets activated at time t and forms an alliance with \(m=2\) partners. She links with one non-labeled agent and one labeled (yellow) agent at time \(t + dt\).

3 Model calibration

3.1 Data sources

Our agent-based model, as already mentioned, will be calibrated and validated against data sets from two different domains, covering inter-firm R&D alliances and co-authorship of scientific papers. In the following, we describe the two data sets and afterwards how they are used as input for the model.

3.1.1 R&D network

To reconstruct the R&D network of collaborating firms we use SDC Platinum database.Footnote 1 It contains data about approximately 672,000 announced alliances from all countries between 1984 and 2009 with daily resolution. The economic actors participating in these alliances are of several types, e.g. investors, manufacturing firms and universities, but for simplicity we address them as firms. Each actor listed in the data set is associated with a SIC (Standard Industrial Classification) code that allows us to unambiguously assign its corresponding industrial sector. Further, the purpose of each alliance is characterized by various flags, e.g. manufacturing, licensing, research and development (R&D). We restrict ourselves to all alliances with the flag ‘R&D’, which gives us 14,829 alliances connecting 14,561 firms. The number of partners involved in each alliance can vary (see Section 3.2 for details). In most cases the alliance size is two, however it can also be three or higher.

In order to reconstruct the R&D network, we focus on the time-aggregated data set. Each firm engaged in a R&D alliance becomes a node and un-directed links connect nodes involved in the same alliance. By adopting this procedure, the 14,829 R&D alliances result in a total of 21,572 links connecting 14,561 nodes. To compare collaborations in different industrial sectors, we reconstruct six distinct R&D networks for the six largest industrial sectors. According to our data set, these are related to computer software, pharmaceuticals, R&D laboratory and testing, computer hardware, electronic components and communications equipment. An alliance is considered as part of a given sector if one of the collaborating firms has a matching SIC code. The details for the sectoral networks are given in Table 1. Additionally, we compare these sectoral networks with an aggregated R&D network, previously analyzed by [19], which was obtained by considering all the R&D alliances together, i.e. more than just the six largest industrial sectors.

Table 1 Number of nodes N , links L , collaboration events E and average degree \(\pmb{\left \langle k \right \rangle ^{\mathrm{OBS}}=2L/N}\) for the aggregated R&D network, the six largest sectoral R&D networks, and the six representative co-authorship networks

3.1.2 Co-authorship network

To reconstruct the collaboration network of scientists, we use the data set from the American Physical Society about papers published in any APS journal, namely Physical Review Letters, Reviews of Modern Physics, and all Physical Review journals (APS).Footnote 2 From this data set we use the PACSFootnote 3 codes of the papers to assign the papers to different research areas. We restrict ourselves to six specific PACS codes (more details follow) and to the period from 1984 to 2009, for which we use the time-aggregated data. By this, we analyze the same time range for both the R&D and the co-authorship data.

This data set has the limitation that the authors are identified by strings which often contain inconsistencies, e.g. missing special characters or spelling mistakes. Thus, in order to really make use of the APS data set, we have to disambiguate authors names in a separate, but time consuming, data processing. The latter involves matching the titles of the papers in the APS data set with Microsoft Academic Search (MSAS) service, where both papers and authors have unique identifiers. The MSAS is a search engine which mines data from a bibliographic database containing information about scholars and their publications from 15 different disciplines. We have used the Application Programming Interface (API) of MSAS to obtain information about scholars publishing on APS. This way, we obtain a list of unique authors that we can use.

It is worth noticing that the matching procedure at article level was not perfect. About 27% of the articles were not matched. These unmatched articles often had titles containing special characters needed to write latex formulas and/or Greek letters. This problem affected mainly papers belonging to PACS 42. Among the matched articles we have sampled at random 100 articles and checked the authors’ list. We have found these lists were correct 89% of the times. The most common error was that one or two authors’ names were missing from the authors’ lists. More details about the coverage of MSAS and the accuracy of the disambiguation procedure are given in Appendix 1.

To reconstruct the co-authorship network, each unique author is represented by a node and links connect nodes that have co-authored at least one paper in the aggregated data set. Following this procedure, the 73,000 papers listed in the data set result in 300,000 links connecting 95,000 nodes.

At difference with the R&D networks, where firms are characterized by SIC codes, authors are not associated with any classification. Authors can change their research subject during their career, thus making a categorization on the author level difficult. Instead, the classification, i.e. the PACS number, is assigned to the links of the network representing the papers. For this reason, we build co-authorship networks of different fields by using the PACS numbers assigned to papers. In order to have co-authorship networks comparable in size and density with the R&D networks, we select the following six representative PACS numbers: 03 (quantum mechanics, field theories and special relativity), 04 (general relativity and gravitation), 42 (optics), 72 (electronic transport in condensed matter), 74 (superconductivity) and 89 (other areas of applied and interdisciplinary physics, that for example includes network theory). We report the sizes of these networks in Table 1.

3.2 Input quantities

Based on the two data sets, we now calculate the two empirical inputs needed for our agent-based model, namely the size distribution of the collaboration events and the activity distribution of the agents.

3.2.1 Size of collaboration events

In the SDC alliance data set, the size of a collaboration event is the number of firms per R&D alliance, while in the co-authorship data set it is the number of co-authors per paper. To study these, we analyzed the distributions of partners per collaboration event, \(P(m)\), in both considered data sets.

With respect to our six sectoral R&D networks, we find that the size distribution is right-skewed with values ranging between 2 and 20. It should be noted that the identification of the functional form of these distributions (e.g., power-law, exponential, log-normal and so on) is outside of the scope of this study, therefore we leave it as a possible extension. Most of the collaborations are stipulated between two partners, but some alliances - the so-called consortia - involve three or more partners. In Figure 3 we report such distributions for two represetative industrial sectors. Results for four more industrial sectors are presented in Appendix 1, confirming that the right-skewed distribution holds for all sectoral R&D networks, with only small differences in the tails of the respective distributions. These results are in line with the ones presented in [19] for the aggregated R&D network.

Figure 3
figure 3

Distribution of the size of collaboration events. For two representative industrial sectors, computer software (top left) and pharmaceutics (top right), distribution of the number of partners per alliance as measured from the SDC data set. For two representative co-authorship networks, superconductivity (bottom left) and interdisciplinary physics (bottom right), distribution of the number of authors per paper as measured from the APS-MSAS data set.

Regarding the size of scientific collaborations, we find results similar to the R&D alliances. I.e., most papers in our APS-MSAS data set have two co-authors with a broad right-skewed size distribution for all PACS numbers investigated. From our analysis, we have excluded all papers written by only one author because we are interested in collaboration networks, whereas such papers would only generate isolated nodes.Footnote 4 Also, in every economic data set on inter-firm alliances, a collaboration of size 1 could not exist by definition. Hence, to the purpose of comparing R&D and co-authorship networks, we do not consider single-author papers and the size of the collaboration events starts from 2 in all of our plots. Figure 3 gives representative examples from two PACS numbers. Differently from the sectoral R&D networks, the co-authorship networks exhibit a larger degree of variability among PACS numbers. This is due to the fact that the typical number of authors per paper strongly depends on the field. To give an example, the field of applied and interdisciplinary physics is characterized by significantly fewer authors per paper (at most 10) than the field of general relativity and gravitation (whose right tail reaches 55 authors per paper). In Figure 11 and Figure 12 in Appendix 1, we show the distribution of collaboration sizes for respectively the six sectoral R&D networks and the six co-authorship networks.

3.2.2 Agents’ activity

This is one of the two key attributes assigned to agents in our model. We apply a measure developed in the setting of temporal networks [24], which has been already used to analyze various data sets [2527], also in the context of R&D and co-authorship networks [19, 28].

Following these approaches, we argue that activity reflects the propensity of an agent to participate in a collaboration event. Precisely, we define the empirical activity of an agent i at time t as the number of collaboration events, \(e^{\Delta t}_{i,t}\), involving agent i during a time window Δt ending at time t divided by the total number of collaboration events, \(E^{\Delta t}_{t}\), involving any agent during the same period of time:

$$ a^{\Delta t}_{i,t} = \frac{{e^{\Delta t}_{i,t}}}{ {E^{\Delta t}_{t}}}. $$
(1)

For both the SDC alliance and APS-MSAS data sets, we measure the empirical distribution of activity, \(P(a)\), for four different time windows, \(\Delta t=1,5,10\) and 26. When the time window is shorter than 26 years (the entire data set observation period), we compute the activity by shifting the time window in 1-year increments and then we average the results. For simplicity, from now on, we will write \(a^{\Delta t=26 \mathrm{\ years}}_{i,2009}\) as \(a_{i}\), which is the activity over the longest time window. Interestingly, we find that these distributions are independent of the size of the time window, which is a robust feature for both R&D and co-authorship collaborations. In Figure 4, we report these results for two representative sectoral R&D networks and two representative co-authorship networks. For a visualization of the complete results for the six sectoral R&D networks see [19] (Supplementary information) and for the six co-authorship networks see Figure 13 in Appendix 1.

Figure 4
figure 4

Complementary cumulative distribution function (CCDF) of activities. At the top CCDF of the empirical firm activities, measured for two representative industrial sectors (from the SDC data set, [19] Supplementary information). At the bottom CCDF of the empirical author activities, measured for two representative co-authorship networks (from the APS-MSAS data sets). We considered for 4 different time windows Δt of 1, 5, 10 and 26 years.

3.3 Implementation and optimal model selection

To reproduce the collaboration networks from the two domains, we implement the agent-based model described in Section 2. For the simulations, we take the number of agents, N, and the total number of collaboration events, E, from the respective empirical networks. The two input parameters, size of the collaboration event, \(m_{i}\), and agent activity, \(a_{i}\), are obtained by sampling from the above distributions, \(P(m)\) and \(P(a)\). With that, the only free parameters in our model are the five probabilities \(p_{s}^{L}\), \(p_{d}^{L}\), \(p_{n}^{L}\), \(p_{n}^{NL}\), \(p_{n}^{NL}\) which we vary in order to find which combination gives the best match between the simulated and the observed network. For more information about the exploration of the parameter space see Appendix 2. For the comparison we use the following quantities: average degree, \(\left \langle k \right \rangle \), average path length, \(\left \langle l \right \rangle \), and global clustering coefficient, C, and define the respective relative errors \(\varepsilon _{ \langle k \rangle}\), \(\varepsilon _{ \langle l \rangle}\) and \(\varepsilon _{C} \) between the observed and the simulated quantities. We require that these errors have to be smaller than a threshold \(\varepsilon ^{0}\). For all probability combinations we perform 25 simulations. We then select the combination that gives us the highest fraction of networks that match the criterion \(\varepsilon <\varepsilon _{0}\). The optimal probabilities are indicated using a star (e.g. \(p_{s}^{*L}\)).

In Table 2 we report the optimal set of probabilities for the collaboration networks from the two different domains. The network simulated using the optimal set of probabilities will be named optimal simulated networks. In Table 3 in Appendix 2, we report the \(\left \langle k \right \rangle \), \(\left \langle l \right \rangle \) and C of the optimal simulated networks and they can be compared with the respective values for the observed networks. With this, we are set for the validation of our agent-based model which of course has to include features of the network that were not used for the calibration of the model.

Table 2 Optimal sets of probabilities to simulated the collaboration networks
Table 3 Summary of average statistics for the empirical and optimal simulated networks

4 Model validation

4.1 Reproducing four distributions

To validate our agent-based model, we compare the empirical networks with the statistical properties of the simulated ones using the optimal set of probabilities. For the comparison, we use macroscopic features such as distributions of degrees, path lengths, local clustering coefficients and sizes of the disconnected components. Additionally, we also investigate microscopic, or agent centric, features such as labels. The validation procedure is similar to the one described in [19]. To validate the above mentioned distributions, we emphasize that for the calibration we did not use information about the distributions, but only about the respective average values, \(\left \langle k \right \rangle \), \(\left \langle l \right \rangle \) and C, to calculate the relative errors.

Figure 5 and Figure 6 show these distributions for one representative sectoral R&D network and one co-authorship network. We observe a remarkable match between the simulated and the empirical distributions for all four quantities. In particular, the model reproduces the emergence of a giant component in both networks, together with many smaller components down to size two.

Figure 5
figure 5

Distributions of node degrees (a), path lengths (b), local clustering coefficients (c) and component sizes (d) for the real and the 25 optimal simulated networks in ‘Pharmaceuticals’(SIC code 283). The blue circles in our plots correspond to the mean values and the error bars correspond to the standard deviations of all the quantities we analyze on the 25 realizations of each optimal simulated collaboration network. In many cases, the error bars are not visible, because the values are very narrowly distributed across these 25 realizations.

Figure 6
figure 6

Distributions of node degrees (a), path lengths (b), local clustering coefficients (c) and component sizes (d) for the real and the 25 optimal simulated networks in applied and interdisciplinary physics (PACS number 89).

4.2 Community structures and groups of influence

The second part of our validation regards the modular structure of the collaboration networks in terms of communities. We start by evaluating and comparing the community structure of the observed networks and of the simulated ones using the optimal set of probabilities. Then, we verify that the groups of influence defined by the agents’ labels well reproduce the community structure of the simulated networks.

4.2.1 Community structure of empirical and simulated networks

To detect the community structure in the observed networks, we employ a widely used algorithm, Infomap [29], which is based on the probability flow of random walks on networks. In Table 4 in Appendix 3, we report the number of communities found in each network. In Figure 7(a), we give a visual representation of the respective communities in the co-authorship network in applied and interdisciplinary physics.

Figure 7
figure 7

Co-authorship network in applied and interdisciplinary physics (PACS number 89). (a) Visual representation of the empirical network, considering only the 30 largest clusters detected by the Infomap algorithm. Distinct clusters are represented by grouping nodes in distinct regions of the plot area. (b) Visual representation of one realization of the simulated network, considering only the 30 largest clusters detected by the Infomap algorithm. Distinct clusters are represented by node groups in distinct regions of the plot area. In addition, we depict our node labels by using different colors: most of the nodes in a given cluster share the same label.

Table 4 Modular properties for the aggregated R&D network, the six largest sectoral R&D networks, and the six representative co-authorship networks

In order to quantify the goodness of the community partitions detected by Infomap, we use a normalized modularity score Q. This coefficient is equal to 1 when all links connect only nodes belonging to the same community, equal to 0 for a network where links are placed randomly, and equal to −1 when links are formed only among nodes populating distinct communities. Interestingly, we find that all the R&D and co-authorship networks are characterized by a high modularity as reported in Table 4 in Appendix 3. Precisely, all the Q scores for partitions originated by Infomap are significantly higher than the equivalent scores on randomly generated networks with the same degree sequence, especially in the domain of co-authorship networks. We can safely conclude that our high Q values are indicative of a real modular structure, and not a simple artifact of the network’s size and density [30].

To detect communities structure on the simulated networks, we employ the same procedure we have described above. We visualize the partitioning detected for the co-authorship network in other applied and interdisciplinary physics in Figure 7(b). The simulated distributions of clusters size match their empirical counterparts, which is far from being trivial given that no information about the community structure was used for the calibration. We report this result for the ‘Pharmaceuticals’ R&D network in Figure 8(a), and for the co-authorship network in applied and interdisciplinary physics in Figure 8(b).

Figure 8
figure 8

Size distribution of (i) the circles of influence in the 25 realizations of the optimal simulated network, (ii) the Infomap clusters in the 25 realizations of the optimal simulated network and (iii) the Infomap clusters in the empirical network for ‘Pharmaceuticals’ (SIC code 283) (a) and ‘Applied and interdisciplinary physics’ (PACS number 89) (b).

Another evidence of their similarity is the modularity score of the optimal simulated networks - \(Q^{*}=0.61 \pm0.01\) for the Pharmaceuticals R&D network, and \(Q^{*}=0.87 \pm0.01\) for the co-authorship network in interdisciplinary physics. These values are close to their empirical equivalents, 0.62 and 0.92 respectively. In all cases, the modularity scores are significantly greater (with a p-value computationally indistinguishable from zero) than the ones obtained for a set of 100 randomly generated networks with the same degree sequence, proving that the obtained modularity cannot be expected or explained simply with the degree sequence.

4.2.2 Community structure using the agents’ labels

In order to estimate the overlap between the communities detected using the Infomap algorithm and the group of influence defined by our agents’ labels, we use the normalized mutual information coefficient \(I_{\mathrm{norm}}\) [31]. We find that labels are actually able to reproduce the community structures of collaboration networks coming from both the economic and the scientific domains. \(I_{\mathrm{norm}}(\mathrm{Labels,~Infomap~clusters}) = 0.887 \pm0.003\) for the ‘Pharmaceuticals’ R&D network, and \(I_{\mathrm{norm}}( \mathrm{Labels,~Infomap~clusters}) = 0.952 \pm0.002\) for the co-authorship network in interdisciplinary physics. This result is even more remarkable if we consider that the Infomap algorithm detects structural clusters based on the probability flow of random walks in the network, while our label propagation mechanism consists of an assignment of a fixed membership attribute - which is not only closer to a real phenomenon, but also computationally easier.

4.3 Distribution of path lengths at link formation

Finally, we compare the empirical and the simulated networks with respect to the distribution of path lengths between every pair of agents at the moment preceding the link formation. This is different from the distribution of path lengths analyzed before, which was computed on the time-aggregated networks. Now we are interested to know whether agents preferably form links with agents already part of the same connected component or with agents from another component or with newcomers. The respective distribution of link types is shown in Figure 9 for the ‘Pharmaceuticals’ R&D network, and in Figure 10 for the co-authorship network in interdisciplinary physics. In all cases, there is a higher number of links with agents inside the same connected component or with newcomers. We emphasize the very good match between the empirical and the simulated frequencies of link types.

Figure 9
figure 9

Temporal path length analysis for ‘Pharmaceuticals’ R&D network (SIC code 283). (a) Distribution of link types for empirical and simulated networks: ‘newcomer(s)’ means that at least one of the agents was isolated (i.e. not yet part of the network) before the link formation; ‘disconnected’ refers to agents already belonging to the network, but placed in two disconnected components; ‘connected’ refers to agents already belonging to the same network component prior to the link formation. (b) Distribution of path lengths at the moment of link formation (only for agents belonging to the same connected component).

Figure 10
figure 10

Temporal path length analysis for the co-authorship network in applied and interdisciplinary physics (PACS number 89). (a) Distribution of link types for empirical and simulated networks: ‘newcomer(s)’ means that at least one of the agents was isolated (i.e. not yet part of the network) before the link formation; ‘disconnected’ refers to agents already belonging to the network, but placed in two disconnected components; ‘connected’ refers to agents already belonging to the same network component prior to the link formation. (b) Distribution of path lengths at the moment of link formation (only for agents belonging to the same connected component).

For links connecting agents which are already in the same connected component we can further discuss the network distance, or path length between two agents. It is interesting whether agents at larger network distances are still able to know each other and to form a link. Trivially, agents at distance 1 have already a collaboration (and can start a new one), whereas agents at distance 2 have one collaborator in common (triadic closure). We report our findings about the path length between agents before they engage in a collaboration in Figure 9 for the ‘Pharmaceuticals’ R&D network, and in Figure 10 for the co-authorship network in interdisciplinary physics. We see that in the case of R&D networks agents preferably choose close collaborators for a new collaboration (path length up to 5), whereas for co-authorship networks agents prefer previous collaborators or collaborators at distance 2.

Let us emphasize that our model well reproduces two important characteristics of collaboration networks: the high number of repeated interactions and the phenomenon of triadic closure. These are known to have a positive impact on productivity [32] and to be a driving force in the formation of new collaborations [7]. This result is far from being trivial as we have not included neither ad hoc microscopic rules nor information to reproduce such characteristics.

In conclusion, the model correctly predicts the formation of links between agents irrespectively of whether they are already in the same network component and gives an exact calculation of the shortest path length at the moment of link formation. In addition, it well captures repeated interactions and the triadic closure phenomenon without using any ad hoc microscopic information.

5 Discussion and conclusion

5.1 Commonalities in collaboration networks

In the present paper, we have explored the structure and dynamics of collaboration networks in two different domains, R&D alliances between firms and co-authorship relations between scientists. Despite their different origin, these collaboration networks share a number of common features that can be even found on the sub-domain level (SIC and PACS numbers). These empirical features include the right-skewed distribution of collaboration sizes (Figure 3), the distribution of activities to engage in a collaboration (Figure 4) which are very stable across domains and over time, the pronounced community structure of the networks and the existence of a giant-connected component (Figure 7).

These commonalities motivated us to use the same agent-based model to explain the structure and dynamics of these collaboration networks. Precisely, we have compared the outcome on the systemic level, i.e. the networks simulated by the agent-based model and the observed networks, to conclude whether our assumptions for the interactions on the agent level are justified. We remark that reproducing systemic features along very different dimensions indeed lends evidence to the validity of our agent-based model, because it cannot simply be obtained by a fitting procedure. Specifically, our model is able to reproduce the distributions of degree, of path length, of local clustering coefficients, of component sizes and of path lengths between every pair of agents at the moment of link formation, without imposing any constraints on these features during the calibration procedure.

5.2 Strategies of agents choosing collaboration partners

The agent-based model builds on five probabilities to form a link with another agent, which depend on the label of the initiator (newcomer vs. established agent) and on the counterparty (newcomer vs. established agent with the same or a different label). These agent-centric probabilities are calibrated using only three macroscopic features of the empirical networks (mean values of degree, path length and clustering coefficient). Remarkably, we find that these probabilities have very similar values, regardless of the domains (R&D networks vs. co-authorship networks) and the sub-domains (SIC and PACS numbers).

Interpreting these probabilities as strategies of an agent to choose a collaboration partner, we can obtain the following insights:

  1. (i)

    For all R&D and co-authorship networks, established agents prefer to form links with other established agents (\(p^{*L}_{s}+p^{*L}_{d} > 55\%\)).

  2. (ii)

    When forming a link with an established agent, the initiator tends to select a counterparty with the same label, i.e. belonging to the same community (\(p^{*L}_{s} \ge p^{*L}_{d}\)). Comparing the two domains, we find that this general tendency is 10 times larger in co-authorship networks. The probability to select a co-author from a different community \(p^{*L}_{d}\) equals the lowest possible value, 5%, in all cases.

  3. (iii)

    A difference between domains is observed in the strategy of the newcomers. For R&D networks, newcomers tend to enter the network by forming links with established agents (\(p^{*N L}_{l} > p^{*N L}_{nl}\)). This finding is consistent with empirical evidence [5, 33]. However, for all co-authorship networks newcomers tend to enter the network by forming links with other newcomers (\(p^{*N L}_{nl} > p^{*N L}_{l}\)). So, the fact that \(p^{*N L}_{nl} \ge0.55\) in co-authorship networks clearly supports this hypothesis.

The difference in the strategies of newcomers in R&D and co-authorship networks can be attributed to the higher entry barriers in economic systems compared to academic environments. An exception from these general observations can be only found for one sectoral network ‘R&D, laboratory and testing’, where the strategies of newcomers are more like in co-authorship networks. We attribute this deviation to the high technological dynamism in this sector.

5.3 Network-endogenous and -exogenous factors

Following the distinction in the literature [5] we argue that the strategies of agents in choosing their collaboration partners are determined by both endogenous and exogenous factors. These are known to be crucial in the formation and evolution of the R&D alliances [5]. However, they have been usually considered separately by empirical and theoretical works [21, 3336], and to our knowledge no study has analyzed their importance in co-authorship networks.

Network-endogenous factors cover the information that the initiator has about the network, for instance information about the network position (i.e. social capital) of its potential partners. Thus, these factors take into account collaboration patterns already present in the networks. These factors are captured by the probabilities to link to a labeled agent, \(p^{L}_{s}\), \(p^{L}_{d}\) and \(p^{N L}_{l}\). Network-exogenous factors do not consider such information, but instead use external information such as the technological, scientific or geographical proximity of the agents. These factors are captured by the probabilities to link to a newcomer, \(p^{L}_{n}\) and \(p^{N L}_{nl}\).

Comparing the two types of factors, we find that network-endogenous factors are predominant in the formation of new collaborations in each of the collaboration networks analyzed in this study. In other words, the existing network structures explain most of the newly formed links. In terms of linking probabilities, this means that \(p^{*L}_{s}+p^{*L}_{d}+p^{*N L}_{l}\) is always bigger than \(p^{*L}_{nl}+p^{*N L}_{nl}\) (where refers to the optimal probability) for all sectoral R&D networks and co-authorship networks. This result is also in line with the empirical finding [37, 38] that firms in R&D networks prefer to establish alliances with other firms which have an history of previous alliances.

5.4 Reconstruction of communities by means of labels

In our model, labels represent the fact that agents belong to certain communities. This way, newcomers and established agents can be distinguished. Moreover, different labels allow to further differentiate between groups of agents with a certain interest. The label dynamics explained in Section 2 provides a mechanism of label propagation.

We point out that our assumption about the label attribute is in agreement with the results reported by [23], that have identified the presence of communities based on ground truth in real networks. Such communities include nodes that do not necessarily share features such as the same geographical provenience, or the belonging to the same institution. They are rather defined dynamically, through consecutive interactions and link formation. The same reasoning holds for both R&D and co-authorship networks, where communities of collaborating agents do not depend on their geographical or knowledge distance, but are defined by the subsequent propagation of a (virtual) membership attribute, which is the ‘label’.

It is remarkable that this rather abstract setup for labels is indeed able to reproduce the distributions of communities present in the collaboration networks from both domains (see Figure 8). The overlap in communities, measured through a normalized mutual information criterion, is around 90% for all collaboration networks. In Table 4 in Appendix 3, we have shown that such community structure cannot be expected at random from the degree sequence. Thus, we can conclude that labels represent a simple and elegant way to capture various network-endogenous factors which drive agents in both domains, R&D collaborations and co-authorship networks, to form communities. While the existence of communities is an empirical fact, the rules for their formation are not fully understood. With this work, we provide evidence that such rules can be inferred from the empirical networks and are not only able to reproduce the community structure, but also other, more sophisticated features of the networks.