Unipartite networks
We start our analysis by looking at the unipartite networks. The first on which we focus is the network of co-citations of international aid organizations. In this network, two organizations are linked to each other if they are mentioned in the same web page a significant amount of times.
Figure 1 depicts this network. The first noticeable property of the network is the tendency of organizations to cluster with similar organizations. This homophily can be estimated by calculating the modularity of the partition obtained by classifying organizations according to their class (as described in Sect. 2.1). Modularity is a measure estimating how many more edges are inside a node community than we would expect [22]. Modularity takes values between −1 and 1. In this case, we observe a value of 0.34. This is remarkably high considering that we obtain a modularity of 0.47 when we detect node communities with an actual community discover algorithm (Infomap [23]). This means that the organization class is a reasonable way to partition this network, i.e. organizations of similar classes tend to cluster together.
Note that this is a significance map, meaning that we only see the associations that are stronger than expected. There are strong associations that are not in this network, because they are a reflection of business-as-usual for the connected organizations. The distribution of edge weights is broad, and we depict it in Fig. 2. The distribution has a strong exponential cutoff. This means that very well cited organizations fail to be mentioned together as much as we would expect—as this would generate a power-law. Table 1 reports the top ten organizations associated with the World Bank according to the WS measure.
Table 1 Top ten organizations according to WS for the World Bank in the co-citation network Keeping our attention on the organization-organization network, we now focus on its citation variant: here organizations are connected not if they co-appear in a page, but if they directly mention (or link) one another. In this case, connections are not symmetric: the edges have a direction going from the citer to the citee.
The first thing we are interested in investigating is which organization is the most cited and which one is the most central in this kind of structure. Table 2 reports the ten most central organizations in the network, according to betweenness centrality [24]. Betweenness centrality tells us the fraction of paths in the network that would get longer—or disconnected—if the node were to be removed from the network.
Table 2 The ten most central organizations in the citation network, and their incoming citation score Note that being the most central node in the network does not necessarily mean that the node is the most cited. In fact, the third most cited organization (according to the sum of the WS of the edges pointing to it) is the World Health Organization, which is eighth most central. UNDP and FAO received less citations, but they have higher betweenness centrality. This means that the additional citations come from either the same edge or from other organizations that do not contribute to the centrality of the WHO.
In the citation network, another important question regards connected components: is there a citation path going from an organization to any other organization? A disconnected space would mean that there is information that cannot reach every organization in the system.
The citation network has only one weakly connected component. This means that there is a path between any two organizations, provided that we ignore the direction of the edge [25]. The majority of the network is composed by nodes that are target of a citation, but do not cite back in sufficient amount to clear the backboning threshold. There is a core of 15 aid organizations part of the only non-trivial strongly connected component (i.e. a component containing more than one node). With the exception of Save the Children, all organizations in Table 2 are part of this core of the network.
We conclude the section on unipartite networks by looking at the other two types of networks we generated: the issue–issue and country-country networks. We do not perform analysis on them as this paper is more focused on aid organizations. However, they are objects with interesting descriptive power. They tell us how aid organizations talk about health issues and countries.
Figure 3 depicts the issue network. Two issues are connected if they co-appear in webpages more often than we would expect given their popularity. There are two interesting aspects. First, terms that are synonyms or quasi-synonyms are connected in this network—such as “DAH” and “Development Assistance for Health”, or “TB” and “Tubercolosis”. This is a sign that the methodology is valid, as it was able to catch such relationships.
The second aspect concerns our ability of defining clusters of issues. We perform a simple community discovery with the Infomap algorithm [23] and it returned us the clusters depicted in the figure. The communities help us in making sense of different issues as being part of macro categories. For instance, we can delineate a reproductive health macro category (in orange) as distinct from the one focused on infrastructure and governmental cooperation (in red).
Similar considerations can be made for the country network. Figure 4 depicts it. This network can also be used as a mean of validation of our Web data. We expect to find connections between countries that share a similar geographical and cultural position. Countries affected by the same issues should also be closely connected.
In fact, we see that there is a strong geographical component, showing an African, American, Middle East and Far East cluster. Romania is the only European country, and thus has smaller connections. Yet, its most significant connection is with its neighbor Turkey.
Bipartite networks
We now turn our attention to bipartite networks. In a bipartite network, we have two classes of nodes. Edges only connect nodes belonging to unlike classes. In this work we have three possible types of bipartite networks: organizations connected to keywords, organizations connected to countries, and countries connected to keywords. The connection criterion is the same used for unipartite networks: the co-appearance of the two terms in the same webpage.
The first network we analyze is the one connecting organizations to keywords. Our first question is: how can we estimate which issues are the most associated with an organization? To answer this question we create a derived WS measure. \(\mathit{WS}_{oi}\) is the total number of co-appearances of organization o and issue i. We can define \(\mathit{WS}_{i}\) as the total number of times i was mentioned. Note that \(\mathit{WS}_{i} \neq\sum_{\forall o \in O} \mathit{WS}_{oi}\), because we can have a page mentioning i and two different organizations, so that page would be double counted.Footnote 4
The ratio between \(\mathit{WS}_{oi}\) and \(\mathit{WS}_{i}\) is informative: it is the share of pages mentioning i that also mentioned o. The higher this share, the more o is seen as relevant to i by the aid community—even if o might not mention i at all. Table 3 reports the top and bottom issues according to this ratio for the World Bank. Note that the World Bank is mentioned more often together with AIDS and HIV than it is mentioned together with Infant Survival. However, HIV/AIDS are topics discussed very broadly and very often, and only in around 4% of the times the World Bank is mentioned. This is still higher than expectation of uniform mentions: the expectation is lower than 1%, given that we have more than 100 organizations in our sample.
Table 3 The ten most associated issues to the World Bank according to WS We can visualize the relationship between the World Bank and all issues on the issue network we built in the previous section. In Fig. 5 (left), we use the same layout used in Fig. 3. The only difference is in the color of the nodes. Instead of showing network communities, we show the value of WS% (\(= 100 * \mathit{WS}_{oi} / \mathit{WS}_{i}\)). From the figure, we can see that the World Bank tends to have lower WS% only for very popular issues (large nodes), and it tends to be mentioned disproportionally often with smaller issues.
Is there an alignment between what the World Bank is talked about (i.e. the terms with which it co-appears) and what the World Bank is talking about (i.e. the terms it mentions in its website)? Fig. 5 (right) is an attempt to answer this question. The WS% indicator as calculated so far is equal to \(\mathit {WS}_{oi} / \mathit{WS}_{i}\). If we substitute the numerator, we can estimate the share of pages mentioning i coming from a given organization website o: \(\mathit{WS}^{o}_{i} / \mathit{WS}_{i}\). In a perfectly aligned world, these two ratios should be the same: an organization is mentioned together with an issue exactly as much as this organization itself mentions the issue. Disagreements in these two ratios imply that o thinks some other issue is more relevant to itself than i.
Figure 5 (right) shows that there is a correlation between the two ratios, but it is not perfect. The vast majority of issues is clustered into a clear linear relationship; however, there are a handful of outliers. Secondary/Tertiary Care and Avian Influenza are mentioned by the World Bank more than the World Bank is mentioned with them; on the other hand, the World Bank mentions Infant Survival and Neglected Disease proportionally less than it is mentioned about them.
A plausible reason to explain why in the World Bank site “Infant Survival” is found more often with other partners (i.e. high co-appearances relative to citation), while other terms, such as “Avian Flu” is found more often in isolation (i.e. high citation relative to co-appearances) could be that the World Bank operates to improve “Infant Survival” often with other health organizations (e.g. UNICEF and WHO). On the other hand, the World Bank often operates in the “Avian Flu” domain with no-(human)health organizations (e.g. OIE—World Organisation for Animal Health) as avian flu is relevant both to human and animal health.
We can repeat the same exercise in the organization-country bipartite network. Table 4 reports the results of the same analysis performed for Table 3; Fig. 6 depicts the same scattergram shown in Fig. 5 (right), shifting the attention from issues to countries.
Table 4 The ten most associated countries to the World Bank according to WS The main difference between the two bipartite networks is that WS% distributes differently. There is less variation in the organization-country network than in the organization-keyword network. For the World Bank, the difference between the most associated country (Vietnam) and the least associated country (Nigeria) is just above 0.3 percentage points.
Just as in the case with issues, also for countries there is a strong correlation between the number of times the World Bank is mentioned about a country and the number of times it mentions the country itself—with the presence of some outliers. Romania is mentioned by the World Bank more than we would expect given the speech of the rest of the aid community, while Jordan and Egypt are mentioned fewer times.
We conclude this section by looking at the country-issue bipartite network. In this case we cannot compare the association as emerging from co-occurrences against direct mentions, because we do not have a source of data on what a country talks about. For this reason, we limit ourselves to providing a simple ranking. We want to answer the question: which issue is most related to which country?
Table 5 provides the answer. There is only one issue that ranks as the top associated concern in more than two countries and that is Infant Survival. From Fig. 3 we could see that this issue is not mentioned often in the crawled websites. However, most of its mentions come from a collection of countries: the six countries listed in the table make up for more than 77% of all mentions of the issue.
Table 5 The most associated issue per country to WS Multilayer networks
In this section we focus on multilayer networks, i.e. networks whose nodes can be connected with different criteria. We focus on a single variant on all the possible multilayer networks we could build. Analysis similar to the one delineated in this section can easily be carried on the other types of multilayer networks. Here we are interested in investigating the network connecting issues together.
The layers of the network come from the different websites crawled. In other words, we build a unipartite issue–issue network just like we did in the Unipartite Networks section, by looking exclusively at data coming from a website at a time. In this way, we have a layer showing us the issue–issue network in the eyes of the World Bank, a layer for UNICEF, a layer for the WHO, and so on. Figure 7 (left) shows an example of such structure.
In the Unipartite Networks section we saw that there is an issue–issue network arising from the combination of the discourse of all international aid organizations. From that network, we discovered that—for instance—HIV is related to Maternal Health. From where is this connection coming? How do the networks emerging from the speech of each aid organization combine to form the big picture shown in Fig. 3?
In [17], the authors introduced two concepts that are useful for thinking about this question. These concepts were “complementarity” and “redundancy”. If we look at all the edges in the multilayer network, we can record in how many layers a connection is reproduced. If on average all connections are present in most layers, then we can conclude that there is a high degree of redundancy: the deletion of an edge in a specific layer does not affect the connectivity of the multilayer network as a whole. However, if most edges are present in a single layer, then the deletion of a single edge from i to j in the only layer in which they are connected has a dramatic effect. If we perform such deletion, there is no direct way to go from i to j.
We can redefine redundancy ρ here to apply it to a single edge. The redundancy \(\rho_{ij}\) of the edge connecting i to j is simply the number of layers in which the edge appears, divided by the total number of layers in the network. If l is a layer containing a set of edges from our set of layers L,
$$\rho_{ij} = \frac{\sum_{\forall l \in L} \delta_{ij}^{l}}{|L|}, $$
with:
$$\delta_{ij}^{l} = \textstyle\begin{cases} 1 & \text{if } ij \in l, \\ 0 & \text{otherwise}. \end{cases} $$
In other words, \(\rho_{ij}\) is the average number of layers in which i is connected to j.
If we calculate the average \(\rho_{ij}\) value for all the edges in the issue–issue network shown in Fig. 3, we obtain a value of 0.035, meaning that the average edge appears in only 3.5% of all layers. This seems to suggest that the network is characterized by a certain level of complementarity. The intuition is confirmed if we look at the edges with the highest value of redundancy. The top eight edges, the ones present in 13 or more layers, are either synonyms or antonyms frequently used together: Malnutrition and Nutrition, Communicable Diseases and Non-communicable Disease, AIDS and HIV, and so on. If we were to collapse these issues as one, the average redundancy would be even lower than 0.035.
Moreover, there are 19 connections in the issue–issue network as shown in Fig. 3 that do not appear in a single layer. This means that it is the aggregation of correlated discourses that make these connections emerge, even if there is not a single organization for which the two issues are co-mentioned a significant amount of times. Table A4 reports the full list.
Another multilayer network analysis focuses on estimating the similarities between each layer and the whole structure. We can calculate two related, but complementary measures: Alignment and Impact. Alignment tells us how much a layer is similar to the whole network. To estimate Alignment, we calculate the correlation coefficient between the global issue–issue WS values and the ones coming from a single layer. A high Alignment value means that the connection strength between two issues in the eyes of an organization is correlated with the strength in the overall network. In other words, the organization agrees with the community about the relation between the two issues.
On the other hand, Impact tells us how much of the global structure is due to the organization itself. We define Impact as the normalized difference between Alignment and Alignment∗. Alignment∗ is the correlation between the WS values of the layer and the ones coming from the global network minus the layer itself.Footnote 5 A high Impact value means that, if we were to remove the layer from the global network, the network would change into something rather different than the layer itself. Impact equal to zero would imply that removing the layer from the network would not change the distribution of the WS values at all.
Figure 7 (right) shows the relationship between Alignment and Impact across all the organizations. We divide the space in four quadrants by slicing each axis with its median value. We name each quadrant as follows (where “high” means “higher than median”, and “low” means “lower than median”):
-
Leaders. High Alignment and high Impact—e.g. WHO or UNICEF—mean that the organization agrees with the overall WS distribution of the aid community and its removal would cause a shift in this distribution. These are organizations which set the discourse of the health community.
-
Followers. High Alignment and low Impact—e.g. NORAD or DCD—mean that the organization agrees with the overall WS distribution of the aid community, but its removal would not change this distribution. These are organizations which follow the discourse of the health community.
-
Explorers. Low Alignment and high Impact—e.g. World Bank or IMF—mean that the organization disagrees with the overall WS distribution of the aid community, but its removal would cause a shift in this distribution. These are organizations which seem to explore different ways to provide health aid and that have enough power to shift the discourse of the community.
-
Strugglers. Low Alignment and low Impact—e.g. IsDB or UNPBF—mean that the organization disagrees with the overall WS distribution of the aid community, and its removal would not change this distribution. These are organizations which are not following the mainstream of the health aid community, but do not have enough power to shift the discourse.
Unsurprisingly WHO, the main actor in health, is in the leader quadrant: its direction emerges as a defining force of the global health arena. Many organizations score a higher impact than WHO, for instance UNICEF and Doctors without Borders, as they likely use their influence in a more visible way for the population at large. The position of World Bank as explorer also makes sense: the World Bank is a large organization which can coordinate a significant human and economic effort, however it is not traditionally thought of as a health provider, and thus has a lower alignment level with the community.
Case study: World Bank
We now focus our attention on a single organization: the World Bank. The aim of this section is to provide some examples of organization-centric analyses. We focus on the World Bank because it is one of the largest organizations in international aid. Among its many commitments—helping nations to build more equitable societies and to improve fiscal performance and country competitiveness—the World Bank Group (WBG) also works in the health sector. To support countries in reaching the goal of UHC by 2030 (all people and communities can use the promotive, preventive, curative, rehabilitative and palliative health services they need, of sufficient quality to be effective, while also ensuring that the use of these services does not expose the user to financial hardship), the Bank provides financing, state-of-the-art analysis, and policy advice to improve service delivery and expand access to quality, affordable health care. During the period from fiscal year 2000 to 2016, the World Bank invested US$35 billion in the Health, Nutrition and Population (HNP) thematic areas. Over this period, the average annual lending increased significantly from US$1.3 billion to US$2.4 billion. The Bank currently manages an active HNP portfolio of $11.9 billion.
The WBG has focused its health sector investments and research in areas that are especially vital to helping countries achieve UHC by 2030, working closely with donors, development partners, governments, and the private sector. Some of these focus areas include ending preventable maternal and child mortality; reducing stunting and improved nutrition for infants and children; strengthening health systems and health financing; ensuring pandemic preparedness and response; promoting sexual and reproductive health and rights; and the prevention and treatment of communicable diseases.
In light of this, the World Bank is among the main actors in this area, and one of considerable impact—as highlighted in Fig. 7 (right). Moreover, we have access to additional data sources that make the analysis more substantive.
The first question we want to address is whether the connections we extracted from Web patterns reflect the actual activities of an organization. We construct an index measuring the Involvement of the World Bank with a given country. This index is formulated as follows:
$$\mathit{Involvement} = \log P + \log I + \log GP, $$
where:
-
P is the number of World Bank financing projects supporting health services approved in a given country over the 2005–2016 period;
-
I is the number of World Bank non-financing projects—or advisory Services and analytics supporting health services—approved by the World Bank;
-
GP is the number of World Bank operational units delivering the support to health services in a given country over the 2005–2016 period.
We can correlate World Bank’s Involvement with WS. Figure 8 shows their direct relationship. The Spearman rank correlation between these two measures is 0.6, with a \(p\mbox{-value} \sim0.01\). Obtaining an almost significant p-value with only 17 observations is remarkable, even more so considering that there is one clear outlier (Romania).
This relationship might be driven by external factors. A country might be mentioned more overall because it is under the spotlight, or because its development level requires for more attention. Here we propose two controls for addressing this objection. The popularity of a country might be estimated looking at the page views of its page in Wikipedia.Footnote 6 Its development level is instead approximated by its GDP per capita PPP. Table 6 shows these controls. Remarkably for such a small number of observations, WS is still significant.
Table 6 The direct relationship between World Bank’s broad support and WS, controlling for a country’s popularity and its GDP PPP per capita To conclude this section, we perform an analysis borrowing tools from network epidemics [19]. Real communication channels between organization do not usually take place on the Web, and thus are outside the data we analyze. However, we can interpret the organization-organization citation network as a trace of past communication: if two organizations are talking about each other and they link each other, it means that they are aware of their actions in the field, and are paying attention at each other discourse. In this light, the citation network can be used as a forensic tool to analyze influence after it took place. Note that, in the original network, we establish a directed edge going from organization i to organization j if i cites j. Here, we are interested in knowing that j influenced i, so we reverse the direction of all edges, transforming the citation network into an influence network.
Susceptible–Infected (SI) models have been defined for network data [26]. In these models, nodes are assigned to one of two classes: Susceptible if the node can be infected with a disease, and Infected if the infection happened. In the most simple model—which we use here—a parameter β is specified: if more than a β fraction of the incoming edge weight of a Susceptible node comes from Infected nodes, then the node turns from Susceptible to Infected. These models have been successfully used to track the spread of information in social networks [27].
Here we assume that at time step zero only the World Bank is “infected” with a message it wants to spread. We run an SI model, recording at each time step the fraction of nodes that are part of the Infected pool. Here β represents the share of “infected” cited pages by an organization i necessary to infect i itself. If \(\beta = 0.1\), to infect i we require that at least 10% of the pages cited by i have to come from infected organizations. In practice, higher βs mean that we require more citations between organizations A and B to say that A has influenced B in the past.
If we assume that all influence connections in the network can be used, the World Bank is able to infect almost the entire network (\({\sim}94\%\) of the nodes). The leftover nodes have no sufficiently strong incoming connections, and thus cannot be influenced. This is tested across a variety of β values. Figure 9 (left) depicts the result of these simulations. The higher the β parameter the harder it is for an infection to spread. However, in the case of the World Bank this only affects the speed of propagation of the information (from three to seven steps), not the final coverage. This implies that, according to this model, if the World Bank sends a message to the international aid community, likely 94% of organizations will receive it eventually, assuming that the real unobservable epidemics parameter β is equal to or lower than 0.15.
However, not all messages are equal. The content of the message likely influences its chances to be passed or not. We can simulate also this case, by creating a multilayer view of the influence network, where each layer only contains citations made from pages containing only a specific keyword. We can run the SI model using exclusively edges coming from a single layer, which will now inform us about the power of the World Bank to influence organizations exclusively about a specific issue.
Figure 9 (right) depicts the result of these simulations. We also report the result of the simulation using all the layers, for reference. We can see that there are significant differences between issues. The World Bank is able to reach the vast majority of nodes in some cases, for instance when talking about Public Health (the final infected share of nodes is \({\sim}88\%\)). However, in the case of Avian Influenza the message finds a bottleneck in the multilayer network, and only reaches a third of the network.
How does the World Bank compare to other organizations? Here we choose three comparisons: WHO, UNICEF and USAIDS. We run the SI model using all connections from all keywords, and fixing β once again to 0.1. Figure 10 (left) depicts the result. We can see that the World Bank is noticeably slower than WHO, which reaches saturation faster. However, the World Bank outperforms UNICEF. When considering all keywords at the same time, the messages coming from USAID are dwarfed and reach a negligible portion of the network.
However, as pointed out before, these diffusion patterns are highly dependent on which keyword we are focusing. Figure 10 (right) depicts the information spreading results when focusing on a specific one: “Nurse”. In this case, the four organizations are hardly distinguishable, with USAID having an influence potential on par with the World Bank and WHO.
Stability
The Web is notoriously dynamic: there is no guarantee that a webpage you can visit now is going to be online tomorrow. In the face of such dynamism, one could argue that any snapshot gathered from the Web could be noisy to the point of not being a good representation of the system.
Our original crawl was performed in February 2017. Eleven months after, on January 2018, we perform a second crawl of a selected number of websites to test the stability of our results. We selected roughly half of the websites and apply our data cleaning pipeline. The question is: is the 2017 WS measure a good predictor of 2018 WS? We answer by doing two analysis.
First, we perform a simple regression analysis. Regressing 2017 WS against 2018 WS yields an \(R^{2} \sim74.7\%\). This means that the vast majority of the variance in the 2018 measure can be predicted using 2017 data. This result reassures us that, overall, the networks we build are stable over time. A perfect correlation would have been as problematic as a low one: it would imply that the international aid community does not change which would be a puzzling property for such a complex system. Since the correlation is not perfect, we are reassured that we are able to capture the evolution of the system.
We also provide a visual representation of the relationship between the 2017 and 2018 versions of WS, in Fig. 11. Most of the data points lie in the diagonal, which means that most edges maintained the same weight.