Definitions
Suppose there exists an unknown directed cyclic network G over which an infectious disease transmits, the observed surveillance data can be represented in a tuple of <i d
p
,i t
p
,l o c
p
>. p is the index of a reported/confirmed case. i d
p
represents the unique identity. i t
p
is the reported infection time. l o c
p
is the geographical location where the reported/confirmed case p gets infected.
After aggregating infection cases based on locations and infection times, a dataset D = {<v
i
,i c
i
,t
i
> | i = 0,1,2,…N,t∈T} is collected. i is the index of a specific node. v
i
corresponds to the unique identity of a geographical location (e.g., a province, a city, a township, or an urban area). i c
i
is the aggregated number of infection cases. t
i
indicates a time step. T is the considered time period of disease transmission. In this research, given only the observed data D, the underlying disease transmission network G is inversely inferred. The estimated disease transmission network is referred to as G∗.
Definition 1. Disease Transmission Network: Graph G=<V,E> is a directed cyclic network where V = {v
i
| i = 0,1,2,…,N} is the set of nodes. The node v0 represents the source node of the imported cases that would potentially cause local epidemics (the imported cases for a disease can be defined as the laboratory-confirmed infection cases where people have traveled to disease endemic regions or countries within days before the onset of the disease [26]). v
i
(i = 1,2,…N) correspond to the rest of nodes within the target region. E = {e
i
| i = 1,2,…,N} denotes the set of directed edges with different weights W = {w
i
| i = 1,2,…,N}. e
i
= {e
ji
| j = 0,1,2,…,N} is the set of incoming links for node i and w
i
= {w
ji
| j = 0,1,2,…,N} is the corresponding weight vector. To be noticed, the source node v0 does not have incoming links. The physical meanings of these edges that have non-zero weights can be understood as the generalized transmission pathways that temporally correlate subpopulations in terms of their infection observations.Unlike the network structures used in previous studies, the network structures used in this research contain three types of transmission pathways (shown in Figure 1). As the data describes a real-world situation, the assumption is that infected people can infect susceptible people within the same subpopulation (shown in Figure 2). This type of transmission pathway is defined as the internal transmission component. In addition, subpopulations within metapopulation-based disease transmission networks can be affected not only by subpopulations located in adjacent geographical regions, but also by imported cases. We define them respectively as the neighborhood transmission component and the external influence component.
Definition 2. Internal Transmission Component: Within each node (subpopulation), previously infected people may correlate to newly infected people without outside disturbances. This component is disease independent. Air-borne diseases such as influenza, vector-borne diseases such as malaria, and other infectious diseases all have this property. It is represented as an edge linking to itself with weight w
ii
for each node i in the disease transmission network G.
Definition 3. Neighborhood Transmission Component: Among groups of nodes (subpopulations), the temporal correlations among infected people could be caused by physically connected highways, air travels, adjacent borders, etc. This component signifies the interactions happening between infected people in different subpopulations. In G, it is represented as a directed link e
ij
from nodes i to j with weight w
ij
, indicating the correlations between infected people in both i and j.
Definition 4. External Influence Component: In disease transmission, the imported cases from foreign or distant endemic countries and regions are another major factor that can cause local epidemics [27]. Thus, we consider this factor in the disease transmission network as an external node connected to all the other nodes. In G, this is denoted as an edge e0i from external node to node i with weight w0i.
Linear transmission model
To characterize a disease transmission process over G, we integrate both of the internal transmission component and the external influence component with the neighborhood transmission component. The internal transmission component characterizes the possible transmission relationships between previously infected people and current infected people within each subpopulation. The assumption in [19], that “individuals do not change disease states during movements” is retained. Thus the neighborhood transmission component describes the temporal correlations between infected people in different subpopulations. The external influence component depicts the introduction of the imported cases from external sources. The above three types of transmission pathways are defined in mathematical forms, respectively, as follows:
(1)
where i t c
i
t, n t c
i
t, and e i c
i
t refer to the number of infection cases from the internal transmission, neighborhood transmission, and external influence components of node i (i≠0) at time step t, respectively. N
i
is the number of the neighbors of node i. w
ii
, w
ji
, and w0i are the corresponding edge weights. i c
i
is the total number of infection cases in node i, which can be written as a linear combination of the above three components plus an error term ε. ε is used to capture unpredicted biases. The assumption is that the infection number for each node follows a zero-mean normal distribution, ε∼N(0,β):
(2)
Equations 1 and 2 characterize the temporal dynamics of the infection cases at each location. Note that in the real world, once a reported/confirmed case is diagnosed, the physicians or hospitals would take necessary treatment and intervention measures, for example, medication or isolation. Thus, in the above linear transmission model, the infection cases at the current time step would be set to be isolated in the subsequent time steps.
Network inference problem
The network inference problem to be solved here is how to inversely infer the existence of the edges within the hidden disease transmission network G and their corresponding weights W={w
i
| i= 0,1,2,…,N}, given an observed surveillance dataset D={<v
i
,i c
i
,t
i
> | i=0,1,2,…,N,t∈T}. Since the disease transmission process at the metapopulation level does not follow the Directed Acyclic Graphs pattern (Figure 2), it would be inaccurate to infer disease transmission networks following the cascading process in the information diffusion [14].
To recover the network structure G, it is necessary to first write the likelihood function for a specific node i based on Eq. 2:
(3)
where all the parameters are the same as those in Eq. 2, except we use , and rather than N
i
, to indicate the number of estimated neighbors of node i within the inferred network G∗. β is the variance of the normal distribution for the error term ε. Based on this equation, we transform the network inference problem into an optimization problem, which is to find the optimal combination of neighbors with accurate weights for a specific node i.
Then for the entire estimated network G∗, the objective is to maximize the likelihood function:
(4)
To evaluate the estimated network G∗, we will use precision-recall measures. Specifically, we will compare both the existences of edges and their corresponding weights in the synthetic network G and the estimated network G∗.
Partial correlation network construction
Because there could be many combinations for a node to form its neighborhood, the solution space for the above problem is huge. At the first step, we plan to reduce this space in order to improve both accuracy and performance for further tuning.
When using the Pearson correlation to analyze the correlation between two selected nodes i and j, a problem arises in the analysis of disease transmission networks. As shown in Figure 3(A), disease transmission may follow a path from node i to k, then to j. Take nodes i and j as our analysis targets. Although they are not directly connected, and the overall time-series surveillance data exhibits time delay, they may still be correlated. Therefore, in the approximate network structure Gp, they may be connected. The same problem exists in the case of Figure 3(B), where both nodes i and j are the children of node k in the disease transmission process. The correlation between nodes i and j is still strong even though the weights w
ki
and w
kj
are very different. To solve the biases produced by the intermediate node and the sharing of the same parent node, a first-order partial correlation analysis is carried out.
The first-order partial correlation is a measurement of the dependence between two variables X and Y, after removing or fixing a third variable Z. In our case, to compute it between nodes i and j, the effect of another node k, where k=0,1,2,…,N, and k≠i,j is sequentially removed or fixed. From the results, only those coefficients that indicate strong correlations with significant p-values are chosen. It should be mentioned that a partial correlation analysis usually does not provide edge direction information [28, 29]. Therefore, to infer a directed relationship, in this research, we analyze the partial correlation with a time lag. The physical meaning of the time lag is a time step during the disease transmission process (e.g., one day, one week, or one month). Here, we use a time lag of one unit as example, but the time lag is not limited to one unit, other options are also allowed. The direction is defined as from the node using the previous-time-step time-series data to the node using the current-time-step time-series data. Defining the partial correlation coefficient between nodes i and j after fixing the variable of node k as ρi j.k, it can be computed as follows:
(5)
where ρ
ij
, ρ
ik
and ρ
jk
are the covariances between each pair of node i, j and k respectively.
Back-tracking Bayesian learning
Given the partial correlation network Gp, an approximate disease transmission network structure is obtained that contains possible neighbors for each node. However, some edges in Gp still do not exist in the synthetic network G. A possible solution is to set the weights of these false positive edges within Gp as zero during the inference process. This is similar to the procedure of removing irrelevant basis components, which is the basis for dimension reduction [30]. In the proposed inference method, the Bayesian learning is based on the Sparse Bayesian Learning (SBL) framework [31]. Related work has been widely and well reported in signal processing studies [30]. To be noticed, if two components are similar, SBL only chooses one of them in order to compress the relevant information. However, in our case, even two nodes are similar, we aim to find both of them.
For a specific node i, the preprocessed surveillance dataset D is divided into two subsets: an M×1 vector of y={<v
i
,i c
i
,t
i
> | t
i
= 2,3,…,M+1,M∈T} and a M×|Np| matrix of x = {<v
j
,i c
j
,t
j
> | j∈Np,t
j
= 1,2,…,M,M∈T-1}. M is the size of output variable y and input variable x. Np represents the indices of the possible neighbors that node i has based on Gp. T-1 is the previously considered time period of disease transmission. For the sake of presentation, in the following, we omit the index i for y, x, and other parameters. If not specifically stated, all the parameters are formulated for node i. Here, we use a time lag of 1 between y and x.The relationship between y and x can be formulated based on the generalized linear transmission model introduced earlier as follows:
where w = {w
j
| j∈Np} is a vector indicating the weights of all possible incoming links estimated based on Gp. ε is an error term. As mentioned earlier, the solution space is huge. Thus we hope to limit w within a smooth range. Here we follow the framework of SBL, and let both w and ε follow a zero-mean Gaussian distribution with variances of α and β, respectively [31]. They are defined as:
(7)
Because there is no prior knowledge of w and ε, it is reasonable to set them with non-informative prior distributions, such as a Gamma distribution. Here, α and β are assumed to have the same hyperparameters for all nodes.
Given the observation data y and the prior distribution α and β, the posterior distribution of w is:
(9)
which is a Gaussian distribution N(μ,Σ) with
(11)
where . “Type-II maximization likelihood” maximization combined with a maximum a posteriori probability (MAP) estimate [31] transforms the whole problem into the following marginal likelihood function:
(12)
Writing Eq. 12 into a logarithm form , we have:
(13)
with
The derivatives of Eq. 13 with respect to α
j
and β are [32]:
(15)
(16)
Setting Eqs. 15 and 16 to zero, the estimations of α
j
and β become:
(17)
(18)
The above iterative estimation procedure can be solved by using the Expectation-Maximization. In each iteration, the contributions to the marginal likelihood function are estimated for all the nodes in Gp. The one with the maximum contribution is selected as the candidate neighbor. Its corresponding weight is then computed.
In the disease transmission network G, only positive links indicating the existence of transmission pathways exist. However, the prior distribution shown in Eq. 7 may cause w to be negative. To avoid this, a constraint limiting w to a positive value is introduced. To incorporate this constraint into the framework of the above Bayesian learning, a back-tracking technique is used. During the EM learning procedure, the marginal likelihood function and other parameters are updated sequentially. Consequently, each time μ, Σ, α
j
, and β are updated, any α
j
that fail the constraint are selected out, and their corresponding indices are put into a blacklist. The learning procedure is then rolled back, including the marginal likelihood value, to the previous step, and proceeds by selecting only nodes that do not appear in the blacklist, while at the same time maximizing the marginal likelihood function. The algorithm for the Back-Tracking Bayesian Learning is shown in Figure 4.